Category Archives: Hadoop

How-to: Analyze Fantasy Sports with Apache Spark and SQL (Part 2: Data Exploration)

Categories: Hadoop Spark Use Case

Learn how analyzing stats from professional sports leagues is an instructive use case for data analytics using Apache Spark with SQL. Covered in this installment: data exploration with Apache Impala (incubating) and Hue.

In Part 1 of this series, I introduced the topic of using fantasy sports analytics as an instructive use case for exploring the Apache Hadoop ecosystem. In that installment, we focused on data processing by taking a collection of data from Basketball-Reference.com and enriching it with z-scores and normalized z-scores to analyze the relative value of NBA players.

Read More

Untangling Apache Hadoop YARN, Part 4: Fair Scheduler Queue Basics

Categories: Hadoop YARN

In this installment, we provide insight into how the Fair Scheduler works, and why it works the way it does.

In Part 3 of this series, you got a quick introduction to Fair Scheduler, one of the scheduler choices in Apache Hadoop YARN (and the one recommended by Cloudera). In Part 4, we will cover most of the queue properties, some examples of their use, as well as their limitations.

Read More

How-to: Use Impala and Kudu Together for Analytic Workloads

Categories: Data Science Hadoop How-to Impala Kudu Performance

Using Apache Impala (incubating) on top of Apache Kudu (incubating) has significant performance benefits

Apache Kudu (incubating) is the newest addition to the set of storage engines that integrate with the Apache Hadoop ecosystem. The promise of Kudu is to deliver high-scan performance, targeting analytical workloads, while allowing users to concurrently insert, update, and delete records. With these properties, Kudu becomes a viable alternative to existing combinations of HDFS and/or Apache HBase to achieve similar results with less complicated ETL pipelines,

Read More

Building, Benchmarking, and Tuning Syslog Ingest Architecture at Vodafone UK

Categories: Flume Hadoop Kafka Security Use Case

Vodafone UK’s new SIEM system relies on Apache Flume and Apache Kafka to ingest nearly 1 million events per second. In this post, learn about the architecture and performance-tuning techniques and that got it there.

SIEM platforms provide a useful tool for identifying indicators of compromise across disparate infrastructure. The catch is, they’re only as accurate as the fidelity of the data involved, which is why Apache Hadoop is becoming such a valuable platform for that use case.

Read More

Meet the Authors: “Data Analytics with Hadoop” from O’Reilly Media

Categories: Books Data Science General Hadoop

I recently had a chat with Benjamin Bengfort, a data scientist finishing his PhD at the University of Maryland, and Jenny Kim, a software engineer at Cloudera, about their forthcoming O’Reilly Media book (now in Early Access), Data Analytics with Hadoop: An Introduction for Data Scientists.

Why did you decide to write this book?

Ben: The content was originally part of a class that Jenny and I were teaching together.

Read More