Learn how analyzing stats from professional sports leagues is an instructive use case for data analytics using Apache Spark with SQL. Covered in this installment: data exploration with Apache Impala (incubating) and Hue.
In Part 1 of this series, I introduced the topic of using fantasy sports analytics as an instructive use case for exploring the Apache Hadoop ecosystem. In that installment, we focused on data processing by taking a collection of data from Basketball-Reference.com and enriching it with z-scores and normalized z-scores to analyze the relative value of NBA players.
In this installment, we provide insight into how the Fair Scheduler works, and why it works the way it does.
In Part 3 of this series, you got a quick introduction to Fair Scheduler, one of the scheduler choices in Apache Hadoop YARN (and the one recommended by Cloudera). In Part 4, we will cover most of the queue properties, some examples of their use, as well as their limitations.
Using Apache Impala (incubating) on top of Apache Kudu (incubating) has significant performance benefits
Apache Kudu (incubating) is the newest addition to the set of storage engines that integrate with the Apache Hadoop ecosystem. The promise of Kudu is to deliver high-scan performance, targeting analytical workloads, while allowing users to concurrently insert, update, and delete records. With these properties, Kudu becomes a viable alternative to existing combinations of HDFS and/or Apache HBase to achieve similar results with less complicated ETL pipelines,
Vodafone UK’s new SIEM system relies on Apache Flume and Apache Kafka to ingest nearly 1 million events per second. In this post, learn about the architecture and performance-tuning techniques and that got it there.
SIEM platforms provide a useful tool for identifying indicators of compromise across disparate infrastructure. The catch is, they’re only as accurate as the fidelity of the data involved, which is why Apache Hadoop is becoming such a valuable platform for that use case.
I recently had a chat with Benjamin Bengfort, a data scientist finishing his PhD at the University of Maryland, and Jenny Kim, a software engineer at Cloudera, about their forthcoming O’Reilly Media book (now in Early Access), Data Analytics with Hadoop: An Introduction for Data Scientists.
Why did you decide to write this book?
Ben: The content was originally part of a class that Jenny and I were teaching together.