Thanks to Sam Shuster, Software Engineer at Edmunds.com, for the guest post below about his company’s use case for Spark Streaming, SparkOnHBase, and Morphlines.
Every year, the Super Bowl brings parties, food and hopefully a great game to appease everyone’s football appetites until the fall. With any event that brings in around 114 million viewers with larger numbers each year, Americans have also grown accustomed to commercials with production budgets on par with television shows and with entertainment value that tries to rival even the game itself.
Our thanks to Micah Whitacre, a senior software architect on Cerner Corp.’s Big Data Platforms team, for the post below about Cerner’s use case for CDH + Apache Kafka. (Kafka integration with CDH is currently incubating in Cloudera Labs.)
Over the years, Cerner Corp., a leading Healthcare IT provider, has utilized several of the core technologies available in CDH, Cloudera’s software platform containing Apache Hadoop and related projects—including HDFS,
This Spark Streaming use case is a great example of how near-real-time processing can be brought to Hadoop.
Spark Streaming is one of the most interesting components within the Apache Spark stack. With Spark Streaming, you can create data pipelines that process streamed data using the same API that you use for processing batch-loaded data. Furthermore, Spark Steaming’s “micro-batching” approach provides decent resiliency should a job fail for some reason.
Our thanks to Melanie Imhof, Jonas Looser, Thierry Musy, and Kurt Stockinger of the Zurich University of Applied Science in Switzerland for the post below about their research into the query performance of Impala for mixed workloads.
Recently, we were approached by an industry partner to research and create a blueprint for a new Big Data, near real-time, query processing architecture that would replace its current architecture based on a popular open source database system.
The ability to quickly and accurately count complex events is a legitimate business advantage.
In our work as data scientists, we spend most of our time counting things. It is the foundational skill that is used in data cleansing, reporting, feature engineering, and simple-but-effective machine learning models like Naive Bayes classifiers. Hilary Mason has a quote about the benefits of counting that I love:
Understand that what big data really means is to be able to count things in data sets of any size,