There are two clear trends in the big-data ecosystem: the growth of machine learning use cases that leverage large distributed data sets, and the growth of Spark’s Machine Learning libraries (often referred to as MLlib) for these use cases. In fact, Spark’s MLlib library is arguably the leading solution for machine learning on large distributed data sets.
Intel and Cloudera have collaborated to speed up Spark’s ML algorithms, via integration with Intel’s Math Kernel Library (Intel® MKL).
Today, Cloudera announced the availability of an Apache Spark 2.0 Beta release for users of the Cloudera platform.
Apache Spark 2.0 is tremendously exciting (read this post for more background) because (among other things):
- The Dataset API further enhances Spark’s claim as the best tool for data engineering by providing compile-time type safety along with the benefits of a query-optimization engine.
- The Structured Streaming API enables the modeling of streaming data as a continuous DataFrame and expresses operations on that data with a SQL-like API.
Livy, which streamlines Spark architecture for web/mobile apps, is the newest addition to Cloudera Labs.
With respect to the impact of Apache Spark on the Apache Hadoop ecosystem, its virtual overnight adoption as the default data processing engine—and as a standard for powering advanced analytic applications—speaks for itself. But, that’s not to say that there isn’t work yet to be done, particularly in the areas of performance at scale/under multi-tenancy,
Cloudera has announced support for Spark SQL/DataFrame API and MLlib. This post explains their benefits for app developers, data analysts, data engineers, and data scientists.
In July 2015, Cloudera re-affirmed its position since 2013: that Apache Spark is on course to replace MapReduce as the default general-purpose data processing engine for Apache Hadoop. Thanks to initiatives like the One Platform Initiative,
This post contains answers to common questions about deploying and configuring Apache Kafka as part of a Cloudera-powered enterprise data hub.
Cloudera added support for Apache Kafka, the open standard for streaming data, in February 2015 after its brief incubation period in Cloudera Labs. Apache Kafka now is an integrated part of CDH, manageable via Cloudera Manager, and we are witnessing rapid adoption of Kafka across our customer base.