Category Archives: Spark

Blacklisting in Apache Spark

Categories: Hadoop Spark

At Cloudera, we’re always working to provide our customers and the Apache Spark community with the most robust, most reliable software possible. This article describes some recent engineering work on [SPARK-8425] that is available in CDH 5.10 and CDH5.11, as well as in upstream Apache Spark starting with the 2.2 release.

The work pertains to the Blacklist Tracker mechanism in Spark’s scheduler. This was the subject of a recent Spark Summit talk,

Read more

How-to: Log Analytics with Solr, Spark, OpenTSDB and Grafana

Categories: Hadoop How-to Search Spark

Organizations analyze logs for a variety of reasons.  Some typical use cases include predicting server failures, analyzing customer behavior, and fighting cybercrime.  However, one of the most overlooked use cases is to help companies write better software.  In this digital age, most companies write applications, be it for its employees or external users.  The cost of faulty software can be severe, ranging from customer churn to a complete firm’s demise, as was the case with Knight Capital in 2012.

Read more

How To Set Up a Shared Amazon RDS as Your Hive Metastore

Categories: Cloud Hadoop Hive How-to Impala Spark Use Case

Before CDH 5.10, every CDH cluster had to have its own Apache Hive Metastore (HMS) backend database. This model is ideal for clusters where each cluster contains the data locally along with the metadata. In the cloud, however, many CDH clusters run directly on a shared object store (like Amazon S3), making it possible for the data to live across multiple clusters and beyond any cluster’s lifespan. In this scenario clusters need to regenerate and coordinate metadata for the underlying shared data individually.

Read more

Accelerating Apache Spark MLlib with Intel® Math Kernel Library (Intel® MKL)

Categories: Data Science Spark

There are two clear trends in the big-data ecosystem: the growth of machine learning use cases that leverage large distributed data sets, and the growth of Spark’s Machine Learning libraries (often referred to as MLlib) for these use cases. In fact, Spark’s MLlib library is arguably the leading solution for machine learning on large distributed data sets.

Intel and Cloudera have collaborated to speed up Spark’s ML algorithms, via integration with Intel’s Math Kernel Library (Intel® MKL).

Read more

Analyzing US flight data on Amazon S3 with sparklyr and Apache Spark 2.0

Categories: CDH Data Science Hadoop Spark Use Case

We posted several blog posts about sparklyr (introduction, automation), which enables you to analyze big data leveraging Apache Spark seamlessly with R. sparklyr, developed by RStudio, is an R interface to Spark that allows users to use Spark as the backend for dplyr, which is the popular data manipulation package for R.

If you are interested in sparklyr, you can learn how to use it with the official document,

Read more