Tag Archives: platform

How-to: Build a Complex Event Processing App on Apache Spark and Drools

Categories: HBase How-to Kafka Spark Use Case

Combining CDH with a business execution engine can serve as a solid foundation for complex event processing on big data.

Event processing involves tracking and analyzing streams of data from events to support better insight and decision making. With the recent explosion in data volume and diversity of data sources, this goal can be quite challenging for architects to achieve.

Complex event processing (CEP) is a type of event processing that combines data from multiple sources to identify patterns and complex relationships across various events.

Read more

How-to: Build a Machine-Learning App Using Sparkling Water and Apache Spark

Categories: CDH Data Science Guest How-to Spark

Thanks to Michal Malohlava, Amy Wang, and Avni Wadhwa of H20.ai for providing the following guest post about building ML apps using Sparkling Water and Apache Spark on CDH.

The Sparkling Water project is nearing its one-year anniversary, which means Michal Malohlava, our main contributor, has been very busy for the better part of this past year. The Sparkling Water project combines H2O machine-learning algorithms with the execution power of Apache Spark.

Read more

Continuous Distribution Goodness-of-Fit in MLlib: Kolmogorov-Smirnov Testing in Apache Spark

Categories: Spark

Thanks to former Cloudera intern Jose Cambronero for the post below about his summer project, which involved contributions to MLlib in Apache Spark.

Data can come in many shapes and forms, and can be described in many ways. Statistics like the mean and standard deviation of a sample provide descriptions of some of its important qualities. Less commonly used statistics such as skewness and kurtosis provide additional perspective into the data’s profile.

Read more

How-to: Prepare Your Apache Hadoop Cluster for PySpark Jobs

Categories: CDH Hadoop How-to Spark

Proper configuration of your Python environment is a critical pre-condition for using Apache Spark’s Python API.

One of the most enticing aspects of Apache Spark for data scientists is the API it provides in non-JVM languages for Python (via PySpark) and for R (via SparkR). There are a few reasons that these language bindings have generated a lot of excitement: Most data scientists think writing Java or Scala is a drag,

Read more

Spotlight on Apache Spark at Strata + Hadoop World NYC 2015

Categories: Community Events Spark

Apache Spark continues to be a major theme in the Strata + Hadoop World conference series; here are highlights at NYC next week.

Strata + Hadoop World NYC 2015 (Sept. 29-Oct. 1; if you haven’t registered yet, a 20% discount is still available) is a learning bonanza for many reasons, but this year the focus on Apache Spark and its growing importance in the Apache Hadoop ecosystem is notable.

Read more