Tag Archives: analytics

How-to: Build a Machine-Learning App Using Sparkling Water and Apache Spark

Categories: CDH Data Science Guest How-to Spark

Thanks to Michal Malohlava, Amy Wang, and Avni Wadhwa of H20.ai for providing the following guest post about building ML apps using Sparkling Water and Apache Spark on CDH.

The Sparkling Water project is nearing its one-year anniversary, which means Michal Malohlava, our main contributor, has been very busy for the better part of this past year. The Sparkling Water project combines H2O machine-learning algorithms with the execution power of Apache Spark.

Read More

Continuous Distribution Goodness-of-Fit in MLlib: Kolmogorov-Smirnov Testing in Apache Spark

Categories: Spark

Thanks to former Cloudera intern Jose Cambronero for the post below about his summer project, which involved contributions to MLlib in Apache Spark.

Data can come in many shapes and forms, and can be described in many ways. Statistics like the mean and standard deviation of a sample provide descriptions of some of its important qualities. Less commonly used statistics such as skewness and kurtosis provide additional perspective into the data’s profile.

Read More

Kudu: New Apache Hadoop Storage for Fast Analytics on Fast Data

Categories: Hadoop HBase HDFS Impala Kudu Performance Spark

This new open source complement to HDFS and Apache HBase is designed to fill gaps in Hadoop’s storage layer that have given rise to stitched-together, hybrid architectures.

The set of data storage and processing technologies that define the Apache Hadoop ecosystem are expansive and ever-improving, covering a very diverse set of customer use cases used in mission-critical enterprise applications. At Cloudera, we’re constantly pushing the boundaries of what’s possible with Hadoop—making it faster,

Read More

How-to: Prepare Your Apache Hadoop Cluster for PySpark Jobs

Categories: CDH Hadoop How-to Spark

Proper configuration of your Python environment is a critical pre-condition for using Apache Spark’s Python API.

One of the most enticing aspects of Apache Spark for data scientists is the API it provides in non-JVM languages for Python (via PySpark) and for R (via SparkR). There are a few reasons that these language bindings have generated a lot of excitement: Most data scientists think writing Java or Scala is a drag,

Read More

Spotlight on Apache Spark at Strata + Hadoop World NYC 2015

Categories: Community Events Spark

Apache Spark continues to be a major theme in the Strata + Hadoop World conference series; here are highlights at NYC next week.

Strata + Hadoop World NYC 2015 (Sept. 29-Oct. 1; if you haven’t registered yet, a 20% discount is still available) is a learning bonanza for many reasons, but this year the focus on Apache Spark and its growing importance in the Apache Hadoop ecosystem is notable.

Read More