Cloudera Engineering Blog · Spark Posts

Building Lambda Architecture with Spark Streaming

The versatility of Apache Spark’s API for both batch/ETL and streaming workloads brings the promise of lambda architecture to the real world.

Few things help you concentrate like a last-minute change to a major project.

Bayesian Machine Learning on Apache Spark

Markov Chain Monte Carlo methods are another example of useful statistical computation for Big Data that is capably enabled by Apache Spark.

During my internship at Cloudera, I have been working on integrating PyMC with Apache Spark. PyMC is an open source Python package that allows users to easily apply Bayesian machine learning methods to their data, while Spark is a new, general framework for distributed computing on Hadoop. Together, they provide a scalable framework for scalable Markov Chain Monte Carlo (MCMC) methods. In this blog post, I am going to describe my work on distributing large-scale graphical models and MCMC computation.

Markov Chain Monte Carlo Methods

How-to: Use IPython Notebook with Apache Spark

IPython Notebook and Spark’s Python API are a powerful combination for data science.

The developers of Apache Spark have given thoughtful consideration to Python as a language of choice for data analysis. They have developed the PySpark API for working with RDDs in Python, and further support using the powerful IPythonshell instead of the builtin Python REPL.

New in CDH 5.1: Apache Spark 1.0

Spark 1.0 reflects a lot of hard work from a very diverse community.

Cloudera’s latest platform release, CDH 5.1, includes Apache Spark 1.0, a milestone release for the Spark project that locks down APIs for Spark’s core functionality. The release reflects the work of hundreds of contributors (including our own Diana Carroll, Mark Grover, Ted Malaska, Colin McCabe, Sean Owen, Hari Shreedharan, Marcelo Vanzin, and me).

New Apache Spark Developer Training: Beyond the Basics

While the new Spark Developer training from Cloudera University is valuable for developers who are new to Big Data, it’s also a great call for MapReduce veterans.

When I set out to learn Apache Spark (which ships inside Cloudera’s open source platform) about six months ago, I started where many other people do: by following the various online tutorials available from UC Berkeley’s AMPLab, the creators of Spark. I quickly developed an appreciation for the elegant, easy-to-use API and super-fast results, and was eager to learn more.

Estimating Financial Risk with Apache Spark

Learn how Spark facilitates the calculation of computationally-intensive statistics such as VaR via the Monte Carlo method.

Under reasonable circumstances, how much money can you expect to lose? The financial statistic value at risk (VaR) seeks to answer this question. Since its development on Wall Street soon after the stock market crash of 1987, VaR has been widely adopted across the financial services industry. Some organizations report the statistic to satisfy regulations, some use it to better understand the risk characteristics of large portfolios, and others compute it before executing trades to help make informed and immediate decisions.

Apache Hive on Apache Spark: Motivations and Design Principles

Two of the most vibrant communities in the Apache Hadoop ecosystem are now working together to bring users a Hive-on-Spark option that combines the best elements of both.

Apache Hive is a popular SQL interface for batch processing and ETL using Apache Hadoop. Until recently, MapReduce was the only execution engine in the Hadoop ecosystem, and Hive queries could only run on MapReduce. But today, alternative execution engines to MapReduce are available — such as Apache Spark and Apache Tez (incubating).

Meet the Data Scientist: Sandy Ryza

Meet Sandy Ryza (@SandySifting), the newest member of Cloudera’s data science team. See Sandy present at Spark Summit 2014 (June 30-July 1 in San Francisco; register here for a 20% discount).

What is your definition of a “data scientist”?

This Month in the Ecosystem (May 2014)

Welcome to our ninth edition of “This Month in the Ecosystem,” a digest of highlights from May/early June 2014 (never intended to be comprehensive; for that, see the excellent Hadoop Weekly).

More good news!

Apache Spark 1.0 is Released

Spark 1.0 is its biggest release yet, with a list of new features for enterprise customers.

Congratulations to the Apache Spark community for today’s release of Spark 1.0, which includes contributions from more than 100 people (including Cloudera’s own Diana Carroll, Mark Grover, Ted Malaska, Sean Owen, Sandy Ryza, and Marcelo Vanzin). We think this release is an important milestone in the continuing rapid uptake of Spark by enterprises — which is supported by Cloudera via Cloudera Enterprise 5 — as a modern, general-purpose processing engine for Apache Hadoop.

Older Posts