Cloudera Engineering Blog · Spark Posts
Our thanks to Mayur Rustagi (@mayur_rustagi), CTO at Sigmoid Analytics, for allowing us to re-publish his post about the Spork (Pig-on-Spark) project below. (Related: Read about the ongoing upstream to bring Spark-based data processing to Hive here.)
Analysts can talk about data insights all day (and night), but the reality is that 70% of all data analyst time goes into data processing and not analysis. At Sigmoid Analytics, we want to streamline this data processing pipeline so that analysts can truly focus on value generation and not data preparation.
The key to getting the most out of Spark is to understand the differences between its RDD API and the original Mapper and Reducer API.
Venerable MapReduce has been Apache Hadoop‘s work-horse computation paradigm since its inception. It is ideal for the kinds of work for which Hadoop was originally designed: large-scale log processing, and batch-oriented ETL (extract-transform-load) operations.
The versatility of Apache Spark’s API for both batch/ETL and streaming workloads brings the promise of lambda architecture to the real world.
Few things help you concentrate like a last-minute change to a major project.
Markov Chain Monte Carlo methods are another example of useful statistical computation for Big Data that is capably enabled by Apache Spark.
During my internship at Cloudera, I have been working on integrating PyMC with Apache Spark. PyMC is an open source Python package that allows users to easily apply Bayesian machine learning methods to their data, while Spark is a new, general framework for distributed computing on Hadoop. Together, they provide a scalable framework for scalable Markov Chain Monte Carlo (MCMC) methods. In this blog post, I am going to describe my work on distributing large-scale graphical models and MCMC computation.
Markov Chain Monte Carlo Methods
IPython Notebook and Spark’s Python API are a powerful combination for data science.
The developers of Apache Spark have given thoughtful consideration to Python as a language of choice for data analysis. They have developed the PySpark API for working with RDDs in Python, and further support using the powerful IPythonshell instead of the builtin Python REPL.
Spark 1.0 reflects a lot of hard work from a very diverse community.
Cloudera’s latest platform release, CDH 5.1, includes Apache Spark 1.0, a milestone release for the Spark project that locks down APIs for Spark’s core functionality. The release reflects the work of hundreds of contributors (including our own Diana Carroll, Mark Grover, Ted Malaska, Colin McCabe, Sean Owen, Hari Shreedharan, Marcelo Vanzin, and me).
While the new Spark Developer training from Cloudera University is valuable for developers who are new to Big Data, it’s also a great call for MapReduce veterans.
When I set out to learn Apache Spark (which ships inside Cloudera’s open source platform) about six months ago, I started where many other people do: by following the various online tutorials available from UC Berkeley’s AMPLab, the creators of Spark. I quickly developed an appreciation for the elegant, easy-to-use API and super-fast results, and was eager to learn more.
Learn how Spark facilitates the calculation of computationally-intensive statistics such as VaR via the Monte Carlo method.
Under reasonable circumstances, how much money can you expect to lose? The financial statistic value at risk (VaR) seeks to answer this question. Since its development on Wall Street soon after the stock market crash of 1987, VaR has been widely adopted across the financial services industry. Some organizations report the statistic to satisfy regulations, some use it to better understand the risk characteristics of large portfolios, and others compute it before executing trades to help make informed and immediate decisions.
Two of the most vibrant communities in the Apache Hadoop ecosystem are now working together to bring users a Hive-on-Spark option that combines the best elements of both.
Apache Hive is a popular SQL interface for batch processing and ETL using Apache Hadoop. Until recently, MapReduce was the only execution engine in the Hadoop ecosystem, and Hive queries could only run on MapReduce. But today, alternative execution engines to MapReduce are available — such as Apache Spark and Apache Tez (incubating).
What is your definition of a “data scientist”?