New in CDH 5.1: Apache Spark 1.0
Spark 1.0 reflects a lot of hard work from a very diverse community.
Cloudera’s latest platform release, CDH 5.1, includes Apache Spark 1.0, a milestone release for the Spark project that locks down APIs for Spark’s core functionality. The release reflects the work of hundreds of contributors (including our own Diana Carroll, Mark Grover, Ted Malaska, Colin McCabe, Sean Owen, Hari Shreedharan, Marcelo Vanzin, and me).
In this post, we’ll describe some of Spark 1.0’s changes and new features that are relevant to CDH users.
In anticipation of some features coming down the pipe, the release includes a few incompatible changes that will enable Spark to avoid breaking compatibility in the future. Most applications will require a recompile to run against Spark 1.0, and some will require changes in source code.
There are two changes in the core Scala API:
groupByKeyoperators now return
Iterators over their values instead of
Seqs. This change means that the set of values corresponding to a particular key need not all reside in memory at the same time.
Spark’s Java APIs were updated to accommodate Java 8 lambdas. Details on these changes are available under the Java section here.
The MLLib API contains a set of changes that allows it to support sparse and dense vectors in a unified way. Details on these changes are available here. (Note that MLLib is still a beta component, meaning that its APIs may change in the future.)
Details on the future-proofing of Spark streaming APIs are available here.
Most Spark programming examples focus on spark-shell, but prior to 1.0, users who wanted to submit compiled Spark applications to a cluster found it to be a convoluted process requiring guess-and-check, different invocations depending on the cluster manager and deploy mode, and oodles of boilerplate. Spark 1.0 includes
spark-submit, a command that abstracts across the variety of deploy modes that Spark supports and takes care of assembling the classpath for you. A sample invocation:
spark-submit --class com.yourcompany.MainClass --deploy-mode cluster --master yarn appjar.jar apparg1 apparg2
We fixed a couple critical Apache Avro bugs that were preventing Spark from reading and writing Avro data. Stay tuned for a future post explaining best practices on interacting with Avro and Apache Parquet (incubating) data from Spark.
PySpark on YARN
One of the remaining items in Spark on YARN compared to other cluster managers was lack of PySpark support. Spark 1.0 allows you to launch PySpark apps against YARN clusters. PySpark currently only works in yarn-client mode. Starting a PySpark shell against a YARN installation is as simple as running:
and running a PySpark script is as simple as running:
spark-submit --master yarn yourscript.py apparg1 apparg1
Spark History Server
A common complaint with Spark has been that the per-application UI, which displays task metrics and other useful information, disappears after an app completes. That leaves users in a rut when trying to debug failures. Instead, Spark 1.0 offers a History Server that displays information about applications after they have completed. Cloudera Manager provides easy setup and configuration of this daemon.
Spark SQL, which deserves a blog post of its own, is a new Spark component that allows you to run SQL statements inside of a Spark application that manipulate and produce RDDs. Due to its immaturity and alpha component status, Cloudera does not currently offer commercial support for Spark SQL. However, we bundle it with our distribution so that users can try it out.
A Note on Stability
While we at Cloudera are quite bullish on Spark, it’s important the acknowledge that even its core components are not yet as stable as many of the more mature Hadoop ecosystem components. The 1.0 mark does not mean that Spark is now bug-free and ready to replace all your production MapReduce uses — but it does mean that people building apps on top of Spark core should be safe from surprises in future releases. Existing APIs will maintain compatibility, existing deploy modes will remain supported, and the general architecture will remain the same.
Cloudera engineers are working hard to make Spark more stable, easier to use, easier to debug, and easier to manage. Expect future releases to greater robustness, enhanced scalability, and deeper insight into what is going on inside a Spark application, both while running and after it has completed.
Sandy Ryza is a data scientist at Cloudera, and an Apache Hadoop committer.