New in CDH 5.1: Apache Spark 1.0

Spark 1.0 reflects a lot of hard work from a very diverse community.

Cloudera’s latest platform release, CDH 5.1, includes Apache Spark 1.0, a milestone release for the Spark project that locks down APIs for Spark’s core functionality. The release reflects the work of hundreds of contributors (including our own Diana Carroll, Mark Grover, Ted Malaska, Colin McCabe, Sean Owen, Hari Shreedharan, Marcelo Vanzin, and me).

In this post, we’ll describe some of Spark 1.0’s changes and new features that are relevant to CDH users.

API Incompatibilities

In anticipation of some features coming down the pipe, the release includes a few incompatible changes that will enable Spark to avoid breaking compatibility in the future. Most applications will require a recompile to run against Spark 1.0, and some will require changes in source code.

There are two changes in the core Scala API:

  • The cogroup and groupByKey operators now return Iterators over their values instead of Seqs. This change means that the set of values corresponding to a particular key need not all reside in memory at the same time.
  • SparkContext.jarOfClass now returns Option[String] instead of Seq[String].

Spark’s Java APIs were updated to accommodate Java 8 lambdas. Details on these changes are available under the Java section here.

The MLLib API contains a set of changes that allows it to support sparse and dense vectors in a unified way. Details on these changes are available here. (Note that MLLib is still a beta component, meaning that its APIs may change in the future.)

Details on the future-proofing of Spark streaming APIs are available here.

spark-submit

Most Spark programming examples focus on spark-shell, but prior to 1.0, users who wanted to submit compiled Spark applications to a cluster found it to be a convoluted process requiring guess-and-check, different invocations depending on the cluster manager and deploy mode, and oodles of boilerplate. Spark 1.0 includes spark-submit, a command that abstracts across the variety of deploy modes that Spark supports and takes care of assembling the classpath for you. A sample invocation:

 

Avro Support

We fixed a couple critical Apache Avro bugs that were preventing Spark from reading and writing Avro data. Stay tuned for a future post explaining best practices on interacting with Avro and Apache Parquet (incubating) data from Spark.

PySpark on YARN

One of the remaining items in Spark on YARN compared to other cluster managers was lack of PySpark support. Spark 1.0 allows you to launch PySpark apps against YARN clusters. PySpark currently only works in yarn-client mode. Starting a PySpark shell against a YARN installation is as simple as running:

 

and running a PySpark script is as simple as running:

 

Spark History Server

A common complaint with Spark has been that the per-application UI, which displays task metrics and other useful information, disappears after an app completes. That leaves users in a rut when trying to debug failures. Instead, Spark 1.0 offers a History Server that displays information about applications after they have completed. Cloudera Manager provides easy setup and configuration of this daemon.

Spark SQL

Spark SQL, which deserves a blog post of its own, is a new Spark component that allows you to run SQL statements inside of a Spark application that manipulate and produce RDDs. Due to its immaturity and alpha component status, Cloudera does not currently offer commercial support for Spark SQL. However, we bundle it with our distribution so that users can try it out.

A Note on Stability

While we at Cloudera are quite bullish on Spark, it’s important the acknowledge that even its core components are not yet as stable as many of the more mature Hadoop ecosystem components. The 1.0 mark does not mean that Spark is now bug-free and ready to replace all your production MapReduce uses — but it does mean that people building apps on top of Spark core should be safe from surprises in future releases. Existing APIs will maintain compatibility, existing deploy modes will remain supported, and the general architecture will remain the same.

Conclusion

Cloudera engineers are working hard to make Spark more stable, easier to use, easier to debug, and easier to manage. Expect future releases to greater robustness, enhanced scalability, and deeper insight into what is going on inside a Spark application, both while running and after it has completed.

Sandy Ryza is a data scientist at Cloudera, and an Apache Hadoop committer.

Filed under:

14 Responses
  • Alonso Isidoro Roman / July 29, 2014 / 11:21 AM

    Hi, is it possible to upgrade from CDH5.0 to CDH5.1 with latest spark support through the vmware image?

    Thank you very much

    • Justin Kestelyn (@kestelyn) / July 29, 2014 / 12:05 PM

      I think it would be easier/cleaner to simply import the CDH 5.1 version of the QuickStart VM (which contains Spark 1.0). Unless you have brought some of your own data into your VM.

  • msjian / August 08, 2014 / 12:05 AM

    hi, I didn’t find a way how to install shark, can you help me?

  • msjian / August 15, 2014 / 1:04 AM

    hi, thanks for your reply
    now, i used CDH5.1.0, and find spark in
    http://archive.cloudera.com/spark/parcels/
    seems all are the same for CDH4,
    do you have the spark for version CDH5.1.0?
    or do you know how can i downgrade my CDH to 4.6
    thanks

  • msjian / August 17, 2014 / 7:00 PM

    i use
    CDH-5.1.0-1.cdh5.1.0.p0.53
    scala-2.10.3
    shark-0.9.2
    spark 1.0.0
    Hive – Version 0.12.0
    Hadoop 2.3.0-cdh5.1.0

    after Compile Shark:$ sbt/sbt package, get the follows error

    java.lang.RuntimeException: Server IPC version 9 cannot communicate with client version 4
    at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
    at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
    at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1362)
    at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1146)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:952)
    at shark.SharkCliDriver.processCmd(SharkCliDriver.scala:340)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
    at shark.SharkCliDriver$.main(SharkCliDriver.scala:237)
    at shark.SharkCliDriver.main(SharkCliDriver.scala)
    FAILED: Execution Error, return code -101 from shark.execution.SparkTask

    i tried to test the shark both 0.9.1 or 0.9.2 on CDH5.0.0 or CDH5.1.0
    all of them will get the above error.
    i don’t know how to fix this.

    when i input the command “show tables” in the spark shell, can work well.
    but input “select * from src” this kind command will get this error.

  • Vladimir / August 21, 2014 / 12:59 PM

    Looking forward for the post explaining best practices on interacting with Avro.
    We store all our data in Avro format, and I would like to introduce our engineering team to Apache Spark – so reading Avro would be the first question they would ask.
    Any idea when it will be available?

  • Waled Tayib / August 25, 2014 / 2:34 PM

    Is it possible to download the Latest Spark from the Spark website and some how connect that to YARN from Cloudera. The issue is, with all the new classes available on the newer versions of Spark, I need to use the newer Spark distros. But I want all my normal Hadoop components with Cloudera.

    • Justin Kestelyn (@kestelyn) / August 25, 2014 / 2:46 PM

      The current version upstream is only 1.02 – whereas CDH 5.1 ships with 1.0. That’s not much of a gap, I think.

      If it is too wide a gap for you, I suppose you could try to use the upstream version — but it hasn’t been tested yet with CDH.

  • rakesh / November 10, 2014 / 8:46 AM

    Hello!!

    I am having CDH 5 installed on my cluster, version:
    **********************************
    Hadoop 2.3.0-cdh5.1.3
    Subversion git://github.sf.cloudera.com/CDH/cdh.git -r 8e266e052e423af592871e2dfe09d54c03f6a0e8
    **********************************
    I have installed and configured a prebuilt version of Spark 1.1.0 (Apache Version), built for hadoop 2.3 on my cluster.

    when I run the Pi example in the ‘client mode’, it runs succesfully, but it fails in the ‘yarn-cluster’ mode. The spark job is successfully submitted, but fails after sometime saying:
    ***********************************
    $ ./bin/spark-submit –class org.apache.spark.examples.SparkPi –master yarn-cluster –num-executors 2 –driver-memory 500m –executor-cores 2 lib/spark-examples*.jar 3

    Logs:
    14/11/05 20:47:47 INFO yarn.Client: Application report from ResourceManager:
    application identifier: application_1415193640322_0013
    appId: 13
    clientToAMToken: null
    appDiagnostics: Application application_1415193640322_0013 failed 2 times due to AM Container for appattempt_1415193640322_0013_000002 exited with exitCode: 1 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException:
    org.apache.hadoop.util.Shell$ExitCodeException:
    ***********************************

    Can you please suggest any solution.Do you think I should compile the spark code on my cluster.
    Or should I use Spark provided with CDH5.1

    Any help will be appreciated!

    • Justin Kestelyn (@kestelyn) / November 10, 2014 / 8:50 AM

      Rakesh,

      I suggest you post your issue to the “Advanced Analytics” area at community.cloudera.com.

  • RakeshGupta / November 11, 2014 / 4:35 AM

    Thanks Justin!

Leave a comment


2 + = six