Since the launch of sparklyr, working with Apache Spark in Apache Hadoop has become much easier for R users. sparklyr contains a dplyr interface into Spark and allows users to leverage crucial machine learning algorithms from Spark MLlib and H2O Sparkling Water. This greatly reduces the barrier of entry for R users in adopting Spark as a tool for big data and should go a long way in enabling R workloads to migrate to Hadoop.
A common design pattern often emerges when teams begin to stitch together existing systems and an EDH cluster: file dumps, typically in a format like CSV, are regularly uploaded to EDH, where they are then unpacked, transformed into optimal query format, and tucked away in HDFS where various EDH components can use them. When these file dumps are large or happen very often, these simple steps can significantly slow down an ingest pipeline. Part of this delay is inevitable;
Get started with scalable graph analysis via simple examples that utilize GraphFrames and Spark SQL on HDFS.
Graphs—also known as “networks”—are ubiquitous across web applications. As a refresher, a graph consists of nodes and edges. A node can be any object, such as a person or an airport, and an edge is a relation between two nodes, such as a friendship or an airline connection between two cities.
Earlier this week, RStudio announced sparklyr, a new package that provides an interface between R and Apache Spark. We republish RStudio’s blog post below (see original) for your convenience.
Over the past couple of years we’ve heard time and time again that people want a native dplyr interface to Spark, so we built one! sparklyr also provides interfaces to Spark’s distributed machine learning algorithms and much more.
Today, Cloudera announced the availability of an Apache Spark 2.0 Beta release for users of the Cloudera platform.
- The Dataset API further enhances Spark’s claim as the best tool for data engineering by providing compile-time type safety along with the benefits of a query-optimization engine.
- The Structured Streaming API enables the modeling of streaming data as a continuous DataFrame and expresses operations on that data with a SQL-like API.