Earlier this week, RStudio announced sparklyr, a new package that provides an interface between R and Apache Spark. We republish RStudio’s blog post below (see original) for your convenience.
Over the past couple of years we’ve heard time and time again that people want a native dplyr interface to Spark, so we built one! sparklyr also provides interfaces to Spark’s distributed machine learning algorithms and much more.
Today, Cloudera announced the availability of an Apache Spark 2.0 Beta release for users of the Cloudera platform.
Apache Spark 2.0 is tremendously exciting (read this post for more background) because (among other things):
- The Dataset API further enhances Spark’s claim as the best tool for data engineering by providing compile-time type safety along with the benefits of a query-optimization engine.
- The Structured Streaming API enables the modeling of streaming data as a continuous DataFrame and expresses operations on that data with a SQL-like API.
Last week, the open source Open Network Insights (ONI) project, now called Spot, was accepted into the ASF Incubator. Here are the highlights about its open data model approach and initial use cases.
One of the biggest challenges organizations face today in combating cyber threats is collecting and normalizing data from numerous security event data sources (often up to thousands of them) to build the required analytics.
Impala users can expect new performance and usability benefits via improved integration with Kudu.
It’s been nearly one year since the public beta announcement of Kudu (now a top-level Apache project) and a noteworthy milestone has been reached: its 1.0 release. This is particularly exciting as Kudu extends the use cases that can be supported on the Apache Hadoop platform, whether it be on-premises or in the cloud,
As measured across multiple dimensions (see analysis below), Impala provides a better cloud-native experience than Redshift for a number of common use cases.
Impala 2.6 brings read/write support on Amazon S3, which provides cloud capabilities such as direct querying of data from S3, elastic scaling of compute, and seamless data portability and flexibility that are unique amongst cloud-based analytic databases. With more and more users looking to deploy and run in public-cloud environments,