We posted several blog posts about sparklyr (introduction, automation), which enables you to analyze big data leveraging Apache Spark seamlessly with R. sparklyr, developed by RStudio, is an R interface to Spark that allows users to use Spark as the backend for dplyr, which is the popular data manipulation package for R.
If you are interested in sparklyr, you can learn how to use it with the official document,
Earlier this week, RStudio announced sparklyr, a new package that provides an interface between R and Apache Spark. We republish RStudio’s blog post below (see original) for your convenience.
Over the past couple of years we’ve heard time and time again that people want a native dplyr interface to Spark, so we built one! sparklyr also provides interfaces to Spark’s distributed machine learning algorithms and much more. Highlights include:
- Interactively manipulate Spark data using both dplyr and SQL (via DBI).
This past January, we (Hadley and Wes) met and discussed some of the systems challenges facing the Python and R open source communities. In particular, we wanted to explore opportunities to collaborate on tools for improving interoperability between Python, R, and external compute and storage systems.
One thing that struck us was that, while R’s data frames and Python’s pandas data frames utilize different internal memory representations, the semantics of their user data types are mostly the same.
Creating and training machine-learning models is more complex on distributed systems, but there are lots of frameworks for abstracting that complexity.
There are more options now than ever from proven open source projects for doing distributed analytics, with Python and R become increasingly popular. In this post, you’ll learn the options for setting up a simple read-eval-print (REPL) environment with Python and R within the Cloudera QuickStart VM using APIs for two of the most popular cluster computing frameworks: Apache Spark (with MLlib) and H2O (from the company with the same name).
Data scientists have hundreds of probability distributions from which to choose. Where to start?
Data science, whatever it may be, remains a big deal. “A data scientist is better at statistics than any software engineer,” you may overhear a pundit say, at your local tech get-togethers and hackathons. The applied mathematicians have their revenge, because statistics hasn’t been this talked-about since the roaring 20s. They have their own legitimizing Venn diagram of which people don’t make fun.