How Cloudera Enables R Users to Optimize Their Data Science and Machine Learning Workflows

by Ian Cook

Posted in Technical | January 29, 2020 3 min read

This week, R users from around the world convene in San Francisco for rstudio::conf 2020. With a packed agenda of new package announcements and case studies highlighting successful applications of R across different industries, it’s evident that R and the ecosystem of tools around it make up a vital part of the data science and machine learning landscape.

At Cloudera, we enable numerous users to run data science workloads and deploy machine learning code on our platforms every day. Many of these users rely on R for tasks ranging from cleaning datasets to training deep neural networks. As a long-time R user and package developer, I’m representing Cloudera at rstudio::conf; if you’re attending, stop by my session (described below) or catch me between sessions to pick up a Cloudera hex sticker!

How Cloudera Shows Love to R Users

Cloudera strives to provide a delightful experience to our users who develop and run R code. Here are some of the ways we show love to R users and the R community:

- R is a native language in Cloudera Machine Learning: Cloudera Machine Learning (CML) enables you to develop and run R code in an intuitive browser-based environment, using containers to isolate workloads and provide scalable compute resources on demand. CML is the next generation of Cloudera Data Science Workbench (CDSW) optimized for the cloud. It offers true self-service access, allowing you to safely install and use R packages in self-contained project environments without help from an administrator.
- Train, tune, and deploy models with R code: The Experiments and Models capabilities in CML enable users to train and test machine learning models, iterate to identify the optimal combination of model and hyperparameters, and deploy the model to generate predictions in real time—all using R code.
- Use RStudio in CML: If you know and love the RStudio IDE, you’ll find yourself right at home using RStudio Server inside CML, enabled by the built-in third-party editors capability.
- Deploy Shiny apps seamlessly: The Applications capability in CML enables data scientists to create, launch, and securely share long-running interactive applications with just a few clicks, using Shiny and other web-based app building tools.
- Learn from expert instructors: Cloudera offers several training courses that can be taught using R, including Cloudera Data Scientist Training (available as a private instructor-led training using sparklyr) and CDSW Training (a video-based course on Cloudera OnDemand featuring a track for R users).
- Confidently use Cloudera and RStudio products together: RStudio is a certified Cloudera partner. Cloudera and RStudio can work together to help you find a combination of our products that best solves your business challenges, and to ensure you get the support you need.
- Contributing to the open ecosystem of R packages: Cloudera contributes to the development of packages that help R users work productively on Cloudera platforms. By funding projects like implyr (an optimized dplyr interface for Impala) and tidyquery (a SQL interface to R data frames), Cloudera aims to provide a more uniform experience to R users working with data of all sizes from different sources. See the examples highlighting this below.

SQL or dplyr? Take your pick

Our customers love it when they can use familiar syntax to work with data regardless of its size or its source. The popularity of sparklyr is a case in point: it enables R users to use either SQL or dplyr—both familiar to most R users—to work with large-scale data using Apache Spark. Two R packages developed at Cloudera—implyr and tidyquery—aim to provide this same choice of either SQL or dplyr when querying tables with Apache Impala and when manipulating R data frames.

To set up implyr, first see the details in the implyr README. After installing the implyr package and connecting to Impala as described there, you can query an Impala table either by using dplyr functions such as group_by() and summarise(), or by using a SQL SELECT statement:

With the release of tidyquery, which I’m announcing in a talk here at rstudio::conf, you now have the option to use SQL SELECT statements to query R data frames:

For more details, please see the tidyquery README. If you are attending rstudio::conf, stop by my session “Bridging the gap between SQL and R: Introducing queryparser and tidyquery” and pick up a Cloudera hex sticker!

Ian Cook

Senior Curriculum Developer @ianmcook

More by this author

Editor's Choice

Business

Generative AI for the Enterprise

Technical

Building Trust in Public Sector AI Starts with Trusting Your Data

3 Comments

by curt on Apr 14, 2020 @ 7:21 pm EDT

This is just what I was looking for, Ian, and is very helpful. But it looks like query() needs the dataframe to exist in the top level environment if it is not provided as the first argument. Is there a workaround? Not a lot up on StackExchange just yet. Thanks,
x<-function() {
aa <- data.frame(a=1:10, b=letters[1:10])
rc 5″)
return(rc)
}
x()
# Error: No data frame exists with the name aa

by Ian Cook on Apr 15, 2020 @ 8:47 am EDT

Thank you Curt for reporting this issue. I just fixed this in the version of tidyquery that’s on GitHub. If possible, please install tidyquery from GitHub, using:

remotes::install_github("ianmcook/tidyquery")

I will submit a new version to CRAN soon, and it will include this and some other small bugfixes and improvements.

by curt on Apr 15, 2020 @ 3:42 pm EDT

That was fast, Ian. The update works just fine. And it looks like the code paste got botched, glad you could work through that.

How Cloudera Enables R Users to Optimize Their Data Science and Machine Learning Workflows

How Cloudera Shows Love to R Users

SQL or dplyr? Take your pick

Editor's Choice

3 Comments

Leave a comment Cancel reply