Category Archives: Data Science

Introducing sparklyr, an R Interface for Apache Spark

Categories: Data Science Guest Spark

Earlier this week, RStudio announced sparklyr, a new package that provides an interface between R and Apache Spark. We republish RStudio’s blog post below (see original) for your convenience.

sparklyr-illustration

Over the past couple of years we’ve heard time and time again that people want a native dplyr interface to Spark, so we built one! sparklyr also provides interfaces to Spark’s distributed machine learning algorithms and much more. 

Read More

Solving Real-Life Mysteries with Big Data and Apache Spark

Categories: Data Science Spark

Can using simple statistical techniques in combination with big data help solve the Tamam Shud mystery?

Everyone loves a good real-life mystery. That’s why the three most popular TV shows of the 80s and 90s were Jack Palance’s reboot of Ripley’s Believe It or Not!, Unsolved Mysteries with Robert Stack, and Beyond Belief: Fact or Fiction hosted by Commander Riker.

Read More

Building a Data Science Portfolio: Storytelling with Data (Part 2: Data Exploration)

Categories: Data Science Guest

The following post (Part 2 of two parts) by Vik Paruchuri, founder of data science learning platform Dataquest, offers some detailed and instructive insight about data science workflow (regardless of the tech stack involved, but in this case, using Python). We re-publish it here for your convenience.

Before we dive into exploring the data [see Part 1 for steps relating to data preparation], we’ll want to set the context,

Read More

Building a Data Science Portfolio: Storytelling with Data

Categories: Data Science Guest

The following post by Vik Paruchuri, founder of data science learning platform Dataquest, offers some detailed and instructive insight about data science workflow (regardless of the tech stack involved, but in this case, using Python). We re-publish it here for your convenience.

Data science companies are increasingly looking at portfolios when making hiring decisions. One of the reasons for this is that a portfolio is the best way to judge someone’s real-world skills.

Read More

Announcing hs2client, A Fast New C++ / Python Thrift Client for Impala and Hive

Categories: Data Science Hive Impala Tools

This new (alpha) C++ client library for Apache Impala (incubating) and Apache Hive provides high-performance data access from Python.

Earlier this year, members of the Python data tools and Impala teams at Cloudera began collaborating to create a new C++ library to eventually become a faster, more memory-efficient replacement for impyla, PyHive, and other (largely pure Python) client libraries for talking to Hive and Impala.

Read More