Category Archives: Data Science

Announcing hs2client, A Fast New C++ / Python Thrift Client for Impala and Hive

Categories: Data Science Hive Impala Tools

This new (alpha) C++ client library for Apache Impala (incubating) and Apache Hive provides high-performance data access from Python.

Earlier this year, members of the Python data tools and Impala teams at Cloudera began collaborating to create a new C++ library to eventually become a faster, more memory-efficient replacement for impyla, PyHive, and other (largely pure Python) client libraries for talking to Hive and Impala.

Read More

How-to: Use Impala and Kudu Together for Analytic Workloads

Categories: Data Science Hadoop How-to Impala Kudu Performance

Using Apache Impala (incubating) on top of Apache Kudu (incubating) has significant performance benefits

Apache Kudu (incubating) is the newest addition to the set of storage engines that integrate with the Apache Hadoop ecosystem. The promise of Kudu is to deliver high-scan performance, targeting analytical workloads, while allowing users to concurrently insert, update, and delete records. With these properties, Kudu becomes a viable alternative to existing combinations of HDFS and/or Apache HBase to achieve similar results with less complicated ETL pipelines,

Read More

Genome Analysis Toolkit: Now Using Apache Spark for Data Processing

Categories: Data Science Spark Use Case

Users of the latest release of the Genome Analysis Toolkit, an open source framework for analyzing high-throughput DNA sequencing data, can now choose Apache Spark for data processing.

Ever since the Human Genome Project produced the first draft sequence of the human genome in 2000, the cost of sequencing has dropped exponentially, from around US$100 million per genome then to around US$1,000 today. Over the same period, we have seen massive growth in the storage and processing capabilities of big data technologies like Apache Hadoop.

Read More

Feather: A Fast On-Disk Format for Data Frames for R and Python, powered by Apache Arrow

Categories: Data Science

This past January, we (Hadley and Wes) met and discussed some of the systems challenges facing the Python and R open source communities. In particular, we wanted to explore opportunities to collaborate on tools for improving interoperability between Python, R, and external compute and storage systems.

One thing that struck us was that, while R’s data frames and Python’s pandas data frames utilize different internal memory representations, the semantics of their user data types are mostly the same.

Read More

Meet the Authors: “Data Analytics with Hadoop” from O’Reilly Media

Categories: Books Data Science General Hadoop

I recently had a chat with Benjamin Bengfort, a data scientist finishing his PhD at the University of Maryland, and Jenny Kim, a software engineer at Cloudera, about their forthcoming O’Reilly Media book (now in Early Access), Data Analytics with Hadoop: An Introduction for Data Scientists.

Why did you decide to write this book?

Ben: The content was originally part of a class that Jenny and I were teaching together.

Read More