Tag Archives: python

Building a Data Science Portfolio: Storytelling with Data (Part 2: Data Exploration)

Categories: Data Science Guest

The following post (Part 2 of two parts) by Vik Paruchuri, founder of data science learning platform Dataquest, offers some detailed and instructive insight about data science workflow (regardless of the tech stack involved, but in this case, using Python). We re-publish it here for your convenience.

Before we dive into exploring the data [see Part 1 for steps relating to data preparation], we’ll want to set the context,

Read More

Building a Data Science Portfolio: Storytelling with Data

Categories: Data Science Guest

The following post by Vik Paruchuri, founder of data science learning platform Dataquest, offers some detailed and instructive insight about data science workflow (regardless of the tech stack involved, but in this case, using Python). We re-publish it here for your convenience.

Data science companies are increasingly looking at portfolios when making hiring decisions. One of the reasons for this is that a portfolio is the best way to judge someone’s real-world skills.

Read More

Announcing hs2client, A Fast New C++ / Python Thrift Client for Impala and Hive

Categories: Data Science Hive Impala Tools

This new (alpha) C++ client library for Apache Impala (incubating) and Apache Hive provides high-performance data access from Python.

Earlier this year, members of the Python data tools and Impala teams at Cloudera began collaborating to create a new C++ library to eventually become a faster, more memory-efficient replacement for impyla, PyHive, and other (largely pure Python) client libraries for talking to Hive and Impala.

Read More

Feather: A Fast On-Disk Format for Data Frames for R and Python, powered by Apache Arrow

Categories: Data Science

This past January, we (Hadley and Wes) met and discussed some of the systems challenges facing the Python and R open source communities. In particular, we wanted to explore opportunities to collaborate on tools for improving interoperability between Python, R, and external compute and storage systems.

One thing that struck us was that, while R’s data frames and Python’s pandas data frames utilize different internal memory representations, the semantics of their user data types are mostly the same.

Read More

Introducing Apache Arrow: A Fast, Interoperable In-Memory Columnar Data Structure Standard

Categories: Data Science General HDFS Impala Kudu Performance

Engineers from across the Apache Hadoop community are collaborating to establish Arrow as a de-facto standard for columnar in-memory processing and interchange. Here’s how it works.

Apache Arrow is an in-memory data structure specification for use by engineers building data systems. It has several key benefits:

  • A columnar memory-layout permitting O(1) random access. The layout is highly cache-efficient in analytics workloads and permits SIMD optimizations with modern processors.

Read More