Feather: A Fast On-Disk Format for Data Frames for R and Python, powered by Apache Arrow

Categories: Data Science

This past January, we (Hadley and Wes) met and discussed some of the systems challenges facing the Python and R open source communities. In particular, we wanted to explore opportunities to collaborate on tools for improving interoperability between Python, R, and external compute and storage systems.

One thing that struck us was that, while R’s data frames and Python’s pandas data frames utilize different internal memory representations, the semantics of their user data types are mostly the same. In both R and pandas, data frames contain lists of named, equal-length columns, which can be numeric, boolean, and date-and-time, categorical (factors), or string. Additionally, these columns must support missing (null) values.

Around this time, the open source community had just started the new Apache Arrow project, designed to improve data interoperability for systems dealing with columnar tabular data. In discussing Arrow in the context of Python and R, we wanted to see if we could design a very fast file format for storing data frames that could be used by both languages. Thus, the Feather format was born.

What is Feather?

Feather is a fast, lightweight, and easy-to-use binary file format for storing data frames. It has a few specific design goals:

  • Lightweight, minimal API: make pushing data frames in and out of memory as simple as possible
  • Language agnostic: Feather files are the same whether written by Python or R code. Other languages can read and write Feather files, too.
  • High read and write performance. When possible, Feather operations should be bound by local disk performance.

Code Examples

The Feather API is designed to make reading and writing data frames as easy as possible. In R, the code might look like:

Analogously, in Python, we have:

How Fast is Feather?

Feather is extremely fast. Since Feather does not currently use any compression internally, it works best when used with solid-state drives as come with most of today’s laptop computers. For this first release, we prioritized a simple implementation and are thus writing unmodified Arrow memory to disk.

To give you an idea of Feather’s speed, here is a Python benchmark writing an approximately 800MB pandas DataFrame to disk:

On Wes’s laptop (latest-gen Intel processor with SSD), this takes:

This is effective performance of over 600 MB/s. Of course, the performance you see will depend on your hardware configuration.

And in R (on Hadley’s laptop, which is very similar):

How Can I Get Feather?

The Feather source code is hosted at http://github.com/wesm/feather.

Installing Feather for R

Feather is currently available from github, and you can install with:

Feather uses C++11, so if you’re on Windows, you’ll need the new gcc 4.93 toolchain. (All going well, this toolchain will be included in R 3.3.0, which is scheduled for release within weeks. We’ll aim for a CRAN release soon after that.)

Installing Feather for Python

For Python, you can install Feather from PyPI like so:

Feather has only been tested on OS X and Linux, and requires a C++11 compiler (gcc 4.8 and higher or XCode 6 and up). We will look into providing more installation options, such as conda builds, in the future.

When Should You Not Use Feather?

Feather is not designed for long-term data storage. At this time, we do not guarantee that the file format will be stable across versions. Instead, use Feather for quickly exchanging data between Python and R code, or for short-term storage of data frames as part of some analysis.

Feather, Apache Arrow, and the Community

One of the great parts of Feather is that the file format is language agnostic. Other languages, such as Julia or Scala (for Spark users), can read and write the format without knowledge of details of Python or R.

Feather is one of the first projects to bring the tangible benefits of the Arrow spec to users in the form of an efficient, language-agnostic representation of tabular data on disk. Since Arrow does not provide for a file format, we are using Google’s Flatbuffers library to serialize column types and related metadata in a language-independent way in the file.

The Python interface uses Cython to expose Feather’s C++11 core to users, while the R interface uses Rcpp for the same task.

We’re interested in evolving the Feather format to support the needs of more Python and R users, as well as other seeing bindings for Feather built in other data analysis languages. There will also be plenty of opportunities to get involved in improving Feather’s performance or storage efficiency through compression or other techniques.

Wes McKinney is a Software Engineer at Cloudera, the founder of the pandas and Ibis projects, and an Apache Arrow committer. 

Hadley Wickham is Chief Scientist at RStudio and Adjunct Professor of Statistics at Rice University.

facebooktwittergoogle_pluslinkedinmailfacebooktwittergoogle_pluslinkedinmail

13 responses on “Feather: A Fast On-Disk Format for Data Frames for R and Python, powered by Apache Arrow

  1. Scott Paul Jones

    This seems like a very useful package, and in particular, I was curious about support for Julia.
    Is there a pure C API for Feather? Currently, Julia cannot directly call in to C++ (although hopefully Keno Fischer’s Cxx.jl package will be ready for v0.5 of Julia soon), so that is an issue.
    Also, looking a bit at the code, it seems like it might be better to create a native version in Julia, rather than have the overhead of calling into a C++ library.
    Another question I have is if there is any way to add new types to the format? (Even if they are just opaque fixed length or variable length items for the languages that don’t support those types). Examples might be: 128-bit signed/unsigned integers, IEEE 754-2008 packed or binary format decimal floating point numbers, arbitrary precision integers and floats (BigInt, BigFloat in Julia).
    In Julia, you might want to use Feather as mentioned above: “for short-term storage of data frames as part of some analysis.”, where using some types not supported in all languages using Feather was not an issue.

  2. Matthew Pancia

    This is awesome, thanks!

    FYI: There are smart quotes in the R code to install the package, which makes it not work when you copy it.

  3. Blaine Mooers

    I installed feather with pip on a Mac with Yosemite and macports python.

    The python example is not working. Feather is missing the write_dataframe method.

    bash-3.2$ python
    Python 2.7.11 (default, Mar 1 2016, 19:44:31)
    [GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
    Type “help”, “copyright”, “credits” or “license” for more information.
    >>> import feather
    >>> import pandas as pd
    /opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
    warnings.warn(self.msg_depr % (key, alt_key))
    >>> import numpy as np
    >>> arr = np.random.randn(10000000)
    >>> # 10% nulls
    … arr[::10] = np.nan
    >>> df = pd.DataFrame({‘column_{0}’.format(i): arr for i in range(10)})
    >>> feather.write_dataframe(df, ‘test.feather’)
    Traceback (most recent call last):
    File “”, line 1, in
    AttributeError: ‘module’ object has no attribute ‘write_dataframe’

    1. Sean

      Had the same problem! My guess is you did the following: pip install feather
      This is the wrong package. You need to: pip install feather-format
      That should solve it. :)

  4. Nathaniel

    Does one have to worry about endian-ness with regards to feather? i.e. if at some point in time, I’m transferring the data from a little-endian system to a big-endian one?

  5. Lance

    Looks cool. Only concern is that I might not be able to read the data later if I upgraded a version. I almost never have to do this anyway so it might not be a justified concern.

  6. Jeff

    Maybe you don’t care but if you want this to travel outside of academia I suggest two enhancements. First, remove dependencies on software with viral licenses. Second, add encryption and decryption.

  7. Evgeniy

    >> One of the great parts of Feather is that the file format is language agnostic. Other languages, such as Julia or Scala (for Spark users), can read and write the format without knowledge of details of Python or R.

    I’m struggling to find a library to operate with this format in Java or Scala. Do you know if there are any?

Leave a Reply

Your email address will not be published. Required fields are marked *