Learn about the architecture of Ibis, the roadmaps for Ibis and Impala, and how to get started and contribute.
We created Ibis, a new Python data analysis framework now incubating in Cloudera Labs, with the goal of enabling data scientists and data engineers to be as productive working with big data as they are working with small and medium data today. In doing so, we will enable Python to become a true first-class language for Apache Hadoop, without compromises in functionality, usability, or performance. Having spent much of the last decade improving the usability of the single-node Python experience (with pandas and other projects), we are looking to achieve:
- 100% Python end-to-end user workflows
- Native hardware speeds for a broad set of use cases
- Full-fidelity data analysis without extractions or sampling
- Scalability for big data
- Integration with the existing Python data ecosystem (pandas, scikit-learn, NumPy, and so on)
(Read more about the technical vision for Ibis in this post.)
The Ibis user interface centers around a general pandas-like data expression API. Users compose relational algebra, data transformations, and analytics on data in HDFS, and these operations are executed transparently, returning results in the familiar pandas data frame format.
This first beta release of Ibis utilizes Impala as a first-class execution engine. It includes comprehensive support for the analytical capabilities provided by Impala, with tools to drastically simplify ETL workflows, data wrangling, and analysis tasks on top of HDFS.
We have focused Ibis development on integration with Impala due to architectural synergies (namely, Impala’s C++ and LLVM-based engine) that will enable Python for the first time to achieve native hardware performance at Hadoop scale and integrate with the existing Python data analysis and high-performance computing ecosystem. See the roadmap below (and the recently updated Impala roadmap) for more detail on how we will achieve these goals.
Ibis Source Code, Installation, and Trying It Out
Ibis is a 100% open source, Apache-licensed codebase hosted on GitHub.
The Ibis documentation includes installation instructions, API documentation, a tutorial, and details on how to get involved in the development process.
As Ibis currently requires Impala, we have created a standalone Linux virtual machine based on the Cloudera QuickStart VM to enable users on OS X and Windows to try out Ibis. See complete instructions here.
Roadmaps for Ibis and Impala
As part of the Ibis roadmap, we are evolving Impala to enable Ibis users to achieve the performance and scalability goals with user-supplied Python code. In particular, these will be the immediate future areas of focus:
- More natural data modeling through support for Impala’s forthcoming complex types, enabling expressive analytics on JSON-like data and allow for more powerful user-defined functions.
- New in-memory columnar format to enable user-defined Python logic to execute efficiently on complex data without serialization overhead
- Efficient interpreted user-defined functions that will enable full use of existing Python libraries
- LLVM IR generation to achieve native hardware performance from user-defined Python code, integrating with the new canonical columnar data format
- Machine-learning capabilities through integration with MADLib
In future blog posts, we will go into detail about these roadmap items as they develop, and about how they will enable new data problems to be solved with Ibis.
Community and Contributing to Ibis
We are excited to hear from users tackling big data with Python and how Ibis can become a better solution for their use cases.
Furthermore, we welcome involvement in Ibis development from the Python, Hadoop, and Apache Spark communities. There are many ways that open source developers can get involved right now to add valuable functionality, such as:
- Enhancing interoperability with pandas
- High-level analytical functionality utilizing Ibis data expression syntax
- New C++ user-defined function implementations for Impala for high-performance analytics and data processing functionality
- Utilizing PySpark to integrate with the Spark module ecosystem (e.g. MLLib)
- Support for additional SQL-based execution backends (e.g. Presto, Apache Hive, Spark SQL, PostgreSQL)
- Ibis website: http://ibis-project.org/
- Ibis source code : http://github.com/cloudera/ibis
- Ibis documentation http://docs.ibis-project.org
- Ibis blog: http://blog.ibis-project.org
- Ibis user mailinglist: email@example.com
- Cloudera Labs Discussion Forum: http://community.cloudera.com/t5/Cloudera-Labs/bd-p/ClouderaLabs
- Trying out Ibis with the Cloudera QuickStart VM: http://docs.ibis-project.org/getting-started.html?highlight=quickstart#using-ibis-with-the-cloudera-quickstart-vm
Wes McKinney is a Software Engineer at Cloudera. He is the creator of Python’s ubiquitous pandas library and the author of the O’Reilly Media best-seller Python for Data Analysis. Previously, Wes was the founder and CEO of DataPad.
Marcel Kornacker is Chief Architect for Database Technology at Cloudera, and the creator of Impala.