Cloudera Data Science Workbench provides freedom for data scientists. It gives them the flexibility to work with their favorite libraries using isolated environments with a container for each project.
In JVM world such as Java or Scala, using your favorite packages on a Spark cluster is easy. Each application manages preferred packages using fat JARs, and it brings independent environments with the Spark cluster. Many data scientists prefer Python to Scala for data science,
The following post (Part 2 of two parts) by Vik Paruchuri, founder of data science learning platform Dataquest, offers some detailed and instructive insight about data science workflow (regardless of the tech stack involved, but in this case, using Python). We re-publish it here for your convenience.
Before we dive into exploring the data [see Part 1 for steps relating to data preparation], we’ll want to set the context,
The following post by Vik Paruchuri, founder of data science learning platform Dataquest, offers some detailed and instructive insight about data science workflow (regardless of the tech stack involved, but in this case, using Python). We re-publish it here for your convenience.
Data science companies are increasingly looking at portfolios when making hiring decisions. One of the reasons for this is that a portfolio is the best way to judge someone’s real-world skills.
This new (alpha) C++ client library for Apache Impala (incubating) and Apache Hive provides high-performance data access from Python.
Earlier this year, members of the Python data tools and Impala teams at Cloudera began collaborating to create a new C++ library to eventually become a faster, more memory-efficient replacement for impyla, PyHive, and other (largely pure Python) client libraries for talking to Hive and Impala.
This past January, we (Hadley and Wes) met and discussed some of the systems challenges facing the Python and R open source communities. In particular, we wanted to explore opportunities to collaborate on tools for improving interoperability between Python, R, and external compute and storage systems.
One thing that struck us was that, while R’s data frames and Python’s pandas data frames utilize different internal memory representations, the semantics of their user data types are mostly the same.