Category Archives: Data Science

Common Probability Distributions: The Data Scientist’s Crib Sheet

Categories: Data Science

Data scientists have hundreds of probability distributions from which to choose. Where to start?

Data science, whatever it may be, remains a big deal.  “A data scientist is better at statistics than any software engineer,” you may overhear a pundit say, at your local tech get-togethers and hackathons. The applied mathematicians have their revenge, because statistics hasn’t been this talked-about since the roaring 20s. They have their own legitimizing Venn diagram of which people don’t make fun.

Read More

What We Learned at Wrangle 2015 (Data Science is About People)

Categories: Community Data Science Events

The Wrangle conference was a huge hit. Look for it to return in 2016!

Wrangle, the conference for and by data science practitioners from startup to enterprise, made a noticeable splash in San Francisco last week. As the conference host and organizer, we (Cloudera) couldn’t be happier about its attendees’ happiness.

Wrangle Conference 2015

With presenter/panelist representation from the data science teams at Uber,

Read More

How-to: Build a Machine-Learning App Using Sparkling Water and Apache Spark

Categories: CDH Data Science Guest How-to Spark

Thanks to Michal Malohlava, Amy Wang, and Avni Wadhwa of H20.ai for providing the following guest post about building ML apps using Sparkling Water and Apache Spark on CDH.

The Sparkling Water project is nearing its one-year anniversary, which means Michal Malohlava, our main contributor, has been very busy for the better part of this past year. The Sparkling Water project combines H2O machine-learning algorithms with the execution power of Apache Spark.

Read More

The New Wrangle Conference: Solving the Hardest Data Science Challenges from Startup to Enterprise

Categories: Community Data Science Events

Wrangle, a new conference dedicated to the practice of data science from startup to enterprise, debuts in San Francisco on Oct. 22, 2015.

Even as Cloudera introduce new tools for analytics and machine learning into its platform (like the recently announced Ibis project, for example), we are mindful of the fact that many of the hardest problems in data science cannot be solved by technology alone. From the smallest startups to the largest enterprises,

Read More

Ibis on Impala: Python at Scale for Data Science

Categories: Cloudera Labs Data Science Impala

This new Cloudera Labs project promises to deliver the great Python user experience and ecosystem at Hadoop scale.

Across the user community, you will find general agreement that the Apache Hadoop stack has progressed dramatically in just the past few years. For example, Search and Impala have moved Hadoop beyond batch processing, while developers are seeing significant productivity gains and additional use cases by transitioning from MapReduce to Apache Spark.

Thanks to such advances in the ecosystem,

Read More