Enabling Python development on CDH clusters (for PySpark, for example) is now much easier thanks to new integration with Continuum Analytics’ Python platform (Anaconda).
Python has become an increasingly popular tool for data analysis, including data processing, feature engineering, machine learning, and visualization. Data scientists and data engineers enjoy Python’s rich numerical and analytical libraries—such as NumPy, pandas, and scikit-learn—and have long wanted to apply them to large datasets stored in Apache Hadoop clusters.
Spark MLLib is growing in popularity for machine-learning model development due to its elegance and usability. In this post, you’ll learn why.
Spark MLLib is a library for performing machine-learning and associated tasks on massive datasets. With MLlib, fitting a machine-learning model to a billion observations can take a couple lines of code and leverage hundreds of machines. MLlib greatly simplifies the model development process.
In this post,
Creating and training machine-learning models is more complex on distributed systems, but there are lots of frameworks for abstracting that complexity.
There are more options now than ever from proven open source projects for doing distributed analytics, with Python and R become increasingly popular. In this post, you’ll learn the options for setting up a simple read-eval-print (REPL) environment with Python and R within the Cloudera QuickStart VM using APIs for two of the most popular cluster computing frameworks: Apache Spark (with MLlib) and H2O (from the company with the same name).
The 0.2.0 release of the spark-ts package includes includes a fleshed-out Java API, among other things.
The spark-ts library, which was initially developed by Cloudera’s Data Science team, enables analysis of datasets comprising millions of time series, each with millions of measurements. It runs atop Apache Spark, and exposes Scala, Java, and Python APIs. Check out this recent post for a closer look at the library and how to use it.
Time-series analysis is becoming mainstream across multiple data-rich industries. The new spark-ts library helps analysts and data scientists focus on business questions, not on building their own algorithms.
Have you ever wanted to build models over measurements coming in every second from sensors across the world? Dig into intra-day trading prices of millions of financial instruments? Compare hourly view statistics across every page on Wikipedia? To do any of these things,