Tag Archives: analytics

Latest Impala Cookbook

Categories: Impala

Over the past year (and through several releases), Apache Impala (incubating) has added numerous new features and performance enhancements better enabling high-performance SQL analytics over big data.  Thus, it is time again for an update to the Impala cookbook, which contains best practices for these new features, updated guidelines, and more detailed examples.

Note: This cookbook does not yet capture best practices for the major new advancements available with the recent GA of Kudu.

Read More

New in Cloudera Enterprise 5.10: Hue SQL Editor and Security Improvements

Categories: Hadoop Hue Oozie

Cloudera Enterprise 5.10 includes the latest updates of Hue, the intelligent editor for SQL Developers and Analysts.

As part of Cloudera’s continuing investments in user experience and productivity, Cloudera Enterprise 5.10 includes an updated version of Hue. We provide a summary of the main enhancements in the following part of this blog post. (Hue from C5.10 is also available for a quick try in one click on demo.gethue.com.)

SQL Improvements

The Hue editor keeps getting better with these major improvements:

Row Count

The number of rows returned is displayed so you can quickly see the size of the dataset.

Read More

A New Library for Analyzing Time-Series Data with Apache Spark

Categories: Data Science Spark

Time-series analysis is becoming mainstream across multiple data-rich industries. The new spark-ts library helps analysts and data scientists focus on business questions, not on building their own algorithms.

Have you ever wanted to build models over measurements coming in every second from sensors across the world? Dig into intra-day trading prices of millions of financial instruments? Compare hourly view statistics across every page on Wikipedia? To do any of these things,

Read More

Common Probability Distributions: The Data Scientist’s Crib Sheet

Categories: Data Science

Data scientists have hundreds of probability distributions from which to choose. Where to start?

Data science, whatever it may be, remains a big deal.  “A data scientist is better at statistics than any software engineer,” you may overhear a pundit say, at your local tech get-togethers and hackathons. The applied mathematicians have their revenge, because statistics hasn’t been this talked-about since the roaring 20s. They have their own legitimizing Venn diagram of which people don’t make fun.

Read More

Sustained Innovation in Apache Spark: DataFrames, Spark SQL, and MLlib

Categories: CDH Spark

Cloudera has announced support for Spark SQL/DataFrame API and MLlib. This post explains their benefits for app developers, data analysts, data engineers, and data scientists.

In July 2015, Cloudera re-affirmed its position since 2013: that Apache Spark is on course to replace MapReduce as the default general-purpose data processing engine for Apache Hadoop. Thanks to initiatives like the One Platform Initiative,

Read More