Tag Archives: statistics

Common Probability Distributions: The Data Scientist’s Crib Sheet

Categories: Data Science

Data scientists have hundreds of probability distributions from which to choose. Where to start?

Data science, whatever it may be, remains a big deal.  “A data scientist is better at statistics than any software engineer,” you may overhear a pundit say, at your local tech get-togethers and hackathons. The applied mathematicians have their revenge, because statistics hasn’t been this talked-about since the roaring 20s. They have their own legitimizing Venn diagram of which people don’t make fun.

Read More

Introducing Cloudera Navigator Optimizer: For Optimal SQL Workload Efficiency on Apache Hadoop

Categories: Cloudera Navigator Impala Performance

Cloudera Navigator Optimizer, a new (beta) component of Cloudera Enterprise, helps optimize inefficient query workloads for best results on Apache Hadoop.

With the proliferation of Apache Hadoop deployments, more and more customers are looking to reduce operational overheads in their enterprise data warehouse (EDW) installations by exploiting low-cost, highly scalable, open source SQL-on-Hadoop frameworks such as Impala and Apache Hive. Processing portions of SQL workloads better suited to Hadoop on these frameworks,

Read More

How-to: Build a Complex Event Processing App on Apache Spark and Drools

Categories: HBase How-to Kafka Spark Use Case

Combining CDH with a business execution engine can serve as a solid foundation for complex event processing on big data.

Event processing involves tracking and analyzing streams of data from events to support better insight and decision making. With the recent explosion in data volume and diversity of data sources, this goal can be quite challenging for architects to achieve.

Complex event processing (CEP) is a type of event processing that combines data from multiple sources to identify patterns and complex relationships across various events.

Read More

How-to: Use Apache Solr to Query Indexed Data for Analytics

Categories: How-to Search

Bet you didn’t know this: In some cases, Solr offers lightning-fast response times for business-style queries.

If you were to ask well informed technical people about use cases for Solr, the most likely response would be that Solr (in combination with Apache Lucene) is an open source text search engine: one can use Solr to index documents, and after indexing, these same documents can be easily searched using free-form queries in much the same way as you would query Google.

Read More

Continuous Distribution Goodness-of-Fit in MLlib: Kolmogorov-Smirnov Testing in Apache Spark

Categories: Spark

Thanks to former Cloudera intern Jose Cambronero for the post below about his summer project, which involved contributions to MLlib in Apache Spark.

Data can come in many shapes and forms, and can be described in many ways. Statistics like the mean and standard deviation of a sample provide descriptions of some of its important qualities. Less commonly used statistics such as skewness and kurtosis provide additional perspective into the data’s profile.

Read More