Category Archives: Performance

Demystifying Spark Jobs to Optimize for Cost and Performance

Categories: Performance Spark

Apache Spark is one of the most popular engines for distributed data processing on Big Data clusters. Spark jobs come in all shapes, sizes and cluster form factors. Ranging from 10’s to 1000’s of nodes and executors, seconds to hours or even days for job duration, megabytes to petabytes of data and simple data scans to complicated analytical workloads. Throw in a growing number of streaming workloads to huge body of batch and machine learning jobs —

Read more

Using Native Math Libraries to Accelerate Spark Machine Learning Applications

Categories: AI and Machine Learning CDH Performance Spark

[Editor’s note: The original version of this article was published as part of our Guru How-To series for Data Science. Be sure to also check out the series for Cloudera Data Warehouse.]

 

Spark ML is one of the dominant frameworks for many major machine learning algorithms, such as the Alternating Least Squares (ALS) algorithm for recommendation systems, the Principal Component Analysis algorithm, and the Random Forest algorithm.

Read more

Faster Swarms of Data : Accelerating Hive Queries with Parquet Vectorization

Categories: CDH Hive Parquet Performance

Background

Apache Hive is a widely adopted data warehouse engine that runs on Apache Hadoop. Features that improve Hive performance can significantly improve the overall utilization of resources on the cluster. Hive processes data using a chain of operators within the Hive execution engine. These operators are scheduled in the various tasks (for example, MapTask, ReduceTask, or SparkTask) of the query execution plan. Traditionally, these operators are designed to process one row at a time.

Read more

Assessment of Apache Impala Performance using Cloudera Manager Metrics – Part 1 of 3

Categories: CDH Cloudera Manager Impala Performance

For a user-facing system like Apache Impala, bad performance and downtime can have serious negative impacts on your business. Given the complexity of the system and all the moving parts, troubleshooting can be time-consuming and overwhelming.

In this blog post series, we are going to show how the charts and metrics on Cloudera Manager (CM) can help troubleshoot Impala performance issues. They can also help to monitor the system to predict and prevent future outages.

Read more

Evaluating Partner Platforms

Categories: CDH Hardware How-to Performance

As a member of Cloudera’s Partner Engineering team, I evaluate hardware and cloud computing platforms offered by commercial partners who want to certify their products for use with Cloudera software. One of my primary goals is to make sure that these platforms provide a stable and well-performing base upon which our products will run, a state of operation that a wide variety of customers performing an even wider variety of tasks can appreciate.

Read more