Spark ML is one of the dominant frameworks for many major machine learning algorithms, such as the Alternating Least Squares (ALS) algorithm for recommendation systems, the Principal Component Analysis algorithm, and the Random Forest algorithm.
Apache Hive is a widely adopted data warehouse engine that runs on Apache Hadoop. Features that improve Hive performance can significantly improve the overall utilization of resources on the cluster. Hive processes data using a chain of operators within the Hive execution engine. These operators are scheduled in the various tasks (for example, MapTask, ReduceTask, or SparkTask) of the query execution plan. Traditionally, these operators are designed to process one row at a time.
For a user-facing system like Apache Impala, bad performance and downtime can have serious negative impacts on your business. Given the complexity of the system and all the moving parts, troubleshooting can be time-consuming and overwhelming.
In this blog post series, we are going to show how the charts and metrics on Cloudera Manager (CM) can help troubleshoot Impala performance issues. They can also help to monitor the system to predict and prevent future outages.
As a member of Cloudera’s Partner Engineering team, I evaluate hardware and cloud computing platforms offered by commercial partners who want to certify their products for use with Cloudera software. One of my primary goals is to make sure that these platforms provide a stable and well-performing base upon which our products will run, a state of operation that a wide variety of customers performing an even wider variety of tasks can appreciate.
Cloudera Data Science Workbench (CDSW) provides data science teams with a self-service platform for quickly developing machine learning workloads in their preferred language, with secure access to enterprise data and simple provisioning of compute. Individuals can request schedulable resources (e.g. compute, memory, GPUs) on a shared cluster that is managed centrally.
While self-service provisioning of resources is critical to the rapid interaction cycle of data scientists, it can pose a challenge to administrators.