Introducing Cloudera Navigator Optimizer: For Optimal SQL Workload Efficiency on Apache Hadoop

Categories: Cloudera Navigator Impala Performance

Cloudera Navigator Optimizer, a new (beta) component of Cloudera Enterprise, helps optimize inefficient query workloads for best results on Apache Hadoop.

With the proliferation of Apache Hadoop deployments, more and more customers are looking to reduce operational overheads in their enterprise data warehouse (EDW) installations by exploiting low-cost, highly scalable, open source SQL-on-Hadoop frameworks such as Impala and Apache Hive. Processing portions of SQL workloads better suited to Hadoop on these frameworks, while retaining operational queries on existing EDW systems, is an increasingly popular workload-optimization strategy.

However, legacy EDW workloads can comprise of millions of SQL queries, so manually identifying queries that benefit from Hadoop is a formidable task. Even after these queries are identified, deploying them to Hadoop as-is may not be prudent given the disparities in the underlying architectures of EDW and SQL-on-Hadoop solutions. Choosing the right data models, and re-designing the queries by breaking down complex SQL operations to simpler, more efficient constructs, are critical for ensuring a successful Hadoop deployment.

Another complicating factor is the size of these SQL queries; legacy business critical queries can contain thousands of lines of un-formatted SQL. Scanning such queries manually to gauge suitability for Hadoop is usually not a viable option.

Cloudera Navigator Optimizer, now available in a limited beta, is a unique, workload optimization tool (available via SaaS) that helps DBAs, data warehouse architects, and data analysts adopt an informed and systematic approach to getting optimal results with Hadoop. In summary, Cloudera Navigator Optimizer profiles and analyzes the SQL text in large, complex SQL workloads so users can gain an in-depth understanding of their workloads, identify queries best-suited for Hadoop and modify them as needed for optimal efficiency on Hadoop—all via an easy-to-use web UI.

In the remainder of this post, we’ll provide an overview of what it can do.

Supported SQL

Cloudera Navigator Optimizer can analyze SQL queries from commercially available platforms including Teradata, Oracle Database, IBM DB2 and Netezza, Microsoft SQL Server, and Hadoop (Impala). Some of these platforms store all queries in a query log; SQL scripts can be used to extract queries from such logs. The elapsed time per query, the schema for the underlying tables, and table and column statistics (if available) enhance the quality of the recommendations.

Architecture and Functionality

Cloudera Navigator Optimizer complements existing optimization tools offered by traditional EDW vendors. While these vendor tools focus on performance tuning for SQL queries on their own platforms, Cloudera Navigator Optimizer uniquely combines an analytical framework with advanced domain-specific knowledge extraction logic to analyze entire SQL workloads and optimize performance on Hadoop.

The product has a highly distributed, fault-tolerant backend architecture to:

  • Parse SQL queries to reveal table and query usage patterns
  • Store results in an in-memory store for quick retrieval (a modern database provides persistence)
  • Apply advanced learning algorithms and analytics to derive actionable insights

Anonymization techniques such as stripping literals from SQL queries are automatically applied to protect customer data privacy. The rich UI allows self-service exploration of the analytical results and actionable insights, guiding users to adopt optimization strategies best-suited to the workload and pinpointing high impact queries to focus on.

The framework for query discovery and analysis of these high impact queries is built on three pillars:

  • Duplication: Many workloads consist of highly duplicated queries or query sets. The tool uses advanced semantic analysis and query similarity algorithms to zero-in on the core set of unique queries in a workload. Users then have a much smaller set of queries in which to explore optimization opportunities for Hadoop. Our analytics also detects duplication of tables, columns, and query usage patterns across queries in the workload, including highly duplicated joins, popular column filters, and oft-used subqueries—all constructs that can be optimized for Hadoop.
  • Complexity: Categorizing queries by analyzing the complexity of their SQL constructs (based on joins, subqueries, aggregation functions, inline views, CASE expressions, UNIONs) and deriving popular patterns across the workload allows users to find optimization opportunities at the broader workload level, as well as on a per-query basis.
  • Compatibility: Checking syntax compatibility between the SQL query from the EDW system against SQL supported on SQL-on-Hadoop platforms helps users find and resolve query syntax differences. Highlighting these differences in the context of the query is also supported.

With respect to features and functionality, Cloudera Navigator Optimizer allows you to:

  • Profile SQL workloads to get visibility across multiple dimensions including:
    • A ‘dashboard’ view: of the SQL operations, table usage, and query usage patterns
    • A popular view: of the most frequently used tables, queries, and column patterns
    • A deep-dive view: of each table and query’s main characteristics, patterns, and complexities
    • A highlights view: of the major insights drawn from analyzing the entire workload
  • Analyze optimization opportunities by exploring important and popular table and query access patterns in the workload including SQL join patterns and join distributions, access patterns within queries such as subqueries, and column usage information such as filter columns
  • Download assessment reports to share summarized workload findings with project team(s)
  • Identify candidate query sets to deploy on Hadoop, using metrics such as the number of times similar queries are repeated in the workload OR the average elapsed time of queries
  • Evaluate candidate queries to identify complex, oft-repeated SQL operations that may require or benefit from schema and/or query design for efficient execution on Hadoop; and leverage SQL syntax compatibility for Impala and Hive
  • Get design recommendations, including data modeling and schema-design recommendations, such as partitioning key design for Impala and Hive, and pertinent query-design recommendations, such as inline view materialization for efficient execution on Hadoop
optimizer

To learn more about these functions, check out the product demo:

Use Cases

Alpha versions of Cloudera Navigator Optimizer have been used by Cloudera customers spanning multiple market segments including healthcare, telecommunications, retail, banking, and finance. The workload profiling, reporting, query identification and evaluation features have been used for diverse workloads including ETL, BI, complex query redesign, and ad-hoc query workloads—contributing to significant operational and developmental cost savings.

What’s Next

In future releases, Cloudera Navigator Optimizer will continue to identify optimization opportunities for EDW and SQL-on-Hadoop workloads and provide recommendations for efficient execution on Hadoop. For example, it will auto-generate re-designed SQL queries for Hadoop based on query design recommendations. It will also offer “active optimization” capabilities by monitoring SQL performance in CDH and then capturing changes in business logic and data volume.

Getting Started

Cloudera Navigator Optimizer is now available in limited beta. To learn more and request beta access, visit optimizer.cloudera.com and register for the December 15 webinar, “Unlocking Hadoop Success with Cloudera Navigator Optimizer.”

We appreciate any and all feedback in the “Beta Releases” area at community.cloudera.com.

facebooktwittergoogle_pluslinkedinmailfacebooktwittergoogle_pluslinkedinmail

2 responses on “Introducing Cloudera Navigator Optimizer: For Optimal SQL Workload Efficiency on Apache Hadoop

Leave a Reply

Your email address will not be published. Required fields are marked *