Tag Archives: reliability

Blacklisting in Apache Spark

Categories: Hadoop Spark

At Cloudera, we’re always working to provide our customers and the Apache Spark community with the most robust, most reliable software possible. This article describes some recent engineering work on [SPARK-8425] that is available in CDH 5.10 and CDH5.11, as well as in upstream Apache Spark starting with the 2.2 release.

The work pertains to the Blacklist Tracker mechanism in Spark’s scheduler. This was the subject of a recent Spark Summit talk,

Read More

Impala’s Next Step: Proposal to Join the Apache Software Foundation

Categories: Impala Kudu

The Impala project has already passed several important milestones on the way to its status as the leader and open standard for BI and SQL analytics on modern big data architecture. Today’s milestone is the submission of proposals for Impala and Kudu to join the Apache Software Foundation (ASF) Incubator.

[Update: Read the text of the Impala and Kudu proposals here and here, respectively.]

Since its initial release nearly five years ago,

Read More

Inside Santander’s Near Real-Time Data Ingest Architecture

Categories: Flume HBase Kafka

Learn about the near real-time data ingest architecture for transforming and enriching data streams using Apache Flume, Apache Kafka, and RocksDB at Santander UK.

Cloudera Professional Services has been working with Santander UK to build a near real-time (NRT) transactional analytics system on Apache Hadoop. The objective is to capture, transform, enrich, count, and store a transaction within a few seconds of a card purchase taking place. The system receives the bank’s retail customer card transactions and calculates the associated trend information aggregated by account holder and over a number of dimensions and taxonomies.

Read More

What’s Next for Impala: More Reliability, Usability, and Performance at Even Greater Scale

Categories: Impala

This year will close out with new features for reliability, usability, and nested types, and in 2016, performance-related enhancements promise >20x gains.

It’s been roughly a year since we provided an update about the Impala roadmap. During that time, a number of milestones have been reached:

  • Most Cloudera customers have deployed Impala to production across industries including financial services, retail, healthcare, gaming, government, advertising, and telecom.

Read More

Security, Hive-on-Spark, and Other Improvements in Apache Hive 1.2.0

Categories: Community Hive Spark

Apache Hive 1.2.0, although not a major release, contains significant improvements.

Recently, the Apache Hive community moved to a more frequent, incremental release schedule. So, a little while ago, we covered the Apache Hive 1.0.0 release and explained how it was renamed from 0.14.1 with only minor feature additions since 0.14.0.

Shortly thereafter, Apache Hive 1.1.0 was released (renamed from Apache Hive 0.15.0), which included more significant features—including Hive-on-Spark.

Read More