Cloudera Engineering Blog

Big Data best practices, how-to's, and internals from Cloudera Engineering and the community


Here’s Your Getting Started with Impala Book

Getting Started with Impala (now in early release)—another book in the Hadoop ecosystem books canon—is indispensable for people who want to get familiar with Impala, the open source MPP query engine for Apache Hadoop. We spoke with its author, Impala docs writer John Russell, about the book’s origin and mission.

Why did you decide to write this book?

Secrets of Cloudera Support: Using OpenStack to Shorten Time-to-Resolution

Automating the creation of short-lived clusters for testing purposes frees our support engineers to spend more time on customer issues.

The first step for any support engineer is often to replicate the customer’s environment in order to identify the problem or issue. Given the complexity of Cloudera customer environments, reproducing a specific issue is often quite difficult, as a customer’s problem might only surface in an environment with specific versions of Cloudera Enterprise (CDH + Cloudera Manager), configuration settings, certain number of nodes, or the structure of the dataset itself. Even with Cloudera Manager’s awesome setup wizards, setting up Apache Hadoop can be quite time consuming, as the software was never designed with ephemeral clusters in mind.

New Benchmarks for SQL-on-Hadoop: Impala 1.4 Widens the Performance Gap

With 1.4, Impala’s performance lead over the SQL-on-Hadoop ecosystem gets wider, especially under multi-user load.

As noted in our recent post about the Impala 2.x roadmap (“What’s Next for Impala: Focus on Advanced SQL Functionality”), Impala’s ecosystem momentum continues to accelerate, with nearly 1 million downloads since the GA of 1.0, deployment by most of Cloudera’s enterprise data hub customers, and adoption by MapR, Amazon, and Oracle as a shipping product. Furthermore, in the past few months, independent sources such as IBM Research have confirmed that “Impala’s database-like architecture provides significant performance gains, compared to Hive’s MapReduce- or Tez-based runtime.”

Community Meetups during Strata + Hadoop World 2014

The meetup opportunities during the conference week are more expansive than ever — spanning Impala, Spark, HBase, Kafka, and more.

Strata + Hadoop World 2014 is a kaleidoscope of experiences for attendees, and those experiences aren’t contained within the conference center’s walls. For example, the meetups that occur during the conf week (which is concurrent with NYC DataWeek) are a virtual track for developers — and with Strata + Hadoop World being bigger than ever, so is the scope of that track.

How Impala Supports Mixed Workloads in Multi-User Environments

Our thanks to Melanie Imhof, Jonas Looser, Thierry Musy, and Kurt Stockinger of the Zurich University of Applied Science in Switzerland for the post below about their research into the query performance of Impala for mixed workloads.

Recently, we were approached by an industry partner to research and create a blueprint for a new Big Data, near real-time, query processing architecture that would replace its current architecture based on a popular open source database system.

How-to: Install CDH on Mac OSX 10.9 Mavericks

This overview will cover the basic tarball setup for your Mac.

If you’re an engineer building applications on CDH and becoming familiar with all the rich features for designing the next big solution, it becomes essential to have a native Mac OSX install. Sure, you may argue that your MBP with its four-core, hyper-threaded i7, SSD, 16GB of DDR3 memory are sufficient for spinning up a VM, and in most instances — such as using a VM for a quick demo — you’re right.  However, when experimenting with a slightly heavier workload that is a bit more resource intensive, you’ll want to explore a native install.

Apache Kafka for Beginners

When used in the right way and for the right use case, Kafka has unique attributes that make it a highly attractive option for data integration.

Apache Kafka is creating a lot of buzz these days. While LinkedIn, where Kafka was founded, is the most well known user, there are many companies successfully using this technology.

Getting Started with Big Data Architecture

What does a “Big Data engineer” do, and what does “Big Data architecture” look like? In this post, you’ll get answers to both questions.

Apache Hadoop has come a long way in its relatively short lifespan. From its beginnings as a reliable storage pool with integrated batch processing using the scalable, parallelizable (though inherently sequential) MapReduce framework, we have witnessed the recent additions of real-time (interactive) components like Impala for interactive SQL queries and integration with Apache Solr as a search engine for free-form text exploration.

The Early Release Books Keep Coming: This Time, Hadoop Security

Hadoop Security is the latest book from Cloudera engineers in the Hadoop ecosystem books canon.

We are thrilled to announce the availability of the early release of Hadoop Security, a new book about security in the Apache Hadoop ecosystem published by O’Reilly Media. The early release contains two chapters on System Architecture and Securing Data Ingest and is available in O’Reilly’s catalog and in Safari Books.

This Month in the Ecosystem (August 2014)

Welcome to our 12th (first annual!) edition of “This Month in the Ecosystem,” a digest of highlights from August 2014 (never intended to be comprehensive; for that, see the excellent Hadoop Weekly).

Newer Posts Older Posts