Cloudera Engineering Blog · Hadoop Posts
Strata + Hadoop World San Jose 2015 (Feb. 17-20) is a focal point for learning about production-izing Hadoop.
Strata + Hadoop World sessions have always been indispensable for learning about Hadoop internals, use cases, and admin best practices. When deep learning is needed, however—and deep dives are a necessity if you’re running Hadoop in production, or aspire to—tutorials are your ticket.
Learn how to set up a Hadoop cluster in a way that maximizes successful production-ization of Hadoop and minimizes ongoing, long-term adjustments.
Previously, we published some recommendations on selecting new hardware for Apache Hadoop deployments. That post covered some important ideas regarding cluster planning and deployment such as workload profiling and general recommendations for CPU, disk, and memory allocations. In this post, we’ll provide some best practices and guidelines for the next part of the implementation process: configuring the machines once they arrive. Between the two posts, you’ll have a great head start toward production-izing Hadoop.
A new Spark tutorial and Trifacta deployment option make Cloudera Live even more useful for getting started with Apache Hadoop.
When it comes to learning Hadoop and CDH (Cloudera’s open source platform including Hadoop), there is no better place to start than Cloudera Live (cloudera.com/live). With a quick, one-button deployment option, Cloudera Live launches a four-node Cloudera cluster that you can learn and experiment in free for two-weeks. To help plan and extend the capabilities of your cluster, we also offer various partner deployments. Building on the addition of interactive tutorials and Tableau and Zoomdata integration, we have added a new tutorial on Apache Spark and a new Trifacta partner deployment.
Support for transparent, end-to-end encryption in HDFS is now available and production-ready (and shipping inside CDH 5.3 and later). Here’s how it works.
Apache Hadoop 2.6 adds support for transparent encryption to HDFS. Once configured, data read from and written to specified HDFS directories will be transparently encrypted and decrypted, without requiring any changes to user application code. This encryption is also end-to-end, meaning that data can only be encrypted and decrypted by the client. HDFS itself never handles unencrypted data or data encryption keys. All these characteristics improve security, and HDFS encryption can be an important part of an organization-wide data protection story.
Our “Top 10″ list of blog posts published during a calendar year is a crowd favorite (see the 2013 version here), in particular because it serves as informal, crowdsourced research about popular interests. Page views don’t lie (although skew for publishing date—clearly, posts that publish earlier in the year have pole position—has to be taken into account).
In 2014, a strong interest in various new components that bring real time or near-real time capabilities to the Apache Hadoop ecosystem is apparent. And we’re particularly proud that the most popular post was authored by a non-employee.
- How-to: Create a Simple Hadoop Cluster with VirtualBox
by Christian Javet
Explains how t set up a CDH-based Hadoop cluster in less than an hour using VirtualBox and Cloudera Manager.
- Why Apache Spark is a Crossover Hit for Data Scientists
by Sean Owen
An explanation of why Spark is a compelling multi-purpose platform for use cases that span investigative, as well as operational, analytics.
- How-to: Run a Simple Spark App in CDH 5
by Sandy Ryza
Helps you get started with Spark using a simple example.
- New SQL Choices in the Apache Hadoop Ecosystem: Why Impala Continues to Lead
by Justin Erickson, Marcel Kornacker & Dileep Kumar
Open benchmark testing of Impala 1.3 demonstrates performance leadership compared to alternatives (by 950% or more), while providing greater query throughput and with a far smaller CPU footprint.
- Apache Kafka for Beginners
by Gwen Shapira & Jeff Holoman
When used in the right way and for the right use case, Kafka has unique attributes that make it a highly attractive option for data integration.
- Apache Hadoop YARN: Avoiding 6 Time-Consuming “Gotchas”
by Jeff Bean
Understanding some key differences between MR1 and MR2/YARN will make your migration much easier.
- Impala Performance Update: Now Reaching DBMS-Class Speed
by Justin Erickson, Greg Rahn, Marcel Kornacker & Yanpei Chen
As of release 1.1.1, Impala’s speed beat the fastest SQL-on-Hadoop alternatives–including a popular analytic DBMS running on its own proprietary data store.
- The Truth About MapReduce Performance on SSDs
by Karthik Kambatla & Yanpei Chen
It turns out that cost-per-performance, not cost-per-capacity, is the better metric for evaluating the true value of SSDs. (See the session on this topic at Strata+Hadoop World San Jose in Feb. 2015!)
- How-to: Translate from MapReduce to Spark
by Sean Owen
The key to getting the most out of Spark is to understand the differences between its RDD API and the original Mapper and Reducer API.
- How-to: Write and Run Apache Giraph Jobs on Hadoop
by Mirko Kämpf
Explains how to create a test environment for writing and testing Giraph jobs, or just for playing around with Giraph and small sample datasets.
Benchmarking Big Data systems is nontrivial. Avoid these traps!
Here at Cloudera, we know how hard it is to get reliable performance benchmarking results. Benchmarking matters because one of the defining characteristics of Big Data systems is the ability to process large datasets faster. “How large” and “how fast” drive technology choices, purchasing decisions, and cluster operations. Even with the best intentions, performance benchmarking is fraught with pitfalls—easy to get numbers, hard to tell if they are sound.
A significant vulnerability affecting the entire Apache Hadoop ecosystem has now been patched. What was involved?
By now, you may have heard about the POODLE (Padding Oracle On Downgraded Legacy Encryption) attack on TLS (Transport Layer Security). This attack combines a cryptographic flaw in the obsolete SSLv3 protocol with the ability of an attacker to downgrade TLS connections to use that protocol. The result is that an active attacker on the same network as the victim can potentially decrypt parts of an otherwise encrypted channel. The only immediately workable fix has been to disable the SSLv3 protocol entirely.
The Apache Hadoop community has voted to release Hadoop 2.6. Congrats to all contributors!
This new release contains a variety of improvements, particularly in the storage layer and in YARN. We’re particularly excited about the encryption-at-rest feature in HDFS!
Cloudera’s culture is premised on innovation and teamwork, and there’s no better example of them in action than our internal hackathon.
Cloudera Engineering doubled-down on its “hackathon” tradition last week, with this year’s edition taking an around-the-clock approach thanks to the HQ building upgrade since the 2013 edition (just look at all that space!).
The number of powerful data query tools in the Apache Hadoop ecosystem can be confusing, but understanding a few simple things about your needs usually makes the choice easy.
Ah, the good old days. I recall vividly that in 2007, I was faced to store 1 billion XML documents and make them accessible as well as searchable. I had few choices on a given shoestring budget: build something one my own (it was the rage back then—and still is), use an existing open source database like PostgreSQL or MySQL, or try this thing that Google built successfully and that was now implemented in open source under the Apache umbrella: Hadoop.