Cloudera Engineering Blog
Big Data best practices, how-to's, and internals from Cloudera Engineering and the community
These new Apache HBase features in CDH 5.2 make multi-tenant environments easier to manage.
Historically, Apache HBase treats all tables, users, and workloads with equal weight. This approach is sufficient for a single workload, but when multiple users and multiple workloads were applied on the same cluster or table, conflicts can arise. Fortunately, starting with HBase in CDH 5.2 (HBase 0.98 + backports), workloads and users can now be prioritized.
November was busy, even accounting for the US Thanksgiving holiday:
A significant vulnerability affecting the entire Apache Hadoop ecosystem has now been patched. What was involved?
By now, you may have heard about the POODLE (Padding Oracle On Downgraded Legacy Encryption) attack on TLS (Transport Layer Security). This attack combines a cryptographic flaw in the obsolete SSLv3 protocol with the ability of an attacker to downgrade TLS connections to use that protocol. The result is that an active attacker on the same network as the victim can potentially decrypt parts of an otherwise encrypted channel. The only immediately workable fix has been to disable the SSLv3 protocol entirely.
This guest post from Intel Java performance architect Eric Kaczmarek (originally published here) explores how to tune Java garbage collection (GC) for Apache HBase focusing on 100% YCSB reads.
Apache HBase is an Apache open source project offering NoSQL data storage. Often used together with HDFS, HBase is widely used across the world. Well-known users include Facebook, Twitter, Yahoo, and more. From the developer’s perspective, HBase is a “distributed, versioned, non-relational database modeled after Google’s Bigtable, a distributed storage system for structured data”. HBase can easily handle very high throughput by either scaling up (i.e., deployment on a larger server) or scaling out (i.e., deployment on more servers).
The Apache Hadoop community has voted to release Hadoop 2.6. Congrats to all contributors!
This new release contains a variety of improvements, particularly in the storage layer and in YARN. We’re particularly excited about the encryption-at-rest feature in HDFS!
Learn about BigBench, the new industrywide effort to create a sorely needed Big Data benchmark.
Benchmarking Big Data systems is an open problem. To address this concern, numerous hardware and software vendors are working together to create a comprehensive end-to-end big data benchmark suite called BigBench. BigBench builds upon and borrows elements from existing benchmarking efforts in the Big Data space (such as YCSB, TPC-xHS, GridMix, PigMix, HiBench, Big Data Benchmark, and TPC-DS). Intel and Cloudera, along with other industry partners, are working to define and implement extensions to BigBench 1.0. (A TPC proposal for BigBench 2.0 is in the works.)
The community effort to make Apache Spark an execution engine for Apache Hive is making solid progress.
Apache Spark is quickly becoming the programmatic successor to MapReduce for data processing on Apache Hadoop. Over the course of its short history, it has become one of the most popular projects in the Hadoop ecosystem, and is now supported by multiple industry vendors—ensuring its status as an emerging standard.
Installing CDH on newer unsupported operating systems (such as Ubuntu 13.04 and later) can lead to conflicts. These guidelines will help you avoid them.
Some of the more recently released operating systems that bundle portions of the Apache Hadoop stack in their respective distro repositories can conflict with software from Cloudera repositories. Consequently, when you set up CDH for installation on such an OS, you may end up picking up packages with the same name from the OS’s distribution instead of Cloudera’s distribution. Package installation may succeed, but using the installed packages may lead to unforeseen errors.
Thanks to Guy Harrison of Dell Inc. for the guest post below about time-tested performance optimizations for connecting Oracle Database with Apache Hadoop that are now available in Apache Sqoop 1.4.5 and later.
Back in 2009, I attended a presentation by a Cloudera employee named Aaron Kimball at the MySQL User Conference in which he unveiled a new tool for moving data from relational databases into Hadoop. This tool was to become, of course, the now very widely known and beloved Sqoop!
Cloudera’s culture is premised on innovation and teamwork, and there’s no better example of them in action than our internal hackathon.
Cloudera Engineering doubled-down on its “hackathon” tradition last week, with this year’s edition taking an around-the-clock approach thanks to the HQ building upgrade since the 2013 edition (just look at all that space!).