Taking a thoughtful approach to data serialization can achieve significant performance improvements for HBase deployments.
The question of using tall versus wide tables in Apache HBase is a commonly discussed design pattern (see reference here and here). However, there are more considerations here than making that simple choice. Because HBase stores each column of a table as an independent row in the underlying HFiles, significant storage overhead can occur when storing small pieces of information.
Our thanks to Don Drake (@dondrake), an independent technology consultant who is currently working at Allstate Insurance, for the guest post below about his experiences comparing use of the Apache Avro and Apache Parquet file formats with Apache Spark.
Over the last few months, numerous hallway conversations, informal discussions, and meetings have occurred at Allstate about the relative merits of different file formats for data stored in Apache Hadoop—including CSV,
Learn how improve Apache HBase usability by creating a custom formatter for viewing binary data types in the HBase shell.
Cloudera customers are looking to store complex data types in Apache HBase to provide fast retrieval of complex information such as banking transactions, web analytics records, and related metadata associated with those records. Serialization formats such as Apache Avro, Thrift, and Protocol Buffers greatly assist in meeting this goal,
Thanks to Big Data Solutions Architect Matthieu Lieber for allowing us to republish the post below.
A customer of mine wants to take advantage of both worlds: work with his existing Apache Avro data, with all of the advantages that it confers, but take advantage of the predicate push-down features that Parquet provides. How to reconcile the two?
For more information about combining these formats,
At Cloudera, there is a long and proud tradition of employees creating new open source projects intended to help fill gaps in platform functionality (in addition to hiring new employees who have done so in the past). In fact, more than a dozen ecosystem projects — including Apache Hadoop itself — were founded by Clouderans, more than can be attributed to employees of any other single company. Cloudera was also the first vendor to ship most of those projects as enterprise-ready bits inside its platform.