Cloudera Developer Blog
Big Data best practices, how-to's, and internals from Cloudera Engineering and the community
The team behind Hue, the open source Web UI that makes Apache Hadoop easier to use, strikes again with a new Spark app.
Editor’s note: This post was recently published on the Hue blog. We republish it here for your convenience.
From Python, to ZooKeeper, to Impala, to Parquet, blog readers in 2013 were interested in a variety of topics.
Clouderans and guest authors from across the ecosystem (LinkedIn, Netflix, Concurrent, Etsy, Stripe, Databricks, Oracle, Tableau, Alteryx, Talend, Twitter, Dell, Concurrent, SFDC, Endgame, MicroStrategy, Hazy Research, Wibidata, StackIQ, ZoomData, Damballa, Mu Sigma) published prolifically on the Cloudera Developer blog in 2013, with more than 250 new posts — basically, averaging one per business day.
Apache Accumulo is now generally available on CDH 4.
Cloudera is pleased to announce the immediate availability of its first release of Accumulo packaged to run under CDH, our open source distribution of Apache Hadoop and related projects and the foundational infrastructure for Enterprise Data Hubs.
More and more customers are using automation/configuration management frameworks alongside Cloudera Manager.
As Apache Hadoop clusters continue to grow in size, complexity, and business importance as the foundational infrastructure for an Enterprise Data Hub, the use cases for a robust and mature management console expand.
CDK has a new monicker, but the goals remain the same.
We are pleased to announce a new name for the Cloudera Development Kit (CDK): Kite. We’ve just released Kite version 0.10.0, which is purely a rename of CDK 0.9.0.
Developers, rejoice: Impala is now available on EMR for testing and evaluation.
Very recently, Amazon Web Services announced support for running Cloudera Impala queries on its Elastic MapReduce (EMR) service. This is very good news for EMR users — as well as for users of other platforms interested in kicking Impala’s tires in a friction-free way. It’s also yet another sign that Impala is rapidly being adopted across the ecosystem as the gold standard for interactive SQL and BI queries on Apache Hadoop.
The new RImpala package brings the speed and interactivity of Impala to queries from R.
Our thanks to Austin Chungath, Sachin Sudarshana, and Vikas Raguttahalli of Mu Sigma, a Decision Sciences and Big Data analytics company, for the guest post below.
Flavio Junqueira (PMC Chair of the Apache ZooKeeper project and a member of the Systems and Networking Group at Microsoft Research) and Benjamin Reed (PMC Member and Software Engineer at Facebook) are the co-authors of the new O’Reilly Media book ZooKeeper: Distributed Process Coordination. We had a chat with Flavio and Ben recently about the rationale for writing the book, and what it will add to the distributed systems conversation.
Learn the new features and enhancements in Cloudera Manager 5, including support for YARN, management of third-party apps and frameworks, and more.
The response to the Oct. 2013 release of Cloudera Enterprise 5 Beta has been overwhelming, and Cloudera is busily working closely with several customers to incorporate their feedback.
The compactions model is changing drastically with CDH 5/HBase 0.96. Here’s what you need to know.
Apache HBase is a distributed data store based upon a log-structured merge tree, so optimal read performance would come from having only one file per store (Column Family). However, that ideal isn’t possible during periods of heavy incoming writes. Instead, HBase will try to combine HFiles to reduce the maximum number of disk seeks needed for a read. This process is called compaction.