Cloudera Developer Blog
Big Data best practices, how-to's, and internals from Cloudera Engineering and the community
Understanding how checkpointing works in HDFS can make the difference between a healthy cluster or a failing one.
Checkpointing is an essential part of maintaining and persisting filesystem metadata in HDFS. It’s crucial for efficient NameNode recovery and restart, and is an important indicator of overall cluster health. However, checkpointing can also be a source of confusion for operators of Apache Hadoop clusters.
Spark is a compelling multi-purpose platform for use cases that span investigative, as well as operational, analytics.
Data science is a broad church. I am a data scientist — or so I’ve been told — but what I do is actually quite different from what other “data scientists” do. For example, there are those practicing “investigative analytics” and those implementing “operational analytics.” (I’m in the second camp.)
Hue users can learn a lot about new features by following a steady stream of new demos.
Hue, the open source Web UI that makes Apache Hadoop easier to use, is now a standard across the ecosystem — shipping within multiple software distributions and sandboxes. One of the reasons for its success is an agile developer community behind it that is constantly rolling out new features to its users.
Cloudera’s own enterprise data hub is yielding great results for providing world-class customer support.
Here at Cloudera, we are constantly pushing the envelope to give our customers world-class support. One of the cornerstones of this effort is the Cloudera Support Interface (CSI), which we’ve described in prior blog posts (here and here). Through CSI, our support team is able to quickly reason about a customer’s environment, search for information related to a case currently being worked, and much more.
Hadoop 2.3.0 includes hundreds of new fixes and features, but none more important than HDFS caching.
The Apache Hadoop community has voted to release Hadoop 2.3.0, which includes (among many other things):
Learn how to use Cloudera Search along with RBL-JE to search and index documents in multiple languages.
Our thanks to Basis Technology for providing the how-to below!
Bringing Parquet support to Hive was a community effort that deserves congratulations!
Previously, this blog introduced Parquet, an efficient ecosystem-wide columnar storage format for Apache Hadoop. As discussed in that blog post, Parquet encodes data extremely efficiently and as described in Google’s original Dremel paper. (For more technical details on the Parquet format read Dremel made simple with Parquet, or go directly to the open and community-driven Parquet Format specification.)
Integrating Hue with LDAP can help make your secure Hadoop apps as widely consumed as possible.
Hue, the open source Web UI that makes Apache Hadoop easier to use, easily integrates with your corporation’s existing identity management systems and provides authentication mechanisms for SSO providers. So, by changing a few configuration parameters, your employees can start analyzing Big Data in their own browsers under an existing security policy.
Thanks to the improvements described here, CDH 5 will ship with a version of MapReduce 2 that is just as fast (or faster) than MapReduce 1.
Performance fixes are tiny, easy, and boring, once you know what the problem is. The hard work is in putting your finger on that problem: narrowing, drilling down, and measuring, measuring, measuring.
This FAQ contains answers to the most frequently asked questions about the architecture and configuration choices involved.
In December 2013, Cloudera and Amazon Web Services (AWS) announced a partnership to support Cloudera Enterprise on AWS infrastructure. Along with this announcement, we released a Deployment Reference Architecture Whitepaper. In this post, you’ll get answers to the most frequently asked questions about the architecture and the configuration choices that have been highlighted in that whitepaper.