Cloudera Engineering Blog · Hadoop Posts
Using an appropriate network representation and the right tool set are the key factors in successfully merging structured and time-series data for analysis.
In Part 1 of this series, you took your first steps for using Apache Giraph, the highly scalable graph-processing system, alongside Apache Hadoop. In this installment, you’ll explore a general use case for analyzing time-dependent, Big Data graphs using data from multiple sources. You’ll learn how to generate random large graphs and small-world networks using Giraph – as well as play with several parameters to probe the limits of your cluster.
In its relatively short lifetime (co-founded by Twitter and Cloudera in July 2013), Parquet has already become the de facto standard for columnar storage of Apache Hadoop data — with native support in Impala, Apache Hive, Apache Pig, Apache Spark, MapReduce, Apache Tajo, Apache Drill, Apache Crunch, and Cascading (and forthcoming in Presto and Shark). Parquet adoption is also broad-based, with employees of the following companies (partial list) actively contributing:
Learn how to convert your data to the Parquet columnar format to get big performance gains.
Using a columnar storage format for your data offers significant performance advantages for a large subset of real-world queries. (Click here for a great introduction.)
Thanks to recent work upstream, YARN is now a highly available service. This post explains its architecture and configuration details.
YARN, the next-generation compute and resource management framework in Apache Hadoop, until recently had a single point of failure: the ResourceManager, which coordinates work in a YARN cluster. With planned (upgrades) or unplanned (node crashes) events, this central service, and YARN itself, could become unavailable.
Understanding some key differences between MR1 and MR2/YARN will make your migration much easier.
Here at Cloudera, we recently finished a push to get Cloudera Enterprise 5 (containing CDH 5.0.0 + Cloudera Manager 5.0.0) out the door along with more than 100 partner certifications.
The community has voted to release Apache Hadoop 2.4.0.
Hadoop 2.4.0 includes myriad improvements to HDFS and MapReduce, including (but not limited to):
More good news for the ecosystem!
The GA release of Cloudera Enterprise 5 signifies the evolution of the platform from a mere Apache Hadoop distribution into an enterprise data hub.
We are thrilled to announce the GA release of Cloudera Enterprise 5 (comprising CDH 5.0 and Cloudera Manager 5.0).
The integration of Apache Sentry with Apache Solr helps Cloudera Search meet important security requirements.
As you have learned in previous blog posts, Cloudera Search brings the power of Apache Hadoop to a wide variety of business users via the ease and flexibility of full-text querying provided by Apache Solr. We have also done significant work to make Cloudera Search easy to add to an existing Hadoop cluster:
Cost-per-performance, not cost-per-capacity, turns out to be the better metric for evaluating the true value of SSDs.
In the Big Data ecosystem, solid-state drives (SSDs) are increasingly considered a viable, higher-performance alternative to rotational hard-disk drives (HDDs). However, few results from actual testing are available to the public.