Cloudera Developer Blog · HDFS Posts

High Availability for the Hadoop Distributed File System (HDFS)

Background

Apache Hadoop consists of two primary components: HDFS and MapReduce. HDFS, the Hadoop Distributed File System, is the primary storage system of Hadoop, and is responsible for storing and serving all data stored in Hadoop. MapReduce is a distributed processing framework designed to operate on data stored in HDFS.

HDFS has long been considered a highly reliable file system.  An empirical study done at Yahoo! concluded that across Yahoo!’s 20,000 nodes running Apache Hadoop in 10 different clusters in 2009, HDFS lost only 650 blocks out of 329 million total blocks. The vast majority of these lost blocks were due to a handful of bugs which have long since been fixed.

Apache Hadoop for Archiving Email – Part 2

Part 1 of this post covered how to convert and store email messages for archival purposes using Apache Hadoop, and outlined how to perform a rudimentary search through those archives. But, let’s face it: for search to be of any real value, you need robust features and a fast response time. To accomplish this we use Solr/Lucene-type indexing capabilities on top of HDFS and MapReduce.

Before getting into indexing within Hadoop, let us review the features of Lucene and Solr:

Apache Lucene and Apache Solr

Hadoop World 2011: A Glimpse into Development

The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics such as HDFS, HBase, Pig, Hive, Flume and other related technologies. In addition, speakers will address key development areas including tools, performance, bringing the stack together and testing the stack. Sessions in this track are for developers of all levels who want to learn more about upcoming features and enhancements, new tools, advanced techniques and best practices.

Preview of Development Track Sessions

CDH3 Update 1 Released

Continuing with our practice from Cloudera’s Distribution Including Apache Hadoop v2 (CDH2), our goal is to provide regular (quarterly), predictable updates to the generally available release of our open source distribution.  For CDH3 the first such update is available today, approximately 3 months from when CDH3 went GA.

For those of you who are recent Cloudera users, here is a refresh on our update policy:

Hoop – Hadoop HDFS over HTTP

What is Hoop?

Hoop provides access to all Hadoop Distributed File System (HDFS) operations (read and write) over HTTP/S.

Hoop can be used to:

Apache Hadoop Availability

A common question on the Apache Hadoop mailing lists is what’s going on with availability? This post takes a look at availability in the context of Hadoop, gives an overview of the work in progress and where things are headed.

Background

When discussing Hadoop availability people often start with the NameNode since it is a single point of failure (SPOF) in HDFS, and most components in the Hadoop ecosystem (MapReduce, Apache HBase, Apache Pig, Apache Hive etc) rely on HDFS directly, and are therefore limited by its availability. However, Hadoop availability is a larger, more general issue, so it’s helpful to establish some context before diving in.

CDH2 Update 3 Now Available

Cloudera is happy to announce the availability of the third update to version 2 of our distribution for Apache Hadoop (CDH2). CDH2 Update 3 contains a number of important fixes like HADOOP-5203, HDFS-1377, MAPREDUCE-1699, MAPREDUCE-1853, and MAPREDUCE-270. Check out the release notes and change log for more details on what’s in this release. You can find the packages and tarballs on our website, or simply update your systems if you are already using our repositories. More instructions can be found in our CDH documentation.

We appreciate feedback! Get in touch with us on the CDH user list, twitter or IRC (#cloudera on freenode.net) and let us know how the update is working for you.

Using Apache Hadoop for Fraud Detection and Prevention

Fraud has multiple meanings and the term can be easily abused.  The definition of fraud has undergone multiple changes throughout the years and is elusive as well as fraud itself.  The modern legal definition of fraud usually contains a few elements that have to be proven in court and depends on the state/country.  For example, in California, the elements of fraud, which give rise to the fraud cause of action in the California Courts, are: (a) misrepresentation (false representation, concealment, or nondisclosure); (b) knowledge of falsity (or scienter); (c) intent to defraud, i.e., to induce reliance; (d) justifiable reliance; and (e) resulting damage.  A more general definition may contain up to 9 elements.

From the statistical or technical perspective, fraud is a rare event that results in a significant financial impact to the organization.

Hadoop Administrator Training Comes to London

Cloudera’s Apache Hadoop Training and Certification for System Administrators has made it across the Atlantic to London for the first time! This two-day course covers planning, deploying, maintaining, monitoring, and troubleshooting your Hadoop cluster. We’ll talk about HDFS, MapReduce, Apache Hive, Apache Pig, Apache HBase, Flume and more, from the System Administrator’s point of view. Take the certification exam at the end of your training and go home with a valuable validation of your Hadoop knowledge.

Enter the code “london_10pct” when registering and receive a 10% discount!

Hadoop/HBase Capacity Planning

Apache Hadoop and Apache HBase are gaining popularity due to their flexibility and tremendous work that has been done to simplify their installation and use.  This blog is to provide guidance in sizing your first Hadoop/HBase cluster.  First, there are significant differences in Hadoop and HBase usage.  Hadoop MapReduce is primarily an analytic tool to run analytic and data extraction queries over all of your data, or at least a significant portion of them (data is a plural of datum).  HBase is much better for real-time read/write/modify access to tabular data.  Both applications are designed for high concurrency and large data sizes.  For a general discussions about Hadoop/HBase architecture and differences please refer to Cloudera, Inc. [https://wiki.cloudera.com/display/DOC/Hadoop+Installation+Documentation+for+Cloudera+Enterprise, http://blog.cloudera.com/blog/2010/07/whats-new-in-cdh3-b2-hbase], or Lars George blogs [http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html].  We expect a new edition of the Tom White’s Hadoop book [http://www.hadoopbook.com] and a new HBase book in the near future as well.

Newer Posts Older Posts