5 Common Questions About Apache Hadoop

Categories: General Hadoop

There’s been a lot of buzz about Apache Hadoop lately. Just the other day, some of our friends at Yahoo! reclaimed the terasort record from Google using Hadoop, and the folks at Facebook let on that they ingest 15 terabytes a day into their 2.5 petabyte Hadoop-powered data warehouse.

But many people still find themselves wondering just how all this works, and what it means to them. We get a lot of common questions while working with customers, speaking at conferences, and teaching new users about Hadoop. If you, or folks you know, are trying to wrap their head around this Hadoop thing, we hope you find this post helpful.

Introduction: Throwing out Fundamental Assumptions
When Google began ingesting and processing the entire web on a regular basis, no existing system was up for the task. Managing and processing data at this scale was simply never considered before.

To address the massive scale of data first introduced by the web, but now commonplace in many industries, Google built systems from the ground up to reliably store and process petabytes of data.

Many of us have been trained and conditioned to accept the common assumptions that traditional systems are built upon. Hadoop throws many of these assumptions out. If you can set aside these fundamental assumptions you’ll be one step closer to understanding the power of Hadoop.

Assumption 1: Hardware can be reliable.
It is true that you can pay a significant premium for hardware with a mean time to failure (MTTF) that exceeds its expected lifespan. However, working with web scale data requires thousands of disks and servers. Even an MTTF of 4 years results in nearly 5 failures per week in a cluster of 1,000 nodes. For a fraction of the cost, using commodity hardware with an MTTF of 2 years, you can expect just shy of 10 failures per week. In the big scheme of things, these scenarios are nearly identical, and both require fundamentally rethinking fault tolerance. In order to provide reliable storage and computation at scale, fault tolerance must be provided through software. When this is achieved, the economics of “reliable” hardware quickly fall apart.

Assumption 2: Machines have identities.
Once you accept that all machines will eventually fail, you need to stop thinking of them as individuals with identities, or you will quickly find yourself trying to identify a machine which no longer exists. It is obvious that if you want to leverage many machines to accomplish a task, they must be able to communicate with each other. This is true, but to deal effectively with the reality of unreliable hardware, communication must be implicit. It must not depend on machine X sending some data Y to machine Z, but rather some machine saying that some other machine must process some data Y. If you maintain explicit communication at scale, you face a verification problem at least as large as your data processing problem. This shift from explicit to implicit communication allows the underlying software system to reliably ensure data is stored and processed without requiring the programmer to verify successful communication, or more importantly, allow them to make mistakes doing so.

Assumption 3: A data set can be stored on a single machine.
When working with large amounts of data, we are quickly confronted with data sets that exceed the capacity of a single disk and are intractable for a single processor. This requires changing our assumptions about how data is stored and processed. A large data set can actually be stored in many pieces across many machines to facilitate parallel computation. If each machine in your cluster stores a small piece of each data set, any machine can process part of any data set by reading from its local disk. When many machines are used in parallel, you can process your entire data set by pushing the computation to the data and thus conserving precious network bandwidth.

By abandoning these three assumptions, you can began to understand how “shared nothing” architecture principles enable Hadoop to provide reliable infrastructure using unreliable commodity hardware.

With this laid out, let’s tackle a few questions we hear a lot…

Does Hadoop replace databases or other existing systems?
No. Hadoop is not a database nor does it need to replace any existing data systems you may have. Hadoop is a massively scalable storage and batch data processing system. It provides an integrated storage and processing fabric that scales horizontally with commodity hardware and provides fault tolerance through software. Rather than replace existing systems, Hadoop augments them by offloading the particularly difficult problem of simultaneously ingesting, processing and delivering/exporting large volumes of data so existing systems can focus on what they were designed to do: whether that be serve real time transactional data or provide interactive business intelligence. Furthermore, Hadoop can absorb any type of data, structured or not, from any number of sources. Data from many sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide. Lastly, these results can be delivered to any existing enterprise system for further use independent of Hadoop.

For example, consider an RDBMS used for serving real-time data and ensuring transactional consistency. Asking that same database to generate complex analytical reports over large volumes of data is performance intensive and detracts from its ability to do what it does well. Hadoop is designed to store vast volumes of data, process that data in arbitrary ways, and deliver that data to wherever it is needed. By regularly exporting data from an RDBMS, you can tune your database for its interactive workload and use Hadoop to conduct arbitrarily complex analysis offline without impacting your real-time systems.

How does MapReduce relate to Hadoop and other systems?
Hadoop is an open source implementation of both the MapReduce programming model, and the underlying file system Google developed to support web scale data.

The MapReduce programming model was designed by Google to enable a clean abstraction between large scale data analysis tasks and the underlying systems challenges involved in ensuring reliable large-scale computation. By adhering to the MapReduce model, your data processing job can be easily parallelized and the programmer doesn’t have to think about the system level details of synchronization, concurrency, hardware failure, etc.

What limits the scalability of any MapReduce implementation is the underlying system storing your data and feeding it to MapReduce.

Google chose not to use a RDBMS for this because the overhead incurred from maintaining indexes, relationships and transactional guarantees isn’t necessary for batch processing. Rather, they designed a new distributed file system, from the ground up, that worked well with “shared nothing” architecture principles. They purposefully abandoned indexes, relationships and transactional guarantees because they limit horizontal scalability and slow down loading and batch data processing for semi and unstructured data.

Some RDBMSs provide MapReduce functionality. In doing so, they provide an easy way for programmers to create more expressive queries than SQL allows in a way that does not induce any additional scalability constraints on the database. However, MapReduce alone does not address the fundamental challenge of scaling out a RDBMS in a horizontal manner.

If you need indexes, relationships and transactional guarantees, you need a database. If you need a database, one that supports MapReduce will allow for more expressive analysis than one that does not.

However, if your primary need is a highly scalable storage and batch data processing system, you will often find that Hadoop utilizes commodity resources effectively to deliver a lower cost per TB for data storage and processing.

How do existing systems interact with Hadoop?
Hadoop often serves as a sink for many sources of data because Hadoop allows you to store data cost effectively and process that data in arbitrary ways at a later time. Because Hadoop doesn’t maintain indexes or relationships, you don’t need to decide how you want analyze your data in advance. Let’s look at how various systems get data into Hadoop.

Databases: Hadoop has native support for extracting data over JDBC. Many databases also have bulk dump / load functionality. In either case, depending on the type of data, it is easy to dump the entire database to Hadoop on a regular basis, or just export the updates since the last dump. Often, you will find that by dumping data to Hadoop regularly, you can store less data in your interactive systems and lower your licensing costs going forward.

Log Generators: Many systems, from web servers to sensors, generate log data and some do so at astonishing rates. These log records often have a semi-structured nature and change over time. Analyzing these logs is often difficult because they don’t “fit” nicely in relational databases and take too long to process on a single machine. Hadoop makes it easy for any number of systems to reliably stream any volume of logs to a central repository for later analysis. You often see users dump each day’s logs into a new directory so they can easily run analysis over any arbitrary time-frame’s worth of logs.

Scientific Equipment: As sensor technology improves, we see many scientific devices ranging from imagery systems (medical, satellite, etc) to DNA sequencers to high energy physics detectors, generating data at rates that vastly exceed both the write speed and capacity of a single disk. These systems can write data directly to Hadoop and as the data generation rate and processing demands increase, they can be addressed simply adding more commodity hardware to your Hadoop cluster.

Hadoop is agnostic to what type of data you store. It breaks data into manageable chunks, replicates them, and distributes multiple copies across all the nodes in your cluster so you can process your data quickly and reliably later. You can now conduct analysis that consumes all of your data. These aggregates, summaries, and reports can be exported to any other system using raw files, JDBC, or custom connectors.

How can users across an organization interact with Hadoop?
One of the great things about Hadoop is that it exposes massive amounts of data to a diverse set of users across your organization. It creates and drives a culture of data, which in turn empowers individuals at every level to make better business decisions.

When a DBA designs and optimizes a database, she considers many factors. First and foremost is the structure of the data, the access patterns for the data, and the views / reports required of the data. These early decisions limit the types of queries the database can respond to efficiently. As business users request more views on the data, it becomes a constant struggle to maintain performance and deliver new reports in timely manner. Hadoop enables a DBA to optimize the database for its primary workload, and regularly export the data for analytical purposes.

Once data previously locked up in database systems is available for easy processing, programmers can transform this data in any number of ways. They can craft more expressive queries and generate more data / cpu intensive reports without impacting production database performance. They can build pipelines that leverage data from many sources for research, development, and business processes. Programmers may find our online Hadoop training useful, especially Programming with Hadoop.

But working closely with data doesn’t stop with DBAs and programmers. By providing simple high-level interfaces, Hadoop enables less technical users to ask quick, ad-hoc questions about any data in the enterprise. This enables everyone from product managers, to analysts, to executives to participate in, and drive a culture focused on data. For more details, check out our online training for Hive (a data warehouse for Hadoop with an SQL interface) and Pig (a high level language for ad-hoc analysis).

How do I understand and predict the cost of running Hadoop?
One of the nice things about Hadoop is that understanding your costs in advance is relatively simple. Hadoop is free software and it runs on commodity hardware, which includes cloud providers like Amazon. It has been demonstrated to scale beyond tens of petabytes (PB). More importantly, it does so with linear performance characteristics and cost.

Hadoop uses commodity hardware, so every month, your costs decrease or provide more capacity for the same price point. You probably have a vendor you like, and they probably sell a dual quad-core (8 cores total) machine with 4 1TB SATA disks (you specifically don’t want RAID). Your budget and workload will decide whether you want 8 or 16GB of RAM per machine. Hadoop uses thee-way replication for data durability, so your “usable” capacity will roughly be your raw disk capacity divided by 3. For machines with 4x1TB disks, 1 TB is a good estimate for usable space, as it leaves some overhead for intermediate data and the like. It also makes the math really easy. If you use EC2, two extra large instances provide about 1 usable TB.

Armed with your initial data size, the growth rate of that data, and the cost per usable TB, you should be able to estimate the cost of your Hadoop cluster. You will also incur operational costs, but, because the software is common across all the machines, and requires little per-machine tuning, the operational cost scales sub-linearly.

To make things even easier, Cloudera provides our distribution for both local deployments and EC2 free of charge under the Apache 2.0 license. You can learn more at http://blog.cloudera.com//hadoop

Summary: Looking Forward
Today, many enterprise users have the ability to generate and access ever increasing volumes of data. A TB really isn’t that much anymore, and when you consider that you can buy such a disk for about $100, you need to reconsider how much you pay for the surrounding infrastructure to manage that data.

Looking beyond reliable storage and processing, Hadoop represents an understanding of how computer architecture has fundamentally changed in recent years and how those trends effect the foreseeable future. As processors get more parallel, disk density increases, and software becomes more effective at managing unreliable hardware, Hadoop enables you to reliably scale with your data and align costs with commodity trends.


15 responses on “5 Common Questions About Apache Hadoop

  1. seymourjane

    “Hadoop is a massively scalable storage and batch data processing system.”

    This is really helpful and insightly. Thank you

  2. Christophe Bisciglia Post author

    Pete, I’ll try and get that graphic cleaned up and added to this post. Thanks!

  3. Joe Kraska

    In your section “How do I understand and predict the cost of running Hadoop?” you answer the question on how to understand and predict the cost of BUYING a Hadoop system, but, unforunately, say nothing about the cost of RUNNING it. Could you help us understand the staffing requirements for a large Hadoop system better?

    Joe Kraska
    San Diego CA

  4. Patricia McAlernon

    I am in Sardegna at the ISBIS Summer School for two weeks, William Cleveland is presenting on RHIPE, makes it sound all very easy