High Energy Hadoop
We asked Brian Bockelman, a Post Doc Research Associate in the Computer Science & Engineering Department at the University of Nebraska–Lincoln, to tell us how Hadoop is being used to process the results from High-Energy Physics experiments. His response gives insights into the kind and volume of data that High-Energy Physics experiments generate and how Hadoop is being used at the University of Nebraska. -Matt
In the least technical language, most High Energy Physics (HEP) experiments involve colliding highly energetic, common particles together in order to create small, exotic, and incredibly short-lived particles. The Large Hadron Collider (LHC), in particular, collides protons together at an energy of 7 TeV per particle (design energy; turn-on energy will be less than this). The headline search for the LHC is the search for the Higgs boson, but it really probes an energy scale never before seen–it’s expected to be ripe with new physics beyond the Standard Model.
There are two major components of a HEP experiment–the accelerator and the detector (in the case of the LHC, the detectors). The accelerator ring at the LHC is a 27 km ring that runs underneath Switzerland at a depth of 100 to 175 meters, depending on the part of the ring. The design is such that small bunches of protons (about 2808 bunches in the ring at a time) travel around the rings and are collided inside huge particle detectors. The design is that a collision occurs once every 25 nanoseconds (ns), but it will probably be closer to one collision every 75 ns when it turns on later this year.
There are four major experiments on the LHC rings–ATLAS, CMS, LHCb, and ALICE. ATLAS and CMS are general purpose detectors, while LHCb and ALICE are specialized ones–I work for CMS computing (not a physicist). You can think of a detector as an incredibly specialized digital camera wrapped around the beam pipe. Compact Muon Solenoid (CMS) is a layered design, with highly accurate silicon detector nearest to the beam pipe, a very large magnet (which causes charged particles to change course, allowing us to distinguish charged vs non-charged particles, as well as get their energy), and huge, heavy outer layers that capture the energy of very energetic particles. In CMS, we will be getting collisions at a rate of 40MHz; each collision has about 1MB worth of data coming from the various components. This immediately is a problem: 40MHz x 1MB = 320 Tbps, a completely unmanageable amount of data. Luckily, nature provides an answer: many of these beam crossings will result in non-energetic collisions (particles glancing off each other instead of direct collisions) and quantum physics dictates that even in highly energetic collisions, the outcome is statistically determined. This means that a huge percentage of beam crossings are completely uninteresting; there’s a very complex custom compute system (called the trigger) that sifts through the data and records only the most interesting ones. This cuts down the entire collision rate to about 300Hz.
The raw data coming in at 300Hz x 1MB transfers from the online computing system to the offline computing system, which is not time-sensitive and does the majority of the processing. This data goes through the following transformations on its journey from a network in Switzerland down to a physicist sitting anywhere in the world:
- Raw sensor data is turned into reconstructed data. This takes the information about which pixels / elements are turned on and off into possible tracks of particles as well as a lot of data about how the information was constructed. The reconstructed data is again about 1MB.
- Reconstructed data is turned into analysis-oriented data (AOD). This strips out the per-event data until it only contains physics activity–no more detector information, just tracks and types of particles that came out of the event. The AOD data is about 110KB.
- The AOD data is turned physicist-specific ntuples. This will be a small, column-oriented format containing exactly the data the physicist is interested in. This could be as small as a KB per event.
At each transformation, the data is organized into more specific datasets–at the last step, where the physicist selects his specific data, he is usually very specific. The billions of events in all of CMS are condensed down into a few thousand interesting events.
Because of the computational and data requirements, not all of the transformations can occur centrally at CERN. CERN has just enough computing power to write the raw data to tape and reconstruct a small percentage of the data (for data quality monitoring); immediately, it transfers the raw data to large data centers around the world (called Tier 1 centers; there are 7 for CMS), where the raw data is again written to tape and the reconstruction and transformation into AODs is performed. From the Tier 1s, the data (right now, the reconstructed data–the reconstruction algorithms are new enough that folks don’t yet trust the AODs) is transferred to Tier 2s, where it is analyzed by physicists.
In the US, there are 7 Tier 2 sites (about 50 worldwide for CMS); each had a target of 200 TB in 2008 and should get up to 400TB in 2009. Each T2 has to be accessible by all the worldwide collaborators–so we see a wide variety of jobs–and is maintained by 2 administrators: about one for the computing side, one for the storage side.
So, what are the characteristics of the Tier 2 system? We have:
- A large amount of data (400TB is still big these days, right? It seems so small sometimes.)
- A large data rate: Tier 2s download the data the physicists are interested in; this means that we need to be able to burst data as fast as possible–preferably in the range of 10Gbps.
- Need for reliable, but not archival storage. Losing data is a huge headache, but at least most of it can be re-downloaded. For user data, it can usually be regenerated, but folks can be seriously upset if they get pushed back by a few weeks.
- A small amount of effort (For running cutting-edge technology, 2 people get spread awfully thin).
- A need for interoperability for other sites. Not all T2 sites will ever use the same technology, and the Tier 1 sites need deep integration with tape libraries; we need to use standard interfaces to transfer data between sites.
So, how did we end up looking at Hadoop? Grid computing in the LHC has existed for around 10 years, so there are other choices available in the area that we can choose from and have many years of experience with. However, we believe that the Hadoop Distributed File System (HDFS) is superior to these for Tier 2s because of:
- Manageability: A lot of the storage systems in our area are designed either for reliable hardware or to be backed up by tape, and then adopted to the Tier 2 sites. It turns out that, beyond a certain number of boxes, even reliable hardware is “unreliable”. Because HDFS is designed from ground-up with commodity hardware in mind, some of the common tasks–decommissioning a node for example–are straightforward and built into the filesystem itself. In the case of many node failures, the fsck utility is *wonderful*: simple, fast, and straightforward.
- Reliability: The replication features of HDFS work strikingly well. We keep two replicas, which makes data loss theoretically fairly unlikely but still possible: we find that the replication features work just as well as advertised. This is very important: it decouples the loss of a hardware with admin interventions.
- Usability: This is perhaps less important to the physicists (who have large computing organizations that can provide a layer to make I/O transparent to the end user) as it is to our site: having a mountable file system through FUSE means that we can now re-use the system we’ve built for the physicists for a wide range of users (not to mention that we can offer MapReduce services to our campus users).
- Scalability: By having the worker nodes and the storage systems co-located, we get far superior scalability compared to having a separate storage cluster. Hands-down, there are few systems that can pump out as much data as fast as HDFS does.
We use HDFS as a generic, distributed file system for now by mounting it onto our worker nodes with FUSE. On top of that, we layer SRM (which is a generic web-services interface) and a Globus GridFTP server (a standard grid protocol for WAN transfers). The FUSE components allow our physicists’ applications–consisting of several million lines of C++, probably of comparable size to the Linux kernel–to access HDFS without modification, while the two grid components allow us to interoperate with other, non-Hadoop sites.
So, where do I see us going in the future? I understand that the batch scheduler we are using (Condor) is starting to explore ways to integrate HDFS into its own system in order to deliver both compute and data scheduling, which has many potential possibilities. I believe that the physicists’ transformation and reduction of data is very similar to the MapReduce paradigm, and there might someday be small explorations into using the MapReduce components of Hadoop–but that’s pretty far off. With millions of lines, and a new detector coming online, the emphasis in this field is on stability and bullet-proofing existing systems.
What do I see as needed developments? Off the top of my head, here’s what I think:
- As we work with national labs, there’s an emphasis on security. We’re getting awfully anxious about seeing strong authentication come to the file system.
- As we are getting closer to having a production system, we’d really like to see a high-availability solution for HDFS start to happen–especially if it included a way to automatically sync checkpoints from within the system, instead of externally.
- The physicist’s workflow isn’t entirely event based: it’s somewhat akin to a column store. Each event has many, many pieces of data. If a physicist only needs a small subset of the data (such as, a few KB out of a MB event), then it’s much more convenient to use a column-based layout (putting similar types of data contiguously in the file) instead of an event-based layout (putting each event contiguously in the file). The underlying application they use, ROOT, uses somewhat of a hybrid between the two. Because of this, the I/O patterns can be horribly random. I’d really like to have a strong understanding of HDFS’s random I/O performance, a somewhat predictable model or guideline for how it works out, and a strong knowledge of how this affects the ROOT-based I/O.
I guess you can say that HEP is a large field with a long background in big computing. A lot of folks are watching the investigations happening in Nebraska, and seem enthusiastic to see it succeed. However, as in any field with a broad range of users and sites, we are proceeding carefully and cautiously, and trying to do thorough investigations so we make sure we understand the pro’s and con’s before diving in. HDFS has held up very well in our trials at Nebraska; I fully expect it to be accepted as a CMS storage element and probably deployed at other sites who are also looking for new storage solutions.