Cloudera Engineering Blog · Guest Posts
This post was contributed by Friso van Vollenhoven from Xebia. Friso is a consultant at Xebia and he is currently working at the RIPE NCC’s Information Services department on migrating an existing MySQL based solution to use Hadoop and HBase. Xebia performs large scale development projects, consultancy in architecture & auditing and helps to bring your middleware under control. Guiding clients in how to leverage agile is Xebia’s passion.
The RIPE NCC is one of five Regional Internet Registries (RIRs) providing Internet resource allocations, registration services and co-ordination activities that support the operation of the Internet globally. The RIPE NCC also provides services for the benefit of the Internet community at large. Amongst these is the Routing Information Service, which collects and stores Internet routing data from several locations around the globe. At RIPE NCC, we are in the progress of migrating an existing MySQL based system to use Apache HBase as storage backend and Hadoop MapReduce as framework for processing import jobs. In this post we will provide some background on our efforts, how we implemented it on top of Apache Hadoop and HBase and our experiences with using Hadoop and HBase in a real life scenario.
This post was contributed by Jonathan Seidman from Orbitz. Jonathan is a Lead Engineer on the Intelligent Marketplace/Machine Learning team at Orbitz Worldwide . You can hear more from Jonathan at Hadoop World October 12th in NYC.
Orbitz Worldwide (NYSE:OWW) is composed of a global portfolio of online consumer travel brands including Orbitz, Cheaptickets, The Away Network, ebookers and HotelClub, Additionally, the company operates business-to-business service: Orbitz Worldwide Distribution provides third parties such as Amtrak, Delta, LAN, KLM, Air France and a number of other leading airlines hotel booking capabilities, and Orbitz for Business provides corporate travel services to a number of Fortune 100 clients. The Orbitz Worldwide sites process millions of searches and transactions every day, which not surprisingly results in hundreds of gigabytes of log data per day. Not all of that data necessarily has value, but much of it does. Unfortunately storing and processing all of that data in our existing data warehouse infrastructure is impractical because of expense and space considerations.
This post was contributed by John Sichi, a committer on the Apache Hive project and a member of the Data Infrastructure team at Facebook.
As many readers may already know, Hive was initially developed at Facebook for dealing with explosive growth in our multi-petabyte data warehouse. Since its release as an Apache project, it has been put into use at a number of other companies for solving big data problems. Hive storage is based on Hadoop‘s underlying append-only filesystem architecture, meaning that it is ideal for capturing and analyzing streams of events (e.g. web logs). However, a data warehouse also has to relate these event streams to application objects; in Facebook’s case, these include familiar items such as fan pages, user profiles, photo albums, or status messages.
Hive can store this information easily, even for hundreds of millions of users, but keeping the warehouse up to date with the latest information published by users can be a challenge, as the append-only constraint makes it impossible to directly apply individual updates to warehouse tables. Up until now, the only practical option has been to periodically pull snapshots of all of the information from live MySQL databases and dump them to new Hive partitions. This is a costly operation, meaning it can be done at most daily (leading to stale data in the warehouse), and does not scale well as data volumes continue to shoot through the roof.
How Raytheon BBN Technologies Researchers are Using Hadoop to Build a Scalable, Distributed Triple Store
This post was contributed by Kurt Rohloff, a researcher in the Information and Knowledge Technologies group of Raytheon BBN Technologies, a wholly owned subsidiary of Raytheon Company.
Using Hadoop to Build a Scalable, Distributed Triple Store
At Cloudera, we frequently work with leading Hadoop developers to produce guest blog posts of general interest to the community. We started a project with Pete Skomoroch a while back, and we were so impressed with his work, we’ve decided to bring Pete on as a regular guest blogger. Pete can show you how to do some pretty amazing things with Hadoop, Pig and Hive and has a particular bias towards Amazon EC2. With that, I’m happy to welcome Pete to the blog, and hope you enjoy his first post as much as we did. -Christophe
Trendingtopics.org was built by Data Wrangling to demonstrate how Hadoop and Amazon EC2 can be used with Rails to power a data-driven website. This post will give an overview of how trendingtopics.org was put together and show some basic approaches for finding trends in log data with Hive. The source code for trendingtopics is available on Github and a tutorial is provided on the Cloudera site which describes many of the data processing steps in greater detail.
Update (May 1 2013): The post below, which is based on an outdated VM, is deprecated. Rather please see the Cloudera QuickStart VM, which runs on VirtualBox, VMware, and KVM.
Cloudera’s Training VM is one of the most popular resources on our website. It was created with VMware Workstation, and plays nicely with the VMware Player for Windows, Linux, and Mac. But VMware isn’t for everyone. Thomas Lockney has managed to get our VM image running on Virtual Box, and has written a step-by-step guide for the community. Thanks Thomas! – Christophe
Disclaimer: Cloudera no longer approves of the recommendations in this post. Please see this documentation for configuration recommendations.
One of the things we get a lot of questions about is how to make Hadoop highly available. There is still a lot of work to be done on this front, but we wanted to take a moment and share the best practices from one of our customers. Check out what Paul George has to say about how they keep thier NameNode up at ContextWeb. – Christophe
An important part of making sure Apache Hadoop works well for all users is developing and maintaining strong relationships with the folks who run Hadoop day in and day out. Edward Capriolo keeps About.com’s Hadoop cluster happy, and we frequently chew the fat with Ed on issues ranging from administrative best practices to monitoring. Ed’s been an invaluable resource as we beta test our distribution and chase down bugs before our official releases. Today’s article looks at some of Ed’s tricks for monitoring Hadoop with Cacti through JMX. -Christophe
You may have already read Philip’s Hadoop Metrics post, which provides a general overview of the Hadoop Metrics system. Here, we’ll examine Hadoop monitoring with Cacti through JMX.
What is Cacti?
Yesterday, Chris Goffinet from Digg made a great blog post about LZO and Hadoop. Many users have been frustrated because LZO has been removed from Apache Hadoop’s core, and Chris highlights a great way to mitigate this while the project identifies an alternative with a compatible license. We liked the post so much, we asked Chris to share it with our audience. Thanks Chris! -Christophe
So at Digg, we have been working our own Apache Hadoop cluster using Cloudera’s distribution. One of the things we have been working through is how can we split our large compressed data and run them in parallel on Hadoop? One of the biggest drawbacks from compression algorithms like Gzip is that you can’t split them into multiple mappers. This is where LZO comes in.