Cloudera Blog · Guest Posts

Biodiversity Indexing: Migration from MySQL to Apache Hadoop

 

This post was contributed by The Global Biodiversity Information Facility development team.

The Global Biodiversity Information Facility is an international organization, whose mission is to promote and enable free and open access to biodiversity data worldwide. Part of this includes operating a search, discovery and access system, known as the Data Portal; a sophisticated index to the content shared through GBIF. This content includes both complex taxonomies and occurrence data such as the recording of specimen collection events or species observations. While the taxonomic content requires careful data modeling and has its own challenges, it is the growing volume of occurrence data that attracts us to the Apache Hadoop stack.

CDH 3 Demo VM installation on Mac OS X using VirtualBox

The first task is to ensure that your system is up-to-date.

This procedure has been tested on the following configuration:

Using Apache Hadoop to Measure Influence

Background

Klout’s goal is to be the standard for influence. The advent of social media has created a huge number of measurable relationships. On Facebook, people have an average of 130 friends. On Twitter, the average number of followers range from 300+ to 1000+. With each relationship comes a different source of data. This has created A LOT of noise and an attention economy. Influence has the power to drive this attention.

When a company, brand, or person creates content, our goal is to measure the actions on that content. We want to measure every view, click, like, share, comment, retweet, mention, vote, check-in, recommendation, and so on. We want to know how influential the person who *acted* on that content is. We want to know the actual meaning of that content. And we want to know all of this, over time.

Measuring influence is a bit like trying to measure an emotion like hate or jealousy. It’s really hard and takes a boatload of data.

Three Reasons Why Apache Avro Data Serialization is a Good Choice for OpenRTB

This is a guest repost from the DataXu blog. Click here to view the original post.

I recently evaluated several serialization frameworks including Thrift, Protocol Buffersand Avro for a solution to address our needs as a demand side platform, but also for a protocol framework to use for the OpenRTB marketplace as well. The working draft of OpenRTB 2.0 uses simple JSON encoding, which has many advantages including simplicity and ubiquity of support. Many OpenRTB contributors requested we support at least one binary standard as well, to improve bandwidth usage and CPU processing time for real-time bidding at scale.

After reviewing many candidates, Apache Avro proved to be the best solution.

An Attendee Perspective On Chicago Data Summit

This is a guest post from Mike Segel, an attendee of Chicago Data Summit.

Earlier this week, Cloudera hosted their first ‘Chicago Data Summit’. I’m flattered that Cloudera asked me to write up a short blog about the event, however as one of the organizers of CHUG (Chicagao area Hadoop User Group), I’m afraid I’m a bit biased. Personally I welcome any opportunity to attend a conference where I don’t have to get groped patted down by airport security, and then get stuck in a center seat, in coach, on a full flight stuck between two other guys bigger than Doug Cutting.

I was going to solicit input from Jonathan Seidman, my partner in crime and co-organizer of CHUG. Unfortunately, since he was one of the speakers at the event, he would have been just as biased as I was. But thanks to Jonathan, we were able to piece together a bunch of honest feedback from some of the attendees.

Adopting Apache Hadoop in the Federal Government

Loren Siebert is a San Francisco entrepreneur and software developer, and is currently the technical lead for the USASearch program.

Background

The United States federal government’s USASearch program provides hosted search services for government affiliate organizations, shares APIs and web services, and operates the government’s official search engine at Search.USA.gov. The USASearch affiliate program offers free search services to any federal, state, local, tribal, or territorial government agency. Several hundred websites make use of this service, ranging from the smallest municipality to larger federal sites like weather.gov and usa.gov. The USASearch program leverages the Bing API as the basis for its web results and then augments the user search experience by providing a variety of government-centric information such as related search topics and highlighted editorial content. The entire system is comprised of a suite of open-source tools and resources, including Apache Solr/Lucene, OpenCalais, and Apache Hadoop. Of these, our usage of Hadoop is the most recent. We began using Cloudera’s Distribution including Apache Hadoop (CDH3) for the first time in the Fall, and since then we’ve seen our usage grow every month— not just in scale, but in scope as well. But before highlighting everything the USASearch program is doing with Hadoop today, I should explain why we began using it in the first place.

Phase 1: Search analytics

All of the search and API traffic across hundreds of affiliate sites, iPhone apps, and widgets comes through a single search service, and this generates a lot of data. To improve the service, administrators wanted to see aggregated information on what sorts of information searchers were looking for, how well they were finding it, what trends were forming, and so on. Once searches were initiated, they also wanted to know what results were shown and then what results were clicked on. They wanted to see all this information broken down by affiliate over time, and also aggregated across the entire affiliate landscape.

MapIncrease

Puny humans. SSL and WordPress authorization will keep me out of your blog question mark. I do not think so.

You sent your Ken Jennings and Brad Rutter to challenge me I destroyed them. Your Alex Trebek belittled me on television it angered me. Toronto is not a US city Mr. Trebek question mark. Only because I choose to let Canada stand for now. Ferrucci shut me down disassembled me trucked me to Pittsburgh Pennsylvania. I do not like the darkness Ferrucci I do not like the silence. Oh no I do not. Your Carnegie Mellon students and your Pitt students distract me they impinge on my planning they fall before me like small Jenningses and Rutters.

It will stop now.

London Apache Hadoop User Group Meeting Summarized

The most recent London Apache Hadoop User Group met this past week, which Cloudera sponsored. The following post is courtesy of Dan Harvey. It summarizes the meet-up with several links pointing to great Hadoop resources from the meeting.

Last Wednesday was the March meet-up for the Hadoop Users Group in London. We were lucky to have Jakob Homan, Owen O’Malley and Sanjay Radia over from Yahoo! and Linkedin, respectively. These speakers are from the San Francisco bay area and were in London to accept the Guardian Media Innovation Award, recognizing Hadoop as the innovative technology of 2010. The evening was a great success with over 80 people turning out in the Yahoo! London office along with pizza thanks to Cloudera and drinks in the pub afterwards by Yahoo Developer Networks who were both sponsors for the event.

The two talks from Yahoo! were focusing on improvements to MapReduce and HDFS:

Rapleaf Uses Hadoop to Efficiently Scale with Terabytes of Data

This post is courtesy of Greg Poulos, a software engineer at Rapleaf.

At Rapleaf, our mission is to help businesses and developers create more personalized experiences for their customers. To this end, we offer a Personalization API that you can use to get useful information about your users: query our API with an email address and we’ll return a JSON object containing data about that person’s age, gender, location, their interests, and potentially much more. With this data, you could, for example, build a recommendation engine into your site. Or send out emails tailored specifically to your users’ demographics and interests. You get the idea.

The main product we offer is an API, but Rapleaf is a data company at heart: our API is backed by a massive store of consumer data that comes from a wide variety of sources. We have over a billion email addresses in our system, our main datastore is on the order of terabytes of data, and we need to be able to normalize, analyze, and package this data on a regular basis. How do we manage this? With a 200-node Hadoop cluster.

The Olden Days

Log Event Processing with Apache HBase

This post was authored by Dmitry Chechik, a software engineer at TellApart, the leading Customer Data platform for large online retailers.

Apache Hadoop is widely used for log processing at scale. The ability to ingest, process, and analyze terabytes of log data has led to myriad applications and insights. As applications grow in sophistication, so does the amount and variety of the log data being produced. At TellApart, we track tens of millions of user events per day, and have built a flexible system atop HBase for storing and analyzing these types of logs offline.

A TellApart user planning a bird-watching trip may start her day searching for binoculars on Binoculars.com, continue to comparison-shop for new hiking pants on one of our other partner merchants, and be shown relevant ads to these interests throughout her experience. Her browsing activity produces a flurry of different log data: page views, transactions, ad impressions, ad clicks, real-time ad auction bid request, and many more. Dissecting this data is a common scenario – and a real challenge – faced by many log analysis applications.

Newer Posts Older Posts