Cloudera Developer Blog · Guest Posts
This post was contributed by Bob Gourley, editor, CTOvision.com.
You are no doubt aware of the interesting situation we face with data today: The amount of data being created is growing faster than humans can analyze, but fast analysis over data can help humanity solve some very tough challenges. This fact is moving the globe towards new “Big Data” solutions.
Government use of Big Data is of particular note.
The government has special abilities to focus research in areas like Health Sciences, Economics, Law Enforcement, Defense, Geographic Studies, Environmental Studies, Bioinformatics, and Computer Security. Each of those can be well served by Big Data approaches. There are also many areas of direct government service to citizens which can benefit from Big Data solutions. Today’s citizen is being served with government information faster than ever before, thanks to Apache Hadoop-based solutions like those at USAsearch.gov.
Up to this point, we’ve described our reasons for using Hadoop and Hive on our neural recordings (Part I), the reasons why the analyses of these recordings are interesting from a scientific perspective, and detailed descriptions of our implementation of these analyses using Apache Hadoop and Apache Hive (Part II). The last part of this story cuts straight to the results and then discusses important lessons we learned along the way and future goals for improving the analysis framework we’ve built so far.
Here are two plots of the output data from our benchmark run. Both plots show the same data, one in three dimensions and the other in a two-dimensional density format.
As mentioned in Part I, although Apache Hadoop and other Big Data technologies are typically applied to I/O intensive workloads, where parallel data channels dramatically increase I/O throughput, there is growing interest in applying these technologies to CPU intensive workloads. In this work, we used Hadoop and Hive to digitally signal process individual neuron voltage signals captured from electrodes embedded in the rat brain. Previously, this processing was performed on a single Matlab workstation, a workload that was both CPU intensive and data intensive, especially for intermediate output data. With Hadoop and Apache Hive, we were not only able to apply parallelism to the various processing steps, but had the additional benefit of having all the data online for additional ad hoc analysis. Here, we describe the technical details of our implementation, including the biological relevance of the neural signals and analysis parameters. In Part III, we will then describe the tradeoffs between the Matlab and Hadoop/Hive approach, performance results, and several issues identified with using Hadoop/Hive in this type of application.
For this work, we used a university Hadoop computing cluster. Note that it is blade-based, and is not an ideal configuration for Hadoop because of the limited number (2) of drive bays per node. It has these specifications:
In this three-part series of posts, we will share our experiences tackling a scientific computing challenge that may serve as a useful practical example for those readers considering Apache Hadoop and Apache Hive as an option to meet their growing technical and scientific computing needs. This first part describes some of the background behind our application and the advantages of Hadoop that make it an attractive framework in which to implement our solution. Part II dives into the technical details of the data we aimed to analyze and of our solution. Finally, we wrap up this series in Part III with a description of some of our main results, and most importantly perhaps, a list of things we learned along the way, as well as future possibilities for improvements.
About a year ago, after hearing increasing buzz about big data in general, and Hadoop in particular, I (Brad Rubin) saw an opportunity to learn more at our Twin Cities (Minnesota) Java User Group. Brock Noland, the local Cloudera representative, gave an introductory talk. I was really intrigued by the thought of leveraging commodity computing to tackle large-scale data processing. I teach several courses at the University of St. Thomas Graduate Programs in Software, including one in information retrieval. While I had taught the abstract principles behind the scale and performance solutions for indexing web-sized document collections, I saw an opportunity to integrate a real-world solution into the course.
Our department had an idle computing cluster. While it wasn’t an ideal Hadoop platform, because of the limited disk arms available in the blade configuration, our computing support staff and a grad student installed Ubuntu and Hadoop. We immediately had trouble with frequent crashes, and Brock came by to diagnose our problem as a hardware memory configuration issue. We got the cluster running just in time for use by a few student projects in my information retrieval class. We decided to go with Cloudera’s Distribution Including Apache Hadoop (CDH) because initially learning about the technologies and bringing up a new cluster is complex enough, and we wanted the benefit of having a software collection that was already configured to work together, including patches. The mailing lists were also an important benefit, allowing search for problem solutions posted by others, and quick responses to new questions by Cloudera employees and other users.
This is a guest re-post from Datameer’s Director of Marketing, Rich Taylor. The original post can be found on the Datameer blog.
Datameer uses D3.js to power our Business Infographic™ designer. I thought I would show how we visualized the Apache Hadoop ecosystem connections. First using only D3.js, and second using Datameer 2.0.
This is a guest post by Assaf Yardeni, Head of R&D for Treato, an online social healthcare solution, headquartered in Israel.
Three years ago I joined Treato, a social healthcare analysis firm to help treato.com scale up to its present capability. Treato is a new source for healthcare information where health-related user generated content (UGC) from the Internet is aggregated and organized into usable insights for patients, physicians and other healthcare professionals. With oceans of patient-written health-related information available on the Web, and more being published each day, Treato needs to be able to collect and process vast amounts of data – Treato is Big Data par excellence, and my job has been to bring Treato to this stage.
Before the Hadoop era
When I arrived at Treato, the team had already developed a Microsoft-based prototype that could organize a limited amount of health-related UGC into relevant insights, as a proof of concept. The system would:
This is a guest post contributed by Loren Siebert. Loren is a San Francisco entrepreneur and software developer, and is currently the technical lead for the USASearch program.
A year ago I rolled my first Apache Hadoop system into production. Since then, I’ve spoken to quite a few people who are eager to try Hadoop themselves in order to solve their own big data problems. Despite having similar backgrounds and data problems, few of these people have sunk their teeth into Hadoop. When I go to Hadoop Meetups in San Francisco, I often meet new people who are evaluating Hadoop and have yet to launch a cluster. Based on my own background and experience, I have some ideas on why this is the case.
I studied computer science in school and have worked on a wide variety of computer systems in my career, with a lot of focus on server-side Java. I learned a bit about building distributed systems and working with large amounts of data when I built a pay-per-click (PPC) ad network in 2004. The system is still in operation and at one point was handling several thousand searches per second. As the sole technical resource on the system, I had to educate myself very quickly about how to scale up.
This is a guest post from RichRelevance Principal Architect and Apache Avro PMC Chair Scott Carey.
In Early 2010 at RichRelevance, we were searching for a new way to store our long lived data that was compact, efficient, and maintainable over time. We had been using Hadoop for about a year, and started with the basics – text formats and SequenceFiles. Neither of these were sufficient. Text formats are not compact enough, and can be painful to maintain over time. A basic binary format may be more compact, but it has the same maintenance issues as text. Furthermore, we needed rich data types including lists and nested records.
After analysis similar to Doug Cutting’s blog post, we chose Apache Avro. As a result we were able to eliminate manual version management, reduce joins during data processing, and adopt a new vision for what data belongs in our event logs. On Cyber Monday 2011, we logged 343 million page view events, and nearly 100 million other events into Avro data files.
Avoiding Version Management Baggage
This post was contributed by Bob Gourley, editor, CTOvision.com.
The missions and data of governments make the government sector one of particular importance for Big Data solutions. Federal, State and Local governments have special abilities to focus research in areas like Health Sciences, Economics, Law Enforcement, Defense, Geographic Studies, Environmental Studies, Bioinformatics, Information Search/Discovery, and Computer Security. Government-Industry teams are working to field Big Data solutions in all these fields.
The Government Big Data Solutions Award was established by the technology blog CTOvision.com to help facilitate the exchange of best practices, lessons learned and creative ideas for solutions to hard data challenges in the government sector. Nominations are being sought and your input would be most appreciated. Please nominate capabilities and solutions you know hold great potential for mission impact in the government sector.
Pero works on research and development in new technologies for online advertising at Aol Advertising R&D in Palo Alto. Over the past 4 years he has been the Chief Architect of R&D distributed ecosystem comprising more than thousand nodes in multiple data centers. He also led large-scale contextual analysis, segmentation and machine learning efforts at AOL, Yahoo and Cadence Design Systems and published patents and research papers in these areas.
A critical premise for success of online advertising networks is to successfully collect, organize, analyze and use large volumes of data for decision making. Given the nature of their online orientation and dynamics, it is critical that these processes be automated to the largest extent possible.
Specifically, the success of advertising technology and its impact on revenue are directly proportional to its capability to use large amounts of data in order to compute proper impression value given the unique circumstances of ad serving events such as the characteristics of the impression, the ad, and the user as well as the content and context. As a general rule, more data results in more accurate predictions.