Cloudera Engineering Blog · General Posts
The Apache Hadoop PMC has voted to release Apache Hadoop 0.23.0. This release is significant since it is the first major release of Hadoop in over a year, and incorporates many new features and improvements over the 0.20 release series. The biggest new features are HDFS federation, and a new MapReduce framework. There is also a new build system (Maven), Kerberos HTTP SPNEGO support, as well as some significant performance improvements which we’ll be covering in future posts. Note, however, that 0.23.0 is not a production release, so please don’t install it on your production cluster.
HDFS federation improves HDFS scalability by allowing multiple independent namenodes, each managing a portion of the namespace. Each datanode in the cluster can provide storage to all the namenodes (which means datanodes do not, for example, belong to a single namenode). Note that HDFS federation is not to be confused with HDFS High Availability, which will be coming in a future 0.23 release.
Cloudera believes that the flexibility and power of Apache Mahout (http://mahout.apache.org/) in conjunction with Hadoop is invaluable. Therefore, we have packaged the most recent stable release of Mahout (0.5) into CDH3u2, and we are very excited to work with the Mahout community becoming much more involved with the project as both Mahout & Hadoop continue to grow. You can test our CDH with Mahout integration by downloading our most recent release: https://ccp.cloudera.com/display/DOC/Downloading+CDH+Releases
Why we are packing Mahout with Hadoop?
Machine learning is an entire field devoted to Information Retrieval, Statistics, Linear Algebra, Analysis of Algorithms, and many other subjects. This field allows us to examine things such as recommendation engines involving new friends, love interests, and new products. We can do incredibly advanced analysis around genetic sequencing and examination, distributed search and frequency pattern matching, as well mathematical analysis with vectors, matrices, and singular value decomposition (SVD).
Several meetups for Apache Hadoop and Hadoop-related projects are scheduled for the evenings surrounding Hadoop World 2011. Make the most of your week in New York City by attending one or more of these meetups focusing on the Apache projects Hadoop, HBase, Sqoop, Hive and Flume. Food and beverages will be provided at each meetup. Join us to relax, get informed and network with your fellow conference attendees.
Continuing with our practice from Cloudera’s Distribution Including Apache Hadoop v2 (CDH2), our goal is to provide regular (quarterly), predictable updates to the generally available release of our open source distribution. For CDH3 the second such update is available today, approximately 3 months after update 1.
For those of you who are recent Cloudera users, here is a refresh on our update policy:
The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics such as HDFS, HBase, Pig, Hive, Flume and other related technologies. In addition, speakers will address key development areas including tools, performance, bringing the stack together and testing the stack. Sessions in this track are for developers of all levels who want to learn more about upcoming features and enhancements, new tools, advanced techniques and best practices.
As a data scientist at Cloudera, I work with customers across a wide range of industries that use Apache Hadoop to solve their business problems. Many of the solutions we create involve multi-stage pipelines of MapReduce jobs that join, clean, aggregate, and analyze enormous amounts of data. When working with log files or relational database tables, we use high-level tools like Apache Pig and Apache Hive for their convenient and powerful support for creating pipelines over structured and semi-structured records.
As Hadoop has spread from web companies to other industries, the variety of data that is stored in HDFS has expanded dramatically. Hadoop clusters are being used to process satellite images, time series data, audio files, and seismograms. These formats are not a natural fit for the data schemas imposed by Pig and Hive, in the same way that structured binary data in a relational database can be a bit awkward to work with. For these use cases, we either end up writing large, custom libraries of user-defined functions in Pig or Hive, or simply give up on our high-level tools and go back to writing MapReduces in Java. Either of these options is a serious drain on developer productivity.
This blog was originally posted on the Apache Blog: https://blogs.apache.org/sqoop/entry/apache_sqoop_overview
Using Hadoop for analytics and data processing requires loading data into clusters and processing it in conjunction with other data that often resides in production databases across the enterprise. Loading bulk data into Hadoop from production systems or accessing it from map reduce applications running on large clusters can be a challenging task. Users must consider details like ensuring consistency of data, the consumption of production system resources, data preparation for provisioning downstream pipeline. Transferring data using scripts is inefficient and time consuming. Directly accessing data residing on external systems from within the map reduce applications complicates applications and exposes the production system to the risk of excessive load originating from cluster nodes.
Owen O’Malley recently collected and analyzed information in the Apache Hadoop project commit logs and its JIRA repository. That data describes the history of development for Hadoop and the contributions of the individuals who have worked on it.
In the wake of his analysis, Owen wrote a blog post called The Yahoo! Effect. In it, he highlighted the huge amount of work that has gone into Hadoop since the project’s inception, and showed clearly how an early commitment to the project by Yahoo! had contributed to the growth of the platform and of the community.
This post was written by Daniel Jackoway following his internship at Cloudera during the summer of 2011.
When I started my internship at Cloudera, I knew almost nothing about systems programming or Apache Hadoop, so I had no idea what to expect. The most important lesson I learned is that structured data is great as long as it is perfect, with the addendum that it is rarely perfect.
Business Solutions is a Hadoop World 2011 track geared towards business strategists and decision makers. Sessions in this track focus on the motivations behind the rapidly increasing adoption of Apache Hadoop across a variety of industries. Speakers will present innovative Hadoop use cases and uncover how the technology fits into their existing data management environments. Attendees will learn how to leverage Hadoop to improve their own infrastructures and profit from increasing opportunities presented from using all of their data.
Preview of Business Solutions Track Sessions
Advancing Disney’s Data Infrastructure with Hadoop
Matt Estes, Disney Connected and Advanced Technologies
This post will explore a specific use case for Apache Hadoop, one that is not commonly recognized, but is gaining interest behind the scenes. It has to do with converting, storing, and searching email messages using the Hadoop platform for archival purposes.
Most of us in IT/Datacenters know the challenges behind storing years of corporate mailboxes and providing an interface for users to search them as necessary. The sheer volume of messages, the content structure and its complexity, the migration processes, and the need to provide timely search results stand out as key points that must be addressed before embarking on an actual implementation. For example, in some organizations all email messages are stored in production servers; others just create a backup dump and store them in tapes; and some organizations have proper archival processes that include search features. Regardless of the situation, it is essential to be able to store and search emails because of the critical information they hold as well as for legal compliance, investigation, etc. That said, let’s look at how Hadoop could help make this process somewhat simple, cost effective, manageable, and scalable.
The Hadoop World train is approaching the station! Remember to mark November 8th and 9th in your calendars for Hadoop World 2011 in New York City. The Hadoop World agenda is beginning to take shape. View all scheduled sessions at hadoopworld.com/sessions, and check back regularly for updates.
Hadoop World 2011 will feature five tracks to run in parallel across two days. The tracks and their intended audiences are
BusinessWeek recently published a fascinating article on Apache Hadoop and Big Data, interviewing several Cloudera customers as well as our CEO Mike Olson. One of the things that has consistently exceeded our expectations is the diversity of industries that are adopting Hadoop to solve impressive business challenges and create real value for their organizations. Two distinct use cases that Hadoop is used to tackle have emerged across these industries. Though these have different names in each industry, the mechanics have clear parallels that cross domains.
Data Processing is Hadoop’s original use case. By scaling out the amount of data that users could store and access in a single system then distributing the document and log processing used to index, and extract patterns from this data, Hadoop made a direct impact on the web and online advertising industries early on. Today, data processing means more than sessionization of click stream data, index construction or attribution for advertising. Hadoop is used to process data by commerce, media and telecommunications companies in order to measure engagement, and handle complex mediation. Retail and financial institutions use Hadoop to understand customer preferences, better target prices and reconcile trades. Most recently we’re seeing Hadoop used for time series and signal processing in the energy sector and genome mapping and alignment among life sciences organizations.
Hadoop Tuesdays: Get a Handle on Unstructured Data with a 7-part Webinar Series Led By Cloudera and Informatica
Unstructured data is the fastest growing type of data generated today. The growth rate of text, documents, images, and clickstream data is incredible. Expand the scope of view to include machine generated data such as telemetry, location, network and weather sources and that growth rate is unnerving. An inspiring population of customers have come to recognize that the Apache Hadoop data management platform is, in many ways, uniquely equipped to handle the volume, variety and velocity of unstructured data being generated within their businesses.
But where does someone new to the conversation get started? How do you know if you are ready to explore Apache Hadoop? What are the products and techniques available to make Apache Hadoop more familiar and accessible? Do you have a roadmap? Do you want to?
Snappy is a compression library developed at Google, and, like many technologies that come from Google, Snappy was designed to be fast. The trade off is that the compression ratio is not as high as other compression libraries. From the Snappy homepage:
… compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger.
Make the most of your week in New York City by combining the Hadoop World 2011 conference with training classes that give you essential experience with Hadoop and related technologies. For those who are Hadoop proficient, we have a number of certification exam time slots available for you to become Cloudera Certified for Apache Hadoop.
All classes and exams begin November 7th, the Monday before the Hadoop World conference.
The 3rd annual Hadoop World conference takes place on November 8th and 9th in New York City. Cloudera invites you to the largest gathering of Hadoop practitioners, developers, business executives, industry luminaries and innovative companies in the Hadoop ecosystem.
Pero works on research and development in new technologies for online advertising at Aol Advertising R&D in Palo Alto. Over the past 4 years he has been the Chief Architect of R&D distributed ecosystem comprising more than thousand nodes in multiple data centers. He also led large-scale contextual analysis, segmentation and machine learning efforts at AOL, Yahoo and Cadence Design Systems and published patents and research papers in these areas.
A critical premise for success of online advertising networks is to successfully collect, organize, analyze and use large volumes of data for decision making. Given the nature of their online orientation and dynamics, it is critical that these processes be automated to the largest extent possible.
Phil Langdale is a software engineer at Cloudera and the technical lead for Cloudera’s SCM Express product.
What is SCM Express?
The Only Full Lifecycle Management for Apache Hadoop: Introducing Cloudera Enterprise 3.5 and SCM Express
Drew O’Brien is a product marketing manager at Cloudera
We’re excited to share the news about the immediate availability of Cloudera Enterprise 3.5 and SCM Express, which we announced this week in tandem with our presence at Hadoop Summit. These products represent a major advance in Cloudera’s mission to drive massive enterprise adoption of 100% open source Apache Hadoop. We now make it easier and more convenient than ever before for companies to run and manage Apache Hadoop clusters throughout their entire operational lifecycle.
Bala Venkatrao is the director of product management at Cloudera.
I had the pleasure of attending Enzee Universe 2011 User Conference this week (June 20-22) in Boston. The conference was very well organized and was attended by well over 1000+ attendees, many of whom lead the Data Warehouse/Data Management functions for their companies. This was Netezza’s largest conference so far in seven years. Netezza is known for enterprise data warehousing, and in fact, they pioneered the concept of the data warehouse appliance. Netezza is a success story: since its founding in 2000, Netezza has seen a steady growth in customers and revenues and last year (2010), IBM acquired Netezza for a whopping $1.7B.
Klout’s goal is to be the standard for influence. The advent of social media has created a huge number of measurable relationships. On Facebook, people have an average of 130 friends. On Twitter, the average number of followers range from 300+ to 1000+. With each relationship comes a different source of data. This has created A LOT of noise and an attention economy. Influence has the power to drive this attention.
When a company, brand, or person creates content, our goal is to measure the actions on that content. We want to measure every view, click, like, share, comment, retweet, mention, vote, check-in, recommendation, and so on. We want to know how influential the person who *acted* on that content is. We want to know the actual meaning of that content. And we want to know all of this, over time.
Cloudera is offering several training courses for Apache Hadoop over the dates surrounding Hadoop Summit. There are five different courses in all spanning the dates of June 27th to July 1st. Three of these courses are specifically designed to provide the necessary knowledge for a robust overall understanding of Hadoop and they tackle the “elephant” from several perspectives — developer, system administrator, and managerial. The other two training sessions focus on projects within the Hadoop ecosystem; namely Hive, Pig, and HBase.
Cloudera Developer Bootcamp for Apache Hadoop is a two-day course designed for developers who wish to learn the MapReduce framework and how to write programs against its API. The course covers similar material to our standard three-day Developer training, but has been condensed into two intensive days with extended course hours. At the end of the course, attendees have the opportunity to take an exam which, if passed, confers the Cloudera Certified Hadoop Developer credential.
This is a guest post from Mike Segel, an attendee of Chicago Data Summit.
Earlier this week, Cloudera hosted their first ‘Chicago Data Summit’. I’m flattered that Cloudera asked me to write up a short blog about the event, however as one of the organizers of CHUG (Chicagao area Hadoop User Group), I’m afraid I’m a bit biased. Personally I welcome any opportunity to attend a conference where I don’t have to get groped patted down by airport security, and then get stuck in a center seat, in coach, on a full flight stuck between two other guys bigger than Doug Cutting.
Do you know the answer?
Many prominent projects (e.g. Hive, Pig) were sub-projects of Hadoop before becoming Apache TLPs. What project was Hadoop itself spun off from?
I am very pleased to announce the general availability of Cloudera’s Distribution including Apache Hadoop, version 3. We’ve been working on this release for more than a year — our initial beta release was on March 24 of 2010, and we’ve made a number of enhancements to the software in the intervening months. This release is the culmination of that long process. It includes the hard work of the broad Apache Hadoop community and the entire team here at Cloudera.
We’ve done three things in this release that I’m particularly proud of.
This is the final piece to a three part blog series. If you would like to view the previous parts to this series please use the following link:
Loren Siebert is a San Francisco entrepreneur and software developer, and is currently the technical lead for the USASearch program.
The United States federal government’s USASearch program provides hosted search services for government affiliate organizations, shares APIs and web services, and operates the government’s official search engine at Search.USA.gov. The USASearch affiliate program offers free search services to any federal, state, local, tribal, or territorial government agency. Several hundred websites make use of this service, ranging from the smallest municipality to larger federal sites like weather.gov and usa.gov. The USASearch program leverages the Bing API as the basis for its web results and then augments the user search experience by providing a variety of government-centric information such as related search topics and highlighted editorial content. The entire system is comprised of a suite of open-source tools and resources, including Apache Solr/Lucene, OpenCalais, and Apache Hadoop. Of these, our usage of Hadoop is the most recent. We began using Cloudera’s Distribution including Apache Hadoop (CDH3) for the first time in the Fall, and since then we’ve seen our usage grow every month— not just in scale, but in scope as well. But before highlighting everything the USASearch program is doing with Hadoop today, I should explain why we began using it in the first place.
Phase 1: Search analytics
Puny humans. SSL and WordPress authorization will keep me out of your blog question mark. I do not think so.
You sent your Ken Jennings and Brad Rutter to challenge me I destroyed them. Your Alex Trebek belittled me on television it angered me. Toronto is not a US city Mr. Trebek question mark. Only because I choose to let Canada stand for now. Ferrucci shut me down disassembled me trucked me to Pittsburgh Pennsylvania. I do not like the darkness Ferrucci I do not like the silence. Oh no I do not. Your Carnegie Mellon students and your Pitt students distract me they impinge on my planning they fall before me like small Jenningses and Rutters.
If you find yourself in the Chicago area later this month, please join us at the Chicago Data Summit on April 26th at the InterContinental Hotel on the Magnificent Mile. Whether you’re an Apache Hadoop novice or more advanced, you will find the presentations to be very informative and the opportunity to network with Hadoop professionals quite valuable.
For those new to Hadoop, the project itself was named after a yellow stuffed elephant belonging to the son of Hadoop Co-founder Doug Cutting, the Chicago Data Summit’s keynote speaker. In addition to being a Hadoop founder, Doug is the Chairman of the Apache Software Foundation, as well as an Architect at Cloudera. Doug’s presentation will explain the Hadoop project and the advantages provided by Hadoop’s linear scalability and cost effectiveness.
Yesterday, Media Guardian announced that the Apache Hadoop project had won the prestigious Media Guardian innovation award. This is a considerable honor to the global team that conceived and built Hadoop under the stewardship of the Apache Software Foundation.
Doug Cutting, the project creator, ASF Chair and a Cloudera employee, was asked to provide a video for presentation at the award ceremony. Other prominent community members — Owen O’Malley, Sanjay Radia and Jakob Homan — made the trip to the UK to attend the award ceremony in person.
This post is courtesy of Greg Poulos, a software engineer at Rapleaf.
At Rapleaf, our mission is to help businesses and developers create more personalized experiences for their customers. To this end, we offer a Personalization API that you can use to get useful information about your users: query our API with an email address and we’ll return a JSON object containing data about that person’s age, gender, location, their interests, and potentially much more. With this data, you could, for example, build a recommendation engine into your site. Or send out emails tailored specifically to your users’ demographics and interests. You get the idea.
This is the second post of a three part blog series. If you would like to read “Part 1,” please follow this link. In this post we will be reviewing a simple moving average in contexts that should be familiar to the analyst not well versed in Hadoop as to establish a common ground with the reader from which we can move forward.
A Quick Primer on Simple Moving Average in Excel
Let’s take a second to do a quick review of how we define simple moving average in an Excel spreadsheet. We’ll need to start with some simple source data, so let’s download a source csv file from github and save it locally. This file contains a synthetic 33 row sample of Yahoo NYSE stock data that we’ll use for the series of examples. Import the csv data into Excel. From there, scan to the date “3/5/2008” and move to the cell to the right of the “ad close” column. Enter the formula
In this three part blog series I want to take a look at how we would do a Simple Moving Average with MapReduce and Apache Hadoop. This series is meant to show how to translate a common Excel or R function into MapReduce java code with accompanying working code and data to play with. Most analysts can take a few months of stock data and produce an excel spreadsheet that shows a moving average, but doing this in Hadoop might be a more daunting task. Although time series as a topic is relatively well understood, I wanted to take the approach of using a simple topic to show how it translated into a powerful parallel application that can calculate the simple moving average for a lot of stocks simultaneously with MapReduce and Hadoop. I also want to demonstrate the underlying mechanic of using the “secondary sort” technique with Hadoop’s MapReduce shuffle phase, which we’ll see is applicable to a lot of different application domains such as finance, sensor, and genomic data.
This article should be approachable to the beginner Hadoop programmer who has done a little bit of MapReduce in java and is looking for a slightly more challenging MapReduce application to hack on. In case you’re not very familiar with Hadoop, here’s some background information and CDH. The code in this example is hosted on github and is documented to illustrate how the various components work together to achieve the secondary sort effect. One of the goals of this article is to have this code be relatively basic and approachable by most programmers.
This is the third and final post in a series detailing a recent improvement in Apache HBase that helps to reduce the frequency of garbage collection pauses. Be sure you’ve read part 1 and part 2 before continuing on to this post.
It’s been a few days since the first two posts, so let’s start with a quick refresher. In the first post we discussed Java garbage collection algorithms in general and explained that the problem of lengthy pauses in HBase has only gotten worse over time as heap sizes have grown. In the second post we ran an experiment showing that write workloads in HBase cause memory fragmentation as all newly inserted data is spread out into several MemStores which are freed at different points in time.
Arena Allocators and TLABs
While Cloudera’s Distribution including Apache Hadoop (CDH) operating system support is covered in the documentation, we thought a quick overview of the changes in CDH3 would be helpful to highlight before CDH3 goes stable. CDH3 supports both 32-bit and 64-bit packages for Red Hat Enterprise Linux 5 and CentOS 5. A significant addition in CDH3 Beta 4 was 64-bit support for SUSE Linux Enterprise Server 11 (SLES 11). CDH3 also supports both 32-bit and 64-bit packages for the two most recent Ubuntu releases: Lucid (10.04 LTS) and Maverick (10.10). As of Beta 4, CDH3 no longer contains packages for Debian Lenny, Ubuntu Hardy, Jaunty, or Karmic. Checkout these upgrade instructions if you are using an Ubuntu release past its end of life. If you are using a release for which Cloudera’s Debian or RPM packages are not available, you can always use the tarballs from the CDH download page. If you have any questions, you can reach us on the CDH user list.
This post was contributed by Boris Shimanovsky, the Director of Engineering at Factual. Boris is responsible for managing all engineering functions and various data infrastructures at Factual- including the internal Cloudera’s Distribution for Apache Hadoop stack. He has been at Factual for over two years, and prior he was the CTO of XAP where he managed a team of +40 across multiple environments. He has an MS from UCLA in Computer Science.
Things have a funny way of working out this way. A couple features were pushed back from a previous release and some last minute improvements were thrown in, and suddenly we found ourselves dragging out a lot more fresh code in our release than usual. All this the night before one of our heavy API users was launching something of their own. They were expecting to hit us thousands of times a second and most of their calls touched some piece of code that hadn’t been tested in the wild. Ordinarily, we would soft launch and put the system through its paces. But now we had no time for that. We really wanted to hammer the entire stack, yesterday, and so we couldn’t rely on internal compute resources.
Today, rather than discussing new projects or use cases built on top of CDH, I’d like to switch gears a bit and share some details about the engineering that goes into our products. In this post, I’ll explain the MemStore-Local Allocation Buffer, a new component in the guts of Apache HBase which dramatically reduces the frequency of long garbage collection pauses. While you won’t need to understand these details to use Apache HBase, I hope it will provide an interesting view into the kind of work that engineers at Cloudera do.
Cloudera is happy to announce the fourth beta release of Cloudera’s Distribution for Apache Hadoop version 3 — CDH3b4. As usual, we’d like to share a few highlights from this release.
Since this will be the last beta before we designate CDH3 stable, our focuses for this release have been on stability, security, and scalability.
This post was authored by Dmitry Chechik, a software engineer at TellApart, the leading Customer Data platform for large online retailers.
Apache Hadoop is widely used for log processing at scale. The ability to ingest, process, and analyze terabytes of log data has led to myriad applications and insights. As applications grow in sophistication, so does the amount and variety of the log data being produced. At TellApart, we track tens of millions of user events per day, and have built a flexible system atop HBase for storing and analyzing these types of logs offline.
The user-data connection is driving NoSQL database-Hadoop pairing
Like enterprises everywhere, the federal government is challenged with issues of overwhelming data. Thanks to a mature Apache Software Foundation suite of tools and a strong ecosystem around large-scale data storage and analytical capabilities, these challenges are actually turning into tremendous opportunities.
The consensus from the Cloudera attendees of the O’Reilly Strata Conference last week was that the data-focused conference was nearly pitch perfect for the data scientist, practitioners and enthusiast who attended the event. It was filled with educational and sometimes entertaining sessions, provided ample time for mingling with vendors and attendees and was well run in general.
One of the cool activities happening at the conference was live streaming video brought to us from the good folks at SiliconAngle. Using a mobile production system called The Cube, Silicon Angle hosts John Furrier (@furrier) and Dave Vellante interviewed industry luminaries and up and comers while bringing their own perspective. After streaming live for nearly two days these hosts are still able to keep the energy high and the tone light.
A common question on the Apache Hadoop mailing lists is what’s going on with availability? This post takes a look at availability in the context of Hadoop, gives an overview of the work in progress and where things are headed.
When discussing Hadoop availability people often start with the NameNode since it is a single point of failure (SPOF) in HDFS, and most components in the Hadoop ecosystem (MapReduce, Apache HBase, Apache Pig, Apache Hive etc) rely on HDFS directly, and are therefore limited by its availability. However, Hadoop availability is a larger, more general issue, so it’s helpful to establish some context before diving in.
This is a guest repost contributed by Eric Lubow, CTO at SimpleReach. It originally appeared here.
I have recently spent a few days getting up to speed with Flume, Cloudera‘s distributed log offering. If you haven’t seen this and deal with lots of logs, you are definitely missing out on a fantastic project. I’m not going to spend time talking about it because you can read more about it in the user’s guide or in the Quora Flume Topic in ways that are better than I can describe it. But I will tell you about is my experience setting up Flume in a distributed environment to sync logs to an Amazon S3 sink.
In an announcement on its blog, Yahoo! recently announced plans to stop distributing its own version of Hadoop, and instead to re-focus on improving Apache’s Hadoop releases. This is great news. Currently, many people running Hadoop use patched versions of the Apache Hadoop package that combine features contributed by Yahoo! and others, but may not yet be collectively available in a single Apache release. Different teams working on enhancements have made their changes to distinct branches off of old releases. Collecting that work into a single source code package and building a system with the best quality and feature set has been hard work.
New users of Hadoop have generally found this assembly work to be too much trouble. To solve that problem, Cloudera currently distributes a patched version of Apache Hadoop, assembling work from Yahoo!, Cloudera, Facebook and others that has been committed to the Apache project, but not necessarily collectively available in one Apache release.
This is a guest post from an attendee of our Hadoop Developer Training course, Attila Csordas, bioinformatician at the European Bioinformatics Institute, Hinxton, Cambridge, UK.
As a wet lab biologist turned bioinformatician I have ~2 year programming experience, mainly in Perl and have been working with Java for the last 9 months. A bioinformatician is not a developer so I’m writing easy code in just a fraction of my work time: parsers, db connections, xml validators, little bug fixes, shell scripts. On the other hand, I have now 5 months of Hadoop experience – and a 6 month old baby named Alice – and that experience is as immense as it gets. Ever since I read the classic Dean-Ghemawat paper, MapReduce: Simplified Data Processing on Large Clusters, I’m thinking about bioinformatics problems in terms of Map and Reduce functions (especially during my evening jog), then implementing these ideas in my free time–which consists of feeding the baby, writing code, changing the nappy, rewriting code.
We blogged about 104 different topics in 2010 and we recently decided to take a look back and see what folks were most interested in reading. The topics that were featured ranged from Cloudera’s Distribution for Apache Hadoop technical updates (CDH3b3 being the most recent) to highlighting upcoming Hadoop related events and activities to sharing practical insights for implementing Hadoop. We also featured a number of guest blog posts.
Here are the top 10 blog posts from 2010:
- How to Get a Job at Cloudera
Cloudera is hiring around the clock, and this blog highlights the best course of action to increase your chances of becoming a Clouderan.
- Why Europe’s Largest Ad Targeting Platform Uses Hadoop
“As data volumes increased and performance suffered, we recognized a new approach was needed (Hadoop).” –Richard Hutton, Nugg.ad CTO
- What’s New in CDH3b2 Flume
Flume, our data movement platform, was introduced to the world and into the open source environment.
- What’s New in CDH3b2 Hue
Hue, a web UI for Hadoop, is a suite of web applications as well as a platform for building custom applications with a nice UI library.
- Natural Language Processing with Hadoop and Python
Data volumes are increasing naturally from text (blogs) and speech (YouTube videos) posing new questions for Natural Language Processing. This involves making sense of lots of data in different forms and extracting useful insights.
- How Raytheon BBN Technologies Researchers are Using Hadoop to Build a Scalable, Distributed Triple Store
Raytheon BBN Technologies built a cloud-based triple-store technology, known as SHARD, to address scalability issues in the processing and analysis of Semantic Web data.
- Cloudera’s Support Team Shares Some Basic Hardware Recommendations
The Cloudera support team discusses workload evaluation and the critical role it plays in hardware selection.
- Integrating Hive and HBase
Facebook explains integrating Hive and HBase to keep their warehouse up to date with the latest information published by users.
- Pushing the Limits of Distributed Processing
Google built a 100,000 node Hadoop cluster running on Nexus One mobile phone hardware and powered by Android. The environmental cost of this solution is 1/100th the equivalent of running it within their data center. (April Fools)
- Using Flume to Collect Apache 2 Web Server Logs
This post presents the common use case of using a Flume node to collect Apache 2 web server logs and deliver them to HDFS.
Post written by Cloudera Software Engineer Aaron T. Myers.
Apache Hadoop has had methods of doing user authorization for some time. The Hadoop Distributed File System (HDFS) has a permissions model similar to Unix to control file and directory access, and MapReduce has access control lists (ACLs) per job queue to control which users may submit jobs. These authorization schemes allow Hadoop users and administrators to specify exactly who may access Hadoop’s resources. However, until recently, these mechanisms relied on a fundamentally insecure method of identifying the user who is interacting with Hadoop. That is, Hadoop had no way of performing reliable authentication. This limitation meant that any authorization system built on top of Hadoop, while helpful to prevent accidental unwanted access, could do nothing to prevent malicious users from accessing other users’ data.