Cloudera Engineering Blog · Community Posts
This was my summer internship project at Cloudera, and I’m very thankful for the level of support and mentorship I’ve received from the Apache HBase community. I started off in June with a very limited knowledge of both HBase and distributed systems in general, and by September, managed to get this patch committed to HBase trunk. I couldn’t have done this without a phenomenal amount of help from Cloudera and the greater HBase community.
The amount of memory available on a commodity server has increased drastically in tune with Moore’s law. Today, its very feasible to have up to 96 gigabytes of RAM on a mid-end, commodity server. This extra memory is good for databases such as HBase which rely on in memory caching to boost read performance.
This is a guest post contributed by Loren Siebert. Loren is a San Francisco entrepreneur and software developer, and is currently the technical lead for the USASearch program.
A year ago I rolled my first Apache Hadoop system into production. Since then, I’ve spoken to quite a few people who are eager to try Hadoop themselves in order to solve their own big data problems. Despite having similar backgrounds and data problems, few of these people have sunk their teeth into Hadoop. When I go to Hadoop Meetups in San Francisco, I often meet new people who are evaluating Hadoop and have yet to launch a cluster. Based on my own background and experience, I have some ideas on why this is the case.
This is a guest post from RichRelevance Principal Architect and Apache Avro PMC Chair Scott Carey.
In Early 2010 at RichRelevance, we were searching for a new way to store our long lived data that was compact, efficient, and maintainable over time. We had been using Hadoop for about a year, and started with the basics – text formats and SequenceFiles. Neither of these were sufficient. Text formats are not compact enough, and can be painful to maintain over time. A basic binary format may be more compact, but it has the same maintenance issues as text. Furthermore, we needed rich data types including lists and nested records.
David joined us as part of our intern program, and built the prototype for the distributed log search functionality that’s available as part of Cloudera Manager 3.7. He did an awesome job, and wrote the following blog post which, now that CM3.7 has been released, we’re pleased to publish.
My intern project was to build a log searching tool, specialized for Apache Hadoop. My mini-app allows Hadoop cluster admins and operators to search their error logs across many machines, filter by time range, text in the log message, and find the namenode machine, for example. The results are then ordered by time, and shown to the user.
This blog was originally posted on the Apache Blog: https://blogs.apache.org/flume/entry/flume_ng_architecture
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. Flume is currently undergoing incubation at The Apache Software Foundation. More information on this project can be found at http://incubator.apache.org/flume. Flume NG is work related to new major revision of Flume and is the subject of this post.
San Francisco, Salesforce.com HQ - Recently there was an Apache HBase Pow-wow where project contributors gathered to discuss the directions of future releases of HBase in person. This group included a quorum of the core committers from Facebook, StumbleUpon, Salesforce, eBay, and Cloudera as well as many contributors and users from other companies. This was an open discussion, and in compliance with Apache Software Foundation policies, the agenda and detailed minutes were shared with the community at large so that everyone can chime in before any final decisions are made.
We summarize some of the high-level discussion topics:
This blog was originally posted on the Apache Blog:
Over 30 people attended the inaugural Sqoop Meetup on the eve of Hadoop World in NYC. Faces were put to names, troubleshooting tips were swapped, and stories were topped – with the table-to-end-all-tables weighing in at 28 billion rows.
The third annual Hadoop World conference has come and gone. The two days of conference keynotes and sessions were surrounded by receptions, meetups and training classes, and marked by plenty of time for networking in hallways and the exhibit hall. The energy exhibited at the conference was infectious and exchange of ideas outstanding. Nearly 1,500 people – almost double last year’s number – attended. They came from 580 companies, 27 countries and 40 of the United States. Big data is clearly a global phenomenon.
Cloudera believes that the flexibility and power of Apache Mahout (http://mahout.apache.org/) in conjunction with Hadoop is invaluable. Therefore, we have packaged the most recent stable release of Mahout (0.5) into CDH3u2, and we are very excited to work with the Mahout community becoming much more involved with the project as both Mahout & Hadoop continue to grow. You can test our CDH with Mahout integration by downloading our most recent release: https://ccp.cloudera.com/display/DOC/Downloading+CDH+Releases
Why we are packing Mahout with Hadoop?
Machine learning is an entire field devoted to Information Retrieval, Statistics, Linear Algebra, Analysis of Algorithms, and many other subjects. This field allows us to examine things such as recommendation engines involving new friends, love interests, and new products. We can do incredibly advanced analysis around genetic sequencing and examination, distributed search and frequency pattern matching, as well mathematical analysis with vectors, matrices, and singular value decomposition (SVD).
Several meetups for Apache Hadoop and Hadoop-related projects are scheduled for the evenings surrounding Hadoop World 2011. Make the most of your week in New York City by attending one or more of these meetups focusing on the Apache projects Hadoop, HBase, Sqoop, Hive and Flume. Food and beverages will be provided at each meetup. Join us to relax, get informed and network with your fellow conference attendees.
This post was contributed by Bob Gourley, editor, CTOvision.com.
The missions and data of governments make the government sector one of particular importance for Big Data solutions. Federal, State and Local governments have special abilities to focus research in areas like Health Sciences, Economics, Law Enforcement, Defense, Geographic Studies, Environmental Studies, Bioinformatics, Information Search/Discovery, and Computer Security. Government-Industry teams are working to field Big Data solutions in all these fields.
The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics such as HDFS, HBase, Pig, Hive, Flume and other related technologies. In addition, speakers will address key development areas including tools, performance, bringing the stack together and testing the stack. Sessions in this track are for developers of all levels who want to learn more about upcoming features and enhancements, new tools, advanced techniques and best practices.
This blog was originally posted on the Apache Blog: https://blogs.apache.org/sqoop/entry/apache_sqoop_overview
Using Hadoop for analytics and data processing requires loading data into clusters and processing it in conjunction with other data that often resides in production databases across the enterprise. Loading bulk data into Hadoop from production systems or accessing it from map reduce applications running on large clusters can be a challenging task. Users must consider details like ensuring consistency of data, the consumption of production system resources, data preparation for provisioning downstream pipeline. Transferring data using scripts is inefficient and time consuming. Directly accessing data residing on external systems from within the map reduce applications complicates applications and exposes the production system to the risk of excessive load originating from cluster nodes.
The Enterprise Architecture track at Hadoop World 2011 will provide insight into how Hadoop is powering today’s advanced data management ecosystems and how Hadoop fits into modern enterprise environments. Speakers will discuss architecture and models, demonstrating how Hadoop connects to surrounding platforms. Attendees of the Enterprise Architecture track will learn Hadoop deployment design patterns; enterprise models and system architecture; types of systems managing data that is transferred to Hadoop using Apache Sqoop and Apache Flume; and how to publish data via Apache Hive, Apache HBase and Apache Sqoop to systems that consume data from Hadoop.
Owen O’Malley recently collected and analyzed information in the Apache Hadoop project commit logs and its JIRA repository. That data describes the history of development for Hadoop and the contributions of the individuals who have worked on it.
In the wake of his analysis, Owen wrote a blog post called The Yahoo! Effect. In it, he highlighted the huge amount of work that has gone into Hadoop since the project’s inception, and showed clearly how an early commitment to the project by Yahoo! had contributed to the growth of the platform and of the community.
Business Solutions is a Hadoop World 2011 track geared towards business strategists and decision makers. Sessions in this track focus on the motivations behind the rapidly increasing adoption of Apache Hadoop across a variety of industries. Speakers will present innovative Hadoop use cases and uncover how the technology fits into their existing data management environments. Attendees will learn how to leverage Hadoop to improve their own infrastructures and profit from increasing opportunities presented from using all of their data.
Preview of Business Solutions Track Sessions
Advancing Disney’s Data Infrastructure with Hadoop
Matt Estes, Disney Connected and Advanced Technologies
The Hadoop World train is approaching the station! Remember to mark November 8th and 9th in your calendars for Hadoop World 2011 in New York City. The Hadoop World agenda is beginning to take shape. View all scheduled sessions at hadoopworld.com/sessions, and check back regularly for updates.
Hadoop World 2011 will feature five tracks to run in parallel across two days. The tracks and their intended audiences are
Snappy is a compression library developed at Google, and, like many technologies that come from Google, Snappy was designed to be fast. The trade off is that the compression ratio is not as high as other compression libraries. From the Snappy homepage:
… compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger.
Make the most of your week in New York City by combining the Hadoop World 2011 conference with training classes that give you essential experience with Hadoop and related technologies. For those who are Hadoop proficient, we have a number of certification exam time slots available for you to become Cloudera Certified for Apache Hadoop.
All classes and exams begin November 7th, the Monday before the Hadoop World conference.
Take advantage of the Early Bird price which has been extended to Friday, September 9, 2011.
The 3rd annual Hadoop World conference takes place on November 8th and 9th in New York City. Cloudera invites you to the largest gathering of Hadoop practitioners, developers, business executives, industry luminaries and innovative companies in the Hadoop ecosystem.
What is Hoop?
Hoop provides access to all Hadoop Distributed File System (HDFS) operations (read and write) over HTTP/S.
Hoop can be used to:
Philip Zeyliger is a software engineer at Cloudera and started the SCM
Two weeks ago, at Hadoop Summit, we released our Service and Configuration Manager (SCM) Express. It’s a dramatically simpler and faster way to get started with Cloudera’s Distribution including Apache Hadoop (CDH). In a previous blog post, we talked in some detail about SCM Express and what it can do for you.
This is a guest repost from Shopzilla’s Tech Blog written by Andrew Look, a Software Engineer at Shopzilla.com. Andrew is responsible for maintaining and constructing SEM systems to manage keyword-based marketing operations, Andrew also has a strong background in highly concurrent web applications and Service Oriented Architectures.
Having gained a strong interest in Hadoop/NoSQL after prototyping a workflow based on MapReduce/Pig, he is now co-organizer of the Los Angeles Hadoop Users’ Group, evangelizing use of the Hadoop project within the Southern California software community.
Ed Albanese leads business development for Cloudera. He is responsible for identifying new markets, revenue opportunities and strategic alliances for the company.
This week’s announcement about the availability of the Cloudera Connector for IBM Netezza is the achievement of a major milestone, but not necessarily the one you might expect. It’s not just the delivery of a useful software component; it’s also the introduction of a new generation of data management architectures. For literally decades, data management architecture consisted of RDBMS, a BI tool and an ETL engine. Those three components assembled together gave you a bonafide data management environment. That architecture has been relevant for long enough to withstand the onslaught of data driven by the introduction of ERP, the rise and fall of client/server and several versions of web architecture. But the machines are unrelenting. They keep generating data. And there’s not just more of it, there is more you can—and often need—to do with it.
The times they are a-changin’, and unstructured data is taking over
Bala Venkatrao is the director of product management at Cloudera.
I had the pleasure of attending Enzee Universe 2011 User Conference this week (June 20-22) in Boston. The conference was very well organized and was attended by well over 1000+ attendees, many of whom lead the Data Warehouse/Data Management functions for their companies. This was Netezza’s largest conference so far in seven years. Netezza is known for enterprise data warehousing, and in fact, they pioneered the concept of the data warehouse appliance. Netezza is a success story: since its founding in 2000, Netezza has seen a steady growth in customers and revenues and last year (2010), IBM acquired Netezza for a whopping $1.7B.
This post was contributed by Jennie Cochran-Chinn and Joe Crobak. They are part of the team building out Adconion‘s Hadoop infrastructure to support Adconion’s next-generation ad optimization and reporting systems.
This is the first of a two part series about moving away from Amazon’s EMR service to an in-house Apache Hadoop cluster.
This post was contributed by The Global Biodiversity Information Facility development team.
Take advantage of the opportunity to become a Cloudera Certified Developer or Administrator for Apache Hadoop the day before Hadoop Summit, June 28th. This is the first time these certifications have been offered apart from their respective courses – so don’t miss the chance to validate your Hadoop expertise!
There are several exam times throughout the day for your convenience. The Developer exam lasts for 90 minutes, the Administrator exam for 60 minutes.
This is a guest post from Mike Segel, an attendee of Chicago Data Summit.
Earlier this week, Cloudera hosted their first ‘Chicago Data Summit’. I’m flattered that Cloudera asked me to write up a short blog about the event, however as one of the organizers of CHUG (Chicagao area Hadoop User Group), I’m afraid I’m a bit biased. Personally I welcome any opportunity to attend a conference where I don’t have to get groped patted down by airport security, and then get stuck in a center seat, in coach, on a full flight stuck between two other guys bigger than Doug Cutting.
Do you know the answer?
Many prominent projects (e.g. Hive, Pig) were sub-projects of Hadoop before becoming Apache TLPs. What project was Hadoop itself spun off from?
I recently gave a talk at the LA Hadoop User Group about Apache HBase Do’s and Don’ts. The audience was excellent and had very informed and well articulated questions. Jody from Shopzilla was an excellent host and I owe him a big thanks for giving the opportunity to speak with over 60 LA Hadoopers. Since not everyone lives in LA or could make it to the meetup, I’ve summarized some of the salient points here. For those of you with a busy day, here’s the tl;dr:
Loren Siebert is a San Francisco entrepreneur and software developer, and is currently the technical lead for the USASearch program.
The United States federal government’s USASearch program provides hosted search services for government affiliate organizations, shares APIs and web services, and operates the government’s official search engine at Search.USA.gov. The USASearch affiliate program offers free search services to any federal, state, local, tribal, or territorial government agency. Several hundred websites make use of this service, ranging from the smallest municipality to larger federal sites like weather.gov and usa.gov. The USASearch program leverages the Bing API as the basis for its web results and then augments the user search experience by providing a variety of government-centric information such as related search topics and highlighted editorial content. The entire system is comprised of a suite of open-source tools and resources, including Apache Solr/Lucene, OpenCalais, and Apache Hadoop. Of these, our usage of Hadoop is the most recent. We began using Cloudera’s Distribution including Apache Hadoop (CDH3) for the first time in the Fall, and since then we’ve seen our usage grow every month— not just in scale, but in scope as well. But before highlighting everything the USASearch program is doing with Hadoop today, I should explain why we began using it in the first place.
Phase 1: Search analytics
The most recent London Apache Hadoop User Group met this past week, which Cloudera sponsored. The following post is courtesy of Dan Harvey. It summarizes the meet-up with several links pointing to great Hadoop resources from the meeting.
Last Wednesday was the March meet-up for the Hadoop Users Group in London. We were lucky to have Jakob Homan, Owen O’Malley and Sanjay Radia over from Yahoo! and Linkedin, respectively. These speakers are from the San Francisco bay area and were in London to accept the Guardian Media Innovation Award, recognizing Hadoop as the innovative technology of 2010. The evening was a great success with over 80 people turning out in the Yahoo! London office along with pizza thanks to Cloudera and drinks in the pub afterwards by Yahoo Developer Networks who were both sponsors for the event.
If you find yourself in the Chicago area later this month, please join us at the Chicago Data Summit on April 26th at the InterContinental Hotel on the Magnificent Mile. Whether you’re an Apache Hadoop novice or more advanced, you will find the presentations to be very informative and the opportunity to network with Hadoop professionals quite valuable.
For those new to Hadoop, the project itself was named after a yellow stuffed elephant belonging to the son of Hadoop Co-founder Doug Cutting, the Chicago Data Summit’s keynote speaker. In addition to being a Hadoop founder, Doug is the Chairman of the Apache Software Foundation, as well as an Architect at Cloudera. Doug’s presentation will explain the Hadoop project and the advantages provided by Hadoop’s linear scalability and cost effectiveness.
On Monday, we held our second Flume Office Hours at Cloudera HQ in Palo Alto. The intent was to meet informally, to talk about what’s new, to answer questions, and to get feedback from the community to help prioritize features for future releases.
Below is the slide deck from Flume Office Hours:
This post was authored by Dmitry Chechik, a software engineer at TellApart, the leading Customer Data platform for large online retailers.
Apache Hadoop is widely used for log processing at scale. The ability to ingest, process, and analyze terabytes of log data has led to myriad applications and insights. As applications grow in sophistication, so does the amount and variety of the log data being produced. At TellApart, we track tens of millions of user events per day, and have built a flexible system atop HBase for storing and analyzing these types of logs offline.
The user-data connection is driving NoSQL database-Hadoop pairing
Like enterprises everywhere, the federal government is challenged with issues of overwhelming data. Thanks to a mature Apache Software Foundation suite of tools and a strong ecosystem around large-scale data storage and analytical capabilities, these challenges are actually turning into tremendous opportunities.
The consensus from the Cloudera attendees of the O’Reilly Strata Conference last week was that the data-focused conference was nearly pitch perfect for the data scientist, practitioners and enthusiast who attended the event. It was filled with educational and sometimes entertaining sessions, provided ample time for mingling with vendors and attendees and was well run in general.
One of the cool activities happening at the conference was live streaming video brought to us from the good folks at SiliconAngle. Using a mobile production system called The Cube, Silicon Angle hosts John Furrier (@furrier) and Dave Vellante interviewed industry luminaries and up and comers while bringing their own perspective. After streaming live for nearly two days these hosts are still able to keep the energy high and the tone light.
This post is courtesy of Kumanan Rajamanikkam, Lead Engineer at Wordnik.
Wordnik’s Processing Challenge
At Wordnik, our goal is to build the most comprehensive, high-quality understanding of English text. We make our findings available through a robust REST api and www.wordnik.com. Our corpus grows quickly—up to 8,000 words per second. Performing deep lexical analysis on data at this rate is challenging to say the least.
Apache Hadoop is increasingly being adopted for storage and processing of large-scale complex data. There are more Hadoop user groups in more locations than ever before and the community surrounding Hadoop is alive and vibrant.
The questions we are all thinking include,
Cloudera is happy to announce the availability of the third update to version 2 of our distribution for Apache Hadoop (CDH2). CDH2 Update 3 contains a number of important fixes like HADOOP-5203, HDFS-1377, MAPREDUCE-1699, MAPREDUCE-1853, and MAPREDUCE-270. Check out the release notes and change log for more details on what’s in this release. You can find the packages and tarballs on our website, or simply update your systems if you are already using our repositories. More instructions can be found in our CDH documentation.
What is Kerberos & SPNEGO?
Kerberos is an authentication protocol that provides mutual authentication and single sign-on capabilities.
We blogged about 104 different topics in 2010 and we recently decided to take a look back and see what folks were most interested in reading. The topics that were featured ranged from Cloudera’s Distribution for Apache Hadoop technical updates (CDH3b3 being the most recent) to highlighting upcoming Hadoop related events and activities to sharing practical insights for implementing Hadoop. We also featured a number of guest blog posts.
Here are the top 10 blog posts from 2010:
- How to Get a Job at Cloudera
Cloudera is hiring around the clock, and this blog highlights the best course of action to increase your chances of becoming a Clouderan.
- Why Europe’s Largest Ad Targeting Platform Uses Hadoop
“As data volumes increased and performance suffered, we recognized a new approach was needed (Hadoop).” –Richard Hutton, Nugg.ad CTO
- What’s New in CDH3b2 Flume
Flume, our data movement platform, was introduced to the world and into the open source environment.
- What’s New in CDH3b2 Hue
Hue, a web UI for Hadoop, is a suite of web applications as well as a platform for building custom applications with a nice UI library.
- Natural Language Processing with Hadoop and Python
Data volumes are increasing naturally from text (blogs) and speech (YouTube videos) posing new questions for Natural Language Processing. This involves making sense of lots of data in different forms and extracting useful insights.
- How Raytheon BBN Technologies Researchers are Using Hadoop to Build a Scalable, Distributed Triple Store
Raytheon BBN Technologies built a cloud-based triple-store technology, known as SHARD, to address scalability issues in the processing and analysis of Semantic Web data.
- Cloudera’s Support Team Shares Some Basic Hardware Recommendations
The Cloudera support team discusses workload evaluation and the critical role it plays in hardware selection.
- Integrating Hive and HBase
Facebook explains integrating Hive and HBase to keep their warehouse up to date with the latest information published by users.
- Pushing the Limits of Distributed Processing
Google built a 100,000 node Hadoop cluster running on Nexus One mobile phone hardware and powered by Android. The environmental cost of this solution is 1/100th the equivalent of running it within their data center. (April Fools)
- Using Flume to Collect Apache 2 Web Server Logs
This post presents the common use case of using a Flume node to collect Apache 2 web server logs and deliver them to HDFS.
This is a guest re-post courtesy of Arun Jacob, Data Architect at Disney, prior to that he was an engineer at RichRelevance and Evri. For the last couple of years, Arun has been focused on data mining/information extraction, using a mix of custom and open source technologies.
A New Machine
Guest post from Paul Burkhardt, a Research Developer at SRA International, Inc. where he develops large-scale, distributed computing solutions.
Previously we proposed how we measure the performance in Hadoop MapReduce applications in an effort to better understand the computing efficiency. In this part, we’ll describe some results and illuminate both good and bad characteristics.
Guest post from Paul Burkhardt, a Research Developer at SRA International, Inc. where he develops large-scale, distributed computing solutions.
We were asked by one of our customers to investigate Hadoop MapReduce for solving distributed computing problems. We were particularly interested in how effectively MapReduce applications utilize computing resources. Computing efficiency is important not only for speed-up and scale-out performance but also power consumption. Consider a hypothetical High-Performance Computing (HPC) system of 10,000 nodes running 50% idle at 50 watts per idle node, and assuming 10 cents per kilowatt hour. It would cost $219,000 per year to power just the idle-time. Keeping a large HPC system busy is difficult and requires huge datasets and efficient parallel algorithms. We wanted to analyze Hadoop applications to determine the computing efficiency and gain insight to tuning and optimization of these applications. We installed CDH3 onto a number of different clusters as part of our comparative study. The CDH3 was preferred over the standard Hadoop installation for the recent patches and the support offered by Cloudera. In the first part of this two-part article, we’ll more formally define computing efficiency as it relates to evaluating Hadoop MapReduce applications and describe the performance metrics we gathered for our assessment. The second part will describe our results and conclude with suggestions for improvements and hopefully will instigate further study in Hadoop MapReduce performance analysis.
Neil Kodner, an independent consultant, is the guest author of this post. Neil found inspiration, which spurred innovation at Hadoop World 2010 from a moments decision to capture the #hw2010 streaming Twitter feed.
During the Hadoop World 2010 keynote, a majority of attendees were typing away on their laptops as Mike Olson and Tim O’Reilly dazzled the audience. Many of these laptop-users appeared to be tweeting as the keynote was taking place. Since I have more than a passing interest in twitter, Hadoop, and text mining, I thought it would be a great idea to track and store everyone’s Hadoop World tweets.