Cloudera Blog · Hadoop Posts
In this installment of “Meet the Instructor,” we speak to San Francisco-based Glynn Durham, one of the big brains behind Cloudera’s Introduction to Data Science training and certification.
What is your role at Cloudera?
I am a Senior Instructor with Cloudera University, which means I am a road warrior: I will travel anywhere to teach anything to anyone. I teach all the courses Cloudera offers, including custom private training events that I run at customer sites. Right now, I’m especially enjoying teaching Cloudera’s new course, Introduction to Data Science: Building Recommender Systems. In tandem with the rollout of the course, we’re developing Cloudera Certified Professional: Data Scientist exams, which will include a challenging performance-based lab component in addition to the written test.
Prior to Cloudera, I primarily came from a database background. My first corporate job was at Oracle just before it went public. I spent a year producing Oracle’s first batch of course materials for developers and database administrators and then spent several years teaching all kinds of people all over the world. For some time, I was an Oracle Database Administrator. I eventually moved on to the LAMP code stack, and I later worked for MySQL.
It has been a busy time for announcements coinciding with this week’s Strata conference. There’s no corner of the technology world that has not embraced Apache Hadoop as the new platform for big data. Apache Hadoop began as a telegram from the future from Google, turned into real software by Doug Cutting while on a freelance assignment. While Hadoop’s origins are surprising, its ongoing popularity is not – open source has been a major contributing factor to Hadoop’s current ubiquity. Easy to trial, fast to evolve, inexpensive to own: open source makes a compelling case for itself.
From the founding of the company, Cloudera recognized the importance of Apache open source to Hadoop’s continued evolution. We’re now entering our fifth year of shipping a 100% open source platform. Every significant advance we have added to the platform has stayed consistent to our open source strategy. In the process Cloudera has now sponsored the development of seven new open source projects including Apache Flume, Apache Sqoop, Apache Bigtop, Apache MRUnit, Cloudera Hue, Apache Crunch, and most recently, Cloudera Impala. Acknowledging the maxim “innovation happens elsewhere,” we’ve also managed to convince the founders and/or PMC chairs of Apache Hadoop, Apache Oozie, Apache Zookeeper, and Apache HBase to come join Cloudera.
Our investment in open source is not altruistic — we think it is good business. Today, Cloudera employees contribute more patches to the Apache Hadoop ecosystem than every other software vendor combined. Meanwhile more enterprises have adopted our open source platform than every other Hadoop distribution combined. We do not think it is a coincidence that these two things are simultaneously true.
(Added Feb. 25 2013: Early Bird registration is now open – closes April 23, 2013!)
Now that Apache Hadoop is seven years old, use-case patterns for Big Data have emerged. In this post, I’m going to describe the three main ones (reflected in the post’s title) that we see across Cloudera’s growing customer base.
Transformations (T, for short) are a fundamental part of BI systems: They are the process through which data is converted from a source format (which can be relational or otherwise) into a relational data model that can be queried via BI tools.
In the late 1980s, the first BI data stacks started to materialize, and they typically looked like Figure 1.
Last week the Apache Hadoop PMC voted to release Apache Hadoop 2.0.3-alpha, the latest in the Hadoop 2 release series. This release fixes over 500 issues (covering the Common, HDFS, MapReduce and YARN sub-projects) since the 2.0.2-alpha release in October last year. In addition to bug fixes and general improvements the more noteworthy changes include:
In my previous post, you learned how to write a basic MapReduce job and run it on Apache Hadoop. In this post, we’ll delve deeper into MapReduce programming and cover some of the framework’s more advanced features. In particular, we’ll explore:
Organizations of all types and sizes are waking up to the idea that integrating the Apache Hadoop stack into their IT infrastructure solves very common, near-term data management problems. At the same time, deploying Hadoop offers the long-term promise of rapid innovation via Big Data analytics. But, how do you get from Point A to Point Z with the least possible exposure to risk?
Coming to a U.S. city near you, The Cloudera Sessions are single-day, interactive events – with presentations in the morning, and technical breakouts in the afternoon – designed to help you identify where you are on your journey with Apache Hadoop, and how to keep that journey going in a low-risk, productive way. You’ll benefit not only from Cloudera’s experiences with real-world deployments, but also hear directly from some of the Hadoop users who planned and manage them.
If you’re the business owner of a data warehouse, an enterprise architect, an IT leader, or an analyst or developer, you can’t afford to miss this opportunity to learn about the very real and widely applicable benefits of Hadoop.
In this installment of “Meet the Engineer”, get to know Customer Operations Engineering Manager/Apache Sqoop committer Kathleen Ting (@kate_ting).
What do you do at Cloudera, and in what open-source projects are you involved?
I’m a support manager at Cloudera, and an Apache Sqoop committer and PMC member. I also contribute to the Apache Flume and Apache ZooKeeper mailing lists and organize and present at meetups, as well as speak at conferences, about those projects.
My role is a hybrid “player/coach” model: in addition to doing managerial things like leading a team and addressing customer escalations, I also answer customer support cases directly, which is a fairly unique combination. This is an effective approach: giving me direct insights into customer concerns that I otherwise wouldn’t get, helping me stay grounded, and ensuring I appreciate the work the team is doing, first-hand.
Cloudera Impala, the open-source real-time query engine for Apache Hadoop, uses many tools and techniques to get the best query performance. This blog post will discuss how we use runtime code generation to significantly improve our CPU efficiency and overall query execution time. We’ll explain the types of inefficiency that code-generation eliminates and go over in more detail one of the queries in the TPCH workload where code generation improves overall query speeds by close to 3x.
Why Code Generation?
The baseline for “optimal” query engine performance is a native application that is written specifically for your data format, written only to support your query. For example, it would be ideal if a query engine could execute this query:
select count(*) from tbl where col like %XYZ%
Introduction: Training is Key
Apache Hadoop is extremely important to maximizing the value Syncsort’s technology delivers to our customers. That value promise starts with a solid foundation of knowledge and skills among key technical staff across the company.
We chose Cloudera University’s private training option to ensure Syncsort’s cross-functional team of engineering, support, services, and technical sales professionals had the expertise to optimize our data products for the end-user. Because the members of our team had different levels of prior Hadoop experience, the private class enabled us to freely share information and ask tough questions, resulting in a high level of engagement throughout the course.