Cloudera Engineering Blog · Hadoop Posts

Visualization on Impala: Big, Real-Time, and Raw

The guest post below is provided by Justin Langseth, Founder & CEO of Zoomdata, Inc. Thanks, Justin!

What if you could affordably manage billions of rows of raw Big Data and let typical business people analyze it at the speed of thought in beautiful, interactive visuals? What if you could do all the above without worrying about structuring that data in a data warehouse schema, moving it, and pre-defining reports and dashboards? With the approach I’ll describe below, you can.

How Improved Short-Circuit Local Reads Bring Better Performance and Security to Hadoop

One of the key principles behind Apache Hadoop is the idea that moving computation is cheaper than moving data — we prefer to move the computation to the data whenever possible, rather than the other way around. Because of this, the Hadoop Distributed File System (HDFS) typically handles many “local reads” reads where the reader is on the same node as the data:

Spotlight: How National Institutes of Health Advances Genomic Research with Big Data

This week, I’d like to shine a spotlight on innovative work the National Institutes of Health (NIH) is working on, leveraging Big Data, in the area of genomic research. Understanding DNA structure and functions is a very data-intensive, complex, and expensive undertaking. Apache Hadoop is making it more affordable and feasible to process, store, and analyze this data, and the NIH is embracing the technology for this reason. In fact, it has initiated a Big Data center of excellence — which it calls Big Data to Knowledge (BD2K) — to accelerate innovations in bioinformatics using Big Data, which will ultimately help us better understand and control various diseases and disorders.

Bob Gourley — a friend of Cloudera’s who wears many hats including publisher of, CTO of Crucial Point LLC, and GigaOm analyst — recently interviewed Dr. Mark Guyer, the deputy director of the NIH’s National Human Genome Research Institute (NHGRI), about the BD2K effort.

Flexpod Select with Cloudera

Earlier this week, our partners NetApp and Cisco announced the Flexpod Select Family with support for Cloudera’s Distribution including Apache Hadoop (CDH).

We’re looking forward to the expansion of Flexpod Select to include Hadoop, as it provides additional options for customers to consume the benefits of the Cloudera Enterprise platform.

Cloudera at Strata + Hadoop World 2013

Strata Conference + Hadoop World 2013 is looming on the horizon and pacing to be the largest gathering of Big Data professionals on the globe. As co-hosts with O’Reilly, we have seen the conference thrive, grow, and are excited about the upcoming Oct. 28 – 30 event!


This Month in the Ecosystem

The ecosystem is evolving at a rapid pace – so rapidly, that important developments are often passing through the public attention zone too quickly. Thus, we think it might be helpful to bring you a digest (by no means complete!) of our favorite highlights on a regular basis. (This effort, by the way, has different goals than the fine Hadoop Weekly newsletter, which has a more expansive view – and which you should subscribe to immediately, as far as we’re concerned.)

Find the first installment below. Although the time period reflected here is obviously more than a month long, we have some catching up to do before we can move to a truly monthly cadence.

Customer Spotlight: Cerner’s Ryan Brush Presents “Thinking in MapReduce” at StampedeCon

For those of you attending this week’s StampedeCon event in St. Louis, I’d encourage you to check out the “Thinking in MapReduce” session presented by Cerner’s Ryan Brush. The session will cover the value that MapReduce and Apache Hadoop offer to the healthcare space, and provide tips on how to effectively use Hadoop ecosystem tools to solve healthcare problems.

Big Data challenges within the healthcare space stem from the standard practice of storing data in many siloed systems. Hadoop is allowing pharmaceutical companies and healthcare providers to revolutionize their approach to business by making it easier and more cost efficient to bring together all of these fragmented systems for a single, more accurate view of health. The end result: smarter clinical care decisions, better understanding of health risks for individuals and populations, and proactive measures to improve health and reduce healthcare costs.

Announcing Parquet 1.0: Columnar Storage for Hadoop

We’re very happy to re-publish the following post from Twitter analytics infrastructure engineering manager Dmitriy Ryaboy (@squarecog).

In March we announced the Parquet project, the result of a collaboration between Twitter and Cloudera intended to create an open-source columnar storage format library for Apache Hadoop.

With Sentry, Cloudera Fills Hadoop’s Enterprise Security Gap

Every day, more data, users, and applications are accessing ever-larger Apache Hadoop clusters. Although this is good news for data driven organizations overall, for security administrators and compliance officers, there are still lingering questions about how to enable end-users under existing Hadoop infrastructure without compromising security or compliance requirements.

While Hadoop has strong security at the filesystem level, it lacks the granular support needed to adequately secure access to data by users and BI applications. Today, this problem forces organizations in industries for which security is paramount (such as financial services, healthcare, and government) to make a choice: either leave data unprotected or lock out users entirely. Most of the time, the preferred choice is the latter, severely inhibiting access to data in Hadoop.

Customer Spotlight: Motorola Mobility’s Award-Winning Unified Data Repository

The Data Warehousing Institute (TDWI) runs an annual Best Practices Awards program to recognize organizations for their achievements in business intelligence and data warehousing. A few months ago, I was introduced to Motorola Mobility’s VP of cloud platforms and services, Balaji Thiagarajan. After learning about its interesting Apache Hadoop use case and the success it has delivered, Balaji and I worked together to nominate Motorola Mobility for the TDWI Best Practices Award for Emerging Technologies and Methods. And to my delight, it won!

Chances are, you’ve heard of Motorola Mobility. It released the first commercial portable cell phone back in 1984, later dominated the mobile phone market with the super-thin RAZR, and today a large portion of the massive smartphone market runs on its Android operating system.

How HiveServer2 Brings Security and Concurrency to Apache Hive

Apache Hive was one of the first projects to bring higher-level languages to Apache Hadoop. Specifically, Hive enables the legions of trained SQL users to use industry-standard SQL to process their Hadoop data.

However, as you probably have gathered from all the recent community activity in the SQL-over-Hadoop area, Hive has a few limitations for users in the enterprise space. Until recently, two in particular – concurrency and security – were largely unaddressed.

Guide to Using Apache HBase Ports

For those people new to Apache HBase (version 0.90 and later), the configuration of network ports used by the system can be a little overwhelming.

In this blog post, you will learn all the TCP ports used by the different HBase processes and how and why they are used (all in one place) — to help administrators troubleshoot and set up firewall settings, and help new developers how to debug.

Myrrix Joins Cloudera to Bring "Big Learning" to Hadoop

What a short, strange trip it’s been. Just a year ago, I founded Myrrix in London’s Silicon Roundabout to commercialize large-scale machine learning based on Apache Hadoop and Apache Mahout. It’s been a busy scramble, building software and proudly watching early customers get real, big data-sized machine learning into production.

And now another beginning: Myrrix has a new home in Cloudera. I’m excited to join as Director of Data Science in London, alongside Josh Wills. Some of the Myrrix technology will be coming along to benefit CDH and its customers too. There was no question that Cloudera is the right place to continue building out the vision that started as Myrrix, because Josh, Jeff Hammerbacher and the rest of the data science team here have the same vision. It’s an unusually perfect match. Cloudera has made an increasingly complex big-data ecosystem increasingly accessible (Hadoop, real-time queries, search), and we’re going to make “Big Learning” on Hadoop easy and accessible too.

What is Old is New Again

Data-savvy companies of all sizes can now accomplish many viable machine learning projects.

How Does Cloudera Manager Work?

At Cloudera, we believe that Cloudera Manager is the best way to install, configure, manage, and monitor your Apache Hadoop stack. Of course, most users prefer not to take our word for it — they want to know how Cloudera Manager works under the covers, first. 

In this post, I’ll explain some of its inner workings. 

The Vocabulary of Cloudera Manager

Introducing Morphlines: The Easy Way to Build and Integrate ETL Apps for Hadoop

This post is the first in a series of blog posts about Cloudera Morphlines, a new command-based framework that simplifies data preparation for Apache Hadoop workloads. To check it out or help contribute, you can find the code here.

Cloudera Morphlines is a new open source framework that reduces the time and effort necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards. If you want to integrate, build, or facilitate transformation pipelines without programming and without substantial MapReduce skills, and get the job done with a minimum amount of fuss and support costs, this post gets you started.

Where to Find Cloudera Tech Talks Through September 2013

Below please find our regularly scheduled quarterly update about where to find tech talks by Cloudera employees this year – this time, for July through September 2013. Note that this list will be continually curated during the period; complete logistical information may not be available yet.

As always, we’re standing by to assist your meetup by providing speakers, sponsorships, and schwag!

Date City Venue Speaker(s)
July 11 Boston Boston HUG Solr Committer Mark Miller on Solr+Hadoop
July 11 Santa Clara, Calif. Big Data Gurus Patrick Hunt on Solr+Hadoop
July 11 Palo Alto, Calif. Cloudera Manager Meetup Phil Zeyliger on Cloudera Manager internals
July 11 Kansas City, Mo. KC Big Data Matt Harris on Impala
July 17 Mountain View, Calif. Bay Area Hadoop Meetups Patrick Hunt on Solr+Hadoop
July 22 Chicago Chicago Big Data Hadoop and Lucene founder Doug Cutting on Solr+Hadoop
July 22 Portland, Ore. OSCON 2013 Tom Wheeler on “Introduction to Apache Hadoop”
July 24 Portland, Ore. OSCON 2013 Sqoop Committer Kate Ting on “Building an Impenetrable ZooKeeper”
July 24 Portland, Ore. OSCON 2013 Jesse Anderson on “Doing Data Science On NFL Play by Play”
July 24 Portland, Ore. OSCON 2013 Bigtop Committer Mark Grover on “Getting Hadoop, Hive and HBase up and running in less than 15 minutes”
July 24 Portland, Ore. OSCON 2013 Hadoop Committer Colin McCabe on Locksmith
July 25 San Francisco SF Data Engineering Wolfgang Hoschek on Morphlines
July 25 Washington DC Hadoop-DC Joey Echeverria on Accumulo
Aug. 14 San Francisco SF Hadoop Users TBD, but we’re hosting!
Aug. 14 LA LA HBase Users Meetup HBase Committer/PMC Chair Michael Stack on HBase
Aug. 29 London London Java Community Hadoop Committer Tom White on CDK
Sept. 11 San Francisco Cloudera Sessions (SOLD OUT) Eric Sammer-led CDK lab
Sept. 12 New York NYC Search, Discovery & Analytics Meetup Solr Committer Mark Miller on Solr+Hadoop
Sept. 12 Cambridge, UK Enterprise Search Cambridge UK Tom White on Solr+Hadoop
Sept. 12 Los Angeles LA Hadoop Users Group Greg Chanan on Solr+Hadoop
Sept. 16 Sunnyvale, Calif. Big Data Gurus Eric Sammer on CDK
Sept. 17 Sunnyvale, Calif. SF Large-Scale Production Engineering Darren Lo on Hadoop Ops
Sept. 18 Mountain View, Calif. Silicon Valley JUG Wolfgang Hoschek on Morphlines
Sept. 19 El Dorado Hills, Calif. NorCal Big Data Apache Bigtop Committer Sean Mackrory on Bigtop & QuickStart VM
Sept. 24 Washington DC Hadoop-DC Doug Cutting on Apache Lucene

The Blur Project: Marrying Hadoop with Lucene

Doug Cutting’s recent post about Cloudera Search included a hat-tip to Aaron McCurry, founder of the Blur project, for inspiring some of its design principles. We thought you would be interested in hearing more about Blur (which is mentored by Doug and Cloudera’s Patrick Hunt) from Aaron himself – thanks, Aaron, for the guest post below!

Blur is an Apache Incubator project that provides distributed search functionality on top of Apache Hadoop, Apache Lucene, Apache ZooKeeper, and Apache Thrift. When I started building Blur three years ago, there wasn’t a search solution that had a solid integration with the Hadoop ecosystem. Our initial needs were to be able to index our data using MapReduce, store indexes in HDFS, and serve those indexes from clusters of commodity servers while remaining fault tolerant. Blur was built specifically for Hadoop — taking scalability, redundancy, and performance into consideration from the very start — while leveraging all the great features that already exist in the Hadoop stack.

What a Great Year for Hue Users!

HueWith the recent release of CDH 4.3, which contains Hue 2.3, I’d like to report on the fantastic progress of Hue in the past year.

For those who are unfamiliar with it, Hue is a very popular, end-user focused, fully open source Web UI designed for interaction with Apache Hadoop and its ecosystem components. Founded by Cloudera employees, Hue has been around for quite some time, but only in the last 12 months has it evolved into the great ramp-up and interaction tool it is today. It’s fair to say that Hue is the most popular open source GUI for the Hadoop ecosystem among beginners — as well as a valuable tool for seasoned Hadoop users (and users generally in an enterprise environment) – and it is the only end-user tool that ships with Hadoop distributions today. In fact, Hue is even redistributed and marketed as part of other user-experience and ramp-up-on-Hadoop VMs in the market.

Apache Bigtop: The "Fedora of Hadoop" is Now Built on Hadoop 2.x

BigtopJust in time for Hadoop Summit 2013, the Apache Bigtop team is very pleased to announce the release of Bigtop 0.6.0: The very first release of a fully integrated Big Data management distribution built on the currently most advanced Hadoop 2.x, Hadoop 2.0.5-alpha.

Bigtop, as many of you might already know, is a project aimed at creating a 100% open source and community-driven Big Data management distribution based on Apache Hadoop. (You can learn more about it by reading one of our previous blog posts on Apache Blogs.) Bigtop also plays an important role in CDH, which utilizes its packaging code from Bigtop — Cloudera takes pride in developing open source packaging code and contributing the same back to the community.

Customer Spotlight: Big Data Making a Big Impact in Healthcare and Life Sciences

In this Customer Spotlight, I’d like to emphasize some undeniably positive use cases for Big Data, by looking at some of the ways the healthcare and life sciences industries are innovating to benefit humankind. Here are just a few examples:

Mount Sinai School of Medicine has partnered with Cloudera’s own Jeff Hammerbacher to apply Big Data to better predict and understand disease processes and treatments. The Mount Sinai School of Medicine is a top medical school in the US, noted for innovation in biomedical research, clinical care delivery, and community services. With Cloudera’s Big Data technology and Jeff’s data science expertise, Mount Sinai is better equipped to develop solutions designed for high-performance, scalable data analysis and multi-scale measurements. For example, medical research and discovery areas in genotype, gene expression and organ health will benefit from these Big Data applications.

Hadoop for Everyone: Inside Cloudera Search

CDH, Cloudera’s 100% open source distribution of Apache Hadoop and related projects, has successfully enabled Big Data processing for many years. The typical approach is to ingest a large set of a wide variety of data into HDFS or Apache HBase for cost-efficient storage and flexible, scalable processing. Over time, various tools to allow for easier access have emerged — so you can now interact with Hadoop through various programming methods and the very familiar structured query capabilities of SQL.

However, many users with less interest in programmatic interaction have been shut out of the value that Hadoop creates from Big Data. And teams trying to achieve more innovative processing struggle with a time-efficient way to interact with, and explore, the data in Hadoop or HBase.

QuickStart VM: Now with Real-Time Big Data

For years, Cloudera has provided virtual machines that give you a working Apache Hadoop environment out-of-the-box. It’s the quickest way to learn and experiment with Hadoop right from your desktop.

We’re constantly updating and improving the QuickStart VM, and in the latest release there are two of Cloudera’s new products that give you easier and faster access to your data: Cloudera Search and Cloudera Impala. We’ve also added corresponding applications to Hue – an open source web-based interface for Hadoop, and the easiest way to interact with your data.

Meetups at Hadoop Summit

Hadoop Summit convenes next week, and even if you’re not attending, there are a host of meetup opportunities available to you during the week.

Here are just a few, and you can find a full list here.

Improvements in the Hadoop YARN Fair Scheduler

Starting in CDH 4.2, YARN/MapReduce 2 (MR2) includes an even more powerful Fair Scheduler. In addition to doing nearly all that it could do in MapReduce 1 (MR1), the YARN Fair Scheduler can schedule non-MapReduce jobs, schedule based on fine-grained memory instead of slots, and support hierarchical queues. In this post, you’ll learn what the Fair Scheduler’s role is and how it fulfills it, what it means to be a YARN “scheduler,” and dive into its new features and how to get them running on your cluster.

YARN/MR2 vs. MR1

YARN uses an updated terminology to reflect that it no longer just manages resources for MapReduce. From YARN’s perspective, a MapReduce job is an application. YARN schedules containers for map and reduce tasks to live in. What was referred to as pools in the MR1 Fair Scheduler has been updated to queue for consistency with the capacity scheduler. An excellent and deeper explanation is available here.

How Does it Work?

Customer Spotlight: It’s HBase Week!

This is the week of Apache HBase, with HBaseCon 2013 taking place Thursday, followed by WibiData’s KijiCon on Friday. In the many conversations I’ve had with Cloudera customers over the past 18 months, I’ve noticed a trend: Those that run HBase stand out. They tend to represent a group of very sophisticated Hadoop users that are accomplishing impressive things with Big Data. They deploy HBase because they require random, real-time read/write access to the data in Hadoop. Hadoop is a core component of their data management infrastructures, and these users rely on the latest and greatest components of the Hadoop stack to satisfy their mission-critical data needs.

Today I’d like to shine a spotlight on one innovative company that is putting top engineering talent (and HBase) to work, helping to save the planet — literally.

HBaseCon 2013: "Ecosystem" Track Preview

Unbelievably, HBaseCon 2013 is only one week away (June 13 in San Francisco)!

Cloudera Search: The Newest Hadoop Framework for CDH Users and Developers

One of the unexpected pleasures of open source development is the way that technologies adapt and evolve for uses you never originally anticipated.

Seven years ago, Apache Hadoop sprang from a project based on Apache Lucene, aiming to solve a search problem: how to scalably store and index the internet. Today, it’s my pleasure to announce Cloudera Search, which uses Lucene (among other things) to make search solve a Hadoop problem: how to let non-technical users interactively explore and analyze data in Hadoop.

Customer Spotlight:’s Climb to the Social Gaming Throne

This week I’d like to highlight, a European social gaming giant that recently claimed the throne for having the most daily active users (more than 66 million). has methodically and successfully expanded its reach beyond mainstream social gaming to dominate the mobile gaming market — it offers a streamlined experience that allows gamers to pick up their gaming session from wherever they left off, in any game and on any device.’s top games include “Candy Crush Saga” and “Bubble Saga”.

And — you guessed it — runs on CDH.

If It’s Tuesday, There Must Be a "Data Ride"

Mark your calendars, all you data cyclists!

I’m visiting Paris, London, and Edinburgh this June. When I travel I like to talk to locals. And, wherever I am, I like to bicycle. So, I thought I might combine these interests and host “data rides” in these three cities.

Customer Spotlight: Gravity Creates Personalized Web Experience, 300-400% Higher Click-through

According to Jim Benedetto, Gravity’s co-founder and CTO, there have been two paradigm shifts that have transformed consumers’ web experience to date:

Meet the Project Founder: Roman Shaposhnik

Todd This installment of “Meet the Project Founder” features Apache Bigtop founder and PMC Chair/VP Roman Shaposhnik.

What led you to your project idea(s)?

How-to: Configure Eclipse for Hadoop Contributions

Contributing to Apache Hadoop or writing custom pluggable modules requires modifying Hadoop’s source code. While it is perfectly fine to use a text editor to modify Java source, modern IDEs simplify navigation and debugging of large Java projects like Hadoop significantly. Eclipse is a popular choice thanks to its broad user base and multitude of available plugins.

This post covers configuring Eclipse to modify Hadoop’s source. (Developing applications against CDH using Eclipse is covered in a different post.) Hadoop has changed a great deal since our previous post on configuring Eclipse for Hadoop development; here we’ll revisit configuring Eclipse for the latest “flavors” of Hadoop. Note that trunk and other release branches differ in their directory structure, feature set, and build tools they use. (The EclipseEnvironment Hadoop wiki page is a good starting point for development on trunk.)

How-to: Automate Your Hadoop Cluster from Java

One of the complexities of Apache Hadoop is the need to deploy clusters of servers, potentially on a regular basis. At Cloudera, which at any time maintains hundreds of test and development clusters in different configurations, this process presents a lot of operational headaches if not done in an automated fashion. In this post, I’ll describe an approach to cluster automation that works for us, as well as many of our customers and partners.

Taming Complexity

At Cloudera engineering, we have a big support matrix: We work on many versions of CDH (multiple release trains, plus things like rolling upgrade testing), and CDH works across a wide variety of OS distros (RHEL 5 & 6, Ubuntu Precise & Lucid, Debian Squeeze, and SLES 11), and complex configuration combinations — highly available HDFS or simple HDFS, Kerberized or non-secure, using YARN or MR1 as the execution framework, etc. Clearly, we need an easy way to spin-up a new cluster that has the desired setup, which we can subsequently use for integration, testing, customer support, demos, and so on.

Tracking Hadoop Jobs from Your Mac: There’s an App for That

Our thanks to Etsy developer Brad Greenlee (@bgreenlee) for the post below. We think his Mac OS app for JobTracker is great! is a Mac menu bar app interface to the Hadoop JobTracker. It provides Growl/Notification Center notices of starting, completed, and failed jobs and gives easy access to the detail pages of those jobs.

Cloudera Development Kit (CDK): Hadoop Application Development Made Easier

Editor’s Note (Dec. 11, 2013): As of Dec. 2013, the Cloudera Development Kit is now known as the Kite SDK. Links below are updated accordingly.

At Cloudera, we have the privilege of helping thousands of developers learn Apache Hadoop, as well as build and deploy systems and applications on top of Hadoop. While we (and many of you) believe that platform is fast becoming a staple system in the data center, we’re also acutely aware of its complexities. In fact, this is the entire motivation behind Cloudera Manager: to make the Hadoop platform easy for operations staff to deploy and manage.

How the SAS and Cloudera Platforms Work Together

On Monday April 29, Cloudera announced a strategic alliance with SAS. As the industry leader in business analytics software, SAS brings a formidable toolset to bear on the problem of extracting business value from large volumes of data.

Over the past few months, Cloudera has been hard at work along with the SAS team to integrate a number of SAS products with Apache Hadoop, delivering the ability for our customers to use these tools in their interaction with data on the Cloudera platform. In this post, we will delve into the major mechanisms that are available for connecting SAS to CDH, Cloudera’s 100% open-source distribution including Hadoop.

SAS/ACCESS to Hadoop

Cloudera Impala 1.0: It’s Here, It’s Real, It’s Already the Standard for SQL on Hadoop

In October 2012, we introduced the Impala project, at that time the first known effort to bring a modern, open source, distributed SQL query engine to Apache Hadoop. Our release of source code and a beta implementation were met with widespread acclaim — and later inspired similar efforts across the industry that now measure themselves against the Impala standard.

Today, we are proud to announce the first production drop of Impala (download here), which reflects feedback from across the user community based on multiple types of real-world workloads. Just as a refresher, the main design principle behind Impala is complete integration with the Hadoop platform (jointly utilizing a single pool of storage, metadata model, security framework, and set of system resources). This integration allows Impala users to take advantage of the time-tested cost, flexibility, and scale advantages of Hadoop for interactive SQL queries, and makes SQL a first-class Hadoop citizen alongside MapReduce and other frameworks. The net result is that all your data becomes available for interactive analysis simultaneously with all other types of processing, with no ETL delays needed.

The Platform for Big Data is Here

It has been an exciting couple of days for new product announcements at Cloudera — exciting especially for me as the edges of the new platform for big data we have been talking about since Strata + Hadoop World 2012 come into focus.

Yesterday, Cloudera announced a strategic alliance with SAS. SAS is the industry leader in business analytics software, especially predictive analytics. Ninety percent of the Fortune 100 run SAS today. We have been working with SAS to make a number of its products work well with Cloudera including SAS Access, SAS Visual Analytics, and SAS High Performance Analytics (HPA). SAS HPA is an excellent case example of the future direction of Apache Hadoop as a data management platform:

Meet the Project Founder: Doug Cutting (First in a Series)

ToddAt Cloudera, there is a long and proud tradition of employees creating new open source projects intended to help fill gaps in platform functionality (in addition to hiring new employees who have done so in the past). In fact, more than a dozen ecosystem projects — including Apache Hadoop itself — were founded by Clouderans, more than can be attributed to employees of any other single company. Cloudera was also the first vendor to ship most of those projects as enterprise-ready bits inside its platform.

We thought you might be interested in meeting some of them over the next few months, in a new “Meet the Project Founder” series. It’s only appropriate that we begin with Doug Cutting himself – Cloudera’s chief architect and the quadruple-threat founder of Apache Lucene, Apache Nutch, Apache Hadoop, and Apache Avro.

Customer Spotlight: Nokia’s Big Data Ecosystem Connects Cloudera, Teradata, Oracle, and Others

As Cloudera’s keeper of customer stories, it’s dawned on me that others might benefit from the information I’ve spent the past year collecting: the many use cases and deployment patterns for Hadoop amongst our customer base.

This week I’d like to highlight Nokia, a global company that we’re all familiar with as a large mobile phone provider, and whose Senior Director of Analytics – Amy O’Connor – will be speaking at tomorrow’s Cloudera Sessions event in Boston.

Cloudera Academic Partnership Program: Creating Hadoop Lovers in Universities Worldwide

Today Cloudera announced a new Cloudera Academic Partnership program, in which participating universities worldwide get access to curriculum, training, certification, and software. 

As noted in the press release, the global demand for people with Apache Hadoop and data science skills is dwarfing all supply. We consider it an important mission to help accredited universities meet that demand, by equipping them with the content and training they need to educate students in the Hadoop arts.

Learn How To Hadoop from Tom White in Dr. Dobb’s

It’s always a great thing for everybody when the experts are willing and eager to share.

So, it’s with special pleasure that I can point you toward a new three-part series by Cloudera’s own Tom White (@tom_e_white) to be published in Dr Dobb’s, which has long been one of the publications of record in the mainstream developer world – from which many original programmers learned basics like BASIC. Now, Dobb’s turns its attention to Apache Hadoop, which says a lot about Hadoop’s continuing adoption.

Where to Find Cloudera Tech Talks Through June 2013

It’s time for me to give you a quarterly update (here’s the one for Q1) about where to find tech talks by Cloudera employees in 2013. Committers, contributors, and other engineers will travel to meetups and conferences near and far to do their part in the community to make Apache Hadoop a household word!

(Remember, we’re always ready to assist your meetup by providing speakers, sponsorships, and schwag.)

Congrats to OSCON 2013 Speakers!

Cloudera will be a proud exhibitor at O’Reilly OSCON 2013 (July 22-26 in Portland, OR), which in our opinion is a shining light in the open source community. So be sure to look for us at Booth #420!

Seven Thoughts on Hadoop’s Seventh Birthday

On this special April 1 – the seven-year anniversary of the Apache Hadoop project’s first release – Hadoop founder Doug Cutting (also Cloudera’s chief architect and the Apache Software Foundation chair) offers seven thoughts on Hadoop:

  1. Open source accelerates adoption.

    If Hadoop had been created as proprietary software it would not have spread as rapidly. We’ve seen incredible growth in the use of Hadoop. Partly that’s because it’s useful. But many would have been cautious to make a vendor-controlled platform part of their infrastructure, useful or not.

  2.  Apache builds collaborative communities.

Cloudera’s Jeff Hammerbacher on Charlie Rose

In this Charlie Rose interview that aired on March 22, 2013, Cloudera’s Chief Scientist Jeff Hammerbacher (@hackingdata) offers fascinating insights into the origins of Big Data and data science techniques at Google and their re-implementation into open source used by consumer Web companies. Furthermore, he offers great detail about their positive application across healthcare diagnostics and delivery – as well as the overall need for better balance between “numerical imagination” and “narrative imagination” in everything we do (in order to “ask bigger questions”, as some would say).

It’s an incredibly valuable look into where Big Data came from, where it’s going, and how Cloudera is helping it get there.

Cloudera Speakers at Hadoop Summit Europe

Hadoop Summit Europe is coming up in Amsterdam next week, so this is an appropriate time to make you aware of the Cloudera speaker program there (all three talks on Thursday, March 21):

Introducing Parquet: Efficient Columnar Storage for Apache Hadoop

Below you’ll find the official announcement from Cloudera and Twitter about Parquet, an efficient general-purpose columnar file format for Apache Hadoop.

Parquet is designed to bring efficient columnar storage to Hadoop. Compared to, and learning from, the initial work done toward this goal in Trevni, Parquet includes the following enhancements:

How-to: Set Up a Hadoop Cluster with Network Encryption

Hadoop network encryption is a feature introduced in Apache Hadoop 2.0.2-alpha and in CDH4.1.

In this blog post, we’ll first cover Hadoop’s pre-existing security capabilities. Then, we’ll explain why network encryption may be required. We’ll also provide some details on how it has been implemented. At the end of this blog post, you’ll get step-by-step instructions to help you set up a Hadoop cluster with network encryption.

A Bit of History on Hadoop Security

Meet the Instructor: Glynn Durham

In this installment of “Meet the Instructor,” we speak to San Francisco-based Glynn Durham, one of the big brains behind Cloudera’s Introduction to Data Science training and certification. 

What is your role at Cloudera?
I am a Senior Instructor with Cloudera University, which means I am a road warrior: I will travel anywhere to teach anything to anyone. I teach all the courses Cloudera offers, including custom private training events that I run at customer sites. Right now, I’m especially enjoying teaching Cloudera’s new course, Introduction to Data Science: Building Recommender Systems. In tandem with the rollout of the course, we’re developing Cloudera Certified Professional: Data Scientist exams, which will include a challenging performance-based lab component in addition to the written test.

Newer Posts Older Posts