Cloudera Engineering Blog · Hadoop Posts

Customer Spotlight: Cerner’s Ryan Brush Presents “Thinking in MapReduce” at StampedeCon

For those of you attending this week’s StampedeCon event in St. Louis, I’d encourage you to check out the “Thinking in MapReduce” session presented by Cerner’s Ryan Brush. The session will cover the value that MapReduce and Apache Hadoop offer to the healthcare space, and provide tips on how to effectively use Hadoop ecosystem tools to solve healthcare problems.

Big Data challenges within the healthcare space stem from the standard practice of storing data in many siloed systems. Hadoop is allowing pharmaceutical companies and healthcare providers to revolutionize their approach to business by making it easier and more cost efficient to bring together all of these fragmented systems for a single, more accurate view of health. The end result: smarter clinical care decisions, better understanding of health risks for individuals and populations, and proactive measures to improve health and reduce healthcare costs.

Announcing Parquet 1.0: Columnar Storage for Hadoop

We’re very happy to re-publish the following post from Twitter analytics infrastructure engineering manager Dmitriy Ryaboy (@squarecog).

In March we announced the Parquet project, the result of a collaboration between Twitter and Cloudera intended to create an open-source columnar storage format library for Apache Hadoop.

With Sentry, Cloudera Fills Hadoop’s Enterprise Security Gap

Every day, more data, users, and applications are accessing ever-larger Apache Hadoop clusters. Although this is good news for data driven organizations overall, for security administrators and compliance officers, there are still lingering questions about how to enable end-users under existing Hadoop infrastructure without compromising security or compliance requirements.

While Hadoop has strong security at the filesystem level, it lacks the granular support needed to adequately secure access to data by users and BI applications. Today, this problem forces organizations in industries for which security is paramount (such as financial services, healthcare, and government) to make a choice: either leave data unprotected or lock out users entirely. Most of the time, the preferred choice is the latter, severely inhibiting access to data in Hadoop.

Customer Spotlight: Motorola Mobility’s Award-Winning Unified Data Repository

The Data Warehousing Institute (TDWI) runs an annual Best Practices Awards program to recognize organizations for their achievements in business intelligence and data warehousing. A few months ago, I was introduced to Motorola Mobility’s VP of cloud platforms and services, Balaji Thiagarajan. After learning about its interesting Apache Hadoop use case and the success it has delivered, Balaji and I worked together to nominate Motorola Mobility for the TDWI Best Practices Award for Emerging Technologies and Methods. And to my delight, it won!

Chances are, you’ve heard of Motorola Mobility. It released the first commercial portable cell phone back in 1984, later dominated the mobile phone market with the super-thin RAZR, and today a large portion of the massive smartphone market runs on its Android operating system.

How HiveServer2 Brings Security and Concurrency to Apache Hive

Apache Hive was one of the first projects to bring higher-level languages to Apache Hadoop. Specifically, Hive enables the legions of trained SQL users to use industry-standard SQL to process their Hadoop data.

However, as you probably have gathered from all the recent community activity in the SQL-over-Hadoop area, Hive has a few limitations for users in the enterprise space. Until recently, two in particular – concurrency and security – were largely unaddressed.

Guide to Using Apache HBase Ports

For those people new to Apache HBase (version 0.90 and later), the configuration of network ports used by the system can be a little overwhelming.

In this blog post, you will learn all the TCP ports used by the different HBase processes and how and why they are used (all in one place) — to help administrators troubleshoot and set up firewall settings, and help new developers how to debug.

Myrrix Joins Cloudera to Bring "Big Learning" to Hadoop

What a short, strange trip it’s been. Just a year ago, I founded Myrrix in London’s Silicon Roundabout to commercialize large-scale machine learning based on Apache Hadoop and Apache Mahout. It’s been a busy scramble, building software and proudly watching early customers get real, big data-sized machine learning into production.

And now another beginning: Myrrix has a new home in Cloudera. I’m excited to join as Director of Data Science in London, alongside Josh Wills. Some of the Myrrix technology will be coming along to benefit CDH and its customers too. There was no question that Cloudera is the right place to continue building out the vision that started as Myrrix, because Josh, Jeff Hammerbacher and the rest of the data science team here have the same vision. It’s an unusually perfect match. Cloudera has made an increasingly complex big-data ecosystem increasingly accessible (Hadoop, real-time queries, search), and we’re going to make “Big Learning” on Hadoop easy and accessible too.

What is Old is New Again

Data-savvy companies of all sizes can now accomplish many viable machine learning projects.

How Does Cloudera Manager Work?

At Cloudera, we believe that Cloudera Manager is the best way to install, configure, manage, and monitor your Apache Hadoop stack. Of course, most users prefer not to take our word for it — they want to know how Cloudera Manager works under the covers, first. 

In this post, I’ll explain some of its inner workings. 

The Vocabulary of Cloudera Manager

Introducing Morphlines: The Easy Way to Build and Integrate ETL Apps for Hadoop

This post is the first in a series of blog posts about Cloudera Morphlines, a new command-based framework that simplifies data preparation for Apache Hadoop workloads. To check it out or help contribute, you can find the code here.

Cloudera Morphlines is a new open source framework that reduces the time and effort necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards. If you want to integrate, build, or facilitate transformation pipelines without programming and without substantial MapReduce skills, and get the job done with a minimum amount of fuss and support costs, this post gets you started.

Where to Find Cloudera Tech Talks Through September 2013

Below please find our regularly scheduled quarterly update about where to find tech talks by Cloudera employees this year – this time, for July through September 2013. Note that this list will be continually curated during the period; complete logistical information may not be available yet.

As always, we’re standing by to assist your meetup by providing speakers, sponsorships, and schwag!

Date City Venue Speaker(s)
July 11 Boston Boston HUG Solr Committer Mark Miller on Solr+Hadoop
July 11 Santa Clara, Calif. Big Data Gurus Patrick Hunt on Solr+Hadoop
July 11 Palo Alto, Calif. Cloudera Manager Meetup Phil Zeyliger on Cloudera Manager internals
July 11 Kansas City, Mo. KC Big Data Matt Harris on Impala
July 17 Mountain View, Calif. Bay Area Hadoop Meetups Patrick Hunt on Solr+Hadoop
July 22 Chicago Chicago Big Data Hadoop and Lucene founder Doug Cutting on Solr+Hadoop
July 22 Portland, Ore. OSCON 2013 Tom Wheeler on “Introduction to Apache Hadoop”
July 24 Portland, Ore. OSCON 2013 Sqoop Committer Kate Ting on “Building an Impenetrable ZooKeeper”
July 24 Portland, Ore. OSCON 2013 Jesse Anderson on “Doing Data Science On NFL Play by Play”
July 24 Portland, Ore. OSCON 2013 Bigtop Committer Mark Grover on “Getting Hadoop, Hive and HBase up and running in less than 15 minutes”
July 24 Portland, Ore. OSCON 2013 Hadoop Committer Colin McCabe on Locksmith
July 25 San Francisco SF Data Engineering Wolfgang Hoschek on Morphlines
July 25 Washington DC Hadoop-DC Joey Echeverria on Accumulo
Aug. 14 San Francisco SF Hadoop Users TBD, but we’re hosting!
Aug. 14 LA LA HBase Users Meetup HBase Committer/PMC Chair Michael Stack on HBase
Aug. 29 London London Java Community Hadoop Committer Tom White on CDK
Sept. 11 San Francisco Cloudera Sessions (SOLD OUT) Eric Sammer-led CDK lab
Sept. 12 New York NYC Search, Discovery & Analytics Meetup Solr Committer Mark Miller on Solr+Hadoop
Sept. 12 Cambridge, UK Enterprise Search Cambridge UK Tom White on Solr+Hadoop
Sept. 12 Los Angeles LA Hadoop Users Group Greg Chanan on Solr+Hadoop
Sept. 16 Sunnyvale, Calif. Big Data Gurus Eric Sammer on CDK
Sept. 17 Sunnyvale, Calif. SF Large-Scale Production Engineering Darren Lo on Hadoop Ops
Sept. 18 Mountain View, Calif. Silicon Valley JUG Wolfgang Hoschek on Morphlines
Sept. 19 El Dorado Hills, Calif. NorCal Big Data Apache Bigtop Committer Sean Mackrory on Bigtop & QuickStart VM
Sept. 24 Washington DC Hadoop-DC Doug Cutting on Apache Lucene
Newer Posts Older Posts