Cloudera Engineering Blog · Hadoop Posts

Announcing Parquet 1.0: Columnar Storage for Hadoop

We’re very happy to re-publish the following post from Twitter analytics infrastructure engineering manager Dmitriy Ryaboy (@squarecog).

In March we announced the Parquet project, the result of a collaboration between Twitter and Cloudera intended to create an open-source columnar storage format library for Apache Hadoop.

With Sentry, Cloudera Fills Hadoop’s Enterprise Security Gap

Every day, more data, users, and applications are accessing ever-larger Apache Hadoop clusters. Although this is good news for data driven organizations overall, for security administrators and compliance officers, there are still lingering questions about how to enable end-users under existing Hadoop infrastructure without compromising security or compliance requirements.

While Hadoop has strong security at the filesystem level, it lacks the granular support needed to adequately secure access to data by users and BI applications. Today, this problem forces organizations in industries for which security is paramount (such as financial services, healthcare, and government) to make a choice: either leave data unprotected or lock out users entirely. Most of the time, the preferred choice is the latter, severely inhibiting access to data in Hadoop.

Customer Spotlight: Motorola Mobility’s Award-Winning Unified Data Repository

The Data Warehousing Institute (TDWI) runs an annual Best Practices Awards program to recognize organizations for their achievements in business intelligence and data warehousing. A few months ago, I was introduced to Motorola Mobility’s VP of cloud platforms and services, Balaji Thiagarajan. After learning about its interesting Apache Hadoop use case and the success it has delivered, Balaji and I worked together to nominate Motorola Mobility for the TDWI Best Practices Award for Emerging Technologies and Methods. And to my delight, it won!

Chances are, you’ve heard of Motorola Mobility. It released the first commercial portable cell phone back in 1984, later dominated the mobile phone market with the super-thin RAZR, and today a large portion of the massive smartphone market runs on its Android operating system.

How HiveServer2 Brings Security and Concurrency to Apache Hive

Apache Hive was one of the first projects to bring higher-level languages to Apache Hadoop. Specifically, Hive enables the legions of trained SQL users to use industry-standard SQL to process their Hadoop data.

However, as you probably have gathered from all the recent community activity in the SQL-over-Hadoop area, Hive has a few limitations for users in the enterprise space. Until recently, two in particular – concurrency and security – were largely unaddressed.

Guide to Using Apache HBase Ports

For those people new to Apache HBase (version 0.90 and later), the configuration of network ports used by the system can be a little overwhelming.

In this blog post, you will learn all the TCP ports used by the different HBase processes and how and why they are used (all in one place) — to help administrators troubleshoot and set up firewall settings, and help new developers how to debug.

Myrrix Joins Cloudera to Bring "Big Learning" to Hadoop

What a short, strange trip it’s been. Just a year ago, I founded Myrrix in London’s Silicon Roundabout to commercialize large-scale machine learning based on Apache Hadoop and Apache Mahout. It’s been a busy scramble, building software and proudly watching early customers get real, big data-sized machine learning into production.

And now another beginning: Myrrix has a new home in Cloudera. I’m excited to join as Director of Data Science in London, alongside Josh Wills. Some of the Myrrix technology will be coming along to benefit CDH and its customers too. There was no question that Cloudera is the right place to continue building out the vision that started as Myrrix, because Josh, Jeff Hammerbacher and the rest of the data science team here have the same vision. It’s an unusually perfect match. Cloudera has made an increasingly complex big-data ecosystem increasingly accessible (Hadoop, real-time queries, search), and we’re going to make “Big Learning” on Hadoop easy and accessible too.

What is Old is New Again

Data-savvy companies of all sizes can now accomplish many viable machine learning projects.

How Does Cloudera Manager Work?

At Cloudera, we believe that Cloudera Manager is the best way to install, configure, manage, and monitor your Apache Hadoop stack. Of course, most users prefer not to take our word for it — they want to know how Cloudera Manager works under the covers, first. 

In this post, I’ll explain some of its inner workings. 

The Vocabulary of Cloudera Manager

Introducing Morphlines: The Easy Way to Build and Integrate ETL Apps for Hadoop

This post is the first in a series of blog posts about Cloudera Morphlines, a new command-based framework that simplifies data preparation for Apache Hadoop workloads. To check it out or help contribute, you can find the code here.

Cloudera Morphlines is a new open source framework that reduces the time and effort necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards. If you want to integrate, build, or facilitate transformation pipelines without programming and without substantial MapReduce skills, and get the job done with a minimum amount of fuss and support costs, this post gets you started.

Where to Find Cloudera Tech Talks Through September 2013

Below please find our regularly scheduled quarterly update about where to find tech talks by Cloudera employees this year – this time, for July through September 2013. Note that this list will be continually curated during the period; complete logistical information may not be available yet.

As always, we’re standing by to assist your meetup by providing speakers, sponsorships, and schwag!

Date City Venue Speaker(s)
July 11 Boston Boston HUG Solr Committer Mark Miller on Solr+Hadoop
July 11 Santa Clara, Calif. Big Data Gurus Patrick Hunt on Solr+Hadoop
July 11 Palo Alto, Calif. Cloudera Manager Meetup Phil Zeyliger on Cloudera Manager internals
July 11 Kansas City, Mo. KC Big Data Matt Harris on Impala
July 17 Mountain View, Calif. Bay Area Hadoop Meetups Patrick Hunt on Solr+Hadoop
July 22 Chicago Chicago Big Data Hadoop and Lucene founder Doug Cutting on Solr+Hadoop
July 22 Portland, Ore. OSCON 2013 Tom Wheeler on “Introduction to Apache Hadoop”
July 24 Portland, Ore. OSCON 2013 Sqoop Committer Kate Ting on “Building an Impenetrable ZooKeeper”
July 24 Portland, Ore. OSCON 2013 Jesse Anderson on “Doing Data Science On NFL Play by Play”
July 24 Portland, Ore. OSCON 2013 Bigtop Committer Mark Grover on “Getting Hadoop, Hive and HBase up and running in less than 15 minutes”
July 24 Portland, Ore. OSCON 2013 Hadoop Committer Colin McCabe on Locksmith
July 25 San Francisco SF Data Engineering Wolfgang Hoschek on Morphlines
July 25 Washington DC Hadoop-DC Joey Echeverria on Accumulo
Aug. 14 San Francisco SF Hadoop Users TBD, but we’re hosting!
Aug. 14 LA LA HBase Users Meetup HBase Committer/PMC Chair Michael Stack on HBase
Aug. 29 London London Java Community Hadoop Committer Tom White on CDK
Sept. 11 San Francisco Cloudera Sessions (SOLD OUT) Eric Sammer-led CDK lab
Sept. 12 New York NYC Search, Discovery & Analytics Meetup Solr Committer Mark Miller on Solr+Hadoop
Sept. 12 Cambridge, UK Enterprise Search Cambridge UK Tom White on Solr+Hadoop
Sept. 12 Los Angeles LA Hadoop Users Group Greg Chanan on Solr+Hadoop
Sept. 16 Sunnyvale, Calif. Big Data Gurus Eric Sammer on CDK
Sept. 17 Sunnyvale, Calif. SF Large-Scale Production Engineering Darren Lo on Hadoop Ops
Sept. 18 Mountain View, Calif. Silicon Valley JUG Wolfgang Hoschek on Morphlines
Sept. 19 El Dorado Hills, Calif. NorCal Big Data Apache Bigtop Committer Sean Mackrory on Bigtop & QuickStart VM
Sept. 24 Washington DC Hadoop-DC Doug Cutting on Apache Lucene

The Blur Project: Marrying Hadoop with Lucene

Doug Cutting’s recent post about Cloudera Search included a hat-tip to Aaron McCurry, founder of the Blur project, for inspiring some of its design principles. We thought you would be interested in hearing more about Blur (which is mentored by Doug and Cloudera’s Patrick Hunt) from Aaron himself – thanks, Aaron, for the guest post below!

Blur is an Apache Incubator project that provides distributed search functionality on top of Apache Hadoop, Apache Lucene, Apache ZooKeeper, and Apache Thrift. When I started building Blur three years ago, there wasn’t a search solution that had a solid integration with the Hadoop ecosystem. Our initial needs were to be able to index our data using MapReduce, store indexes in HDFS, and serve those indexes from clusters of commodity servers while remaining fault tolerant. Blur was built specifically for Hadoop — taking scalability, redundancy, and performance into consideration from the very start — while leveraging all the great features that already exist in the Hadoop stack.

Newer Posts Older Posts