Cloudera Engineering Blog · Hadoop Posts
For those of you attending this week’s StampedeCon event in St. Louis, I’d encourage you to check out the “Thinking in MapReduce” session presented by Cerner’s Ryan Brush. The session will cover the value that MapReduce and Apache Hadoop offer to the healthcare space, and provide tips on how to effectively use Hadoop ecosystem tools to solve healthcare problems.
Big Data challenges within the healthcare space stem from the standard practice of storing data in many siloed systems. Hadoop is allowing pharmaceutical companies and healthcare providers to revolutionize their approach to business by making it easier and more cost efficient to bring together all of these fragmented systems for a single, more accurate view of health. The end result: smarter clinical care decisions, better understanding of health risks for individuals and populations, and proactive measures to improve health and reduce healthcare costs.
We’re very happy to re-publish the following post from Twitter analytics infrastructure engineering manager Dmitriy Ryaboy (@squarecog).
Every day, more data, users, and applications are accessing ever-larger Apache Hadoop clusters. Although this is good news for data driven organizations overall, for security administrators and compliance officers, there are still lingering questions about how to enable end-users under existing Hadoop infrastructure without compromising security or compliance requirements.
While Hadoop has strong security at the filesystem level, it lacks the granular support needed to adequately secure access to data by users and BI applications. Today, this problem forces organizations in industries for which security is paramount (such as financial services, healthcare, and government) to make a choice: either leave data unprotected or lock out users entirely. Most of the time, the preferred choice is the latter, severely inhibiting access to data in Hadoop.
The Data Warehousing Institute (TDWI) runs an annual Best Practices Awards program to recognize organizations for their achievements in business intelligence and data warehousing. A few months ago, I was introduced to Motorola Mobility’s VP of cloud platforms and services, Balaji Thiagarajan. After learning about its interesting Apache Hadoop use case and the success it has delivered, Balaji and I worked together to nominate Motorola Mobility for the TDWI Best Practices Award for Emerging Technologies and Methods. And to my delight, it won!
Chances are, you’ve heard of Motorola Mobility. It released the first commercial portable cell phone back in 1984, later dominated the mobile phone market with the super-thin RAZR, and today a large portion of the massive smartphone market runs on its Android operating system.
Apache Hive was one of the first projects to bring higher-level languages to Apache Hadoop. Specifically, Hive enables the legions of trained SQL users to use industry-standard SQL to process their Hadoop data.
However, as you probably have gathered from all the recent community activity in the SQL-over-Hadoop area, Hive has a few limitations for users in the enterprise space. Until recently, two in particular – concurrency and security – were largely unaddressed.
For those people new to Apache HBase (version 0.90 and later), the configuration of network ports used by the system can be a little overwhelming.
In this blog post, you will learn all the TCP ports used by the different HBase processes and how and why they are used (all in one place) — to help administrators troubleshoot and set up firewall settings, and help new developers how to debug.
What a short, strange trip it’s been. Just a year ago, I founded Myrrix in London’s Silicon Roundabout to commercialize large-scale machine learning based on Apache Hadoop and Apache Mahout. It’s been a busy scramble, building software and proudly watching early customers get real, big data-sized machine learning into production.
And now another beginning: Myrrix has a new home in Cloudera. I’m excited to join as Director of Data Science in London, alongside Josh Wills. Some of the Myrrix technology will be coming along to benefit CDH and its customers too. There was no question that Cloudera is the right place to continue building out the vision that started as Myrrix, because Josh, Jeff Hammerbacher and the rest of the data science team here have the same vision. It’s an unusually perfect match. Cloudera has made an increasingly complex big-data ecosystem increasingly accessible (Hadoop, real-time queries, search), and we’re going to make “Big Learning” on Hadoop easy and accessible too.
What is Old is New Again
Data-savvy companies of all sizes can now accomplish many viable machine learning projects.
At Cloudera, we believe that Cloudera Manager is the best way to install, configure, manage, and monitor your Apache Hadoop stack. Of course, most users prefer not to take our word for it — they want to know how Cloudera Manager works under the covers, first.
In this post, I’ll explain some of its inner workings.
The Vocabulary of Cloudera Manager
This post is the first in a series of blog posts about Cloudera Morphlines, a new command-based framework that simplifies data preparation for Apache Hadoop workloads. To check it out or help contribute, you can find the code here.
Cloudera Morphlines is a new open source framework that reduces the time and effort necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards. If you want to integrate, build, or facilitate transformation pipelines without programming and without substantial MapReduce skills, and get the job done with a minimum amount of fuss and support costs, this post gets you started.
Below please find our regularly scheduled quarterly update about where to find tech talks by Cloudera employees this year – this time, for July through September 2013. Note that this list will be continually curated during the period; complete logistical information may not be available yet.
As always, we’re standing by to assist your meetup by providing speakers, sponsorships, and schwag!
|July 11||Boston||Boston HUG||Solr Committer Mark Miller on Solr+Hadoop|
|July 11||Santa Clara, Calif.||Big Data Gurus||Patrick Hunt on Solr+Hadoop|
|July 11||Palo Alto, Calif.||Cloudera Manager Meetup||Phil Zeyliger on Cloudera Manager internals|
|July 11||Kansas City, Mo.||KC Big Data||Matt Harris on Impala|
|July 17||Mountain View, Calif.||Bay Area Hadoop Meetups||Patrick Hunt on Solr+Hadoop|
|July 22||Chicago||Chicago Big Data||Hadoop and Lucene founder Doug Cutting on Solr+Hadoop|
|July 22||Portland, Ore.||OSCON 2013||Tom Wheeler on “Introduction to Apache Hadoop”|
|July 24||Portland, Ore.||OSCON 2013||Sqoop Committer Kate Ting on “Building an Impenetrable ZooKeeper”|
|July 24||Portland, Ore.||OSCON 2013||Jesse Anderson on “Doing Data Science On NFL Play by Play”|
|July 24||Portland, Ore.||OSCON 2013||Bigtop Committer Mark Grover on “Getting Hadoop, Hive and HBase up and running in less than 15 minutes”|
|July 24||Portland, Ore.||OSCON 2013||Hadoop Committer Colin McCabe on Locksmith|
|July 25||San Francisco||SF Data Engineering||Wolfgang Hoschek on Morphlines|
|July 25||Washington DC||Hadoop-DC||Joey Echeverria on Accumulo|
|Aug. 14||San Francisco||SF Hadoop Users||TBD, but we’re hosting!|
|Aug. 14||LA||LA HBase Users Meetup||HBase Committer/PMC Chair Michael Stack on HBase|
|Aug. 29||London||London Java Community||Hadoop Committer Tom White on CDK|
|Sept. 11||San Francisco||Cloudera Sessions (SOLD OUT)||Eric Sammer-led CDK lab|
|Sept. 12||New York||NYC Search, Discovery & Analytics Meetup||Solr Committer Mark Miller on Solr+Hadoop|
|Sept. 12||Cambridge, UK||Enterprise Search Cambridge UK||Tom White on Solr+Hadoop|
|Sept. 12||Los Angeles||LA Hadoop Users Group||Greg Chanan on Solr+Hadoop|
|Sept. 16||Sunnyvale, Calif.||Big Data Gurus||Eric Sammer on CDK|
|Sept. 17||Sunnyvale, Calif.||SF Large-Scale Production Engineering||Darren Lo on Hadoop Ops|
|Sept. 18||Mountain View, Calif.||Silicon Valley JUG||Wolfgang Hoschek on Morphlines|
|Sept. 19||El Dorado Hills, Calif.||NorCal Big Data||Apache Bigtop Committer Sean Mackrory on Bigtop & QuickStart VM|
|Sept. 24||Washington DC||Hadoop-DC||Doug Cutting on Apache Lucene|