Cloudera Engineering Blog · Hadoop Posts
For those people new to Apache HBase (version 0.90 and later), the configuration of network ports used by the system can be a little overwhelming.
In this blog post, you will learn all the TCP ports used by the different HBase processes and how and why they are used (all in one place) — to help administrators troubleshoot and set up firewall settings, and help new developers how to debug.
What a short, strange trip it’s been. Just a year ago, I founded Myrrix in London’s Silicon Roundabout to commercialize large-scale machine learning based on Apache Hadoop and Apache Mahout. It’s been a busy scramble, building software and proudly watching early customers get real, big data-sized machine learning into production.
And now another beginning: Myrrix has a new home in Cloudera. I’m excited to join as Director of Data Science in London, alongside Josh Wills. Some of the Myrrix technology will be coming along to benefit CDH and its customers too. There was no question that Cloudera is the right place to continue building out the vision that started as Myrrix, because Josh, Jeff Hammerbacher and the rest of the data science team here have the same vision. It’s an unusually perfect match. Cloudera has made an increasingly complex big-data ecosystem increasingly accessible (Hadoop, real-time queries, search), and we’re going to make “Big Learning” on Hadoop easy and accessible too.
What is Old is New Again
Data-savvy companies of all sizes can now accomplish many viable machine learning projects.
At Cloudera, we believe that Cloudera Manager is the best way to install, configure, manage, and monitor your Apache Hadoop stack. Of course, most users prefer not to take our word for it — they want to know how Cloudera Manager works under the covers, first.
In this post, I’ll explain some of its inner workings.
The Vocabulary of Cloudera Manager
This post is the first in a series of blog posts about Cloudera Morphlines, a new command-based framework that simplifies data preparation for Apache Hadoop workloads. To check it out or help contribute, you can find the code here.
Cloudera Morphlines is a new open source framework that reduces the time and effort necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards. If you want to integrate, build, or facilitate transformation pipelines without programming and without substantial MapReduce skills, and get the job done with a minimum amount of fuss and support costs, this post gets you started.
Below please find our regularly scheduled quarterly update about where to find tech talks by Cloudera employees this year – this time, for July through September 2013. Note that this list will be continually curated during the period; complete logistical information may not be available yet.
As always, we’re standing by to assist your meetup by providing speakers, sponsorships, and schwag!
|July 11||Boston||Boston HUG||Solr Committer Mark Miller on Solr+Hadoop|
|July 11||Santa Clara, Calif.||Big Data Gurus||Patrick Hunt on Solr+Hadoop|
|July 11||Palo Alto, Calif.||Cloudera Manager Meetup||Phil Zeyliger on Cloudera Manager internals|
|July 11||Kansas City, Mo.||KC Big Data||Matt Harris on Impala|
|July 17||Mountain View, Calif.||Bay Area Hadoop Meetups||Patrick Hunt on Solr+Hadoop|
|July 22||Chicago||Chicago Big Data||Hadoop and Lucene founder Doug Cutting on Solr+Hadoop|
|July 22||Portland, Ore.||OSCON 2013||Tom Wheeler on “Introduction to Apache Hadoop”|
|July 24||Portland, Ore.||OSCON 2013||Sqoop Committer Kate Ting on “Building an Impenetrable ZooKeeper”|
|July 24||Portland, Ore.||OSCON 2013||Jesse Anderson on “Doing Data Science On NFL Play by Play”|
|July 24||Portland, Ore.||OSCON 2013||Bigtop Committer Mark Grover on “Getting Hadoop, Hive and HBase up and running in less than 15 minutes”|
|July 24||Portland, Ore.||OSCON 2013||Hadoop Committer Colin McCabe on Locksmith|
|July 25||San Francisco||SF Data Engineering||Wolfgang Hoschek on Morphlines|
|July 25||Washington DC||Hadoop-DC||Joey Echeverria on Accumulo|
|Aug. 14||San Francisco||SF Hadoop Users||TBD, but we’re hosting!|
|Aug. 14||LA||LA HBase Users Meetup||HBase Committer/PMC Chair Michael Stack on HBase|
|Aug. 29||London||London Java Community||Hadoop Committer Tom White on CDK|
|Sept. 11||San Francisco||Cloudera Sessions (SOLD OUT)||Eric Sammer-led CDK lab|
|Sept. 12||New York||NYC Search, Discovery & Analytics Meetup||Solr Committer Mark Miller on Solr+Hadoop|
|Sept. 12||Cambridge, UK||Enterprise Search Cambridge UK||Tom White on Solr+Hadoop|
|Sept. 12||Los Angeles||LA Hadoop Users Group||Greg Chanan on Solr+Hadoop|
|Sept. 16||Sunnyvale, Calif.||Big Data Gurus||Eric Sammer on CDK|
|Sept. 17||Sunnyvale, Calif.||SF Large-Scale Production Engineering||Darren Lo on Hadoop Ops|
|Sept. 18||Mountain View, Calif.||Silicon Valley JUG||Wolfgang Hoschek on Morphlines|
|Sept. 19||El Dorado Hills, Calif.||NorCal Big Data||Apache Bigtop Committer Sean Mackrory on Bigtop & QuickStart VM|
|Sept. 24||Washington DC||Hadoop-DC||Doug Cutting on Apache Lucene|
Doug Cutting’s recent post about Cloudera Search included a hat-tip to Aaron McCurry, founder of the Blur project, for inspiring some of its design principles. We thought you would be interested in hearing more about Blur (which is mentored by Doug and Cloudera’s Patrick Hunt) from Aaron himself – thanks, Aaron, for the guest post below!
Blur is an Apache Incubator project that provides distributed search functionality on top of Apache Hadoop, Apache Lucene, Apache ZooKeeper, and Apache Thrift. When I started building Blur three years ago, there wasn’t a search solution that had a solid integration with the Hadoop ecosystem. Our initial needs were to be able to index our data using MapReduce, store indexes in HDFS, and serve those indexes from clusters of commodity servers while remaining fault tolerant. Blur was built specifically for Hadoop — taking scalability, redundancy, and performance into consideration from the very start — while leveraging all the great features that already exist in the Hadoop stack.
For those who are unfamiliar with it, Hue is a very popular, end-user focused, fully open source Web UI designed for interaction with Apache Hadoop and its ecosystem components. Founded by Cloudera employees, Hue has been around for quite some time, but only in the last 12 months has it evolved into the great ramp-up and interaction tool it is today. It’s fair to say that Hue is the most popular open source GUI for the Hadoop ecosystem among beginners — as well as a valuable tool for seasoned Hadoop users (and users generally in an enterprise environment) – and it is the only end-user tool that ships with Hadoop distributions today. In fact, Hue is even redistributed and marketed as part of other user-experience and ramp-up-on-Hadoop VMs in the market.
Just in time for Hadoop Summit 2013, the Apache Bigtop team is very pleased to announce the release of Bigtop 0.6.0: The very first release of a fully integrated Big Data management distribution built on the currently most advanced Hadoop 2.x, Hadoop 2.0.5-alpha.
Bigtop, as many of you might already know, is a project aimed at creating a 100% open source and community-driven Big Data management distribution based on Apache Hadoop. (You can learn more about it by reading one of our previous blog posts on Apache Blogs.) Bigtop also plays an important role in CDH, which utilizes its packaging code from Bigtop — Cloudera takes pride in developing open source packaging code and contributing the same back to the community.
In this Customer Spotlight, I’d like to emphasize some undeniably positive use cases for Big Data, by looking at some of the ways the healthcare and life sciences industries are innovating to benefit humankind. Here are just a few examples:
Mount Sinai School of Medicine has partnered with Cloudera’s own Jeff Hammerbacher to apply Big Data to better predict and understand disease processes and treatments. The Mount Sinai School of Medicine is a top medical school in the US, noted for innovation in biomedical research, clinical care delivery, and community services. With Cloudera’s Big Data technology and Jeff’s data science expertise, Mount Sinai is better equipped to develop solutions designed for high-performance, scalable data analysis and multi-scale measurements. For example, medical research and discovery areas in genotype, gene expression and organ health will benefit from these Big Data applications.
CDH, Cloudera’s 100% open source distribution of Apache Hadoop and related projects, has successfully enabled Big Data processing for many years. The typical approach is to ingest a large set of a wide variety of data into HDFS or Apache HBase for cost-efficient storage and flexible, scalable processing. Over time, various tools to allow for easier access have emerged — so you can now interact with Hadoop through various programming methods and the very familiar structured query capabilities of SQL.
However, many users with less interest in programmatic interaction have been shut out of the value that Hadoop creates from Big Data. And teams trying to achieve more innovative processing struggle with a time-efficient way to interact with, and explore, the data in Hadoop or HBase.