Cloudera Engineering Blog · Community Posts
With the close of 2013, we also thought it appropriate to include some high points from across the year (not listed in any particular order):
The new Cloudera Developer Newsletter makes its debut in January 2014.
Developers and data scientists, we’re realize you’re special – as are operators and analysts, in their own particular ways.
Find Cloudera tech talks in Berlin, Budapest, London, Stockholm, Tokyo, and across the US during this calendar quarter.
Below please find our regularly scheduled quarterly update about where to find tech talks by Cloudera employees – this time, for the first calendar quarter of 2014 (January through March). Note that this list will be continually curated during the period; complete logistical information may not be available yet. And remember, many of these talks are in “free” venues (no cost of entry).
Join us at Cloudera’s San Francisco office on Feb. 20 for tech talks, T-shirts, and adult refreshments!
As an extension of the DeveloperWeek Conf & Festival 2014 experience in San Francisco next month, join us at Cloudera’s San Francisco office for a Developer Happy Hour (beer + tech talks), focusing on Apache Hadoop 2 application development. Anyone (attendees or non) is free to attend, but RSVP now because seats (and “Data is the New Bacon” T-shirts) are limited!
From Python, to ZooKeeper, to Impala, to Parquet, blog readers in 2013 were interested in a variety of topics.
Clouderans and guest authors from across the ecosystem (LinkedIn, Netflix, Concurrent, Etsy, Stripe, Databricks, Oracle, Tableau, Alteryx, Talend, Twitter, Dell, Concurrent, SFDC, Endgame, MicroStrategy, Hazy Research, Wibidata, StackIQ, ZoomData, Damballa, Mu Sigma) published prolifically on the Cloudera Developer blog in 2013, with more than 250 new posts — basically, averaging one per business day.
Flavio Junqueira (PMC Chair of the Apache ZooKeeper project and a member of the Systems and Networking Group at Microsoft Research) and Benjamin Reed (PMC Member and Software Engineer at Facebook) are the co-authors of the new O’Reilly Media book ZooKeeper: Distributed Process Coordination. We had a chat with Flavio and Ben recently about the rationale for writing the book, and what it will add to the distributed systems conversation.
Welcome to our fifth edition of “This Month in the Ecosystem,” a digest of highlights from November 2013 (never intended to be comprehensive; for completeness, see the excellent Hadoop Weekly).
With the holidays upon us, the news in November was sparse. Even so, the ecosystem never stops churning!
Some things for which we are thankful, the 2013 edition (not listed in order):
1. The entire Apache Hadoop community for its constant and hard work to Make the Platform Better,
Since its inception, Cloudera has been an enthusiastic supporter of user groups and meetups worldwide. And now, we’re extending that support yet further, by incubating new Cloudera User Groups (CUGs) in the San Francisco Bay Area, Chicago area, and New York City.
Unlike grass-roots user groups, which are inherently community-oriented and have no particular vendor preference, CUGs are designed and intended for users of Cloudera Standard (our free offering containing CDH and Cloudera Manager) and customers of Cloudera Enterprise (our paid, supported offering containing CDH, Cloudera Manager, and enterprise functionality such as rolling upgrades and Cloudera Navigator). For that reason, I predict that CUG conversations will tend to focus on the differentiated aspects of the Cloudera platform.
Welcome to our fourth edition of “This Month in the Ecosystem,” a digest of highlights from October 2013 (never intended to be comprehensive; for completeness, see Hadoop Weekly).
For generating sheer excitement, that month installed a high bar to meet in the future:
In the wake of the Strata + Hadoop World 2013 afterglow, speaker slides and video have been posted. For your convenience, they are aggregated below:
For those of you attending virtually/in spirit, I thought it would be nice to bring you a selection of photos from the week so far. Credit goes to Alex Moundalexis (@technmsg) for the majority of these shots.
Kate Ting, Apache Sqoop cookbook co-chef.
We are just a weekend away from the Biggest. Strata + Hadoop World. Ever.
The following post, by Apache HBase 0.96 Release Manager/Cloudera Software Engineer Michael Stack, was published originally at blogs.apache.org and is provided below for your convenience. Our thanks to the release’s numerous contributors!
Note: HBase 0.96 will be packaged in the next release of CDH (CDH 5).
The release of Apache Hadoop 2, as announced today by the Apache Software Foundation, is an exciting one for the entire Hadoop ecosystem.
Cloudera engineers have been working hard for many months with the rest of the vast Hadoop community to ensure that Hadoop 2 is the best it can possibly be, for the users of Cloudera’s platform as well as all Hadoop users generally. Hadoop 2 contains many major advances, including (but not limited to):
Below please find our regularly scheduled quarterly update about where to find tech talks by Cloudera employees this year – this time, for October through December 2013. Note that this list will be continually curated during the period; complete logistical information may not be available yet.
As always, we’re standing by to assist your meetup by providing speakers, sponsorships, and schwag!
|Oct. 1||Aarhus, Denmark||GOTO Aarhus||Eva Andreeason on Hadoop use cases|
|Oct. 8||Sunnyvale, Calif.||Hadoop Happy Hour||Kathleen Ting and Jarek Cecho sign books!|
|Oct. 9||Santa Clara, Calif.||IEEE BigData Conference||Amr Awadallah on Hadoop use cases|
|Oct. 9||San Francisco||SF Hadoop Users||Eric Sammer on Hadoop app development (panelist)|
|Oct. 10||Sydney||DataCon||Sean Owen on data science|
|Oct. 15||Durham, NC||TriHUG||Mark Miller on Solr+Hadoop|
|Oct. 15||Mountain View, Calif.||Oracle NoSQL & Big Data Meetup||Mike Olson on virtues of key-value stores|
|Oct. 15-17||Burlingame, Calif.||Big Data TechCon||Apache Hive workshop with Mark Grover|
|Doug Cutting on the Hadoop revolution|
|Hadoop app development (CDK) workshop with Ryan Blue|
|Jonathan Seidman on extending data infrastructure with Hadoop|
|Jonathan Seidman on the Hadoop ecosystem|
|Himanshu Vashishtha on HBase use cases|
|Kate Ting on Apache ZooKeeper|
|Kate Ting on 7 Deadly Hadoop Misconfigurations|
|Oct. 16||Dallas, Tex.||DFW Big Data||John Ringhofer on Impala|
|Oct. 17||Milwaukee, Wis.||Cloudera Sessions||Hadoop app development lab (on CDK) with Ryan Blue|
|Oct. 17||St. Louis, Mo.||St. Louis HUG||Tom Wheeler on Parquet|
|Oct. 18||Munich||HUG Munich||Lars George on Impala|
|Oct. 22||London||UK HUG||Sean Owen on Scalable Big learning|
|Oct. 23||Seattle||Seattle Scalability Meetup||Ronan Stokes on Cloudera Search|
|Oct. 24||Palo Alto, Calif.||Bay Area HBase User Group||Michael Stack on HBase 0.96|
|Oct. 24||Raleigh, NC||All Things Open||Josh Wills on open source innovation|
|Oct. 28-30||New York||Strata Conference + Hadoop World 2013||Mike Olson on Hadoop’s impact on data management|
|Doug Cutting on the future of Hadoop|
|Henry Robinson on workload diversity in Hadoop|
|Hadoop app development (CDK) workshop with Eric Sammer|
|Matt Brandwein on leveraging mainframe data with Hadoop|
|Aaron T. Myers and Shreepadma Venugopalan on Hadoop security|
|Jayant Shekar on machine data analytics|
|Amandeep Khurana on Monsanto’s use case for Hadoop & HBase|
|Philip Zeyliger on debugging distributed systems|
|Greg Rahn on Impala performance tuning|
|Jon Hsieh on HBase roadmap|
|Oct. 28||New York||NYC HUG||Arvind Prabhakar on Apache Sentry (incubating)|
|Oct. 28||New York||Sqoop User Meetup||Abe Elmahrek on the Sqoop2 app for Hue|
|Oct. 29||New York||Impala + Parquet Meetup||Greg Rahn on Impala+Parquet performance tuning|
|Oct. 29||New York||Cloudera Manager Meetup||Aditya Achara on Cloudera Manager success stories|
|Oct. 30||New York||Apache Sentry User Meetup||Arvind Prabhakar and Shreepadma Venugopalan with a Sentry overview|
|Oct. 30||Philadelphia||Chariot Data IO Conference||Lars George on HBase sizing as well as on Parquet|
|Nov. 6||Chantilly, Va.||Open Source Search Conference||Alex Moundalexis on Search+Hadoop|
|Nov. 6||Munich||JAX Munich||Lars George on HBase and Impala|
|Nov. 7||Tokyo||Cloudera World Tokyo||Kiyoshi Mizumaru on CDH|
|Sho Shimauchi on Cloudera Manager|
|Tatsuo Kawasaki witha Hadoop 101|
|Daisuke Kobayashi on Hadoop ops|
|Nov. 11||London||UK HUG||Marcel Kornacker on Impala|
|Nov. 12-13||London||Strata London||Sean Owen on Scalable Big Learning; Tom White on Hadoop app development with CDK|
|Nov. 12||San Francisco||QCon SF||Josh Wills on machine learning|
|Nov. 13||Washington DC||LISA 2013||John Ridley on Hadoop 101 for sysadmins|
|Nov. 14||Seoul||Tech Planet Korea||Michael Stack on HBase roadmap|
|Nov. 14||Tokyo||Cloudera Manager Meetup||Sho Shimauchi, Kiyoshi Mizumaru: What is Cloudera Manager?|
|Nov. 14||Antwerp||Devoxx Belgium||Tom White on building Hadoop apps with CDK|
|Nov. 16||Los Angeles||Big Data Camp LA||Alex Behm on Impala|
|Nov. 20||Boulder, Colo.||Boulder/Denver Big Data Meetup||John Darrah on Hadoop 101|
|Dec. 2||Tokyo||Cloudera Manager Meetup||Sho Shimauchi, Kiyoshi Mizumaru: What is Cloudera Manager?|
History teaches us that ecosystem growth is fueled by enthusiasm, tools (including frameworks and APIs), and knowledge in roughly equal measures. To this point, the Apache Hadoop ecosystem has been blessed with the first two ingredients – thanks to the magic of open source – but in the third category, there is still plenty of work to be done.
Welcome to our third edition of “This Month in the Ecosystem,” a digest of highlights from September 2013 (never intended to be comprehensive; for completeness, see Hadoop Weekly).
Note: there were a few other interesting developments this week, but out of respect for the calendar, I’ll address them next month.
Strata Conference + Hadoop World 2013 (Oct. 28-30 in New York City) approaches (register here for an automatic 20% discount), and that means it’s time to get your meetup schedule sorted out!
There are a variety of them planned across the week (something for everyone!), onsite at the conference hotel as well as offsite. Use the links below to RSVP.
Welcome to our second edition of “This Month in the Ecosystem.” (See the inaugural edition here.) Although August was not as busy as July, there are some very notable highlights to report.
Today, I thought it would be helpful to highlight some features that will help you get the most out of this new service:
Strata Conference + Hadoop World 2013 is looming on the horizon and pacing to be the largest gathering of Big Data professionals on the globe. As co-hosts with O’Reilly, we have seen the conference thrive, grow, and are excited about the upcoming Oct. 28 – 30 event!
The ecosystem is evolving at a rapid pace – so rapidly, that important developments are often passing through the public attention zone too quickly. Thus, we think it might be helpful to bring you a digest (by no means complete!) of our favorite highlights on a regular basis. (This effort, by the way, has different goals than the fine Hadoop Weekly newsletter, which has a more expansive view – and which you should subscribe to immediately, as far as we’re concerned.)
Find the first installment below. Although the time period reflected here is obviously more than a month long, we have some catching up to do before we can move to a truly monthly cadence.
Cloudera Impala has made huge progress since its initial announcement – and there’s even more good news on the roadmap. To learn more, plan to attend an Impala meetup hosted by Cloudera in its San Francisco offices on the evening of Aug. 20:
We’re very happy to re-publish the following post from Twitter analytics infrastructure engineering manager Dmitriy Ryaboy (@squarecog).
OSCON 2013 is already receding in the rear-view mirror, but we had a great time. Cloudera’s sessions were very well attended — with Tom Wheeler taking the prize (well over 200 attendees for his “Introduction to Apache Hadoop” tutorial) — but best of all was the opportunity to meet and mingle with people in the broader open source community. If you visited us at Booth 420, we hope you will now download and install the QuickStart VM after seeing it in our demo, and that your questions were adequately answered (most popular question: “Can you tell me more about Cloudera Impala?”)
In my biased opinion, the crowning achievement was our ability to not only distribute a couple hundred “Data is the New Bacon” Tshirts within a 36-hour period, but to clean ourselves out of the meat-free version shortly thereafter, as well:
This is a great day for technical end-users – developers, admins, analysts, and data scientists alike. Starting now, Cloudera complements its traditional mailing lists with a new, feature-rich community forums intended for users of Cloudera’s Platform for Big Data! (Login using your existing credentials or click the link to register.)
Although mailing lists have long been a standard for user interaction, and will undoubtedly continue to be, they have flaws. For example, they lack structure or taxonomy, which makes consumption difficult. Search functionality is often less than stellar and users are unable to build reputations that span an appreciable period of time. For these reasons, although they’re easy to create and manage, mailing lists inherently limit access to knowledge and hence limit adoption.
Continuing the fine tradition of Clouderans contributing books to the Apache Hadoop ecosystem, Apache Sqoop Committers/PMC Members Kathleen Ting and Jarek Jarcec Cecho have officially joined the book author community: their Apache Sqoop Cookbook is now available from O’Reilly Media (with a pelican the assigned cover beast).
The book arrives at an ideal time. Hadoop has quickly become the standard for processing and analyzing Big Data, and in order to integrate a new Hadoop deployment into your existing environment, you will very likely need to transfer data stored in legacy relational databases into your new cluster.
Below please find our regularly scheduled quarterly update about where to find tech talks by Cloudera employees this year – this time, for July through September 2013. Note that this list will be continually curated during the period; complete logistical information may not be available yet.
As always, we’re standing by to assist your meetup by providing speakers, sponsorships, and schwag!
|July 11||Boston||Boston HUG||Solr Committer Mark Miller on Solr+Hadoop|
|July 11||Santa Clara, Calif.||Big Data Gurus||Patrick Hunt on Solr+Hadoop|
|July 11||Palo Alto, Calif.||Cloudera Manager Meetup||Phil Zeyliger on Cloudera Manager internals|
|July 11||Kansas City, Mo.||KC Big Data||Matt Harris on Impala|
|July 17||Mountain View, Calif.||Bay Area Hadoop Meetups||Patrick Hunt on Solr+Hadoop|
|July 22||Chicago||Chicago Big Data||Hadoop and Lucene founder Doug Cutting on Solr+Hadoop|
|July 22||Portland, Ore.||OSCON 2013||Tom Wheeler on “Introduction to Apache Hadoop”|
|July 24||Portland, Ore.||OSCON 2013||Sqoop Committer Kate Ting on “Building an Impenetrable ZooKeeper”|
|July 24||Portland, Ore.||OSCON 2013||Jesse Anderson on “Doing Data Science On NFL Play by Play”|
|July 24||Portland, Ore.||OSCON 2013||Bigtop Committer Mark Grover on “Getting Hadoop, Hive and HBase up and running in less than 15 minutes”|
|July 24||Portland, Ore.||OSCON 2013||Hadoop Committer Colin McCabe on Locksmith|
|July 25||San Francisco||SF Data Engineering||Wolfgang Hoschek on Morphlines|
|July 25||Washington DC||Hadoop-DC||Joey Echeverria on Accumulo|
|Aug. 14||San Francisco||SF Hadoop Users||TBD, but we’re hosting!|
|Aug. 14||LA||LA HBase Users Meetup||HBase Committer/PMC Chair Michael Stack on HBase|
|Aug. 29||London||London Java Community||Hadoop Committer Tom White on CDK|
|Sept. 11||San Francisco||Cloudera Sessions (SOLD OUT)||Eric Sammer-led CDK lab|
|Sept. 12||New York||NYC Search, Discovery & Analytics Meetup||Solr Committer Mark Miller on Solr+Hadoop|
|Sept. 12||Cambridge, UK||Enterprise Search Cambridge UK||Tom White on Solr+Hadoop|
|Sept. 12||Los Angeles||LA Hadoop Users Group||Greg Chanan on Solr+Hadoop|
|Sept. 16||Sunnyvale, Calif.||Big Data Gurus||Eric Sammer on CDK|
|Sept. 17||Sunnyvale, Calif.||SF Large-Scale Production Engineering||Darren Lo on Hadoop Ops|
|Sept. 18||Mountain View, Calif.||Silicon Valley JUG||Wolfgang Hoschek on Morphlines|
|Sept. 19||El Dorado Hills, Calif.||NorCal Big Data||Apache Bigtop Committer Sean Mackrory on Bigtop & QuickStart VM|
|Sept. 24||Washington DC||Hadoop-DC||Doug Cutting on Apache Lucene|
In this installment of “Meet the Project Founder”, meet Apache Oozie PMC member (and ASF member) Alejandro Abdelnur, the Cloudera software engineer who founded what eventually became the Apache Oozie project in 2011. Alejandro is also on the PMC of Apache Hadoop.
What led you to your project idea(s)?
Hadoop Summit convenes next week, and even if you’re not attending, there are a host of meetup opportunities available to you during the week.
Here are just a few, and you can find a full list here.
For those of you who missed the show, session video and presentation slides (as well as photos) will be available via hbasecon.com in a few weeks. (To be notified, follow @cloudera or @ClouderaEng.) Although it’s not quite as good as being there with the rest of the community, you’ll still be able to partake from the real-world experiences of Apache HBase users like Facebook, Box, Yahoo!, Salesforce.com, Pinterest, Twitter, Groupon, and more.
HBaseCon 2013 is this Thursday (June 13 in San Francisco), and we can hardly wait!
What do you do at Cloudera (and in which Apache project(s) are you involved)?
Unbelievably, HBaseCon 2013 is only one week away (June 13 in San Francisco)!
As we march toward HBaseCon 2013 (June 13 in San Francisco), it’s time to bring you a preview of the Internals track (see the Operations track preview here) — the track guaranteed to be of most interest to Apache HBase developers and other people tracking the progress of the code base.
Our thanks to Jordan Zimmerman, software engineer at Netflix, for the guest post below about the recently announced Apache Curator (incubating) project.
As you have probably learned by now, HBaseCon 2013 sessions are organized into four tracks: Operations, Internals, Ecosystem, and Case Studies. In combination, they offer a 360-degree view of Apache HBase that is invaluable for experts and aspiring experts alike. In the next few posts leading up to the conference (June 13 in San Francisco – register now while there’s still room), we’ll offer sneak previews of what each track has to offer.
Mark your calendars, all you data cyclists!
I’m visiting Paris, London, and Edinburgh this June. When I travel I like to talk to locals. And, wherever I am, I like to bicycle. So, I thought I might combine these interests and host “data rides” in these three cities.
This installment of “Meet the Project Founder” features Apache Bigtop founder and PMC Chair/VP Roman Shaposhnik.
What led you to your project idea(s)?
Contributing to Apache Hadoop or writing custom pluggable modules requires modifying Hadoop’s source code. While it is perfectly fine to use a text editor to modify Java source, modern IDEs simplify navigation and debugging of large Java projects like Hadoop significantly. Eclipse is a popular choice thanks to its broad user base and multitude of available plugins.
This post covers configuring Eclipse to modify Hadoop’s source. (Developing applications against CDH using Eclipse is covered in a different post.) Hadoop has changed a great deal since our previous post on configuring Eclipse for Hadoop development; here we’ll revisit configuring Eclipse for the latest “flavors” of Hadoop. Note that trunk and other release branches differ in their directory structure, feature set, and build tools they use. (The EclipseEnvironment Hadoop wiki page is a good starting point for development on trunk.)
The schedule/agenda grid for HBaseCon 2013 (rapidly approaching: June 13 in San Francisco) is a thing of beauty.
HBaseCon 2013 is approaching fast – June 13 in San Francisco. If you’re on the fence about attending – or perhaps your manager is on the fence about approving your participation – here are a few things that you/they need to know (in no particular order):
- HBaseCon is the annual rallying point for the HBase community. If you’ve ever had a desire to learn how to get involved in the community as a contributor, or just want to ask a committer or PMC member why things are done (or not done) a certain way, this is your opportunity – because this is where those people are. Participating in a mailing list thread is never quite the same once you’ve met the people behind it.
- HBaseCon is a one-stop shop for learning about the HBase roadmap, as well as other projects across the ecosystem. Current HBase users should be particularly interested in learning about which JIRAs will have the most impact on the user experience – and once again, most of the committers working on those JIRAs will either be leading sessions or otherwise present. Plus, you can learn about how new complementary projects like Impala, Kiji, Phoenix, and Honeycomb are transforming the use cases for HBase and helping to expand its footprint across the enterprise.
- HBaseCon is a feast of real-world experiences and use cases. Sure, maybe you’ve read about the HBase-backed applications used by companies like Facebook, Salesforce.com, eBay, Pinterest, and Yahoo!. But wouldn’t it be helpful to hear technical details and best practices directly from the people who built and run them? I’ll bet it would. And you really can’t do that anywhere else — in the whole world. (Plus, you can take advantage of formal training right before the conference, at a discount.)
- HBaseCon is a pageant of engineer rock-stars. If your company is an HBase user and hungry for talent, there’s no better place to find it: HBaseCon is literally the world’s biggest gathering of HBase experts under one roof.
- HBaseCon is a heck of a blast. Come for the deep-dives and advice, stay for the after-event party. The libations will be extensive!
At Cloudera, there is a long and proud tradition of employees creating new open source projects intended to help fill gaps in platform functionality (in addition to hiring new employees who have done so in the past). In fact, more than a dozen ecosystem projects — including Apache Hadoop itself — were founded by Clouderans, more than can be attributed to employees of any other single company. Cloudera was also the first vendor to ship most of those projects as enterprise-ready bits inside its platform.
We thought you might be interested in meeting some of them over the next few months, in a new “Meet the Project Founder” series. It’s only appropriate that we begin with Doug Cutting himself – Cloudera’s chief architect and the quadruple-threat founder of Apache Lucene, Apache Nutch, Apache Hadoop, and Apache Avro.
Today Cloudera announced a new Cloudera Academic Partnership program, in which participating universities worldwide get access to curriculum, training, certification, and software.
As noted in the press release, the global demand for people with Apache Hadoop and data science skills is dwarfing all supply. We consider it an important mission to help accredited universities meet that demand, by equipping them with the content and training they need to educate students in the Hadoop arts.
It’s only Rock and Roll, but I like it!
– Mick Jagger
Copyright is having a tough time in the digital age. New copies of music, movies and software can be created at near zero cost. Some wonder whether it still makes sense to ever charge for content. Over the past century large industries have developed that sell content. These industries resist change. We consumers love our content, but don’t love paying for it. But would all the content we love still exist without payment for copyright?
It’s time for me to give you a quarterly update (here’s the one for Q1) about where to find tech talks by Cloudera employees in 2013. Committers, contributors, and other engineers will travel to meetups and conferences near and far to do their part in the community to make Apache Hadoop a household word!
(Remember, we’re always ready to assist your meetup by providing speakers, sponsorships, and schwag.)
Cloudera will be a proud exhibitor at O’Reilly OSCON 2013 (July 22-26 in Portland, OR), which in our opinion is a shining light in the open source community. So be sure to look for us at Booth #420!
In the technology business, building a thriving and progressive user ecosystem around a platform is about as Mom-and-apple-pie as you can get. We all intuitively acknowledge that it’s one of the metrics for success.