Cloudera Engineering Blog · Community Posts

Community Meetups at Strata + Hadoop World NYC 2015

Strata + Hadoop World 2015 NYC is more than a daytime conference; it’s also a nighttime meetup experience. (Plus, there are a bunch of book signings.)

It won’t be long before we’re all in NYC for Strata + Hadoop World (Sept. 29-Oct. 1; if you haven’t registered yet, a 20% discount is still available). So, consider for your evening agenda:

Apache Spark Comes to Apache HBase with HBase-Spark Module

The SparkOnHBase project in Cloudera Labs was recently merged into the Apache HBase trunk. In this post, learn the project’s history and what the future looks like for the new HBase-Spark module.

SparkOnHBase was first pushed to Github on July 2014, just six months after Spark Summit 2013 and five months after Apache Spark first shipped in CDH. That conference was a big turning point for me, because for the first time I realized that the MapReduce engine had a very strong competitor. Spark was about to enter an exciting new phase in its open source life cycle, and just one year later, it’s used at massive scale at 100s if not 1000s of companies (with 200+ of them doing so on Cloudera’s platform).

The New Wrangle Conference: Solving the Hardest Data Science Challenges from Startup to Enterprise

Wrangle, a new conference dedicated to the practice of data science from startup to enterprise, debuts in San Francisco on Oct. 22, 2015.

Even as Cloudera introduce new tools for analytics and machine learning into its platform (like the recently announced Ibis project, for example), we are mindful of the fact that many of the hardest problems in data science cannot be solved by technology alone. From the smallest startups to the largest enterprises, we see companies struggling with how to acquire and manage new data sources, recruit and train the next generation of data scientists, and create a data-driven culture that crosses every level of the organization.

Call for Demos: Developer Showcase at Strata + Hadoop World NYC 2015

Strata + Hadoop World New York 2015 needs your developer demos! The proposal period closes on Aug. 14.

As everyone knows, Apache Hadoop’s overwhelming success is partly premised on de-centralized innovation from all corners of the community—users, vendors, and academia—with everyone participating on a level playing field. And since 2011, Strata + Hadoop World has been a community and content hub of that ecosystem.

Strata + Hadoop World NYC 2015 Content Preview

The Strata + Hadoop World NYC 2015 (Sept. 29-Oct. 3) agenda was published in the last few days. Congratulations to all accepted presenters!

In this post, I just want to provide a concise digest of the tutorials and sessions that will involve Cloudera or Intel engineers and/or interesting use cases. There are many worthy sessions from which to choose, so we hope this list will influence your decisions about where to spend your time during the week! (Note that evening meetups are a work in progress; more on those later.)

Security, Hive-on-Spark, and Other Improvements in Apache Hive 1.2.0

Apache Hive 1.2.0, although not a major release, contains significant improvements.

Recently, the Apache Hive community moved to a more frequent, incremental release schedule. So, a little while ago, we covered the Apache Hive 1.0.0 release and explained how it was renamed from 0.14.1 with only minor feature additions since 0.14.0.

Impala Needs Your Contributions

Your contributions, and a vibrant developer community, are important for Impala’s users. Read below to learn how to get involved.

From the moment that Cloudera announced it at Strata New York in 2012, Impala has been an 100% Apache-licensed open source project. All of Impala’s source code is available on GitHub—where nearly 500 users have forked the project for their own use—and we follow the same model as every other platform project at Cloudera: code changes are committed “upstream” first, and are then selected and backported to our release branches for CDH releases.

Sneak Preview: HBaseCon 2015 Use Cases Track

This year’s HBaseCon Use Cases track includes war stories about some of the world’s best examples of running Apache HBase in production.

As a final sneak preview leading up to the show next week, in this post, I’ll give you a window into the HBaseCon 2015′s (May 7 in San Francisco) Use Cases track.

Sneak Preview: HBaseCon 2015 Ecosystem Track

This year’s HBaseCon Ecosystem track covers projects that are complementary to HBase (with a focus on SQL) such as Apache Phoenix, Apache Kylin, and Trafodion.

In this post, I’ll give you a window into the HBaseCon 2015′s (May 7 in San Francisco) Ecosystem track.

Sneak Preview: HBaseCon 2015 Development & Internals Track

This year’s HBaseCon Development & Internals track covers new features in HBase 1.0, what’s to come in 2.0, best practices for tuning, and more.

In this post, I’ll give you a window into the HBaseCon 2015′s (May 7 in San Francisco) Development & Internals track.

Sneak Preview: HBaseCon 2015 Operations Track

This year’s HBaseCon Operations track features some of the world’s largest and most impressive operators.

In this post, I’ll give you a window into the HBaseCon 2015′s (May 7 in San Francisco) Operations track.

Sneak Preview: HBaseCon 2015 General Session

As is its tradition, this year’s HBaseCon General Session includes keynotes about the world’s most awesome HBase deployments.

It’s Spring, which also means that it’s HBaseCon season—the time when the Apache HBase community gathers for its annual ritual.

This Month in the Ecosystem (February 2015)

Welcome to our 17th edition of “This Month in the Ecosystem,” a digest of highlights from February 2015 (never intended to be comprehensive; for that, see the excellent Hadoop Weekly).

Wow, a ton of news for such a short month:

Apache HBase 1.0 is Released

The Cloudera HBase Team are proud to be members of Apache HBase’s model community and are currently AWOL, busy celebrating the release of the milestone Apache HBase 1.0. The following, from release manager Enis Soztutar, was published today in the ASF’s blog.

 

Apache Hive 1.0.0 Has Been Released

The Apache Hive PMC has recently voted to release Hive 1.0.0 (formerly known as Hive 0.14.1).

This release is recognition of the work the Apache Hive community has done over the past nine years and is continuing to do. The Apache Hive 1.0.0 release is a codebase that was expected to be released as 0.14.1 but the community felt it was time to move to a 1.x.y release naming structure.

This Month in the Ecosystem (January 2015)

Welcome to our 16th edition of “This Month in the Ecosystem,” a digest of highlights from January 2015 (never intended to be comprehensive; for that, see the excellent Hadoop Weekly). 

You may have noticed that this report went on hiatus for December 2014 due to a lack of critical news mass (plus, we realize that most of you are out of the loop until mid-January). It’s back with a vengeance, though:

Where to Find Cloudera Tech Talks (through March 2015)

Find Cloudera tech talks in Austin, London, Washington DC, Zurich, and other cities through March 2015.

Below please find our regularly scheduled quarterly update about where to find tech talks by Cloudera employees—this time, through the first quarter of calendar year 2015. Note that this list will be continually curated during the period; complete logistical information may not be available yet. And remember, many of these talks are in “free” venues (no cost of entry).

HBaseCon 2015: Call for Papers and Early Bird Registration

HBaseCon 2015 is ON, people! Book Thursday, May 7, in your calendars.

If you’re a developer in Silicon Valley, you probably already know that since its debut in 2012, HBaseCon has been one of the best developer community conferences out there. If you’re not, this is a great opportunity to learn that for yourself: HBaseCon 2015 will occur on Thurs., May 7, 2015, at the Westin St. Francis on Union Square in San Francisco.

The Top 10 Posts of 2014 from the Cloudera Engineering Blog

Our “Top 10″ list of blog posts published during a calendar year is a crowd favorite (see the 2013 version here), in particular because it serves as informal, crowdsourced research about popular interests. Page views don’t lie (although skew for publishing date—clearly, posts that publish earlier in the year have pole position—has to be taken into account). 

In 2014, a strong interest in various new components that bring real time or near-real time capabilities to the Apache Hadoop ecosystem is apparent. And we’re particularly proud that the most popular post was authored by a non-employee.

  1. How-to: Create a Simple Hadoop Cluster with VirtualBox
    by Christian Javet
    Explains how t set up a CDH-based Hadoop cluster in less than an hour using VirtualBox and Cloudera Manager.
  2. Why Apache Spark is a Crossover Hit for Data Scientists
    by Sean Owen

    An explanation of why Spark is a compelling multi-purpose platform for use cases that span investigative, as well as operational, analytics. 
  3. How-to: Run a Simple Spark App in CDH 5
    by Sandy Ryza
    Helps you get started with Spark using a simple example.
  4. New SQL Choices in the Apache Hadoop Ecosystem: Why Impala Continues to Lead
    by Justin Erickson, Marcel Kornacker & Dileep Kumar

    Open benchmark testing of Impala 1.3 demonstrates performance leadership compared to alternatives (by 950% or more), while providing greater query throughput and with a far smaller CPU footprint.
  5. Apache Kafka for Beginners
    by Gwen Shapira & Jeff Holoman
    When used in the right way and for the right use case, Kafka has unique attributes that make it a highly attractive option for data integration.
  6. Apache Hadoop YARN: Avoiding 6 Time-Consuming “Gotchas”
    by Jeff Bean
    Understanding some key differences between MR1 and MR2/YARN will make your migration much easier.
  7. Impala Performance Update: Now Reaching DBMS-Class Speed
    by Justin Erickson, Greg Rahn, Marcel Kornacker & Yanpei Chen
    As of release 1.1.1, Impala’s speed beat the fastest SQL-on-Hadoop alternatives–including a popular analytic DBMS running on its own proprietary data store.
  8. The Truth About MapReduce Performance on SSDs
    by Karthik Kambatla & Yanpei Chen

    It turns out that cost-per-performance, not cost-per-capacity, is the better metric for evaluating the true value of SSDs. (See the session on this topic at Strata+Hadoop World San Jose in Feb. 2015!)
  9. How-to: Translate from MapReduce to Spark
    by Sean Owen

    The key to getting the most out of Spark is to understand the differences between its RDD API and the original Mapper and Reducer API.
  10. How-to: Write and Run Apache Giraph Jobs on Hadoop
    by Mirko Kämpf
    Explains how to create a test environment for writing and testing Giraph jobs, or just for playing around with Giraph and small sample datasets.

Hands-on Hive-on-Spark in the AWS Cloud

Interested in Hive-on-Spark progress? This new AMI gives you a hands-on experience.

Nearly one year ago, the Apache Hadoop community began to embrace Apache Spark as a powerful batch-processing engine. Today, many organizations and projects are augmenting their Hadoop capabilities with Spark. As part of this shift, the Apache Hive community is working to add Spark as an execution engine for Hive. The Hive-on-Spark work is being tracked by HIVE-7292 which is one of the most popular JIRAs in the Hadoop ecosystem. Furthermore, three weeks ago, the Hive-on-Spark team offered the first demo of Hive on Spark.

Progress Report: Community Contributions to Parquet

Community contributions to Parquet are increasing in parallel with its adoption. Here are some of the highlights.

Apache Parquet (incubating), the open source, general-purpose columnar storage format for Apache Hadoop, was co-founded only 18 months ago by Cloudera and Twitter. Since that time, its rapid adoption by multiple platform vendors and communities has made it a de facto standard for this purpose.

This Month in the Ecosystem (November 2014)

Welcome to our 15th edition of “This Month in the Ecosystem,” a digest of highlights from November 2014 (never intended to be comprehensive; for that, see the excellent Hadoop Weekly).

November was busy, even accounting for the US Thanksgiving holiday:

Apache Hadoop 2.6 is Released

The Apache Hadoop community has voted to release Hadoop 2.6. Congrats to all contributors!

This new release contains a variety of improvements, particularly in the storage layer and in YARN. We’re particularly excited about the encryption-at-rest feature in HDFS!

Apache Hive on Apache Spark: The First Demo

The community effort to make Apache Spark an execution engine for Apache Hive is making solid progress.

Apache Spark is quickly becoming the programmatic successor to MapReduce for data processing on Apache Hadoop. Over the course of its short history, it has become one of the most popular projects in the Hadoop ecosystem, and is now supported by multiple industry vendors—ensuring its status as an emerging standard.

The Story of the Cloudera Engineering Hackathon (2014 Edition)

Cloudera’s culture is premised on innovation and teamwork, and there’s no better example of them in action than our internal hackathon.

Cloudera Engineering doubled-down on its “hackathon” tradition last week, with this year’s edition taking an around-the-clock approach thanks to the HQ building upgrade since the 2013 edition (just look at all that space!).

Where to Find Cloudera Tech Talks (Through End of 2014)

Find Cloudera tech talks in Seattle, Las Vegas, London, Madrid, Budapest, Barcelona, Washington DC, Toronto, and other cities through the end of 2014.

Below please find our regularly scheduled quarterly update about where to find tech talks by Cloudera employees—this time, for the remaining dates of 2014. Note that this list will be continually curated during the period; complete logistical information may not be available yet. And remember, many of these talks are in “free” venues (no cost of entry).

This Month in the Ecosystem (October 2014)

Welcome to our 14th edition of “This Month in the Ecosystem,” a digest of highlights from October 2014 (never intended to be comprehensive; for that, see the excellent Hadoop Weekly).

Introducing Cloudera Labs: An Open Look into Cloudera Engineering R&D

Cloudera Labs contains ecosystem innovations that one day may bring developers more functionality or productivity in CDH.

Since its inception, one of the defining characteristics of Apache Hadoop has been its ability to evolve/reinvent and thrive at the same time. For example, two years ago, nobody could have predicted that the formative MapReduce engine, one of the cornerstones of “original” Hadoop, would be marginalized or even replaced. Yet today, that appears to be happening via Apache Spark, with Hadoop becoming the stronger for it. Similarly, we’ve seen other relatively new components, like Impala, Apache Parquet (incubating), and Apache Sentry (also incubating), become widely adopted in relatively short order.

This Month in the Ecosystem (September 2014)

Welcome to our 13th edition of “This Month in the Ecosystem,” a digest of highlights from September 2014 (never intended to be comprehensive; for that, see the excellent Hadoop Weekly).

Community Meetups during Strata + Hadoop World 2014

The meetup opportunities during the conference week are more expansive than ever — spanning Impala, Spark, HBase, Kafka, and more.

Strata + Hadoop World 2014 is a kaleidoscope of experiences for attendees, and those experiences aren’t contained within the conference center’s walls. For example, the meetups that occur during the conf week (which is concurrent with NYC DataWeek) are a virtual track for developers — and with Strata + Hadoop World being bigger than ever, so is the scope of that track.

This Month in the Ecosystem (August 2014)

Welcome to our 12th (first annual!) edition of “This Month in the Ecosystem,” a digest of highlights from August 2014 (never intended to be comprehensive; for that, see the excellent Hadoop Weekly).

Running CDH 5 on GlusterFS 3.3

The following post was written by Jay Vyas (@jayunit100) and originally published in the Gluster.org Community.

I have recently spent some time getting Cloudera’s CDH 5 distribution of Apache Hadoop to work on GlusterFS 3.3 using Distributed Replicated 2 Volumes. This is made possible by the fact that Apache Hadoop has a pluggable filesystem architecture that allows the computational components within the CDH 5 distribution to be configured to use alternative filesystems to HDFS. In this case, one can configure CDH 5 to use the Hadoop FileSystem plugin for GlusterFS (glusterfs-hadoop), which allows it to run on GlusterFS 3.3. I’ve provided a diagram below that illustrates the CDH 5 core processes and how they interact with GlusterFS.

Progress Report: Cloudera Community Forums After One Year

Cloudera Community forums are proving their value as an important contributor to a rich user experience.

It’s been almost exactly one year since the debut of the Cloudera Community forums. In addition to doing the birthday shout-out, I thought it would be interesting to bring you up to date about adoption and usage patterns.

This Month in the Ecosystem (June 2014)

Welcome to our 10th edition of “This Month in the Ecosystem,” a digest of highlights from June 2014 (never intended to be comprehensive; for that, see the excellent Hadoop Weekly).

Pretty busy for early Summer:

Apache Hive on Apache Spark: Motivations and Design Principles

Two of the most vibrant communities in the Apache Hadoop ecosystem are now working together to bring users a Hive-on-Spark option that combines the best elements of both.

(Editor’s note [Feb. 25, 2015]: A Hive-on-Spark beta release is now available for download. Learn more here.)

Where to Find Cloudera Tech Talks (Through September 2014)

Find Cloudera tech talks in Texas, Oregon, Washington DC, Illinois, Georgia, Japan, and across the SF Bay Area during the next calendar quarter.

Below please find our regularly scheduled quarterly update about where to find tech talks by Cloudera employees – this time, for the third calendar quarter of 2014 (July through September; traditionally, the least active quarter of the year). Note that this list will be continually curated during the period; complete logistical information may not be available yet. And remember, many of these talks are in “free” venues (no cost of entry).

This Month in the Ecosystem (April 2014)

Welcome to our eighth edition of “This Month in the Ecosystem,” a digest of highlights from April 2014 (never intended to be comprehensive; for completeness, see the excellent Hadoop Weekly).

More good news!

HBaseCon 2014 is a Wrap!

HBaseCon 2014 is in the books. Thanks to all attendees, speakers, and sponsors!

HBaseCon 2014, much like a butterfly, lived for a short number of hours on Monday — but it certainly was beautiful while it lasted! (See photos here.)

Sneak Preview: "Case Studies" Track at HBaseCon 2014

The HBaseCon 2014 “Case Studies” track surfaces some of the most interesting (and diverse) use cases in the HBase ecosystem — and in the world of NoSQL overall — today.

The HBaseCon 2014 (May 5, 2014 in San Francisco) is not just about internals and best practices — it’s also a place to explore use cases that you not have even considered before.

Sneak Preview: "Ecosystem" Track at HBaseCon 2014

The HBaseCon 2014 “Ecosystem” track offers a cross-section view of the most interesting projects emerging on top of, or alongside, HBase.

The HBaseCon 2014 (May 5, 2014 in San Francisco) is not just a reflection of HBase itself — it’s also a celebration of the entire ecosystem. Thanks again, Program Committee!

This Month in the Ecosystem (March 2014)

Welcome to our seventh edition of “This Month in the Ecosystem,” a digest of highlights from March 2014 (never intended to be comprehensive; for completeness, see the excellent Hadoop Weekly).

More good news for the ecosystem!

Sneak Preview: "Features & Internals" Track at HBaseCon 2014

The HBaseCon 2014 “Features & Internals” track covers the newest developments in Apache HBase functionality.

The HBaseCon 2014 (May 5, 2014 in San Francisco) agenda has something for everyone – particularly, developers building apps on HBase. Thanks again, Program Committee!

Sneak Preview: HBaseCon 2014 "Operations" Track

HBaseCon 2014 “Operations” track reveals best practices used by some of the world’s largest production-cluster operators.

The HBaseCon 2014 (May 5, 2014 in San Francisco) agenda is particularly strong in the area of operations. Thanks again, Program Committee!

Where to Find Cloudera Tech Talks (Through June 2014)

Find Cloudera tech talks in Amsterdam, Boston, Berlin, Sao Paulo, Singapore, Zurich, and other cities across Europe and the US during the next calendar quarter.

Below please find our regularly scheduled quarterly update about where to find tech talks by Cloudera employees – this time, for the second calendar quarter of 2014 (April through June). Note that this list will be continually curated during the period; complete logistical information may not be available yet. And remember, many of these talks are in “free” venues (no cost of entry).

Sneak Preview: HBaseCon 2014 General Session

The HBaseCon 2014 General Session – with keynotes by Facebook, Google, and Salesforce.com engineers – is arguably the best ever.

HBaseCon 2014 (May 5, 2014 in San Francisco) is coming very, very soon. Over the next few weeks, as I did for last year’s conference, I’ll be bringing you sneak previews of session content (across Operations, Features & Internals, Ecosystem, and Case Studies tracks) accepted by the Program Committee.

HBaseCon 2014: Speakers, Keynotes, and Sessions Announced

Users of diverse, real-world HBase deployments around the world present at this year’s event.

This year’s agenda for HBaseCon, the conference for the Apache HBase community (developers, operators, contributors), looks “Stack-ed” with can’t-miss keynotes and breakouts. Program committee, you really came through (again).

This Month in the Ecosystem (February 2014)

Welcome to our sixth edition of “This Month in the Ecosystem,” a digest of highlights from February 2014 (never intended to be comprehensive; for completeness, see the excellent Hadoop Weekly).

February being a short month, the list is relatively short — but never confuse quantity with quality!

Apache Hadoop 2.3.0 is Released (HDFS Caching FTW!)

Hadoop 2.3.0 includes hundreds of new fixes and features, but none more important than HDFS caching.

The Apache Hadoop community has voted to release Hadoop 2.3.0, which includes (among many other things):

Pro Tips for Pitching an HBaseCon Talk

These suggestions from the Program Committee offer an inside track to getting your talk accepted!

With HBaseCon 2014 (in San Francisco on May 5) Call for Papers closing in just over three weeks (on Feb. 14 — sooner than you think), there’s no better time than “now” to start thinking about your proposal.

It’s a Three-peat! HBaseCon 2014 Call for Papers and Early Bird Registration Now Open

The third-annual HBaseCon is now open for business. Submit your paper or register today for early bird savings!

Seems like only yesterday that droves of Apache HBase developers, committers/contributors, operators, and other enthusiasts converged in San Francisco for HBaseCon 2013 — nearly 800 of them, in fact. 

Older Posts