We announced a leadership change at Cloudera today. Tom Reilly, formerly CEO at Arcsight, is joining us in my old role – CEO – and I am assuming two new posts: Chief Strategy Officer and Chairman of the Board of Directors.
When we started the company five years ago, almost no one had heard of Apache Hadoop. Big Data, to the extent the term was used at all, was strictly a consumer internet phenomenon. No other enterprise vendor believed the platform mattered.
We did, of course, and we set out to make that true. We’ve engaged closely with the open source community, worked hard to advance the state of the art in the platform and crafted a business strategy that allows us to grow quickly and to build a great company for the long term.
Data analysts and business intelligence specialists have been at the heart of new trends driving business growth over the past decade, including log file and social media analytics. However, Big Data heretofore has been beyond the reach of analysts because traditional tools like relational databases don’t scale, and scalable systems like Apache Hadoop have historically required Java expertise.
Today, the rise of new ecosystem tools is rapidly broadening the community using Hadoop and Big Data. Projects like Cloudera Impala and Apache Hive and Apache Pig have for the first time made Big Data accessible to those with traditional analytics backgrounds. With the launch of Data Analyst Training, Cloudera is helping the world’s analysts prove there’s nothing traditional about data analytics and BI on Hadoop.
The Democratization of Big Data
Starting in CDH 4.2, YARN/MapReduce 2 (MR2) includes an even more powerful Fair Scheduler. In addition to doing nearly all that it could do in MapReduce 1 (MR1), the YARN Fair Scheduler can schedule non-MapReduce jobs, schedule based on fine-grained memory instead of slots, and support hierarchical queues. In this post, you’ll learn what the Fair Scheduler’s role is and how it fulfills it, what it means to be a YARN “scheduler,” and dive into its new features and how to get them running on your cluster.
YARN/MR2 vs. MR1
YARN uses an updated terminology to reflect that it no longer just manages resources for MapReduce. From YARN’s perspective, a MapReduce job is an application. YARN schedules containers for map and reduce tasks to live in. What was referred to as pools in the MR1 Fair Scheduler has been updated to queue for consistency with the capacity scheduler. An excellent and deeper explanation is available here.
How Does it Work?
How a Hadoop scheduler functions can often be confusing, so we’ll start with a short overview of what the Fair Scheduler does and how it works.
For those of you who missed the show, session video and presentation slides (as well as photos) will be available via hbasecon.com in a few weeks. (To be notified, follow @cloudera or @ClouderaEng.) Although it’s not quite as good as being there with the rest of the community, you’ll still be able to partake from the real-world experiences of Apache HBase users like Facebook, Box, Yahoo!, Salesforce.com, Pinterest, Twitter, Groupon, and more.
While you’re waiting for that, allow me to bring you just this single photo to capture the HBaseCon experience:
This is the week of Apache HBase, with HBaseCon 2013 taking place Thursday, followed by WibiData’s KijiCon on Friday. In the many conversations I’ve had with Cloudera customers over the past 18 months, I’ve noticed a trend: Those that run HBase stand out. They tend to represent a group of very sophisticated Hadoop users that are accomplishing impressive things with Big Data. They deploy HBase because they require random, real-time read/write access to the data in Hadoop. Hadoop is a core component of their data management infrastructures, and these users rely on the latest and greatest components of the Hadoop stack to satisfy their mission-critical data needs.
Today I’d like to shine a spotlight on one innovative company that is putting top engineering talent (and HBase) to work, helping to save the planet — literally.
That company is Opower. Opower partners with 80+ utilities providers to offer an integratedcustomer engagement platform using a software-as-a-service (SaaS) model. Its goal: to help people save energy and reduce utilities bills by applying intensive, Big Data analytics to deliver informative dashboards, alerts, incentives, similar household comparisons, and other communications to customers across communication channels and via in-home devices. Opower combines utility data — such as that from smart meters — with weather information, geographic details, demographic data and more, over decades of history, so it can offer valuable insights. (Hint: this is where the value of Hadoop and HBase come in.)
HBaseCon 2013 is this Thursday (June 13 in San Francisco), and we can hardly wait!
To complete the “preview” cycle, today we bring you a snapshot of the Case Studies track, which offers a cross-section of the many real-world use cases for Apache HBase. You will learn about how a range of companies across diverse industries use it at the heart of their IT infrastructure to run their business.
Michael Stack is the chair of the Apache HBase PMC and has been a committer and project “caretaker” since 2007. Stack is a Software Engineer at Cloudera.
Apache Hadoop and HBase have quickly become industry standards for storage and analysis of Big Data in the enterprise, yet as adoption spreads, new challenges and opportunities have emerged. Today, there is a large gap — a chasm, a gorge — between the nice application model your Big Data Application builder designed and the raw, byte-based APIs provided by HBase and Hadoop. Many Big Data players have invested a lot of time and energy in bridging this gap. Cloudera, where I work, is developing the Cloudera Development Kit (CDK). Kiji, an open source framework for building Big Data Applications, is another such thriving option. A lot of thought has gone into its design. More importantly, long experience building Big Data Applications on top of Hadoop and HBase has been baked into how it all works.
Kiji provides a model and set of libraries that help you get up and running quickly.
Kiji provides a model and a set of libraries that allow developers to get up and running quickly. Intuitive Java APIs and Kiji’s rich data model allow developers to build business logic and machine learning algorithms without having to worry about bytes, serialization, schema evolution, and lower-level aspects of the system. The Kiji framework is modularized into separate components to support a wide range of usage and encourage clean separation of functionality. Kiji’s main components include KijiSchema, KijiMR, KijiHive, KijiExpress, KijiREST, and KijiScoring. KijiSchema, for example, helps team members collaborate on long-lived Big Data management projects, and does away with common incompatibility issues, and helps developers build more integrated systems across the board. All of these components are available in a single download called a BentoBox.
Cloudera Impala has many exciting features, but one of the most impressive is the ability to analyze data in multiple formats, with no ETL needed, in HDFS and Apache HBase. Furthermore, you can use multiple frameworks, such as MapReduce and Impala, to analyze that same data. Consequently, Impala will often run side-by-side with MapReduce on the same physical hardware, with both supporting business-critical workloads. For such multi-tenant clusters, Impala and MapReduce both need to perform well despite potentially conflicting demands for cluster resources.
In this post, we’ll share our experiences configuring Impala and MapReduce for optimal multi-tenant performance. Our goal is to help users understand how to tune their multi-tenant clusters to meet production service level objectives (SLOs), and to contribute to the community some test methods and performance models that can be helpful beyond Cloudera.
Defining Realistic Test Scenarios
Cloudera’s broad and diverse customer base makes it a top concern to do testing for real-world scenarios. Realistic tests based on common use cases offer meaningful guidance, whereas guidance based on contrived, unrealistic testing often fails to translate to real-life deployments.
Earlier this week, we hosted The Cloudera Forum to reveal Cloudera’s “Unaccept the Status Quo” vision and to announce the public beta launch of Cloudera Search. The event featured a panel discussion between representatives from four companies that are embracing the latest big data innovations, moderated by our own CEO Mike Olson. Those are the companies I’d like to highlight in this week’s spotlight, for obvious reasons. The panelists were… (drumroll, please):
What do you do at Cloudera (and in which Apache project(s) are you involved)?
I’m a software engineer on the Search team. I’ve been involved in the Apache Lucene community since 2006 and Apache Solr since around 2009. I spend a lot of time adding features to Solr and fixing bugs, as well as working on improving Solr integration with the rest of the Hadoop ecosystem. I kind of think of myself as a “distributed search guy” at the moment.