Cloudera Engineering Blog
Big Data best practices, how-to's, and internals from Cloudera Engineering and the community
YCSB, the open standard for comparative performance evaluation of data stores, is now available to CDH users for their Apache HBase deployments via new packages from Cloudera Labs.
Many factors go into deciding which data store should be used for production applications, including basic features, data model, and the performance characteristics for a given type of workload. It’s critical to have the ability to compare multiple data stores intelligently and objectively so that you can make sound architectural decisions.
Learn about the new functionality coming aboard Cloudera Navigator, the trail-blazing solution for metadata management and lineage in Apache Hadoop.
More than two years ago, Cloudera introduced Cloudera Navigator 1.0, which was the first offering to unify auditing across enterprise Apache Hadoop deployments. About a year later, Cloudera released Cloudera Navigator 2.0, which introduced another first for Hadoop: comprehensive metadata management and lineage to Hadoop. Today, more than 200 customers across numerous industries use Cloudera Navigator in production to deliver trust and visibility to their Hadoop deployments.
Strata + Hadoop World 2015 NYC is more than a daytime conference; it’s also a nighttime meetup experience. (Plus, there are a bunch of book signings.)
It won’t be long before we’re all in NYC for Strata + Hadoop World (Sept. 29-Oct. 1; if you haven’t registered yet, a 20% discount is still available). So, consider for your evening agenda:
Thanks to Jeff Palmucci, Director of Machine Learning at TripAdvisor, for permission to republish the following (originally appeared in TripAdvisor’s Engineering/Operations blog).
Here at TripAdvisor we have a lot of reviews, several hundred million according to the last announcement. I work with machine learning, and one thing we love in machine learning is putting lots of data to use.
As is their custom, Cloudera Engineering’s interns made innovation, especially for Apache Spark, the theme of the Summer season.
Cloudera has a long-time tradition of searching far and wide for the smartest summer engineering interns that it can find. Alumni of the program have become start-up co-founders, faculty at top-tier CS departments, employees at other prominent technology companies (including Google, Databricks, Uber, LinkedIn), as well as many current employees at Cloudera. See some examples here.
Big Industries, Cloudera systems integration and reseller partner for Belgium and Luxembourg, has developed an integration of Apache Mesos and CDH that can be deployed and managed through Cloudera Manager. In this post, Big Industries’ Rob Gibbon explains the benefits of deploying Mesos on your cluster and walks you through the process of setting it up.
[Editor's Note: Mesos integration is not currently supported by Cloudera, thus the setup described below is not recommended for production use.]
Cloudera Director 1.5 introduces a new plugin architecture to enable support for additional cloud providers. If you want to implement a plugin to add integration with a cloud provider that is not supported out-of-the-box, or to extend one of the existing plugins, these details will get you started.
As discussed in our previous blog post, the Cloudera Director Service Provider Interface (Cloudera Director SPI) defines a Java interface and packaging standards for Cloudera Director plugins. Let’s take a look at what it takes to implement a plugin.
Before You Begin
The SparkOnHBase project in Cloudera Labs was recently merged into the Apache HBase trunk. In this post, learn the project’s history and what the future looks like for the new HBase-Spark module.
SparkOnHBase was first pushed to Github on July 2014, just six months after Spark Summit 2013 and five months after Apache Spark first shipped in CDH. That conference was a big turning point for me, because for the first time I realized that the MapReduce engine had a very strong competitor. Spark was about to enter an exciting new phase in its open source life cycle, and just one year later, it’s used at massive scale at 100s if not 1000s of companies (with 200+ of them doing so on Cloudera’s platform).
Cloudera Director 1.5 is now available; this post describes what’s inside, including a new open source plugin interface.
Cloudera Director is the manifestation of Cloudera’s commitment to providing a simple and reliable way to deploy, scale, and manage Apache Hadoop in the cloud of your choice. With Cloudera Director 1.5, we continue the story of enabling production-ready clusters and big data applications by focusing on the following themes.
Thanks to Barclays employees Sam Savage, VP Data Science, and Harry Powell, Head of Advanced Analytics, for the guest post below about the Barclays use case for Apache Spark and its Scala API.
At Barclays, our team recently built an application called Insights Engine to execute an arbitrary number N of near-arbitrary SQL-like queries and execute them in a way that can scale with increasing N. The queries were non-trivial, each constituting 200-300 lines of SQL, and they were running over a large dataset for hours as Apache Hive scripts. Yet, we need to execute 50 queries in less than an hour for our use case.
Learn how Cloudera Navigator Encrypt bring data security to YARN containers.
With the introduction of transparent data encryption in HDFS, we are now a big step closer toward a secure platform in the Apache Hadoop world. However, there are still gaps in the platform, including how YARN and its applications manage their cache. In this post, I’ll explain how Cloudera Navigator Encrypt fills that particular gap.
Learn about the near real-time data ingest architecture for transforming and enriching data streams using Apache Flume, Apache Kafka, and RocksDB at Santander UK.
Cloudera Professional Services has been working with Santander UK to build a near real-time (NRT) transactional analytics system on Apache Hadoop. The objective is to capture, transform, enrich, count, and store a transaction within a few seconds of a card purchase taking place. The system receives the bank’s retail customer card transactions and calculates the associated trend information aggregated by account holder and over a number of dimensions and taxonomies. This information is then served securely to Santander’s “Spendlytics” app (see below) to enable customers to analyze their latest spending patterns.
To design effective fraud-detection architecture, look no further than the human brain (with some help from Spark Streaming and Apache Kafka).
At its core, fraud detection is about detection whether people are behaving “as they should,” otherwise known as catching anomalies in a stream of events. This goal is reflected in diverse applications such as detecting credit-card fraud, flagging patients who are doctor shopping to obtain a supply of prescription drugs, or identifying bullies in online gaming communities.
Our thanks to Karthik Vadla and Abhi Basu, Big Data Solutions engineers at Intel, for permission to re-publish the following (which was originally available here).
Data science is not a new discipline. However, with the growth of big data and adoption of big data technologies, the request for better quality data has grown exponentially. Today data science is applied to every facet of life—product validation through fault prediction, genome sequence analysis, personalized medicine through population studies and Patient 360 view, credit card fraud-detection, improvement in customer experience through sentiment analysis and purchase patterns, weather forecast, detecting cyber or terrorist attacks, aircraft maintenance utilizing predictive analytics to repair critical parts before they fail, and many more. Every day, data scientists are detecting patterns in data and providing actionable insights to influence organizational changes.
Thrift client authentication and
doAs impersonation, introduced in HBase 1.0, provides more flexibility for your HBase installation.
In the two-part blog series “How-to: Use the HBase Thrift Interface” (Part 1 and Part 2), Jesse Anderson explained the Thrift interface in detail, and demonstrated how to use it. He didn’t cover running Thrift in a secure Apache HBase cluster, however, because there was no difference in the client configuration with the HBase releases available at that time.
You now have more deployment options for getting hands-on with Apache Hadoop.
Launched in September 2014, Cloudera Live has become a popular choice for getting hands-on with Apache Hadoop via the cloud and Cloudera Enterprise, the world’s most deployed commercial Hadoop-based platform (CDH + Cloudera Manager, Navigator, and Director). The popularity is credited to its ease of spin up and use: With step-by-step, ramp-up tutorials, Cloudera Live helps users get up and running in just a few hours.
Learn about the architecture of Ibis, the roadmaps for Ibis and Impala, and how to get started and contribute.
We created Ibis, a new Python data analysis framework now incubating in Cloudera Labs, with the goal of enabling data scientists and data engineers to be as productive working with big data as they are working with small and medium data today. In doing so, we will enable Python to become a true first-class language for Apache Hadoop, without compromises in functionality, usability, or performance. Having spent much of the last decade improving the usability of the single-node Python experience (with pandas and other projects), we are looking to achieve:
Wrangle, a new conference dedicated to the practice of data science from startup to enterprise, debuts in San Francisco on Oct. 22, 2015.
Even as Cloudera introduce new tools for analytics and machine learning into its platform (like the recently announced Ibis project, for example), we are mindful of the fact that many of the hardest problems in data science cannot be solved by technology alone. From the smallest startups to the largest enterprises, we see companies struggling with how to acquire and manage new data sources, recruit and train the next generation of data scientists, and create a data-driven culture that crosses every level of the organization.
This new Cloudera Labs project promises to deliver the great Python user experience and ecosystem at Hadoop scale.
Across the user community, you will find general agreement that the Apache Hadoop stack has progressed dramatically in just the past few years. For example, Search and Impala have moved Hadoop beyond batch processing, while developers are seeing significant productivity gains and additional use cases by transitioning from MapReduce to Apache Spark.
Thanks to Wuheng Luo, a Hadoop and big data architect at Sears Holdings, for the guest post below about Pig job-level performance tuning
Many factors can affect Apache Pig job performance in Apache Hadoop, including hardware, network I/O, cluster settings, code logic, and algorithm. Although the sysadmin team is responsible for monitoring many of these factors, there are other issues that MapReduce job owners or data application developers can help diagnose, tune, and improve. One such example is a disproportionate Map-to-Reduce ratio—that is, using too many reducers or mappers in a Pig job.
Strata + Hadoop World New York 2015 needs your developer demos! The proposal period closes on Aug. 14.
As everyone knows, Apache Hadoop’s overwhelming success is partly premised on de-centralized innovation from all corners of the community—users, vendors, and academia—with everyone participating on a level playing field. And since 2011, Strata + Hadoop World has been a community and content hub of that ecosystem.
This year will close out with new features for reliability, usability, and nested types, and in 2016, performance-related enhancements promise >20x gains.
It’s been roughly a year since we provided an update about the Impala roadmap. During that time, a number of milestones have been reached:
Apache Spark’s ability to support data quality checks via DataFrames is progressing rapidly. This post explains the state of the art and future possibilities.
Apache Hadoop and Apache Spark make Big Data accessible and usable so we can easily find value, but that data has to be correct, first. This post will focus on this problem and how to solve it with Apache Spark 1.3 and Apache Spark 1.4 using DataFrames. (Note: although relatively new to Spark and thus not yet supported by Cloudera at the time of this writing, DataFrames are highly worthy of exploration and experimentation. Learn more about Cloudera’s support for Apache Spark here.)
The Strata + Hadoop World NYC 2015 (Sept. 29-Oct. 3) agenda was published in the last few days. Congratulations to all accepted presenters!
In this post, I just want to provide a concise digest of the tutorials and sessions that will involve Cloudera or Intel engineers and/or interesting use cases. There are many worthy sessions from which to choose, so we hope this list will influence your decisions about where to spend your time during the week! (Note that evening meetups are a work in progress; more on those later.)
This post contains answers to common questions about deploying and configuring Apache Kafka as part of a Cloudera-powered enterprise data hub.
Cloudera added support for Apache Kafka, the open standard for streaming data, in February 2015 after its brief incubation period in Cloudera Labs. Apache Kafka now is an integrated part of CDH, manageable via Cloudera Manager, and we are witnessing rapid adoption of Kafka across our customer base.
Thanks to Pengyu Wang, software developer at FINRA, for permission to republish this post.
Salted Apache HBase tables with pre-split is a proven effective HBase solution to provide uniform workload distribution across RegionServers and prevent hot spots during bulk writes. In this design, a row key is made with a logical key plus salt at the beginning. One way of generating salt is by calculating n (number of regions) modulo on the hash code of the logical row key (date, etc).
Salting Row Keys
Cloudera Navigator Encrypt is a key security feature in production-deployed enterprise data hubs. This post explains how it works.
Cloudera Navigator Encrypt, which is integrated with Cloudera Navigator (the native, end-to-end governance solution for Apache Hadoop-based systems), provides massively scalable, high-performance encryption for critical Hadoop data. It utilizes industry-standard AES-256 encryption and provides a transparent layer between the application and filesystem. Navigator Encrypt also includes process-based access controls, allowing authorized Hadoop processes to access encrypted data while simultaneously preventing admins or super-users like root from accessing data that they don’t need to see.
Learn about the design decisions behind HBase’s new support for MOBs.
Apache HBase is a distributed, scalable, performant, consistent key value database that can store a variety of binary data types. It excels at storing many relatively small values (<10K), and providing low-latency reads and writes.
The best data protection strategy is to remove sensitive information from everyplace it’s not needed.
Have you ever wondered what sort of “sensitive” information might wind up in Apache Hadoop log files? For example, if you’re storing credit card numbers inside HDFS, might they ever “leak” into a log file outside of HDFS? What about SQL queries? If you have a query like
select * from table where creditcard = '1234-5678-9012-3456', where is that query information ultimately stored?
Evaluating which streaming architectural pattern is the best match to your use case is a precondition for a successful production deployment.
The Apache Hadoop ecosystem has become a preferred platform for enterprises seeking to process and understand large-scale data in real time. Technologies like Apache Kafka, Apache Flume, Apache Spark, Apache Storm, and Apache Samza are increasingly pushing the envelope on what is possible. It is often tempting to bucket large-scale streaming use cases together but in reality they tend to break down into a few different architectural patterns, with different components of the ecosystem better suited for different problems.
Apache Hive 1.2.0, although not a major release, contains significant improvements.
Recently, the Apache Hive community moved to a more frequent, incremental release schedule. So, a little while ago, we covered the Apache Hive 1.0.0 release and explained how it was renamed from 0.14.1 with only minor feature additions since 0.14.0.
The following post about the new request throttling feature in HBase 1.1 (now shipping in CDH 5.4) originally published in the ASF blog. We re-publish it here for your convenience.
Running multiple workloads on HBase has always been challenging, especially when trying to execute real-time workloads while concurrently running analytical jobs. One possible way to address this issue is to throttle analytical MR jobs so that real-time workloads are less affected.
Your contributions, and a vibrant developer community, are important for Impala’s users. Read below to learn how to get involved.
From the moment that Cloudera announced it at Strata New York in 2012, Impala has been an 100% Apache-licensed open source project. All of Impala’s source code is available on GitHub—where nearly 500 users have forked the project for their own use—and we follow the same model as every other platform project at Cloudera: code changes are committed “upstream” first, and are then selected and backported to our release branches for CDH releases.
The following post from Julien Le Dem, a tech lead at Twitter, originally appeared in the Twitter Engineering Blog. We bring it to you here for your convenience.
ASF, the Apache Software Foundation, recently announced the graduation of Apache Parquet, a columnar storage format for the Apache Hadoop ecosystem. At Twitter, we’re excited to be a founding member of the project.
Learn how to read FIX message files directly with Hive, create a view to simplify user queries, and use a flattened Apache Parquet table to enable fast user queries with Impala.
The Financial Information eXchange (FIX) protocol is used widely by the financial services industry to communicate various trading-related activities. Each FIX message is a record that represents an action by a financial party, such as a new order or an execution report. As the raw point of truth for much of the trading activity of a financial firm, it makes sense that FIX messages are an obvious data source for analytics and reporting in Apache Hadoop.
The recent OpenStack Kilo release adds many features to the Sahara project, which provides a simple means of provisioning an Apache Hadoop (or Spark) cluster on top of OpenStack. This how-to, from Intel Software Engineer Wei Ting Chen, explains how to use the Sahara CDH plugin with this new release.
This how-to assumes that OpenStack is already installed. If not, we recommend using Devstack to build a test OpenStack environment in a short time. (Note: Devstack is not recommended for use in a production environment. For production deployments, refer to the OpenStack Installation Guide.)
The following post, from Cloudera intern Jonathan Lawlor, originally appeared in the Apache Software Foundation’s blog.
Over the past few months there have a been a variety of nice changes made to scanners in Apache HBase. This post focuses on two such changes, namely RPC chunking (HBASE-11544) and scanner heartbeat messages (HBASE-13090). Both of these changes address long standing issues in the client-server scan protocol. Specifically, RPC chunking deals with how a server handles the scanning of very large rows and scanner heartbeat messages allow scan operations to progress even when aggressive server-side filtering makes infrequent result returns.
Our thanks to Ilya Ganelin, Senior Data Engineer at Capital One Labs, for the guest post below about his hard-earned lessons from using Spark.
I started using Apache Spark in late 2014, learning it at the same time as I learned Scala, so I had to wrap my head around the various complexities of a new language as well as a new computational framework. This process was a great in-depth introduction to the world of Big Data (I previously worked as an electrical engineer for Boeing), and I very quickly found myself deep in the guts of Spark. The hands-on experience paid off; I now feel extremely comfortable with Spark as my go-to tool for a wide variety of data analytics tasks, but my journey here was no cakewalk.
This new feature gives Hadoop admins the commonplace ability to replace failed DataNode drives without unscheduled downtime.
Hot swapping—the process of replacing system components without shutting down the system—is a common and important operation in modern, production-ready systems. Because disk failures are common in data centers, the ability to hot-swap hard drives is a supported feature in hardware and server operating systems such as Linux and Windows Server, and sysadmins routinely upgrade servers or replace a faulty components without interrupting business-critical services.
We are happy to announce the inclusion of Apache Phoenix in Cloudera Labs.
This year’s HBaseCon Use Cases track includes war stories about some of the world’s best examples of running Apache HBase in production.
As a final sneak preview leading up to the show next week, in this post, I’ll give you a window into the HBaseCon 2015′s (May 7 in San Francisco) Use Cases track.
Installing Cloudera Navigator Encrypt on SUSE is a one-off process, but we have you covered with this how-to.
Cloudera Navigator Encrypt, which is integrated with Cloudera Navigator governance software, provides massively scalable, high-performance encryption for critical Apache Hadoop data. It leverages industry-standard AES-256 encryption and provides a transparent layer between the application and filesystem. Navigator Encrypt also includes process-based access controls, allowing authorized Hadoop processes to access encrypted data, while simultaneously preventing admins or super-users like root from accessing data that they don’t need to see.
The conclusion to this series covers Combiner-like aggregation functionality, counters, partitioning, and serialization.
Apache Spark is rising in popularity as an alternative to MapReduce, in a large part due to its expressive API for complex data processing. A few months ago, my colleague, Sean Owen wrote a post describing how to translate functionality from MapReduce into Spark, and in this post, I’ll extend that conversation to cover additional functionality.
This year’s HBaseCon Ecosystem track covers projects that are complementary to HBase (with a focus on SQL) such as Apache Phoenix, Apache Kylin, and Trafodion.
In this post, I’ll give you a window into the HBaseCon 2015′s (May 7 in San Francisco) Ecosystem track.
Cloudera Search combines the speed of Apache Solr with the scalability of CDH. Our newest training course covers this exciting technology in depth, from indexing to user interfaces, and is ideal for developers, analysts, and engineers who want to learn how to effectively search both structured and unstructured data at scale.
Despite being nearly 10 years old, Apache Hadoop already has an interesting history. Some of you may know that it was inspired by the Google File System and MapReduce papers, which detailed how the search giant was able to store and process vast amounts of data. Search was the original Big Data application, and, in fact, Hadoop itself was a spinoff of a project designed to create a reliable, scalable system to index data using one of Doug Cutting’s other creations: Apache Lucene.
We’re pleased to announce the release of Cloudera Enterprise 5.4 (comprising CDH 5.4, Cloudera Manager 5.4, and Cloudera Navigator 2.3).
Cloudera Enterprise 5.4 (Release Notes) reflects critical investments in a production-ready customer experience through governance, security, performance and deployment flexibility in cloud environments. It also includes support for a significant number of updated open standard components–including Apache Spark 1.3, Impala 2.2, and Apache HBase 1.0 (as well as unsupported beta releases of Hive-on-Spark data processing and OpenStack deployments).
Thanks to Torsten Kilias and Alexander Löser of the Beuth University of Applied Sciences in Berlin for the following guest post about their INDREX project and its integration with Impala for integrated management of textual and relational data.
Textual data is a core source of information in the enterprise. Example demands arise from sales departments (monitor and identify leads), human resources (identify professionals with capabilities in ‘xyz’), market research (campaign monitoring from the social web), product development (incorporate feedback from customers), and the medical domain (anamnesis).
This year’s HBaseCon Development & Internals track covers new features in HBase 1.0, what’s to come in 2.0, best practices for tuning, and more.
In this post, I’ll give you a window into the HBaseCon 2015′s (May 7 in San Francisco) Development & Internals track.
Thanks to Chen Song, Data Team Lead at AppNexus, for allowing us to republish the following post about his company’s use case for Apache Parquet (incubating at this writing), the open standard for columnar storage across the Apache Hadoop ecosystem.
At AppNexus, over 2MM log events are ingested into our data pipeline every second. Log records are sent from upstream systems in the form of Protobuf messages. Raw logs are compressed in Snappy when stored on HDFS. That said, even with compression, this still leads to over 25TB of log data collected every day. On top of logs, we also have hundreds of MapReduce jobs that process and generate aggregated data. Collectively, we store petabytes of data in our primary Hadoop cluster.
Apache Hadoop ecosystem, time to celebrate! The much-anticipated, significantly updated 4th edition of Tom White’s classic O’Reilly Media book, Hadoop: The Definitive Guide, is now available.
The Hadoop ecosystem has changed a lot since the 3rd edition. How are those changes reflected in the new edition?