Cloudera Engineering Blog
Big Data best practices, how-to's, and internals from Cloudera Engineering and the community
Wow, a ton of news for such a short month:
Thanks to Matthew Dixon, principal consultant at Quiota LLC and Professor of Analytics at the University of San Francisco, and Mohammad Zubair, Professor of Computer Science at Old Dominion University, for this guest post that demonstrates how to easily deploy exposure calculations on Apache Spark for in-memory analytics on scenario data.
Since the 2007 global financial crisis, financial institutions now more accurately measure the risks of over-the-counter (OTC) products. It is now standard practice for institutions to adjust derivative prices for the risk of the counter-party’s, or one’s own, default by means of credit or debit valuation adjustments (CVA/DVA).
The Kite project recently released a stable 1.0!
Providing Hadoop-as-a-Service to your internal users can be a major operational advantage.
Cloudera Director (free to download and use) is designed for easy, on-demand provisioning of Apache Hadoop clusters in Amazon Web Services (AWS) environments, with support for other cloud environments in the works. It allows for provisioning clusters in accordance with the Cloudera AWS Reference Architecture.
Cloudera recently announced formal support for Apache Kafka. This simple use case illustrates how to make web log analysis, powered in part by Kafka, one of your first steps in a pervasive analytics journey.
If you are not looking at your company’s operational logs, then you are at a competitive disadvantage in your industry. Web server logs, application logs, and system logs are all valuable sources of operational intelligence, uncovering potential revenue opportunities and helping drive down the bottom line. Whether your firm is an advertising agency that analyzes clickstream logs for customer insight, or you are responsible for protecting the firm’s information assets by preventing cyber-security threats, you should strive to get the most value from your data as soon as possible.
A Hive-on-Spark beta is now available via CDH parcel. Give it a try!
The Hive-on-Spark project (HIVE-7292) is one of the most watched projects in Apache Hive history. It has attracted developers from across the ecosystem, including from organizations such as Intel, MapR, IBM, and Cloudera, and gained critical help from the Spark community.
The Cloudera HBase Team are proud to be members of Apache HBase’s model community and are currently AWOL, busy celebrating the release of the milestone Apache HBase 1.0. The following, from release manager Enis Soztutar, was published today in the ASF’s blog.
Cloudera Director 1.1 introduces new features and improvements that provide more options for creating and managing cloud deployments of Apache Hadoop. Here are details about how they work.
Cloudera Director, which was released in October of 2014, delivers production-ready, self-service interaction with Apache Hadoop clusters in cloud environments. You can find background information about Cloudera Director’s purpose and fundamental features in our earlier introductory blog post and technical overview blog post.
With Kafka now formally integrated with, and supported as part of, Cloudera Enterprise, what’s the best way to deploy and configure it?
Earlier today, Cloudera announced that, following an incubation period in Cloudera Labs, Apache Kafka is now fully integrated into Cloudera’s Big Data platform, Cloudera Enterprise (CDH + Cloudera Manager). Our customers have expressed strong interest in Kafka, and some are already running Kafka in production.
Having a good grasp of HDFS recovery processes is important when running or moving toward production-ready Apache Hadoop.
An important design requirement of HDFS is to ensure continuous and correct operations to support production deployments. One particularly complex area is ensuring correctness of writes to HDFS in the presence of network and node failures, where the lease recovery, block recovery, and pipeline recovery processes come into play. Understanding when and why these recovery processes are called, along with what they do, can help users as well as developers understand the machinations of their HDFS cluster.
Cloudera customers can now install, launch, and monitor CDAP directly from Cloudera Manager. This post from Nitin Motgi, Cask CTO, explains how.
Today, Cloudera and Cask are very happy to introduce the integration of Cloudera’s enterprise data hub (EDH) with the Cask Data Application Platform (CDAP). CDAP is an integrated platform for developers and organizations to build, deploy, and manage data applications on Apache Hadoop. This initial integration will enable CDAP to be installed, configured, and managed from within Cloudera Manager, a component of Cloudera Enterprise. Furthermore, it will simplify data ingestion for a variety of data sources, as well as enable interactive queries via Impala. Starting today, you can download and install CDAP directly from Cloudera’s downloads page.
An improved upgrade wizard in Cloudera Manager 5.3 makes it easy to upgrade CDH on your clusters.
Upgrades can be hard, and any downtime to mission-critical workloads can have a direct impact on revenue. Upgrading the software that powers these workloads can often be an overwhelming and uncertain task that can create unpredictable issues. Apache Hadoop can be especially complex as it consists of dozens of components running across multiple machines. That’s why an enterprise-grade administration tool is necessary for running Hadoop in production, and is especially important when taking the upgrade plunge.
Thanks to Călin-Andrei Burloiu, Big Data Engineer at antivirus company Avira, and Radu Pastia, Senior Software Developer in the Big Data Team at Orange, for the guest post below about the Couchdoop connector for bringing Couchbase data into Hadoop.
Couchdoop is a Couchbase connector for Apache Hadoop, developed by Avira on CDH, that allows for easy, parallel data transfer between Couchbase and Hadoop storage engines. It includes a command-line tool, for simple tasks and prototyping, as well as a MapReduce library, for those who want to use Couchdoop directly in MapReduce jobs. Couchdoop works natively with CDH 5.x.
Couchdoop can help you:
Many thanks to David Whiting of Spotify for allowing us to re-publish the following Spotify Labs post about its Apache Crunch use cases.
(Note: Since this post was originally published in November 2014, many of the library functions described have been added into crunch-core, so they’ll soon be available to all Crunch users by default.)
The Apache Hive PMC has recently voted to release Hive 1.0.0 (formerly known as Hive 0.14.1).
This release is recognition of the work the Apache Hive community has done over the past nine years and is continuing to do. The Apache Hive 1.0.0 release is a codebase that was expected to be released as 0.14.1 but the community felt it was time to move to a 1.x.y release naming structure.
Thanks to Jesus Centeno of Qlik for the post below about using Impala alongside Qlik Sense.
Cloudera and Qlik (which is part of the Impala Accelerator Program) have revolutionized the delivery of insights and value to every business stakeholder for “small data,” to something more powerful in the Big Data world—enabling users to combine Big Data and “small data” to yield actionable business insights.
Xplain.io is now part of Cloudera.
Fifteen months ago, Rituparna Agrawal and I incorporated Xplain.io in a small shed in my backyard. With intense focus on solving real customer problems, we built an eclectic and diverse team with skills across database internals, distributed systems, and customer-centric design.
You may have noticed that this report went on hiatus for December 2014 due to a lack of critical news mass (plus, we realize that most of you are out of the loop until mid-January). It’s back with a vengeance, though:
Thanks to Michael Williams, BIRT Product Evangelist & Forums Manager at analytics software specialist Actuate Corp. (now OpenText), for the guest post below. Actuate is the primary builder and supporter of BIRT, a top-level project of the Eclipse Foundation.
The Actuate (now OpenText) products BIRT Designer Professional and BIRT iHub allow you to connect to multiple data sources to create and deliver meaningful visualizations securely, with scalability reaching millions of users and devices. And now, with Impala emerging as a standard Big Data query engine for many of Actuate’s customers, solid BIRT integration with Impala has become critical.
Strata + Hadoop World San Jose 2015 (Feb. 17-20) is a focal point for learning about production-izing Hadoop.
Strata + Hadoop World sessions have always been indispensable for learning about Hadoop internals, use cases, and admin best practices. When deep learning is needed, however—and deep dives are a necessity if you’re running Hadoop in production, or aspire to—tutorials are your ticket.
Starting in CDH 5.3, Apache Sentry integration with HDFS saves admins a lot of work by centralizing access control permissions across components that utilize HDFS.
It’s been more than a year and a half since a couple of my colleagues here at Cloudera shipped the first version of Sentry (now Apache Sentry (incubating)). This project filled a huge security gap in the Apache Hadoop ecosystem by bringing truly secure and dependable fine grained authorization to the Hadoop ecosystem and provided out-of-the-box integration for Apache Hive. Since then the project has grown significantly–adding support for Impala and Search and the wonderful Hue App to name a few significant additions.
Authored by a substantial portion of Cloudera’s Data Science team (Sean Owen, Sandy Ryza, Uri Laserson, Josh Wills), Advanced Analytics with Spark (currently in Early Release from O’Reilly Media) is the newest addition to the pipeline of ecosystem books by Cloudera engineers. I talked to the authors recently.
Why did you decide to write this book?
Cloudera and Google are collaborating to bring Google Cloud Dataflow to Apache Spark users (and vice-versa). This new project is now incubating in Cloudera Labs!
“The future is already here—it’s just not evenly distributed.” —William Gibson
Learn how to set up a Hadoop cluster in a way that maximizes successful production-ization of Hadoop and minimizes ongoing, long-term adjustments.
Previously, we published some recommendations on selecting new hardware for Apache Hadoop deployments. That post covered some important ideas regarding cluster planning and deployment such as workload profiling and general recommendations for CPU, disk, and memory allocations. In this post, we’ll provide some best practices and guidelines for the next part of the implementation process: configuring the machines once they arrive. Between the two posts, you’ll have a great head start toward production-izing Hadoop.
Cloudera and Intel engineers are collaborating to make Spark’s shuffle process more scalable and reliable. Here are the details about the approach’s design.
What separates computation engines like MapReduce and Apache Spark (the next-generation data processing engine for Apache Hadoop) from embarrassingly parallel systems is their support for “all-to-all” operations. As distributed engines, MapReduce and Spark operate on sub-slices of a dataset partitioned across the cluster. Many operations process single data-points at a time and can be carried out fully within each partition. All-to-all operations must consider the dataset as a whole; the contents of each output record can depend on records that come from many different partitions. In Spark,
reduceByKey are popular examples of these types of operations.
Our thanks to Montrial Harrell, Enterprise Architect for the State of Indiana, for the guest post below.
Recently, the State of Indiana has begun to focus on how enterprise data management can help our state’s government operate more efficiently and improve the lives of our residents. With that goal in mind, I began this journey just like everyone else I know: with an interest in learning more about Apache Hadoop.
Find Cloudera tech talks in Austin, London, Washington DC, Zurich, and other cities through March 2015.
Below please find our regularly scheduled quarterly update about where to find tech talks by Cloudera employees—this time, through the first quarter of calendar year 2015. Note that this list will be continually curated during the period; complete logistical information may not be available yet. And remember, many of these talks are in “free” venues (no cost of entry).
A new Spark tutorial and Trifacta deployment option make Cloudera Live even more useful for getting started with Apache Hadoop.
When it comes to learning Hadoop and CDH (Cloudera’s open source platform including Hadoop), there is no better place to start than Cloudera Live (cloudera.com/live). With a quick, one-button deployment option, Cloudera Live launches a four-node Cloudera cluster that you can learn and experiment in free for two-weeks. To help plan and extend the capabilities of your cluster, we also offer various partner deployments. Building on the addition of interactive tutorials and Tableau and Zoomdata integration, we have added a new tutorial on Apache Spark and a new Trifacta partner deployment.
Support for transparent, end-to-end encryption in HDFS is now available and production-ready (and shipping inside CDH 5.3 and later). Here’s how it works.
Apache Hadoop 2.6 adds support for transparent encryption to HDFS. Once configured, data read from and written to specified HDFS directories will be transparently encrypted and decrypted, without requiring any changes to user application code. This encryption is also end-to-end, meaning that data can only be encrypted and decrypted by the client. HDFS itself never handles unencrypted data or data encryption keys. All these characteristics improve security, and HDFS encryption can be an important part of an organization-wide data protection story.
Thanks to Ben Harden of CapTech for allowing us to re-publish the post below.
Getting delimited flat file data ingested into Apache Hadoop and ready for use is a tedious task, especially when you want to take advantage of file compression, partitioning and performance gains you get from using the Avro and Parquet file formats.
We’re pleased to announce the release of Cloudera Enterprise 5.3 (comprising CDH 5.3, Cloudera Manager 5.3, and Cloudera Navigator 2.2).
This release continues the drumbeat for security functionality in particular, with HDFS encryption (jointly developed with Intel under Project Rhino) now recommended for production use. This feature alone should justify upgrades for security-minded users (and an improved CDH upgrade wizard makes that process easier).
HBaseCon 2015 is ON, people! Book Thursday, May 7, in your calendars.
If you’re a developer in Silicon Valley, you probably already know that since its debut in 2012, HBaseCon has been one of the best developer community conferences out there. If you’re not, this is a great opportunity to learn that for yourself: HBaseCon 2015 will occur on Thurs., May 7, 2015, at the Westin St. Francis on Union Square in San Francisco.
As we progressively move from MapReduce to Spark, we shouldn’t have to give up good HBase integration. Hence the newest Cloudera Labs project, SparkOnHBase!
Apache Spark is making a huge impact across our industry, changing the way we think about batch processing and stream processing. However, as we progressively migrate from MapReduce toward Spark, we shouldn’t have to “give up” anything. One of those capabilities we need to retain is the ability to interact with Apache HBase.
Our “Top 10″ list of blog posts published during a calendar year is a crowd favorite (see the 2013 version here), in particular because it serves as informal, crowdsourced research about popular interests. Page views don’t lie (although skew for publishing date—clearly, posts that publish earlier in the year have pole position—has to be taken into account).
In 2014, a strong interest in various new components that bring real time or near-real time capabilities to the Apache Hadoop ecosystem is apparent. And we’re particularly proud that the most popular post was authored by a non-employee.
- How-to: Create a Simple Hadoop Cluster with VirtualBox
by Christian Javet
Explains how t set up a CDH-based Hadoop cluster in less than an hour using VirtualBox and Cloudera Manager.
- Why Apache Spark is a Crossover Hit for Data Scientists
by Sean Owen
An explanation of why Spark is a compelling multi-purpose platform for use cases that span investigative, as well as operational, analytics.
- How-to: Run a Simple Spark App in CDH 5
by Sandy Ryza
Helps you get started with Spark using a simple example.
- New SQL Choices in the Apache Hadoop Ecosystem: Why Impala Continues to Lead
by Justin Erickson, Marcel Kornacker & Dileep Kumar
Open benchmark testing of Impala 1.3 demonstrates performance leadership compared to alternatives (by 950% or more), while providing greater query throughput and with a far smaller CPU footprint.
- Apache Kafka for Beginners
by Gwen Shapira & Jeff Holoman
When used in the right way and for the right use case, Kafka has unique attributes that make it a highly attractive option for data integration.
- Apache Hadoop YARN: Avoiding 6 Time-Consuming “Gotchas”
by Jeff Bean
Understanding some key differences between MR1 and MR2/YARN will make your migration much easier.
- Impala Performance Update: Now Reaching DBMS-Class Speed
by Justin Erickson, Greg Rahn, Marcel Kornacker & Yanpei Chen
As of release 1.1.1, Impala’s speed beat the fastest SQL-on-Hadoop alternatives–including a popular analytic DBMS running on its own proprietary data store.
- The Truth About MapReduce Performance on SSDs
by Karthik Kambatla & Yanpei Chen
It turns out that cost-per-performance, not cost-per-capacity, is the better metric for evaluating the true value of SSDs. (See the session on this topic at Strata+Hadoop World San Jose in Feb. 2015!)
- How-to: Translate from MapReduce to Spark
by Sean Owen
The key to getting the most out of Spark is to understand the differences between its RDD API and the original Mapper and Reducer API.
- How-to: Write and Run Apache Giraph Jobs on Hadoop
by Mirko Kämpf
Explains how to create a test environment for writing and testing Giraph jobs, or just for playing around with Giraph and small sample datasets.
Interested in Hive-on-Spark progress? This new AMI gives you a hands-on experience.
Nearly one year ago, the Apache Hadoop community began to embrace Apache Spark as a powerful batch-processing engine. Today, many organizations and projects are augmenting their Hadoop capabilities with Spark. As part of this shift, the Apache Hive community is working to add Spark as an execution engine for Hive. The Hive-on-Spark work is being tracked by HIVE-7292 which is one of the most popular JIRAs in the Hadoop ecosystem. Furthermore, three weeks ago, the Hive-on-Spark team offered the first demo of Hive on Spark.
Benchmarking Big Data systems is nontrivial. Avoid these traps!
Here at Cloudera, we know how hard it is to get reliable performance benchmarking results. Benchmarking matters because one of the defining characteristics of Big Data systems is the ability to process large datasets faster. “How large” and “how fast” drive technology choices, purchasing decisions, and cluster operations. Even with the best intentions, performance benchmarking is fraught with pitfalls—easy to get numbers, hard to tell if they are sound.
Bookmark this new living document to ensure use of current and proper configuration, sizing, management, and measurement practices.
Impala, the open source MPP analytic database for Apache Hadoop, is now firmly entrenched in the Big Data mainstream. How do we know this? For one, Impala is now the standard against which alternatives measure themselves, based on a proliferation of new benchmark testing. Furthermore, Impala has been adopted by multiple vendors as their solution for letting customers do exploratory analysis on Big Data, natively and in place (without the need for redundant architecture or ETL). Also significant, we’re seeing the emergence of best practices and patterns out of customer experiences.
Community contributions to Parquet are increasing in parallel with its adoption. Here are some of the highlights.
Apache Parquet (incubating), the open source, general-purpose columnar storage format for Apache Hadoop, was co-founded only 18 months ago by Cloudera and Twitter. Since that time, its rapid adoption by multiple platform vendors and communities has made it a de facto standard for this purpose.
These new Apache HBase features in CDH 5.2 make multi-tenant environments easier to manage.
Historically, Apache HBase treats all tables, users, and workloads with equal weight. This approach is sufficient for a single workload, but when multiple users and multiple workloads were applied on the same cluster or table, conflicts can arise. Fortunately, starting with HBase in CDH 5.2 (HBase 0.98 + backports), workloads and users can now be prioritized.
November was busy, even accounting for the US Thanksgiving holiday:
A significant vulnerability affecting the entire Apache Hadoop ecosystem has now been patched. What was involved?
By now, you may have heard about the POODLE (Padding Oracle On Downgraded Legacy Encryption) attack on TLS (Transport Layer Security). This attack combines a cryptographic flaw in the obsolete SSLv3 protocol with the ability of an attacker to downgrade TLS connections to use that protocol. The result is that an active attacker on the same network as the victim can potentially decrypt parts of an otherwise encrypted channel. The only immediately workable fix has been to disable the SSLv3 protocol entirely.
This guest post from Intel Java performance architect Eric Kaczmarek (originally published here) explores how to tune Java garbage collection (GC) for Apache HBase focusing on 100% YCSB reads.
Apache HBase is an Apache open source project offering NoSQL data storage. Often used together with HDFS, HBase is widely used across the world. Well-known users include Facebook, Twitter, Yahoo, and more. From the developer’s perspective, HBase is a “distributed, versioned, non-relational database modeled after Google’s Bigtable, a distributed storage system for structured data”. HBase can easily handle very high throughput by either scaling up (i.e., deployment on a larger server) or scaling out (i.e., deployment on more servers).
The Apache Hadoop community has voted to release Hadoop 2.6. Congrats to all contributors!
This new release contains a variety of improvements, particularly in the storage layer and in YARN. We’re particularly excited about the encryption-at-rest feature in HDFS!
Learn about BigBench, the new industrywide effort to create a sorely needed Big Data benchmark.
Benchmarking Big Data systems is an open problem. To address this concern, numerous hardware and software vendors are working together to create a comprehensive end-to-end big data benchmark suite called BigBench. BigBench builds upon and borrows elements from existing benchmarking efforts in the Big Data space (such as YCSB, TPC-xHS, GridMix, PigMix, HiBench, Big Data Benchmark, and TPC-DS). Intel and Cloudera, along with other industry partners, are working to define and implement extensions to BigBench 1.0. (A TPC proposal for BigBench 2.0 is in the works.)
The community effort to make Apache Spark an execution engine for Apache Hive is making solid progress.
Apache Spark is quickly becoming the programmatic successor to MapReduce for data processing on Apache Hadoop. Over the course of its short history, it has become one of the most popular projects in the Hadoop ecosystem, and is now supported by multiple industry vendors—ensuring its status as an emerging standard.
Installing CDH on newer unsupported operating systems (such as Ubuntu 13.04 and later) can lead to conflicts. These guidelines will help you avoid them.
Some of the more recently released operating systems that bundle portions of the Apache Hadoop stack in their respective distro repositories can conflict with software from Cloudera repositories. Consequently, when you set up CDH for installation on such an OS, you may end up picking up packages with the same name from the OS’s distribution instead of Cloudera’s distribution. Package installation may succeed, but using the installed packages may lead to unforeseen errors.
Thanks to Guy Harrison of Dell Inc. for the guest post below about time-tested performance optimizations for connecting Oracle Database with Apache Hadoop that are now available in Apache Sqoop 1.4.5 and later.
Back in 2009, I attended a presentation by a Cloudera employee named Aaron Kimball at the MySQL User Conference in which he unveiled a new tool for moving data from relational databases into Hadoop. This tool was to become, of course, the now very widely known and beloved Sqoop!
Cloudera’s culture is premised on innovation and teamwork, and there’s no better example of them in action than our internal hackathon.
Cloudera Engineering doubled-down on its “hackathon” tradition last week, with this year’s edition taking an around-the-clock approach thanks to the HQ building upgrade since the 2013 edition (just look at all that space!).
Our thanks to Micah Whitacre, a senior software architect on Cerner Corp.’s Big Data Platforms team, for the post below about Cerner’s use case for CDH + Apache Kafka. (Kafka integration with CDH is currently incubating in Cloudera Labs.)
Over the years, Cerner Corp., a leading Healthcare IT provider, has utilized several of the core technologies available in CDH, Cloudera’s software platform containing Apache Hadoop and related projects—including HDFS, Apache HBase, Apache Crunch, Apache Hive, and Apache Oozie. Building upon those technologies, we have been able to architect solutions to handle our diverse ingestion and processing requirements.
Find Cloudera tech talks in Seattle, Las Vegas, London, Madrid, Budapest, Barcelona, Washington DC, Toronto, and other cities through the end of 2014.
Below please find our regularly scheduled quarterly update about where to find tech talks by Cloudera employees—this time, for the remaining dates of 2014. Note that this list will be continually curated during the period; complete logistical information may not be available yet. And remember, many of these talks are in “free” venues (no cost of entry).