Cloudera Engineering Blog · Guest Posts
Thanks to Michal Malohlava, Amy Wang, and Avni Wadhwa of H20.ai for providing the following guest post about building ML apps using Sparkling Water and Apache Spark on CDH.
The Sparkling Water project is nearing its one-year anniversary, which means Michal Malohlava, our main contributor, has been very busy for the better part of this past year. The Sparkling Water project combines H2O machine-learning algorithms with the execution power of Apache Spark. This means that the project is heavily dependent on two of the fastest growing machine-learning open source projects out there. With every major release of Spark or H2O there are API changes and, less frequently, major data structure changes that affect Sparkling Water. Throw Cloudera releases into the mix, and you have a plethora of git commits dedicated to maintaining a few simple calls to move data between the different platforms.
Thanks to Holden Karau (@holdenkarau), Software Engineer at Alpine Data Labs (also a Spark contributor and book author), for providing the following post about her work on new base classes for testing Apache Spark programs.
Testing in the world of Apache Spark has often involved a lot of hand-rolled artisanal code, which frankly is a good way to ensure that developers write as few tests as possible. I’ve been doing some work with Spark Testing Base (also available on Spark Packages) to try and make testing Spark jobs as easy as “normal” software (and remove excuses for not writing tests).
Thanks to Jeff Palmucci, Director of Machine Learning at TripAdvisor, for permission to republish the following (originally appeared in TripAdvisor’s Engineering/Operations blog).
Here at TripAdvisor we have a lot of reviews, several hundred million according to the last announcement. I work with machine learning, and one thing we love in machine learning is putting lots of data to use.
Big Industries, Cloudera systems integration and reseller partner for Belgium and Luxembourg, has developed an integration of Apache Mesos and CDH that can be deployed and managed through Cloudera Manager. In this post, Big Industries’ Rob Gibbon explains the benefits of deploying Mesos on your cluster and walks you through the process of setting it up.
[Editor's Note: Mesos integration is not currently supported by Cloudera, thus the setup described below is not recommended for production use.]
Thanks to Barclays employees Sam Savage, VP Data Science, and Harry Powell, Head of Advanced Analytics, for the guest post below about the Barclays use case for Apache Spark and its Scala API.
At Barclays, our team recently built an application called Insights Engine to execute an arbitrary number N of near-arbitrary SQL-like queries and execute them in a way that can scale with increasing N. The queries were non-trivial, each constituting 200-300 lines of SQL, and they were running over a large dataset for hours as Apache Hive scripts. Yet, we need to execute 50 queries in less than an hour for our use case.
Our thanks to Karthik Vadla and Abhi Basu, Big Data Solutions engineers at Intel, for permission to re-publish the following (which was originally available here).
Data science is not a new discipline. However, with the growth of big data and adoption of big data technologies, the request for better quality data has grown exponentially. Today data science is applied to every facet of life—product validation through fault prediction, genome sequence analysis, personalized medicine through population studies and Patient 360 view, credit card fraud-detection, improvement in customer experience through sentiment analysis and purchase patterns, weather forecast, detecting cyber or terrorist attacks, aircraft maintenance utilizing predictive analytics to repair critical parts before they fail, and many more. Every day, data scientists are detecting patterns in data and providing actionable insights to influence organizational changes.
Thanks to Wuheng Luo, a Hadoop and big data architect at Sears Holdings, for the guest post below about Pig job-level performance tuning
Many factors can affect Apache Pig job performance in Apache Hadoop, including hardware, network I/O, cluster settings, code logic, and algorithm. Although the sysadmin team is responsible for monitoring many of these factors, there are other issues that MapReduce job owners or data application developers can help diagnose, tune, and improve. One such example is a disproportionate Map-to-Reduce ratio—that is, using too many reducers or mappers in a Pig job.
Thanks to Pengyu Wang, software developer at FINRA, for permission to republish this post.
Salted Apache HBase tables with pre-split is a proven effective HBase solution to provide uniform workload distribution across RegionServers and prevent hot spots during bulk writes. In this design, a row key is made with a logical key plus salt at the beginning. One way of generating salt is by calculating n (number of regions) modulo on the hash code of the logical row key (date, etc).
Salting Row Keys
The following post from Julien Le Dem, a tech lead at Twitter, originally appeared in the Twitter Engineering Blog. We bring it to you here for your convenience.
ASF, the Apache Software Foundation, recently announced the graduation of Apache Parquet, a columnar storage format for the Apache Hadoop ecosystem. At Twitter, we’re excited to be a founding member of the project.
The recent OpenStack Kilo release adds many features to the Sahara project, which provides a simple means of provisioning an Apache Hadoop (or Spark) cluster on top of OpenStack. This how-to, from Intel Software Engineer Wei Ting Chen, explains how to use the Sahara CDH plugin with this new release.
This how-to assumes that OpenStack is already installed. If not, we recommend using Devstack to build a test OpenStack environment in a short time. (Note: Devstack is not recommended for use in a production environment. For production deployments, refer to the OpenStack Installation Guide.)
The following post, from Cloudera intern Jonathan Lawlor, originally appeared in the Apache Software Foundation’s blog.
Over the past few months there have a been a variety of nice changes made to scanners in Apache HBase. This post focuses on two such changes, namely RPC chunking (HBASE-11544) and scanner heartbeat messages (HBASE-13090). Both of these changes address long standing issues in the client-server scan protocol. Specifically, RPC chunking deals with how a server handles the scanning of very large rows and scanner heartbeat messages allow scan operations to progress even when aggressive server-side filtering makes infrequent result returns.
Our thanks to Ilya Ganelin, Senior Data Engineer at Capital One Labs, for the guest post below about his hard-earned lessons from using Spark.
I started using Apache Spark in late 2014, learning it at the same time as I learned Scala, so I had to wrap my head around the various complexities of a new language as well as a new computational framework. This process was a great in-depth introduction to the world of Big Data (I previously worked as an electrical engineer for Boeing), and I very quickly found myself deep in the guts of Spark. The hands-on experience paid off; I now feel extremely comfortable with Spark as my go-to tool for a wide variety of data analytics tasks, but my journey here was no cakewalk.
Thanks to Torsten Kilias and Alexander Löser of the Beuth University of Applied Sciences in Berlin for the following guest post about their INDREX project and its integration with Impala for integrated management of textual and relational data.
Textual data is a core source of information in the enterprise. Example demands arise from sales departments (monitor and identify leads), human resources (identify professionals with capabilities in ‘xyz’), market research (campaign monitoring from the social web), product development (incorporate feedback from customers), and the medical domain (anamnesis).
Thanks to Chen Song, Data Team Lead at AppNexus, for allowing us to republish the following post about his company’s use case for Apache Parquet (incubating at this writing), the open standard for columnar storage across the Apache Hadoop ecosystem.
At AppNexus, over 2MM log events are ingested into our data pipeline every second. Log records are sent from upstream systems in the form of Protobuf messages. Raw logs are compressed in Snappy when stored on HDFS. That said, even with compression, this still leads to over 25TB of log data collected every day. On top of logs, we also have hundreds of MapReduce jobs that process and generate aggregated data. Collectively, we store petabytes of data in our primary Hadoop cluster.
Thanks to Sam Shuster, Software Engineer at Edmunds.com, for the guest post below about his company’s use case for Spark Streaming, SparkOnHBase, and Morphlines.
Every year, the Super Bowl brings parties, food and hopefully a great game to appease everyone’s football appetites until the fall. With any event that brings in around 114 million viewers with larger numbers each year, Americans have also grown accustomed to commercials with production budgets on par with television shows and with entertainment value that tries to rival even the game itself.
Thanks to Big Data Solutions Architect Matthieu Lieber for allowing us to republish the post below.
A customer of mine wants to take advantage of both worlds: work with his existing Apache Avro data, with all of the advantages that it confers, but take advantage of the predicate push-down features that Parquet provides. How to reconcile the two?
Thanks to Cody Koeninger, Senior Software Engineer at Kixer, for the guest post below about Apache Kafka integration points in Apache Spark 1.3. Spark 1.3 will ship in CDH 5.4.
The new release of Apache Spark, 1.3, includes new experimental RDD and DStream implementations for reading data from Apache Kafka. As the primary author of those features, I’d like to explain their implementation and usage. You may be interested if you would benefit from:
Thanks to Matthew Dixon, principal consultant at Quiota LLC and Professor of Analytics at the University of San Francisco, and Mohammad Zubair, Professor of Computer Science at Old Dominion University, for this guest post that demonstrates how to easily deploy exposure calculations on Apache Spark for in-memory analytics on scenario data.
Since the 2007 global financial crisis, financial institutions now more accurately measure the risks of over-the-counter (OTC) products. It is now standard practice for institutions to adjust derivative prices for the risk of the counter-party’s, or one’s own, default by means of credit or debit valuation adjustments (CVA/DVA).
Thanks to Călin-Andrei Burloiu, Big Data Engineer at antivirus company Avira, and Radu Pastia, Senior Software Developer in the Big Data Team at Orange, for the guest post below about the Couchdoop connector for bringing Couchbase data into Hadoop.
Couchdoop is a Couchbase connector for Apache Hadoop, developed by Avira on CDH, that allows for easy, parallel data transfer between Couchbase and Hadoop storage engines. It includes a command-line tool, for simple tasks and prototyping, as well as a MapReduce library, for those who want to use Couchdoop directly in MapReduce jobs. Couchdoop works natively with CDH 5.x.
Couchdoop can help you:
Many thanks to David Whiting of Spotify for allowing us to re-publish the following Spotify Labs post about its Apache Crunch use cases.
(Note: Since this post was originally published in November 2014, many of the library functions described have been added into crunch-core, so they’ll soon be available to all Crunch users by default.)
Thanks to Jesus Centeno of Qlik for the post below about using Impala alongside Qlik Sense.
Cloudera and Qlik (which is part of the Impala Accelerator Program) have revolutionized the delivery of insights and value to every business stakeholder for “small data,” to something more powerful in the Big Data world—enabling users to combine Big Data and “small data” to yield actionable business insights.
Thanks to Michael Williams, BIRT Product Evangelist & Forums Manager at analytics software specialist Actuate Corp. (now OpenText), for the guest post below. Actuate is the primary builder and supporter of BIRT, a top-level project of the Eclipse Foundation.
The Actuate (now OpenText) products BIRT Designer Professional and BIRT iHub allow you to connect to multiple data sources to create and deliver meaningful visualizations securely, with scalability reaching millions of users and devices. And now, with Impala emerging as a standard Big Data query engine for many of Actuate’s customers, solid BIRT integration with Impala has become critical.
Our thanks to Montrial Harrell, Enterprise Architect for the State of Indiana, for the guest post below.
Recently, the State of Indiana has begun to focus on how enterprise data management can help our state’s government operate more efficiently and improve the lives of our residents. With that goal in mind, I began this journey just like everyone else I know: with an interest in learning more about Apache Hadoop.
Thanks to Ben Harden of CapTech for allowing us to re-publish the post below.
Getting delimited flat file data ingested into Apache Hadoop and ready for use is a tedious task, especially when you want to take advantage of file compression, partitioning and performance gains you get from using the Avro and Parquet file formats.
This guest post from Intel Java performance architect Eric Kaczmarek (originally published here) explores how to tune Java garbage collection (GC) for Apache HBase focusing on 100% YCSB reads.
Apache HBase is an Apache open source project offering NoSQL data storage. Often used together with HDFS, HBase is widely used across the world. Well-known users include Facebook, Twitter, Yahoo, and more. From the developer’s perspective, HBase is a “distributed, versioned, non-relational database modeled after Google’s Bigtable, a distributed storage system for structured data”. HBase can easily handle very high throughput by either scaling up (i.e., deployment on a larger server) or scaling out (i.e., deployment on more servers).
Thanks to Guy Harrison of Dell Inc. for the guest post below about time-tested performance optimizations for connecting Oracle Database with Apache Hadoop that are now available in Apache Sqoop 1.4.5 and later.
Back in 2009, I attended a presentation by a Cloudera employee named Aaron Kimball at the MySQL User Conference in which he unveiled a new tool for moving data from relational databases into Hadoop. This tool was to become, of course, the now very widely known and beloved Sqoop!
Our thanks to Micah Whitacre, a senior software architect on Cerner Corp.’s Big Data Platforms team, for the post below about Cerner’s use case for CDH + Apache Kafka. (Kafka integration with CDH is currently incubating in Cloudera Labs.)
Over the years, Cerner Corp., a leading Healthcare IT provider, has utilized several of the core technologies available in CDH, Cloudera’s software platform containing Apache Hadoop and related projects—including HDFS, Apache HBase, Apache Crunch, Apache Hive, and Apache Oozie. Building upon those technologies, we have been able to architect solutions to handle our diverse ingestion and processing requirements.
Thanks to M. Asokan, Chief Architect at Syncsort, for the guest post below.
Apache Sqoop provides a framework to move data between HDFS and relational databases in a parallel fashion using Hadoop’s MR framework. As Hadoop becomes more popular in enterprises, there is a growing need to move data from non-relational sources like mainframe datasets to Hadoop. Following are possible reasons for this:
Our thanks to AWS Solutions Architect Rahul Bhartia for allowing us to republish his post below.
Apache Hadoop provides a great ecosystem of tools for extracting value from data in various formats and sizes. Originally focused on large-batch processing with tools like MapReduce, Apache Pig, and Apache Hive, Hadoop now provides many tools for running interactive queries on your data, such as Impala, Drill, and Presto. This post shows you how to use Amazon Elastic MapReduce (Amazon EMR) to analyze a data set available on Amazon Simple Storage Service (Amazon S3) and then use Tableau with Impala to visualize the data.
Our thanks to Melanie Imhof, Jonas Looser, Thierry Musy, and Kurt Stockinger of the Zurich University of Applied Science in Switzerland for the post below about their research into the query performance of Impala for mixed workloads.
Recently, we were approached by an industry partner to research and create a blueprint for a new Big Data, near real-time, query processing architecture that would replace its current architecture based on a popular open source database system.
Our thanks to Rakesh Rao of Quaero, for allowing us to re-publish the post below about Quaero’s experiences using partitioning in Apache Hive.
In this post, we will talk about how we can use the partitioning features available in Hive to improve performance of Hive queries.
The Transaction Processing Council (TPC), working with Cloudera, recently announced the new TPCx-HS benchmark, a good first step toward providing a Big Data benchmark.
In this interview by Roberto Zicari with Francois Raab, the original author of the TPC-C Benchmark, and Yanpei Chen, a Performance Engineer at Cloudera, the interviewees share their thoughts on the next step for benchmarks that reflect real-world use cases.
The following post was written by Jay Vyas (@jayunit100) and originally published in the Gluster.org Community.
I have recently spent some time getting Cloudera’s CDH 5 distribution of Apache Hadoop to work on GlusterFS 3.3 using Distributed Replicated 2 Volumes. This is made possible by the fact that Apache Hadoop has a pluggable filesystem architecture that allows the computational components within the CDH 5 distribution to be configured to use alternative filesystems to HDFS. In this case, one can configure CDH 5 to use the Hadoop FileSystem plugin for GlusterFS (glusterfs-hadoop), which allows it to run on GlusterFS 3.3. I’ve provided a diagram below that illustrates the CDH 5 core processes and how they interact with GlusterFS.
It’s been a while since we provided a how-to for this purpose. Thanks, Daan Debie (@DaanDebie), for allowing us to re-publish the instructions below (for CDH 5)!
I recently started as a Big Data Engineer at The New Motion. While researching our best options for running an Apache Hadoop cluster, I wanted to try out some of the features available in the newest version of Cloudera’s Hadoop distribution: CDH 5. Of course I could’ve downloaded the QuickStart VM, but I rather wanted to run a virtual cluster, making use of the 16GB of RAM my shiny new 15″ Retina Macbook Pro has ;)
Thanks to Bill Podell, VP Big Data and BI Practice, MBI Solutions, for the guest post below.
Capacity planning has long been a critical component of successful implementations for production systems. Today, Big Data calls for a particularly deep understanding of capacity management – because resource utilization explodes as business users, analysts, and data scientists jump onboard to analyze and use newly found data. The resource impact can escalate very quickly, causing poor loading and or response times. The result is throwing more hardware at the issue without any understanding of what impact the new hardware will have on the current issue. Better yet, be proactive and know about the problem before the problem even occurs!
Our thanks to Don Drake (@dondrake), an independent technology consultant who is currently working as a Principal Big Data Consultant at Allstate Insurance, for the guest post below about his experiences with Impala.
It started with a simple request from one of the managers in my group at Allstate to put together a demo of Tableau connecting to Cloudera Impala. I had previously worked on Impala with a large dataset about a year ago while it was still in beta, and was curious to see how Impala had improved since then in features and stability.
Thanks to Jonathan Natkins of WibiData for the post below about how his company extended Cloudera Manager to manage Kiji. Learn more about Kiji and the organizations using it to build real-time HBase applications at Kiji Sessions, happening on May 6, 2014, the day after HBaseCon.
As a partner of Cloudera, WibiData sees Cloudera Manager’s new extensibility framework as one of the most exciting parts of Cloudera Enterprise 5. Cloudera Manager 5.0.0 provides the single-pane view that Apache Hadoop administrators and operators want to effectively manage a cluster of machines. Additionally, Cloudera Manager now offers tight integration for partners to plug into the CDH ecosystem, which benefits Cloudera as well as WibiData.
Thanks to Alexander Rubin of Percona for allowing us to re-publish the post below!
Apache Hadoop is commonly used for data analysis. It is fast for data loads and scalable. In a previous post I showed how to integrate MySQL with Hadoop. In this post I will show how to export a table from MySQL to Hadoop, load the data to Cloudera Impala (columnar format), and run reporting on top of that. For the examples below, I will use the “ontime flight performance” data from my previous post.
Our thanks to Prashant Sharma and Matei Zaharia of Databricks for their permission to re-publish the post below about future Java 8 support in Apache Spark. Spark is now generally available inside CDH 5.
One of Apache Spark‘s main goals is to make big data applications easier to write. Spark has always had concise APIs in Scala and Python, but its Java API was verbose due to the lack of function expressions. With the addition of lambda expressions in Java 8, we’ve updated Spark’s API to transparently support these expressions, while staying compatible with old versions of Java. This new support will be available in Spark 1.0.
A Few Examples
Our thanks to Amar Parkash, a Software Developer at Goibibo, a leading travel portal in India, for the enthusiastic support of Hue you’ll read below.
At Goibibo, we use Hue in our production environment. I came across Hue while looking for a near real-time log search tool and got to know about Cloudera Search and the interface provided by Hue. I tried it on my machine and was really impressed by the UI it provides for Apache Hive, Apache Pig, HDFS, job browser, and basically everything in the Big Data domain. We immediately deployed Hue in production, and that has been one of the best decisions we have ever made for our data platform at Goibibo.
The following post, by Sarah Cannon of Digital Reasoning, was originally published in that company’s blog. Digital Reasoning has graciously permitted us to re-publish here for your convenience.
At the beginning of each release cycle, engineers at Digital Reasoning are given time to explore the latest in Big Data technologies, examining how the frequently changing landscape might be best adapted to serve our mission. As we sat down in the early stages of planning for Synthesys 3.8 one of the biggest issues we faced involved reconciling the tradeoff between flexibility and performance. How can users quickly and easily retrieve knowledge from Synthesys without being tied to one strict data model?
Our thanks to Russell Cardullo and Michael Ruggiero, Data Infrastructure Engineers at Sharethrough, for the guest post below about its use case for Spark Streaming.
At Sharethrough, which offers an advertising exchange for delivering in-feed ads, we’ve been running on CDH for the past three years (after migrating from Amazon EMR), primarily for ETL. With the launch of our exchange platform in early 2013 and our desire to optimize content distribution in real time, our needs changed, yet CDH remains an important part of our infrastructure.
The guest post below was originally authored by Pinterest engineer Raghavendra Prabhu and published by the Pinterest Engineering blog. Being big ZooKeeper fans, we re-publish it here for your convenience. Thanks, Pinterest!
Apache ZooKeeper is an open source distributed coordination service that’s popular for use cases like service discovery, dynamic configuration management and distributed locking. While it’s versatile and useful, it has failure modes that can be hard to prepare for and recover from, and if used for site critical functionality, can have a significant impact on site availability.
Learn how to use Cloudera Search along with RBL-JE to search and index documents in multiple languages.
Our thanks to Basis Technology for providing the how-to below!
Set up a CDH-based Hadoop cluster in less than an hour using VirtualBox and Cloudera Manager.
Thanks to Christian Javet for his permission to republish his blog post below!
Thanks to Marshall Bockrath-Vandegrift of advanced threat detection/malware company (and CDH user) Damballa for the following post about his Parkour project, which offers libraries for writing MapReduce jobs in Clojure. Parkour has been tested (but is not supported) on CDH 3 and CDH 4.
Clojure is Lisp-family functional programming language which targets the JVM. On the Damballa R&D team, Clojure has become the language of choice for implementing everything from web services to machine learning systems. One of Clojure’s key features for us is that it was designed from the start as an explicitly hosted language, building on rather than replacing the semantics of its underlying platform. Clojure’s mapping from language features to JVM implementation is frequently simpler and clearer even than Java’s.
Our thanks to Databricks, the company behind Apache Spark (incubating), for providing the guest post below. Cloudera and Databricks recently announced that Cloudera will distribute and support Spark in CDH. Look for more posts describing Spark internals and Spark + CDH use cases in the near future.
Our thanks to Telvis Calhoun, Zach Hanif, and Jason Trost of Endgame for the guest post below about their BinaryPig application for large-scale malware analysis on Apache Hadoop. Endgame uses data science to bring clarity to the digital domain, allowing its federal and commercial partners to sense, discover, and act in real time.
Our thanks to Concurrent Inc. for the how-to below about using Cascading Pattern with CDH. Cloudera recently tested CDH 4.4 with the Cascading Compatibility Test Suite verifying compatibility with Cascading 2.2.
Cascading Pattern is a machine-learning project within the Cascading development framework used to build enterprise data workflows. Cascading provides an abstraction layer on top of Apache Hadoop and other computing topologies that allows enterprises to leverage existing skills and resources to build data processing applications on Hadoop, without the need for specialized Hadoop skills.
Thanks to Victor Bittorf, a visiting graduate computer science student at Stanford University, for the guest post below about how to use the new prebuilt analytic functions for Cloudera Impala.
Cloudera Impala is an exciting project that unlocks interactive queries and SQL analytics on big data. Over the past few months I have been working with the Impala team to extend Impala’s analytic capabilities. Today I am happy to announce the availability of pre-built mathematical and statistical algorithms for the Impala community under a free open-source license. These pre-built algorithms combine recent theoretical techniques for shared nothing parallelization for analytics and the new user-defined aggregations (UDA) framework in Impala 1.2 in order to achieve big data scalability. This initial release has support for logistic regression, support vector machines (SVMs), and linear regression.