Cloudera Engineering Blog · Guest Posts
It’s been a while since we provided a how-to for this purpose. Thanks, Daan Debie (@DaanDebie), for allowing us to re-publish the instructions below (for CDH 5)!
I recently started as a Big Data Engineer at The New Motion. While researching our best options for running an Apache Hadoop cluster, I wanted to try out some of the features available in the newest version of Cloudera’s Hadoop distribution: CDH 5. Of course I could’ve downloaded the QuickStart VM, but I rather wanted to run a virtual cluster, making use of the 16GB of RAM my shiny new 15″ Retina Macbook Pro has ;)
Thanks to Bill Podell, VP Big Data and BI Practice, MBI Solutions, for the guest post below.
Capacity planning has long been a critical component of successful implementations for production systems. Today, Big Data calls for a particularly deep understanding of capacity management – because resource utilization explodes as business users, analysts, and data scientists jump onboard to analyze and use newly found data. The resource impact can escalate very quickly, causing poor loading and or response times. The result is throwing more hardware at the issue without any understanding of what impact the new hardware will have on the current issue. Better yet, be proactive and know about the problem before the problem even occurs!
Our thanks to Don Drake (@dondrake), an independent technology consultant who is currently working as a Principal Big Data Consultant at Allstate Insurance, for the guest post below about his experiences with Impala.
It started with a simple request from one of the managers in my group at Allstate to put together a demo of Tableau connecting to Cloudera Impala. I had previously worked on Impala with a large dataset about a year ago while it was still in beta, and was curious to see how Impala had improved since then in features and stability.
Thanks to Jonathan Natkins of WibiData for the post below about how his company extended Cloudera Manager to manage Kiji. Learn more about Kiji and the organizations using it to build real-time HBase applications at Kiji Sessions, happening on May 6, 2014, the day after HBaseCon.
As a partner of Cloudera, WibiData sees Cloudera Manager’s new extensibility framework as one of the most exciting parts of Cloudera Enterprise 5. Cloudera Manager 5.0.0 provides the single-pane view that Apache Hadoop administrators and operators want to effectively manage a cluster of machines. Additionally, Cloudera Manager now offers tight integration for partners to plug into the CDH ecosystem, which benefits Cloudera as well as WibiData.
Thanks to Alexander Rubin of Percona for allowing us to re-publish the post below!
Apache Hadoop is commonly used for data analysis. It is fast for data loads and scalable. In a previous post I showed how to integrate MySQL with Hadoop. In this post I will show how to export a table from MySQL to Hadoop, load the data to Cloudera Impala (columnar format), and run reporting on top of that. For the examples below, I will use the “ontime flight performance” data from my previous post.
Our thanks to Prashant Sharma and Matei Zaharia of Databricks for their permission to re-publish the post below about future Java 8 support in Apache Spark. Spark is now generally available inside CDH 5.
One of Apache Spark‘s main goals is to make big data applications easier to write. Spark has always had concise APIs in Scala and Python, but its Java API was verbose due to the lack of function expressions. With the addition of lambda expressions in Java 8, we’ve updated Spark’s API to transparently support these expressions, while staying compatible with old versions of Java. This new support will be available in Spark 1.0.
A Few Examples
Our thanks to Amar Parkash, a Software Developer at Goibibo, a leading travel portal in India, for the enthusiastic support of Hue you’ll read below.
At Goibibo, we use Hue in our production environment. I came across Hue while looking for a near real-time log search tool and got to know about Cloudera Search and the interface provided by Hue. I tried it on my machine and was really impressed by the UI it provides for Apache Hive, Apache Pig, HDFS, job browser, and basically everything in the Big Data domain. We immediately deployed Hue in production, and that has been one of the best decisions we have ever made for our data platform at Goibibo.
The following post, by Sarah Cannon of Digital Reasoning, was originally published in that company’s blog. Digital Reasoning has graciously permitted us to re-publish here for your convenience.
At the beginning of each release cycle, engineers at Digital Reasoning are given time to explore the latest in Big Data technologies, examining how the frequently changing landscape might be best adapted to serve our mission. As we sat down in the early stages of planning for Synthesys 3.8 one of the biggest issues we faced involved reconciling the tradeoff between flexibility and performance. How can users quickly and easily retrieve knowledge from Synthesys without being tied to one strict data model?
Our thanks to Russell Cardullo and Michael Ruggiero, Data Infrastructure Engineers at Sharethrough, for the guest post below about its use case for Spark Streaming.
At Sharethrough, which offers an advertising exchange for delivering in-feed ads, we’ve been running on CDH for the past three years (after migrating from Amazon EMR), primarily for ETL. With the launch of our exchange platform in early 2013 and our desire to optimize content distribution in real time, our needs changed, yet CDH remains an important part of our infrastructure.
The guest post below was originally authored by Pinterest engineer Raghavendra Prabhu and published by the Pinterest Engineering blog. Being big ZooKeeper fans, we re-publish it here for your convenience. Thanks, Pinterest!
Apache ZooKeeper is an open source distributed coordination service that’s popular for use cases like service discovery, dynamic configuration management and distributed locking. While it’s versatile and useful, it has failure modes that can be hard to prepare for and recover from, and if used for site critical functionality, can have a significant impact on site availability.