Cloudera Engineering Blog
Big Data best practices, how-to's, and internals from Cloudera Engineering and the community
Learn about BigBench, the new industrywide effort to create a sorely needed Big Data benchmark.
Benchmarking Big Data systems is an open problem. To address this concern, numerous hardware and software vendors are working together to create a comprehensive end-to-end big data benchmark suite called BigBench. BigBench builds upon and borrows elements from existing benchmarking efforts in the Big Data space (such as YCSB, TPC-xHS, GridMix, PigMix, HiBench, Big Data Benchmark, and TPC-DS). Intel and Cloudera, along with other industry partners, are working to define and implement extensions to BigBench 1.0. (A TPC proposal for BigBench 2.0 is in the works.)
The community effort to make Apache Spark an execution engine for Apache Hive is making solid progress.
Apache Spark is quickly becoming the programmatic successor to MapReduce for data processing on Apache Hadoop. Over the course of its short history, it has become one of the most popular projects in the Hadoop ecosystem, and is now supported by multiple industry vendors—ensuring its status as an emerging standard.
Installing CDH on newer unsupported operating systems (such as Ubuntu 13.04 and later) can lead to conflicts. These guidelines will help you avoid them.
Some of the more recently released operating systems that bundle portions of the Apache Hadoop stack in their respective distro repositories can conflict with software from Cloudera repositories. Consequently, when you set up CDH for installation on such an OS, you may end up picking up packages with the same name from the OS’s distribution instead of Cloudera’s distribution. Package installation may succeed, but using the installed packages may lead to unforeseen errors.
Thanks to Guy Harrison of Dell Inc. for the guest post below about time-tested performance optimizations for connecting Oracle Database with Apache Hadoop that are now available in Apache Sqoop 1.4.5 and later.
Back in 2009, I attended a presentation by a Cloudera employee named Aaron Kimball at the MySQL User Conference in which he unveiled a new tool for moving data from relational databases into Hadoop. This tool was to become, of course, the now very widely known and beloved Sqoop!
Cloudera’s culture is premised on innovation and teamwork, and there’s no better example of them in action than our internal hackathon.
Cloudera Engineering doubled-down on its “hackathon” tradition last week, with this year’s edition taking an around-the-clock approach thanks to the HQ building upgrade since the 2013 edition (just look at all that space!).
Our thanks to Micah Whitacre, a senior software architect on Cerner Corp.’s Big Data Platforms team, for the post below about Cerner’s use case for CDH + Apache Kafka. (Kafka integration with CDH is currently incubating in Cloudera Labs.)
Over the years, Cerner Corp., a leading Healthcare IT provider, has utilized several of the core technologies available in CDH, Cloudera’s software platform containing Apache Hadoop and related projects—including HDFS, Apache HBase, Apache Crunch, Apache Hive, and Apache Oozie. Building upon those technologies, we have been able to architect solutions to handle our diverse ingestion and processing requirements.
Find Cloudera tech talks in Seattle, Las Vegas, London, Madrid, Budapest, Barcelona, Washington DC, Toronto, and other cities through the end of 2014.
Below please find our regularly scheduled quarterly update about where to find tech talks by Cloudera employees—this time, for the remaining dates of 2014. Note that this list will be continually curated during the period; complete logistical information may not be available yet. And remember, many of these talks are in “free” venues (no cost of entry).
The new integration between Flume and Kafka offers sub-second-latency event processing without the need for dedicated infrastructure.
In this previous post you learned some Apache Kafka basics and explored a scenario for using Kafka in an online application. This post takes you a step further and highlights the integration of Kafka with Apache Hadoop, demonstrating both a basic ingestion capability as well as how different open-source components can be easily combined to create a near-real time stream processing workflow using Kafka, Apache Flume, and Hadoop. (Kafka integration with CDH is currently incubating in Cloudera Labs.)
The Case for Flafka
The number of powerful data query tools in the Apache Hadoop ecosystem can be confusing, but understanding a few simple things about your needs usually makes the choice easy.
Ah, the good old days. I recall vividly that in 2007, I was faced to store 1 billion XML documents and make them accessible as well as searchable. I had few choices on a given shoestring budget: build something one my own (it was the rage back then—and still is), use an existing open source database like PostgreSQL or MySQL, or try this thing that Google built successfully and that was now implemented in open source under the Apache umbrella: Hadoop.