Our thanks to AWS Solutions Architect Rahul Bhartia for allowing us to republish his post below.
Apache Hadoop provides a great ecosystem of tools for extracting value from data in various formats and sizes. Originally focused on large-batch processing with tools like MapReduce, Apache Pig, and Apache Hive, Hadoop now provides many tools for running interactive queries on your data, such as Impala, Drill, and Presto. This post shows you how to use Amazon Elastic MapReduce (Amazon EMR) to analyze a data set available on Amazon Simple Storage Service (Amazon S3) and then use Tableau with Impala to visualize the data.
Impala’s speed now beats the fastest SQL-on-Hadoop alternatives. Test for yourself!
Since the initial beta release of Cloudera Impala more than one year ago (October 2012), we’ve been committed to regularly updating you about its evolution into the standard for running interactive SQL queries across data in Apache Hadoop and Hadoop-based enterprise data hubs. To briefly recap where we are today:
- Impala is being widely adopted.
Developers, rejoice: Impala is now available on EMR for testing and evaluation.
Very recently, Amazon Web Services announced support for running Cloudera Impala queries on its Elastic MapReduce (EMR) service. This is very good news for EMR users — as well as for users of other platforms interested in kicking Impala’s tires in a friction-free way. It’s also yet another sign that Impala is rapidly being adopted across the ecosystem as the gold standard for interactive SQL and BI queries on Apache Hadoop.
The following is a guest post kindly offered by Adam Kawa, a 26-year old Hadoop developer from Warsaw, Poland. This post was originally published in a slightly different form at his blog, Hakuna MapData!
Recently I have found an interesting dataset, called Million Song Dataset (MSD), which contains detailed acoustic and contextual data about a million songs. For each song we can find information like title, hotness,
This post was contributed by Jennie Cochran-Chinn and Joe Crobak. They are part of the team building out Adconion‘s Hadoop infrastructure to support Adconion’s next-generation ad optimization and reporting systems.
This is the first of a two part series about moving away from Amazon’s EMR service to an in-house Apache Hadoop cluster.
When we first started using Hadoop, we went down the path of Amazon’s EMR service.