Our thanks to AWS Solutions Architect Rahul Bhartia for allowing us to republish his post below.
Apache Hadoop provides a great ecosystem of tools for extracting value from data in various formats and sizes. Originally focused on large-batch processing with tools like MapReduce, Apache Pig, and Apache Hive, Hadoop now provides many tools for running interactive queries on your data, such as Impala, Drill, and Presto. This post shows you how to use Amazon Elastic MapReduce (Amazon EMR) to analyze a data set available on Amazon Simple Storage Service (Amazon S3) and then use Tableau with Impala to visualize the data.
Amazon Elastic MapReduce
Amazon EMR is a web service that makes it easy to quickly and cost-effectively process vast amounts of data. Amazon EMR uses Apache Hadoop, an open source framework, to distribute and process your data across a resizable cluster of Amazon Elastic Compute Cloud (Amazon EC2) instances.
Impala is an open source tool in the Hadoop ecosystem and is available on EMR for interactive, ad hoc querying using SQL syntax. Instead of using a MapReduce engine like Hive, Impala leverages a massively parallel processing (MPP) engine similar to what’s used in traditional relational database management systems (RDBMS), which allows it to achieve faster query response times.
While both Impala and Hive provide SQL-like capabilities and can share the same Metastore (a repository for tables and partitions metadata), they each play a distinct role in the Hadoop ecosystem. Compared to Impala, Hive typically has much higher query response times. This makes it inefficient to use with interactive data analysis tools like Tableau. However, Impala uses significant memory resources, and the cluster’s available memory places a constraint on how much data any query can consume. Hive is not limited in the same way and can process larger data sets with the same hardware, making it better for ETL workloads on large datasets.
Tableau Software is a business intelligence solution that integrates data analysis and reports into a continuous visual analysis process that is easy to learn and use. Tableau’s software delivers fast analytics, visualization and business intelligence and connects directly to AWS services and many other sources. The recent version of Tableau Desktop enables connection to Hive or Impala running on Amazon EMR via the ODBC driver for Amazon EMR. You can contact Tableau to find out how to activate Amazon EMR as a data source.
In this blog post, we’ll demonstrate Amazon EMR enabled as a data source in Tableau and connect to Impala for creating an interactive visualization
Using Amazon EMR to Analyze Google Books n-grams
The Google Books n-gram data set is freely available via the AWS Public Data Sets on Amazon S3. N-grams are fixed-size tuples of items. In this case, the items are words extracted from the Google Books corpus. The “n” specifies the number of elements in the tuple, so a 5-gram contains five words or characters.
Apache Hadoop traditionally works with HDFS, but it also supports the use of Amazon S3 as a file system. Impala currently requires data to be on HDFS while Hive supports direct querying of data on Amazon S3.
The Google Books n-gram data set is already in a Hadoop-friendly file format with sizes up to 2.2 TB. The data set files are stored in the SequenceFile format with block-level LZO compression. The sequence file key is the row number of the dataset stored as a LongWritable and the value is the raw data stored as TextWritable.
The SequenceFile format with block-level LZO compression requires further transformation because Impala cannot create them or insert data into them; it can only query LZO-compressed Text tables. Hive, which supports SequenceFile format with block-level LZO compression and querying for external data in Amazon S3, is an easy choice as the tool for transforming our data into an Impala-supported format onto HDFS.
Starting an Amazon EMR Cluster
First, we’ll launch an Amazon EMR cluster with Hive and Impala installed.
The statement below launches the EMR cluster using the AWS CLI and returns the unique identifier for your cluster.1aws emr create-cluster --name ImpalaCluster --ami-version 3.1.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.medium --ec2-attributes KeyName=keyPairName,AvailabilityZone=availabilityZone --applications Name=Hive,Name=Impala --no-auto-terminate
Note: Replace the string keyPairName and availabilityZone with appropriate values before running the statement above. You’ll also need to replace the string j- XXXXXXXXXXXX in some of the next steps below with the unique identifier the statement above returns.
The cluster should be ready in 5-10 minutes (its state will become “Waiting.”) To check the status of the cluster as it is initializing, run the following command:1aws emr describe-cluster --cluster-id j-XXXXXXXXXXXX--query 'Cluster.Status.State' --output text
Once your cluster enters the “WAITING” state, you can connect to the master node by using the command below.1aws emr ssh --cluster-id <em>j-XXXXXXXXXXXX</em> --key-pair-file keyFilePath
Note: Replace the string keyFilePath with the path to the private key file.
Creating the External Table from data in Amazon S3
Outside data sources are referenced in Amazon EMR by creating an EXTERNAL TABLE. This simply creates a reference to the data; no data is moved yet.
Once logged into the master node, start the Hive shell:1$ hive
Define the source using a CREATE TABLE statement. For this example, we will only use English 1-grams dataset.1hive> CREATE EXTERNAL TABLE eng_1M_1gram(token STRING, year INT, frequency INT, pages INT, books INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS SEQUENCEFILE LOCATION 's3://datasets.elasticmapreduce/ngrams/books/20090715/eng-1M/1gram';
Creating a Replica Table in HDFS
We’ll create a replica table to store the results on HDFS required for Impala. In the replica table we’ll use Parquet instead of Sequence File format. Parquet is a column-oriented binary file format intended to be highly efficient for running large-scale queries.
Create the replica table in Hive:1hive> CREATE TABLE eng_1M_1gram_paraquet(token STRING, year INT, frequency INT, pages INT, books INT) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS inputformat 'parquet.hive.DeprecatedParquetInputFormat' outputformat 'parquet.hive.DeprecatedParquetOutputFormat';
- Adjust the mapred.min.split.size settings because the data is stored in Amazon S3 in a single file.
1hive> set mapred.min.split.size=134217728;
This setting tells Hive to split the file into pieces of at least 128 MB for processing. This prevents you from using only one mapper when processing the data, as this wouldn’t take advantage of the distributed nature of MapReduce.
Insert data into this table using a select query. We’ll read from the raw data table and insert into this new table.1hive> INSERT OVERWRITE TABLE eng_1M_1gram_paraquet SELECT lower(token), year, frequency, pages, books FROM eng_1M_1gram WHERE year >= 1890 AND token REGEXP "^[A-Za-z+'-]+";
This query also illustrates a typical use of Hive to run transformation on your data to make it easier to query downstream using tools like Tableau. In the query, first we filter out data before 1890 because of lower number of publications and use regular expression to accept only n-grams composed of alphabets and commonly used punctuations. We also lowercase the n-gram using the built-in function to store data in unified format and simplify querying in later stages.
Once the step above completes, quit Hive.
Making the Replica Table Available in Impala
We’ll use the same Metastore here for both Impala and Hive. This requires the metadata to be updated in Impala before the table is available for Impala queries. The INVALIDATE METADATA statement marks the metadata as stale and makes Impala reload the associated metadata before a query proceeds.
Log into Impala.1$ impala-shell
Invalidate the metadata for the replica table in Impala.1impala> invalidate metadata;
Exit the Impala shell and close the SSH connection to the Amazon EMR cluster.
Using Tableau to Visualize the Data from Impala
For the next steps, you’ll need Tableau Desktop installed on a Windows or MacOSX machine. If you don’t have a Tableau installation, you can use Amazon EC2 and install Tableau for the purpose of this blog. You’ll also need the keys to enable Amazon EMR as a data connection in Tableau. These can be obtained by contacting Tableau.
Install the ODBC driver on your machine with Tableau Desktop, required for connecting Tableau Desktop to Impala on Amazon EMR.
- Download the drivers.
- Unzip the downloaded file. This should create a folder named "ImpalaODBC."
- Navigate to the required package for installing the driver.
MacOSX: ImpalaODBC/18.104.22.1687/MacOSX/ SimbaImpalaODBC.dmg
- Run the package above and follow the prompts to install the ODBC driver.
Modify the Amazon EMR cluster’s Master Security Group so Tableau can connect with the Impala server running on the master node of the Amazon EMR cluster.
- Click the Amazon EC2 tab in the AWS Management Console to open the Amazon EC2 console.
- In the navigation pane, select Security Groups under the Network and Security group.
- In the Security Groups list, select Elastic MapReduce-master.
- In the lower pane, click the Inbound tab.
- In the Port Range field type 21050. Leave the default value in the Source field.
- Click Add Rule, and then click Apply Rule Changes.
Follow the steps as directed by Tableau to enable Amazon EMR as a data connection option in Tableau. Clicking A" should show a connection page similar to the image below.
Fill in the DNS of master node in the Server field above and click Connect. You can find the DNS using the statement below:1aws emr describe-cluster --cluster-id j-XXXXXXXXXXXX --query 'Cluster.MasterPublicDnsName' --output text
On the next screen, Select default from the schema drop-down, drag the table named "eng_1m_1g_paraquet" onto the upper left panel as in the image below, and click Go to Worksheet.
This opens a Tableau workbook with the Dimension and Measure automatically populated. Now we can use Tableau with Impala running on Amazon EMR.
Creating Interactive Visualizations
This video demonstrates the steps for creating some interactive visualizations using Tableau. First, we create a trend-line of books published over year.
Creating a Filter
The video demonstrates the steps for creating a filter that lets users choose a specific 1-Gram for the trend-line. The sudden increase around 1905 for the 1-gram ‘computer’ is quite interesting. If you have a hypothesis for the cause of this, please leave a comment below.
Terminating the Amazon EMR Cluster
Once you’ve completed the steps above, terminate the Amazon EMR cluster from the AWS console or via the CLI using the command below:
aws emr terminate-cluster --cluster-id j-XXXXXXXXXXXX
If you launched an Amazon EC2 instance for the purpose of this demo, don’t forget to terminate the instance.
In this post, we looked at a way to quickly visualize your data in Amazon S3 using Amazon EMR, Impala and Tableau. We also saw how Impala provides the response time needed while using Tableau’s data exploration feature so that you can quickly explore data and shift views on the fly to follow your train of thought. You can use the same pattern to quickly analyze any data on Amazon S3, including AWS CloudTrail logs or Amazon S3 access logs.