QuickStart VM: Now with Real-Time Big Data
- by Sean Mackrory
- June 20, 2013
- 7 comments
For years, Cloudera has provided virtual machines that give you a working Apache Hadoop environment out-of-the-box. It’s the quickest way to learn and experiment with Hadoop right from your desktop.
We’re constantly updating and improving the QuickStart VM, and in the latest release there are two of Cloudera’s new products that give you easier and faster access to your data: Cloudera Search and Cloudera Impala. We’ve also added corresponding applications to Hue – an open source web-based interface for Hadoop, and the easiest way to interact with your data.
Cloudera Search integrates Apache Solr with the rest of the platform, to let you do full-text search of the data stored in your cluster, just like you would with an online search-engine! Cloudera Impala, on the other hand, lets you execute SQL queries against that same data, on the same platform, and get results back fast enough to interactively explore and analyze. With both these workloads available on the cluster, it eliminates the pain of having to move large data sizes around.
To help you get a sense of how these could work for you, we’ve set up a couple of examples in the Cloudera QuickStart VM. You can download the VM here for VirtualBox, VMware, and other hypervisors.
Starting Services in Cloudera Manager
When the QuickStart VM boots, it configures all the services you might expect on a Cloudera cluster. Obviously, this single-node “pseudo-distributed” simple setup does not represent the performance, scalability, and reliability of a fully-distributed cluster – but it does give you a taste of how easy it is to perform powerful work with your data.
The core services are already running, but you’ll need to make sure Impala, and Solr, or any other additional service you’d like to try is started before you proceed.
Enabling Cloudera Search
From the welcome screen, select Cloudera Manager, or navigate to http://localhost:7180 in your browser. Log in with the username and password ‘admin’. When the Services page loads, look for the line with the solr1 service, and click Actions → Start…
ZooKeeper, HDFS, MapReduce, and Hue are started automatically and should already be running. Later in the tutorial, you may want to stop Solr and start Impala. If you go beyond the examples in this tutorial, you may need to allocate more memory for your VM. The default is 4GB, but starting more services in Cloudera Manager may require more, depending on your use case.
Batch Indexing with MapReduce
One way to make data searchable is to index it with a batch job. This is ideal for fixed data sets, like a collection of reports from last year. The script ~/datasets/batch-tweets.sh demonstrates how you would set up a “collection” in Solr and invoke MapReduce for this type of job. You can open the Terminal application from the taskbar at the top, and run this script to load and index some sample data:
Note that this data does not consist of real tweets – it is just similarly structured data to demonstrate the process. (If you want to see some interesting, real data from Twitter, see the near-real-time example below!) Now that you have some data loaded, you can try out the interfaces for doing full-text search.
Searching With Hue and Solr
Click the Hue bookmark or navigate to http://localhost:8888/home in the browser, and open the Search app. (Look for this icon.) You will be presented with a list of collections – select the batch_tweets collection that we just created and import it. Once it is imported, open that collection, click the Search It! link, and you will see all the data from the fake tweets we just indexed. As you type a query, the list is filtered accordingly.
You can also click the Solr bookmark in Firefox or navigate to http://localhost:8983/solr to access the Solr admin web interface. This interface provides more advanced information about your data and the underlying search infrastructure.
Hue also allows you to customize which fields in the data are shown, and how they are displayed. Check out this video that shows how to quickly make a professional-looking search page like the one below.
Near-real-time Indexing with Flume
You can configure Cloudera Search to stream live data from Twitter with Flume, index it with Solr as it comes in, and store it in HDFS for future searches. You’ll need to sign in to dev.twitter.com with your Twitter account, select My applications from the drop-down menu in the top-right corner, and Create a new application to represent your Search installation. (You don’t need a callback URL, and as you won’t be sharing this ‘app’ with others, you may fill in the other fields according to your own preference.) Once the application is created, click Create my access token at the bottom of the page. To configure your cluster to connect to Twitter as this application, you’ll need the consumer key, consumer secret, access token, and access token secret from this page. (You may have to refresh to see the access token you created.) Keep this information confidential just as you would your Twitter username and password. You can learn more about Twitter’s public data streams and their policies here.
The script ~/datasets/nrt-tweets.sh demonstrates how you would configure Flume for this type of job. By providing the Twitter credentials listed above, you can use this script to create a collection and start downloading data:
$ ~/datasets/nrt-tweets.sh start [CONSUMER_KEY] [CONSUMER_SECRET] \ [ACCESS_TOKEN] [ACCESS_TOKEN_SECRET]
Tweets become searchable seconds later. In the Hue Search application, go to the Collection Manager, and import the new collection. If you don’t see any tweets, look at the log file /var/log/flume-ng/flume.log for any errors that were returned by the Twitter API. If the credentials were wrong when running the setup script, you can edit them in /etc/flume-ng/conf/flume.conf and restart flume. If the VM was not able to set the correct system time during the boot process (
`date --utc` must return the current UTC time for authentication), correct it with
`ntpdate pool.ntp.org`. You should restart the services in Cloudera Manager after making this change.
When you want to stop ingesting data, run:
$ ~/datasets/nrt-tweets.sh stop
$ sudo service flume-ng-agent stop
Note that nrt-tweets.sh runs Flume independently of the service in Cloudera Manager, so it will not interfere with any other Flume configuration you might set up in the QuickStart VM using Cloudera Manager.
Interactive Querying with Impala
Another recent addition to our Big Data platform is Cloudera Impala. Impala allows you to execute SQL queries against your data in Hadoop or HBase, using the same tables and metastore you use with Apache Hive. Impala is designed for extremely low latency – it’s fast enough to interactively explore and analyze your data.
Again, a single-node demo doesn’t truly demonstrate the speed and scalability of Impala, but it will let you execute some sample queries and see the performance relative to Hive. As we did with search, make sure the impala1 and hive1 services are started in Cloudera Manager.
Hue provides some sample data sets for Hive and Impala (and other components too). You can install them from within Hue, just go to About (top-left icon) → Step 2: Examples, and click the service you want to try out. The QuickStart VM comes with two additional data sets you can install for Hive and Impala. The first lists the median income for each zip code in the United States from the 2000 Census. The second is a much larger data set from the Transaction Processing Performance Council that is used to benchmark databases against realistic business workloads. There are scripts to install and configure each, but be sure to execute the refresh command to make Impala aware of the new data sets:
$ ~/datasets/zipcode-setup.sh $ ~/datasets/tpcds-setup.sh # requires Internet access $ impala-shell -q ‘refresh’
Once these data sets are installed you can explore the tables in the Metastore Manager application in Hue, and you can execute SQL queries through the Hive and Impala applications (or from the command-line using the
impala-shellutilities). Here are some example queries you can try. Compare them and see how much faster you can get results with Impala!
select * from zipcode_incomes where zip='59101';
select i_item_id, s_state, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 FROM store_sales JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk) JOIN item on (store_sales.ss_item_sk = item.i_item_sk) JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk) JOIN store on (store_sales.ss_store_sk = store.s_store_sk) where cd_gender = 'M' and cd_marital_status = 'S' and cd_education_status = 'College' and d_year = 2002 and s_state in ('TN','SD', 'SD', 'SD', 'SD', 'SD') group by i_item_id, s_state order by i_item_id, s_state limit 100;
New products from Cloudera like Search and Impala are making it easy to extract valuable information from your data faster than ever before. To learn about how these tools can be used to help solve problems and answer bigger questions, please visit Cloudera’s website:
If you run into any problems or have questions, you can refer to our online documentation or contact us on the appropriate user groups, all of which are detailed here.
Other links you may find useful:
Sean Mackrory is a Software Engineer on the infrastructure team and an Apache Bigtop Committer.