Cloudera Blog · Hive Posts

Demo: Analyzing Data with Hue and Hive

In the first installment of the demo series about Hue — the open source Web UI that makes Apache Hadoop easier to use — you learned how file operations are simplified via the File Browser application. In this installment, we’ll focus on analyzing data with Hue, using Apache Hive via Hue’s Beeswax and Catalog applications (based on Hue 2.3 and later).

The Yelp Dataset Challenge provides a good use case. This post explains, through a video and tutorial, how you can get started doing some analysis and exploration of Yelp data with Hue. The goal is to find the coolest restaurants in Phoenix!

Dataset Challenge with Hue

The demo below demonstrates how the “business” and “review” datasets are cleaned and then converted to a Hive table before being queried with SQL.

How Persado Supports Persuasion Marketing Technology with Hive and Pig Training

This guest post comes from Alex Giamas, Senior Software Engineer on the data warehouse team at Persado, an ultra-hot persuasion marketing technology company with operations in Athens, Greece.

A World-Class EDW Requires a World-Class Hadoop Team

Persado is the global leader in persuasion marketing technology, a new category in digital marketing. Our revolutionary technology maps the genome of marketing language and generates the messages that work best for any customer and any product at any time. To assure the highest quality experience for both our clients and end-users, our engineering team collaborates with Ph.D. statisticians and data analysts to develop new ways to segment audiences, discover content, and deliver the most relevant and effective marketing messages in real time.

Given the challenge of creating a market based on ongoing data collection and massive query ability, the data warehouse organization ultimately plays the most important role in the persuasion marketing value chain, assuring a steady and unobstructed multidirectional flow of information. My team continuously ensures Persado’s infrastructure is aligned to the needs of our data scientists, including regularly generating KPI reports, managing data from heterogeneous sources, preparing customized analyses, and even implementing specific statistical algorithms in Java based on reference implementations of R.

Meet the Engineer: Mark Grover

Mark Grover

In this installment, meet Cloudera Software Engineer Mark Grover (@mark_grover).

What do you do at Cloudera and in which Apache project are you involved?
I’m a Software Engineer at Cloudera, involved mostly with Apache Bigtop, an open source project aimed at building a community around packaging and interoperability testing of projects in the Apache Hadoop ecosystem. In addition, I contribute to Apache Hive, a data warehousing system built on top of Apache Hadoop that allows users to structure and query their Hadoop data using familiar SQL-like syntax. I have also written a section in O’Reilly’s book on Hive, Programming Hive.

How-to: Analyze Twitter Data with Hue

Hue 2.2 , the open source web-based interface that makes Apache Hadoop easier to use, lets you interact with Hadoop services from within your browser without having to go to a command-line interface. It features different applications like an Apache Hive editor and Apache Oozie dashboard and workflow builder.

This post is based on our “Analyzing Twitter Data with Hadoop” sample app and details how the same results can be achieved through Hue in a simpler way. Moreover, all the code and examples of the previous series have been updated to the recent CDH4.2 release.

Collecting Data

The first step is to create the “flume” user and his home on the HDFS where the data will be stored. This can be done via the User Admin application.

One User’s Impala Experience at Data Hacking Day

The following guest post comes to you from Alan Gardner of remote database services and consulting company Pythian, who participated in Data Hacking Day (and was on the winning team!) at Cloudera’s offices in February.

Last Feb. 25, just prior to attending Strata, Alex Gorbachev (our CTO) and I had the chance to visit Cloudera’s Palo Alto offices for Data Hacking Day. The goal of the event was to produce something cool that leverages Cloudera Impala – the new open source, low-latency platform for querying data in Apache Hadoop.

Our hosts helpfully suggested some datasets, including the DEBS 2013 Grand Challenge data. This dataset contains the position of all the players and ball during a football match; our project was to map the data for a given span of time and player onto a map of the field, to create a heatmap of how much time that player spent at different positions.

The Data

Apache Hadoop Developer Training Helps Query Massive Telecom Data

This guest post is provided by Rohit Menon, Product Support and Development Specialist at Subex.

I am a software developer in Denver and have been working with C#, Java, and Ruby on Rails for the past six years. Writing code is a big part of my life, so I constantly keep an eye out for new advances, developments, and opportunities in the field, particularly those that promise to have a significant impact on software engineering and the industries that rely on it. 

In my current role working on revenue assurance products in the telecom space for Subex, I have regularly heard from customers that their data is growing at tremendous rates and becoming increasingly difficulty to process, often forcing them to portion out data into small, more manageable subsets. The more I heard about this problem, the more I realized that the current approach is not a solution, but an opportunity, since companies could clearly benefit from more affordable and flexible ways to store data. Better query capability on larger data sets at any given time also seemed key to derive the rich, valuable information that helps drive business. Ultimately, I was hoping to find a platform on which my customers could process all their data whenever they needed to. As I delved into this Big Data problem of managing and analyzing at mega-scale, it did not take long before I discovered Apache Hadoop.

Mission: Hands-On Hadoop

My initial reading about Hadoop on the various blogs and forums had me convinced that it is easily one of the best tools out there for handling and processing large volumes of data. At first, I thought I’d be able to learn Hadoop on my own by reading Hadoop: The Definitive Guide and the Hadoop Tutorial from Yahoo! However, after only a few days of reading, it became clear that I would benefit greatly from direct interaction with Hadoop experts, supervised experimentation, and interaction with practical examples of Hadoop challenges from the field. 

How-to: Set Up Cloudera Manager 4.5 for Apache Hive

Last week Cloudera released the 4.5 release of Cloudera Manager, the leading framework for end-to-end management of Apache Hadoop clusters. (Download Cloudera Manager here, and see install instructions here.) Among many other features, Cloudera Manager 4.5 adds support for Apache Hive. In this post, I’ll explain how to set up a Hive server for use with Cloudera Manager 4.5.

For details about other new features in this release, please see the full release notes:

Inside Cloudera Impala: Runtime Code Generation

Cloudera Impala, the open-source real-time query engine for Apache Hadoop, uses many tools and techniques to get the best query performance. This blog post will discuss how we use runtime code generation to significantly improve our CPU efficiency and overall query execution time. We’ll explain the types of inefficiency that code-generation eliminates and go over in more detail one of the queries in the TPCH workload where code generation improves overall query speeds by close to 3x.

Why Code Generation?

The baseline for “optimal” query engine performance is a native application that is written specifically for your data format, written only to support your query. For example, it would be ideal if a query engine could execute this query:

select count(*)
from tbl
where col like %XYZ%

Save 15% on Multi-Course Public Training Enrollments in January and February

Cloudera University is the world leader in Apache Hadoop training and certification. Our full suite of live courses and online materials is the best resource to get started with your Hadoop cluster in development or advance it towards production.  We offer deep industry insight into the skills and expertise required to establish yourself as a leading Developer or Administrator managing and processing Big Data in this fast-growing field.

But did you know Cloudera training can also help you plan for the advanced stages and progress of your Hadoop cluster? In addition to core training for Developers and Administrators, we also offer the best (and, in some cases, only) opportunity to get up to speed on lifecycle projects within the Hadoop ecosystem in a classroom setting. Cloudera University’s course offerings go beyond the basics to include Training for Apache HBase, Training for Apache Hive and Pig, and Introduction to Data Science: Building Recommender Systems. Depending on your Big Data agenda, Cloudera training can help you increase the accessibility and queryability of your data, push your data performance towards real-time, conduct business-critical analyses using familiar scripting languages, build new applications and customer-facing products, and conduct data experiments to improve your overall productivity and profitability.

For a limited time, Cloudera University is offering a 15% discount when you register for two or more Hadoop training courses to help you build out and realize your Big Data plan. Cover the basics with Developer or Administrator training, move beyond the HDFS and MapReduce core by pairing Developer and HBase training, work towards machine learning with Hive and Pig training and Introduction to Data Science, or customize your own learning path.  Just use discount code 15off2 when you register for multiple public training classes from Cloudera University. This offer is only available for new enrollments and is only valid for classes delivered by Cloudera and scheduled to begin before March 1, 2013.

How-To: Schedule Recurring Hadoop Jobs with Apache Oozie

Our thanks to guest author Jon Natkins (@nattyice) of WibiData for the following post!

Today, many (if not most) companies have ETL or data enrichment jobs that are executed on a regular basis as data becomes available. In this scenario it is important to minimize the lag time between data being created and being ready for analysis.

CDH, Cloudera’s open-source distribution of Apache Hadoop and related projects, includes a framework called Apache Oozie that can be used to design complex job workflows and coordinate them to occur at regular intervals. In this how-to, you’ll review a simple Oozie coordinator job, and learn how to schedule a recurring job in Hadoop. The example involves adding new data to a Hive table every hour, using Oozie to schedule the execution of recurring Hive scripts. (For the full context of the example, see the “Analyzing Twitter Data with Apache Hadoop” series.)

Adding Data to Hive Tables

Older Posts