Author Archives: Jon Natkins

How-to: Use a SerDe in Apache Hive

Categories: Hive How-to

Apache Hive is a fantastic tool for performing SQL-style queries across data that is often not appropriate for a relational database. For example, semistructured and unstructured data can be queried gracefully via Hive, due to two core features: The first is Hive’s support of complex data types, such as structs, arrays, and unions, in addition to many of the common data types found in most relational databases. The second feature is the SerDe.

Read More

Analyzing Twitter Data with Apache Hadoop, Part 3: Querying Semi-structured Data with Apache Hive

Categories: CDH Hadoop Hive How-to Use Case

This is the third article in a series about analyzing Twitter data using some of the components of the Apache Hadoop ecosystem that are available in CDH (Cloudera’s open-source distribution of Apache Hadoop and related projects). If you’re looking for an introduction to the application and a high-level view, check out the first article in the series.

In the previous article in this series, we saw how Flume can be utilized to ingest data into Hadoop.

Read More

Analyzing Twitter Data with Apache Hadoop, Part 2: Gathering Data with Flume

Categories: CDH Flume Hadoop How-to Oozie Use Case

This is the second article in a series about analyzing Twitter data using some of the components of the Hadoop ecosystem available in CDH, Cloudera’s open-source distribution of Apache Hadoop and related projects. In the first article, you learned how to pull CDH components together into a single cohesive application, but to really appreciate the flexibility of each of these components, we need to dive deeper.

Every story has a beginning,

Read More

How-to: Analyze Twitter Data with Apache Hadoop

Categories: CDH Data Ingestion Flume General Hive How-to Oozie

Social media has gained immense popularity with marketing teams, and Twitter is an effective tool for a company to get people excited about its products. Twitter makes it easy to engage users and communicate directly with them, and in turn, users can provide word-of-mouth marketing for companies by discussing the products. Given limited resources, and knowing we may not be able to talk to everyone we want to target directly, marketing departments can be more efficient by being selective about whom we reach out to.

Read More

How-to: Develop CDH Applications with Maven and Eclipse

Categories: CDH How-to Tools

Learn how to configure a basic Maven project that will be able to build applications against CDH

Apache Maven is a build automation tool that can be used for Java projects. Since nearly all the Apache Hadoop ecosystem is written in Java, Maven is a great tool for managing projects that build on top of the Hadoop APIs. In this post, we’ll configure a basic Maven project that will be able to build applications against CDH (Cloudera’s Distribution Including Apache Hadoop) binaries.

Read More