Apache Hive is a fantastic tool for performing SQL-style queries across data that is often not appropriate for a relational database. For example, semistructured and unstructured data can be queried gracefully via Hive, due to two core features: The first is Hive’s support of complex data types, such as structs, arrays, and unions, in addition to many of the common data types found in most relational databases. The second feature is the SerDe.
This is the third article in a series about analyzing Twitter data using some of the components of the Apache Hadoop ecosystem that are available in CDH (Cloudera’s open-source distribution of Apache Hadoop and related projects). If you’re looking for an introduction to the application and a high-level view, check out the first article in the series.
In the previous article in this series, we saw how Flume can be utilized to ingest data into Hadoop.
This is the second article in a series about analyzing Twitter data using some of the components of the Hadoop ecosystem available in CDH, Cloudera’s open-source distribution of Apache Hadoop and related projects. In the first article, you learned how to pull CDH components together into a single cohesive application, but to really appreciate the flexibility of each of these components, we need to dive deeper.
Every story has a beginning,
Social media has gained immense popularity with marketing teams, and Twitter is an effective tool for a company to get people excited about its products. Twitter makes it easy to engage users and communicate directly with them, and in turn, users can provide word-of-mouth marketing for companies by discussing the products. Given limited resources, and knowing we may not be able to talk to everyone we want to target directly, marketing departments can be more efficient by being selective about whom we reach out to.
Learn how to configure a basic Maven project that will be able to build applications against CDH
Apache Maven is a build automation tool that can be used for Java projects. Since nearly all the Apache Hadoop ecosystem is written in Java, Maven is a great tool for managing projects that build on top of the Hadoop APIs. In this post, we’ll configure a basic Maven project that will be able to build applications against CDH (Cloudera’s Distribution Including Apache Hadoop) binaries.