Category Archives: How-to

Analyzing Twitter Data with Apache Hadoop, Part 2: Gathering Data with Flume

Categories: CDH Flume Hadoop How-to Oozie Use Case

This is the second article in a series about analyzing Twitter data using some of the components of the Hadoop ecosystem available in CDH, Cloudera’s open-source distribution of Apache Hadoop and related projects. In the first article, you learned how to pull CDH components together into a single cohesive application, but to really appreciate the flexibility of each of these components, we need to dive deeper.

Every story has a beginning,

Read more

How-to: Set Up an Apache Hadoop/Apache HBase Cluster on EC2 in (About) an Hour

Categories: CDH Cloud Cloudera Manager How-to

Note (added July 8, 2013): The information below is deprecated; we suggest that you refer to this post for current instructions.

Today we bring you one user’s experience using Apache Whirr to spin up a CDH cluster in the cloud. This post was originally published here by George London (@rogueleaderr) based on his personal experiences; he has graciously allowed us to bring it to you here as well in a condensed form.

Read more

How-to: Enable User Authentication and Authorization in Apache HBase

Categories: HBase How-to Platform Security & Cybersecurity

With the default Apache HBase configuration, everyone is allowed to read from and write to all tables available in the system. For many enterprise setups, this kind of policy is unacceptable. 

Administrators can set up firewalls that decide which machines are allowed to communicate with HBase. However, machines that can pass the firewall are still allowed to read from and write to all tables.  This kind of mechanism is effective but insufficient because HBase still cannot differentiate between multiple users that use the same client machines,

Read more

How-to: Analyze Twitter Data with Apache Hadoop

Categories: CDH Data Ingestion Flume General Hive How-to Oozie

Social media has gained immense popularity with marketing teams, and Twitter is an effective tool for a company to get people excited about its products. Twitter makes it easy to engage users and communicate directly with them, and in turn, users can provide word-of-mouth marketing for companies by discussing the products. Given limited resources, and knowing we may not be able to talk to everyone we want to target directly, marketing departments can be more efficient by being selective about whom we reach out to.

Read more

How-to: Automate Your Cluster with Cloudera Manager API

Categories: Cloudera Manager Hadoop How-to MapReduce Ops and DevOps Tools

API access was a new feature introduced in Cloudera Manager 4.0 (download free edition here.). Although not visible in the UI, this feature is very powerful, providing programmatic access to cluster operations (such as configuration and restart) and monitoring information (such as health and metrics). This article walks through an example of setting up a 4-node HDFS and MapReduce cluster via the Cloudera Manager (CM) API.

Cloudera Manager API Basics

The CM API is an HTTP REST API,

Read more