Category Archives: Pig

What’s New in CDH4.1 Pig

Categories: CDH Pig

Apache Pig is a platform for analyzing large data sets that provides a high-level language called Pig Latin. Pig users can write complex data analysis programs in an intuitive and compact manner using Pig Latin.

Among many other enhancements, CDH4.1, the newest release of Cloudera’s open-source Hadoop distro, upgrades Pig from version 0.9 to version 0.10. This post provides a summary of the top seven new features introduced in CDH4.1 Pig.

Read More

Applying Parallel Prediction to Big Data

Categories: Guest Hadoop Mahout Pig Use Case

This guest post is provided by Dan McClary, Principal Product Manager for Big Data and Hadoop at Oracle.

One of the constants in discussions around Big Data is the desire for richer analytics and models. However, for those who don’t have a deep background in statistics or machine learning, it can be difficult to know not only just what techniques to apply, but on what data to apply them.

Read More

CDH4.1 Now Released!

Categories: CDH General Hadoop HBase HDFS Hive Oozie Pig Sqoop ZooKeeper

Update time!  As a reminder, Cloudera releases major versions of CDH, our 100% open source distribution of Apache Hadoop and related projects, annually and then updates to CDH every three months.  Updates primarily comprise bug fixes but we will also add enhancements.  We only include fixes or enhancements in updates that maintain compatibility, improve system stability and still allow customers and users to skip updates as they see fit.

We’re pleased to announce the availability of CDH4.1. 

Read More

What Do Real-Life Apache Hadoop Workloads Look Like?

Categories: CDH Hadoop HBase HDFS Hive MapReduce Oozie Ops and DevOps Pig Testing Use Case

Organizations in diverse industries have adopted Apache Hadoop-based systems for large-scale data processing. As a leading force in Hadoop development with customers in half of the Fortune 50 companies, Cloudera is in a unique position to characterize and compare real-life Hadoop workloads. Such insights are essential as developers, data scientists, and decision makers reflect on current use cases to anticipate technology trends.

Recently we collaborated with researchers at UC Berkeley to collect and analyze a set of Hadoop traces.

Read More

Process a Million Songs with Apache Pig

Categories: CDH Community MapReduce Pig

The following is a guest post kindly offered by Adam Kawa, a 26-year old Hadoop developer from Warsaw, Poland. This post was originally published in a slightly different form at his blog, Hakuna MapData!

Recently I have found an interesting dataset, called Million Song Dataset (MSD), which contains detailed acoustic and contextual data about a million songs. For each song we can find information like title, hotness,

Read More