Category Archives: Pig

Save 15% on Multi-Course Public Training Enrollments in January and February

Categories: Data Science Hadoop HBase Hive Pig Training

Cloudera University is the world leader in Apache Hadoop training and certification. Our full suite of live courses and online materials is the best resource to get started with your Hadoop cluster in development or advance it towards production.  We offer deep industry insight into the skills and expertise required to establish yourself as a leading Developer or Administrator managing and processing Big Data in this fast-growing field.

Read more

Apache Hadoop in 2013: The State of the Platform

Categories: Avro CDH Flume Hadoop HBase HDFS Hive Hue Impala Mahout MapReduce Oozie Pig Sqoop YARN ZooKeeper

For several good reasons, 2013 is a Happy New Year for Apache Hadoop enthusiasts.

In 2012, we saw continued progress on developing the next generation of the MapReduce processing framework (MRv2), work that will bear fruit this year. HDFS experienced major progress toward becoming a lights-out, fully enterprise-ready distributed filesystem with the addition of high availability features and increased performance. And a hint of the future of the Hadoop platform was provided with the Beta release of Cloudera Impala,

Read more

Applying Parallel Prediction to Big Data

Categories: Guest Hadoop Mahout Pig Use Case

This guest post is provided by Dan McClary, Principal Product Manager for Big Data and Hadoop at Oracle.

One of the constants in discussions around Big Data is the desire for richer analytics and models. However, for those who don’t have a deep background in statistics or machine learning, it can be difficult to know not only just what techniques to apply, but on what data to apply them.

Read more

What Do Real-Life Apache Hadoop Workloads Look Like?

Categories: CDH Hadoop HBase HDFS Hive MapReduce Oozie Ops and DevOps Pig Testing Use Case

Organizations in diverse industries have adopted Apache Hadoop-based systems for large-scale data processing. As a leading force in Hadoop development with customers in half of the Fortune 50 companies, Cloudera is in a unique position to characterize and compare real-life Hadoop workloads. Such insights are essential as developers, data scientists, and decision makers reflect on current use cases to anticipate technology trends.

Recently we collaborated with researchers at UC Berkeley to collect and analyze a set of Hadoop traces.

Read more

Process a Million Songs with Apache Pig

Categories: CDH Community MapReduce Pig

The following is a guest post kindly offered by Adam Kawa, a 26-year old Hadoop developer from Warsaw, Poland. This post was originally published in a slightly different form at his blog, Hakuna MapData!

Recently I have found an interesting dataset, called Million Song Dataset (MSD), which contains detailed acoustic and contextual data about a million songs. For each song we can find information like title, hotness, tempo,

Read more