This guide is intended to be an introduction to Crunch.
Crunch is used for processing data. Crunch builds on top of Apache Hadoop to provide a simpler interface for Java programmers to process data. In Crunch you create pipelines, not unlike Unix pipelines, such as the command below:
grep "ERROR" log | sort | uniq -c
Crunch pipelines consist of a series of functions you apply to the input data. Let’s say you have raw Apache HTTPD server logs and that you want to know the total amount of data downloaded by ip address.