Thanks to Wuheng Luo, a Hadoop and big data architect at Sears Holdings, for the guest post below about Pig job-level performance tuning
Many factors can affect Apache Pig job performance in Apache Hadoop, including hardware, network I/O, cluster settings, code logic, and algorithm. Although the sysadmin team is responsible for monitoring many of these factors, there are other issues that MapReduce job owners or data application developers can help diagnose,
Our thanks to Mayur Rustagi (@mayur_rustagi), CTO at Sigmoid Analytics, for allowing us to re-publish his post about the Spork (Pig-on-Spark) project below. (Related: Read about the ongoing upstream to bring Spark-based data processing to Hive here.)
Analysts can talk about data insights all day (and night), but the reality is that 70% of all data analyst time goes into data processing and not analysis.
The CDH software stack lets you use your tool of choice with the Parquet file format – – offering the benefits of columnar storage at each phase of data processing.
An open source project co-founded by Twitter and Cloudera, Parquet was designed from the ground up as a state-of-the-art, general-purpose, columnar file format for the Apache Hadoop ecosystem. In particular, Parquet has several features that make it highly suited to use with Cloudera Impala for data warehouse-style operations:
- Columnar storage layout: A query can examine and perform calculations on all values for a column while reading only a small fraction of the data from a data file or table.
Thanks to Xavier Clements of Wajam for allowing us to re-publish his blog post about Wajam’s Hadoop experiences below!
Wajam is a social search engine that gives you access to the knowledge of your friends. We gather your friends’ recommendations from Facebook, Twitter, and other social platforms and serve these back to you on supported sites like Google, eBay, TripAdvisor, and Wikipedia.
To do this,
This installment of the Hue demo series is about accessing the Hive Metastore from Hue, as well as using HCatalog with Hue. (Hue, of course, is the open source Web UI that makes Apache Hadoop easier to use.)
What is HCatalog?
HCatalog is a module in Apache Hive that enables non-Hive scripts to access Hive tables. You can then directly load tables with Apache Pig or MapReduce without having to worry about re-defining the input schemas,