Tag Archives: Pig

How-to: Do Real-Time Log Analytics with Apache Kafka, Cloudera Search, and Hue

Categories: Data Ingestion How-to Hue Kafka Search

Cloudera recently announced formal support for Apache Kafka. This simple use case illustrates how to make web log analysis, powered in part by Kafka, one of your first steps in a pervasive analytics journey.

If you are not looking at your company’s operational logs, then you are at a competitive disadvantage in your industry. Web server logs, application logs, and system logs are all valuable sources of operational intelligence,

Read more

Download the Hive-on-Spark Beta

Categories: Cloudera Labs Hive Spark

A Hive-on-Spark beta is now available via CDH parcel. Give it a try!

The Hive-on-Spark project (HIVE-7292) is one of the most watched projects in Apache Hive history. It has attracted developers from across the ecosystem, including from organizations such as Intel, MapR, IBM, and Cloudera, and gained critical help from the Spark community.

Many anxious users have inquired about its availability in the last few months.

Read more

New in CDH 5.3: Apache Sentry Integration with HDFS

Categories: Data Ingestion Platform Security & Cybersecurity Sentry Sqoop

Starting in CDH 5.3, Apache Sentry integration with HDFS saves admins a lot of work by centralizing access control permissions across components that utilize HDFS.

It’s been more than a year and a half since a couple of my colleagues here at Cloudera shipped the first version of Sentry (now Apache Sentry (incubating)). This project filled a huge security gap in the Apache Hadoop ecosystem by bringing truly secure and dependable fine grained authorization to the Hadoop ecosystem and provided out-of-the-box integration for Apache Hive.

Read more

Using Impala, Amazon EMR, and Tableau to Analyze and Visualize Data

Categories: Cloud General Guest

Our thanks to AWS Solutions Architect Rahul Bhartia for allowing us to republish his post below.

Apache Hadoop provides a great ecosystem of tools for extracting value from data in various formats and sizes. Originally focused on large-batch processing with tools like MapReduce, Apache Pig, and Apache Hive, Hadoop now provides many tools for running interactive queries on your data, such as Impala, Drill, and Presto. This post shows you how to use Amazon Elastic MapReduce (Amazon EMR) to analyze a data set available on Amazon Simple Storage Service (Amazon S3) and then use Tableau with Impala to visualize the data.

Read more

Building Lambda Architecture with Spark Streaming

Categories: Kafka Spark

The versatility of Apache Spark’s API for both batch/ETL and streaming workloads brings the promise of lambda architecture to the real world.

Few things help you concentrate like a last-minute change to a major project.

One time, after working with a customer for three weeks to design and implement a proof-of-concept data ingest pipeline, the customer’s chief architect told us:

You know, I really like the design –

Read more