Tag Archives: logs

Apache Kafka for Beginners

Categories: Flume Kafka

When used in the right way and for the right use case, Kafka has unique attributes that make it a highly attractive option for data integration.

Apache Kafka is creating a lot of buzz these days. While LinkedIn, where Kafka was founded, is the most well known user, there are many companies successfully using this technology.

So now that the word is out, it seems the world wants to know: What does it do?

Read more

How-to: Count Events Like a Data Scientist

Categories: Data Science How-to Use Case

The ability to quickly and accurately count complex events is a legitimate business advantage.

In our work as data scientists, we spend most of our time counting things. It is the foundational skill that is used in data cleansing, reporting, feature engineering, and simple-but-effective machine learning models like Naive Bayes classifiers. Hilary Mason has a quote about the benefits of counting that I love:

Understand that what big data really means is to be able to count things in data sets of any size,

Read more

New in CDH 5.1: Document-level Security for Cloudera Search

Categories: CDH Platform Security & Cybersecurity Search

Cloudera Search now supports fine-grain access control via document-level security provided by Apache Sentry.

In my previous blog post, you learned about index-level security in Apache Sentry (incubating) and Cloudera Search. Although index-level security is effective when the access control requirements for documents in a collection are homogenous, often administrators want to restrict access to certain subsets of documents in a collection.

For example,

Read more

Apache Hive on Apache Spark: Motivations and Design Principles

Categories: Community Hive Spark

Two of the most vibrant communities in the Apache Hadoop ecosystem are now working together to bring users a Hive-on-Spark option that combines the best elements of both.

(Editor’s note [April 12, 2016]:┬áHive-on-Spark is now GA/ready for production as of CDH 5.7.)

Apache Hive is a popular SQL interface for batch processing and ETL using Apache Hadoop. Until recently, MapReduce was the only execution engine in the Hadoop ecosystem,

Read more

How-to: Configure JDBC Connections in Secure Apache Hadoop Environments

Categories: Hive How-to Impala Platform Security & Cybersecurity

Learn how HiveServer, Apache Sentry, and Impala help make Hadoop play nicely with BI tools when Kerberos is involved.

In 2010, I wrote a simple pair of blog entries outlining the general considerations behind using Apache Hadoop with BI tools. The Cloudera partner ecosystem has positively exploded since then, and the technology has matured as well. Today, if JDBC is involved, all the pieces needed to expose Hadoop data through familiar BI tools are available:

Read more