Category Archives: How-to

How-to: Build a Real-Time Search System using StreamSets, Apache Kafka, and Cloudera Search

Categories: Cloudera Manager Guest How-to Hue Kafka Search

Thanks to Jonathan Natkins, a field engineer from StreamSets, for the guest post below about using StreamSets Data Collector—open source, GUI-driven ingest technology for developing and operating data pipelines with a minimum of code—and Cloudera Search and HUE to build a real-time search environment.

As pressure mounts on data engineers to deliver more data from more sources in less time, StreamSets Data Collector can serve as a linchpin in the data management process,

Read More

How-to: Install Cloudera Enterprise on Microsoft Azure (Part 1)

Categories: Cloud Guest How-to

Recently, GoDataDriven installed a Cloudera Enterprise (CDH + Cloudera Manager) cluster on Microsoft Azure. This two-part series, written by Alexander Bij and Tünde Alkemade and republished with permission, includes information about use case, design, and installation.

Processing large amounts of unstructured data requires serious computing power and also maintenance effort. As load on computing power typically fluctuates due to time and seasonal influences and/or processes running on certain times,

Read More

How-to: Train Models in R and Python using Apache Spark MLlib and H2O

Categories: Data Science How-to Spark

Creating and training machine-learning models is more complex on distributed systems, but there are lots of frameworks for abstracting that complexity.

There are more options now than ever from proven open source projects for doing distributed analytics, with Python and R become increasingly popular. In this post, you’ll learn the options for setting up a simple read-eval-print (REPL) environment with Python and R within the Cloudera QuickStart VM using APIs for two of the most popular cluster computing frameworks: Apache Spark (with MLlib) and H2O (from the company with the same name).

Read More

How-to: Design An Analytic Database Schema on Apache Impala (Incubating) with Indyco

Categories: Guest How-to Impala

Our thanks to Manuel Spezzani, Indyco Technical Leader, and Edward William Gnudi, Indyco’s Chief of Customer Happiness, for the guest post below about using Indyco alongside Apache Impala.

In this post, you will learn how to automatically design a complete data warehouse solution on top of Impala using Indyco, a tool for designing, exploring, and understand your business model (recently named Cloudera Certificated Partner for the Impala platform).

Read More

How-to: Create and Use a Custom Formatter in the Apache HBase Shell

Categories: Avro HBase How-to Tools

Learn how improve Apache HBase usability by creating a custom formatter for viewing binary data types in the HBase shell.

Cloudera customers are looking to store complex data types in Apache HBase to provide fast retrieval of complex information such as banking transactions, web analytics records, and related metadata associated with those records. Serialization formats such as Apache Avro, Thrift, and Protocol Buffers greatly assist in meeting this goal,

Read More