Thanks to Jonathan Natkins, a field engineer from StreamSets, for the guest post below about using StreamSets Data Collector—open source, GUI-driven ingest technology for developing and operating data pipelines with a minimum of code—and Cloudera Search and HUE to build a real-time search environment.
As pressure mounts on data engineers to deliver more data from more sources in less time, StreamSets Data Collector can serve as a linchpin in the data management process,
Recently, GoDataDriven installed a Cloudera Enterprise (CDH + Cloudera Manager) cluster on Microsoft Azure. This two-part series, written by Alexander Bij and Tünde Alkemade and republished with permission, includes information about use case, design, and installation.
Processing large amounts of unstructured data requires serious computing power and also maintenance effort. As load on computing power typically fluctuates due to time and seasonal influences and/or processes running on certain times,
Creating and training machine-learning models is more complex on distributed systems, but there are lots of frameworks for abstracting that complexity.
There are more options now than ever from proven open source projects for doing distributed analytics, with Python and R become increasingly popular. In this post, you’ll learn the options for setting up a simple read-eval-print (REPL) environment with Python and R within the Cloudera QuickStart VM using APIs for two of the most popular cluster computing frameworks: Apache Spark (with MLlib) and H2O (from the company with the same name).
Our thanks to Manuel Spezzani, Indyco Technical Leader, and Edward William Gnudi, Indyco’s Chief of Customer Happiness, for the guest post below about using Indyco alongside Apache Impala.
In this post, you will learn how to automatically design a complete data warehouse solution on top of Impala using Indyco, a tool for designing, exploring, and understand your business model (recently named Cloudera Certificated Partner for the Impala platform).
Learn how improve Apache HBase usability by creating a custom formatter for viewing binary data types in the HBase shell.
Cloudera customers are looking to store complex data types in Apache HBase to provide fast retrieval of complex information such as banking transactions, web analytics records, and related metadata associated with those records. Serialization formats such as Apache Avro, Thrift, and Protocol Buffers greatly assist in meeting this goal,