Tag Archives: events

Using Apache Parquet at AppNexus

Categories: Guest Impala Parquet Performance

Thanks to Chen Song, Data Team Lead at AppNexus, for allowing us to republish the following post about his company’s use case for Apache Parquet (incubating at this writing), the open standard for columnar storage across the Apache Hadoop ecosystem.

At AppNexus, over 2MM log events are ingested into our data pipeline every second. Log records are sent from upstream systems in the form of Protobuf messages. Raw logs are compressed in Snappy when stored on HDFS.

Read more

How-to: Tune Your Apache Spark Jobs (Part 1)

Categories: How-to Spark

Learn techniques for tuning your Apache Spark jobs for optimal efficiency.

When you write Apache Spark code and page through the public APIs, you come across words like transformation, action, and RDD. Understanding Spark at this level is vital for writing Spark programs. Similarly, when things start to fail, or when you venture into the web UI to try to understand why your application is taking so long,

Read more

How-to: Do Real-Time Log Analytics with Apache Kafka, Cloudera Search, and Hue

Categories: Data Ingestion How-to Hue Kafka Search

Cloudera recently announced formal support for Apache Kafka. This simple use case illustrates how to make web log analysis, powered in part by Kafka, one of your first steps in a pervasive analytics journey.

If you are not looking at your company’s operational logs, then you are at a competitive disadvantage in your industry. Web server logs, application logs, and system logs are all valuable sources of operational intelligence,

Read more

How-to: Install and Use Cask Data Application Platform Alongside Impala

Categories: How-to Impala Tools

Cloudera customers can now install, launch, and monitor CDAP directly from Cloudera Manager. This post from Nitin Motgi, Cask CTO, explains how.

Today, Cloudera and Cask are very happy to introduce the integration of Cloudera’s enterprise data hub (EDH) with the Cask Data Application Platform (CDAP). CDAP is an integrated platform for developers and organizations to build, deploy, and manage data applications on Apache Hadoop. This initial integration will enable CDAP to be installed,

Read more

Couchdoop: Couchbase Meets Apache Hadoop

Categories: Guest Hadoop

Thanks to C─âlin-Andrei Burloiu, Big Data Engineer at antivirus company Avira, and Radu Pastia, Senior Software Developer in the Big Data Team at Orange, for the guest post below about the Couchdoop connector for bringing Couchbase data into Hadoop.

Couchdoop is a Couchbase connector for Apache Hadoop, developed by Avira on CDH, that allows for easy, parallel data transfer between Couchbase and Hadoop storage engines. It includes a command-line tool,

Read more