The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics such as HDFS, HBase, Pig, Hive, Flume and other related technologies. In addition, speakers will address key development areas including tools, performance, bringing the stack together and testing the stack. Sessions in this track are for developers of all levels who want to learn more about upcoming features and enhancements, new tools, advanced techniques and best practices.
Building Web Analytics Processing on Hadoop at CBS Interactive
Michael Sun, CBS Interactive
Abstract: CBS Interactive successfully adopted Hadoop as the web analytics platform, processing one Billion weblogs daily from hundreds of web site properties that CBS Interactive oversees. After introducing Lumberjack—the Extraction, Transformation and Loading framework we built based on python and streaming, which is under review for Open-Source release—Michael will talk about web metrics processing on Hadoop, focusing on weblog harvesting, parsing, dimension look-up, sessionization, and loading into a database. Since migrating processing from a proprietary platform to Hadoop, CBS Interactive achieved robustness, fault-tolerance and scalability, and significant reduction of processing time to reach SLA (over six hours reduction so far).
Gateway: Cluster Virtualization Framework
Konstantin Shvachko, eBay
Abstract: Access to Hadoop clusters through dedicated portal nodes (typically located behind firewalls and performing user authentication and authorization) can have several drawbacks as shared multitenant resources they can create contention among users and increase the maintenance overhead for cluster administrators. This session will discuss the Gateway system, a cluster virtualization framework that provides multiple benefits: seamless access from users workplace computers through corporate firewalls; the ability to failover to active clusters for scheduled or unscheduled downtime, as well as the ability to redirect traffic to other clusters during upgrades; and user access to clusters running different versions of Hadoop.
SHERPASURFING Open Source Cyber Security Solution
Wayne Wheeles, Novii Design
Abstract: Every day billions of packets, both benign and some malicious, flow in and out of networks. Every day it is an essential task for the modern Defensive Cyber Security Organization to be able to reliably survive the sheer volume of data, bring the NETFLOW data to rest, enrich it, correlate it and perform. SHERPASURFING is an open source platform built on the proven Clouderas Distribution including Apache Hadoop that enables organizations to perform the Cyber Security mission and at scale at an affordable price point. This session will include an overview of the solution and components, followed by a demonstration of analytics.
Integrating Hadoop with Enterprise RDBMS Using Apache SQOOP and Other Tools
Arvind Prabhakar, Cloudera
Guy Harrison, Quest Software
Abstract: As Hadoop graduates from pilot project to a mission critical component of the enterprise IT infrastructure, integrating information held in Hadoop and in Enterprise RDBMS becomes imperative. Well look at key scenarios driving Hadoop and RDBMS integration and review technical options. In particular, well deep dive into the Apache SQOOP project, which expedites data movement between Hadoop and any JDBC database, as well as providing an framework which allows developers and vendors to create connectors optimized for specific targets such as Oracle, Netezza etc.
Next Generation Apache Hadoop MapReduce
Mahadev Konar, Hortonworks
Abstract: The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale, high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization. We will be presenting the architecture and design of the next generation of map reduce and will delve into the details of the architecture that makes it much easier to innovate. We will also be presenting large scale and small scale comparisons on some benchmarks with MRV1.