Cloudera Developer Blog · Distribution Posts
Update time! As a reminder, Cloudera releases major versions of CDH, our 100% open source distribution of Apache Hadoop and related projects, annually and then updates to CDH every three months. Updates primarily comprise bug fixes but we will also add enhancements. We only include fixes or enhancements in updates that maintain compatibility, improve system stability and still allow customers and users to skip updates as they see fit.
We’re pleased to announce the availability of CDH4.1. We’ve seen excellent adoption of CDH4.0 since it went GA at the end of June and a number of exciting use cases have moved to production. CDH4.1 is an update that has a number of fixes but also a number of useful enhancements. Among them:
Apache Flume is a scalable, reliable, fault-tolerant, distributed system designed to collect, transfer, and store massive amounts of event data into HDFS. Apache Flume recently graduated from the Apache Incubator as a Top Level Project at Apache. Flume is designed to send data over multiple hops from the initial source(s) to the final destination(s). Click here for details of the basic architecture of Flume. In this article, we will discuss in detail some new components in Flume 1.x (also known as Flume NG), which is currently on the trunk branch, techniques and components that can be be used to route the data, configuration validation, and finally support for serializing events.
In the past several months, contributors have been busy adding several new sources, sinks and channels to Flume. Flume now supports Syslog as a source, where sources have been added to support Syslog over TCP and UDP.
Flume now has a high performance persistent channel – the File Channel. This means if the agent fails for any reason before events committed by the source are not removed and the transaction committed by the sink, the events will reloaded from disk and can be taken when the agent starts up again. The events will only be removed from the channel when the transaction is committed by the sink. The File channel uses a Write Ahead Log to save events.
Apache Bigtop 0.3.0 (incubating) is now available. This is the first fully integrated, community-driven, 100% Apache Big Data management distribution based on Apache Hadoop 1.0. In addition to a major change in the Hadoop version, all of the Hadoop ecosystem components have been upgraded to the latest stable versions and thoroughly tested:
I’m pleased to inform our users and customers that Cloudera has released its 4th version of Cloudera’s Distribution Including Apache Hadoop (CDH) into beta today. This release combines the input from our enterprise customers, partners and users with the hard work of Cloudera engineering and the larger Apache open source community to create what we believe is a compelling advance for this widely adopted platform.
There are a great many improvements and new capabilities in CDH4 compared to CDH3. Here is a high level list of what’s available for you to test in this first beta release:
This blog was originally posted on the Apache Blog: https://blogs.apache.org/flume/entry/flume_ng_architecture
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. Flume is currently undergoing incubation at The Apache Software Foundation. More information on this project can be found at http://incubator.apache.org/flume. Flume NG is work related to new major revision of Flume and is the subject of this post.
Prior to entering the incubator, Flume saw incremental releases leading up to version 0.9.4. As Flume became adopted it became clear that certain design choices would need to be reworked in order to address problems reported in the field. The work necessary to make this change began a few months ago under the JIRA issue FLUME-728. This work currently resides on a separate branch by the name flume-728, and is informally referred to as Flume NG. At the time of writing this post Flume NG had gone through two internal milestones – NG Alpha 1, and NG Alpha 2 and a formal incubator release of Flume NG is in the works.
The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics such as HDFS, HBase, Pig, Hive, Flume and other related technologies. In addition, speakers will address key development areas including tools, performance, bringing the stack together and testing the stack. Sessions in this track are for developers of all levels who want to learn more about upcoming features and enhancements, new tools, advanced techniques and best practices.
Building Web Analytics Processing on Hadoop at CBS Interactive
Michael Sun, CBS Interactive
Pero works on research and development in new technologies for online advertising at Aol Advertising R&D in Palo Alto. Over the past 4 years he has been the Chief Architect of R&D distributed ecosystem comprising more than thousand nodes in multiple data centers. He also led large-scale contextual analysis, segmentation and machine learning efforts at AOL, Yahoo and Cadence Design Systems and published patents and research papers in these areas.
A critical premise for success of online advertising networks is to successfully collect, organize, analyze and use large volumes of data for decision making. Given the nature of their online orientation and dynamics, it is critical that these processes be automated to the largest extent possible.
Specifically, the success of advertising technology and its impact on revenue are directly proportional to its capability to use large amounts of data in order to compute proper impression value given the unique circumstances of ad serving events such as the characteristics of the impression, the ad, and the user as well as the content and context. As a general rule, more data results in more accurate predictions.
Philip Zeyliger is a software engineer at Cloudera and started the SCM
Two weeks ago, at Hadoop Summit, we released our Service and Configuration Manager (SCM) Express. It’s a dramatically simpler and faster way to get started with Cloudera’s Distribution including Apache Hadoop (CDH). In a previous blog post, we talked in some detail about SCM Express and what it can do for you.
The screencast included in this post demonstrates the simplicity of a CDH installation using SCM Express. The “Directors” conversing in the background are engineers Philip Langdale and Philip Zeyliger and VP of Products, Charles Zedlewski.
Phil Langdale is a software engineer at Cloudera and the technical lead for Cloudera’s SCM Express product.
What is SCM Express?
The Only Full Lifecycle Management for Apache Hadoop: Introducing Cloudera Enterprise 3.5 and SCM Express
Drew O’Brien is a product marketing manager at Cloudera
We’re excited to share the news about the immediate availability of Cloudera Enterprise 3.5 and SCM Express, which we announced this week in tandem with our presence at Hadoop Summit. These products represent a major advance in Cloudera’s mission to drive massive enterprise adoption of 100% open source Apache Hadoop. We now make it easier and more convenient than ever before for companies to run and manage Apache Hadoop clusters throughout their entire operational lifecycle.
Cloudera Enterprise 3.5 is a substantial update to our subscription service that delivers production support and management software for Apache Hadoop and the entire Apache Hadoop ecosystem. With new features like automated service and configuration tools, activity monitoring, and one-click security, we’ve streamlined the extremely complex processes and eliminated the uncertainties associated with ongoing management and maintenance of Hadoop clusters in production. Cloudera Enterprise 3.5 codifies and makes available best practices Cloudera has learned over many years of helping enterprise customers build and manage Apache Hadoop-based systems.