Cloudera Blog · General Posts
The post below was originally published via blogs.apache.org and is republished below for your reading pleasure.
This blog post is about Apache Flume’s File Channel. Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
FileChannel is a persistent Flume channel that supports writing to multiple disks in parallel and encryption.
We’re getting really close to Strata Conference + Hadoop World 2012 (just over a month away), schedule planning-wise. So you may want to consider adding the tutorials, sessions, and keynotes below to your calendar! (Start times are always subject to change of course.)
The ones listed below are led or co-led by Clouderans, but there is certainly a wide range of attractive choices beyond what you see here. We just want to ensure that you put these particular ones high on your consideration list.
If you’re interested in community meetups as well, refer to my post from last week on that subject – several are planned.
|An Introduction to Hadoop||Mark Fei||Tues., Oct. 23||9am|
|Using HBase||Amandeep Khurana, Matteo Bertozzi||Tues., Oct. 23||9am|
|Testing Hadoop Applications||Tom Wheeler||Tues., Oct. 23||9am|
|Building a Large-scale Data Collection System Using Flume NG||Hari Shreedharan, Will McQueen, Arvind Prabhakar, Prasad Mujumdar, Mike Percy||Tues., Oct. 23||1:30pm|
|Given Enough Monkeys – Some Thoughts on Randomness||Jesse Anderson||Tues., Oct. 23||3:20pm|
|Keynote: Big Answers||Mike Olson||Weds., Oct. 24||8:55am|
|Large Scale ETL with Hadoop||Eric Sammer||Weds., Oct. 24||11:40am|
|HDFS – What is New and Future||Todd Lipcon (co-presenter)||Weds., Oct. 24||4:10pm|
|High Availability for the HDFS NameNode: Phase 2||Aaron Myers, Todd Lipcon||Weds., Oct. 24||5pm|
|Plenary Session: Beyond Batch||Doug Cutting||Thurs., Oct. 25||9:20am|
|Upcoming Enterprise Features in Apache HBase 0.96||Jon Hsieh||Thurs., Oct. 25||11:40am|
|Data Science on Hadoop: What’s There and What’s Missing||Justin Erickson||Thurs., Oct. 25||1:40pm|
|Taming the Elephant – Learn How Monsanto Manages Their Hadoop Cluster to Enable Genome/Sequence Processing||Bala Venkatrao, Aparna Ramani (with others)||Thurs., Oct. 25||4:10pm|
|Knitting Boar||Josh Patterson, Michael Katzenellenbogen||Thurs., Oct. 25||4:10pm|
What do you do at Cloudera, and in which Apache project are you involved?
For the last year and a half, I’ve been an engineer on the Enterprise team. We’re the guys who build Cloudera Manager, and all the goodies that make it easy to manage and administer Apache Hadoop clusters. Specifically, I’ve worked on a number of things across the product, like scale and performance for the databases underlying the various monitoring tools available in the Enterprise edition of Cloudera Manager. I’ve also worked extensively on our operational reporting and HDFS file search capabilities. While I don’t work full-time on any of the Apache projects, I have been known to contribute to Apache Hive and Hadoop on rainy days.
Social media has gained immense popularity with marketing teams, and Twitter is an effective tool for a company to get people excited about its products. Twitter makes it easy to engage users and communicate directly with them, and in turn, users can provide word-of-mouth marketing for companies by discussing the products. Given limited resources, and knowing we may not be able to talk to everyone we want to target directly, marketing departments can be more efficient by being selective about whom we reach out to.
In this post, we’ll learn how we can use Apache Flume, Apache HDFS, Apache Oozie, and Apache Hive to design an end-to-end data pipeline that will enable us to analyze Twitter data. This will be the first post in a series. The posts to follow to will describe, in more depth, how each component is involved and how the custom code operates. All the code and instructions necessary to reproduce this pipeline is available on the Cloudera Github.
Who is Influential?
To understand whom we should target, let’s take a step back and try to understand the mechanics of Twitter. A user – let’s call him Joe – follows a set of people, and has a set of followers. When Joe sends an update out, that update is seen by all of his followers. Joe can also retweet other users’ updates. A retweet is a repost of an update, much like you might forward an email. If Joe sees a tweet from Sue, and retweets it, all of Joe’s followers see Sue’s tweet, even if they don’t follow Sue. Through retweets, messages can get passed much further than just the followers of the person who sent the original tweet. Knowing that, we can try to engage users whose updates tend to generate lots of retweets. Since Twitter tracks retweet counts for all tweets, we can find the users we’re looking for by analyzing Twitter data.
Strata Conference + Hadoop World (Oct. 23-25 in New York City) is a bonanza for Hadoop and big data enthusiasts – but not only because of the technical sessions and tutorials. It’s also an important gathering place for the developer community, most of whom are eager to share info from their experiences in the “trenches”.
Just to make that process easier, Cloudera is teaming up with local meetups during that week to organize a series of meetings on a variety of topics. (If for no other reason, stop into one of these meetups for a chance to grab a coveted Cloudera t-shirt.)
As you can see, these meetups are highly parallel, so you will either have to make careful choices or have very quick feet. The good news is: there’s something for everybody.
This guest post comes to us courtesy of Gwen Shapira (@gwenshap), a database consultant for The Pythian Group (and an Oracle ACE Director).
Most western countries use street names and numbers to navigate inside cities. But in Japan, where I live now, very few streets have them.
Sometimes solving technical problems is similar to navigating a city without many street names: Once you arrive at the desired location, the path seems obvious, but on the way there are many detours and interesting sights to be seen.
What’s to love about Cloudera Enterprise? A lot! But rather than bury you in documentation today, we’d rather bring you a less-than-two-minute-long video:
For those new to it, Cloudera Manager is the first and market-leading management platform for CDH (Cloudera’s Distribution Including Apache Hadoop). Enterprise customers are coming to expect an end-to-end tool that manages the entire lifecycle of their Hadoop operations. In fact, in a recent Cloudera customer survey, an overwhelming 95% emphasized the need for this approach.
Cloudera Manager sets the standard for enterprise deployment by delivering granular visibility into and control over every part of CDH – empowering operators to improve cluster performance, enhance quality of service, increase compliance and reduce administrative costs. We have also a FREE edition to get started, so try it out today! (BTW, for more information on this subject, you can attend a free Webinar on Wednesday, Sept. 19, on the topic “How CBS Interactive Uses Cloudera Manager to Effectively Manage Their Hadoop Cluster”.)
This is the second blogpost about Apache HBase replication. The previous blogpost, HBase Replication Overview, discussed use cases, architecture and different modes supported in HBase replication. This blogpost is from an operational perspective and will touch upon HBase replication configuration, and key concepts for using it — such as bootstrapping, schema change, and fault tolerance.
As mentioned in HBase Replication Overview, the master cluster sends shipment of WALEdits to one or more slave clusters. This section describes the steps needed to configure replication in a master-slave mode.
- All tables/column families that are to be replicated must exist on both the clusters.
- Add the following property in $HBASE_HOME/conf/hbase-site.xml on all nodes on both clusters; set it to true.
We are happy to announce the general availability of CDH3 update 5. This update is a maintenance release of CDH3 platform and provides a considerable amount of bug-fixes and stability enhancements. Alongside these fixes, we have also included a few new features, most notable of which are the following: