<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cloudera Blog</title>
	<atom:link href="http://blog.cloudera.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.cloudera.com</link>
	<description>Cloudera&#039;s Blog</description>
	<lastBuildDate>Tue, 21 May 2013 17:10:40 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>If It&#8217;s Tuesday, There Must Be a &quot;Data Ride&quot;</title>
		<link>http://blog.cloudera.com/blog/2013/05/if-its-tuesday-there-must-be-a-data-ride/</link>
		<comments>http://blog.cloudera.com/blog/2013/05/if-its-tuesday-there-must-be-a-data-ride/#comments</comments>
		<pubDate>Tue, 21 May 2013 13:33:25 +0000</pubDate>
		<dc:creator>Doug Cutting (@cutting)</dc:creator>
				<category><![CDATA[Community]]></category>
		<category><![CDATA[General]]></category>
		<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://blog.cloudera.com/?p=21657</guid>
		<description><![CDATA[Mark your calendars, all you data cyclists! I’m visiting Paris, London, and Edinburgh this June. When I travel I like to talk to locals. And, wherever I am, I like to bicycle. So, I thought I might combine these interests and host “data rides” in these three cities.]]></description>
			<content:encoded><![CDATA[<p>Mark your calendars, all you data cyclists!</p>
<p>I’m visiting Paris, London, and Edinburgh this June. When I travel I like to talk to locals. And, wherever I am, I like to bicycle. So, I thought I might combine these interests and host “data rides” in these three cities.</p>
<table style="width: 160px;" border="0" align="right">
<tbody>
<tr>
<td><img src="http://blog.cloudera.com/wp-content/uploads/2013/05/IMAG0008_1.jpg" alt="" title="IMAG0008_1" width="150" height="228" size-full wp-image-21658" />
</td>
</tr>
</table>
<p>In each city I’ll name a time and a meeting point, and then ride the local roads for an hour or two with whomever shows up. Afterward, we might need some libations at a local pub. I might even get Cloudera to throw in some schwag.</p>
<p>Ride dates are as follows:</p>
<ul>
<li dir="ltr">
<p><strong>Paris:</strong> Tuesday, June 4</p>
</li>
<li dir="ltr">
<p><strong>London:</strong> Tuesday, June 11</p>
</li>
<li dir="ltr">
<p><strong>Edinburgh:</strong> Tuesday, June 18</p>
</li>
</ul>
<p>All rides will start at 5pm. I’ll pick a meeting point closer to the event and tweet it from <a href="https://twitter.com/cutting">@cutting</a>. Also, please tweet me if you have ideas of where we should ride.</p>
<p><em>Doug Cutting is Cloudera’s chief architect, a founder of the Apache Lucene and Apache Hadoop projects, and the current chair of the Apache Software Foundation.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.cloudera.com/blog/2013/05/if-its-tuesday-there-must-be-a-data-ride/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Customer Spotlight: Gravity Creates Personalized Web Experience, 300-400% Higher Click-through</title>
		<link>http://blog.cloudera.com/blog/2013/05/customer-spotlight-gravity-creates-personalized-web-experience-300-400-higher-click-through/</link>
		<comments>http://blog.cloudera.com/blog/2013/05/customer-spotlight-gravity-creates-personalized-web-experience-300-400-higher-click-through/#comments</comments>
		<pubDate>Mon, 20 May 2013 13:50:41 +0000</pubDate>
		<dc:creator>Karina Babcock (@karinababcock)</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Use Case]]></category>

		<guid isPermaLink="false">http://blog.cloudera.com/?p=21711</guid>
		<description><![CDATA[According to Jim Benedetto, Gravity’s co-founder and CTO, there have been two paradigm shifts that have transformed consumers’ web experience to date: Search — Google figured out how to index all of the content on the internet so you (internet user) can find what you’re looking for. Social — Facebook, Twitter, LinkedIn and other social [...]]]></description>
			<content:encoded><![CDATA[<p>According to <a href="http://www.gravity.com/team">Jim Benedetto</a>, <a href="http://www.gravity.com">Gravity</a>’s co-founder and CTO, there have been two paradigm shifts that have transformed consumers’ web experience to date:</p>
<ul>
<li><strong>Search</strong> — Google figured out how to index all of the content on the internet so you (internet user) can find what you’re looking for.</li>
<li><strong>Social</strong> — Facebook, Twitter, LinkedIn and other social sites give your friends and other social connections a mechanism to push content you’re interested in to you, so you don’t have to search for it yourself.</li>
</ul>
<p>So what does Gravity do? Its goal is to drive the third paradigm shift:</p>
<ul>
<li><strong>Personal</strong> &#8212; Creating a web experience that is totally optimized based on your individual interests, behaviors, and preferences. Or, as Jim puts it, “showing you today what you’re going to search for tomorrow.”</li>
</ul>
<p>Gravity collects and processes more than 10,000 data points every second. All of the data collected is loaded into HDFS, where two Apache Hadoop processes run. The first is a dynamic, real-time system that uses something called “eventual consistency,” meaning it correctly processes as many data points as it can—about both user activity across the web and content that is being published—in real time. 99.99% of that traffic is processed correctly. The second system runs every hour or two, catching the .01% of data points that were missed the first time around. Once the data is processed, it lands in Apache HBase where it is serialized and can be accessed via Apache Hive.</p>
<p>With several Scala engineers in house, the Gravity team decided in 2011 to use the Scala programming language instead of Java. It doesn’t natively integrate with Hadoop or HBase, so the Gravity team wrote its own open source library called <a href="https://github.com/GravityLabs/HPaste">HPaste,</a> which allows Scala engineers to take advantage of all the unique features of Scala on top of HBase.</p>
<p>The results of this system?</p>
<ul>
<li>Higher click-through rates (CTR) — Gravity has measured CTR of people who engage with their personalized content versus standard segmented or generic content, and they’ve proven that personalized content delivers 300-400% higher CTR.</li>
<li>Longer sessions — When personalized content is displayed on a web page, users stay on the page longer, which is a strong indication that they like the site more.</li>
<li>More repeat visitors — If a web visitor sees personalized content their first time visiting a site, the number of times they return to that site afterward is more than 10X higher than when they engage with static content shown to all visitors. Gravity has proven this at scale across some of its largest customers.</li>
</ul>
<p>Want to learn more?</p>
<ul>
<li><a href="http://www.cloudera.com/content/dam/cloudera/Resources/PDF/Cloudera_Gravity_CaseStudy_Final.pdf">Read the full case study</a>.</li>
<li><a href="http://www.cloudera.com/content/cloudera/en/resources/library/video/gravity-creates-a-personalized-web-experience.html">Watch Gravity’s Jim Benedetto explain its use case on video</a>.</li>
<li><a href="https://github.com/GravityLabs/HPaste">Explore the HPaste project on GitHub</a>.</li>
<li><a href="http://www.gravity.com">Learn more about Gravity</a>.
</li>
</ul>
<p><em>Karina Babcock is Cloudera’s Customer Programs &amp; Marketing Manager.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.cloudera.com/blog/2013/05/customer-spotlight-gravity-creates-personalized-web-experience-300-400-higher-click-through/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Meet the Project Founder: Roman Shaposhnik</title>
		<link>http://blog.cloudera.com/blog/2013/05/meet-the-project-founder-roman-shaposhnik/</link>
		<comments>http://blog.cloudera.com/blog/2013/05/meet-the-project-founder-roman-shaposhnik/#comments</comments>
		<pubDate>Fri, 17 May 2013 16:30:09 +0000</pubDate>
		<dc:creator>Justin Kestelyn (@kestelyn)</dc:creator>
				<category><![CDATA[Bigtop]]></category>
		<category><![CDATA[Community]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Meet the Engineer]]></category>

		<guid isPermaLink="false">http://blog.cloudera.com/?p=21704</guid>
		<description><![CDATA[This installment of &#8220;Meet the Project Founder&#8221; features Apache Bigtop founder and PMC Chair/VP Roman Shaposhnik. What led you to your project idea(s)? Conceptually, Apache Bigtop can actually be traced as far back as me working at Sun Microsystems in 2007-2008. I was assisting the team responsible for coming up with a 100% community-driven, open [...]]]></description>
			<content:encoded><![CDATA[<p><img style="float: right; margin: 5px; padding: 0px 10px 0px 0px;" src="/wp-content/uploads/2013/05/headshot-2.jpg" alt="Todd" /> <em>This installment of &#8220;Meet the Project Founder&#8221; features Apache Bigtop founder and PMC Chair/VP Roman Shaposhnik.</em></p>
<p><strong>What led you to your project idea(s)?</strong></p>
<p>Conceptually, <a href="http://bigtop.apache.org">Apache Bigtop</a> can actually be traced as far back as me working at Sun Microsystems in 2007-2008. I was assisting the team responsible for coming up with a 100% community-driven, open source Solaris distribution that could also be used as a basis for an enterprise-grade commercial product offering (which eventually became OpenSolaris). I then joined Yahoo! Inc. as a manager of a small team of extremely talented engineers tasked with integration efforts around Yahoo&#8217;s internal cloud offering based on Hadoop. Our project was called HIT (Hadoop Integration Testing) and we were known as &#8220;HIT-men&#8221;.</p>
<p><strong>Aside from doing the initial commit, what is your definition of the project founder’s role across the lifespan of the project? Benevolent dictator, referee, silent partner?</strong></p>
<p>Honestly, my role model is Linus Torvalds. He&#8217;s somebody who&#8217;s deeply passionate about the state of the community yet he still finds enough time to be involved with most technical aspects of the Linux kernel on a daily basis. But at the end of the day, he&#8217;s just plain fun to be around. Of course, the governance framework of Apache Software Foundation is quite different than the governance model of the Linux kernel. I use that excuse when I can&#8217;t quite measure up to Linus where influence is concerned.</p>
<p><strong>What has surprised you the most about how your project has evolved/matured? </strong></p>
<table style="width: 120px; margin: 6px; padding: 0px 0px 0px 0px;" align="right">
<tbody>
<tr>
<td>
<h3>I&#8217;m still amazed at how quickly the &#8216;Powered by Bigtop&#8217; list is growing.</h3>
</td>
</tr>
</tbody>
</table>
<p>The elevator pitch for Bigtop has always been: Bigtop is to Hadoop what Debian is to Linux. The most surprising development to me was how well that message resonates with the commercial vendors in the Big Data space. I&#8217;m still amazed at how quickly the <a href="https://cwiki.apache.org/BIGTOP/powered-by-bigtop.html">&#8220;Powered by Bigtop&#8221;</a> list is growing.</p>
<p><strong>What is the major work yet to be done, from your perspective as the project’s founder?</strong></p>
<p>Developers, developers, developers!</p>
<p>We have to grow the community by leaps and bounds if we want to be remembered as the Debian of Hadoop. This translates into investing in outreach activities, but also (and maybe primarily) into creating enough value in the project itself so that external developers get hooked. Just steer clear of trying to boil the ocean by yourself &#8212; make the project interesting enough, and the developer community will come to the party.</p>
<p><strong>What is your philosophy, if you have one, for balancing quality versus quantity with respect to contributions?</strong></p>
<p>That&#8217;s a tough one. In the ideally balanced community everybody keeps an eye on all the proposed changes and casts +/-1 as needed. Hence there&#8217;s a self-throttling process that also provides a learning opportunity for the newcomers. Fundamentally though, developers don&#8217;t like to review patches; they like to write code. Making the review duty appealing is like making the broccoli-eating process appealing to your toddler. The bottom line is: You have to get creative (and personally brace yourself for way more reviews than direct code contributions).</p>
<p>At the same time, accepting somebody&#8217;s patches is a great way of keeping that contributor around. Hence, personally, I try and err on the side of openness and community growth over polishing each patch ad infinitum. But my personal coding philosophy is: Commit early, commit often, and re-factor mercilessly.</p>
<p><strong>Any other advice for other potential project founders?</strong></p>
<p>Attachment is the root of all suffering; don&#8217;t get attached to your own code or ideas. The only thing that lasts is community. At ASF, we are reminded of the &#8220;community over code&#8221; mantra all the time, but it&#8217;s not just a phrase. It&#8217;s for real.</p>
<p>P.S.: Oh, and here&#8217;s one more crucial bit of advice: Before naming your project, make sure to check that the vanity license place with its name is available in your state.</p>
<p><strong>Read other &#8220;Meet the Project Founders&#8221; installments:</strong></p>
<p>- <a title="Meet the Project Founder: Doug Cutting (First in a Series)" href="http://blog.cloudera.com/blog/2013/04/meet-the-project-founder-doug-cutting-first-in-a-series/">Doug Cutting</a> (Apache Hadoop, Apache Avro, Apache Lucene)</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.cloudera.com/blog/2013/05/meet-the-project-founder-roman-shaposhnik/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How-to: Configure Eclipse for Hadoop Contributions</title>
		<link>http://blog.cloudera.com/blog/2013/05/how-to-configure-eclipse-for-hadoop-contributions/</link>
		<comments>http://blog.cloudera.com/blog/2013/05/how-to-configure-eclipse-for-hadoop-contributions/#comments</comments>
		<pubDate>Wed, 15 May 2013 16:33:47 +0000</pubDate>
		<dc:creator>Justin Kestelyn (@kestelyn)</dc:creator>
				<category><![CDATA[Community]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[How-to]]></category>
		<category><![CDATA[Tools]]></category>
		<category><![CDATA[contributing]]></category>
		<category><![CDATA[eclipse]]></category>
		<category><![CDATA[java]]></category>

		<guid isPermaLink="false">http://blog.cloudera.com/?p=21369</guid>
		<description><![CDATA[Contributing to Apache Hadoop or writing custom pluggable modules requires modifying Hadoop’s source code. While it is perfectly fine to use a text editor to modify Java source, modern IDEs simplify navigation and debugging of large Java projects like Hadoop significantly. Eclipse is a popular choice thanks to its broad user base and multitude of available plugins. [...]]]></description>
			<content:encoded><![CDATA[<p>Contributing to Apache Hadoop or writing custom pluggable modules requires modifying Hadoop’s source code. While it is perfectly fine to use a text editor to modify Java source, modern IDEs simplify navigation and debugging of large Java projects like Hadoop significantly. <a href="http://eclipse.org">Eclipse</a> is a popular choice thanks to its broad user base and multitude of available plugins.</p>
<p>This post covers configuring Eclipse to modify Hadoop’s source. (Developing applications against CDH using Eclipse is covered in <a href="http://blog.cloudera.com/blog/2012/08/developing-cdh-applications-with-maven-and-eclipse/">a different post</a>.) Hadoop has changed a great deal since our <a href="http://blog.cloudera.com/blog/2009/04/configuring-eclipse-for-hadoop-development-a-screencast/">previous post</a> on configuring Eclipse for Hadoop development; here we’ll revisit configuring Eclipse for the latest “flavors” of Hadoop. Note that trunk and other release branches differ in their directory structure, feature set, and build tools they use. (The <a href="http://wiki.apache.org/hadoop/EclipseEnvironment">EclipseEnvironment Hadoop wiki page</a> is a good starting point for development on trunk.)</p>
<p>This post covers the following main flavors:</p>
<ul>
<li>The traditional implementation of MapReduce based on the JobTracker/TaskTracker architecture (MR1) running on top of HDFS. Apache Hadoop 1.x and CDH3 releases, among others, capture this setup.</li>
<li>A highly-scalable MapReduce (<a href="http://blog.cloudera.com/blog/2012/10/mr2-and-yarn-briefly-explained/">MR2</a>) running over YARN and an improved HDFS 2.0 (Federation, HA, Transaction IDs), captured by Apache Hadoop 2.x and CDH4 releases.</li>
<li>Traditional MapReduce running on HDFS-2 &#8212; that is, the stability of MR1 running over critical improvements in HDFS-2. CDH4 MR1 ships this configuration.</li>
</ul>
<p>The below table captures the releases and the build tools they use along with the preferred version:</p>
<table width="468" border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td style="font-family: Arial, Helvetica, sans-serif; font-size: 12px;">
<div align="center"><strong>Release</strong></div>
</td>
<td style="font-family: Arial, Helvetica, sans-serif; font-size: 12px;">
<p align="center"><strong>Build Tool (preferred version)</strong></p>
</td>
</tr>
<tr>
<td style="font-family: Arial, Helvetica, sans-serif; font-size: 12px;">
<p align="center">CDH3 (Hadoop 1.x)</p>
</td>
<td style="font-family: Arial, Helvetica, sans-serif; font-size: 12px;">
<p align="center">Ant (1.8.2)</p>
</td>
</tr>
<tr>
<td style="font-family: Arial, Helvetica, sans-serif; font-size: 12px;">
<p align="center">CDH4 (Hadoop 2.x) HDFS</p>
</td>
<td style="font-family: Arial, Helvetica, sans-serif; font-size: 12px;">
<p align="center">Maven (3.0.2)</p>
</td>
</tr>
<tr>
<td style="font-family: Arial, Helvetica, sans-serif; font-size: 12px;">
<p align="center">CDH4 (Hadoop 2.x) MR2</p>
</td>
<td style="font-family: Arial, Helvetica, sans-serif; font-size: 12px;">
<p align="center">Maven (3.0.2)</p>
</td>
</tr>
<tr>
<td style="font-family: Arial, Helvetica, sans-serif; font-size: 12px;">
<p align="center">CDH4 MR1</p>
</td>
<td style="font-family: Arial, Helvetica, sans-serif; font-size: 12px;">
<p align="center">Ant (1.8.2)</p>
</td>
</tr>
</tbody>
</table>
<p>Other Requirements:</p>
<ul>
<li>Oracle Java 1.6 or later</li>
<li>Eclipse (Indigo/Juno)</li>
</ul>
<h2>Setting Up Eclipse</h2>
<ol>
<li>First, we need to set a couple of classpath variables so Eclipse can find the dependencies.</li>
<ol>
<li>Go to Window -&gt; Preferences.</li>
<li>Go to Java -&gt; Build Path -&gt; Classpath Variables.</li>
<li>Add a new entry with name ANT_PATH and path set to the ant home on your machine, typically /usr/share/ant.</li>
<li>Add another new entry with name M2_REPO and path set to your maven repository, typically $HOME/.m2/repository (e.g. /home/user/.m2/repository).
<p align="center"><a href="http://blog.cloudera.com/wp-content/uploads/2013/04/eclipse1.png"><img title="eclipse1" src="http://blog.cloudera.com/wp-content/uploads/2013/04/eclipse1.png" alt="" width="600" height="506" /></a></p>
</li>
</ol>
<li>Hadoop requires tools.jar, which is under JDK_HOME/lib. Because it is possible Eclipse won’t pick this up:</li>
<ol>
<li>Go to Window-&gt;Preferences-&gt;Installed JREs.</li>
<li>Select the right Java version from the list, and click &#8220;Edit&#8221;.</li>
<li>In the pop-up, “Add External JARs”, navigate to &#8220;JDK_HOME/lib&#8221;, and add &#8220;tools.jar&#8221;.
<p align="center"><a href="http://blog.cloudera.com/wp-content/uploads/2013/04/eclipse2.png"><img title="eclipse2" src="http://blog.cloudera.com/wp-content/uploads/2013/04/eclipse2.png" alt="" width="600" height="505" /></a></p>
</li>
</ol>
<li>Hadoop uses a particular style of formatting. When contributing to the project, you are required to follow the style guidelines: Java formatting with all spaces and indentation as well as tabs set to 2 spaces. To do that:</li>
<ol>
<li>Go to Window -&gt; Preferences.</li>
<li>Go to Java-&gt;Code Style -&gt; Formatter.</li>
<li>Import this <a href="https://github.com/cloudera/blog-eclipse/blob/master/hadoop-format.xml">Formatter</a>.
<p align="center"><a href="http://blog.cloudera.com/wp-content/uploads/2013/04/eclipse3.png"><img title="eclipse3" src="http://blog.cloudera.com/wp-content/uploads/2013/04/eclipse3.png" alt="" width="600" height="456" /></a></p>
</li>
<li>It is a good practice to enable automatic formatting of the modified code when you save a file. To do that, go to Window-&gt;Preferences-&gt;Java-&gt;Editor-&gt;Save Actions and select “Perform the selected actions on save”, “Format source code”, “Format edited lines”. Also, de-select “Organize imports”.
<p align="center"><a href="http://blog.cloudera.com/wp-content/uploads/2013/04/eclipse4.png"><img title="eclipse4" src="http://blog.cloudera.com/wp-content/uploads/2013/04/eclipse4.png" alt="" width="600" height="406" /></a></p>
</li>
</ol>
<li>For Maven projects, the <a href="http://wiki.eclipse.org/Maven_Integration">m2e plugin</a> can be very useful. To install the plugin, go to Help -&gt; Install New Software. Enter &#8220;http://download.eclipse.org/technology/m2e/releases&#8221; into the “Work with” box and select  the m2e plugins and install them.
<p align="center"><a href="http://blog.cloudera.com/wp-content/uploads/2013/04/eclipse5.png"><img src="http://blog.cloudera.com/wp-content/uploads/2013/04/eclipse5.png" alt="" width="600" height="557" /></a></p>
</li>
</ol>
<p>             </p>
<h2>Configuration for Hadoop 1.x / CDH3</h2>
<ol>
<li>Fetch Hadoop using version control systems <a href="http://wiki.apache.org/hadoop/HowToContribute">subversion</a> or <a href="http://wiki.apache.org/hadoop/GitAndHadoop">git</a> and checkout branch-1 or the particular release branch. Otherwise, download a source tarball from the <a href="http://archive.cloudera.com/cdh/3/">CDH3 releases</a> or <a href="http://hadoop.apache.org/releases.html">Hadoop releases</a>.</li>
<li>Generate Eclipse project information using Ant via command line:</li>
<ol>
<li>For Hadoop (1.x or branch-1), “ant eclipse”</li>
<li>For CDH3 releases, “ant eclipse-files”</li>
</ol>
<li>Pull sources into Eclipse:</li>
<ol>
<li>Go to File -&gt; Import.</li>
<li>Select General -&gt; Existing Projects into Workspace.</li>
<li>For the root directory, navigate to the top directory of the above downloaded source.</li>
</ol>
</ol>
<h2>Configuration for Hadoop 2.x / CDH4 MR2</h2>
<p>Apache Hadoop 2.x (branch-2/trunk based) and CDH4.x have the same directory structure and use Maven as the build tool.</p>
<ol>
<li>Again, fetch sources using svn/git and checkout appropriate branch or download release source tarballs (follow <a href="https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads">CDH Downloads</a>).</li>
<li>Using the m2e plugin we installed earlier:</li>
<ol>
<li>Navigate to the top level and run “mvn generate-sources generate-test-sources”.</li>
<li>Import project into Eclipse:</li>
<ol>
<li>Go to File -&gt; Import.</li>
<li>Select Maven -&gt; Existing Maven Projects.</li>
<li>Navigate to the top directory of the downloaded source.
<p align="center"><a href="http://blog.cloudera.com/wp-content/uploads/2013/04/eclipse6.png"><img title="eclipse6" src="http://blog.cloudera.com/wp-content/uploads/2013/04/eclipse6.png" alt="" width="600" height="537" /></a></p>
</li>
</ol>
<li>The generated sources (e.g. *Proto.java files that are generated using protoc) might not be directly linked and can show up as errors. To fix them, select the project and configure the build path to include the java files under target/generated-sources and target/generated-test-sources. For inclusion pattern, select &#8220;**/*.java&#8221;.
<p align="center"><a href="http://blog.cloudera.com/wp-content/uploads/2013/04/eclipse7.png"><img src="http://blog.cloudera.com/wp-content/uploads/2013/04/eclipse7.png" alt="" width="600" height="466" /></a></p>
</li>
</ol>
<li>Without using the m2e plugin:</li>
<ol>
<li>Generate Eclipse project information using Maven: <code>mvn clean &amp;&amp; mvn install -DskipTests &amp;&amp; mvn eclipse:eclipse</code>. Note: <code>mvn eclipse:eclipse</code> generates a static .classpath file that Eclipse uses, this file isn&#8217;t automatically updated as the project/dependencies change.</li>
<li>Pull sources into Eclipse:</li>
<ol>
<li>Go to File -&gt; Import.</li>
<li>Select General -&gt; Existing Projects into Workspace.</li>
<li>For the root directory, navigate to the top directory of the above downloaded source.</li>
</ol>
</ol>
</ol>
<h2>Configuration for CDH4 MR1</h2>
<p>CDH4 MR1 runs the stable version of MapReduce (MR1) on top of HDFS from Hadoop 2.x branches. So, we have to set up both HDFS and MapReduce separately.</p>
<ol>
<li>Follow Steps 1 and 2 of the previous section (Hadoop 2.x).</li>
<li>Download MR1 source tarball from <a href="https://ccp.cloudera.com/display/SUPPORT/CDH4+Downloadable+Tarballs">CDH4 Downloads</a> and untar into a folder different than the one from Step 1.</li>
<li>Within the MR1 folder, generate Eclipse project information using Ant via command line (<code>ant eclipse-files</code>).</li>
<li>Configure .classpath using <a href="https://github.com/cloudera/blog-eclipse/blob/master/configure-classpath.pl">this perl script</a> to make sure all classpath entries point to the local Maven repository:</li>
<ol>
<li>Copy the script to the top-level Hadoop directory.</li>
<li>Run <code>$ perl configure-classpath.pl</code></li>
</ol>
<li>Pull sources into Eclipse:</li>
<ol>
<li>Go to File -&gt; Import.</li>
<li>Select General -&gt; Existing Projects into Workspace.</li>
<li>For the root directory, navigate to the top directory of the above downloaded sources.</li>
</ol>
</ol>
<p>Happy Hacking!</p>
<p><em>Karthik Kambatla is a Software Engineer at Cloudera in the scheduling and resource management team and works primarily on MapReduce and YARN.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.cloudera.com/blog/2013/05/how-to-configure-eclipse-for-hadoop-contributions/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Fresh and Hot: HBaseCon 2013 Schedule Finalized!</title>
		<link>http://blog.cloudera.com/blog/2013/05/fresh-and-hot-hbasecon-2013-schedule-finalized/</link>
		<comments>http://blog.cloudera.com/blog/2013/05/fresh-and-hot-hbasecon-2013-schedule-finalized/#comments</comments>
		<pubDate>Tue, 14 May 2013 17:53:42 +0000</pubDate>
		<dc:creator>Justin Kestelyn (@kestelyn)</dc:creator>
				<category><![CDATA[Community]]></category>
		<category><![CDATA[Events]]></category>
		<category><![CDATA[General]]></category>
		<category><![CDATA[HBase]]></category>

		<guid isPermaLink="false">http://blog.cloudera.com/?p=21679</guid>
		<description><![CDATA[The schedule/agenda grid for HBaseCon 2013 (rapidly approaching: June 13 in San Francisco) is a thing of beauty. If you lacked motivation to register up until this point, we think that this session line-up will convince you otherwise. We repeat: whether you&#8217;re an HBase committer or just getting started (or at any level in between), HBaseCon [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://hbasecon.com"><img style="float: right; margin: 0px; padding: 7px 7px 7px 7px;" title="HBaseCon2013-logo" src="http://blog.cloudera.com/wp-content/uploads/2013/02/HBaseCon2013-logo3.jpg" alt="" width="200" height="63" /></a></p>
<p>The <a href="http://www.hbasecon.com/schedule/">schedule/agenda grid</a> for HBaseCon 2013 (rapidly approaching: <strong>June 13</strong> in San Francisco) is a thing of beauty.</p>
<p>If you lacked motivation to register up until this point, we think that this session line-up will convince you otherwise. We repeat: whether you&#8217;re an HBase committer or just getting started (or at any level in between), HBaseCon is simply an event that you can&#8217;t afford to miss &#8211; and with an entry fee of just <strong>$350</strong>, it&#8217;s also one you can easily afford.</p>
<p><strong><a href="http://hbasecon13.eventbrite.com/">Register now</a></strong> while there&#8217;s still room!</p>
<p>See also:</p>
<p>- <a title="Top 5 Reasons to Attend HBaseCon 2013" href="http://blog.cloudera.com/blog/2013/05/top-5-reasons-to-attend-hbasecon-2013/">Top 5 Reasons to Attend HBaseCon 2013</a><br />- <a title="HBaseCon 2013 Speakers, Tracks, and Sessions Announced" href="http://blog.cloudera.com/blog/2013/04/hbasecon-2013-speakers-tracks-and-sessions-announced/">HBaseCon 2013 Speakers, Tracks, and Sessions Announced</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.cloudera.com/blog/2013/05/fresh-and-hot-hbasecon-2013-schedule-finalized/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How-to: Automate Your Hadoop Cluster from Java</title>
		<link>http://blog.cloudera.com/blog/2013/05/how-to-automate-your-hadoop-cluster-from-java/</link>
		<comments>http://blog.cloudera.com/blog/2013/05/how-to-automate-your-hadoop-cluster-from-java/#comments</comments>
		<pubDate>Mon, 13 May 2013 17:45:28 +0000</pubDate>
		<dc:creator>bc Wong</dc:creator>
				<category><![CDATA[Cloudera Manager]]></category>
		<category><![CDATA[DevOps]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[How-to]]></category>

		<guid isPermaLink="false">http://blog.cloudera.com/?p=21653</guid>
		<description><![CDATA[One of the complexities of Apache Hadoop is the need to deploy clusters of servers, potentially on a regular basis. At Cloudera, which at any time maintains hundreds of test and development clusters in different configurations, this process presents a lot of operational headaches if not done in an automated fashion. In this post, I’ll [...]]]></description>
			<content:encoded><![CDATA[<p>One of the complexities of Apache Hadoop is the need to deploy clusters of servers, potentially on a regular basis. At Cloudera, which at any time maintains hundreds of test and development clusters in different configurations, this process presents a lot of operational headaches if not done in an automated fashion. In this post, I’ll describe an approach to cluster automation that works for us, as well as many of our customers and partners.</p>
<h2>Taming Complexity</h2>
<p>At Cloudera engineering, we have a big support matrix: We work on many versions of CDH (multiple release trains, plus things like rolling upgrade testing), and CDH works across a wide variety of OS distros (RHEL 5 &amp; 6, Ubuntu Precise &amp; Lucid, Debian Squeeze, and SLES 11), and complex configuration combinations — highly available HDFS or simple HDFS, Kerberized or non-secure, using YARN or MR1 as the execution framework, etc. Clearly, we need an easy way to spin-up a new cluster that has the desired setup, which we can subsequently use for integration, testing, customer support, demos, and so on.</p>
<p>This concept is not new; there are several other examples of Hadoop cluster automation solutions. For example, Yahoo! has its own infrastructure tools, and you can find publicly available Puppet recipes, with various degrees of completeness and maintenance. Furthermore, there are tools that work only with a particular virtualization environment. However, we needed a solution that is more powerful and easier to maintain.</p>
<p>Cloudera&#8217;s automation system for Hadoop cluster deployment provisions VMs on-demand in our internal cloud. As cool as that capability sounds, it’s actually not the most interesting part of the solution. More important is that we can install and configure Hadoop according to precise specifications using a powerful yet simple abstraction &#8212; using Cloudera Manager’s open source <a href="http://cloudera.github.com/cm_api/">REST API</a>.</p>
<h2>Cloudera Manager API</h2>
<p>This is what our automation system does:</p>
<ol>
<li>Installs the Cloudera Manager (CM) packages on the cluster. Start CM server.</li>
<li>Uses the API to add hosts, installs CDH, defines the cluster and its services.</li>
<li>For configuration, we use the API to tune heap sizes, set up HDFS HA, turn on Kerberos security and generate keytabs, customize service directories and ports, and so on. Every configuration available in Cloudera Manager is exposed in the API.</li>
<li>The API also gives access to management features, such as gathering logs and monitoring information, starting and stopping services, polling cluster events, and creating a DR replication schedule. We use these features extensively in our automated tests.</li>
</ol>
<p>The end result is a system that has become an indispensable part of our engineering process. It makes the Hadoop setup easy to maintain. For example, the same API call retrieves logs from HDFS, HBase, or any other service, without the user worrying about the different log locations. The same API call stops any service, without the user worrying about any additional steps. (HBase needs to be gracefully shutdown.) And when Cloudera Manager adds support for more services (e.g. Impala), their setup flows are the same as the existing ones.</p>
<h2>Use Cases from Partners and Customers</h2>
<p>Many of our customers and partners have also adopted the Cloudera Manager API for cluster automation:</p>
<ul>
<li>Some OEM and hardware partners, delivering Hadoop-in-a-box appliances, use the API to set up CDH and Cloudera Manager on bare metal in the factory.</li>
<li>Some of our high-growth customers are constantly deploying new clusters, and have automated that with a combination of Puppet and the Cloudera Manager API. Puppet does the OS-level provisioning, and the software installation. After that, the Cloudera Manager API sets up the Hadoop services and configures the cluster.</li>
<li>Others have found it useful to integrate the API with their reporting and alerting infrastructure. An external script can poll the API for health and metrics information, as well as the stream of events and alerts, to feed into a custom dashboard.</li>
</ul>
<h2>Code Samples</h2>
<p>A previous <a href="http://blog.cloudera.com/blog/2012/09/automating-your-cluster-with-cloudera-manager-api/">blog post</a> gave an example of setting up a CDH4 cluster using the <a href="https://github.com/cloudera/cm_api/tree/master/python">Python API client</a>. Instead of repeating that, let me introduce you to the Java API client. (Although our internal automation tool uses the Python client today, we plan to move to Java to better work with our other Java-based tools like jclouds.) To use the Java client, add this dependency to your project’s pom.xml:</p>
<pre class="code" style="padding-left: 10px;"> 
&lt;project&gt;
  &lt;repositories&gt;
    &lt;repository&gt;
      &lt;id&gt;cdh.repo&lt;/id&gt;
      &lt;url&gt;https://repository.cloudera.com/content/groups/cloudera-repos&lt;/url&gt;
      &lt;name&gt;Cloudera Repository&lt;/name&gt;
    &lt;/repository&gt;
    …
  &lt;/repositories&gt;
  &lt;dependencies&gt;
    &lt;dependency&gt;
      &lt;groupId&gt;com.cloudera.api&lt;/groupId&gt;
      &lt;artifactId&gt;cloudera-manager-api&lt;/artifactId&gt;
      &lt;version&gt;4.5.2&lt;/version&gt;      &lt;!-- Or the CM version you work with --&gt;
    &lt;/dependency&gt;
    …
  &lt;/dependencies&gt;
  ...
&lt;/project&gt;
</pre>
<p>&nbsp;</p>
<p>The Java client works like a proxy. It hides from the caller any details about REST, HTTP, and JSON. The entry point is a handle to the root of the API:</p>
<pre class="code" style="padding-left: 10px;">RootResourceV3 apiRoot = new ClouderaManagerClientBuilder()
            .withHost("cm.cloudera.com")
            .withUsernamePassword("admin", "admin")
            .build()
            .getRootV3();
</pre>
<p>&nbsp;</p>
<p>From the root, you can traverse down to all other resources. (It’s called “v3” because the currently Cloudera Manager API version is version 3. But the same builder also returns a v1 or v2 root.) Here is the tree view of some of the key resources and the interesting operations they support:</p>
<p>    * RootResourceV3<br />         * ClustersResourceV3: hosts membership, start cluster<br />             * ServicesResourceV3: config, get metrics, HA, service commands<br />                 * RolesResource: add roles, get metrics, logs<br />                 * RoleConfigGroupsResource: config<br />             * ParcelsResource: parcels management<br />         * HostsResource: hosts management, get metrics<br />         * UsersResource: users management</p>
<p>Of course, these are all in the <a href="http://cloudera.github.com/cm_api/javadoc/4.5/index.html">Javadoc</a>, and the full <a href="http://cloudera.github.io/cm_api/apidocs/v3/index.html">API documentation</a>. To give a short concrete example, here is the code to list and start a cluster:</p>
<p>&nbsp;</p>
<pre class="code" style="padding-left: 10px;">// List of clusters
ApiClusterList clusters = apiRoot.getClustersResource()
                                 .readClusters(DataView.SUMMARY);
for (ApiCluster cluster : clusters) {
  LOG.info("{}: {}", cluster.getName(), cluster.getVersion());
}

// Start the first cluster
ApiCommand cmd = apiRoot.getClustersResource()
                        .startCommand(clusters.get(0).getName());
while (cmd.isActive()) {
   Thread.sleep(100);
   cmd = apiRoot.getCommandsResource().readCommand(cmd.getId());
}
LOG.info("Cluster start {}", cmd.getSuccess() ?
            "succeeded" : "failed " + cmd.getResultMessage());
</pre>
<p>&nbsp;</p>
<p>To see a full example of cluster deployment using the Java client, see <a href="https://github.com/cloudera/whirr-cm">whirr-cm</a>. Specifically, jump straight to <a href="https://github.com/cloudera/whirr-cm/blob/ed0297b969dfaea8ef2863f8b0d31b5a5cca1ac0/src/main/java/com/cloudera/whirr/cm/server/impl/CmServerImpl.java#L471">CmServerImpl#configure</a> to see the core of the action.</p>
<p>You may find it interesting that the Java client is maintained with very little effort. Using Apache CXF, the client proxy comes free, quite magically. It figures out the right HTTP call to make by inspecting the JAX-RS annotations in the REST interface, which is the same interface used by the Cloudera Manager API server. Therefore, new API methods are available to the Java client automatically.</p>
<h2>What’s Your Plan?</h2>
<p>Overall, we are very happy with our automated deployment capability. I encourage you to try the Cloudera Manager API, and post your questions and feedback on the <a href="https://groups.google.com/a/cloudera.org/forum/?hl=en&amp;fromgroups#!forum/scm-users">mailing list</a>.</p>
<p><em>bc Wong is a Software Engineer on the Enterprise team.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.cloudera.com/blog/2013/05/how-to-automate-your-hadoop-cluster-from-java/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Tracking Hadoop Jobs from Your Mac: There&#8217;s an App for That</title>
		<link>http://blog.cloudera.com/blog/2013/05/tracking-hadoop-jobs-from-your-mac-theres-an-app-for-that/</link>
		<comments>http://blog.cloudera.com/blog/2013/05/tracking-hadoop-jobs-from-your-mac-theres-an-app-for-that/#comments</comments>
		<pubDate>Fri, 10 May 2013 14:07:45 +0000</pubDate>
		<dc:creator>Justin Kestelyn (@kestelyn)</dc:creator>
				<category><![CDATA[Guest]]></category>
		<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://blog.cloudera.com/?p=21634</guid>
		<description><![CDATA[Our thanks to Etsy developer Brad Greenlee (@bgreenlee) for the post below. We think his Mac OS app for JobTracker is great! JobTracker.app is a Mac menu bar app interface to the Hadoop JobTracker. It provides Growl/Notification Center notices of starting, completed, and failed jobs and gives easy access to the detail pages of those jobs. [...]]]></description>
			<content:encoded><![CDATA[<p><em>Our thanks to Etsy developer Brad Greenlee (@bgreenlee) for the post below. We think his Mac OS app for JobTracker is great!</em></p>
<p><a href="http://bgreenlee.github.io/JobTracker">JobTracker.app</a> is a Mac menu bar app interface to the Hadoop JobTracker. It provides Growl/Notification Center notices of starting, completed, and failed jobs and gives easy access to the detail pages of those jobs.</p>
<p>When I started writing Apache Hadoop jobs at <a href="https://www.etsy.com/">Etsy</a>, I found myself wasting a lot of time checking the JobTracker page to see how my job was progressing. The first thing we did to try to solve this problem was to write a <a href="https://github.com/twitter/scalding">Scalding</a> flow listener to announce completed and failed jobs to IRC, but that got a little noisy. So I wrote JobTracker.app.</p>
<h2>Installation and Usage</h2>
<p>You can download the binary from its <a href="https://github.com/bgreenlee/JobTracker">GitHub project page</a>. Just unzip it and drop it into your Applications folder. Running it will put a little pith helmet in your menu bar. Clicking that gets you this menu:</p>
<p align="center"><img src="https://a248.e.akamai.net/camo.github.com/dcc5fcd0259763bf7fb4d286329379a01f1fba4b/687474703a2f2f636c2e6c792f696d6167652f324a32363043314c334230512f6a742d6d61696e2d6d656e752e706e67" alt="" /></p>
<p>You&#8217;ll first need to go to Preferences and enter your JobTracker URL:</p>
<p align="center"><img src="https://a248.e.akamai.net/camo.github.com/5ecb6c764857c23f8e45d2c119a443afd7d36443/687474703a2f2f636c2e6c792f696d6167652f306c33643369326a324731792f6a742d707265666572656e6365732e706e67" alt="" /></p>
<p>By default it will track all jobs. You probably don&#8217;t want this, so put your username and any other usernames you want to track in the &#8220;Usernames to track&#8221; field, comma-separated.</p>
<p>Note that this has only been tested with the version of Hadoop that Etsy is running internally. Due to the somewhat horrifying way that the app gets the JobTracker data (by parsing the JobTracker HTML page, since there&#8217;s currently no API to JobTracker except via Java), it&#8217;s not unlikely that it could break on a different version of Hadoop/JobTracker. If you try it and it doesn&#8217;t work for you, <a href="https://github.com/bgreenlee/JobTracker/issues">file an issue</a> on GitHub and I’ll work with you on fixing it.</p>
<h2>Future Development</h2>
<p>Next on my list of features is allowing for <a href="https://github.com/bgreenlee/JobTracker/issues/2">tracking multiple clusters at once</a>. If you have any requests, please <a href="mailto:brad@etsy.com">let me know</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.cloudera.com/blog/2013/05/tracking-hadoop-jobs-from-your-mac-theres-an-app-for-that/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Top 5 Reasons to Attend HBaseCon 2013</title>
		<link>http://blog.cloudera.com/blog/2013/05/top-5-reasons-to-attend-hbasecon-2013/</link>
		<comments>http://blog.cloudera.com/blog/2013/05/top-5-reasons-to-attend-hbasecon-2013/#comments</comments>
		<pubDate>Thu, 09 May 2013 14:21:45 +0000</pubDate>
		<dc:creator>Justin Kestelyn (@kestelyn)</dc:creator>
				<category><![CDATA[Community]]></category>
		<category><![CDATA[Events]]></category>
		<category><![CDATA[HBase]]></category>

		<guid isPermaLink="false">http://blog.cloudera.com/?p=21539</guid>
		<description><![CDATA[HBaseCon 2013 is approaching fast &#8211; June 13 in San Francisco. If you&#8217;re on the fence about attending &#8211; or perhaps your manager is on the fence about approving your participation &#8211; here are a few things that you/they need to know (in no particular order): HBaseCon is the annual rallying point for the HBase [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://hbasecon.com"><img style="float: right; margin: 0px; padding: 7px 7px 7px 7px;" title="HBaseCon2013-logo" src="http://blog.cloudera.com/wp-content/uploads/2013/02/HBaseCon2013-logo3.jpg" alt="" width="200" height="63" /></a></p>
<p><a href="http://hbasecon.com">HBaseCon 2013</a> is approaching fast &#8211; <strong>June 13</strong> in San Francisco. If you&#8217;re on the fence about attending &#8211; or perhaps your manager is on the fence about approving your participation &#8211; here are a few things that you/they need to know (in no particular order):</p>
<ol>
<li><strong>HBaseCon is the annual rallying point for the HBase community.</strong> If you&#8217;ve ever had a desire to learn how to get involved in the community as a contributor, or just want to ask a committer or PMC member why things are done (or not done) a certain way, this is your opportunity &#8211; because this is where those people are. Participating in a mailing list thread is never quite the same once you&#8217;ve met the people behind it. <br /> </li>
<li><strong>HBaseCon is a one-stop shop for learning about the HBase roadmap, as well as other projects across the ecosystem.</strong> Current HBase users should be particularly interested in learning about which JIRAs will have the most impact on the user experience &#8211; and once again, most of the committers working on those JIRAs will either be leading sessions or otherwise present. Plus, you can learn about how new complementary projects like Impala, Kiji, Phoenix, and Honeycomb are transforming the use cases for HBase and helping to expand its footprint across the enterprise.<br /> </li>
<li><strong>HBaseCon is a feast of real-world experiences and use cases.</strong> Sure, maybe you&#8217;ve read about the HBase-backed applications used by companies like Facebook, Salesforce.com, eBay, Pinterest, and Yahoo!. But wouldn&#8217;t it be helpful to hear technical details and best practices directly from the people who built and run them? I&#8217;ll bet it would. And you really can&#8217;t do that anywhere else &#8212; in the whole world. (Plus, you can take advantage of <a href="http://www.hbasecon.com/apache-hbase-training-discount-for-hbasecon-attendees/">formal training</a> right before the conference, at a discount.)<br /> </li>
<li><strong>HBaseCon is a pageant of engineer rock-stars.</strong> If your company is an HBase user and hungry for talent, there&#8217;s no better place to find it: HBaseCon is literally the world&#8217;s biggest gathering of HBase experts under one roof.<br /> </li>
<li><strong>HBaseCon is a heck of a blast.</strong> Come for the deep-dives and advice, stay for the after-event party. The libations will be extensive!</li>
</ol>
<p>If you have any interest in HBase whatsoever, whether as a user or prospective user, <em>missing HBaseCon is almost unthinkable</em>. </p>
<p><a href="http://hbasecon13.eventbrite.com/">Register early</a>, because space is limited and filling up fast. Don&#8217;t get left out!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.cloudera.com/blog/2013/05/top-5-reasons-to-attend-hbasecon-2013/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Metrics2: The New Hotness for Apache HBase Metrics</title>
		<link>http://blog.cloudera.com/blog/2013/05/metrics2-the-new-hotness-for-apache-hbase-metrics/</link>
		<comments>http://blog.cloudera.com/blog/2013/05/metrics2-the-new-hotness-for-apache-hbase-metrics/#comments</comments>
		<pubDate>Wed, 08 May 2013 21:40:52 +0000</pubDate>
		<dc:creator>Justin Kestelyn (@kestelyn)</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[HBase]]></category>

		<guid isPermaLink="false">http://blog.cloudera.com/?p=21615</guid>
		<description><![CDATA[The post below was originally published at blogs.apache.org/hbase. We re-publish it here for your convenience. Apache HBase is a distributed big data store modeled after Google’s Bigtable paper. As with all distributed systems, knowing what’s happening at a given time can help  spot problems before they arise, debug on-going issues, evaluate new usage patterns, and [...]]]></description>
			<content:encoded><![CDATA[<p><em>The post below was originally published at <a href="https://blogs.apache.org/hbase/entry/migration_to_the_new_metrics">blogs.apache.org/hbase</a>. We re-publish it here for your convenience.</em></p>
<p>Apache HBase is a distributed big data store modeled after Google’s Bigtable paper. As with all distributed systems, knowing what’s happening at a given time can help  spot problems before they arise, debug on-going issues, evaluate new usage patterns, and provide insight into capacity planning.</p>
<p>Since October 2008, version 0.19.0 (<a href="https://issues.apache.org/jira/browse/HBASE-625">HBASE-625</a>), HBase has been using Apache Hadoop’s metrics system to export metrics to JMX, Ganglia, and other metrics sinks. As the code base grew, more and more metrics were added by different developers. New features got metrics. When users needed more data on issues, they added more metrics. These new metrics were not always consistently named, and some were not well documented.</p>
<p>As HBase’s metrics system grew organically, Hadoop developers were making a new version of the Metrics system called Metrics2. In <a href="https://issues.apache.org/jira/browse/HADOOP-6728">HADOOP-6728</a> and subsequent JIRAs, a new version of the metrics system was created. This new subsystem has a new name space, different sinks, different sources, more features, and is more complete than the old metrics. When the Metrics2 system was completed, the old system (aka Metrics1) was deprecated. With all of these things in mind, it was time to update HBase’s metrics system so <a href="https://issues.apache.org/jira/browse/HBASE-4050">HBASE-4050</a> was started. I also wanted to clean up the implementation cruft that had accumulated.</p>
<h2>Definitions</h2>
<p>The implementation details are pretty dense on terminology so lets make sure everything is defined:</p>
<ul>
<li>
<p>Metric: A measurement of a property in the system.</p>
</li>
<li>
<p>Snapshot: A set of metrics at a given point in time.</p>
</li>
<li>
<p>Metrics1: The old Apache Hadoop metrics system.</p>
</li>
<li>
<p>Metrics2: The new overhauled Apache Hadoop Metrics system.</p>
</li>
<li>
<p>Source: A class that exposes metrics to the Hadoop metrics system.</p>
</li>
<li>
<p>Sink: A class that receives metrics snapshots from the Hadoop metrics system.</p>
</li>
<li>
<p>JMX: Java Management Extension. A system built into java that facilitates the management of java processes over a network; it includes the ability to expose metrics.</p>
</li>
<li>
<p>Dynamic Metrics: Metrics that come and go. These metrics are not all known at compile time; instead they are discovered at runtime.</p>
</li>
</ul>
<h2>Implementation</h2>
<p>The Hadoop Metrics2 system implementations in branch-1 and branch-2 have diverged pretty drastically. This means that a single implementation of the code to move metrics from HBase to metrics2 sinks would not be performant or easy. As a result I created different hadoop compatibility shims and a system to load a version at runtime. This led to using <a href="http://docs.oracle.com/javase/6/docs/api/java/util/ServiceLoader.html">ServiceLoader</a> to create an instance of any class that touched parts of Hadoop that had changed between branch-1 and branch-2.</p>
<p><a href="http://blog.cloudera.com/wp-content/uploads/2013/05/fig1.png">Here</a> is an example of how a region server could request a Hadoop 2 version of the shim for exposing metrics about the HRegionServer. (Hadoop 1’s compatibility jar is shown in dotted lines to indicate that it could be swapped in if Hadoop 1 was being used.)</p>
<p>This system allows HBase to support both Hadoop 1.x and Hadoop 2.x implementations without using reflection or other tricks to get around differences in API, usage, and naming.</p>
<p>Now that HBase can use either the Hadoop 1 or Hadoop 2 versions of the metrics 2 systems, I set about cleaning up what metrics HBase exposes, how those metrics are exposed, naming, and performance of gathering the data.</p>
<p>Metrics2 uses either annotations or sources to expose metrics. Since HBase can’t require any part of the metrics2 system in the core classes I exposed all metrics from HBase by creating sources. For metrics that are known ahead of time I created wrappers around classes in the core of HBase that the metrics2 shims could interrogate for values. <a href="http://blog.cloudera.com/wp-content/uploads/2013/05/fig2.png">Here</a> is an example on how HRegionServer’s metrics(the non-dynamic metrics) are exposed.</p>
<p>The above pattern can be repeated to expose a great deal of the metrics that HBase has. However metrics about specific regions are still very interesting but can’t be exposed following the above pattern. So a new solution that would allow metrics about regions to be exposed by whichever HRegionServer is hosting that region was needed. To complicate things further Hadoop’s metrics2 system needs one MetricsSource to be responsible for all metrics that are going to be exposed through a JMX mbean. In order for metrics about regions to be well laid out, HBase needs a way to aggregate metrics from multiple regions into one source. This source will then be responsible for knowing what regions are assigned to the regionserver. These requirements led me to have one aggregation source that contains sudo-sources for each region. These sudo-sources each contain a wrapper around the region. This leads to something that looks like <a href="http://blog.cloudera.com/wp-content/uploads/2013/05/fig3.png">this</a>.</p>
<h2>Benefits</h2>
<p>That’s a lot of work to re-do a previously working metrics system, so what was gained by all this work? The entire system is much easier to test in unit and systems tests. The whole system has been made more regular; that is everything follows the same patterns and naming conventions. Finally everything has been rewritten to be faster.</p>
<p>Since the previous metrics have all been added on as needed they were not all named well. Some metrics were named following the pattern: “metricNameCount” others were named following “numMetricName” while still others were named like “metricName_Count”. This made parsing hard and gave a generally chaotic feel. After the overhaul metrics that are a counter start with the camel cased metric name followed by the suffix “Count.” The mbeans were poorly laid out. Some metrics we spread out between two mbeans. Metrics about a region were under an mbean named Dynamic, not the most descriptive name. Now mbeans are much better organized and have better descriptions.</p>
<p>Tests have found that single threaded scans run as much as 9% faster after HBase’s old metrics system has been replaced. The previous system used lots of <a href="http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/ConcurrentHashMap.html">ConcurrentHashMap’s</a> to store dynamic metrics. All accesses to mutate these metrics required a lookup into these large hash maps. The new system minimizes the use of maps. Instead every region or server exports metrics to one pseudo source. The only changes to hashmaps in the metrics system occurs on region close or open.</p>
<h2>Conclusion</h2>
<p>Overall the whole system is just better. The process was long and laborious, but worth it to make sure that HBase’s metrics system is in a good state. HBase 0.95, and later 0.96, will have the new metrics system.  There’s still more work to be completed but great strides have been made.</p>
<p><em>Elliott Clark is a Software Engineer at Cloudera and an HBase committer.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.cloudera.com/blog/2013/05/metrics2-the-new-hotness-for-apache-hbase-metrics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Cloudera Partners and Impala: Alteryx</title>
		<link>http://blog.cloudera.com/blog/2013/05/cloudera-partners-and-impala-alteryx/</link>
		<comments>http://blog.cloudera.com/blog/2013/05/cloudera-partners-and-impala-alteryx/#comments</comments>
		<pubDate>Wed, 08 May 2013 13:00:42 +0000</pubDate>
		<dc:creator>Justin Kestelyn (@kestelyn)</dc:creator>
				<category><![CDATA[Guest]]></category>
		<category><![CDATA[Impala]]></category>

		<guid isPermaLink="false">http://blog.cloudera.com/?p=21473</guid>
		<description><![CDATA[Our thanks to Brian Dirking, Director of Product Marketing for Alteryx, for the guest post below: At Alteryx we are excited about the release of Cloudera Impala. The impact on Big Data Analytics is that the ability to perform real-time queries on Apache Hadoop will provide faster access and results. This is applicable to our customers, [...]]]></description>
			<content:encoded><![CDATA[<p><em>Our thanks to Brian Dirking, Director of Product Marketing for <a href="http://www.alteryx.com">Alteryx</a>, for the guest post below: </em></p>
<p>At <a href="http://www.alteryx.com/">Alteryx</a> we are excited about the <a title="The Platform for Big Data is Here" href="http://blog.cloudera.com/blog/2013/04/platform-for-big-data-is-here/">release of Cloudera Impala</a>. The impact on Big Data Analytics is that the ability to perform real-time queries on Apache Hadoop will provide faster access and results. This is applicable to our customers, the business users who are running analytics to get access to data, perform analytics, and then follow up with new questions. Insight doesn’t happen all at once. The ability to query and refine quickly is ultimately what will lead business users to insight.</p>
<p>As business users need faster access to data, Alteryx provides a user friendly way to access new solutions like Impala. With Impala support in <a href="http://www.alteryx.com/products/alteryx-8.5">Alteryx Strategic Analytics</a>, business users can get faster access, and can refine data queries and the corresponding analytics to get the answers they need. They can combine these results with other datasets to provide the context necessary to make the right decision, and they can do it without having to go through months of training to master programming and query languages.</p>
<p>A great example of where Impala can have a big impact is in churn analytics. When customers leave a company or service, there are usually a few interactions prior to leaving that are the cause. In the telecom world, these interactions can be dropped calls, support queries, and rate adjustments. The interactions that lead to churn can happen over the course of just a few hours. To be able to log those events, and then have them show up in an analytics query quickly so customers can be saved, can have a huge impact on an organization. Impala enables Alteryx to iteratively analyze the fast moving data involved in churn analysis and prevention.</p>
<table style="width: 120px; margin: 6px; padding: 0px 0px 0px 0px;" align="right">
<tbody>
<tr>
<td>
<h3>Impala enables Alteryx to iteratively analyze the fast moving data involved in churn analysis and prevention.</h3>
</td>
</tr>
</tbody>
</table>
<p>At Alteryx, we recognize that organizations not only need to scale technology to address Big Data, they need to scale human capabilities. That is why Alteryx Strategic Analytics provides an easy to use drag-and-drop interface for business users. Subject matter experts and business analysts can quickly build analytics workflows that gather, cleanse, and blend datasets; enrich them with third party data; and then run sophisticated statistical, predictive, or geo-spatial analytics. By giving business users the access to query, analyze, and refine quickly, the analysis takes place at the business user level, where the business impact is understood. Then by having business users run the analytics, it enables the organization to scale.</p>
<p>With Impala support, Alteryx enables business users to benefit from huge innovations in the Big Data market. As the market matures and more ways of accessing data become available, Alteryx provides an easy interface that enables users to benefit from the power of these innovations, while shielding them from the complexity. This makes users more productive, and able to focus on getting the answers they need to make better decisions faster.</p>
<p>For more information about the Alteryx integration with Cloudera, visit <a href="http://www.alteryx.com/cloudera">www.alteryx.com/cloudera</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.cloudera.com/blog/2013/05/cloudera-partners-and-impala-alteryx/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
