Cloudera Developer Blog · Guest Posts
We’re very happy to re-publish the following post from Twitter analytics infrastructure engineering manager Dmitriy Ryaboy (@squarecog).
Today, we’re happy to tell you about a significant Parquet milestone: a 1.0 release, which includes major features and improvements made since the initial announcement. But first, we’ll revisit why columnar storage is so important for the Hadoop ecosystem.
What is Parquet and Columnar Storage?
Our thanks to Etsy developer Brad Greenlee (@bgreenlee) for the post below. We think his Mac OS app for JobTracker is great!
JobTracker.app is a Mac menu bar app interface to the Hadoop JobTracker. It provides Growl/Notification Center notices of starting, completed, and failed jobs and gives easy access to the detail pages of those jobs.
When I started writing Apache Hadoop jobs at Etsy, I found myself wasting a lot of time checking the JobTracker page to see how my job was progressing. The first thing we did to try to solve this problem was to write a Scalding flow listener to announce completed and failed jobs to IRC, but that got a little noisy. So I wrote JobTracker.app.
Installation and Usage
Our thanks to Brian Dirking, Director of Product Marketing for Alteryx, for the guest post below:
At Alteryx we are excited about the release of Cloudera Impala. The impact on Big Data Analytics is that the ability to perform real-time queries on Apache Hadoop will provide faster access and results. This is applicable to our customers, the business users who are running analytics to get access to data, perform analytics, and then follow up with new questions. Insight doesn’t happen all at once. The ability to query and refine quickly is ultimately what will lead business users to insight.
As business users need faster access to data, Alteryx provides a user friendly way to access new solutions like Impala. With Impala support in Alteryx Strategic Analytics, business users can get faster access, and can refine data queries and the corresponding analytics to get the answers they need. They can combine these results with other datasets to provide the context necessary to make the right decision, and they can do it without having to go through months of training to master programming and query languages.
Our thanks to Ted Wasserman, product manager for Tableau, for the guest post below:
Many of our customers are turning to Apache Hadoop as they grapple with their big data challenges. Hadoop offers many benefits such as its scalability, economics, and versatility. Even so, adoption-to-date has largely centered around applications with “batch”-oriented workloads because of the latency imposed by the MapReduce framework. To increase Hadoop’s usefulness and adoption in the business intelligence space where users need fast, interactive response times when they ask a question, a new approach was needed.
Cloudera Impala technology moves the ball forward for doing ad hoc visual analytics on Hadoop. In particular, we like Impala for several reasons:
Our thanks to Yves de Montcheuil, Vice President of Marketing for Talend, for the guest post below:
According to Wikipedia, the impala is a medium-sized African antelope; its name comes from the Zulu language meaning “gazelle”. Like elephants, it is found in savannas, and this may be the link with Hadoop. Impala is also the name of Cloudera’s SQL-on-Apache Hadoop project, launched in beta at Strata last October and just released in version 1.0.
SQL-on-Hadoop – wait a minute… isn’t it what Apache Hive is for? Well, yes and no. HiveQL certainly brings a set of SQL-like commands to Hadoop data. The big issue with Hive: it’s very slow. More precisely, it’s not interactive. Queries take a long time to be “parsed” and distributed across the cluster. Response times can reach the minute, which is highly impractical for interactive use. It works fine for batch use (response times actually don’t vary much based on the dataset size), but when users want to mine Hadoop data, perform interactive queries or drill-downs, profile data, etc. – they end up spending lots of time glaring at their screen (or fetching more coffee than they should).
A World-Class EDW Requires a World-Class Hadoop Team
Persado is the global leader in persuasion marketing technology, a new category in digital marketing. Our revolutionary technology maps the genome of marketing language and generates the messages that work best for any customer and any product at any time. To assure the highest quality experience for both our clients and end-users, our engineering team collaborates with Ph.D. statisticians and data analysts to develop new ways to segment audiences, discover content, and deliver the most relevant and effective marketing messages in real time.
Given the challenge of creating a market based on ongoing data collection and massive query ability, the data warehouse organization ultimately plays the most important role in the persuasion marketing value chain, assuring a steady and unobstructed multidirectional flow of information. My team continuously ensures Persado’s infrastructure is aligned to the needs of our data scientists, including regularly generating KPI reports, managing data from heterogeneous sources, preparing customized analyses, and even implementing specific statistical algorithms in Java based on reference implementations of R.
This guest post comes to us from David Greco, CTO of Eligotech.
Vagrant is a very nice tool for programmatically managing many virtual machines (VMs) on a single physical machine. It natively supports VirtualBox and also provides plugins for VMware Fusion and Amazon EC2, supporting the management of VMs in those environments as well.
Vagrant provides a very easy-to-use, Ruby-based internal DSL that allows the user to define one or more virtual machines together with their configuration parameters. Furthermore, it offers different mechanisms for automatic provisioning: You can use Puppet, Chef, or shell scripts for automating software installation and configuration on the machines defined in the Vagrant configuration file.
As a follow-up to a previous post about the Impala demo he built during Data Hacking Day, Alan Gardner from Pythian has deployed the app for a limited time on Amazon EC2. We republish his original post below.
A little while ago I blogged about (and open sourced) a Cloudera Impala-powered soccer visualization demo, designed to demonstrate just how responsive Impala queries can be. Since not everyone has the time or resources to run the project themselves, we’ve decided to host it ourselves on an EC2 instance. [Note: instance live only for one week!] You can try the visualization; we’ve also opened up the Impala web interface, where you can see query profiles and performance numbers, and Hue (username and password are both ‘test’), where you can run your own queries on the dataset.
Deploying Impala on EC2
While there are many tools to deploy a Hadoop cluster on EC2 – like Apache Whirr, or even Cloudera Manager – I only wanted to use a single instance for the entire cluster. Starting from the base Ubuntu (Precise) image, I added Cloudera’s apt repos, and installed the single node configuration. Impala doesn’t support using Derby for the Hive metastore, so I installed MySQL and configured Hive to use it instead. Then I installed Impala using Cloudera’s instructions. Impala, and all of the Hadoop daemons, are running comfortably on one M3 2XLarge EC2 instance. Given our modest demands, this may actually be overkill; I over-spec’ed the server trying to find a (now-obvious) performance problem involving short-circuit reads.
The following FAQ is provided by James Taylor of Salesforce, which recently open-sourced its Phoenix client-embedded JDBC driver for low-latency queries over HBase. Thanks, James!
What is this new Phoenix thing I’ve been hearing about?
Phoenix is an open source SQL skin for HBase. You use the standard JDBC APIs instead of the regular HBase client APIs to create tables, insert data, and query your HBase data.
Doesn’t putting an extra layer between my application and HBase just slow things down?
Actually, no. Phoenix achieves as good or likely better performance than if you hand-coded it yourself (not to mention with a heck of a lot less code) by:
The following guest post comes from Alejandro Caceres, president and CTO of Hyperion Gray LLC – a small research and development shop focusing on open-source software for cyber security.
Imagine this: You’re an informed citizen, active in local politics, and you decide you want to support your favorite local political candidate. You go to his or her new website and make a donation, providing your bank account information, name, address, and telephone number. Later, you find out that the website was hacked and your bank account and personal information stolen. You’re angry that your information wasn’t better protected — but at whom should your anger be directed?
Who is responsible for the generally weak condition of website security, today? It can’t be website operators, because there’s no prerequisite to know about blind SQL injection attacks or validation filters before spinning up a website. It can’t be website developers either — we definitely don’t equip them to evaluate website security for themselves. It’s a pretty small community that focuses on web development and web security, and that community is pretty opaque.