Our thanks to Yves de Montcheuil, Vice President of Marketing for Talend, for the guest post below:
According to Wikipedia, the impala is a medium-sized African antelope; its name comes from the Zulu language meaning “gazelle”. Like elephants, it is found in savannas, and this may be the link with Hadoop. Impala is also the name of Cloudera’s SQL-on-Apache Hadoop project, launched in beta at Strata last October and just released in version 1.0.
SQL-on-Hadoop – wait a minute… isn’t it what Apache Hive is for? Well, yes and no. HiveQL certainly brings a set of SQL-like commands to Hadoop data. The big issue with Hive: it’s very slow. More precisely, it’s not interactive. Queries take a long time to be “parsed” and distributed across the cluster. Response times can reach the minute, which is highly impractical for interactive use. It works fine for batch use (response times actually don’t vary much based on the dataset size), but when users want to mine Hadoop data, perform interactive queries or drill-downs, profile data, etc. – they end up spending lots of time glaring at their screen (or fetching more coffee than they should).
As use cases for Hadoop evolve past batch requirements, often for “mundane” tasks such ETL offload or online archiving, and enterprises discover the value of real-time data exploration, mining and analytics on big data, interactive performance becomes a must.
Since Impala is native to Hadoop, it provides access to the same data sets that have already been loaded.
A great example of data exploration is data profiling. Since v5.2, Talend has been providing native data profiling on Hadoop. Based on Hive, profiling is performed “in place”, which means that data does not need to be extracted from Hadoop before being profiled. The issue here is the time it takes to instantiate the profiling job on MapReduce – not very practical for interactive profiling even though it works well for batch. Impala will provide the potential to speed up this process and make it more efficient. And since Impala is native to Hadoop, it provides access to the same data sets that have already been loaded – no need to replicate/duplicate the data.
Talend and Cloudera have been partners for a long time. We started to support Hadoop in its infancy, well before all the hype started – and that puts us in a unique situation to leverage new and upcoming technologies. Developments such as Impala are clearly providing value and we look forward to more exciting news from the broad Hadoop ecosystem!
PS: Am I jumping to conclusions as to why the Impala name was picked, with the reference to antelopes and elephants? Comments from engineers/product managers working on the project are welcome!