Cloudera Blog · Connector Posts

Sqoop Graduation Meetup

This blog was originally posted on the Apache Blog:
https://blogs.apache.org/sqoop/entry/sqoop_graduation_meetup

Cloudera hosted the Apache Sqoop Meetup last week at Cloudera HQ in Palo Alto. About 20 of the Meetup attendees had not used Sqoop before, but were interested enough to participate in the Meetup on April 4th. We believe this healthy interest in Sqoop will contribute to its wide adoption. 

Not only was this Sqoop’s second Meetup but also a celebration for Sqoop’s graduation from the Incubator, cementing its status as a Top-Level Project in Apache Software Foundation. Sqoop’s come a long way since its beginnings three years ago as a contrib module for Apache Hadoop submitted by Aaron Kimball. As a result, it was fitting that Aaron gave the first talk of the night by discussing its history: “Sqoop: The Early Days.” From Aaron, we learned that Sqoop’s original name was “SQLImport” and that it was conceived out of his frustration from the inability to easily query both unstructured and structured data at the same time.

Cloudera Connector for Tableau Has Been Released

Earlier today, Cloudera proudly released the Cloudera Connector for Tableau. The availability of this connector serves both Tableau users who are looking to expand the volume of datasets they manipulate and Hadoop users who want to enable analysts like Tableau users to make the data within Hadoop more meaningful. Enterprises can now extract the full value of big data and allow a new class of power users to interact with Hadoop data in ways they priorly could not.

The Cloudera Connector for Tableau is a free ODBC Driver that enables Tableau Desktop 7.0 to connect to Apache Hive. Tableau users can thus leverage Hive, Hadoop’s data warehouse system, as a data source for all the maps, charts, dashboards and other artifacts typically generated within Tableau.

Hive itself is a powerful query engine that is optimized for analytic workloads, and that’s where this Connector is sure to work best. Tableau also, however, lets users ingest result sets from Hive into its in-memory analytical engine so that results returning from Hadoop can be analyzed much more quickly.

Apache Sqoop: Highlights of Sqoop 2

This blog was originally posted on the Apache Blog: https://blogs.apache.org/sqoop/entry/apache_sqoop_highlights_of_sqoop

Apache Sqoop (incubating) was created to efficiently transfer bulk data between Hadoop and external structured datastores, such as RDBMS and data warehouses, because databases are not easily accessible by Hadoop. Sqoop is currently undergoing incubation at The Apache Software Foundation. More information on this project can be found at http://incubator.apache.org/sqoop.

The popularity of Sqoop in enterprise systems confirms that Sqoop does bulk transfer admirably. That said, to enhance its functionality, Sqoop needs to fulfill data integration use-cases as well as become easier to manage and operate.

What is Sqoop?

Hadoop World 2011: A Glimpse into Development

The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics such as HDFS, HBase, Pig, Hive, Flume and other related technologies. In addition, speakers will address key development areas including tools, performance, bringing the stack together and testing the stack. Sessions in this track are for developers of all levels who want to learn more about upcoming features and enhancements, new tools, advanced techniques and best practices.

Preview of Development Track Sessions

Building Web Analytics Processing on Hadoop at CBS Interactive
Michael Sun, CBS Interactive

Apache Sqoop – Overview

This blog was originally posted on the Apache Blog: https://blogs.apache.org/sqoop/entry/apache_sqoop_overview

Using Hadoop for analytics and data processing requires loading data into clusters and processing it in conjunction with other data that often resides in production databases across the enterprise. Loading bulk data into Hadoop from production systems or accessing it from map reduce applications running on large clusters can be a challenging task. Users must consider details like ensuring consistency of data, the consumption of production system resources, data preparation for provisioning downstream pipeline. Transferring data using scripts is inefficient and time consuming. Directly accessing data residing on external systems from within the map reduce applications complicates applications and exposes the production system to the risk of excessive load originating from cluster nodes.

This is where Apache Sqoop fits in. Apache Sqoop is currently undergoing incubation at Apache Software Foundation. More information on this project can be found at http://incubator.apache.org/sqoop.

If 80% of data is unstructured, is it the exception or a new rule?

Ed Albanese leads business development for Cloudera. He is responsible for identifying new markets, revenue opportunities and strategic alliances for the company.

This week’s announcement about the availability of the Cloudera Connector for IBM Netezza is the achievement of a major milestone, but not necessarily the one you might expect. It’s not just the delivery of a useful software component; it’s also the introduction of a new generation of data management architectures.  For literally decades, data management architecture consisted of RDBMS, a BI tool and an ETL engine. Those three components assembled together gave you a bonafide data management environment. That architecture has been relevant for long enough to withstand the onslaught of data driven by the introduction of ERP, the rise and fall of client/server and several versions of web architecture. But the machines are unrelenting. They keep generating data. And there’s not just more of it, there is more you can—and often need—to do with it.

The times they are a-changin’, and unstructured data is taking over

Companies of all sizes and in nearly every vertical are increasingly tasked with decoding the information being generated by the machines they rely on most. New data sources are creating new data types, including web data, clickstreams, location data, point of sale, social data, building sensors, vehicle and aircraft data, satellite images, medical images, log files, network data and weather data… just to name a few. These data sources were but a glimmer in the eyes of the forefathers of the RDBMS and were most certainly not accounted for in its design. And yet, the percentage of data that fit into this newer bucket is growing at astounding rates. While at Netezza’s Enzee event this week, I listened to Steve Mills, IBM Senior Vice President and Group Executive for the Software Group, cite that more than 80% of the world’s data is unstructured.

So what to do with all of this data?