Cloudera Developer Blog
Big Data best practices, how-to's, and internals from Cloudera Engineering and the community
Hue, the open source Web UI that makes Apache Hadoop easier to use, has a brand-new application that enables transferring data between relational databases and Hadoop. This new application is driven by Apache Sqoop 2 and has several user experience improvements, to boot.
Sqoop is a batch data migration tool for transferring data between traditional databases and Hadoop. The first version of Sqoop is a heavy client that drives and oversees data transfer via MapReduce. In Sqoop 2, the majority of the work was moved to a server that a thin client communicates with. Also, any client can communicate with the Sqoop 2 server over its JSON-REST protocol. Sqoop 2 was chosen instead of its predecessors because of its client-server design.
Importing from MySQL to HDFS
The following is the canonical import job example sourced from http://sqoop.apache.org/docs/1.99.2/Sqoop5MinutesDemo.html. In Hue, this can be done in three easy steps.
In the wake of the Strata + Hadoop World 2013 afterglow, speaker slides and video have been posted. For your convenience, they are aggregated below:
In my previous post you learned how to index email messages in batch mode, and in near real time, using Apache Flume with MorphlineSolrSink. In this post, you will learn how to index emails using Cloudera Search with Apache HBase and Lily HBase Indexer, maintained by NGDATA and Cloudera. (If you have not read the previous post, I recommend you do so for background before reading on.)
Which near-real-time method to choose, HBase Indexer or Flume MorphlineSolrSink, will depend entirely on your use case, but below are some things to consider when making that decision:
In software development, there is no substitute for having choices. Furthermore, freedom of choice – between frameworks, APIs, and languages — is a major fuel source for platform adoption across any successful ecosystem.
In the case of development on CDH, the open source core of Cloudera’s Big Data platform containing Apache Hadoop and related ecosystem projects, the choices have expanded dramatically in the past three weeks:
Among Cloudera’s engineer-presenters at Strata + Hadoop World 2013 this week, Philip Zeyliger (“Tricks for Distributed System Debugging and Diagnosis“) was particularly fortunate to have been interviewed by O’Reilly Media editor Meghan Blanchette on camera.
In the following 8-minute interview, Philip offers an overview of common pain points and failures when debugging distributed systems:
For those of you attending virtually/in spirit, I thought it would be nice to bring you a selection of photos from the week so far. Credit goes to Alex Moundalexis (@technmsg) for the majority of these shots.
Kate Ting, Apache Sqoop cookbook co-chef.
Julien Le Dem and Nong Li present on Parquet and Impala.
Thanks to Victor Bittorf, a visiting graduate computer science student at Stanford University, for the guest post below about how to use the new prebuilt analytic functions for Cloudera Impala.
Cloudera Impala is an exciting project that unlocks interactive queries and SQL analytics on big data. Over the past few months I have been working with the Impala team to extend Impala’s analytic capabilities. Today I am happy to announce the availability of pre-built mathematical and statistical algorithms for the Impala community under a free open-source license. These pre-built algorithms combine recent theoretical techniques for shared nothing parallelization for analytics and the new user-defined aggregations (UDA) framework in Impala 1.2 in order to achieve big data scalability. This initial release has support for logistic regression, support vector machines (SVMs), and linear regression.
Having recently completed my masters degree while working in the database systems group at University of Madison Wisconsin, I’m excited to work with the Impala team on this project while I continue my research as a visiting student at Stanford. I’m going to go through some details about what we’ve implemented and how to use it.
We are pleased to announce the beta release of Cloudera Enterprise 5 (CDH 5 and Cloudera Manager 5). This release has both Cloudera Impala and Cloudera Search integrated into CDH. It also includes many new features and updated component versions including the ones below:
We are just a weekend away from the Biggest. Strata + Hadoop World. Ever.
Not to make non-attendees feel bad — the show sold out a couple weeks ago — but some great things are in store for technical end-users (as well as everyone else):
The rise of Big Data has been pushing search engines to handle ever-increasing amounts of data. While building Cloudera Search, one of the things we considered in Cloudera Engineering was how we would incorporate Apache Solr with Apache Hadoop in a way that would enable near-real-time indexing and searching on really big data.
Eventually, we built Cloudera Search on Solr and Apache Lucene, both of which have been adding features at an ever-faster pace to aid in handling more and more data. However, there is no silver bullet for dealing with extremely large-scale data. A common answer in the world of search is “it depends,” and that answer applies in large-scale search as well. The right architecture for your use case depends on many things, and your choice will generally be guided by the requirements and resources for your particular project.
We wanted to make sure that one simple scaling strategy that has been commonly used in the past for large amounts of time-series data would be fairly simple to set up with Cloudera Search. By “time-series data,” I mean logs, tweets, news articles, market data, and so on — data that is continuously being generated and is easily associated with a current timestamp.