Cloudera Engineering Blog
Big Data best practices, how-to's, and internals from Cloudera Engineering and the community
The conclusion to this series covers how to use scans, and considerations for choosing the Thrift or REST APIs.
In this series of how-tos, you have learned how to use Apache HBase’s Thrift interface. Part 1 covered the basics of the API, working with Thrift, and some boilerplate code for connecting to Thrift. Part 2 showed how to insert and to get multiple rows at a time. In this third and final post, you will learn how to use scans and some considerations when choosing between REST and Thrift.
Scanning with Thrift
The following post, by Sarah Cannon of Digital Reasoning, was originally published in that company’s blog. Digital Reasoning has graciously permitted us to re-publish here for your convenience.
At the beginning of each release cycle, engineers at Digital Reasoning are given time to explore the latest in Big Data technologies, examining how the frequently changing landscape might be best adapted to serve our mission. As we sat down in the early stages of planning for Synthesys 3.8 one of the biggest issues we faced involved reconciling the tradeoff between flexibility and performance. How can users quickly and easily retrieve knowledge from Synthesys without being tied to one strict data model?
The integration of Apache Sentry with Apache Solr helps Cloudera Search meet important security requirements.
As you have learned in previous blog posts, Cloudera Search brings the power of Apache Hadoop to a wide variety of business users via the ease and flexibility of full-text querying provided by Apache Solr. We have also done significant work to make Cloudera Search easy to add to an existing Hadoop cluster:
HBaseCon 2014 “Operations” track reveals best practices used by some of the world’s largest production-cluster operators.
Meet David F. McCoy, one of the first to have earned the title “CCP: Data Scientist” from Cloudera University.
Big Data success requires professionals who can prove their mastery with the tools and techniques of the Hadoop stack. However, experts predict a major shortage of advanced analytics skills over the next few years. At Cloudera, we’re drawing on our industry leadership and early corpus of real-world experience to address the Big Data talent gap with the Cloudera Certified Professional (CCP) program.
Find Cloudera tech talks in Amsterdam, Boston, Berlin, Sao Paulo, Singapore, Zurich, and other cities across Europe and the US during the next calendar quarter.
Below please find our regularly scheduled quarterly update about where to find tech talks by Cloudera employees – this time, for the second calendar quarter of 2014 (April through June). Note that this list will be continually curated during the period; complete logistical information may not be available yet. And remember, many of these talks are in “free” venues (no cost of entry).
Our thanks to Russell Cardullo and Michael Ruggiero, Data Infrastructure Engineers at Sharethrough, for the guest post below about its use case for Spark Streaming.
At Sharethrough, which offers an advertising exchange for delivering in-feed ads, we’ve been running on CDH for the past three years (after migrating from Amazon EMR), primarily for ETL. With the launch of our exchange platform in early 2013 and our desire to optimize content distribution in real time, our needs changed, yet CDH remains an important part of our infrastructure.
The CDH software stack lets you use your tool of choice with the Parquet file format – - offering the benefits of columnar storage at each phase of data processing.
An open source project co-founded by Twitter and Cloudera, Parquet was designed from the ground up as a state-of-the-art, general-purpose, columnar file format for the Apache Hadoop ecosystem. In particular, Parquet has several features that make it highly suited to use with Cloudera Impala for data warehouse-style operations:
The HBaseCon 2014 General Session – with keynotes by Facebook, Google, and Salesforce.com engineers – is arguably the best ever.
HBaseCon 2014 (May 5, 2014 in San Francisco) is coming very, very soon. Over the next few weeks, as I did for last year’s conference, I’ll be bringing you sneak previews of session content (across Operations, Features & Internals, Ecosystem, and Case Studies tracks) accepted by the Program Committee.
This quick demo illustrates how easy it is to implement role-based access and control in Impala using Sentry.
Apache Sentry (incubating) is the Apache Hadoop ecosystem tool for role-based access control (RBAC). In this how-to, I will demonstrate how to implement Sentry for RBAC in Impala. I feel this introduction is best motivated by a use case.
The guest post below was originally authored by Pinterest engineer Raghavendra Prabhu and published by the Pinterest Engineering blog. Being big ZooKeeper fans, we re-publish it here for your convenience. Thanks, Pinterest!
Apache ZooKeeper is an open source distributed coordination service that’s popular for use cases like service discovery, dynamic configuration management and distributed locking. While it’s versatile and useful, it has failure modes that can be hard to prepare for and recover from, and if used for site critical functionality, can have a significant impact on site availability.
Oozie’s new HA qualities help cluster operators sleep well at night. Here’s how it works.
One of the big new features in CDH 5 for Apache Oozie is High Availability (HA). In designing this feature, the Oozie team at Cloudera had two main goals: 1) Don’t change the API or usage patterns, and 2) the user shouldn’t even have to know that HA is enabled. In other words, we wanted Oozie HA to be as easy and transparent as possible.
Sure, Spark is fast, but it also gives developers a positive experience they won’t soon forget.
Apache Spark is well known today for its performance benefits over MapReduce, as well as its versatility. However, another important benefit – the elegance of the development experience – gets less mainstream attention.
Cost-per-performance, not cost-per-capacity, turns out to be the better metric for evaluating the true value of SSDs.
In the Big Data ecosystem, solid-state drives (SSDs) are increasingly considered a viable, higher-performance alternative to rotational hard-disk drives (HDDs). However, few results from actual testing are available to the public.
Users of diverse, real-world HBase deployments around the world present at this year’s event.
This year’s agenda for HBaseCon, the conference for the Apache HBase community (developers, operators, contributors), looks “Stack-ed” with can’t-miss keynotes and breakouts. Program committee, you really came through (again).
In this installment of “Meet the Instructor”, our interview subject is Bruce Martin.
What is your role at Cloudera?
February being a short month, the list is relatively short — but never confuse quantity with quality!
Understanding how checkpointing works in HDFS can make the difference between a healthy cluster or a failing one.
Checkpointing is an essential part of maintaining and persisting filesystem metadata in HDFS. It’s crucial for efficient NameNode recovery and restart, and is an important indicator of overall cluster health. However, checkpointing can also be a source of confusion for operators of Apache Hadoop clusters.
Spark is a compelling multi-purpose platform for use cases that span investigative, as well as operational, analytics.
Data science is a broad church. I am a data scientist — or so I’ve been told — but what I do is actually quite different from what other “data scientists” do. For example, there are those practicing “investigative analytics” and those implementing “operational analytics.” (I’m in the second camp.)
Hue users can learn a lot about new features by following a steady stream of new demos.
Hue, the open source Web UI that makes Apache Hadoop easier to use, is now a standard across the ecosystem — shipping within multiple software distributions and sandboxes. One of the reasons for its success is an agile developer community behind it that is constantly rolling out new features to its users.
Cloudera’s own enterprise data hub is yielding great results for providing world-class customer support.
Here at Cloudera, we are constantly pushing the envelope to give our customers world-class support. One of the cornerstones of this effort is the Cloudera Support Interface (CSI), which we’ve described in prior blog posts (here and here). Through CSI, our support team is able to quickly reason about a customer’s environment, search for information related to a case currently being worked, and much more.
Hadoop 2.3.0 includes hundreds of new fixes and features, but none more important than HDFS caching.
The Apache Hadoop community has voted to release Hadoop 2.3.0, which includes (among many other things):
Learn how to use Cloudera Search along with RBL-JE to search and index documents in multiple languages.
Our thanks to Basis Technology for providing the how-to below!
Bringing Parquet support to Hive was a community effort that deserves congratulations!
Previously, this blog introduced Parquet, an efficient ecosystem-wide columnar storage format for Apache Hadoop. As discussed in that blog post, Parquet encodes data extremely efficiently and as described in Google’s original Dremel paper. (For more technical details on the Parquet format read Dremel made simple with Parquet, or go directly to the open and community-driven Parquet Format specification.)
Integrating Hue with LDAP can help make your secure Hadoop apps as widely consumed as possible.
Hue, the open source Web UI that makes Apache Hadoop easier to use, easily integrates with your corporation’s existing identity management systems and provides authentication mechanisms for SSO providers. So, by changing a few configuration parameters, your employees can start analyzing Big Data in their own browsers under an existing security policy.
Thanks to the improvements described here, CDH 5 will ship with a version of MapReduce 2 that is just as fast (or faster) than MapReduce 1.
Performance fixes are tiny, easy, and boring, once you know what the problem is. The hard work is in putting your finger on that problem: narrowing, drilling down, and measuring, measuring, measuring.
This FAQ contains answers to the most frequently asked questions about the architecture and configuration choices involved.
In December 2013, Cloudera and Amazon Web Services (AWS) announced a partnership to support Cloudera Enterprise on AWS infrastructure. Along with this announcement, we released a Deployment Reference Architecture Whitepaper. In this post, you’ll get answers to the most frequently asked questions about the architecture and the configuration choices that have been highlighted in that whitepaper.
Cloudera has released the Beta 2 version of Cloudera Enterprise 5 (comprises CDH 5.0.0 and Cloudera Manager 5.0.0).
This release (download) contains a number of new features and component versions including the ones below:
Migrating from the Hive CLI to Beeline isn’t as simple as changing the executable name, but this post makes it easy nonetheless.
In its original form, Apache Hive was a heavyweight command-line tool that accepted queries and executed them utilizing MapReduce. Later, the tool split into a client-server model, in which HiveServer1 is the server (responsible for compiling and monitoring MapReduce jobs) and Hive CLI is the command-line interface (sends SQL to the server).
With the close of 2013, we also thought it appropriate to include some high points from across the year (not listed in any particular order):
Create a test environment for writing and testing Giraph jobs, or just for playing around with Giraph and small sample datasets.
Apache Giraph is a scalable, fault-tolerant implementation of graph-processing algorithms in Apache Hadoop clusters of up to thousands of computing nodes. Giraph is in use at companies like Facebook and PayPal, for example, to help represent and analyze the billions (or even trillions) of connections across massive datasets. Giraph was inspired by Google’s Pregel framework and integrates well with Apache Accumulo, Apache HBase, Apache Hive, and Cloudera Impala.
Cloudera is announcing the general availability of support for Spark, bringing interactive machine learning and stream processing to enterprise data hubs.
Cloudera is pleased to announce the immediate availability of its first release of Apache Spark for Cloudera Enterprise (comprising CDH and Cloudera Manager).
Thanks to Xavier Clements of Wajam for allowing us to re-publish his blog post about Wajam’s Hadoop experiences below!
Wajam is a social search engine that gives you access to the knowledge of your friends. We gather your friends’ recommendations from Facebook, Twitter, and other social platforms and serve these back to you on supported sites like Google, eBay, TripAdvisor, and Wikipedia.
Set up a CDH-based Hadoop cluster in less than an hour using VirtualBox and Cloudera Manager.
Thanks to Christian Javet for his permission to republish his blog post below!
These suggestions from the Program Committee offer an inside track to getting your talk accepted!
With HBaseCon 2014 (in San Francisco on May 5) Call for Papers closing in just over three weeks (on Feb. 14 — sooner than you think), there’s no better time than “now” to start thinking about your proposal.
Cloudera provides docs and a sample build environment to help you get easily started writing your own Impala UDFs.
User-defined functions (UDFs) let you code your own application logic for processing column values during a Cloudera Impala query. For example, a UDF could perform calculations using an external math library, combine several column values into one, do geospatial calculations, or other kinds of tests and transformations that are outside the scope of the built-in SQL operators and functions.
In this installment of “Meet the Engineer” we speak with Romain Rigaux, a Software Engineer on the Hue team.
What do you do at Cloudera, and in which project are you involved?
Currently I work on Hue, the open source Web interface that lets users do Big Data analysis directly from their browser. Its goal is to make that process easier, so that more users can get more insights, more quickly.
The Cloudera QuickStart VM is an important platform for learning any Hadoop-related curriculum.
In the Fall 2013 semester, more than 30 NYU graduate students completed the Real-time and Big Data Analytics course at the NYU Courant Institute of Mathematical Sciences, for which I served as instructor.
The third-annual HBaseCon is now open for business. Submit your paper or register today for early bird savings!
Seems like only yesterday that droves of Apache HBase developers, committers/contributors, operators, and other enthusiasts converged in San Francisco for HBaseCon 2013 — nearly 800 of them, in fact.
Impala’s speed now beats the fastest SQL-on-Hadoop alternatives. Test for yourself!
Since the initial beta release of Cloudera Impala more than one year ago (October 2012), we’ve been committed to regularly updating you about its evolution into the standard for running interactive SQL queries across data in Apache Hadoop and Hadoop-based enterprise data hubs. To briefly recap where we are today:
With the close of 2013, we also thought it appropriate to include some high points from across the year (not listed in any particular order):
The new Cloudera Developer Newsletter makes its debut in January 2014.
Developers and data scientists, we’re realize you’re special – as are operators and analysts, in their own particular ways.
Find Cloudera tech talks in Berlin, Budapest, London, Stockholm, Tokyo, and across the US during this calendar quarter.
Below please find our regularly scheduled quarterly update about where to find tech talks by Cloudera employees – this time, for the first calendar quarter of 2014 (January through March). Note that this list will be continually curated during the period; complete logistical information may not be available yet. And remember, many of these talks are in “free” venues (no cost of entry).
Join us at Cloudera’s San Francisco office on Feb. 20 for tech talks, T-shirts, and adult refreshments!
As an extension of the DeveloperWeek Conf & Festival 2014 experience in San Francisco next month, join us at Cloudera’s San Francisco office for a Developer Happy Hour (beer + tech talks), focusing on Apache Hadoop 2 application development. Anyone (attendees or non) is free to attend, but RSVP now because seats (and “Data is the New Bacon” T-shirts) are limited!
Oracle DBAs, get answers to many of your most common questions about getting started with Hadoop.
As a former Oracle DBA, I get a lot of questions (most welcome!) from current DBAs in the Oracle ecosystem who are interested in Apache Hadoop. Here are few of the more frequently asked questions, along with my most common replies.
The team behind Hue, the open source Web UI that makes Apache Hadoop easier to use, strikes again with a new Spark app.
Editor’s note: This post was recently published on the Hue blog. We republish it here for your convenience.
From Python, to ZooKeeper, to Impala, to Parquet, blog readers in 2013 were interested in a variety of topics.
Clouderans and guest authors from across the ecosystem (LinkedIn, Netflix, Concurrent, Etsy, Stripe, Databricks, Oracle, Tableau, Alteryx, Talend, Twitter, Dell, Concurrent, SFDC, Endgame, MicroStrategy, Hazy Research, Wibidata, StackIQ, ZoomData, Damballa, Mu Sigma) published prolifically on the Cloudera Developer blog in 2013, with more than 250 new posts — basically, averaging one per business day.
Apache Accumulo is now generally available on CDH 4.
Cloudera is pleased to announce the immediate availability of its first release of Accumulo packaged to run under CDH, our open source distribution of Apache Hadoop and related projects and the foundational infrastructure for Enterprise Data Hubs.
More and more customers are using automation/configuration management frameworks alongside Cloudera Manager.
As Apache Hadoop clusters continue to grow in size, complexity, and business importance as the foundational infrastructure for an Enterprise Data Hub, the use cases for a robust and mature management console expand.
CDK has a new monicker, but the goals remain the same.
We are pleased to announce a new name for the Cloudera Development Kit (CDK): Kite. We’ve just released Kite version 0.10.0, which is purely a rename of CDK 0.9.0.