Cloudera Developer Blog · Security Posts
A quick on-ramp (and demo) for using the new Sentry module for RBAC in conjunction with Hive
One attribute of the Enterprise Data Hub is fine-grained access to data by users and apps. This post about supporting infrastructure for that goal was originally published at blogs.apache.org. We republish it here for your convenience.
Apache Sentry (incubating) is a highly modular system for providing fine-grained role-based authorization to both data and metadata stored on an Apache Hadoop cluster. It currently works out of the box with Apache Hive and Cloudera Impala. In this blog post, you will learn how to use Sentry with Hive.
There’s good news for users of Hue, the open source web UI that makes Apache Hadoop easier to use: A new SAML 2.0-compliant backend, which is scheduled to ship in the next release of the Cloudera platform, will provide a better authentication experience for users as well as IT.
With this new feature, single sign-on (SSO) authentication can be achieved instead of using Hue credentials – thus, user credentials can be managed centrally (a big benefit for IT), and users needn’t log in to Hue if they have already logged in to another Web application sharing the SSO (a big benefit for users).
Here’s a demo that shows this experience:
Every day, more data, users, and applications are accessing ever-larger Apache Hadoop clusters. Although this is good news for data driven organizations overall, for security administrators and compliance officers, there are still lingering questions about how to enable end-users under existing Hadoop infrastructure without compromising security or compliance requirements.
While Hadoop has strong security at the filesystem level, it lacks the granular support needed to adequately secure access to data by users and BI applications. Today, this problem forces organizations in industries for which security is paramount (such as financial services, healthcare, and government) to make a choice: either leave data unprotected or lock out users entirely. Most of the time, the preferred choice is the latter, severely inhibiting access to data in Hadoop.
Today, Cloudera is excited to launch Sentry, a new open source project that addresses these concerns. Sentry is an authorization module for Hadoop that provides the granular, role-based authorization required to provide precise levels of access to the right users and applications. Its new support for role-based authorization, fine-grained authorization, and multi-tenant administration allows Hadoop operators to:
Apache Hive was one of the first projects to bring higher-level languages to Apache Hadoop. Specifically, Hive enables the legions of trained SQL users to use industry-standard SQL to process their Hadoop data.
However, as you probably have gathered from all the recent community activity in the SQL-over-Hadoop area, Hive has a few limitations for users in the enterprise space. Until recently, two in particular – concurrency and security – were largely unaddressed.
To address these gaps, for Hive release 0.11, Cloudera engineers built and contributed new infrastructure for meeting these needs. In this post, you’ll learn why it’s needed, and how it works.
Thanks to Steven Noels, SVP of Products for NGDATA, for the guest post below.
NGDATA builds and sells Lily, the next-generation Customer Intelligence Platform that helps enterprise marketing teams collect and store customer interaction data in order to profile, segment, and present better offers. We designed Lily from the ground up to run on Apache HBase and Apache Solr. Combining these technologies with our deep marketing segmentation expertise and unique machine learning techniques we’re able to deliver interactive data management, real-time statistical calculations, faceted search views of customers, offers, interactions and the permutations they each inspire.
The team at NGDATA has been working since mid-2010 on HBase triggers (or update notifications, if you want), which we use in Lily to sync up Solr with HBase, to make HBase freely searchable, compute indexed views for data exploration and feed our online machine learning engine with customer behavior information. The foundation portion of our platform – the Lily Data Repository, based on the combination of HBase and Solr – is being used by large banks, media companies and pharmaceutical firms who value combing Apache Hadoop’s data storage and parallel data processing framework with ad-hoc search and discovery through Solr.
The following guest post comes from Alejandro Caceres, president and CTO of Hyperion Gray LLC – a small research and development shop focusing on open-source software for cyber security.
Imagine this: You’re an informed citizen, active in local politics, and you decide you want to support your favorite local political candidate. You go to his or her new website and make a donation, providing your bank account information, name, address, and telephone number. Later, you find out that the website was hacked and your bank account and personal information stolen. You’re angry that your information wasn’t better protected — but at whom should your anger be directed?
Who is responsible for the generally weak condition of website security, today? It can’t be website operators, because there’s no prerequisite to know about blind SQL injection attacks or validation filters before spinning up a website. It can’t be website developers either — we definitely don’t equip them to evaluate website security for themselves. It’s a pretty small community that focuses on web development and web security, and that community is pretty opaque.
Hadoop network encryption is a feature introduced in Apache Hadoop 2.0.2-alpha and in CDH4.1.
In this blog post, we’ll first cover Hadoop’s pre-existing security capabilities. Then, we’ll explain why network encryption may be required. We’ll also provide some details on how it has been implemented. At the end of this blog post, you’ll get step-by-step instructions to help you set up a Hadoop cluster with network encryption.
A Bit of History on Hadoop Security
Starting with Apache Hadoop 0.20.20x and available in Hadoop 1 and Hadoop 2 releases (as well as CDH3 and CDH4 releases), Hadoop supports Kerberos-based authentication. This is commonly referred to as Hadoop Security. When Hadoop Security is enabled it requires users to authenticate (using Kerberos) in order to read and write data in HDFS or to submit and manage MapReduce jobs. In addition, all Hadoop services authenticate with each other using Kerberos.
With the default Apache HBase configuration, everyone is allowed to read from and write to all tables available in the system. For many enterprise setups, this kind of policy is unacceptable.
Administrators can set up firewalls that decide which machines are allowed to communicate with HBase. However, machines that can pass the firewall are still allowed to read from and write to all tables. This kind of mechanism is effective but insufficient because HBase still cannot differentiate between multiple users that use the same client machines, and there is still no granularity with regard to HBase table, column family, or column qualifier access.
In this post, we will discuss how Kerberos is used with Hadoop and HBase to provide User Authentication, and how HBase implements User Authorization to grant users permissions for particular actions on a specified set of data.
Secure HBase: Authentication & Authorization
One of the more confusing topics in Hadoop is how authorization and authentication work in the system. The first and most important thing to recognize is the subtle, yet extremely important, differentiation between authorization and authentication, so let’s define these terms first:
Authentication is the process of determining whether someone is who they claim to be.
Authorization is the function of specifying access rights to resources.
What is Kerberos & SPNEGO?
Kerberos is an authentication protocol that provides mutual authentication and single sign-on capabilities.
SPNEGO is a plain text mechanism for negotiating authentication protocols between peers; one notable application of this is Kerberos authentication over HTTP.