This new core security layer provides a unified data access path for all Hadoop ecosystem components, while improving performance.
We’re thrilled to announce the beta availability of RecordService, a distributed, scalable, data access service for unified access control and enforcement in Apache Hadoop. RecordService is Apache Licensed open source that we intend to transition to the Apache Software Foundation. In this post, we’ll explain the motivation, system architecture, performance characteristics, expected use cases, and future work that RecordService enables.
One of the key properties of the Hadoop ecosystem is decoupling storage managers (e.g. HDFS, Apache HBase) and compute frameworks (e.g. MapReduce, Impala, Apache Spark). While this decoupling allows for far greater flexibility—pick the framework that best solves your problem—it leads to more complexity to ensure everything works seamlessly together. Furthermore, as Hadoop becomes an increasingly critical infrastructure component for users, the expectations for compatibility, performance, and security also increase.
RecordService is a new core security layer for Hadoop that sits between the storage managers and compute frameworks in order to provide a unified data access path.
RecordService comprises two services, the RecordService Planner and the RecordService Worker. The Planner is responsible for generating tasks, which are an identical abstraction to MapReduce’s InputSplit. Each task describes a work unit and the preferred locality. RecordService Workers execute tasks and return reconstructed, filtered records in a canonical wire format. A good way to think about the system is that the Planners provide a layer of abstraction over the metadata services (NameNode, Hive Metastore, Sentry server), and the Workers provide a layer of abstraction of the data stores (DataNode).
The Planner and Worker both contain only soft-state and share only minimal state via Apache ZooKeeper. This approach ensure good scalability and fault tolerance.
RecordService provides these key benefits:
- Fine-grained security enforcement: RecordService enforces column-level permissions (projections), row-level permissions (filtering), and data masking across the Hadoop ecosystem, including frameworks such as MapReduce and Spark where this fine-grained control was not previously possible. RecordService runs as a set of daemons that are isolated from the client jobs; daemons do *not* run arbitrary user code.
- Performance: RecordService is designed to be on the main data access path, meaning it needs to process every byte of data. RecordService scales horizontally to be able to run on the largest Hadoop clusters and high efficiency. It uses the Impala IO layer, which utilizes low-level optimizations such as HDFS short-circuit reads and dynamic code generation to improve thread throughput and reduce CPU utilization. RecordService brings these performance benefits to the other components in Hadoop and accelerates their performance, despite adding a new layer in the stack.
- Simplicity: RecordService provides a higher level, logical abstraction for data. Datasets can be specified as logical names (i.e. tables or views) and RecordService returns schemed objects (in contrast to the storage APIs that deal with paths and bytes). This means that applications built on top of the RecordService APIs don’t need to worry about differences in file formats, the underlying storage APIs, and other low level details.
Integration with Hadoop
RecordService is designed to integrate well with existing Hadoop-based applications. Currently (without RecordService), a simple job execution has the following workflow:
- The client contacts the NameNode to get the blocks and their locations for a particular directory, which are represented as InputSplits.
- The compute framework (MapReduce/Spark) will launch the map task on the worker nodes.
- Each worker node talks to the (ideally local) DataNode to read the data and then perform the computation.
RecordService architecture ties in closely to each of these steps.
With RecordService, clients—instead of directly talking to the metadata services—will talk to the RecordService planner and the tasks, and instead of directly going to the data stores, will read the data from the Worker. We expect the Planner to run on a few nodes (e.g. three) and the Workers to run on all the nodes with data to optimize for read locality.
To provide simple integration for existing applications, RecordService provides client libraries that implement the common Hadoop InputFormats. We expect many applications to be able to use these as drop in replacements. For Spark, the InputFormats will work as well but the client libraries also provide more direct integration with Spark SQL. We have provided a few examples to help you get started.
One of the motivations of RecordService is to support fine-grained (row- and column- level) access control to data independent of how it is accessed. Until now, such controls existed within a few frameworks such as Apache Hive and Impala, or at the application level. Permissions sometimes need to be set multiple times over the same data and in many cases compute frameworks rely on the least-common denominator of HDFS permissions, where access is all or nothing per file.
RecordService solves these problems by introducing a higher-level abstraction (a record set) on top of files providing fine-grained access control enforcement across the Hadoop ecosystem. It leverages existing Apache Sentry (currently incubating at the ASF) permissions, including the ability to set permissions using Hive Metastore Views, and additional controls will be added in the future.
As an example, consider a dataset with this schema:
CREATE TABLE dataset(
We’d like to secure the dataset so that a particular set of users only have access to the name, balance, and masked account number (account) for users in a particular region. To do this we can create a
CREATE VIEW restricted_dataset AS
FROM dataset WHERE region = “Europe”
We’d then grant access to the view to the appropriate role using Sentry. With this single view and single Sentry grant, a set of users now has access to data secured by column- and row-controls, as well as data masking, and they may access this data natively through Hive, Impala, and—with the introduction of RecordService—MapReduce and Spark.
RecordService enforces security on the read path. Unauthorized users cannot read the underlying files at all from the storage manager, regardless of the tool they use to try to access the data, guaranteeing security. On the write (ingest) path, fine-grained access control is less useful. We expect ingest to work as before, running as a user with permission to write files.
As RecordService is on the data access path, high performance is paramount. RecordService reuses some of the core components from Impala: IO management, efficient native code implementations of the file format parsing logic, and predicate evaluation. This approach, combined with our optimized wire format, means that existing applications see improved performance when run with RecordService. Not only do applications gain the benefits of a new layer of abstraction facilitating unified fine-grained access control, but performance can actually improve as a result.
To demonstrate, we’ve ported the Terasort Checksum benchmark to use RecordService. We’ve modified to data to be slightly more tabular (line break after every record) and the schema is just a table of
STRINGs. In many ways, this benchmark represents the worst case scenario. The dataset has the most minimal schema, there is no projection or predicate pruning, and the server runs a more general version of the normal TeraInputFormat. The experiments below were run on a 78-node cluster (12 [24-Hyperthreaded] cores, 12 disks. 77 Workers, and 1 Planner). This workload was run over a 1 billion, 50 billion, and 1 trillion row (~100TB) dataset. In these results, the graphs show normalized job completion times.
In this benchmark, the large job time of Spark was reduced by 15-20% and 50-75% for MapReduce. We attribute the performance improvements to effective IO scheduling and leveraging more efficient lower-level HDFS APIs. The RecordService planner also performs more intelligent task combining, which contributes a lot to the MapReduce speedup. MapReduce benefits significantly more from the improved task generation. It’s also worth mentioning that RecordService can be valuable even in cases where the data only has minimal schema.
TPCDS with Spark SQL
As another example, we benchmarked the Spark SQL integration running queries over the TPCDS dataset (500GB scale factor in Parquet, 5-node cluster, link to benchmark here). In these cases, we expect RecordService to accelerate the scan portion of the query execution. The standard TPCDS queries tend to be join heavy. The below graph represents the performance benefit you should expect to see if your workload looks like these:
RecordService reduces the geomean query time by approximately 15%.
For the final example, we demonstrate a workload closer to the best case scenario where we run a simple workload where a larger portion of the computation can be accelerated with RecordService. Here we run a selective scan and a simple query that sums a single BIGINT column (using the same cluster and dataset as the TPCDS case, but in this case scanning from the store_sales table). As expected, the performance benefits are much larger.
As with all benchmarks, there is no substitute for running it on your workload. This fact is particularly true for RecordService as it only participates in a portion of the total computation. We expect workloads that are scan-bound to observe the most benefit, depending on the complexity of the rest of the job, the file format, and the amount of processing that can be pushed into RecordService.
RecordService provides a higher-level building block for the Hadoop ecosystem and can simultaneously provide more functionality in security, easier integration by exposing a more unified API, and improved performance by leveraging technology developed with Impala. We’re really excited about the future enhancements that can be enabled by having this new abstraction layer. You can download the beta release here.
The community of developers collaborating on RecordService consists of multiple Hadoop distributions, application vendors, and companies that rely on Hadoop as a core part of their data infrastructure. We welcome additional interest!
Resources for Getting Involved
- Mailing list: firstname.lastname@example.org
- Discussion forum: http://community.cloudera.com/t5/Beta-Releases/bd-p/Beta
- Contributions: http://github.com/cloudera/RecordServiceClient/
- Documentation: http://cloudera.github.io/RecordServiceClient/
- Bug Reporting: Open Github Issue
Learn more about RecordService at the Apache Sentry meetup in NYC on Monday Sept. 28 (tonight). Lenni Kuff and Nong Li will present a technical session about RecordService at Strata + Hadoop World NYC on Weds., Sept. 30.
Nong Li is a Software Engineer at Cloudera and tech lead of the RecordService project; Nong is also an active Impala developer. Before joining Cloudera he worked at Microsoft developing new APIs for the Windows graphics system (DirectX).
Lenni Kuff is an Engineering Manager at Cloudera leading the Hive, Pig, and Sentry teams; he also worked as a developer on the Impala project. Before joining Cloudera he worked at Microsoft on a number of projects including the SQL Server storage engine, SQL Azure, and Hadoop on Azure.