This blog post is part of a series on Cloudera’s Operational Database (OpDB) in CDP. Each post goes into more details about new features and capabilities. Start from the beginning of the series with, Operational Database in CDP.
Cloudera’s OpDB provides a rich set of capabilities to store and access data. In this blog post, we’ll look at the accessibility capabilities of OpDB and how you can make use of these capabilities to access your data.
Distribution and sharding
Cloudera’s Operational Database (OpDB) is a scale-out Database Management System (DBMS) that is designed to scale linearly to Petabytes of data. Like all DBMSs, scale-out is implemented through sharding. Two different sharding policies are supported:
- Pre-defined sharding
Regardless of approach, there are APIs to enable sharding based on hash, range of values, and the combination of both.
When auto-sharding is enabled the tables are dynamically distributed across the cluster and when a shard size exceeds the configurable limit, it is automatically split and moved between servers in a cluster.
A table segment is split into two at the middle key, creating two roughly equal halves and those two halves can be served by different servers.
Automated sharding is applied regardless of the network that is used with the OpDB (WAN or local). Clusters can be set up to span a WAN in which case sharding and data movement would occur across the WAN with zero data loss.
The system can be configured to be aware of which nodes are in which data centers, which provides additional resilience for shards as copies of the shards can be distributed across multiple data centers.
Shards can be limited to specific subsets of nodes in a cluster based on policy, typically in a tenant-specific manner. That enables the implementation of geographic-based policies. Then tables can be replicated between clusters and set by policies to ensure that replication of tables, and the associated shards, is limited to desired geographies.
Cloudera’s OpDB provides native support for data sovereignty. If a cluster spans multiple countries, region server groups can be used to anchor data in specific countries along with HDFS Rack isolation configuration.
Cloudera provides three query engines optimized for different types of use cases, both operational and analytical, and NoSQL interfaces to enable optimized performance ranging across a broad range of both operational and data warehouse workload. This enables the execution of queries and joins of data across multiple shards.
Cloudera’s OpDB provides a native OLTP SQL engine that supports querying multiple data and object models including querying and joining across them. Two of our OLAP query engines can be used to map external tables that reside within our OpDB (or in other locations) and can query or join across them for more complex analytical queries typical of data warehousing
Data integration tools
Cloudera provides multiple tools to enable integration with data warehousing and federated query processing.
- Bulk export to a data warehouse is provided by Flink, Spark, Hive, and MapReduce
- Streaming export to a data warehouse is provided by Nifi
- In-situ data query within our OpDB is provided by Phoenix, Impala, and Hive
- Federated query processing across our OpDB, data warehouse solution, and third party data warehouse solutions are provided by Hive
External data support
Cloudera’s OpDB includes many Hadoop tools and integrates with most of the Hadoop ecosystem.
Our OpDB provides NoSQL and SQL interfaces. There are no restrictions on this interfacing and it is very well supported in the Hadoop community.
MiNiFi can be used on portable devices at the edge and provide data connectivity with the OpDB.
The query editor HUE can run on a mobile or portable device.
Cloudera provides both JDBC & ODBC drivers provided through our SQL engines in addition to direct API access to our data-stores and tools.
In this blog post, we looked at some of the OpDB accessibility capabilities such as data query, data integration, and connectivity. In the next article, we’ll cover how you can make use of the administration capabilities in OpDB, find it here.
For more information, please go to: Getting Started with Operational Database.