This blog post is part of a series on Cloudera’s Operational Database (OpDB) in CDP. Each post goes into more details about new features and capabilities. Start from the beginning of the series with, Operational Database in CDP.
This blog post gives you an overview of the NoSQL, component integration, and object store support capabilities of OpDB. These details will help application architects understand the flexible, NoSQL (schema-free) capabilities of Cloudera’s Operational Database and whether they will meet the requirements for the applications they are building.
Cloudera’s Operational Database (OpDB) is a multi-model in that it supports many different types of object models natively within the system.
Users can choose key-value, wide-column, and relational or provide their own object model.
JSON, XML, and other models can also be converted and stored through, for example Nifi, Hive, or stored natively as key-value pairs and queried using, for example Hive. JSON and XML can also be supported using custom implementation through JSONRest.
Cloudera’s OpDB offers direct support for consistent object stores such as Azure Data Lake Store and S3 (AWS native and implementations like Ceph).
Object stores can be used to store HBase’s store files where a bulk of the data resides or as a backup target.
Cloudera’s OpDB stores untyped data by default, meaning that any object can be stored in a key-value natively with little limitation to the number, type of the stored values. The max size for an object is the memory size of the server.
Cloudera’s OpDB is a wide-column data store and natively provides table-style capabilities, such as row lookup and grouping millions of columns into column families.
Column families must be defined at the time of the table creation. Columns do not have to be defined at table creation, they are created as required, which enables flexible schema evolution.
Data types within a column are flexible and user-defined. Users can decide if they want to leverage this flexibility or leverage Relational DBMS capabilities in exchange for reducing flexibility in data types.
|Column Family||Column Family|
Conflict-free replicated data types
Cloudera’s OpDB supports conflict-free replicated data types (CRDTs). It is provided by default and the replication subsystem provides either strong eventual or strong timeline consistency.
Cloudera provides tight integration across the Hadoop ecosystem, including HDFS, due to its strong presence in this space.
Data can be exported using Snapshots or Export from running systems or by directly copying the underlying files (HFiles on HDFS) offline.
Cloudera’s OpDB supports Spark. Multiple integrations with Spark exist, enabling Spark to access tables as external data sources or sink. Users can operate with Spark-SQL on DataFrame or with DataSets.
With the DataFrame and DataSet support, all optimization techniques in the catalyst are available. In this way data locality, partition pruning, predicate pushdowns, scanning, and BulkGate are achieved. Spark worker nodes can be co-located on the cluster enabling data locality. Read and write to the OpDB is also supported.
For each table, a catalog has to be provided. That catalog includes the row key, the columns with data type, and with predefined column families, and it defines the mapping between the column and the table schema. The catalog is user-defined json format.
An HBase DataFrame is a standard Spark DataFrame and is able to interact with any other data sources such as Hive, ORC, Parquet, JSON, and so on. Java primitive types are supported as three internal serdes: Avro, Phoenix, and PrimitiveType.
Cloudera provides several streaming data processing frameworks and tools which are integrated with its OpDB offering.
Cloudera DataFlow (CDF)
Cloudera DataFlow is a scalable, real-time streaming data platform that collects, curates, and analyzes data so customers gain key insights for immediate actionable intelligence.
Cloudera Flow Management (CFM) is a no-code data ingestion and management solution powered by Apache NiFi. It delivers highly scalable data movement, transformation, and management capabilities to the enterprise. Put simply Nifi was built to automate the flow of data between systems. For more information, see Cloudera Flow Management.
Cloudera Streaming Analytics powered by Apache Flink offers a framework for real-time stream processing and streaming analytics. CSA provides a flexible streaming solution with low latency that can scale to large throughput and state. It offers the needed connectors depending on the chosen sources and sinks, for example HBase Streaming connector. For more information, see Cloudera Streaming Analytics
Cloudera Stream Processing (CSP) provides advanced messaging, stream processing, and analytics capabilities powered by Apache Kafka as the core stream processing engine. It also provides streams management capabilities. For more information, see Cloudera Stream Processing.
Spark Streaming is a micro batching stream processing framework built on top of Spark. HBase and Spark Streaming make great companions in that HBase can help serve the following benefits alongside Spark Streaming:
- A place to grab reference data or profile data on the fly
- A place to store counts or aggregates in a way that supports Spark Streaming’s promise of only once processing.
In this blog post, we took a look at the NoSQL capabilities of the OpDB. We also saw how the OpDB integrates with the other components in CDP.
This is the last blog post in a series on Cloudera’s Operational Database (OpDB) in CDP. You can start from the beginning of the series with Operational Database in CDP.