Operational Database Scalability

by Liliana Kadar, Gokul Kamaraj, and Krishna Maheshwari

Posted in Technical | July 17, 2020 3 min read

Cloudera’s Operational Database provides unparalleled scale and flexibility for applications, enabling enterprises to bring together and process data of all types and from more sources, while providing developers with the flexibility they need. In this blog, we’ll look into capabilities that make Operational Database the right choice for hyperscale.

Scale-up architecture

Cloudera’s Operational Database (OpDB) supports a scale-up (SMP) environment. The caching layer is able to consume all memory in a large SMP environment. Memory has to be large enough to cover RegionServers, DataNodes and operating system, and to have enough extra space to allow the block cache to assist with reads. When HBase is running with other components, CPU contention and memory contention can be a problem that is easy to address with proper YARN tuning.

As a result of the scale-up architecture, multiple services and engines can be run on a single node. For smaller nodes, multiple services and engines have to be spread out amongst a larger set of nodes.

Scale-out architecture

In addition to Scale-up, Cloudera’s OpDB supports a scale-out (clustered) architecture by default.

When required, an additional node can easily be added to the cluster using Cloudera Manager. The process involves installing the same JDK version, the Cloudera Manager agent and the parcels on this new node. Once the agent is started, the host can be used to install an OpDB role or an OpDB service.

For example, you can enable RegionServer to add additional worker node capacity. You can then run the balancer to balance the existing workload across this new node. You could also add this new node as the master to enable high availability while growing capacity. The process for on premise and for cloud deployments are the same.

Limitations

Data Types

Cloudera’s OpDB has multiple options on data types. It natively supports untyped data with no limits on size of datatypes (only limited by memory in a given node of the cluster).

AVRO data types are supported:

Primitive data types
- null
- boolean
- int
- long
- float
- double
- bytes
- string
Complex data types
- Records
- Enums
- Arrays
- Maps
- Unions
- Fixed

For more information about AVRO data types, see Avro Schemas documentation.

Phoenix data types are also supported:

Integer/UNSIGNED_INT: 4 bytes
BIGINT/UNSIGNED_LONG/TIME/UNSIGNED_TIME/UNSIGNED_DATE/DATE: 8 bytes
TINYINT/UNSIGNED_TINYINT: 1 byte
SMALLINT, UNSIGNED_SMALLINT: 2 bytes
FLOAT/UNSIGNED_FLOAT: 4 bytes
DOUBLE/UNSIGNED_DOUBLE: 8 bytes
DECIMAL: upto 38 digits with a variable length binary representation
BOOLEAN: 1 byte, TIMESTAMP/UNSIGNED_TIMESTAMP: 12 bytes
VARCHAR/CHAR/BINARY/VARBINARY/ARRAY: no limit

In-Memory Database Size

The in-memory portion of the OpDB can span DRAM and persistent memory like Intel Optane.

Moreover, the database can span multiple nodes in a cluster and is not limited to the memory limits of a single server enabling terabyte level of scale. Alternatively, when the entire dataset can fit in memory (spanning a single server or multiple server) and tables are configured to be memory resident / cached, OpDB can act as an in-memory database with similar benefits in terms of low latency, high throughput. In this scenario, writes would still be written to disk.

Very large databases

Cloudera’s OpDB can act as a Very Large Databases for OLTP applications and no special management tools are required for this use case

There is no specific limit on how large an OLTP database can get. The largest known implementations by Cloudera customers are greater than 2.5PB per instance.

Regional scalability

Cloudera’s OpDB supports scaling across regions. Asynchronous replication can be used to allow the database to span disparate regions spanning the globe with complex bi-directional and multi-directional replication links that create complex topologies with tunable consistency models.

Clusters can also be stretched across smaller distances depending on the latency between those links. While there is no predefined limit, customers typically prefer <20ms network latency between nodes (preferably less than 1ms) to ensure performance envelopes that meet their application requirements. In this scenario, stretching the cluster across three data centers is advised to ensure resiliency or across 3 availability zones when deploying in the cloud.

Fast loader

Multiple mechanisms are provided to do bulk-load such as doing it through API or using MapReduce or Spark.