We at Cloudera believe that all companies should have the power to leverage data for financial gain, to lower operational costs, and to avoid risk. We enable this by providing an enterprise grade platform that allows customers to easily manage, store, process, and analyze all of your data, regardless of volume and variety.
Cloudera’s Enterprise Data Hub (EDH), a modern machine learning and analytics platform that is optimized for the cloud, offers the flexibility to deploy storage services, core compute services, and Cloudera’s Shared Data Experience (SDX) capabilities both on-premises and in the cloud so you can choose which deployment model makes the best financial and operational sense. Regardless of whether you deploy on-premises or in the cloud, careful considerations are needed in order to enable the best performance and functionality for your solution, team, and organization.
Infrastructure is at the core of any software solution. Without a properly planned and designed infrastructure, applications will likely fail to deliver on the value that they promise. The Enterprise Data Hub and the applications which run on it are no exception. The market is overflowing with hardware choices and it is often difficult to understand just which components and configurations will yield the best value for your technology projects and business objectives. Also, the vast number of services and roles that make up the ecosystem open up many ways in which these could be deployed. Use-case and workload considerations become prominent here. Finally, with the advent of Cloud, many other considerations must be accounted for when taking your use-case there.
In this three part series we will attempt to address the most critical decisions when implementing your Big Data solution using Cloudera to address the many potential use-cases, whether it is on-premises or cloud. Part 1 covers Infrastructure Considerations, Part 2 will cover Service and Role Layouts, while Part 3 will look at Cloud deployment considerations.
Node Classifications
Let’s start by categorizing nodes in a CDH cluster by their primary functions. Cloudera classifies nodes using the following nomenclature:
- Utility: Nodes with services that allow you to successfully manage, monitor and govern your cluster. Such services may include Cloudera Manager (CM) and associated Cloudera Management Services (CMS), the metastore RDBMS (if co-located on the cluster) storing metadata on behalf of a variety of services and perhaps your administrator’s custom scripts for backups, deploying custom binaries and more.
- Masters: Nodes containing “master” roles that will typically manage their set of services operating across the distributed cluster. HDFS NameNode, YARN ResourceManager, Impala Catalog/Statestore, HBase Master Server, and Kudu Masters are all examples of master roles.
- Workers: Nodes containing roles that do most of the compute/IO work for their corresponding services. HDFS DataNodes, YARN NodeManagers, HBase RegionServers, Impala Daemons, Kudu Tablet Servers and Solr Servers are examples of worker roles.
- Edge: Nodes that contain configurations, binaries and services that enable them to act as a gateway between the rest of the corporate network and the EDH cluster. Many CDH services come with “gateway roles” that would reside here, along with endpoints for REST API calls and JDBC/ODBC type connections coming from the corporate network. Often it is simpler to set up perimeter security when you allow corporate network traffic to only flow to these nodes, as opposed to allowing access to Masters and Workers directly.
- Encryption: You can encrypt and secure your data at rest using HDFS Transparent Encryption (to encrypt data on HDFS) and Navigator Encrypt (to encrypt data on non-HDFS volumes). This capability requires two additional services, the KeyTrustee Management Service (KMS) and KeyTrustee Service (KTS). The KMS nodes are part of each HDFS cluster, while the KTS nodes should be separated by a firewall and hosted on there own cluster. Note that encryption of data in transit can be achieved via Transport Layer Security (TLS) and enabling data transfer encryption features of HDFS, for example.
Part 1: Infrastructure Considerations
CDH is able to leverage as many resources as provided by the underlying infrastructure. This is an important concept to understand when deploying your first cluster. If deploying on shared rack and network equipment, carefully ensure power and network requirements are met. If deploying on cloud, concepts such as keeping hosts close to each other and ensuring certain roles are not placed on the same underlying physical machine must be considered. To ensure an adequate implementation refer to reference architectures, Minimum Hardware Requirements, and Generic Bare Metal Reference Architecture for hardware, virtualization software and cloud vendors. In most cases the following infrastructure considerations are common.
Disk Layouts
Most enterprise software solutions are used to the concept of RAID configurations to overcome the potential for data loss due to disk failure. However, in CDH, the use of RAID is not necessary across all roles and services. In fact, for some roles requiring low latency and high I/O, single dedicated drives provided as “just a bunch of disks” (JBOD) are encouraged.
Here is a proposed outline of disk setup for the various components for on-premises implementations:
- Operating System: The operating system can be implemented with RAID1 to minimize node failure in the event of disk loss. Typically, 2U server offerings come with a pair of disks in the back of the unit meant for the OS. This is ideal for configuring the pair of disks in RAID1.
- Database: Your metastore RDBMS data should be protected by a RAID volume, considering tradeoffs between performance and resilience in case of disk failures. In any case, at a minimum, if one drive fails, your RDBMS should still be able to access the data.
- HDFS JournalNode: Each JournalNode process should have its own dedicated disk (JBOD or RAID0 with a single disk). You should not use SAN or NAS storage for these directories. If the dedicated disk is SSD/NVMe SSD, ensure you are using enterprise-grade drives which can handle write-heavy sequential operations.
- ZooKeeper: ZooKeeper’s transaction log must be on a dedicated device (a dedicated partition is not enough). ZooKeeper writes logs sequentially, without seeking. Sharing your log device with other processes can cause seeks and contention, which in turn can cause multi-second delays. If the dedicated disk is SSD/NVMe SSD, ensure you are using enterprise-grade drives which can handle write-heavy sequential operations.
- HDFS NameNode: Each NameNode process should have its own dedicated disk (2 disks as JBOD or RAID1 configuration). You should not use SAN or NAS storage for these directories.
- HDFS DataNode/Kudu Tablet Server: A set of JBOD per worker hosting the HDFS DataNode is recommended. The same set of disks and file systems may be used for Kudu Tablet data isolated only by the path on the filesystem (ie. /dataX/dn for DataNode and /dataX/kudu-data for Kudu). Furthermore, if data at rest must be encrypted, HDFS Transparent Encryption could be used for HDFS. Using Navigator Encrypt for Kudu is possible if Kudu data is on its own dedicated devices.
- Kudu Write-Ahead Log (WAL): A dedicated disk is highly recommended for Kudu’s write-ahead log, required on both Master and Tablet Server nodes. Kudu’s write performance relies on the performance of the WAL, so ensuring write-heavy, high sequential throughput, low-latency drives are recommended. SSD/NVMe SSD are a natural fit for performance, though use enterprise-grade SSD that can handle these requirements.
- Kafka Brokers: JBOD or RAID10 is recommended for Kafka Brokers, though each with their set of tradeoffs.
- JBOD risks downtime for the broker if a disk fails and skewed usage of disks as some partitions may have more data than others. However, it achieves excellent IO performance due to isolated writes across disks, without a need to coordinate through a RAID controller. Recent releases of Kafka have tried to solve broker outages due to disk failures with KAFKA-4763.
- RAID configurations avoid downtime if a disk fails, always maintain even distribution of data across disks, and have strong performance as all spindles operate on your behalf for IO, albeit through a RAID controller. The downside of RAID, especially RAID10, is sacrificing half the capacity available for your topics.
- Both disk layout strategies benefit from the OS page cache and asynchronous IO, whereby data that was recently written (by a producer), is typically read shortly after (by a consumer), getting data that was still in the cache.
- Flume: Edge nodes running Flume may consider dedicated disks when leveraging Flume’s File Channel. If relying on Memory Channel or Kafka Channel in your Flume deployments, there is no need to provision a dedicated disk.
It is also recommended that such disks be mounted with a UUID using the blkid command and tuned for Big Data workloads. Tuning can be achieved by configuring the disks using ext4 or xfs, with the noatime mount option and running tune2fs as necessary (see previous blog). Note that for cloud implementations on AWS, Azure and GCP, specific disk requirements are outlined. For instance, in AWS, the use of ephemeral disks or EBS using GP2 volumes should be considered. For Azure, the use of Premium disks is strongly recommended. And on GCP, standard persistent disks are required for DataNodes. Consult the corresponding Reference Architectures for each in your implementation.
Networking
Networking in EDH requires careful planning and considerations to ensure maximum scalability for your workloads. You have the choice of deploying hardware co-located in racks with other software solutions’ hardware or building out isolated data center racks. These considerations change in the Cloud as Networking is provided as part of the Infrastructure as a Service.
Network performance has been improving drastically over the years from 1Gb to 10Gb to 40Gb to 100Gb and beyond. Consider future proofing your cluster’s network leveraging 100Gb aggregate Top-of-Rack (ToR) switches and 10Gb/25Gb/40Gb NICs for your individual servers. If available, multiple network interfaces should be bonded to increase cluster throughput and availability .
When implementing on-premises clusters, many service master roles are spread across three servers. To ensure resiliency to rack failures, having these master roles spread across three racks is ideal. Typically, each rack will be provisioned with dual ToR switches in case of switch failures and network maintenance, and route traffic to systems outside the cluster (e.g., ingest sources, Hive or Hue access, etc.) via dedicated switches to the cluster. When implementing on AWS though, Availability Zones should be used for AWS hosts to ensure low latency links among hosts are used. On Azure, Availability Sets with Fault and Update Domains should be used. On GCP, Regions and Zones should be used. Always refer to vendor specific reference architectures for more details when implementing on vendor specific hardware and/or Cloud. Please consult the various Reference Architectures published for more specific details on vendor specific and cloud topologies.
It is also required that fully qualified domain names (FQDNs) are in lowercase and registered in DNS with reverse lookup properly configured. All CDH related traffic must flow through the NICs associated with the FQDN of the cluster hosts used within CDH’s installation procedure as multi-homing is not supported except on edge nodes. This is particularly important when deploying clusters on Azure which at the time of this writing does not provide reverse DNS out of the box like AWS and GCP do.
High Availability and Load Balancing
In general, implementing high availability and load balancing requires bonding network interfaces and using load balancing infrastructure (e.g., HAProxy, F5, others) to redirect requests upon service failures and/or balance the load of certain services.
Some of the services that benefit from a load balancer include Cloudera Manager, HiveServer2, Impala, Oozie, HDFS HttpFS services, Hue, Solr and more. Be sure to read Cloudera documentation here for details of load balancer configurations for those services as some (such as Impala), have special requirements with regards to session persistence and source IP tracking.
Most enterprise customers require solutions that are resilient and that can scale well as use of such clusters grows over time. To accomplish high availability and load balancing, one requires the following infrastructure:
- Utility: A minimum of two hosts are recommended. Separating Cloudera Manager (CM) from Cloudera Management Services (CMS) across two hosts each may be further needed if the total number of hosts managed is large (>100). CMS services can further be split up across additional utility nodes for very large (>500) node clusters. In addition, if the database is co-located within the cluster (as opposed to located off the cluster), the utility nodes can be used to host the database and implement native replication between the two database instances. CM with high availability also requires Network Attached Storage (NAS) for sharing certain directories between the CM hosts themselves. Remember to provision extra hard drive space for service metrics and logging for the CM server based on required retention policies.
- Masters: A minimum of three hosts are recommended per cluster for HA. Specifically ZooKeeper, HDFS JournalNode, and Kudu require an odd number of these for quorum consensus. Typically, three masters are recommended for clusters < 100 nodes, and five masters are recommended between 100 and 500 nodes. Note: A utility node may contain services of the 3rd master temporarily until it is possible to add a 3rd master node.
- Workers: A bare minimum of three distinct hosts are required for data durability, but reference architectures generally state five as a starting point to provide a reasonable level of data availability, redundancy, and storage capacity. One must keep in mind that because default HDFS replication is three, the effective HDFS space will be limited to that served by a single host (⅓ storage space). Worker nodes scale out as the demand for storage or compute on any given cluster increases. Additional space should be considered for processing engine temporary space and extra copies of the same data as required for your use cases (approximately 25% overhead is a good starting point).
- Edge: Edge nodes need to scale depending on the number of concurrent HiveServer2 connections, HUE users, HttpFS scale requirements and more. At least two nodes would be needed to load balance and enable HA. As a general rule of thumb, for every 20 worker nodes added, one additional edge node is recommended. The Cloudera Data Science Workbench (CDSW) and some third party tools would be installed on edge nodes and sizing the number of edge nodes to deploy these capabilities must be done accordingly.
- Encryption: The KMS and KTS services each require two nodes for high availability. Note that a KMS service is required per HDFS cluster, so anyone implementing more than one HDFS cluster would need two hosts for KTS and two hosts for KMS per HDFS cluster. KTS servers should be treated as their own cluster which are behind firewall rules for protection. The KTS nodes can be virtualized if necessary as long as the risks associated with running this way are understood.
- Risks: Cluster unavailability if the VM environment goes offline or if network between the CDH environment and the VM environment becomes unavailable.
- This should only be considered if data center space and cost constraints are preventing the addition of physical nodes.
- The KTS hosts should be securely backed up or snapshotted frequently to prevent loss of data and exposing keys to decrypt sensitive data.
Conclusion
In this Part 1: Infrastructure Considerations blog post, we covered fundamental concepts for a healthy cluster including a brief overview how nodes are classified, disk layout configurations, network topologies, and what to think about for achieving high availability and load balancing. Watch for Part 2 of this series where we will cover CDH service and role layouts in more detail based on use-case and workload considerations.
Good understanding knowledge