As a member of Cloudera’s Partner Engineering team, I evaluate hardware and cloud computing platforms offered by commercial partners who want to certify their products for use with Cloudera software. One of my primary goals is to make sure that these platforms provide a stable and well-performing base upon which our products will run, a state of operation that a wide variety of customers performing an even wider variety of tasks can appreciate. Sharing some of our evaluation methodology might help others with their own projects.
Platform, defined
What is a platform?
According to Merriam-Webster, a platform could be a plan or design, a declaration of principles, a raised horizontal flat surface, a place or opportunity for public discussion, a vehicle used for a particular purpose, a thick layer between the inner sole and outer sole of a shoe, or the computer architecture and equipment using a particular operating system.
That’s a lot of definitions, none of which quite fit our meaning, but the last one comes close.
At the time of writing Cloudera aims to “deliver the modern platform for machine learning and analytics optimized for the cloud” which suggests our product offering itself is the platform, enabling customers to do all sorts of neat things. Within Cloudera’s Engineering team “platform” could mean the compute, storage, management, and security components that are part of CDH, upon which other components are built.
One thing is certain: platform is a loaded term used for many purposes by different groups.
In the context of this post, platform refers to a computing environment consisting of host, network, and storage resources. This environment forms the base upon which we install and run our software, CDH.
A platform is most commonly composed of
- bare metal hosts in your corporate data center provisioned by a Kickstart script
- virtual machines in your data center’s VMware or OpenStack environment
- virtual machines on cloud providers such as Amazon EC2, Google Cloud Platform, or Microsoft Azure
No matter where your environment resides physically, it will integrate with network and storage infrastructure. Network switches and routers could be physical, software-defined, or within a host hypervisor. Storage could be onboard the host itself, virtualized but directly addressable, or part of a vast network block storage service.
The assembly of all these resources results in an environment whose capabilities must be put to the test before we can recommend it to our customers. Here’s some of what I look at.
Provisioning time
For on-demand hardware provisioning, whether on-prem or on a cloud provider, we want to get a sense of how long it takes to spin up new machines so that we can set proper expectations.
Waiting 10-15 minutes for a platform to make virtual machines available may be acceptable for a cluster of nodes that’s expected to run for several months, but for an ad hoc job that’s only expected to run for a few minutes it probably won’t be.
If numerous machines are requested in a batch, does the time to completion suggest that the provisioning is occurring in serial or parallel fashion? It’s vastly preferred that provisioning be performed in parallel.
The faster machines can be provisioned and made available for us, the faster we can begin installing and using our applications.
Name resolution
Cloudera recommends using DNS for hostname resolution. The usage of /etc/hosts
becomes cumbersome quickly, and routinely is the source of hard to diagnose problems. /etc/hosts
should only contain an entry for 127.0.0.1, and localhost should be the only name that resolves to it. The machine name must not resolve to the 127.0.0.1 address. All hosts in the cluster must have forward and reverse lookups be the inverse of each other for Apache Hadoop to function properly. An easy test to perform on the hosts to ensure proper DNS resolution is to execute:
$ dig <hostname> $ dig –x <ip_address_returned_from_hostname_lookup>
For example:
$ dig themis.apache.org themis.apache.org. 1758 IN A 140.211.11.105 $ dig -x 140.211.11.105 105.11.211.140.in-addr.arpa. 3513 IN PTR themis.apache.org.
This is the behavior we should see for every host in the cluster, including those whose hostnames are automatically assigned during provisioning. It might take a long time to test every permutation on every host, but it’ll prevent headaches later. If you’re using Cloudera Manager the Host Inspector will perform these checks for you.
Disk throughput
How fast can the operating system read and write to a disk? We’re primarily concerned with the disk throughput of those disks being used for DFS blocks, commonly referred to as “data disks,” but any disk can be tested. Make sure the system is otherwise idle while running these tests.
To measure read performance, use hdparm
to specify a disk device:
$ sudo hdparm -tT /dev/sdb /dev/sdb: Timing cached reads: 12422 MB in 2.00 seconds = 6216.73 MB/sec Timing buffered disk reads: 308 MB in 3.00 seconds = 102.65 MB/sec
To measure write performance, use dd
to specify an output file:
$ sudo dd bs=8k count=256k if=/dev/zero of=/data1/output.img conv=fdatasync 2147483648 bytes (2.1 GB) copied, 25.3819 s, 84.6 MB/s
Record and compare the performance for each test across multiple runs. For magnetic disks running on bare metal 120-150 MB/s is a good goal for reads and writes; on virtual machines, 60-80 MB/s is more common for ephemeral disk (such as the dd
example above).
Throughput of SSD and NVMe disks will be considerably faster than their HDD counterparts. It’s also common for the read and write speeds to be significantly different on these type of disks. When configuring data volumes, do not use LVM or RAID. JBOD is still the preferred disk configuration for CDH components.
The throughput of network-attached storage solutions will generally be slower than HDD and isn’t generally supported, although network block storage services offered by cloud providers can usually attain HDD-like speeds when sized appropriately. These tests can be useful to verify that you’re getting the throughput that you expect from provisioned storage.
Network throughput
The network utilization of distributed computing environments is considerably higher than most other applications, so it’s important check that the network infrastructure is capable of handling the data we’re going to throw at it without any bottlenecks.
Pick two nodes: one will be the client and one will be the server. A CentOS installation is shown here, but the iperf3 utility is available for most Linux distributions.
$ sudo yum install -y epel-release $ sudo yum -y --enablerepo=epel install iperf3 iftop
On the server, run:
[alex@server ~]$ sudo iperf3 -s
On the client, run:
[alex@client ~]$ sudo iperf3 -c <FQDN or IP of server>
In certain SDN and cloud environments, the default single-thread throughput can be significantly lower than what is achievable with higher parallelism. You can simulate multi-stream clients by setting -P <number>
(up to 128 parallel streams) when running the test, but it’s better to start with a single stream to obtain a baseline before increasing parallelism.
To run with 4 parallel threads, run:
[alex@client ~]$ sudo iperf3 -P 4 -c <FQDN or IP of server>
The following example measured the throughput between the client and server using a single thread.
[alex@client ~]$ sudo iperf3 -c server
Looking at the excerpts from three different runs between two large virtual machines in the same rack, we can see that the bandwidth is consistent between runs (and likely capped by the hypervisor):
[ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 6.00 GBytes 5.16 Gbits/sec 733 sender [ 4] 0.00-10.00 sec 6.00 GBytes 5.15 Gbits/sec receiver [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 6.01 GBytes 5.16 Gbits/sec 592 sender [ 4] 0.00-10.00 sec 6.00 GBytes 5.16 Gbits/sec receiver [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 5.99 GBytes 5.15 Gbits/sec 936 sender [ 4] 0.00-10.00 sec 5.99 GBytes 5.15 Gbits/sec receiver
If I run the same test on a pair of smaller VMs, we can see less but consistent bandwidth.
[ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 2.55 GBytes 2.19 Gbits/sec 319 sender [ 4] 0.00-10.00 sec 2.55 GBytes 2.19 Gbits/sec receiver [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 2.56 GBytes 2.20 Gbits/sec 229 sender [ 4] 0.00-10.00 sec 2.55 GBytes 2.19 Gbits/sec receiver [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 2.55 GBytes 2.19 Gbits/sec 328 sender [ 4] 0.00-10.00 sec 2.55 GBytes 2.19 Gbits/sec receiver
If you have network kernel changes to make, run a set of tests first and record the results to establish a baseline. Then apply kernel changes one at a time, re-running the tests in between to determine which tweaks help (or don’t).
For bare metal clusters we should observe near full line rate between cluster nodes in the same rack, on different racks, and even between nodes on two of the farthest racks in the cluster.
On virtualized platforms, network throughput is often capped by the hypervisor. For cloud providers we rarely have control over how much network throughput allocation, so these tests can help set proper expectations. When testing network throughput between two virtual machines, make sure both are of the same family and size.
Network latency
Cluster operations can be sensitive to network latency, so we check that too. We can use a simple tool like ping
to get an idea of the network latency; in production you’d want to monitor and record latency over a longer period of time to better establish what’s normal during operation.
Packet loss should be non-existent on modern networks. If you consistently see lost packets, you’ll need to work with your network team to troubleshoot.
Example: Between two hosts in the same rack, an average latency of 0.230ms.
[alex@client ~]$ ping -q -c 10 server PING server (10.1.0.30) 56(84) bytes of data. --- server ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 9198ms rtt min/avg/max/mdev = 0.186/0.230/0.282/0.036 ms
Example: Between two hosts in different data centers in close proximity, an average latency of 0.747ms.
[alex@client ~]$ ping -q -c 10 notsofarawayserver PING notsofarawayserver (10.2.0.20) 56(84) bytes of data. --- notsofarawayserver ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 9012ms rtt min/avg/max/mdev = 0.723/0.747/0.800/0.038 ms
Example: Between two hosts in different data centers connected by different ISPs, latency increases to 1.327ms.
[alex@client ~]$ ping -q -c 10 example.org PING example.org (93.184.216.34) 56(84) bytes of data. --- example.org ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 9011ms rtt min/avg/max/mdev = 1.254/1.328/1.420/0.062 ms
Example: Between two hosts, further away still… 7.034ms.
[alex@client ~]$ ping -q -c 10 example.org PING example.org (93.184.216.34): 56 data bytes --- example.org ping statistics --- 10 packets transmitted, 10 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 6.272/7.034/7.860/0.512 ms
Within the LAN we would expect consistent sub-millisecond latencies between hosts, even across multiple racks.
Between different LANs or spanning data centers, network traffic will be subject to increased latencies and variability, especially if you’ll be sharing WAN links with other users. If you’re planning to span data centers via dark fiber or carrier grade service, it’s important to survey your latency on a continual basis before and after deploying.
Storage throughput
Whether you’re running on bare metal, a vendor-supplied appliance, or on one of the many cloud providers, storage throughput tests are a great way to exercise the entire system; in additional to exercising storage, the tests also put processing and networking components through their paces.
The teragen and terasort benchmarking tools are part of the standard Apache Hadoop distribution and are included with the Cloudera distribution. In the course of a cluster installation or certification, Cloudera recommends running several teragen and terasort jobs to obtain a performance baseline for the cluster. The intention is not to demonstrate the maximum performance possible for the hardware or to compare with externally published results, as tuning the cluster for this may be at odds with actual customer operational workloads. Rather the intention is to run a real workload through YARN to functionally test the cluster as well as obtain baseline numbers that can be used for future comparison, such as in evaluating the performance overhead of enabling encryption features or in evaluating whether operational workloads are performance limited by the I/O hardware. Running benchmarks provides an indication of cluster performance and may also identify and help diagnose hardware or software configuration problems by isolating hardware components, such as disks and network, and subjecting them to a higher than normal load.
The teragen job will generate an arbitrary amount of data, formatted as 100-byte records of random data, and store the result in HDFS. Each record has a random key and value. The terasort job will sort the data generated by teragen and write the output to HDFS.
During the first iteration of the teragen job, the goal is to obtain a performance baseline on the disk I/O subsystem. The HDFS replication factor should be overridden from the default value 3 and set to 1 so that the data generated by teragen is not replicated to additional data nodes. Replicating the data over the network would obscure the raw disk performance with potential network bandwidth constraints.
Once the first teragen job has been run, a second iteration should be run with the HDFS replication factor set to the default value. This will apply a high load to the network, and deltas between the first run and second run can provide an indication of network bottlenecks in the cluster.
While the teragen application can generate any amount of data, 1 TB is standard. For larger clusters, it may be useful to also run 10 TB or even 100 TB, as the time to write 1 TB may be negligible compared to the startup overhead of the YARN job. Another teragen job should be run to generate a dataset that is 3 times the RAM size of the entire cluster. This will ensure we are not seeing page cache effects and are truly exercising the disk I/O subsystem.
The number of mappers for the teragen and terasort jobs should be set to the maximum number of disks in the cluster. This will likely be less than the total number of YARN vcores available, so it is advisable to temporarily lower the vcores available per YARN worker node to the number of disk spindles to ensure an even distribution of the workload. An additional vcore will be needed for the YARN ApplicationMaster.
The terasort job should also be run with the HDFS replication factor set to 1 as well as with the default replication factor. The terasort job hardcodes a DFS replication factor of 1, but it can be overridden or set explicitly by specifying the mapreduce.terasort.output.replication
parameter as shown below.
Teragen and Terasort Command Examples
In the examples below, the cluster has 36 DataNodes with 10 data disks each.
Teragen Command to Generate 1 TB of Data With HDFS Replication Set to 1
$ EXAMPLES_PATH=/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce $ yarn jar ${EXAMPLES_PATH}/hadoop-mapreduce-examples.jar teragen -Ddfs.replication=1 -Dmapreduce.job.maps=360 10000000000 TS_input1
The above command generates 1 TB of data with an HDFS replication factor of 1, using 360 mappers.
Teragen Command to Generate 1 TB of Data With HDFS Default Replication
$ EXAMPLES_PATH=/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce $ yarn jar ${EXAMPLES_PATH}/hadoop-mapreduce-examples.jar teragen -Dmapreduce.job.maps=360 10000000000 TS_input2
The above command generates 1 TB of data with the default HDFS replication factor (usually 3), using 360 mappers.
Terasort Command to Sort Data With HDFS Replication Set to 1
$ EXAMPLES_PATH=/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce $ yarn jar ${EXAMPLES_PATH}/hadoop-mapreduce-examples.jar terasort -Dmapreduce.terasort.output.replication=1 -Dmapreduce.job.maps=360 TS_input1 TS_output1
The above command sorts the data generated by terasort using 360 mappers and writes the sorted output to HDFS with a replication factor of 1.
Terasort Command to Sort Data With HDFS Replication Set to 3
$ EXAMPLES_PATH=/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce $ yarn jar ${EXAMPLES_PATH}/hadoop-mapreduce-examples.jar terasort -Dmapreduce.job.maps=360 -Dmapreduce.terasort.output.replication=3 TS_input2 TS_output2
The above command sorts the data generated by terasort using 360 mappers and writes the sorted output to HDFS with a replication factor of 3 (a typical default).
Calculating Throughput per Disk
To calculate throughput per disk, use the following formula:
throughput = bytes processed / seconds / 1024 bytes per KB / 1024 KB per MB / disk count
For example, here’s a 1 TB teragen test that took 60 seconds to complete on 360 disks:
throughput = 1000000000000 bytes / 60 seconds / 1024 / 1024 / 360 disks throughput = 44.15 MB/s per disk
Given the slow average disk speed, I’d look to see if there were other network or storage layer bottlenecks. If this was a virtualized platform, perhaps our storage volumes are sized too small or are attached to a VM with limited I/O capacity. Given the short run time, it’s also possible the mappers weren’t given enough data to write. We could repeat the test with a 2 or 3 TB dataset and see if the mappers have more of an opportunity to stress the disks.
Keep detailed records of your test runs for later comparison.
Example Teragen and Terasort Results
Command | HDFS Replication |
Mapper Count |
Run Time | Disk Count |
Throughput per Disk |
Teragen for 1 TB data set | 1 | 360 | 60s | 360 | 44.15 MB/s |
Teragen for 3x cluster RAM data set | 1 | 360 | 360 | ||
Terasort for 1 TB data set | 1 | 360 | 360 | ||
Terasort for 3x cluster RAM data set | 1 | 360 | 360 | ||
Teragen for 1 TB data set | 3 | 360 | 360 | ||
Teragen for 3x cluster RAM data set | 3 | 360 | 360 | ||
Terasort for 1 TB data set | 3 | 360 | 360 | ||
Terasort for 3x cluster RAM data set | 3 | 360 | 360 |
Here’s a blank template that can be used to record test results.
Template Teragen and Terasort Results
Command | HDFS Replication |
Mapper Count |
Run Time | Disk Count |
Throughput per Disk |
Teragen for 1 TB data set | 1 | ||||
Teragen for 3x cluster RAM data set | 1 | ||||
Terasort for 1 TB data set | 1 | ||||
Terasort for 3x cluster RAM data set | 1 | ||||
Teragen for 1 TB data set | 3 | ||||
Teragen for 3x cluster RAM data set | 3 | ||||
Terasort for 1 TB data set | 3 | ||||
Terasort for 3x cluster RAM data set | 3 |
Wrap up
When evaluating platforms we want to see an easy and quick provisioning process; we’d expect on-demand offerings to have shorter provisioning times. Proper name resolution is vital to cluster operation, so it’s always preferred that forward and reverse records are preconfigured for all hosts.
Disk throughput is often a bottleneck for distributed jobs, so we want to see ample disk throughput. For magnetic disks running on bare metal 120-150 MB/s is a good goal for reads and writes; on virtual machines, 60-80 MB/s is more common for ephemeral disk. Network block storage will have to be sized appropriately to attain these throughput rates.
The network has to be solid, both in terms of available throughput, reliability, and low latency. We ought to see near line rates for bare metal hosts. We don’t want to see any dropped packets on any network interface, no matter how much data we throw at the cluster.
Using common Linux utilities we can do a quick survey of a platform’s capabilities and catch problems early. Using benchmarking tools included with Hadoop we can stress processing, networking, and storage systems at the same time, and offer the means to improve performance iteratively.
Over time and after running benchmarking tools on numerous clusters, you’ll start to get a sense of what’s normal for a particular hardware environment. You’ll also have a good set of records by which to gauge the performance of new environments.
Alex Moundalexis is a software engineer at Cloudera.