Azure Data Lake Store (ADLS) is a highly scalable cloud-based data store that is designed for collecting, storing and analyzing large amounts of data, and is ideal for enterprise-grade applications. Data can originate from almost any source, such as Internet applications and mobile devices; it is stored securely and durably, while being highly available in any geographic region. ADLS is performance-tuned for big data analytics and can be easily accessed from many components of the Apache Hadoop ecosystem, such as MapReduce, Apache Spark, Apache Impala (incubating) and Apache Hive.
Since cloud data storage is a relatively new feature for several users of Hadoop and CDH, many of the questions they frequently have pertain to the performance of such stores compared to direct-attached HDFS storage. For instance, is there a performance tradeoff to be made in return for a lower TCO? Cloud stores from various providers are implemented differently, resulting in performance differences associated with file-system operations such as renames, listing, and replication. Cloud stores have also been designed with different consistency models. This article describes the performance characteristics of ADLS relative to Azure disk storage when used with CDH.
We used the following methods and workloads to evaluate the performance of ADLS and Azure Premium disk storage:
- The Azure CLI
- The Azure native SDK, via Python scripts
- Hadoop distcp
- MapReduce jobs
Runs were carried out both to upload data to, and download data from ADLS and disk storage. For each test, the resource bottlenecks limiting ADLS connector performance were identified and parameters tuned, so as to obtain the best possible performance.
Single Node Test
To obtain a basic assessment of ADLS performance, a simple technique would be to use a Hadoop cluster setup on Azure with a single Data Node. The Data Node is configured to have both Azure attached Premium Storage disks as well as ADLS access for comparative purposes. The goal here is to test for the best single-node throughput possible to and from ADLS. The I/O profile for this test consists of large sequential writes and reads exclusively. Cloud stores are designed to scale well for high throughput I/O. Since there are limits to how much data a single processing thread can move, the onus is on the user to use concurrency to perform as many I/O transfers in parallel as possible at a time.
We used the Azure native SDK for these single node tests as we learned it delivers better performance as compared to the Azure CLI. Doing this helps establish a baseline for comparisons with the Hadoop application-level results in the next section. Python scripts permit easy access to the SDK.
It is straightforward to copy files to and from ADLS from the Hadoop command shell. The Hadoop file system shell permits this to be done using the copyFromLocal option. Files were initially created on the attached Azure Premium Storage disks. These were then copied to ADLS to estimate outbound throughput and then copied back to measure inbound throughput.
The following commands were used for this purpose; they were executed concurrently in multiple threads, sufficient in number to utilize available ADLS capacity.
To copy to ADLS:
hadoop fs -copyFromLocal <local-path-to-file> adl://account-name/path-to-file
To copy back from ADLS:
hadoop fs -copyToLocal adl://account-name/path-to-file <local-path-to-file>
It became clear while tuning performance that Azure disk storage has throughput limits that can come into play prior to hitting ADLS limits. The disk limits were circumvented by reading from /dev/zero for the ADLS write test, and writing to /dev/null for the ADLS read test.
The test results below are obtained from a single Data Node of the Azure DS14_v2 instance type, which has 112 GB memory, 16 cores and features a 768 MB/s disk throughput limit. There are 15 P20 Premium Storage disks attached to it. This number is sufficient to ensure, that when copying files from and to the disks, their total throughput is not a limiting factor.
Azure Premium Storage delivers high-performance, low-latency disk support for virtual machines with I/O-intensive workloads. For durability, Premium Storage features Locally Redundant Storage (LRS) — there are three copies made of any data stored on this medium.
The table below provides a summary of the results. For each test, it indicates the type of bottleneck that would likely restrict any higher ADLS throughput performance.
|ADLS write||746||Azure Python SDK||Limited by source disk throughput|
|ADLS read||799||Azure Python SDK||Limited by destination disk throughput|
|ADLS write||689||copyFromLocal||Limited by source disk throughput|
|ADLS write||1020||/dev/zero -> ADLS||Likely limited by network throughput|
|ADLS read||1156||ADLS -> /dev/null||Likely limited by network throughput|
Now that we know the read and write throughput characteristics of a single Data Node, we would like to see how per-node performance scales when the number of Data Nodes in a cluster is increased.
The tool we use for scale testing is the Tera* suite that comes packaged with Hadoop. This is a benchmark that combines performance testing of the HDFS and MapReduce layers of a Hadoop cluster. The suite is comprised of three tools that are typically executed in sequence:
- TeraGen, that tool that generates the input data. We use it to test the write performance of HDFS and ADLS.
- TeraSort, which sorts the input data in a distributed fashion. This test is CPU bound and we don’t really use it to characterize the I/O performance or HDFS and ADLS, but it is included for completeness.
- TeraValidate, the test that reads and validates the sorted data from the previous stage. We use it to test the read performance of HDFS and ADLS.
The scale tests are performed on clusters of various sizes, all with the DS14_v2 instance type. The results below are obtained on a cluster of 6 Data Nodes. The throughput numbers in the table are calculated by dividing the total dataset size by the test duration, and then averaging them on a per-node basis.
HDFS/Premium Storage (3x replication)
HDFS/Premium Storage (1x replication)
Here are configuration details for this test:
- The total dataset size is 6 TB, that is, 1 TB per Data Node.
- Number of containers per node is 48, i.e., 288 containers in total.
- Number of maps and reduces are 287, so as to match the number of containers we configure.
- Each Data Node is equipped with 15 Premium Storage disks of P20 type.
As mentioned above, Azure Premium Storage features built-in redundancy, such that any data written to Azure is replicated 3 times within the same Azure geographic region. HDFS itself replicates data as well, with a default replication factor of 3. Thus, when running with default settings, there would be 9 copies made of any blocks written. On the other hand, ADLS manages replication internally, creating 3 copies of all data written.
It is, therefore, instructive to obtain performance numbers with a HDFS replication factor of 1 (the replication factor of Premium Storage is not user-configurable). This gives a more equivalent comparative assessment of ADLS, purely from an I/O perspective. (It should be noted, that, in this case, although the overall replication factor is similar, HDFS behavior is not precisely the same, since while reading a HDFS block, there is only one replica the namenode can target. We strongly recommend against running just 1 HDFS replica due to its lack of data availability.)
It can be seen that TeraGen throughput is much higher on ADLS than on Premium Storage in the tests carried out with 3X HDFS replication. Evidently, the additional two replicas consume two-thirds of the I/O throughput available.
TeraValidate results are roughly the same for both ADLS and Premium Storage, except that this test shows higher CPU usage in the ADLS case. TeraValidate throughput on Premium Storage is bottled by the disk throughput limit of the DS14_v2 instance type.
For purposes of scaling out, the same Tera* suite tests were carried out on clusters with 16 and 32 Data Nodes. The dataset size was chosen so that each Data Node processes 1 TB data. That is, 16 TB for the 16 Data Node cluster, and 32 TB for the 32 Data Node cluster.
Here is a chart that show how TeraGen throughput scales when the number of Data Nodes increases in a cluster:
From the chart, it can be seen that TeraGen throughput on ADLS is higher than Premium Storage with the default HDFS replication factor of 3. On the other hand, TeraGen throughput on ADLS is somewhat lower than that with Premium Storage using a HDFS replication factor of 1. There is some degradation in TeraGen throughput with higher scale factors for both Premium Storage and ADLS.
Similarly, the following chart shows how the TeraValidate throughput scales when the number of Data Nodes increases:
As can be seen, TeraValidate throughput on ADLS is comparable with that achieved with Premium Storage. Again, there is some degradation in overall TeraValidate throughput at higher scale factors here as well for both Premium Storage and ADLS.
Latency details were not evaluated in these runs.
During our scale testing, we ran into a throughput limit that ADLS imposes. This limit is on a per-account basis and intended for ensuring equitable performance across customers utilizing ADLS concurrently within a region.
The maximum sustained ADLS write throughput that we observed from the 6-node test is approximately 7 GiB/s per account. The maximum read throughput from the 6-node test is around 11-12 GiB/s per account. Microsoft informed us that at this scale, the read test did not reach the default per-account throughput limit. The 16-node and 32-node throughput results in the “Scaling Out” section were higher because they were tested after Microsoft explicitly raised the account limit for the test account.
This default account limit set by ADLS is a “soft limit”. That is, most users will be able to run a wide range of workloads without encountering this limit on throughput. Those who do can contact Microsoft support to have their limit raised; this default limit may also be increased in the future.
Summary and Future Work
Cloud data storage using ADLS is a feature that customers will want to consider seriously for their enterprise big-data applications. It can scale to sustained throughput levels of 1 GB/sec per node for both read and write workloads. These numbers are for best-case scenarios, which involve large sequential I/O accesses; real-life workloads will display somewhat lower throughput levels. From a throughput perspective, ADLS performs favorably compared to network-attached Azure disk storage, especially when cost is taken into consideration.
ADLS latency was not specifically evaluated in this work and this will be addressed in a future blog post. Another area of focus for performance analysis will be the inclusion of additional workloads that utilize other CDH components such as Spark, Hive and HBase.