This FAQ contains answers to the most frequently asked questions about the architecture and configuration choices involved.
In December 2013, Cloudera and Amazon Web Services (AWS) announced a partnership to support Cloudera Enterprise on AWS infrastructure. Along with this announcement, we released a Deployment Reference Architecture Whitepaper. In this post, you’ll get answers to the most frequently asked questions about the architecture and the configuration choices that have been highlighted in that whitepaper.
For what workloads is this deployment model designed?
This reference architecture is intended for long-running Cloudera Enterprise Data Hub Edition (EDH) clusters, where the source of data for your workloads is HDFS with S3 as a secondary storage system (used for backups). Different kinds of workloads can run in this environment, including batch processing (MapReduce), fast in-memory analytics (Apache Spark), interactive SQL (Impala), search, and low-latency serving using HBase.
This deployment model is not designed for transient workloads such as spinning up a cluster, running a MapReduce job to process some data, and spinning it down; that model involves different considerations and design. Clusters with workflow-defined lifetimes (transient clusters) will be addressed in a future publication of the reference architecture.
Why support only VPC deployments?
Amazon Virtual Private Cloud (VPC) is the standard deployment model for AWS resources and the default for all new accounts that are being created now. Cloudera recommends deploying in VPC for the following reasons:
- The easiest way to deploy in AWS, where the AWS resources appear as an extension to the corporate network, is to do so inside a VPC, with a VPN/Direct Connect link to the particular AZ in which you are deploying.
- VPC has more advanced security options that you can use to comply with security policies.
- More advanced features, and better network performance for new instance types, are available.
Should I have one VPC per cluster? Or should I have one subnet per cluster in a single VPC? What about multiple clusters in a single subnet?
Some customers consider having one VPC per environment (dev, QA, prod). Within a single VPC, you can have independent subnets for different clusters—and in some cases, multiple subnets for each cluster, where each subnet is for instances playing a particular role (such as Flume nodes, cluster nodes, and so on).
The easiest way to deploy a cluster is to deploy all nodes to a single subnet and use security groups to control ingress and egress in a single VPC. Keep in mind that it’s nontrivial to get instances in different VPCs to interact.
What about different subnets for different roles versus controlling access using security groups?
You have two models of deployment to consider, depending on your security requirements and policies:
- Entire cluster within a single subnet — this means that all the different role types that make up a cluster (slaves, masters, Flume nodes, edge nodes) will be deployed within a single subnet. In most cases, the network access rules for these nodes differ. For example, users will be allowed to login to the edge nodes but not the slave or the master nodes. When deploying in a single subnet, the network rules can be modeled using security groups.
- Subnet per role per cluster — in this model, each of the different roles will have its own subnet in which to deploy. This is a more complex network topology and allows for finer-grained control over the network rules. In this case, you can use a combination of subnet route tables, security groups, and network ACLs to define your networking rules. However, just using security groups and defining the route tables appropriately is sufficient from a functionality standpoint.
Both models are equally valid, but Model #1 is easier to manage.
I don’t want my instances to be accessible from the Internet. Do I HAVE to deploy them in a public subnet?
Currently, there are two ways an instance can get outbound access to the internet, which is required for it to access other AWS services like S3 (excluding RDS) or external repositories for software updates (find detailed documentation here):
- By having a public IP address — this allows the instance to initiate outgoing requests. You can block all incoming traffic using Network ACLs or Security Groups. In this case, you have to set up the routing within your VPC to permit traffic between the subnet hosting your instances and the Internet gateway.
- By having a private IP address only but having a NAT instance in a different subnet through which to route traffic — this allows for all traffic to be routed through the NAT instance. Similar to on-premise configurations, a NAT instance is typically a Linux EC2 instance configured to run as a NAT residing in a subnet that has access to the Internet. You can direct public Internet traffic from subnets that can’t directly access the Internet to the NAT instance.
If you just transfer any sizable amount of data to the public Internet domain (including S3), the recommended method is deployment Model 1. With Model 2, you will bottleneck on the NAT instance.
Why only choose cc2.8xlarge and hs1.8xlarge instances as the supported ones?
Cloudera Enterprise Data Hub Edition deployments have multiple kinds of workloads running in a long running cluster. To support these different workloads, the individual instances need to provide enough horsepower. The cc2.8xlarge and hs1.8xlarge instances make for the best choices amongst all EC2 instances for such deployments for the following reasons:
- Individual instance performance does not suffer from the problem of chatty neighboring applications on the same physical host.
- These instances are on a flat 10G network.
- They have a good amount of CPU and RAM available.
For relatively low storage density requirements, the cc2.8xlarge are the recommended option, and where the storage requirement is high, the hs1.8xlarge are a better choice.
Other instance types are reasonable options for specialized workloads and use cases. For example, a memcached deployment would likely benefit from the high-memory instances, and a transient cluster with only batch-processing requirements could probably leverage the m1 family instances (while having a higher number of them). However, as previously explained, those workloads are not addressed by this reference architecture, which is rather intended for long-running EDH deployments where the primary storage is HDFS on the instance stores, supporting multiple different kinds of workloads on the same cluster.
Why not EBS-backed HDFS?
There are multiple reasons why some people consider Amazon Elastic Block Storage (EBS). They include:
Increasing the storage density per node but using smaller instance types
You can certainly increase the storage density per node by mounting EBS volumes. Having said that, there are a few reasons why doing so doesn’t help:
- Not many of the instance types are good candidates for running an EDH that can sustain different kinds of workloads predictably. Adding a bunch of network-attached storage does theoretically increase the storage capacity, but the other resources like CPU, memory, and network bandwidth don’t change. Therefore, it’s undesirable to use small instance types with EBS volumes attached to them.
- The bandwidth between the EC2 instances and EBS volumes is limited so you’ll likely be bottlenecked on that.
- EBS shines with random I/O. Sequential I/O, which is the predominant access pattern for Hadoop, is not EBS’s forte.
- You pay per IOP on EBS, and for workloads that require large amounts of I/O, that can get expensive to a point that having more instances might be more reasonable than adding EBS volumes and keeping the instance footprint small.
Allowing expansion of storage on existing instances in an existing cluster, thereby not having to add more instances if the storage requirements increase.
The justifications for this requirement are similar to those above. Furthermore, adding storage to a cluster that is predominantly backed by the instance-stores would mean that you have heterogeneous storage options in the same cluster, with different performance and operational characteristics.
More EBS volumes means more spindles, and hence better performance.
Adding EBS volumes does not necessarily mean better I/O performance. For example, EBS volumes are network attached — therefore, the performance is limited by the network bandwidth between the EC2 instances and EBS. Furthermore, as highlighted previously, EBS shines with random I/O in contrast to sequential I/O, which is the predominant access pattern for Hadoop.
Storing the actual files in EBS will enable pausing the cluster and bringing it back on at a later stage.
Today, this is a complex requirement from an operational perspective. The only EC2 instances that can be stopped and later restarted are the EBS-backed ones; the others can only be terminated.
If you mount a bunch of EBS volumes to the EBS-backed instances and use them as data directories, they remain there when the instances are started up again and the data in them stays intact. From that perspective, you’ll have all your data directories mounted just the way you left them prior to the restart, and HDFS should be able to resume operations.
If you mount EBS volumes onto instance-store backed instances, restarting HDFS would mean un-mounting all the EBS volumes when you stop a cluster and then re-mounting them onto a new cluster later. This approach is operationally challenging as well as error-prone.
Although both these options are plausible in theory, they are also not very well tested, and HDFS is not designed to leverage these features regardless.
EBS has higher durability than instance stores and we can reduce the HDFS replication if we use EBS.
This is an interesting proposition and the arguments for and against it are the same as if you were to use NAS to back your HDFS on bare-metal deployments. While certainly doable, there are downsides:
- By reducing replication of HDFS, you are not only giving up on fault tolerance and fast recoverability but also performance. Because fewer copies of the blocks would be available with which to work, more data will move over the network.
- All your data will be going over the network between the EC2 instance and EBS volumes, thereby affecting performance.
Using EBS to back HDFS certainly looks like an attractive option, but as you look at all the factors mentioned above, it should become clear that it has too many performance, cost, and operational drawbacks.
Can I pause my cluster in this deployment model? (Or, Can I stop a cluster when I’m not using it and save money?)
Clusters in which HDFS is backed by instance-stores cannot be paused. Pausing a cluster entails stopping the instances, and when you stop instances, the data on the instance-stores is lost. You can find more information about instance lifecycle here.
What if I don’t want connectivity back to my data center via VPN or Direct Connect?
You don’t have to have connectivity back to your data center if you don’t have to move data between your Hadoop cluster in AWS and other components that may be hosted in your data center.
What are “placement groups”, and why should I care?
As formally defined by AWS documentation:
A placement group is a logical grouping of instances within a single Availability Zone. Using placement groups enables applications to get the full-bisection bandwidth and low-latency network performance required for tightly coupled, node-to-node communication typical of HPC applications.
By deploying your cluster in a placement group, you are guaranteeing predictable network performance across your instances. Anecdotally, network performance between instances within a single Availability Zone that are not in a single placement group can be lower than if they were within the same placement group. As of today, our recommendation is to spin up your cluster (at least the slave nodes) within a placement group. Having said that, placement groups are more restrictive in terms of the capacity pool that you can use to provision your cluster, which can make expanding a cluster challenging.
The root volume is too small for the logs and parcels. What are my choices?
You can resize the root volume on instantiation. However, doing so is more challenging with some AMIs than others. The only reason to resize the root volume is to be able to have enough space to store the logs that Hadoop generates, as well as parcels. For those two purposes, our recommendation is to mount an additional EBS volume and use that instead. You can use the additional EBS volume by sym-linking the /var/logs and /opt/cloudera directories to that. You can also configure Cloudera Manager to use a different path than /var/logs for logs and /opt/cloudera for parcels.
Amandeep Khurana is a Principal Solutions Architect at Cloudera and has been heavily involved in Cloudera’s cloud efforts.