Multi-node Clusters with Cloudera QuickStart for Docker

Categories: CDH QuickStart VM

Getting hands-on with a multi-node cluster for self-learning or testing is even easier, now.

Last December, we introduced the Cloudera QuickStart Docker image to make it easier than ever before to explore Cloudera’s distributed data processing platform, including tools such as Apache Impala (incubating), Apache Spark, and Apache Solr. While the single-node getting-started image was well-received, we noted a large number of requests from the community for a multi-node CDH deployment via Docker. Today, we are excited to announce the new-and-improved Cloudera QuickStart for Docker.

To enable a multi-node cluster deployment on the same Docker host, we created a CDH topology for Apache HBase’s clusterdock framework. As detailed in HBASE-12721, clusterdock is a simple, Python-based library designed to orchestrate multi-node cluster deployments on a single host. Unlike existing tools like Docker Compose, which are great at managing microservice architectures, clusterdock orchestrates multiple containers to act more like traditional hosts. In this paradigm, a four-node Apache Hadoop cluster uses four containers. We’ve found it to be a great tool for testing and prototyping.

Getting Started

To begin, install Docker on your host. Older versions of Docker lack the embedded DNS server and correct reverse hostname lookup required by Cloudera Manager, so ensure you’re running Docker 1.11.0 or newer. Also, keep in mind that the host you use to run your CDH cluster must meet the same resource requirements as a normal multi-node deployment. Therefore, we recommend at least 16GB of free RAM for a two-node cluster and at least 24GB of free RAM for a four-node cluster.

For ease-of-use and portability, clusterdock itself is packaged in a Docker image and its binaries are executed by running containers from this image and specifying an action. This can be done by sourcing the clusterdock.sh helper script and then calling script of interest with the clusterdock_run command. As is always a good idea when executing code from the internet, examine the script to convince yourself of its safety, and then run

With everything ready to go, let’s get started!

Basic Usage

Starting a cluster with clusterdock takes advantage of an abstraction known as a topology; in short, a basic set of steps needed to coordinate pre-built Docker images into a functioning multi-container cluster. If all you’d like is a two-node cluster (with default options being used for everything else), simply type:

When this is run, clusterdock will start two containers from images stored on Docker Hub. As they contain a full Cloudera Manager/CDH deployment, downloading the images the first time may take upwards of five minutes, but this is a one-time cost as the images are then cached locally by Docker. As the cluster starts, clusterdock manages communication between containers through Docker’s bridge networking driver and also updates your host’s /etc/hosts file to make it easier to connect to your container cluster.

Once the cluster is running and the health of your CDH services is validated, you can access the cluster through the Cloudera Manager UI (the address and port number are shown at the end of the startup process). You can also SSH directly to nodes of your cluster using the clusterdock_ssh function where the argument is the fully qualified domain name of the node. For example, running:

drops us into a shell without having to deal with setting up SSH keys on the host:

Advanced Usage

clusterdock supports a number of options that can provide for a more interesting testing environment. We provide a few examples in the sections below, but full usage instructions can be seen by including --help in the invocation of the start_cluster script:

or the cdh topology itself:

Larger Cluster Deployments

If your machine has the available resources, clusterdock allows you to start n-node sized clusters where one node acts as the CM server (and has the majority of CDH service roles assigned to it) and the remaining n-1 nodes act as secondaries with most CDH slave services assigned to them. As an example, to create a four-node CDH cluster in which the containers are named node-1.testing, node-2.testing, node-3.testing, and node-4.testing:

In this case, the clusterdock CDH topology takes advantage of Cloudera Manager’s host template functionality to distribute the roles on node-2 to node-3 and node-4. That is, with only two images, clusterdock allows for arbitrarily-sized cluster deployments. (Again, this is a full cluster running on a single host with a single host’s worth of resources. Be careful!)

Specifying Services to Include (or Exclude)

The clusterdock CDH topology allows you to provide a list of the service types to include in your cluster. This functionality uses the --include-service-types option and removes any service type from Cloudera Manager not included in the list. For example, to create a two-node cluster with only HDFS, Apache ZooKeeper, Apache HBase, and YARN present:

Similarly, an --exclude-service-types option can be used to explicitly leave out services. To create a four-node cluster (machine-1.mycluster, machine-2.mycluster, machine-3.mycluster, machine-4.mycluster) without Impala present:

For a full list of service types, refer to the Cloudera Manager documentation.

Conclusion

While the single-node approach was good for learning and ramp up, the new Cloudera QuickStart for Docker is also excellent for test and development. It provides an easy way to prototype new ideas and use cases, as well as try out new functionality and the latest Cloudera releases. (Just remember, it’s not intended nor supported for production use.)

Lastly, we’d love to know what you think. Please post any and all feedback in our Community Forum; we’d like to hear both positive and constructive suggestions for future improvements. Why not take this opportunity to try it out right away?

Dima Spivak is a Software Engineer at Cloudera.

Facebooktwittergoogle_pluslinkedinmailFacebooktwittergoogle_pluslinkedinmail

23 responses on “Multi-node Clusters with Cloudera QuickStart for Docker

  1. Alper Akture

    I know this is new, just tried it out, and get (mac osx v 10.11.5 w/ 16GB ram):
    INFO:clusterdock.topologies.cdh.actions:Waiting for Cloudera Manager server to come online…
    Traceback (most recent call last):
    File “./bin/start_cluster”, line 70, in
    main()
    File “./bin/start_cluster”, line 63, in main
    actions.start(args)
    File “/root/clusterdock/clusterdock/topologies/cdh/actions.py”, line 109, in start
    CM_SERVER_PORT, timeout_sec=180)
    File “/root/clusterdock/clusterdock/utils.py”, line 52, in wait_for_port_open
    timeout_sec, address, port
    Exception: Timed out after 180 seconds waiting for 192.168.123.10:7180 to be open.

  2. Halil Duygulu

    Hi, thanks for your work docker. Do you consider an article about multi host docker setup for test purposes, like spark executor and driver nodes on different computers

    1. Dima Spivak

      Hi Halil,

      Please open a thread in “Hadoop 101” area at community.cloudera.com letting us know why the solution outlined here would be insufficient for your purposes.

      1. George U

        Hi. How can I connect to the primary node from the cluster. Port 8020 is not exposed to unable to connect from the edge node. Can you add options to expose other ports .. e.g I need port 50010 for writing from edge node to hdfs.
        Thanks.

  3. Patrick Pearson

    this worked great. I ran it on my new ubuntu server. The only issue I had was that i could not get to hue or to any of the edge nodes because only port 7180 was forwarded. did I miss something or was this just to show that it could be done

  4. Lance

    Has anyone had luck stopping and starting the docker containers and preserving the cluster as it was running before?
    Starting the node containers didn’t work. Starting the “manager” container started 2 new node containers but also didn’t work.

  5. Lay Hua

    Can I run the above with just 1 server with 256GB RAM? what is the recommended HDD configuration for 4TB X 12 HDD? JBOD or RAID 5?
    Thanks.

  6. mariro

    hi any help on this
    on a centos 7 where i clenaed the etc hosts and remove all the containers and init 6 too…

    TypeError: sequence item 0: expected string, ApiHost found

    details

    [tb@localhost ~]$ clusterdock_run ./bin/start_cluster cdh –dont-start-cluster –primary-node=node-1 –secondary-nodes=’node-{2..4}’ INFO:clusterdock.cluster:Successfully started node-2.cluster (IP address: 192.168.123.3).
    INFO:clusterdock.cluster:Successfully started node-3.cluster (IP address: 192.168.123.4).
    INFO:clusterdock.cluster:Successfully started node-1.cluster (IP address: 192.168.123.2).
    INFO:clusterdock.cluster:Successfully started node-4.cluster (IP address: 192.168.123.5).
    INFO:clusterdock.cluster:Started cluster in 8.46 seconds.
    INFO:clusterdock.topologies.cdh.actions:Changing server_host to node-1.cluster in /etc/cloudera-scm-agent/config.ini…
    INFO:clusterdock.topologies.cdh.actions:Removing files (/var/lib/cloudera-scm-agent/uuid, /dfs*/dn/current/*) from hosts (node-3.cluster, node-4.cluster)…
    INFO:clusterdock.topologies.cdh.actions:Restarting CM agents…
    cloudera-scm-agent is already stopped
    cloudera-scm-agent is already stopped
    cloudera-scm-agent is already stopped
    cloudera-scm-agent is already stopped
    Starting cloudera-scm-agent: [ OK ]
    Starting cloudera-scm-agent: [ OK ]
    Starting cloudera-scm-agent: [ OK ]
    Starting cloudera-scm-agent: [ OK ]
    INFO:clusterdock.topologies.cdh.actions:Waiting for Cloudera Manager server to come online…
    INFO:clusterdock.topologies.cdh.actions:Detected Cloudera Manager server after 63.06 seconds.
    INFO:clusterdock.topologies.cdh.actions:CM server is now accessible at http://localhost.localdomain:32769
    INFO:clusterdock.topologies.cdh.cm:Detected CM API v13.
    Traceback (most recent call last):
    File “./bin/start_cluster”, line 70, in
    main()
    File “./bin/start_cluster”, line 63, in main
    actions.start(args)
    File “/root/clusterdock/clusterdock/topologies/cdh/actions.py”, line 121, in start
    all_fqdns=[node.fqdn for node in cluster])
    File “/root/clusterdock/clusterdock/topologies/cdh/cm.py”, line 98, in add_hosts_to_cluster
    all_fqdns=all_fqdns)
    File “/root/clusterdock/clusterdock/topologies/cdh/cm_utils.py”, line 39, in add_hosts_to_cluster
    ‘, ‘.join(all_hosts)
    TypeError: sequence item 0: expected string, ApiHost found

  7. Pawan

    Do we have any image or compose file for CDH deployment on multiple nodes. As above configuration will create multiple CDH nodes into one single node only so this that node is single point of failure here.

  8. Dima

    It has some serious resource requirements, definitely not going to fit on standard developer laptop.
    Any chance (plans) to configure a script to run it in AWS (free tier, for try-it-out), in think it would only require creation of the task JSON file?

  9. Manav

    Hey, I want to give a common name to all the ‘n’ containers that gets created when I run for ‘n’ nodes.
    In Docker run cammand we have this utility to give names to running containers.
    docker run –name=bdm_nodes -i -t /bin/bash

  10. Abhi Basu

    Folks:

    I have used this documentation (https://blog.cloudera.com/blog/2016/08/multi-node-clusters-with-cloudera-quickstart-for-docker/) and have successfully installed a 2 node CDH cluster as docker containers. All hadoop services are working and I can ssh to the 2 nodes using the clusterdock_ssh command.

    Although I can ping the two IPs (hadoop nodes in containers) from the host OS (Ubuntu 16.04), I cannot seem to SFTP into them to move some files so I can put them on HDFS.

    I am new to Docker and wanted to post here to see if there is any config I missed to be able to SFTP from host OS to the two container nodes.

    Thanks,

    Abhi

  11. jonathan

    having errors with hdfs, Failed to find datanode (scope=”” excludedScope=”/default”), Failed to place enough replicas, still in need of 2 to reach 3 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy.
    running a three node cluster on a centos VM, on macOS. please help!

Leave a Reply

Your email address will not be published. Required fields are marked *

Prove you're human! *