How-to: Install CDH on Mac OSX 10.9 Mavericks

Categories: CDH General

This overview will cover the basic tarball setup for your Mac.

If you’re an engineer building applications on CDH and becoming familiar with all the rich features for designing the next big solution, it becomes essential to have a native Mac OSX install. Sure, you may argue that your MBP with its four-core, hyper-threaded i7, SSD, 16GB of DDR3 memory are sufficient for spinning up a VM, and in most instances — such as using a VM for a quick demo — you’re right.  However, when experimenting with a slightly heavier workload that is a bit more resource intensive, you’ll want to explore a native install.

In this post, I will cover setup of a few basic dependencies and the necessities to run HDFS, MapReduce with YARN, Apache ZooKeeper, and Apache HBase. It should be used as a guideline to get your local CDH box setup with the objective to enable you with building and running applications on the Apache Hadoop stack.

Note: This process is not supported and thus you should be comfortable as a self-supporting sysadmin. With that in mind, the configurations throughout this guideline are suggested for your default bash shell environment that can be set in your ~/.profile.

Dependencies

Install the Java version that is supported for the CDH version you are installing. In my case for CDH 5.1, I’ve installed JDK 1.7 u67. Historically the JDK for Mac OSX was only available from Apple, but since JDK 1.7, it’s available directly through Oracle’s Java downloads. Download the .dmg (in the example below, jdk-7u67-macosx-x64.dmg) and install it.

Verify and configure the installation:

Old Java path: /System/Library/Frameworks/JavaVM.framework/Home
New Java path: /Library/Java/JavaVirtualMachines/jdk1.7.0_67.jdk/Contents/Home

export JAVA_HOME="/Library/Java/JavaVirtualMachines/jdk1.7.0_67.jdk/Contents/Home"

Note: You’ll notice that after installing the Oracle JDK, the original path used to manage versioning /System/Library/Frameworks/JavaVM.framework/Versions, will not be updated and you now have the control to manage your versions independently. 

Enable ssh on your mac by turning on remote login. You can find this option under your toolbar’s Apple icon > System Preferences > Sharing.

  1. Check the box for Remote Login to enable the service. 
  2. Allow access for: “Only these users: Administrators”

    Note: In this same window, you can modify your computer’s hostname.

Enable password-less ssh login to localhost for MRv1 and HBase. 

  1. Open your terminal.
  2. Generate an rsa or dsa key.
    1. ssh-keygen -t rsa -P ""
    2. Continue through the key generator prompts (use default options).
  3. Test: ssh localhost
Homebrew

Another toolkit I admire is Homebrew, a package manager for OSX. While Xcode developer command-line tools are great, the savvy naming conventions and ease of use of Homebrew get the job done in a fun way. 

I haven’t needed Homebrew for much else than for installing dependencies required for building native Snappy libraries for Mac OSX and ease of install of MySQL for Hive. Snappy is commonly used within HBase, HDFS, and MapReduce for compression and decompression.

CDH

Finally, the easy part: The CDH tarballs are very nicely packaged and easily downloadable from Cloudera’s repository. I’ve downloaded tarballs for CDH 5.1.0.

Download and explode the tarballs in a lib directory where you can manage latest versions with a simple symlink as the following. Although Mac OSX’s “Make Alias” feature is bi-directional, do not use it, but instead use your command-line ln -s command, such as ln -s source_file target_file

  • /Users/jordanh/cloudera/
  • cdh5.1/
    • hadoop -> /Users/jordanh/cloudera/lib/hadoop-2.3.0-cdh5.1.0
    • hbase -> /Users/jordanh/cloudera/lib/hbase-0.98.1-cdh5.1.0
    • hive -> /Users/jordanh/cloudera/lib/hive-0.12.0-cdh5.1.0
    • zookeeper -> /Users/jordanh/cloudera/lib/zookeeper-3.4.5-cdh4.7.0
  • ops/
    • dn
    • logs/hadoop, logs/hbase, logs/yarn
    • nn/
    • pids
    • tmp/
    • zk/

You’ll notice above that you’ve created a handful of directories under a folder named ops. You’ll use them later to customize the configuration of the essential components for running Hadoop. Set your environment properties according to the paths where you’ve exploded your tarballs. 

Update your main Hadoop configuration files, as shown in the sample files below. You can also download all files referenced in this post directly from here.

 

I attribute the YARN and MRv2 configuration and setup from the CDH 5 installation docs. I will not digress into the specifications of each property or the orchestration and details of how YARN and MRv2 operate, but there’s some great information that my colleague Sandy has already shared for developers and admins.

Be sure to make the necessary adjustments per your system’s memory and CPU constraints. Per the image below, it is easy to see how these parameters will affect your machine’s performance when you execute jobs.

Next, edit the following files as shown.

 

 

You can configure HBase to run without separately downloading Apache ZooKeeper. Rather, it has a bundled package that you can easily run as a separate instance or as standalone mode in a single JVM. I recommend using either distributed or standalone mode instead of a separately downloaded ZooKeeper tarball on your machine for ease of use, configuration, and management.

The primary difference with configuration between running HBase in distributed or standalone mode is with the hbase.cluster.distributed property in hbase-site.xml. Set the property to false for launching HBase in standalone mode or true to spin up separate instances for services such as HBase’s ZooKeeper and RegionServer. Update the following configurations for HBase as specified to run it per this type of configuration.

Note regarding hbase-site.xml: Property hbase.cluster.distributed is set to false by default and will launch in standalone mode. Also, hbase.zookeeper.quorum is set to localhost by default and does not need to be overridden in our scenario.

Note regarding $HBASE_HOME/conf/hbase-env.sh: By default HBASE_MANAGES_ZK is set as true and is listed below only for explicit definition.

Pulling it All Together

By now, you should have accomplished setting up HDFS, YARN, and HBase. Hadoop setup and configuration is quite tedious, much less managing it over time (thus Cloudera Manager, which is unfortunately not available for Macs).

These are the bare essentials for getting your local machine ready for running MapReduce jobs and building applications on HBase. In the next few steps, we will start/stop the services and provide examples to ensure each service is operating correctly. The steps are listed in the specific order for initialization in order to adhere to dependencies. The order could be reversed for halting the services.

Service HDFS

NameNode

format:  hdfs namenode -format

start:  hdfs namenode

stop:  Ctrl-C

url:  http://localhost:50070/dfshealth.html

DataNode

start:  hdfs datanode

stop:  Ctrl-C

url:  http://localhost:50075/browseDirectory.jsp?dir=%2F&nnaddr=127.0.0.1:8020

Test

hadoop fs -mkdir /tmp

hadoop fs -put /path/to/local/file.txt /tmp/

hadoop fs -cat /tmp/file.txt

Service YARN

ResourceManager

start:  yarn resourcemanager

stop:  Ctrl-C

url:  http://localhost:8088/cluster

NodeManager

start:  yarn nodemanager

stop:  Ctrl-C

url:  http://localhost:8042/node

MapReduce Job History Server

start:  mapred historyserver, mr-jobhistory-daemon.sh start historyserver

stop:  Ctrl-C, mr-jobhistory-daemon.sh stop historyserver

url:  http://localhost:19888/jobhistory/app

Test Vanilla YARN Application

Test MRv2 YARN TestDFSIO

Test MRv2 YARN Terasort/Teragen

Test MRv2 YARN Pi

Service HBase

HBase Master/RegionServer/ZooKeeper

start:  start-hbase.sh

stop:  stop-hbase.sh

logs:  /Users/jordanh/cloudera/ops/logs/hbase/

url:  http://localhost:60010/master-status

Test

Kite SDK Test

Get familiar with the Kite SDK by trying out this example that loads data to both HDFS and then HBase. Note that there are a few common issues on your OSX that may surface when running through the Kite SDK example. They can be easily resolved with additional setup/config as specified below.

Problem:  NoClassDefFoundError: org/apache/hadoop/hive/metastore/api/NoSuchObjectException

Resolution:  Fix your classpath by making sure to set HIVE_HOME and HCAT_HOME in your environment.

Problem:  InvocationTargetException Caused by: java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path

Resolution:  Snappy libraries are not compiled for Mac OSX out of the box. A Snappy Java port was introduced in CDH 5 and likely will require to be recompiled on your machine.

Landing Page

Creating a landing page will help consolidate all the HTTP addresses of the services that you’re running. Please note that localhost can be replaced with your local hostname (such as jakuza-mbp.local).

Service Apache HTTPD

start: sudo -s launchctl load -w /System/Library/LaunchDaemons/org.apache.httpd.plist

stop: sudo -s launchctl unload -w /System/Library/LaunchDaemons/org.apache.httpd.plist

logs: /var/log/apache2/

url: http://localhost/index.html

Create index.html (edit /Library/WebServer/Documents/index.html, which you can download here).

It will look something like this:

 

Conclusion

With this guide, you should have a locally running Hadoop cluster with HDFS, MapReduce, and HBase. These are the core components for Hadoop, and are good initial foundation for building and prototyping your applications locally. 

I hope this will be a good starting point on your dev box to try out more ways to build your products, whether they are data pipelines, analytics, machine learning, search and exploration, or more, on the Hadoop stack. 

Jordan Hambleton is a Solutions Architect at Cloudera.

Facebooktwittergoogle_pluslinkedinmailFacebooktwittergoogle_pluslinkedinmail

14 responses on “How-to: Install CDH on Mac OSX 10.9 Mavericks

  1. Stephen Boesch

    Hi, please add instructions for hive with mysql as the metastore. Hive is an essential ingredient of an Hadoop Ecosytem – more so than HBase. In any case thanks for putting together what is there so far (Including HBase). thanks.

  2. Jordan Hambleton

    Hi Stephen,

    Appreciate the comment. If you’ve followed the steps above, hive will work out of the box using its embedded metastore using derby. Be sure to use the same local directory for all of the instances you use hive shell in order to use the same metastore created.

    In addition to the local metastore, you can install mysql via brew. Follow the config & setup from the link below. I’ve listed a few tips below.

    https://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/5.0/CDH5-Installation-Guide/cdh5ig_hive_metastore_configure.html

    1. Install Mysql & Setup (mac conversions)
    1.1. brew install mysql
    1.2. follow instructions for mysql config from above CDH5 install link.
    1.3. get mysql connector (ie. mysql-connector-java-5.1.16-bin.jar) and copy it to $HIVE_HOME/lib/

    quick tips
    * start mysql: mysql.server start
    * stop mysql: mysql.server stop

    Lastly, update hive-site.xml as specified on CDH5 install link above or find a copy on my github: https://github.com/joropolis/misc-data/blob/master/blog-2014-09-01/hive-site.xml

    Launch hive shell & if you have any issues, try running in debug mode.

    hive -hiveconf hive.root.logger=DEBUG,console

    If you see an error like the following, ensure the mysql connector jar is in $HIVE_HOME/lib/.
    * The specified datastore driver (“com.mysql.jdbc.Driver”) was not found in the CLASSPATH

  3. Debasish

    Are you sure it works ?

    I am getting this error:

    org.apache.hadoop.util.Shell$ExitCodeException:

    I saw on other posts that yarn.application.classpath has to be set to fix this error but I could not make that work yet as well…

  4. Jordan Hambleton

    Thanks for the note Debasish. Yes, this is working without additional configuration. Did you check your yarn logs for the DistributedShell command you executed (see snip below)? In my example, it will print out the top cpu hogs on your mac!

    Also, note that only mapreduce job logs are viewable through the mapred historyserver web. Use the cmd line to view your yarn logs based on the application id per example below.

    $ yarn logs -applicationId application_1412661426311_0001
    Container: container_1412661426311_0001_01_000002 on jakuza-mbp_54669
    =======================================================================
    LogType: stderr
    LogLength: 0
    Log Contents:

    LogType: stdout
    LogLength: 6724
    Log Contents:
    PID STAT %CPU TIME COMMAND
    2703 S+ 25.0 0:01.76 /Library/Java/JavaVirtualMachines/jdk1.7.0_67.jdk/…
    159 Ss 14.4 1:42.13 /Library/StartupItems/SymAutoProtect/…
    122 Ss 5.1 6:30.38 /System/Library/Frameworks/ApplicationServices…
    2516 S+ 4.1 0:07.47 /Library/Java/JavaVirtualMachines/jdk1.7.0_67.jdk/Contents/…

  5. Somnath

    HUE install fails with the message

    /Users/somnathchoudhuri/software/cloudera/hue-3.6.0-cdh5.1.3> make apps
    /Users/somnathchoudhuri/software/cloudera/hue-3.6.0-cdh5.1.3/Makefile.vars:42: *** “Error: must have python development packages for 2.6 or 2.7. Could not find Python.h. Please install python2.6-devel or python2.7-devel”. Stop.

    We have tried uninstalling and installing Python using brew and also installing gcc. Nothing seems to work. Other then Hue everything else starts up ok (datanode, namenode, resourcemanager, nodemanager, proxyserver and history server).

  6. MIKE B

    @Somnath:
    I had the same problem, and I took a look in Makefile.vars, and it is looking for the python libs in /usr/include/python2.7

    I ran the following, and it seems to work:
    sudo mkdir /usr/include
    sudo ln -s /usr/local/Cellar/python/2.7.8_2/Frameworks/Python.framework/Versions/2.7/include/python2.7 /usr/include/python2.7

    (You might have to adjust these slightly if you’re using a different version of Python.)

    Hope this helps.

    1. SteveL

      Note that this doesn’t work for El Capitan.
      In el capitan, it’s System Integrity feature means you can’t create a /usr/include directory .

      The workaround is to disable the check before the build:
      export SKIP_PYTHONDEV_CHECK=1

  7. Is it possible to install cloudera manager agent on mac

    Awesome post. I successfully configured it on my mac. One question is whether it’s possible to install cloudera manager agent on mac. I have some old linux machines and I’ve configured them through cloudera manager. I want to add my mac to the cluster managed by cm. Thanks a lot.

  8. Soon

    Thanks, this was really helpful. I was able to install it successfully on my mac with no problems. Is it possible to run HiveServer2?

  9. Jordan Hambleton

    Soon, HiveServer2 requires configuring your hive client config property hive.metastore.uris in $HIVE_HOME/conf/hive-site.xml as below (a copy can be found from mentioned links).

    hive.metastore.uris
    thrift://localhost:9083
    IP address (or fully-qualified domain name) and port of the metastore host

    To start the metastore & hiveserver2, use the following commands:

    > hive –service metastore
    > hive –service hiveserver2

    Connect using beeline and query your tables. Be sure hdfs, yarn, and mysql (if in use) are running prior to any queries that result in MapReduce jobs.

    > beeline -u jdbc:hive2://localhost:10000

    If you see the below expected error, you can login using a different user using -n in the beeline command that has access to the hdfs data you’re querying.
    * RuntimeException org.apache.hadoop.security.AccessControlException: Permission denied: user=anonymous …

  10. a

    For passwordless

    ssh-keygen -t rsa -P “”

    maybe practical if it was not there before, but not a good practice to overwrite ppls exisitng configs.

    Use this is you generated the keys before.

    cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

    Of wrap with a few more lines where the earlier generaiton of keys is checked.

  11. Ratul

    Hi
    Thanks for the details instruction about installing CDH in local machine. I have followed all the instructions.
    I am getting the below error while running the “Test Vanilla YARN Application” in my local mac machine.

    16/06/08 15:40:50 INFO localizer.ResourceLocalizationService: Localizer failed
    java.lang.NullPointerException
    at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:345)
    at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
    at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
    at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:115)
    at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getLocalPathForWrite(LocalDirsHandlerService.java:437)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1121)
    16/06/08 15:40:50 ERROR nodemanager.DeletionService: Exception during execution of task in DeletionService
    java.lang.NullPointerException
    at org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274)
    at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755)
    at org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:272)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

    Ratul