Apache Spot (incubating) and Cloudera on AWS in 60 Minutes

Categories: CDH Cloud Cloudera Director

For the Apache Spot novice or for quick evaluation of a Cybersecurity solution on Cloudera Enterprise Data Hub (EDH) without the arduous tasks of manual installation, we’ve created a rapid deployment of Apache Spot on Amazon Web Services (AWS) using Cloudera Director.

You will immediately see how you can isolate and identify suspicious activities from the Apache Spot UI using the sample data provided in the deployment at cloud scale. We’ll explain the simple data pipeline that leverages Apache Spark to execute machine learning models against the massive amounts of network data you’ll ingest. We’ll also introduce the Open Data Model (ODM) and how it can be used to construct a topology of network events.

There are many moving parts to Apache Spot and the resulting cluster you’ll build here can be used as a reference when installing and troubleshooting Apache Spot on your own cluster.

Apache Spot Introduction

Apache Spot is an open source solution for consuming network data and isolating and identifying suspicious activities, threats and attacks within a network. Apache Spot consumes 3 types of network telemetry data: DNS (pcap), Proxy and Netflow (nfcapd). DNS logs contain DNS query events serialized out to a pcap binary format. Proxy logs contain access logs and follows the bluecoat text format. Netflow logs contain source and destination traffic and is serialized as nfdump binary format.

These data sources are passed through an Latent Dirichlet allocation (LDA) machine learning model to find suspicious events for users to explore in the Apache Spot UI.


You must have an AWS account and Cloudera Director Client installed. To fulfill this requirement, follow these instructions for installing Cloudera Director. The Cloudera Director configuration file will also install Java 8 and add a “spot” account on all of the nodes.

Ready Set Go

On Cloudera Director Host

  • ssh on to the Cloudera Director node using your pem file as ec2-user
    • $ ssh -i key.pem ec2-user@cloudera-director-host
  • $ git clone https://github.com/hdulay/apache-spot-60-min
  • $ cd apache-spot-60-min
  • Export these environment variables (better to create an env.sh script to set these values and then source the script):
    • $ export AWS_ACCESS_KEY_ID=[your aws access key]
    • $ export AWS_SECRET_ACCESS_KEY=[aws secret access key]
    • $ export aws_region=[aws region]
    • $ export aws_subnet=[aws subnet]
    • $ export aws_security_group=[aws security group]
    • $ export path_to_private_key=[path to pem file (/home/ec2-user/.ssh/id_rsa)]
    • $ export aws_owner=[for example hdulay]
    • $ export aws_ami=Choose an AMI for RHEL-7.2 (i.e. ami-2051294a for us-east)
  • Execute Cloudera Director. Takes about 20-30 minutes.
    • $ cloudera-director bootstrap spot-director.conf
  • Wait for the cluster to finish installing. If you run into issues, check the Cloudera Director log file. You may need to rebuild the cluster if you run into AWS related issues.
    • $ less ~/.cloudera-director/logs/application.log
  • On AWS, go to the instances page and search for tag key “group” and select “cm” for Cloudera Manager.AWS tag key
  • Under the field “IPv4 Public IP” save the public and private ips
  • In the browser, go to Cloudera Manager http://[public-ip-for-CM]:7180 to verify the cluster installation.
  • Again, on AWS, go to the instances page and search for tag key “group” and select “gateway”
  • ssh to that gateway using the same pem file and user (ec2-user) you used to log into the Cloudera Director host.

On the Gateway Node

  • $ su - spot
    • Password “spot”
  • $ git clone https://github.com/hdulay/apache-spot-60-min
  • Copy the gateway artifacts to the home directory
    • $ cp apache-spot-60-min/gateway/* .
  • $ chmod +x spot-install.sh
  • $ ./spot-install.sh [CLOUDERA_MANAGER_PRIVATE_IP]
    • This will install spot. It should take about 30 min to install all the components.
    • If you get this error: urllib2.URLError: <urlopen error [Errno 110] Connection timed out>, make sure you have the correct IP address to Cloudera Manager or that you Security Group configuration allows you access to that IP.
  • Open your browser to http://GATEWAY_NODE_IP:8889. At this point Apache Spot is installed and running. You will not see any data until you start feeding it with sample data.

Done! (hopefully under 60 min) You have now installed Apache Spot but you don’t have any data to view.Apache Spot

Sample Data

Sample Netflow, DNS and Proxy data was downloaded as part of the installation (spot-install.sh). We ingest this data using the Spot ingestion process which was started up by the install script. The scripts below feed the sample data to Spot-Ingest and waits 5 minutes before running Spot-ML and Spot-OA which are both batch processes.

Start Netflow Feed

$ chmod +x start.flow.sh && ./start.flow.sh

Start DNS Feed

$ chmod +x start.dns.sh && ./start.dns.sh

Start Proxy Feed

$ chmod +x start.proxy.sh && ./start.proxy.sh

Spot UI

The sample data downloaded are for past dates. Copy and paste the URL’s into your browser replacing <gateway-ip> with your cluster’s gateway ip. Log into Hue using spot as the username and password to discover other dates available to view in Spot UI either by Impala or navigating the spot home directory for the Apache Hive partitions.

For Netflow data go here:


Netflow dataFor DNS data go here:

http://<gateway-ip>:8889/files/ui/dns/suspicious.html#date=2016-07-07DNS data

For Proxy data go here:

http://<gateway-ip>:8889/files/ui/proxy/suspicious.html#date=2005-04-05Proxy data

Apache Spot Basics

Apache Spot Version

The instructions provided here uses the SPOT-181_ODM branch of Apache Spot. This branch utilizes the Open Data Model (ODM). See below for more details on ODM.

Spot-Ingest Summary

All Apache Spot ingestion is based on the Python Watchdog library which listens for changes in the file system on the gateway node. The diagram below is from (http://spot.incubator.apache.org/) and is slightly incorrect. The parsing of the data does not occur at the Master. All parsing is done at the worker.Spot ingestion

The diagram below shows a more accurate description of the ingestion. For netflow and dns, the file names are placed into the Kafka Topic by the Master (Master Collector) and the file is placed into HDFS under the directory named “binary” under each data source folder.Spot Ingestion

The processes (collector and worker) for netflow and dns are not YARN managed applications.  The collectors are python scripts which uses watchdog to listen for changes in the file system and writes the file names into Kafka then uploads the binary files into HDFS. The binary file names in Kafka are read by the workers and the binary files are downloaded from HDFS to be parsed locally on the gateway node via command line. The output of both nfdump and tshark is a csv file which is uploaded to the “stage” directory under their corresponding data source folders.

The proxy worker is the only spark streaming / YARN managed process that is part of Spot-Ingestion. The collector is a python script that uses watchdog to listen for changes in the file system and writes the text data into Kafka (proxy logs are in text format not binary). The writer is a spark streaming application and reads proxy data from the Kafka topic and parses the data following the bluecoat log format. The bluecoat format is configurable and typically implemented a little differently at each customer (thus will likely require some changes). The bluecoat proxy parser provided in this document was changed to parse the proxy sample data provided.

Batch Pipeline

Below is the second half of the Spot ETL process which illustrates the data moving to Spot-ML, Spot-OA and Spot-UI further discussed in below.Spot ELT process


Spot ML is the machine learning portion of Apache Spot that runs on Apache Spark and utilizes Spark-ML. It uses an LDA (Latent Dirichlet allocation) model which is an unsupervised learning algorithm that clusters documents based on topics. This is a technique used in data mining and Natural Language Processing (NLP). Basically this algorithm identifies outlying records with scores falling outside of the probability threshold for its topic. The threshold value delineates suspicious activities from normal network events. LDA specifics are outside of the scope of this post but you can watch this video for more information. The output of Spot-ML are the network events that are considered suspicious.

The LDA parameters can be changed to adjust the results and can be found on the gateway nodes at these locations:

  • /etc/spot.conf
  • /home/spot/start.flow.sh
  • /home/spot/start.dns.sh
  • /home/spot/start.proxy.sh

Spot-OA and Spot-UI

OA stands for Operational Analytics and UI stands for User Interface which runs on a Jupyter Notebook. Spot-OA is a python based application and is NOT a YARN managed application. It uses the output of Spot-ML to populate the Apache Spot tables and builds the required artifacts for the Jupyter Notebook that feeds Spot-UI. Spot-OA and Spot-UI are closely integrated and need to exist and run on the same host.

Scaling Ingestion

Apache Spot uses the command line tools nfdump and tshark to parse Netflow and DNS data respectively. It does this by reading the file names from the Kafka topic, downloading the corresponding binary file from HDFS and parsing them using nfdump or tshark from a gateway node and NOT in Hadoop. The illustration below shows this part of the process.Scaling Ingestion

To scale this architecture you will need to create multiple instances of each worker or even multiple gateway nodes. Kafka is the service that allows for scaling at this level by distributing across multiple consumers within the same consumer group.

Alternatively, you can parse at the collection point and send the csv data to Kafka. Then you can utilize Spark Streaming and write to the Spot tables directly without saving the binary files and staging the csv data.

Feedback To Machine Learning

For every data source in Spot-UI there is a section for scoring. This section allows users to mark individual logged events as high, medium or low risk that were identified as suspicious by the LDA model. Once you save your score, a feedback.csv file is updated on HDFS which is consumed by Spot-ML the next time it runs. See more here.

Open Data Model (ODM) Summary

The open data model (ODM) provides a common taxonomy for describing security telemetry data used to detect threats. (ref: Apache Spot ODM) It supports a broader set of cybersecurity use cases than what Apache Spot has which only supports Netflow, DNS and Proxy. Other uses cases can be but not limited to:

  • Web server
  • Operating system
  • Firewall
  • Intrusion Prevention/Detection
  • Data Loss Prevention
  • Active Directory / Identity Management
  • User/Entity Behavior Analysis
  • Endpoint/Asset Management
  • Network Meta/Session and PCAP files
  • Kerberos requests
  • etc

For example, other events or entities like user logins (Active Directory) or devices (Endpoint/Asset Management) respectively can be saved in ODM. This enables you to make connections and follow the path of a suspected threat from a network event to the infected device theoretically building a topology of the network for a more intuitive approach for threat hunting.

Apache Spot is strongly tied to the nfdump, tshark and bluecoat formats of Netflow, DNS and Proxy data respectively. Alternative network log parsers like Bro can extract more information than what the Apache Spot tables provide. ODM has most (if not all) of the additional fields Bro provides and complex columns for additional fields it did not provide. These additional fields could help improve the machine learning models when searching for suspicious events.

The transformation and field definitions of ODM are outside of the scope of this document. Please read here for more information.

To install the ODM tables into the EDH cluster:

  • On the gateway node.
    • $ cd incubator-spot/spot-setup/odm/
  • Follow instructions here to execute odm_setup.sh

Installing Spot Yourself

When installing Apache Spot, use this instance to troubleshoot your Spot installation. There are many little issues that could take a whole day to resolve. Study the installation scripts (especially spot-install.sh and spot.conf.py) but keep in mind that they were written for AWS and the sample data it downloads. Try the installation instructions on Apache Spot first before using the commands found in the install scripts.


Hopefully this post helps with getting you started with Apache Spot and provides a good introduction and reference to using and installing Apache Spot on Cloudera EDH. For viewing ODM tables, we recommend using Arcadia, a free downloadable tool for big data visual analytics. For a fully streaming Spot-Ingestion, we recommend Streamsets. Cloudera bundles these products in a single cybersecurity solution providing a faster path to getting to production. This blog post will help you get started with it.