Edge2AI Autonomous Car: Building an Edge to AI data pipeline (2 of 3)


In our previous blog post, we collected data from sensors mounted on our smart vehicle and described a ROS embedded application to prepare the data for training a machine learning (ML) model. This blog showcases the flow of data streaming from the edge to a data lake in the cloud. The data is in the form of images and the metadata associated with each image collected by our self-driving car (e.g., IMU information, steering angle, location). We direct the data flow to a Cloudera Distribution Hadoop (CDH) cluster where the data will be stored and curated in order to train the model.

Closer Look at Cloudera DataFlow

Cloudera Edge Management

The variety of edge devices generating data in today’s industry continues to diversify, and there is a need to author flows across a variety of edge devices. It is also necessary to monitor these flows across all devices in an enterprise, but without writing custom applications for each device. Cloudera Edge Management (CEM) provides an interface to author flows and monitor them with ease. CEM’s main components are Edge Flow Manager (EFM) and Apache NiFi MiNiFi (MiNiFi). MiNiFi, an edge agent, can be deployed onto millions of edge devices to collect data. The EFM UI manages, controls, and monitors MiNiFi agents, and it allows us to granularly deploy a variety of models to thousands of different edge devices.

Edge Flow Deployment

Cloudera Flow Management

Cloudera Flow Management (CFM) is a no-code data ingestion and data flow management tool, powered by Apache NiFi and used for building enterprise data flows. With NiFi’s graphical user interface and over 300 processors, CFM allows you to build highly scalable data flow solutions. NiFi allows developers to stream data from nearly any data source—in our case, a ROS application that gathered data from sensors—enrich and filter that data, and load the processed data into nearly any data store, stream processing, or distributed storage system.

Building a Simple Cloud Data Pipeline

The data pipeline for this application was built on EC2 instances in the cloud, beginning with the MiNiFi C++ agent pushing data to NiFi on CDF, and finally sending the data to Hadoop Distributed File System (HDFS) on CDH.

NiFi Flow

CFM was used for flow ingestion and was built using two input ports (1), one for ingesting the CSV data, and the other for ingesting camera image data for the left, center, and right camera. This data was transmitted to two PutHDFS processors, one for loading the CSV file into HDFS (2), and the other for loading all the image files into HDFS (3).

NiFi Input Port definition

EFM’s graphical user interface allowed us to easily deploy the flow we had created by simply clicking the Publish button:

Once the flow is published onto the MiNiFi agent and the input ports of NiFi have been started, the data begins to flow and can be saved on CDH. We can ensure the data is thereby inspecting the files using HUE.

HDFS Files in HUE

Once we’ve verified that the data has flowed from the MiNiFi agent to the cloud data lake, the focus can be shifted to transforming this data into actionable intelligence.


This blog explains what Cloudera DataFlow is, and how its components can be indispensable tools when building a bridge from the edge to AI. In the final blog of this series, we will review the benefits of Cloudera Data Science Workbench (CDSW) and use it to build a model that can be deployed back to our car using Cloudera DataFlow (CDF). Learn more about the Cloudera self-driving car and how to build your own in a simulation by completing the Edge2AI autonomous car tutorial.

Leave a comment

Your email address will not be published. Links are not permitted in comments.