Data lineage is an important aspect of establishing trust, and not just for compliance purposes.
In this new series of blog posts, we’ll take a look at some of the newest features we’ve shipped over the past few releases of Cloudera Navigator. In this initial post, we’ll focus on useful enhancements for interactively exploring data lineage.
When Cloudera first shipped lineage in Cloudera Navigator more than two years ago, the initial focus was on two important design principles:
- Lineage must be collected automatically for 100% of Apache Hadoop activity. Opt-in for lineage—or any other governance artifact—simply doesn’t make sense as it can leave blind spots during a breach. Lineage, metadata, and audit logs have to be there when you need them, and you shouldn’t have to require users to do anything to collect this information. We worked hard to ensure that Cloudera Navigator 1.0 collected all its governance artifacts automatically.
- Column-level lineage must be collected whenever possible. We’ve seen Hadoop tables that have tens of thousands of columns—sometimes even more. Customers have told us time and again that without column-level lineage, you don’t have a useful lineage solution. So, we made sure that Cloudera Navigator 1.0 captured column-level lineage for all Apache Hive, Apache Pig, and Apache Impala (incubating) transformations—and we the Navigator SDK lets customers and partners augment other transformations with column-level lineage.
These capabilities have helped make Cloudera Enterprise the first big data platform to pass an independent PCI audit, and pass subsequent compliance audits across our customer base.
However, while security teams require all this detail, sometimes other users may just want a high-level glimpse of lineage, to answer questions like:
- Where did this data come from?
- Can I trust this data for the analysis I’m about to do?
- How is this data being used by other users?
Starting in Cloudera Enterprise 5.7, Cloudera Navigator displays lineage differently so that it’s easier for you to answer these question: specifically, lineage diagrams are now much more interactive so that you can select exactly the level of detail you’d like to see.
For example, the Lineage Options box lets you filter out classes of entities and links from the lineage diagram. The following are the default selections:
- The Only Upstream/Downstream filter allows you to filter out entities and links that are input (upstream) to and output (downstream) from another entity.
- Use the Latest Partition and Operation filter to reduce rendering time when you have similar partitions created and operations performed periodically. For example, if Hive partitions are created daily, the filter allows you to display only today’s partition.
Let’s take a look at how these lineage options work. We’ll start with the following diagram, which only filters out deleted entities.
Control flow links capture lineage for columns that are part of the
WHERE clause in a SQL statement, and data flow links capture lineage for columns that are part of the
SELECT clause. Once you filter “Control Flow Relations,” control flow links are hidden.
If you further select “Show Upstream,” you’ll see only upstream entities and links. This is useful for viewing the provenance of a particular data set to determine where it came from and whether you can trust it.
Alternatively, if you further select “Show Downstream,” you’ll see only downstream entities and links. This is useful when you want to perform impact analysis, and understand the data sets that might be impacted by modifications to the selected data set.
When you hide “Operations,” the lineage diagram displays the relationships between data sets but hides the specific operations. Clicking on any of the arrows will display the operation details, including the SQL text. As always, sensitive data is optionally redacted automatically from the query text.
After reading this post, you should have a good understanding about how data lineage can be explored in Cloudera Navigator 5.7 and later. In the next post, we’ll look at other new features we’ve shipped in Cloudera Navigator recently, including managed metadata.
Mark Donsky is a Director of Products at Cloudera.