This post was published on Hortonworks.com before the merger with Cloudera. Some links, resources, or references may no longer be valid.
Over the last couple of years, big data fabric technology has emerged as a strategic way for companies to get the most value from their data investments. As data lakes proliferate, they also become more difficult to manage. A big data fabric might be the answer for many enterprises struggling to manage their vast stores of big data.
So what, exactly, is a data fabric? And how does it relate to the traditional data lake? Here’s a closer look at the function of a data fabric and what its advantages are, so you can decide if the technology could help your business.
Moving to Multiple Data Lakes
As big data evolves, organizations are tending toward having multiple data lakes instead of a single one. These additional data lakes are built for a number of reasons: they may serve backup or disaster-recovery purposes for an existing production data lake, or perhaps they may replicate the contents of one data lake to another geographic location.
Regardless of why data lake proliferation happens, it presents a challenge to any organization: How do you ensure consistent security governance and data management across all those data lakes?
You likely spent a lot of time building policies for governing and securing your first data lake. When you built another lake, you most likely wanted to ensure you had a way to consistently apply those original policies to that one, too. But the more your big data environment grows, the more difficult it becomes to govern, secure, and manage all those data lakes. It may even become necessary to build out brand-new policies that take the new size of your environment into account.
Another complicating factor is the emergence of the cloud. Many use cases relating to data science, artificial intelligence, machine learning, deep learning, and the like are well-suited to operating in the cloud, and many companies are moving their data off on-premise data centers to save on operating costs and improve availability. While the benefits of the cloud cannot be disputed, it further expands the big data environment and may introduce new management complications.
All of these scenarios require that you have a management layer and abstraction layer that fit across all of your data sources or lakes, whether they are in the cloud or on premises. The abstraction layer’s role is to ensure consistent security and data management across data lakes. A big data fabric can serve as that abstraction layer.
Uniting Them All With a Big Data Fabric
A data fabric weaves together and surrounds all of your data sources. It’s aware of all that exists now and automatically registers new data sources as they are added.
There are several characteristics that a good, enterprise-class data fabric should have:
- The fabric should be aware of all of your data sources and know where all of your data clusters are. It should provide a system administrator with an easy visualization of where all of the data you have resides, what kinds of services are running in those places, what the statuses of those services are, and where all of your data clusters live. This is a basic requirement.
- The data fabric should lend itself to building a number of applications on top of it. For example, if you want to move data from one cluster to another, there should be an application in the fabric that lets you do that. If you need to know where sensitive data is located in various clusters, there should be an application available that tells you that (for example, that cluster A, column B, contains social security numbers, telephone numbers, or similar personal information). If you know that, you’ll have the ability to apply a tag labeling that data as personally identifiable or sensitive, and therefore not available for business analysis.
- The data fabric should also help you derive a security policy. If you have identified sensitive data in your various clusters, you should be able to restrict access to that data. For instance, you should be able to designate that only the human resources organization has access, or perhaps that the data can only be accessed from a certain geography if, for example, it pertains to European citizens. Or you might want to be able to designate that a security event be logged if data is accessed outside of normal business hours.
Keeping Your Big Data Environment Organized
Ultimately, a data fabric helps you evolve your organization into a multiple data lake environment in an organized and secure way. Data fabrics help your organization achieve consistency, security, and high availability while providing a seamless management layer that is aware of all your data all the time.
For more on data fabrics, read The Forrester Wave: Big Data Fabric report.