This post was published on Hortonworks.com before the merger with Cloudera. Some links, resources, or references may no longer be valid.
According to Gartner, “data management architectures and technologies are rapidly shifting to become highly distributed.” This is because of the uptick in the types of data—and the amount of data—that organizations have to manage. Fortunately, big data technologies are now built to support modern big data strategies. Here’s how data has evolved over the years, and how today’s data architecture and technology offerings are keeping up with it.
Data Has Evolved
Data has evolved over the last 20 years or so. A major turning point was the advent of big data, technologies like Hadoop, and the various projects that operate on top of it. Using Hadoop and other open-source solutions, organizations could store a virtually unlimited amount of data. This included not only traditionally structured data in rows and columns, but also unstructured and semi-structured data. Log files, images, sensor data, and other sorts of real-time, unstructured information could all be stored this way.
This explosion in data resulted in the emergence of data lakes as a primary component of data architecture. Yet this created the challenge of figuring out how to scale data lakes as storage prices continually increased. The prevailing wisdom was that businesses would develop massive data lakes where virtually all data would live. But then the cloud revolution began, and prices for all cloud services, including storage, began to fall. Customers ended up with multiple data centers, and data ended up living in different clusters. With multiple locations come cost and performance implications—as well as best practices regarding specific use cases.
Data Lives in Multiple Locations—and Moves Often, Too
Ultimately, today’s reality is that data is distributed: it lives in multiple locations, and companies are constantly moving it to and from data centers and the cloud.
Each data cluster a company uses is likely to have different types of data in it. Pricing information might live in a data center in Bangkok, and customer information might live in a data center in North Carolina. Data protection and disaster recovery might be in different locations, too. There is also new data from sensors or IoT devices that is easier to capture in a cloud service like Amazon Web Services or Microsoft Azure. That type of data might eventually make its way on premises, but for near-time and real-time analysis, it’s much easier to capture that information and analyze it in the cloud.
Most data architecture includes storage distributed across multiple locations and services—and used for multiple purposes. So, how can you wrap your arms around all of your data? Key questions include:
- Are you aware of all of your data sources?
- If you bring a new data source online, how do your users see that data?
- How do you go about building security and governance policies for that data?
- When that data moves around, do you have security and governance policies that are applied to the data when it arrives at a new location?
The Need for a Dataplane
Ultimately, you need a data fabric or a dataplane: an abstraction layer that fits across all these data centers and sources in your data architecture. This layer is constantly aware of your data’s location—knowing you have data in Azure, in Amazon Web Services, in five private data centers around the world, and so on—and the types of data that you have. This fabric will also provide value-added services in terms of data protection, data relocation, and security-policy consistency across locations.
Data has to be available to business analysts, data scientists, and other internal audiences in your organization, and these stakeholders need to have a trusted source of data. Data stewards must be able to determine which data sources are trusted, as well as important information about each source (who created it, who modified it, where it’s been, when it might expire, and more). These data stewards should be able to make these determinations across all data depositories, and investing in a dataplane will make that job significantly easier.
Learn more about the modern dataplane by reading this article.