Next Generation Data Warehousing at Santander UK

by Toby Ferguson

Posted in Business | September 27, 2018 4 min read

Timely data is crucial to businesses in the Big Data age: This blog post outlines how Santander UK utilises the latest Cloudera technologies and superior software development capability to create the next generation of data warehousing and streaming analytics to support intelligence that can improve relationships with customers and follow the mantra of ‘we want to help people grow and prosper.‘

Santander UK’s big data journey started around four years ago. They were early adopters of new data streaming technology like Apache Kafka and had ambitions to revolutionise the customer experience with the use of real-time data and in-app analytics for mobile users.

Since then, Santander UK have enhanced both footprint and capability to innovate with big data technology and have evolved rapidly. The need for large scale streaming analytics has increased and become a reality. Today, at Santander UK, Cloudera’s Big Data, Machine Learning, and Analytics platform is complemented by integrated high-quality and scalable Platform-as-a-Service (PaaS) event delivery through Apache Kafka.

Another technology component that is central to Santander UK’s next generation Data Warehouse is the use of Apache Kudu to enable fast analytics on fast data. When combined with aspects of the Data Vault 2.0 design methodology, it facilitates rapid ingest from hundreds of Apache Kafka data streams; both offloading workload from existing legacy systems and providing the ability to ask ‘right here, right now’ questions regarding customer behaviour and the current state of the Bank.

Speed to Market

Fast data streams can be moved online with minimal effort due to an innovative new platform at Santander UK, which integrates legacy systems with a new Data Vault via Apache Kafka. Due to the clean structure of the data being integrated, a new event stream feed to populate the Apache Kudu Data Vault is largely configuration driven – conforming data events to the Hub, Satellite, and Link structure of Data Vault 2.0 methodology. This allows the schema to react to changes in the business or new understanding of how the data should be conformed.

Santander UK can affect data transformations by scaling the elastic event delivery platform, which is based on Scala Akka and Apache Kafka, allowing rapid and scalable data enrichment in real-time. This enables faster, more timely data, faster decisions and higher speed to market for use cases due to the reusable platform and architecture.

Data Science and Rapid Prototyping of Data Products

Ultimately, there are many potential consumers of this streaming data source; however, interesting insight has already been gleaned through the integration of Cloudera Data Science Workbench to the Data Vault. These provide a comprehensive Data Science experience for the growing Data Science team and also use—in typically Santander UK innovative fashion—the potential to prototype ideas rapidly and create new data products before addressing heavy engineering and architectural challenges. Build a fast prototype, and then, if it engenders value, develop it into a first-class product.

Fast Integration: The Contribution Model

In the vein of the innovation and agility that the Santander UK Data Innovation team have made a reality, they created the notion of the Contribution Model. Because the cluster is multi-tenant with differing business units sourcing, cleansing, and engineering new datasets; if deemed useful to the rest of the business, Data Vault style link tables can be utilised to integrate this generally useful data to the core of the Data Vault schema. In this manner, the team can increase the value of data products through the rapid generation of new combinations of datasets, with traceable lineage by using Cloudera Navigator for governance, and security by using Apache Sentry for access control. If the data of the business unit is deemed useful to others, it is linked to the core and shared according to governance principles.

The Contribution Model allows us to leverage pure datasets that are created independently by different business units and product teams. If this data is valuable to the rest of the business, we have the capability to bring it into the Data Vault as a first-class citizen through the utilisation of link tables. We wanted to replicate the Apache community approach to open source software for data systems in our organization to improve innovation through collaboration.

– Nicolette Bullivant – Head of Data Engineering, Santander UK

Multi-Destination: One Stream to Rule Them All

The raw event streams that are generated from legacy systems are considered canonical, and are generally required by other stakeholders that use the cluster. The Santander UK Data Innovation Team have adopted the principle of ensuring that these event streams are available for utilisation by differing use cases and technologies; thus, a canonical event stream can be redistributed to differing destinations; either HDFS filesystem, Apache HBase, or Apache Kudu. This helps engender a single version of the truth for all stakeholders whilst avoiding back pressure on legacy systems.

Conclusion

In short, Santander UK is innovating directly on the Cloudera stack, coupling streaming data, advanced software engineering principles and frameworks, and modern data warehouse design principles to generate real-time insight to improve customer experience and customer financial wellbeing. This innovation was recently recognized as a third-party panel of judges voted Santander as a Data Impact Award finalist.

Nicolette Bullivant is Head of Data Engineering at Santander UK.
Rob Siwicki is a Senior Solutions Architect for Cloudera’s Professional Services, EMEA.

Toby Ferguson

More by this author

Editor's Choice

Business

Generative AI for the Enterprise

Technical

Building Trust in Public Sector AI Starts with Trusting Your Data