Distributed Pricing Engine using Dockerized Spark on YARN w/ HDP 3.0 [Part 1/4]

Distributed Pricing Engine using Dockerized Spark on YARN w/ HDP 3.0 [Part 1/4]

This blog post was published on Hortonworks.com before the merger with Cloudera. Some links, resources, or references may no longer be accurate.

This is the 1st blog in a 4-part blog series where we will look at an architectural approach to implementing a distributed compute engine for pricing financial derivatives using Hortonworks Data Platform [HDP] 3.0.

In this blog, we will discuss the problem domain and set the context before we zoom in on the functional and technical aspects.

Modern financial trading and risk platforms employ compute engines for pricing and risk analytics across different asset classes to drive real-time trading decisions and quantitative risk management. Pricing financial instruments involves a range of algorithms from simple cashflow discounting to more analytical methods using stochastic processes such as Black-Scholes and computationally intensive numerical methods such as finite differences, Monte Carlo and Quasi Monte Carlo techniques depending on the instrument being priced – bonds, stocks or their derivatives – options, swaps etc. and the pricing (NPV, Rates etc) and risk (DV01, PV01, higher order greeks such as gamma, vega etc.) metrics being calculated. Quantitative finance libraries, typically written in low level programming languages such as C, C++ leverage efficient data structures and parallel programming constructs to realize the potential of modern multi-core CPU, GPU architectures, and even specialized hardware in the form of FPGAs and ASICs for high performance compute of pricing and risk metrics.

Quantitative and regulatory risk management and reporting imperatives such as valuation adjustment calculations XVA (CVA, DVA, FVA etc.), BCBS239 for FRTB, CCAR, DFAST in the US or MiFID in Europe for instance, necessitate valuation of portfolios of millions of trades across tens of thousands of scenario simulations and aggregation of computed metrics across a vast number and combination of dimensions – a data-intensive distributed computing problem that can benefit from:

  • Distributed compute and data-parallel frameworks such as Apache Spark and Hadoop that offer scale-out, shared-nothing and fault-tolerant architectures that are more portable and have more palatable APIs with a focus on leveraging data locality with commodity hardware as compared to relying high speed interconnects between compute and storage on high end hardware as with HPC frameworks such as MPI, OpenMP etc.
  • Elasticity and operational efficiencies of cloud computing especially with burst compute semantics for these use cases augmented by the use of OS virtualization through containers and lean DevOps practices

In part 2, of the 4 part blog series, we will look at the representative pricing semantics and the technical architecture to help capture the very essence of this problem space through a trivial implementation of the compute engine that combines the facilities of parallel programming using QuantLib, an open source library for quantitative finance embedded in a distributed computing framework Apache Spark running in an OS virtualized environment through Docker containers on Apache Hadoop YARN as the resource scheduler and the distributed data operating system provisioned, orchestrated and managed in OpenStack private cloud through Hortonworks Cloudbreak all through a singular platform in the form of HDP 3.0!!


Amol Thacker
Solutions Engineer
More by this author

Leave a comment

Your email address will not be published. Links are not permitted in comments.