Bikas Saha, Author at Cloudera Blog

April 16, 2019 | Technical

Demystifying Spark Jobs to Optimize for Cost and Performance

Apache Spark is one of the most popular engines for distributed data processing on Big Data clusters. Spark jobs come in all shapes, sizes and cluster form factors. Ranging from 10’s to 1000’s of nodes and executors, seconds to hours or even days for job duration, megabytes to petabytes of data and simple data scans […]

by Bikas Saha , Mridul Murlidharan 8 min read

Apache Spark Performance

December 11, 2018 | Technical

Data Science & Engineering Platform: Data Lineage and Provenance for Apache Spark

This blog post was published on Hortonworks.com before the merger with Cloudera. Some links, resources, or references may no longer be accurate. This is the third in a series of data engineering blogs that we plan to publish. The first blog outlined the data science and data engineering capabilities of Hortonworks Data Platform. Motivation Apache […]

by Bikas Saha 3 min read

Apache Spark

More by this author:

Demystifying Spark Jobs to Optimize for Cost and Performance

Data Science & Engineering Platform: Data Lineage and Provenance for Apache Spark