Peregrine

Peregrine

Database administrators (DBAs) were traditionally responsible for optimizing the on-premise database workloads. However, with the rise of cloud data services where cloud providers offer fully managed data processing capabilities, the role of a DBA is completely missing. At the same time, workload optimization becomes even more important for reducing the total costs of operation and making data processing economically viable in the cloud. This project revisits workload optimization in the context of these emerging cloud-based data services. We observe that the missing DBA in these newer data services has affected both the end users and the system developers: users have workload optimization as a major pain point while the system developers are now tasked with supporting a large base of cloud users.

Peregrine is a workload optimization platform for cloud query engines that we have been developing for the big data analytics infrastructure at Microsoft. Peregrine makes three major contributions: (i) a novel way of representing query workloads that is agnostic to the query engine and is general enough to describe a large variety of workloads, (ii) a categorization of the typical workload patterns, derived from production workloads at Microsoft, and the corresponding workload optimizations possible in each category, and (iii) a prescription for adding workload-awareness to a query engine, via the notion of query annotations that are served to the query engine at compile time.

Topics

CloudViews for multi-query optimization in cloud query engines.

Learning optimizer for improving the components of a cloud query optimizer.

Resource optimization for reducing the cloud resource consumption.

Publications

Rathijit Sen, Abhishek Roy, Alekh Jindal
Predictive Price-Performance Optimization for Serverless Query Processing
EDBT 2023, Ioannina, Greece.

Alekh Jindal, Jyoti Leeka
Query Optimizer as a Service: An Idea Whose Time Has Come
SIGMOD Record, September 2022

Sunny Gakhar, Joyce Cahoon, Wangchao Le, Xiangnan Li, Kaushik Ravichandran, Hiren Patel, Marc Friedman, Brandon Haynes, Shi Qiao, Alekh Jindal, Jyoti Leeka
Pipemizer: An Optimizer for Analytics Data Pipelines
VLDB 2022, Sydney, Australia. (Demo paper)

Wangda Zhang, Matteo Interlandi, Paul Mineiro, Shi Qiao, Nasim Ghazanfari, Karlen Lie, Marc Friedman, Rafah Hosn, Hiren Patel, Alekh Jindal
Deploying a Steered Query Optimizer in Production at Microsoft
SIGMOD 2022 (Industry), Philadelphia, USA.

Anish Pimpley, Shuo Li, Rathijit Sen, Soundararajan Srinivasan, Alekh Jindal
Towards Optimal Resource Allocation for Serverless Queries
EDBT 2022, Edinburgh, UK.

Rathijit Sen, Abhishek Roy, Alekh Jindal
Predictive Price-Performance Optimization for Serverless Query Processing
arXiv:2112.08572 [cs.DB], Dec 2021

Remmelt Ammerlaan, Gilbert Antonius, Marc Friedman, H M Sajjad Hossain, Alekh Jindal, Peter Orenberg, Hiren Patel, Shi Qiao, Vijay Ramani, Lucas Rosenblatt, Abhishek Roy, Irene Shaffer, Soundarajan Srinivasan, Markus Weimer
PerfGuard: Deploying ML-for-Systems without Performance Regressions, Almost!
VLDB 2022, Sydney, Australia.

Anish Pimpley, Shuo Li, Anubha Srivastava, Vishal Rohra, Yi Zhu, Soundararajan Srinivasan, Alekh Jindal, Hiren Patel, Shi Qiao, Rathijit Sen
Optimal Resource Allocation for Serverless Queries
arXiv:2107.08594 [cs.DB], July 2021

Yiwen Zhu, Matteo Interlandi, Abhishek Roy, Krishnadhan Das, Hiren Patel, Malay Bag, Hitesh Sharma, Alekh Jindal
Phoebe: A Learning-based Checkpoint Optimizer
VLDB 2021

Alekh Jindal, Matteo Interlandi
Machine Learning for Cloud Data Systems: the Promise, the Progress, and the Path Forward
VLDB 2021 (Tutorial)

Rathijit Sen, Abhishek Roy, Alekh Jindal, Rui Fang, Jeff Zheng, Xiaolei Liu, Ruiping Li
AutoExecutor: Predictive Parallelism for Spark SQL Queries
VLDB 2021 (Demo)

Abhishek Roy, Alekh Jindal, Priyanka Gomatam, Xiating Ouyang, Ashit Gosalia, Nishkam Ravi, Swinky Mann, Prakhar Jain
SparkCruise: Workload Optimization in Managed Spark Clusters at Microsoft
VLDB 2021 (Industry)

Parimarjan Negi, Matteo Interlandi, Ryan Marcus, Mohammad Alizadeh, Tim Kraska, Marc Friedman, Alekh Jindal
Steering Query Optimizers: A Practical Take on Big Data Workloads
SIGMOD 2021 (Industry)
Industry Honorable Mention (SIGMOD Announcement)

Alekh Jindal, Shi Qiao, Rathijit Sen, Hiren Patel
Microlearner: A fine-grained Learning Optimizer for Big Data Workloads at Microsoft
ICDE 2021 (Industry)

Alekh Jindal, Shi Qiao, Hiren Patel, Abhishek Roy, Jyoti Leeka, Brandon Haynes
Production Experiences from Computation Reuse at Microsoft
EDBT 2021 (Industry)

Alekh Jindal
Applied Research Lessons from CloudViews Project
SIGMOD Record, September 2020

Rathijit Sen, Alekh Jindal, Hiren Patel, Shi Qiao
AutoToken: Predicting Peak Parallelism for Big Data Analytics at Microsoft
VLDB 2020, Tokyo, Japan.

Malay Bag, Alekh Jindal, Hiren Patel
Towards Plan-aware Resource Allocation in Serverless Query Processing
HotCloud 2020, Boston, USA.

Tarique Siddiqui, Alekh Jindal, Shi Qiao, Hiren Patel, Wangchao Le
Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our Findings
SIGMOD 2020, Portland, USA.

H M Sajjad Hossain, Lucas Rosenblatt, Gilbert Antonius, Irene Shaffer, Remmelt Ammerlaan, Abhishek Roy, Markus Weimer, Hiren Patel, Marc Friedman, Shi Qiao, Peter Orenberg, Soundarajan Srinivasan, Vijay Ramani, Alekh Jindal
PerfGuard: Deploying ML-for-Systems without Performance Regressions
MLOps Systems 2020, Austin, USA.

Alekh Jindal, Hiren Patel, Abhishek Roy, Shi Qiao, Jarod Yin, Rathijit Sen, Subru Krishnan
Peregrine: Workload Optimization for Cloud Query Engines
SOCC 2019, Santa Cruz, California.

Hiren Patel, Alekh Jindal, Clemens Szyperski
Big Data Processing at Microsoft: Hyper Scale, Massive Complexity, and Minimal Cost
SOCC 2019, Santa Cruz, California. (poster)

Chenggang Wu, Alekh Jindal, Saeed Amizadeh, Hiren Patel, Wangchao Le, Shi Qiao, Sriram Rao
Towards a Learning Optimizer for Shared Clouds
VLDB 2019/PVLDB, Los Angeles, USA.

Abhishek Roy, Alekh Jindal, Hiren Patel, Ashit Gosalia, Subru Krishnan, Carlo Curino
SparkCruise: Handsfree Computation Reuse in Spark
VLDB 2019/PVLDB, Los Angeles, USA. (Demo paper)

Alekh Jindal, Lalitha Viswanathan, Konstantinos Karanasos
Query and Resource Optimizations: A Case for Breaking the Wall in Big Data Systems
arXiv:1906.06590 [cs.DB], June 2019

Alekh Jindal, Konstantinos Karanasos, Sriram Rao, Hiren Patel
Selecting Subexpressions to Materialize at Datacenter Scale
VLDB 2018/PVLDB, Rio de Janeiro, Brazil.

Alekh Jindal, Shi Qiao, Hiren Patel, Jarod Yin, Jieming Di, Malay Bag, Marc Friedman, Yifung Lin, Konstantinos Karanasos, Sriram Rao
Computation Reuse in Analytics Job Service at Microsoft
SIGMOD 2018, Houston, USA.

Lalitha Viswanathan, Alekh Jindal, Konstantinos Karanasos
Query and Resource Optimization: Bridging the Gap
ICDE 2018, Paris, France (Short paper).

Talks

Practical Aspects of Systems that Learn: Ambition vs Reality
Presenter: Alekh Jindal
LADSIOS @ VLDB 2021.

The Cosmos Big Data Platform at Microsoft: Over a Decade of Progress and a Decade to Look Forward
Presenters: Hiren Patel, Alekh Jindal
VLDB 2021.

Microlearner: A fine-grained Learning Optimizer for Big Data Workloads at Microsoft
Presenter: Alekh Jindal
ICDE 2021.

Production Experiences from Computation Reuse at Microsoft
Presenter: Alekh Jindal
EDBT 2021, Virtual. Talk Teaser Poster

Optimizing Cloud Query Engines at Microsoft
Presenter: Alekh Jindal
TU Berlin, Germany, 2020.

Peregrine: Workload Optimization for Cloud Query Engines
Presenter: Alekh Jindal
North West Database Society 2020, Seattle, USA.

Peregrine: Workload Optimization for Cloud Query Engines
Presenter: Alekh Jindal
SOCC 2019, Santa Cruz, USA.

Building a Learning Optimizer at Microsoft
Presenter: Alekh Jindal
MIT, Cambridge, USA, 2019.

Building a Learning Optimizer at Microsoft
Presenter: Alekh Jindal
University of Wisconsin, Madison, USA, 2019.

Towards a Learning Optimizer for Shared Clouds
Presenter: Alekh Jindal
North West Database Society 2019, Redmond, USA.
YouTube Video

Computation Reuse in Analytics Job Service at Microsoft
Presenter: Alekh Jindal
SIGMOD 2018, Houston, USA.

CloudViews Poster
Presenters: CloudViews Team
SIGMOD 2018, Houston, USA.

Patents

HS Patel, Q Shi, A Jindal, MK Bag, R Sen, CA Curino
Resource optimization for serverless query processing (US Patent 11,455,192)
US Patent App. 17/894,628

R Sen, A Jindal, AY Pimpley, S Li, A Srivastava, VL Rohra, Y Zhu, HS Patel, QIAO Shi, MT Friedman, CA Szyperski
Optimizing job runtimes via prediction-based token allocation (US20220100763A1)
US Patent App. 17/060,053

Y Zhu, A Jindal, MK Bag, HS Patel
Data-driven checkpoint selector
US Patent 11,416,487

IR Shaffer, RHL Ammerlaan, G Antonius, MT Friedman, ROY Abhishek, L Rosenblatt, VK Ramani, QIAO Shi, A Jindal, P Orenberg, HM Sajjad Hossain, S Srinivasan, HS Patel, M Weimer
System and method for machine learning for system deployments without performance regressions
US Patent App. 16/840,205

HS Patel, R Sen, Z Yin, Q Shi, ROY Abhishek, A Jindal, SV Krishnan, CA Curino
Cloud based query workload optimization (US20210089532A1)
US Patent App. 16/581,905

TA Siddiqui, A Jindal, Q Shi, HS Patel
Learned resource consumption model for optimizing big data queries
US Patent App. 16/511,966

A Jindal, H Patel, S Amizadeh, C Wu
Learning Optimizer for Shared Cloud
US Patent 11,074,256

A Jindal, K Karanasos, HS Patel, S Rao Sriram
Selection of Subexpressions to Materialize for Datacenter Scale
US Patent 10,726,014

A Jindal, H Patel, Q Shi, J Di, MK Bag, Z Yin
Computation Reuse in Analytics Job Service
US Patent 11,068,482