We are witnessing the rise of declarative big data systems. Examples include Hive, Spark, and Flink in the open source, and BigQuery, BigSQL, and SCOPE among proprietary systems. These systems take the declarative user queries as input and use a (typically cost based) query optimizer to pick the physical execution plans for that input. While query optimization has been a pain even in traditional databases, big data systems make the problem harder due to: (i) massive volumes of data which are very expensive to analyze and collect statistics on, (ii) presence of unstructured data that have schema imposed at runtime and hence cannot be analyzed a priori, and (iii) pervasive use of custom user code (e.g., UDFs) that embed arbitrary application logic resulting in arbitrary performance characteristics.

We are also witnessing the popularity of cloud infrastructures with managed data services, that put the onus of tuning the services on the cloud provider. Fortunately, the scale of these cloud services provide the unique opportunity of observing and analyzing massive query workloads that could be harnessed for continuously improving the performance of these services. The goal of this project is to leverage such cloud workloads for training machine learning models and providing feedback to the query optimizer for future optimizations.