Data preparation, or database design in general, is a significant step before performing data analysis. Increasingly, this step is now also a precursor to several big data applications. Such data preparation could involve logical transformations, e.g., data cleaning, data integration, sampling, as well as physical data transformations, e.g., partitioning, indexing, compression, at different data granularities. Unfortunately, current large-scale data management platforms provide little or no support for such data preparation.
The goal of Cartilage is to allow developers to have full control over data preparation in a distributed file system, such as HDFS. Cartilage provides a dataset abstraction between the logical user dataset and the actual physical HDFS files. We introduced data plans, analogous to query plans, to enable users to declaratively specify their transformations, from the logical dataset to the physical HDFS files. The Cartilage execution engine maintains the lineage of data transformation and developers can exploit this lineage to make fine-granular transformation decisions. Cartilage sits on top of HDFS and could be used with several data processing systems that have HDFS as the underlying storage. We used Cartilage to build a scalable data cleaning system for violation detection and repair. This system compiles the data quality rules provided by the user into a series of transformations and runs them in a distributed fashion, outperforming baseline systems by up to more than two orders of magnitude over a variety of data cleaning tasks.
Talks and Posters