Cartilage

Cartilage

Data preparation, or database design in general, is a significant step before performing data analysis. Increasingly, this step is now also a precursor to several big data applications. Such data preparation could involve logical transformations, e.g., data cleaning, data integration, sampling, as well as physical data transformations, e.g., partitioning, indexing, compression, at different data granularities. Unfortunately, current large-scale data management platforms provide little or no support for such data preparation.

The goal of Cartilage is to allow developers to have full control over data preparation in a distributed file system, such as HDFS. Cartilage provides a dataset abstraction between the logical user dataset and the actual physical HDFS files. We introduced data plans, analogous to query plans, to enable users to declaratively specify their transformations, from the logical dataset to the physical HDFS files. The Cartilage execution engine maintains the lineage of data transformation and developers can exploit this lineage to make fine-granular transformation decisions. Cartilage sits on top of HDFS and could be used with several data processing systems that have HDFS as the underlying storage. We used Cartilage to build a scalable data cleaning system for violation detection and repair. This system compiles the data quality rules provided by the user into a series of transformations and runs them in a distributed fashion, outperforming baseline systems by up to more than two orders of magnitude over a variety of data cleaning tasks.

Publications

Alekh Jindal, Anil Shanbhag, Yi Lu
Robust Data Partitioning
Encyclopedia of Big Data Technologies, 2019, Springer.
Invited Chapter

Anil Shanbhag, Alekh Jindal, Samuel Madden, Jorge Quiane, Aaron Elmore
A Robust Partitioning Scheme for Ad-Hoc Query Workloads
SOCC 2017, Santa Clara, USA.

Yi Lu, Anil Shanbhag, Alekh Jindal, Samuel Madden
AdaptDB: Adaptive Partitioning for Distributed Joins
VLDB 2017, Munich, Germany.

Alekh Jindal, Jorge-Arnulfo Quiane-Ruiz, Samuel Madden
IngestBase: A Declarative Data Ingestion System
arXiv:1701.06093 [cs.DB], Jan 2017

Anil Shanbhag, Alekh Jindal, Yi Lu, Samuel Madden
Amoeba: A Shape changing Storage System for Big Data
VLDB 2016, New Delhi, India. (Demo paper)

Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiane-Ruiz, Nan Tang, Si Yin
BigDansing: A System for Big Data Cleansing
SIGMOD 2015, Melbourne, Australia.

Alekh Jindal, Samuel Madden
Preparing Data For The Data Lake
NEDB 2015, Cambridge, USA. (Short Paper)

Alekh Jindal
Robust Data Transformations
CIDR 2015, Asilomar, USA. (Abstract)

Alekh Jindal, Jorge-Arnulfo Quiane-Ruiz, Samuel Madden
Cartilage: Adding Flexibility to the Hadoop Skeleton
SIGMOD 2013, New York, USA. (Demo paper)

Talks and Posters

The Curios Case of Databases and Horses!
Presenter: Alekh Jindal
CIDR 2015, Asilomar, USA.

Cartilage: Flexible Hadoop Skeleton
Presenters: Alekh Jindal, Jorge Quiane-Ruiz
SIGMOD 2013, New York, USA.