|
|
Magpie
Python has become overwhelmingly popular for ad-hoc data analysis,
and Pandas dataframes have quickly become the de facto
standard API for data science. However, performance and scaling to
large datasets remain significant challenges. This is in stark contrast
with the world of databases, where decades of investments have led
to both sub-millisecond latencies for small queries and many orders
of magnitude better scalability for large analytical queries. Furthermore,
databases offer enterprise-grade features (e.g., transactions,
fine-grained access control, tamper-proof logging, encryption) as
well as a mature ecosystem of tools in modern clouds.
In this project, we bring together the ease of use and versatility of
Python environments with the enterprise-grade, high-performance
query processing of cloud database systems. We describe a system
we are building, coined Magpie, which exposes the popular Pandas
API while lazily pushing large chunks of computation into scalable,
efficient, and secured database engines. Magpie assists the data
scientist by automatically selecting the most efficient engine (e.g.,
SQL DW, SCOPE, Spark) in cloud environments that offer multiple
engines atop a data lake. Magpie's common data layer virtually
eliminates data transfer costs across potentially many such engines.
|
|