Scaling and explaining machine learning powered database applications
Files
Item Status
Embargo End Date
Date
Authors
An, Shuai
Abstract
For decades, database systems have been the backbone of applications in a wide range
of domains, e.g., finance, web services, business intelligence, social analysis, healthcare.
Meanwhile, the resurgence of machine learning in particular large-scale deep learning
services provided by big companies in recent years has been expeditiously reshaping
these applications, enabling them to easily take advantage of model prediction services
and be significantly more adaptive, intelligent and capable. However, this movement
causes database applications to be less transparent, rendering database vendors incapable
of offering reliable insights and explanations to their application customers. In
addition, the increasing popularity of machine learning features rapidly boosts the scale
of applications while the underlying legacy database systems are struggling to scale out
and keep up the pace, causing tension between the application load and database system.
This thesis aims to address these two challenges. The first part of the thesis presents a
new concept, referred to as on-database contextual explanation, and a suit of associated
techniques to empower database applications to explain their learning powered decisions
to end customers, even if they are generated by third-party cloud-based prediction
services that opt not to offer explainability. The key intuition is that the data exchange
between the databases and remote machine learning models already gives the applications
a dynamic context that contains vital information to deduce reliable explanations,
independent of the explainability of the remote models. To elaborate on and exploit this,
we develop algorithms and systems to efficiently compute contextual feature explanation
and counterfactual explanation by examining and monitoring databases at runtime,
faithfully conforming to the “right-to-explanation” policy requested by GDPR. We also
evaluate its effectiveness via extensive experiments and real-world case studies.
The second part of thesis develops a means to scale out legacy databases without
migrating them to the cloud or re-deploying with added hardware. Our method is to
augment legacy database systems at runtime with external caches, allowing us to offload
database load to a look-aside cache on-the-fly. However, a caveat of extending
database systems with lightweight caches is that the augmented system as a whole
loses correctness guarantees that a typical database system offers especially for transactional
workloads, requiring the developers to re-design the applications. To this end,
we present transactional caching, a scheme that maintains application invariant over
the augmented system. It works with any key-value in-memory caches, e.g., Redis
and Memcached, and empowers them to assure that applications always see a monotonically
increasing snapshot of the databases. Critical to the performance of such
cache-augmented databases is the design of transactional cache replacement policies,
which we prove is intractable as opposed to linear time decidable for conventional
caching. Nonetheless, we develop efficient learning-augmented transactional cache policies
with provable guarantees. Over real-life traces and benchmarks, they have shown
effective in improving transaction throughput while guaranteeing application invariant.
These together give us a suit of concepts and techniques for database applications
to benefit from powerful machine learning models, without compromising their transparency,
reliability and correctness that database systems have been offering for decades.
This item appears in the following Collection(s)

