FluiDB: adaptive storage layout using reversible relational operators
Item statusRestricted Access
Embargo end date20/01/2024
It is a popular practice to use materialized intermediate results to improve the perfor- mance of RDBMSes. Work in this area has focused either optimisers matching existing materialized results in the cache and selecting intermediate results from a plan to survive the plan execution. To our knowledge, few attempts have been made to create plans with cached intermediate results in mind, and none that make any attempt to deduplicate the stored data to alleviate the storage cost of maintaining possibly large queries. We built FluiDB to explore a novel approach to integrating the selection of materialized results with the planner to optimize the logical representation of data in memory. FluiDB materializes common intermediate results and deduplucates data to alleviate the cost of maintaining them. This is achieved by introducing reversible operations: versions of normal relational operators that may produce complementary tables alongside the normal output, that allow the reconstruction of the input relations. A planner aware of such operations can build query plans that dynamically adapt the data layout to the plan under a constrained memory budget. This thesis revolves around four main chapters, each of which describes in detail a different part of FluiDB and a final one that goes into evaluation of the system. The first chapter focuses on query processing and the relational algebra semantics that FluiDB operates under. FluiDB parses queries into DAGs of sub-queries connected by reversible operators. Each such graph of the workload is merged into a global query graph that is used to infer properties of each relation like cardinality and extent. The next chapter is dedicated to the planner and a novel monad for weighted backtracking that the planer is based on. The planner attempts to generate a plan by traversing the query graph so that, besides solving the query at hand, it leaves in memory a curated set of queries aiming to maximize the amortized performance of the workload being run. In this chapter, the garbage collector is also discussed, which is the part of the planner responsible for inserting plan fragments that delete nodes when required such that the available storage budget is respected while no information is lost from the database. After that, we go into Antisthenis, a framework we built for defining incremental com- putation systems. Antisthenis is used to build modules of the planner that efficiently determine whether a relation materializable, and estimate the cost of materializing a relation, given a set of materialized relations. Besides computation reuse, Antisthenis is able to prune the computation taking advantage of properties of the operators involved like absorbing group elements and bounded partial results. These techniques are also used to allow evaluation of some classes of self-referential computations. The final chapter about the FluiDB architecture describes the transpilation of plans gen- erated by the planner to C++, as well as the supporting libraries that enable the tran- spilation of queries to highly specialized C++ code, and the low level data organization of the database. The thesis closes with a chapter that describes our methods for benchmarking and some experimental results.