Probabilistic type inference for the construction of data dictionaries
Ceritli, Taha Yusuf
The data understanding stage plays a central role in the entire process of data analytics, as it allows the analyst to gain familiarity with the data, identify data quality issues, and discover initial insights into the data before further analysis (Chapman et al., 2000). These tasks become easier in the presence of well-documented background information such as a data dictionary, which is defined as “a centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format” (McDaniel, 1994). However, data dictionaries are often missing or incomplete. In this thesis we focus on inference of data types (both syntactic and semantic), and develop probabilistic approaches that enable the automatic construction of a data dictionary for a given dataset. Unlike existing rule-based methods, our proposed methods allow us to express uncertainty in a principled way and can provide accurate type predictions even for messy datasets with missing and anomalous values. The thesis makes the following contributions: First, we present ptype - a probabilistic generative model that uses Probabilistic Finite-State Machines (PFSMs) to represent data types. By detecting missing and anomalous data, ptype infers syntactic data types accurately and improves over the performance of existing approaches for type inference. Moreover, it offers the advantage of generating weighted predictions when a column of messy data is consistent with more than one type assignment, in contrast to more familiar finite-state machines (e.g., regular expressions). Secondly, we propose ptype-cat which is an extension of ptype for a better detection of the categorical type. ptype treats non-Boolean categorical variables as either integers or strings. By combining the output of ptype and additional features that can indicate whether a column represents a categorical variable or not, ptype-cat can correctly detect the general categorical type (including non-Boolean variables). In addition, we adapt ptype to the task of identifying the values associated with the corresponding categorical variable. Finally, we present ptype-semantics to demonstrate how ptype can be enriched by semantic information. In this regard, we focus on dimension and unit inference, which are respectively the task of identifying the dimension of a data column and the task of identifying the units of its entries. Syntactic type inference methods including ptype do not address these tasks. However, ptype-semantic can extract extra semantic information (such as dimension and unit) about data columns and treat them as either floats or integers rather than strings.