Analysing system behaviour by automatic benchmarking of system-level provenance
Chan, Sheung Chi
Provenance is a term originating from the work of art. It aims to provide a chain of information of a piece of arts from its creation to the current status. It records all the historic information relating to this piece of art, including the storage locations, ownership, buying prices, etc. until the current status. It has a very similar definition in data processing and computer science. It is used as the lineage of data in computer science to provide either reproducibility or tracing of activities happening in runtime for a different purpose. Similar to the provenance used in art, provenance used in computer science and data processing field describes how a piece of data was created, passed around, modified, and reached the current state. Also, it provides information on who is responsible for certain activities and other related information. It acts as metadata on components in a computer environment. As the concept of provenance is to record all related information of some data, the size of provenance itself is generally proportional to the amount of data processing that took place. It generally tends to be a large set of data and is hard to analyse. Also, in the provenance collecting process, not all information is useful for all purposes. For example, if we just want to trace all previous owners of a file, then all the storage location information may be ignored. To capture useful information and without needing to handle a large amount of information, researchers and developers develop different provenance recording tools that only record information needed by particular applications with different means and mechanisms throughout the systems. This action allows a lighter set of information for analysis but it results in non-standard provenance information and general users may not have a clear view on which tools are better for some purposes. For example, if we want to identify if certain action sequences have been performed in a process and who is accountable for these actions for security analysis, we have no idea which tools should be trusted to provide the correct set of information. Also, it is hard to compare the tools as there is not much common standard around. With the above need in mind, this thesis concentrate on providing an automated system ProvMark to benchmark the tools. This helps to show the strengths and weaknesses of their provenance results in different scenarios. It also allows tool developers to verify their tools and allows end-users to compare the tools at the same level to choose a suitable one for the purpose. As a whole, the benchmarking based on the expressiveness of the tools on different scenarios shows us the right choice of provenance tools on specific usage.