Building the Knowledge Graph for UK Health Data Science
Extracting patient phenotypes from routinely collected health data (such as Electronic Health Records) requires translating clinically-sound phenotype definitions into queries/computations executable on the underlying data sources by clinical researchers. This requires significant knowledge and skills to deal with heterogeneous and often imperfect data. Translations are time-consuming, error-prone and, most importantly, hard to share and reproduce across different settings. This paper proposes a knowledge driven framework that (1) decouples the specification of phenotype semantics from underlying data sources; (2) can automatically populate and conduct phenotype computations on heterogeneous data spaces. We report preliminary results of deploying this framework on five Scottish health datasets. Big data analytics in healthcare has great potential to reveal deep insights from health data, which would extend our knowledge boundary in medicine and improve quality of health service . However, it is very challenging to make sense of distributed and heterogeneous health data. The current reality is that most data is stored in different local communities, which means they are maintained locally and stored in inconsistent formats and languages. A key technical challenge haunting almost all data-driven clinical studies is to extract or compute accurate patients’ phenotypes (traits of symptoms, diseases, medications or biochemistry test results) from such a fragmented data space. Figure 1 (Current Practice section on the left-hand side) illustrates a typical procedure of computing phenotypes from heterogeneous data sources. The main aim of this study is to realise a clinical data science framework that makes the underlying data sources transparent to phenotype computations. Researchers only need to specify the “meanings” of their phenotypes using their familiar terminologies and the actual computations are automatically populated for and executed on data sources. With a given phenotype the computer has to understand its semantics (i.e., computer understandable meanings) so that the right computations and queries can be populated and executed on the underlying data sources. The formalisation framework has three components as follows.