Edinburgh Research Archive

Exploring epigenetic gene regulation using applied machine learning

Item Status

Embargo End Date

Authors

Chhatbar, Kashyap

Abstract

Computational biology plays a pivotal role in scientific discovery, particularly in molecular biology and biomedical research. Advances in high-throughput sequencing have enabled large-scale profiling of epigenomic marks across diverse cellular contexts, providing unprecedented opportunities for data-driven insights. Leveraging these vast datasets, machine learning and artificial intelligence (AI) approaches have become powerful tools for predicting gene expression and revealing molecular interactions. However, the correlative nature of these models often require rigorous validation and integration with experimental data. This thesis applies computational techniques, including linear and non-linear modelling, deep learning, and explainable AI (XAI), to investigate epigenetic mechanisms of gene regulation. By leveraging high-quality transcriptomic and epigenomic datasets, it reassesses longstanding hypotheses and quantifies the influence of chromatin-associated proteins on transcription. One such well-characterized regulator, MeCP2, binds methylated CpG (mCG) di-nucleotides and mCAC tri-nucleotides, recruiting the NCoR complex to regulate gene expression in the brain. While this mechanism is well-established, alternative roles for MeCP2 have been proposed, including its involvement in RNA splicing and specific binding to tandemly repeated CA sequences. To explore MeCP2’s role in RNA splicing, I analysed datasets spanning varying levels of MeCP2 and DNA methylation. Using bioinformatic approaches combined with linear and non-linear machine learning models, my findings revealed minimal splicing changes, with MeCP2 and DNA methylation accounting for only 5-10% of observed variation. Similarly, the proposed preferential targeting of CA repeats by MeCP2 was examined. I found that CA repeats undergo CAC methylation at levels typical of the genome, with no evidence supporting preferential binding. These results suggest that MeCP2-mediated regulation remains consistent regardless of CA repeat presence and is unlikely to play a broad role in splicing. This work reinforces MeCP2’s primary function in methylation-dependent transcriptional regulation, while addressing critical gaps in the understanding of its proposed alternative functions. While MeCP2 regulates transcription by invoking DNA methylation-dependent binding mechanisms, other chromatin-associated factors may employ different strategies. Mammalian genomes contain extensive regions of homogenous AT- or GC-rich DNA, conserved across species but with unclear functional significance. SALL4, previously shown to co-localise with AT-rich heterochromatin in cultured cells, was found to target a wide range of AT-rich motifs based on analyses of in vitro and in vivo datasets from mouse embryonic stem cells (mESCs). Using data from SALL4 DNA-binding deficient mutants and SALL4 overexpression models in mESCs, I applied linear models to quantify the relationship between gene body AT content and the extent of gene expression changes. These results demonstrate that SALL4 represses genes in proportion to their AT content, underscoring the potential of DNA base composition as a biological signal for regulating gene expression. My previous work focused on investigating the role of MeCP2 and SALL4 individually in transcriptional regulation. In the next phase, I aim to explore how chormatin-associated proteins function together to coordinate gene expression. Many modify histones post-translationally, generating epigenetic marks that downstream pathways interpret to coordinate transcriptional programs. Understanding this complex context-dependent interplay remains an open challenge. By generating deep learning models to predict RNA Pol-II occupancy from chromatin profiles in unperturbed mESC conditions, I applied Shapley Additive Explanations (SHAP), a widely used explainable AI (XAI) approach, to infer functional regulatory mechanisms. Genes ranked by SHAP importance accurately predicted direct perturbation targets, even from unperturbed data, reducing the reliance on costly experimental interventions. SHAP analysis further revealed cooperative roles of SET1A and ZC3H4 at promoters and distinct contributions of ZC3H4 at gene bodies in transcriptional regulation. Cross-dataset validation revealed a striking similarity in the direct targets of ZC3H4 and INTS11, suggesting potential regulatory convergence mediated through H3K4me3 and the SET1/COMPASS complex. This highlights SHAP’s ability to uncover context-dependent transcriptional patterns and reveal previously unrecognized connections between regulatory pathways. In summary, this thesis seeks to demonstrate the power of computational biology in dissecting transcriptional regulation by challenging assumptions and uncovering potential new regulatory paradigms. It refines our understanding of how chromatin-associated proteins interpret DNA sequence and epigenetic marks to control transcription. Finally, I propose that predictive modelling combined with XAI approaches like SHAP offers a scalable framework for predicting functional targets and infer regulatory mechanisms.

This item appears in the following Collection(s)