Exploring epigenetic gene regulation using applied machine learning
dc.contributor.advisor
Sanguinetti, Guido
dc.contributor.advisor
Bird, Adrian
dc.contributor.author
Chhatbar, Kashyap
dc.contributor.sponsor
College of Science and Engineering, University of Edinburgh: Research Award
en
dc.date.accessioned
2025-10-09T10:25:05Z
dc.date.available
2025-10-09T10:25:05Z
dc.date.issued
2025-10-09
dc.description.abstract
Computational biology plays a pivotal role in scientific discovery, particularly in molecular biology and biomedical research. Advances in high-throughput sequencing have enabled large-scale profiling of epigenomic marks across diverse cellular contexts, providing unprecedented opportunities for data-driven insights. Leveraging these vast datasets, machine learning and artificial intelligence (AI) approaches have become powerful tools for predicting gene expression and revealing molecular interactions. However, the correlative nature of these models often require rigorous validation and integration with experimental data. This thesis applies computational techniques, including linear and non-linear modelling, deep learning, and explainable AI (XAI), to investigate epigenetic mechanisms of gene regulation. By leveraging high-quality transcriptomic and epigenomic datasets, it reassesses longstanding hypotheses and quantifies the influence of chromatin-associated proteins on transcription.
One such well-characterized regulator, MeCP2, binds methylated CpG (mCG) di-nucleotides and mCAC tri-nucleotides, recruiting the NCoR complex to regulate gene expression in the brain. While this mechanism is well-established, alternative roles for MeCP2 have been proposed, including its involvement in RNA splicing and specific binding to tandemly repeated CA sequences. To explore MeCP2’s role in RNA splicing, I analysed datasets spanning varying levels of MeCP2 and DNA methylation. Using bioinformatic approaches combined with linear and non-linear machine learning models, my findings revealed minimal splicing changes, with MeCP2 and DNA methylation accounting for only 5-10% of observed variation. Similarly, the proposed preferential targeting of CA repeats by MeCP2 was examined. I found that CA repeats undergo CAC methylation at levels typical of the genome, with no evidence supporting preferential binding. These results suggest that MeCP2-mediated regulation remains consistent regardless of CA repeat presence and is unlikely to play a broad role in splicing. This work reinforces MeCP2’s primary function in methylation-dependent transcriptional regulation, while addressing critical gaps in the understanding of its proposed alternative functions.
While MeCP2 regulates transcription by invoking DNA methylation-dependent binding mechanisms, other chromatin-associated factors may employ different strategies. Mammalian genomes contain extensive regions of homogenous AT- or GC-rich DNA, conserved across species but with unclear functional significance. SALL4, previously shown to co-localise with AT-rich heterochromatin in cultured cells, was found to target a wide range of AT-rich motifs based on analyses of in vitro and in vivo datasets from mouse embryonic stem cells (mESCs). Using data from SALL4 DNA-binding deficient mutants and SALL4 overexpression models in mESCs, I applied linear models to quantify the relationship between gene body AT content and the extent of gene expression changes. These results demonstrate that SALL4 represses genes in proportion to their AT content, underscoring the potential of DNA base composition as a biological signal for regulating gene expression.
My previous work focused on investigating the role of MeCP2 and SALL4 individually in transcriptional regulation. In the next phase, I aim to explore how chormatin-associated proteins function together to coordinate gene expression. Many modify histones post-translationally, generating epigenetic marks that downstream pathways interpret to coordinate transcriptional programs. Understanding this complex context-dependent interplay remains an open challenge. By generating deep learning models to predict RNA Pol-II occupancy from chromatin profiles in unperturbed mESC conditions, I applied Shapley Additive Explanations (SHAP), a widely used explainable AI (XAI) approach, to infer functional regulatory mechanisms. Genes ranked by SHAP importance accurately predicted direct perturbation targets, even from unperturbed data, reducing the reliance on costly experimental interventions. SHAP analysis further revealed cooperative roles of SET1A and ZC3H4 at promoters and distinct contributions of ZC3H4 at gene bodies in transcriptional regulation. Cross-dataset validation revealed a striking similarity in the direct targets of ZC3H4 and INTS11, suggesting potential regulatory convergence mediated through H3K4me3 and the SET1/COMPASS complex. This highlights SHAP’s ability to uncover context-dependent transcriptional patterns and reveal previously unrecognized connections between regulatory pathways.
In summary, this thesis seeks to demonstrate the power of computational biology in dissecting transcriptional regulation by challenging assumptions and uncovering potential new regulatory paradigms. It refines our understanding of how chromatin-associated proteins interpret DNA sequence and epigenetic marks to control transcription. Finally, I propose that predictive modelling combined with XAI approaches like SHAP offers a scalable framework for predicting functional targets and infer regulatory mechanisms.
en
dc.identifier.uri
https://hdl.handle.net/1842/44037
dc.identifier.uri
http://dx.doi.org/10.7488/era/6563
dc.language.iso
en
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
Chhatbar, Kashyap, Adrian Bird, and Guido Sanguinetti (July 15, 2025). Modeling transcription with explainable AI uncovers context-specific epigenetic gene regulation at promoters and gene bodies. doi: 10.1101_2025.01.30.635704
en
dc.relation.hasversion
Chhatbar, Kashyap, Justyna Cholewa-Waclaw, Ruth Shah, et al. (Oct. 13, 2020). “Quantitative analysis questions the role of MeCP2 as a global regulator of alternative splicing”. In: PLOS Genetics 16.10. Ed. by Dirk Schübeler, e1009087. doi: 10.1371_journal.pgen.1009087
en
dc.relation.hasversion
Chhatbar, Kashyap, John Connelly, Shaun Webb, et al. (Dec. 2022). “A critique of the hypothesis that CA repeats are primary targets of neuronal MeCP2”. In: Life Science Alliance 5.12, e202201522. doi: 10.26508_lsa.202201522
en
dc.relation.hasversion
Pantier, Raphaël, Kashyap Chhatbar, Timo Quante, et al. (Feb. 2021). “SALL4 controls cell fate in response to DNA base composition”. In: Molecular Cell 81.4, 845–858.e8. doi: 10.1016_j.molcel.2020.11.046
en
dc.relation.hasversion
Tillotson, Rebekah, Justyna Cholewa-Waclaw, Kashyap Chhatbar, et al. (Mar. 2021). “Neuronal non-CG methylation is an essential target for MeCP2 function”. In: Molecular Cell 81.6, 1260–1275.e12. doi: 10.1016_j.molcel.2021.01.011. (cit. on p. 36, 58)
en
dc.rights.license
Creative Commons: Attribution 4.0 International (CC-BY 4.0)
en
dc.rights.uri
https://creativecommons.org/licenses/by/4.0/
en
dc.subject
epigenetics
en
dc.subject
transcription
en
dc.subject
machine learning
en
dc.subject
explainable AI
en
dc.subject
XAI
en
dc.subject
chromatin
en
dc.subject
Mecp2
en
dc.subject
Sall4
en
dc.subject
DNA methylation
en
dc.subject
deep learning
en
dc.subject
Shapley Additive Explanations
en
dc.subject
transcriptomics
en
dc.subject
multiomics
en
dc.subject
RNA splicing
en
dc.subject
interpretable modelling
en
dc.subject
regulatory genomics
en
dc.title
Exploring epigenetic gene regulation using applied machine learning
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en
Files
Original bundle
1 - 1 of 1
- Name:
- Chhatbar2025.pdf
- Size:
- 29.45 MB
- Format:
- Adobe Portable Document Format
- Description:
This item appears in the following Collection(s)

