Higher-order interactions in single-cell gene expression: towards a cybergenetic semantics of cell state
Finding and understanding patterns in gene expression guides our understanding of living organisms, their development, and diseases, but is a challenging and high-dimensional problem as there are many molecules involved. One way to learn about the structure of a gene regulatory network is by studying the interdependencies among its constituents in transcriptomic data sets. These interdependencies could be arbitrarily complex, but almost all current models of gene regulation contain pairwise interactions only, despite experimental evidence existing for higher-order regulation that cannot be decomposed into pairwise mechanisms. I set out to capture these higher-order dependencies in single-cell RNA-seq data using two different approaches. First, I fitted maximum entropy (or Ising) models to expression data by training restricted Boltzmann machines (RBMs). On simulated data, RBMs faithfully reproduced both pairwise and third-order interactions. I then trained RBMs on 37 genes from a scRNA-seq data set of 70k astrocytes from an embryonic mouse. While pairwise and third-order interactions were revealed, the estimates contained a strong omitted variable bias, and there was no statistically sound and tractable way to quantify the uncertainty in the estimates. As a result I next adopted a model-free approach. Estimating model-free interactions (MFIs) in single-cell gene expression data required a quasi-causal graph of conditional dependencies among the genes, which I inferred with an MCMC graph-optimisation algorithm on an initial estimate found by the Peter-Clark algorithm. As the estimates are model-free, MFIs can be interpreted either as mechanistic relationships between the genes, or as substructures in the cell population. On simulated data, MFIs revealed synergy and higher-order mechanisms in various logical and causal dynamics more accurately than any correlation- or information-based quantities. I then estimated MFIs among 1,000 genes, at up to seventh-order, in 20k neurons and 20k astrocytes from two different mouse brain scRNA-seq data sets: one developmental, and one adolescent. I found strong evidence for up to fifth-order interactions, and the MFIs mostly disambiguated direct from indirect regulation by preferentially coupling causally connected genes, whereas correlations persisted across causal chains. Validating the predicted interactions against the Pathway Commons database, gene ontology annotations, and semantic similarity, I found that pairwise MFIs contained different but a similar amount of mechanistic information relative to networks based on correlation. Furthermore, third-order interactions provided evidence of combinatorial regulation by transcription factors and immediate early genes. I then switched focus from mechanism to population structure. Each significant MFI can be assigned a set of single cells that most influence its value. Hierarchical clustering of the MFIs by cell assignment revealed substructures in the cell population corresponding to diverse cell states. This offered a new, purely data-driven view on cell states because the inferred states are not required to localise in gene expression space. Across the four data sets, I found 69 significant and biologically interpretable cell states, where only 9 could be obtained by standard approaches. I identified immature neurons among developing astrocytes and radial glial cells, D1 and D2 medium spiny neurons, D1 MSN subtypes, and cell-cycle related states present across four data sets. I further found evidence for states defined by genes associated to neuropeptide signalling, neuronal activity, myelin metabolism, and genomic imprinting. MFIs thus provide a new, statistically sound method to detect substructure in single-cell gene expression data, identifying cell types, subtypes, or states that can be delocalised in gene expression space and whose hierarchical structure provides a new view on the semantics of cell state. The estimation of the quasi-causal graph, the MFIs, and inference of the associated states is implemented as a publicly available Nextflow pipeline called Stator.