Large-scale analysis of microarray data to identify molecular signatures of mouse pluripotent stem cells
McGlinchey, Aidan James
Publicly-available microarray data constitutes a huge resource for researchers in biological science. A wealth of microarray data is available for the model organism – the mouse. Pluripotent embryonic stem (ES) cells are able to give rise to all of the adult tissues of the organism and, as such, are much-studied for their myriad applications in regenerative medicine. Fully differentiated, somatic cells can also be reprogrammed to pluripotency to give induced pluripotent stem cells (iPSCs). ES cells progress through a range of cellular states between ground state pluripotent stem cells, through the primed state ready for differentiation, to actual differentiation. Microarray data available in public, online repositories is annotated with several important fields, although this accompanying annotation often contains issues which can impact its usefulness to human and / or programmatic interpretation for downstream analysis. This thesis assembles and makes available to the research community the largest-to-date pluripotent mouse ES cell (mESC) microarray dataset and details the manual annotation of those samples for several key fields to allow further investigation of the pluripotent state in mESCs. Microarray samples from a given laboratory or experiment are known to be similar to each other due to batch effects. The same has been postulated about samples which use the same cell line. This work therefore precedes the investigation of transcriptional events in mESCs with an investigation into whether a sample's cell line or source laboratory is a greater contributor to the similarity between samples in this collected pluripotent mESC dataset using a method employing Random Submatrix Total Variability, and so named RaSToVa. Further, an extension of the same permutation and analysis method is developed to enable Discovery of Annotation-Linked Gene Expression Signatures (DALGES), and this is applied to the gathered data to provide the first large-scale analysis of transcriptional profiles and biological pathway activity of three commonly-used mESC cell lines and a selection of iPSC samples, seeking insight into potential biological differences that may result from these. This work then goes on to re-order the pluripotent mESC data by markers of known pluripotency states, from ground state pluripotency through primed pluripotency to earliest differentiation and analyses changes in gene expression and biological pathway activity across this spectrum, using differential expression and a window-scanning approach, seeking to recapitulate transcriptional patterns known to occur in mESCs, revealing the existence of putative “early” and “late” naïve pluripotent states and thereby identifying several lines of enquiry for in-laboratory investigation.