Identifying long non-coding RNA in the chicken transcriptome
Kuo, Richard Izen
The transcriptome remains a vast under explored space in genomics. Unlike the genome which is linear in nature, the use of alternative transcription start, end, and splicing sites in eukaryotes creates the possibility of near infinite differentially expressed RNA. While many expressed messenger RNA have been identified through the proteins that they produce, there is still very little known about the world of long non-coding RNA (lncRNA). Long non-coding RNA are a vast unknown space and represent one of the largest frontiers of transcriptomics. While little is known about this class of RNA as a whole, there have been specific lncRNA which have been found to be crucial components of biological development. Given the characteristics of lncRNA there may also be a sub-class that is involved in cell differentiation and speciation. In order to explore lncRNA and generate high throughput predictions of their functions, I used the chicken as a model and applied comparative genomics using newly assembled genomes from other avian species. Long non-coding RNA present the almost perfect scenario for evading detection from previous RNA discovery methods. They have been shown to be poorly conserved across species, with generally low expression levels and no downstream product that is immediately identifiable. Given these factors, previous RNA detection methods such as expressed sequence tags and RNA sequencing cannot provide reliable evidence for the mass identification of lncRNA. In the first chapter I explore the characteristics of Iso-Seq (Pacific Biosciences long read RNA sequencing technology) and methods for processing the data to improve long non-coding RNA identification. I also explore the use of non-traditional cDNA library preparation methods including cDNA normalization and 5’ cap selection. I found that the ability of long read RNA sequencing to provide full length transcript sequences allows for more robust methods of lncRNA prediction. In the second chapter, I explore the data processing of long reads. I use a dataset generated by Pacific Biosciences using the Universal Human Reference RNA as an example of ideal long read data. By using data based on the human transcriptome, I was able to compare my results with information from one of the most well annotated and studied transcriptomes. I demonstrate the Transcriptome Annotation by Modular Algorithms (TAMA) software that I developed and how it can be used to explore the non-coding RNA within the transcriptome. In the third chapter, I explore the transcriptome constructed from Iso-Seq data on different chicken tissue samples. I used the TAMA software along with other tools to make pipelines optimized for lncRNA discovery and to perform functional annotation. Using these methodologies I identified over 300,000 putative transcript models corresponding to over 50,000 genes. Of these over 100,000 transcript models appear to be lncRNA which correspond to over 38,000 gene loci. The majority of these are predicted as sense exonic and mono-exonic lncRNA. While it will require further investigation to produce sufficient evidence that these RNA are not the result of transcriptional noise, I have identified a subset of these which appear to have functional importance given their co-expression with known genes. I demonstrate that while lncRNA appear to be generally lowly expressed, they often express in a tissue-specific manner which suggests a possible role in tissue differentiation. From these investigations, I have found that there are potentially thousands of unannotated lncRNA within the chicken transcriptome with characteristics that require new technologies such as long read sequencing to identify. These novel lncRNA include a subset which could have functional roles in the regulation of cell differentiation.