Statistical inference in population genetics using microsatellites
Statistical inference from molecular population genetic data is currently a very active area of research for two main reasons. First, in the past two decades an enormous amount of molecular genetic data have been produced and the amount of data is expected to grow even more in the future. Second, drawing inferences about complex population genetics problems, for example understanding the demographic and genetic factors that shaped modern populations, poses a serious statistical challenge. Amongst the many different kinds of genetic data that have appeared in the past two decades, the highly polymorphic microsatellites have played an important role. Microsatellites revolutionized the population genetics of natural populations, and were the initial tool for linkage mapping in humans and other model organisms. Despite their important role, and extensive use, the evolutionary dynamics of microsatellites are still not fully understood, and their statistical methods are often underdeveloped and do not adequately model microsatellite evolution. In this thesis, I address some aspects of this problem by assessing the performance of existing statistical tools, and developing some new ones. My work encompasses a range of statistical methods from simple hypothesis testing to more recent, complex computational statistical tools. This thesis consists of four main topics. First, I review the statistical methods that have been developed for microsatellites in population genetics applications. I review the different models of the microsatellite mutation process, and ask which models are the most supported by data, and how models were incorporated into statistical methods. I also present estimates of mutation parameters for several species based on published data. Second, I evaluate the performance of estimators of genetic relatedness using real data from five vertebrate populations. I demonstrate that the overall performance of marker-based pairwise relatedness estimators mainly depends on the population relatedness composition and may only be improved by the marker data quality within the limits of the population relatedness composition. Third, I investigate the different null hypotheses that may be used to test for independence between loci. Using simulations I show that testing for statistical independence (i.e. zero linkage disequilibrium, LD) is difficult to interpret in most cases, and instead a null hypothesis should be tested, which accounts for the “background LD” due to finite population size. I investigate the utility of a novel approximate testing procedure to circumvent this problem, and illustrate its use on a real data set from red deer. Fourth, I explore the utility of Approximate Bayesian Computation, inference based on summary statistics, to estimate demographic parameters from admixed populations. Assuming a simple demographic model, I show that the choice of summary statistics greatly influences the quality of the estimation, and that different parameters are better estimated with different summary statistics. Most importantly, I show how the estimation of most admixture parameters can be considerably improved via the use of linkage disequilibrium statistics from microsatellite data.