Detecting patterns of positive selection using time-series data

Research output: Book/ReportPh.D. thesisResearch

Population genetics has been focusing on understanding how the evolutionary processes have shaped the genetical variation. Recent advancements in ancient DNA have allowed us to directly observe temporal changes in population genetical composition, giving us insights into how species adapted to various environmental conditions. However, many approaches that study adaptation are designed for present-day genomes. In this thesis we set out to develop new methods for studying natural selection on temporally-spread data, and to apply them to both ancient and present-day human genome sequences. First, we developed a method (Chapter 2) to infer the spatio-temporal allele frequency dynamics of an advantageous mutation. The method finds parameters associated with the allele such as the selection coefficient and the rate of diffusion and advection (the movement of the allele frequency cluster) that best explain the observed data. By applying our model on simulated data, we could accurately recover the underlying allele frequency surface as well as the selection coef- ficient, however, we found that different sets of diffusion and advection parameters can produce similar allele frequency trajectories. We demonstrated our method on two variants previously hypothesised to undergo positive selection – one located in the MCM6/LCT region and associated with the lactase persistence phenotype and the other located within the TYR region and correlated with skin pigmentation. We further applied the aforementioned method on a variant within CCR5 chemokine receptor – CCR5delta32 (Chapter 3), which is known for its protective properties against HIV-1 and other immune-related infections. The evolutionary history of this mutation and its plausible selective advantage is an ongoing debate. While investigating the inferred allele frequency trajectory of CCR5delta32 we found evidence that this variant experienced positive selection in the time period prior to 3000 years before present. Our findings suggest that the Medieval Plague could not have accounted for the frequencies observed today, as suggested by some previous studies, and the selective pressure may thus be attributed to another infectious agent. Finally, we developed another method to find patterns of selective sweeps in whole-genome sequence data (Chapter 4). For this we applied convolutional neural networks – a powerful tool that has been increasingly used in population genetics to tackle problems related to admixture, discrimination of various types of sweeps and population size history inference among others. We trained our neural network using simulated genomic regions undergoing selection. Once it was able to distinguish between selection and neutrality, we applied it to ancient European genomic data. With our method we were able to recover sites previously reported to be undergoing a selective sweep, including LCT and SLC45A2 genes. We also inferred new candidate sites for natural selection. Our method is also able to estimate the age and selection strength of the sweep. With this work I demonstrate that combining powerful computational tools with time-serial genomic data can be used to infer past events, confirm previous hypothesis of known selected sites as well as identify new ones.
Original languageEnglish
PublisherGLOBE Institute, Faculty of Health and Medical Sciences, University of Copenhagen
Number of pages170
Publication statusPublished - 2023

ID: 343301431