Biological Data Science: Ancient genomics, anesthesiology, epidemiology, and a bit in between

Research output: Book/ReportPh.D. thesisResearch

In recent years, methods such as next generation sequencing in genomics and the use of electronic records in the health care sector has dramatically increased the amount of data in the life sciences. In the field of ancient genomics, newer lab protocols, combined with strict precautions, now allow for the sequencing of ancient environmental DNA millions of years old. In health care, electronic records have allowed for the use of modern machine learning models due to the increased amount of collected data. This has led to a need for new methods and tools to analyze and interpret this vast amount of information that seems to keep increasing in size in the coming years. This thesis focuses on the use cases and potential issues with applying modern statistical and data science related methods on biological data.
The work of this thesis is split into four parts, each with a dedicated paper supporting it. The first paper introduces a novel statistical method that we developed for analysing ancient metagenomic DNA damage. To our knowledge, no prior
methods exist which are designed to cover this specific use case in genomics. We show that the work of this project, the metaDMG software, is both faster at ancient DNA damage estimation than existing methods and provides more accurate damage estimates – even at taxonomic levels down to 100 reads. As such, metaDMG is state-of-the-art for ancient DNA damage estimation for both simple and complex ancient genomic datasets.
The second paper presents a machine learning approach to predict medical
complications after surgery, in particular knee and hip operations. The use of machine learning in anaesthesiology is still in its infancy, and this work is a first step towards the use of machine learning in this field. We show that modern machine learning models can be used to predict complications after surgery with higher accuracy than classical statistical methods commonly used in the field.
Concretely, we find a 9.7% increase in precision and 1.6 percentage points increase in the area-under-ROC-curve metric when using a boosted decision tree compared to logistic regression. We further show how explainability methods can not only be used to better understand the “black box” of machine learning models, and thus the risk predictions themselves, but also help support the doctors in their decision making process.
The third paper describes how spatial heterogeneities affect the fitted predictions of an epidemic curve in the early phase. In collaboration with Statens Serum Institut, the Danish Center for Disease Control, we developed an agent based model which extends on the classical SIR models often used in epidemiology. This allowed us to model the spread of disease in the Danish population and introduce complex interaction patterns between the agents in the form of heterogeneities based on geographical density. We found that fitting with classical SEIR models overestimate the peak number of infected and the total number of infected by a factor of two if only fitted on an early-stage epidemic.
All living cells share the same DNA, yet the expression of genes differ wildly between cells. The mechanisms regulating gene expressions and the silencing of specific genes are not yet fully understood, however, it is known that the heterogeneous environment in the cell nucleus is a key factor in this. In particular, the silencing and repair foci play an important role. The fourth paper presents the analysis of these foci by analysing the single molecule dynamics using Bayesian inference based on diffusion models. This allow us to extract and quantify the diffusion coefficients of the foci which describe the physical mechanisms of the formation of the foci.
Original languageEnglish
PublisherNiels Bohr Institutet
Number of pages247
Publication statusPublished - 2023

ID: 347422793