Machine learning for big biomedical data: fast detection of disease agents in samples

Potential supervisors: 

Large-scale DNA sequencing allows the monitoring of environmental and patient samples for infectious disease agents such as fungi, bacteria and viruses, which are leading causes of morbidity and mortality worldwide. Identification of these disease agents is important for effective treatment and the detection of potential pandemic-causing agents, such as SARS-CoV-2. The importance of monitoring is also a consequence of infectious diseases spreading to new geographic regions as disease vectors such as mosquitos spread due to global warming.

The experimental steps of the monitoring process produced hundreds of gigabytes of DNA sequencing data which needs to be analyzed computationally for the presence of disease agents, identification of known pathogens, antibiotic resistance genes and possible emerging pathogens.

The group of Peter Norberg (Department of Infectious Diseases at the Institute of Biomedicine, Gothenburg University) has developed Genomic Signatures to analyze such data. Genomic Signatures are based on Machine Learning models, more specifically Variable Length Markov Chains (VLMCs). These probabilistic models capture specifics of the DNA sequence from pathogens and other organisms and allow an analysis without resorting to comparisons to the terabytes of collections of known DNA sequences. Genomic Signatures are the topic of an on-going cooperations including jointly supervised students.

The goal is the development of a machine learning model that quickly and accurately predicts the content of DNA sequencing data containing multiple pathogens.  An additional use case is the identification of recombination in pathogens, which is a source of variability and thus increased disease potential.  While the VLMCs can be used to classify each DNA sequence, modeling the jump from one VLMC  to another is not straightforward.  Instead, the VLMCs needs to be combined with a framework that explicitly considers the switch from one VLMC to another.  Possible approaches include combining the VLMCs with Hidden Markov Models (HMMs) or topic discovery models such as Latent Dirichlet  allocation (LDA).

Student profile:
The ideal candidate(s) has/have an interest in machine learning for big data, some background in probabilistic methods in machine learning, good knowledge of algorithms and data structures and experience coding computationally intensive applications. Biomedical knowledge is not a prerequisite.

There is the possibility of forming an interdisciplinary team with a medical student. Supervision will be joint between CSE and Biomedicine.

Further Informations:
- Alexander Schliep's Group:
- Peter Norberg:



Date range: 
October, 2021 to October, 2024