Bioinformatics and the Curse of Dimensionality

Published on 3 December 2024

Our Liverpool Virtual Seminar Series on Data Intensive Science will continue on Tuesday 10^th December at 15:00 GMT. The seminar will be given by Euan McDonnell of the University of Liverpool who will present “Bioinformatics and the Curse of Dimensionality.”

Seminars in this series cover R&D outside of the data intensive science CDT’s core research areas and give an insight into cutting edge research in this area. At the end of the talk there will be a Q&A session with the speaker.

About the talk

Bioinformatics as a field has seen a rapid expansion in prevalence over the past 25 years. Much of this has been driven by the increase in the scale and frequency of large-scale datasets, predominantly global biological profiling approaches, or so-called “omics” technologies. These encompass a wide range of applications that quantify the abundance, activity, or presence of various biological entities in a top-down and unbiased manner, resulting in datasets with 100s-millions of features. Much of bioinformatics is concerned with the ranking and selection of such features in regards to their relationships to external factors or co-relationships within- or between-datatypes. However the complex, time-consuming, and expensive task of processing and acquiring biological samples, as well as generation of data from such samples means that, in relation to the dimensionality, the number of data points is frequently far less than the number of features. This problem is termed the “large p, small n” or “p>>n” problem and is a critical issue that is ubiquitous in bioinformatics and health data science. Such high dimensionality in-tandem with low degrees of freedom confers a major analytical and computational challenge due to the explosive increase in the size of the sampling domain with increasing features; the so-called “curse of dimensionality”. This results in overfitting/high variance in statistical and machine learning models, as well as compounding issues faced with the inherent high variability between biological samples. Bioinformatics has thus seen the application of a suite of methodologies that aim to tackle this issue, commonly including the use of empirical Bayes pooling of information, dimensional reduction and regularisation/sparsification procedures. While these approaches have allowed the field to mostly keep up with the increasing scale of data-sets that are being generated, further developments will be required in order.

About the speaker

Euan started his academic career with an integrated Masters in microbiology from the University of Leeds, where in his final year project he delved into bioinformatics using Bash and Python to analyse transcriptome-wide cleavage sites of a bacterial endoribonuclease. He subsequently undertook a DiMeN MRC-funded bioinformatics PhD working on transcriptomic networks and their dysregulation by the oncogenic herpesvirus Kaposi’s Sarcoma-associated Herpesvirus. Since Jan 2023, he has worked as a bioinformatics data scientist with the Computational Biology Facility at the University of Liverpool. His research focuses on a range of projects, including transcriptome-wide analysis to determine the benefit of arginine intervention on pre-mature neonatal patients, predicting and comparing genotype-epigenome relationships in foetal and osteoarthritic/osteoporotic tissue and predicting discriminatory protein biomarkers for the diagnosis of non-bacterial osteomyelitis. More-generally, he is interested in network approaches to biology, primarily Gaussian graphical models and how prior information/multiple ‘omics data-types can be integrated into network models.

How to attend

Participation is free, but you need to register to attend this and other webinars in the series. For more information and how to register please follow this link. Once registered, you will receive the Zoom connection details on the morning of the online seminar.

The seminar details

Speaker: Euan McDonnell (University of Liverpool)

Seminar title: “Bioinformatics and the Curse of Dimensionality”

Date/Time: Tuesday 10^th December at 15:00 GMT