The HEAP interview – Piotr Bala talks metagenomics

We talk to Piotr Bala, leader of HEAP’s Metagenomic Analysis Work Package. He is Professor of parallel and grid computing, algorithms and programming at the University of Warsaw’s Interdisciplinary Centre for Mathematical and Computational Modelling (ICM).

First of all, what does metagenomics involve, and how is it being used in the Human Exposome Assessment Platform (HEAP) project?
Metagenomics is the study of genetic material in biological samples. In the case of HEAP, these are biospecimens from the Swedish Cervical Screening cohort, based at Karolinska Institutet in Stockholm. The aim of the Metagenomics Work Package is to help the biologists and bioinformaticians at KI to quickly and efficiently generate and analyse metagenomic sequencing datasets from their biospecimens, and analyse them to identify bacterial and viral sequences.My team at ICM use computing expertise to test bioinformatics pipelines to perform the analysis several hundred times faster than previously possible. The result of this work will be a better understanding of how bacteria and viruses impact our health. In the longer term, these datasets and insights will be available for researchers to access via the HEAP platform.

Who are the ICM team working on HEAP?
We are a multi-disciplinary team. My background is in physics and computer science, and I work on modelling enzymatics reactions, and on parallelization of molecular dynamics software. We also have two computer science faculty members, both with PhDs, providing technical expertise – Marek Nowiński, who has already worked on parallelization of BLAST code, and Łukasz Górski, who has experience in parallelization of Artificial Intelligence (AI) tools. He has also extensive knowledge of software used in the HEAP project, such as Hadoop and Spark.We also have a bioinformatician – Szymon Szyszkowski, who installed and tested the bioinformatic pipelines.The newest member of our team is Magdalena Mroczek, who specialises in genetics and finished medical studies at Warsaw Medical University. Since graduating she has been working in research laboratories, and she is currently extending her computer science skills through an ICM Masters degree programme in computational engineering.Finally, the ICM team works closely with Roxana Merino, Ville Pimenoff and Sara Arroyo from Karolinksa Institutet, who have provided us with data from biosamples. As the end users of the bioinformatics pipeline, they provide us with ideas and challenges. This cooperation is really valuable to us, and helps us fully understand their requirements.

How did your team at ICM come to be working on HEAP?
The team has been working with Karolinska Institutet for the past 6 years on BLAST, a widely used bioinformatics software. BLAST compares a sequence of interest (the query sequence) to sequences in a large database and reports the number of matches. KI used BLAST to search for virus sequences in human DNA. This required a lot of computational time, and we developed a parallel version of BLAST which runs on supercomputers and shortened the analysis time from months to days or hours. For us, the HEAP project is an opportunity to continue our work with KI, looking at ways to shorten the time taken to analyse DNA using parallel computing. We plan to parallelize additional software packages to enable more extensive analysis of DNA.

What stage is your work at right now?
In December 2020, the ICM team completed tests of bioinformatics software pipelines on the ICM cluster computer. We began our latest round of work by testing 6 potential bioinformatics pipelines, based on a variety of different algorithms. The tests identified bottlenecks and corrected them to improve simultaneous analysis, to allow software can work in parallel on cluster computers. Having run the tests using sample data from Karolinska Institutet, we narrowed this down to 2 or 3 pipelines. The final decision will be taken soon.

What will your team be working on in 2021? Our aim in 2021 is to build on the pipelines we have tested and selected, to make them more advanced. The next stage of work will focus on workflows and machine learning algorithms to identify viral DNA sequences in biological samples.As a result of this work, by spring 2022, we aim to have developed agnostic taxomic classifications methods for virus genomes.

In the longer term, what do you hope will be the legacy of the HEAP project?The project will advance knowledge on how the environment influences human health by developing the HEAP integrated platform to analyse health-related data, including genetic data. Such an ambitious aim requires diverse, multi-disciplinary experience and expertise, and this is offered by the HEAP consortia partners.For ICM, it is important that we work closely with medical professionals, in order to provide them with an efficient solution to speed up genetics analysis, and therefore to improve health of patients and of the whole population. This process is challenging, as it requires interaction between software developers and users, and can only be performed with interdisciplinary and cross-institutional collaboration.