The HEAP interview – Ville Pimenoff on data cohorts

We talk to Ville Pimenoff, Senior Research Fellow at Karolinska Institute in Sweden who is the PI of the Finnish population-wise cohort data that will be included in the HEAP informatics platform as part of its testing and pilot phase.

Q: Tell us about the cohorts that you are working with

A: The Finnish HPV vaccination cohort and the Finnish Maternity Cohort (FMC) are among the largest cohorts in the world at present. The FMC began recruitment of women in 1983 and now includes data from about 1 million women and over 2 million biological samples. The HPV vaccination cohort was started in 2007 and is one of the world’s largest randomized clinical trials, including 32,000 participants that have been followed up and sampled at regular intervals for more than 12 years.

Q: Why have these cohorts been selected as part of the proof of concept for the HEAP informatics platform?

A: These two cohorts were obvious choices because they are well-established, large follow-up datasets that we have been working on for years. Our data management plan in HEAP will be GDPR-compliant, so the data will be well documented and useful for studying population-level exposures.

Indeed, the healthcare system in Finland enables cohort data to be linked to other healthcare registries, such as the national cancer registry, to develop retrospective, long-term follow-up studies for a variety of diseases.

For our HEAP-related research, we are analysing the biological samples from the cohorts to investigate microbial exposures in time and place, at population level. This data will provide insights into the kinds of bacteria and viruses that the participants have been exposed to. As the data includes follow-up samples from the individuals, this will provide longitudinal analysis of exposure patterns in Finland.

Q: What new insights could HEAP offer by analysing cohort data?

A: Data cohorts offer an excellent baseline to study long-term, population-level exposure among initially healthy individuals at a molecular and microbial RNA/DNA level. Potentially, by combining this data with other data from diseases registries, and comparing it with data from other countries, we could find out if similar exposure patterns exist on a national or continental scale. This is a relatively new area for exposure study, and has never yet been been attempted at a population scale.

Q: What have been the main challenges you have faced so far?

A: Collecting the samples, transporting them, and preparing them for the high-throughput experiments is always challenging and indeed also during situations like the COVID-19 pandemic. Fortunately, the situation is now improving.

The next task is to deploy all the samples for molecular analysis and manage the high-throughput data. In addition to the original analyses of the samples, we are now performing metagenome analyses. After publishing the initial research, we aim to make the data available to other researchers in a way that complies with ethical approval and GDPR. In fact, this is one of the main long-term objectives of HEAP.

Q: What is the most exciting thing about the HEAP project, from your perspective?

A: From my perspective as a scientist, the most exciting thing about HEAP is the opportunity to work with experts from many different fields – from ICT front-end developers to clinical researchers, from genomic research to machine learning and artificial intelligence.

And from a cohort perspective, one of the brilliant ideas in HEAP is that it offers a combination of cohorts and datasets that researchers can browse with their research question. For example, if they want to combine information from a certain age-group with specific lifestyle factors – they will be able to explore what data can be harmonized and combined for a particular question.

Of course, there are also challenges. Each country or cohort will potentially have different data access restrictions and requirements. In HEAP we are determined to find a systematic solution to these challenges, and making the datasets from different sources more accessible for merged analysis. By cataloguing the data, we hope it will be possible to rapidly assess what information is available from comparison from different datasets and studies

Q: What do you hope will be the long-term impact of the HEAP informatics platform, once it is developed?

A: I hope the HEAP platform will offer new ways to document, catalogue and share data. We have already experienced from the COVID-19 pandemic that science advances faster when data is shared.

There is a real need for platforms that are easily accessible, user-friendly, and controlled by the researchers who are responsible for the original datasets, so that people feel comfortable sharing their data. These are challenging goals, but we hope that HEAP will provide many of the answers and practical solutions to meet them.

Also, the fact that HEAP is part of the European Human Exposome Network (EHEN) of Horizon 2020-funded projects, will allow us to share ambitious goals and lessons learned and this will help us to find solutions to the ethical and practical challenges in studying health related life-long exposures.