Selecting Synthetic Data for Successful Simulation-Based Transfer Learning in Dynamical Biological Systems

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background

Accurate prediction of the temporal dynamics of biological systems is crucial for informing timely and effective interventions, e.g., in ecological or epidemiological contexts, or for treatment adjustments in therapy. While machine learning has proven its capabilities in generalizing the underlying non-linear dynamics of such systems, unlocking its predictive power is often restrained by the limited availability of large, curated datasets. To supplement real-world data, informing machine learning by transfer learning with synthetic data derived from simulations using ordinary differential equations has emerged as a promising solution. However, the success of this approach highly depends on the designed characteristics of the synthetic data.

Results

We suggest scrutinizing these characteristics, such as size, diversity, and noise, of ordinary differential equation-based synthetic time series datasets. Here, we demonstrate how to systematically evaluate the influence of such design choices on transfer learning performance. We conduct a proof-of-concept study on three simple, but widely used systems and four real-world datasets. We find a strong interdependency between synthetic dataset size and diversity effects. Good transfer learning settings heavily rely on real-world data characteristics as well as the data’s coherence with the dynamics of the model underlying the synthetic data. We achieve a performance improvement of up to 92% in mean absolute error for simulation-based transfer learning compared to non-informed deep learning.

Conclusions

Our work emphasizes the relevance of carefully selecting properties of synthetic data for leveraging the valuable domain knowledge contained in ordinary differential equation models for machine-learning based predictions. The code is available at https://github.com/DILiS-lab/opt-synthdata-4tl .

Article activity feed