Evaluating the applicability of replication success metrics in animal-to-human translation: A simulation study

Carolyne Jie Huang
Samuel Pawel
Kimberley Elaine Wever
Benjamin Victor Ineichen
Rachel Heyard

Curated by eLife

eLife Assessment

This is a detailed and well-designed simulation study of the utility of replication metrics in animal-to-human study translations in bridging the gap between laboratory discoveries and health practice, a critical consideration in turning laboratory scientific research findings into tangible, real-world applications, to directly help human health. The study approaches are solid, and the findings are important, as they offer insights into clinical research translations to advance health decision-making. There is some potential for the strength and applicability of the presented evidence to be improved upon revision.

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (eLife)

Abstract

Translation failure, in which promising animal study results can not be reproduced in human trials, is a challenge in biomedical research. Metrics for replication success are widely used to evaluate reproducibility, i.e., the extent to which the results of a study agree with those of replication studies. The relevance of these metrics in assessing animal-to-human translation success (or faillure) is unclear. We conducted a simulation study to examine whether these metrics can quantify translation success and how their performance varies under different conditions. Using parameters from a meta-analysis on prenatal amino acid supplementation and maternal blood pressure, we simulated animal and human studies under 648 scenarios, varying effect sizes, heterogeneity, animal sample sizes and number of pooled animal studies. Nine metrics were assessed, namely the two-trials rule, meta-analysis, replication Bayes factor, unweighted and weighted Edgington’s methods, golden sceptical p-value and three versions of controlled sceptical p-value. Most metrics, except meta-analysis and replication Bayes factor, controlled false positive rates under no heterogeneity, but became liberal as heterogeneity increased, particularly between human studies. Translation power (i.e., the probability of true positive translation success) was constrained by the weaker evidence of the two findings; e.g., small sample size in the animal studies resulted in lower translation power. The metric based on meta-analysis frequently indicated success when either of the species found strong evidence, while sceptical p-values were more conservative. The sceptical p-value that controls overall type-one error and the weighted version of Edgington’s method performed relatively consistently across scenarios. No metric was uniformly optimal. Metrics developed for replication studies can inform assessments of translation, but their utility depends on the underlying evidence and assumptions. Using multiple metrics in combination, with attention to their strengths and limitations, is recommended for evaluating the translation of animal findings to human outcomes.

eLife
Mar 30, 2026

eLife Assessment

This is a detailed and well-designed simulation study of the utility of replication metrics in animal-to-human study translations in bridging the gap between laboratory discoveries and health practice, a critical consideration in turning laboratory scientific research findings into tangible, real-world applications, to directly help human health. The study approaches are solid, and the findings are important, as they offer insights into clinical research translations to advance health decision-making. There is some potential for the strength and applicability of the presented evidence to be improved upon revision.

Read the original source
eLife
Mar 30, 2026

Reviewer #1 (Public review):

A well-designed and preregistered simulation study investigating whether replication-success metrics can be applied to assess animal-to-human translation. The study is comprehensive, uses realistic parameter settings, and provides valuable insights into how different metrics behave under varied conditions.

Strengths:

(1) Methodologically rigorous and transparently preregistered.

(2) Comprehensive simulation design covering a wide range of plausible scenarios.

(3) Clear description of metrics and decision rules.

(4) Valuable contribution to understanding the limitations of applying replication metrics to translation questions.

Weaknesses:

(1) The conceptual distinction between replication and translation could be more clearly emphasized.

(2) Interpretation of results is dense and can be challenging to follow …

Reviewer #1 (Public review):

A well-designed and preregistered simulation study investigating whether replication-success metrics can be applied to assess animal-to-human translation. The study is comprehensive, uses realistic parameter settings, and provides valuable insights into how different metrics behave under varied conditions.

Strengths:

(1) Methodologically rigorous and transparently preregistered.

(2) Comprehensive simulation design covering a wide range of plausible scenarios.

(3) Clear description of metrics and decision rules.

(4) Valuable contribution to understanding the limitations of applying replication metrics to translation questions.

Weaknesses:

(1) The conceptual distinction between replication and translation could be more clearly emphasized.

(2) Interpretation of results is dense and can be challenging to follow without a clear and summarized.

(3) Some simulation parameters (effect sizes, heterogeneity, and number of animal studies) require more substantial justification.

(4) Practical recommendations could be more explicit to guide applied researchers.

Read the original source
eLife
Mar 30, 2026

Reviewer #2 (Public review):

Summary:

The authors attempt to address the issue of high rates of translation failure from animal studies to humans in the literature, where promising results in animal studies fail when conducting human clinical trials. Using parameters from a previous meta-analysis on prenatal amino acid supplementation and the effects it has on maternal blood pressure, the authors assessed the performance of the metrics used and whether they can quantify translation success. Performing a simulation study, the authors compared nine translation success metrics and found that no one method was uniformly optimal. The authors list several limitations of the study, such as comparability of effect sizes between animal and human studies, different goals of animal studies versus human studies, and the focus of the study on one …

Reviewer #2 (Public review):

Summary:

The authors attempt to address the issue of high rates of translation failure from animal studies to humans in the literature, where promising results in animal studies fail when conducting human clinical trials. Using parameters from a previous meta-analysis on prenatal amino acid supplementation and the effects it has on maternal blood pressure, the authors assessed the performance of the metrics used and whether they can quantify translation success. Performing a simulation study, the authors compared nine translation success metrics and found that no one method was uniformly optimal. The authors list several limitations of the study, such as comparability of effect sizes between animal and human studies, different goals of animal studies versus human studies, and the focus of the study on one aspect (statistics of translation) is part of a broader, more complex decision-making process before proceeding to human trials. The authors recommend using multiple metrics in combination while taking into consideration their strengths and weaknesses to assess the translation of animal studies to human outcomes. The paper achieves the aim of providing a model with several metrics to evaluate translation success from animal studies to humans.

Strengths:

(1) Utilizing 9 different translation success metrics in combination provides strong flexibility in evaluating whether results in animal studies can translate to humans. This would allow researchers to evaluate translation success using multiple different metrics according to the context of the study.

(2) The authors accommodated for the limited sample size in animal studies, which are typically underpowered, and also caution that special attention should be given to heterogeneity when interpreting translation results.

(3) Overall, this approach has the potential to be applied to other biomedical studies, provided the limitations for each of the metrics are considered. It would provide a useful tool in assessing translation from animals to humans, in addition to other factors such as safety, pharmacokinetics, etc.

Weaknesses:

While the study has several strengths, there are some limitations.

(1) Preclinical animal study sizes tend to be much smaller than human studies, which results in underpowered results. The authors adjusted for this by pooling animal study data. However, high heterogeneity in the animal studies can affect translation results.

(2) The study focuses only on evaluating the statistical component of translation, which is only one aspect of the decision-making process to move on to human trials. The study does not take into account safety and toxicological profiles, pharmacokinetics, or genetics, which are important considerations that influence the overall effect in humans.

Read the original source
eLife
Mar 30, 2026

Reviewer #3 (Public review):

Summary:

This paper focused on how to navigate the complex decision-making process of whether to go into human trials. This is a critical topic considering the well-documented challenges in replicating and translating findings. While these are two distinct topics (i.e., replication and translation), they are related, and the authors simulated many conditions to assess the utility of replication assessment metrics.

Strengths:

A major strength of the study is the detailed approach to identifying relevant conditions and metrics, and to providing rich results that outline the strengths and weaknesses of each metric. Any simulation study is challenged by trying to identify the most relevant variables of interest, and this study provided sound justification for its chosen variables of interest. While this study …

Reviewer #3 (Public review):

Summary:

This paper focused on how to navigate the complex decision-making process of whether to go into human trials. This is a critical topic considering the well-documented challenges in replicating and translating findings. While these are two distinct topics (i.e., replication and translation), they are related, and the authors simulated many conditions to assess the utility of replication assessment metrics.

Strengths:

A major strength of the study is the detailed approach to identifying relevant conditions and metrics, and to providing rich results that outline the strengths and weaknesses of each metric. Any simulation study is challenged by trying to identify the most relevant variables of interest, and this study provided sound justification for its chosen variables of interest. While this study does not make a strong recommendation (which I see as a strength), it does provide a comprehensive overview of the various metrics and conditions that were investigated.

Weaknesses:

The weaknesses of the study are the limited focus on specific metrics, the assumptions, particularly in the limited number of human study variables, and the less-than-ideal approachable summary of findings for a non-technical audience.

Conclusion:

This paper provides a much-needed investigation and discussion of how decisions are made when assessing whether to go into human trials. This is an important topic that productively challenges the status quo, considering documented challenges in replication and translation in biomedical research.

Read the original source
Version published to 10.7554/elife.109853.1 on eLife
Mar 30, 2026
Version published to 10.7554/elife.109853 on eLife
Mar 30, 2026
Version published to 10.1101/2025.11.07.25339757 on medRxiv
Nov 9, 2025

LLMs in the Lab: Can AI Predict What Real Participants Do?

This article has 2 authors:
1. Gafari LUKUMON
2. Ebenezer ESENOGHO
This article has no evaluationsLatest version Apr 8, 2026
From GWAS to Causal Inference: A Beginner’s Guide to Mendelian Randomization with Code Examples

This article has 7 authors:
1. Ahmed M Salih
2. Roman Roy
3. Yuhe Wang
4. Irene Treccani
5. Andre Altmann
6. Zahra Raisi-Estabragh
7. Gloria Menegaz
This article has no evaluationsLatest version Apr 9, 2026
GeneBench: Assessing AI Agents for Multi-Stage Inference Problems in Genomics and Quantitative Biology

This article has 2 authors:
1. Jeremy Li
2. Andrew Ho
This article has no evaluationsLatest version Apr 23, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

LLMs in the Lab: Can AI Predict What Real Participants Do?

From GWAS to Causal Inference: A Beginner’s Guide to Mendelian Randomization with Code Examples

GeneBench: Assessing AI Agents for Multi-Stage Inference Problems in Genomics and Quantitative Biology