Accounting for Uncertainty in the Null Benchmark in Two-Stage Phase II Trials

Rebecca Irlmeier
Zhuoli Jin
Fei Ye

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Simon two-stage designs for binary endpoints and their time-to-event analogues, including the Kwak and Jung method, rely on a fixed null benchmark. Their Type I error control is valid only when that benchmark is correctly specified. In practice, historical benchmarks are often inconsistent due to small samples, population heterogeneity, changing eligibility criteria, and evolving standards of care. Even modest misspecifications can substantially inflate the Type I error rate, leading to costly advancement of ineffective treatments.

Methods

We propose the Interval-Null Robust (INR) two-stage design framework that accounts for uncertainty in the historical null benchmark. We define the null hypothesis as a plausible range of clinically uninteresting values: p ∈ [ p _{0
L} , p _{0
U} ] for binary endpoints and λ ∈ [ λ _{0
L} , λ _{0
U} ] (or equivalent survival probabilities) for time-to-event endpoints. Type I error is controlled uniformly over the full null interval: . Under the monotonicity of the Go probability, the supremum occurs at the least favorable null configuration – p _{0
U} and λ _{0
L} – but the design is not reduced to a point-null formulation. The interval defines the uncertainty set for error control and is used in selecting among feasible designs through robust criteria such as worst-case regret or minimal average expected sample size.

Results

Across representative planning scenarios for both endpoint types, classic designs calibrated to a single benchmark exhibit substantial Type I error inflation when the true null parameter exceeds the assumed planning value. INR designs maintain the nominal Type I error rate across the full null interval, directly addressing this vulnerability to benchmark misspecification. The robustness-efficiency trade-off can be managed through design constraints and robust optimization criteria while preserving uniform Type I error control.

Conclusions

INR two-stage designs offer a transparent framework for addressing historical control uncertainty in single-arm Phase II trials. By replacing reliance on a fixed benchmark assumption with a more realistic interval of clinically plausible null values, INR design reduces the risk of false-positive Go-decisions caused by benchmark misspecification. INR applies to both binary and time-to-event endpoints and is implemented in the open-source INRDesign R package and accompanying interactive Shiny app.

Version published to 10.64898/2026.05.14.26353210 on medRxiv
May 18, 2026

Robust Inference of Individualized Treatment Effect in Mendelian Randomization

This article has 4 authors:
1. Ruoxuan Wu
2. Xiudi Li
3. Feifei Xiao
4. Muxuan Liang
This article has no evaluationsLatest version May 12, 2026
The Inflation Reduction Act’s Impact Upon Late-Stage R&D

This article has 4 authors:
1. Harry P. Bowen
2. Gwen O’Loughlin
3. Claire Schleicher
4. Duane G. Schulthess
This article has no evaluationsLatest version May 28, 2026
Can Predictive Modeling Inform the Selection of Time Zero for Target Trial Emulations? An Empirical Study of Atorvastatin Initiation in Medicare Beneficiaries

This article has 3 authors:
1. Christopher G. Rowan
2. Steven M. Brunelli
3. Camille Maringe
This article has no evaluationsLatest version May 6, 2026

Discuss this preprint

Listed in

Abstract

Background

Methods

Results

Conclusions

Article activity feed

Related articles

Robust Inference of Individualized Treatment Effect in Mendelian Randomization

The Inflation Reduction Act’s Impact Upon Late-Stage R&D

Can Predictive Modeling Inform the Selection of Time Zero for Target Trial Emulations? An Empirical Study of Atorvastatin Initiation in Medicare Beneficiaries