Complete Simulation of timsTOF PASEF Raw Datasets with Timsim Enables Precise Evaluation of False Discovery and Phosphosite Localization Error Rates
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Accurate control of false discovery rates (FDR) and false localization rates (FLR) is central to quantitative proteomics and phosphoproteomics, yet rigorous validation is limited by the absence of high-complexity ground truth data. Here we introduce timsim, a simulation framework using machine-learning and first principle-driven prediction of peptide properties to generate native Bruker-format timsTOF dda-PASEF and dia-PASEF acquisition data with complete ground-truth annotation. Using timsim benchmarks, we show that several dia-PASEF workflows control FDR near the nominal 1% threshold at stripped-sequence level but exhibit inflated true FDR (3–5%) when modified peptidoforms are considered, driven by systematic misassignment of common modifications. In dda-PASEF analyses, match-between-runs produced peak-matching errors of up to 30% under high-density conditions. Simulated phosphoproteomics datasets enabled calibration of site localization scores, identifying a 0.65 site-probability cutoff as an optimal tradeoff between sensitivity and false localization. Timsim provides a scalable resource for rigorous benchmarking and development of proteomics software.