How Does Sampling Affect the AI Prediction Accuracy of Peptides’ Physicochemical Properties?

Abstract

Accurate AI prediction of peptide physicochemical properties is essential for advancing peptide-based biomedicine, biotechnology, and bioengineering. However, the performance of predictive AI models is significantly affected by the representativeness of the training data, which depends on the sample size and sampling methods employed. This study addresses the challenge of determining the optimal sample size and sampling methods to enhance the predictive accuracy and generalization capacity of AI models for estimating the aggregation propensity, hydrophilicity, and isoelectric point of tetrapeptides. Four sampling methods were evaluated: Latin Hypercube Sampling (LHS), Uniform Design Sampling (UDS), Simple Random Sampling (SRS), and Probability-Proportional-to-Size Sampling (PPS), across sample sizes ranging from 100 to 20,000. A sample size of approximately 12,000 (7.5% of the total tetrapeptide dataset) marks a key threshold for stable and consistent model performance. This study provides valuable insights into the interplay between sample size, sampling strategies, and model performance, offering a foundational framework for optimizing data collection and AI model training for the prediction of peptides’ physicochemical properties, especially for prediction in the complete sequence space of longer peptides with more than four amino acids.

This Zenodo record is a permanently preserved version of a PREreview. You can view the complete PREreview at https://prereview.org/reviews/14934537.

Peer Review of "How Does Sampling Affect the AI Prediction Accuracy of Peptides' Physicochemical Properties?"

Short Summary of the Research's Main Findings

This paper explores how different sampling strategies and sample sizes influence the predictive accuracy of AI models for peptide physicochemical properties. The study evaluates four sampling methods—Latin Hypercube Sampling (LHS), Uniform Design Sampling (UDS), Simple Random Sampling (SRS), and Probability-Proportional-to-Size Sampling (PPS)—across sample sizes ranging from 100 to 20,000. A key takeaway is that a sample size of 12,000 (7.5% of the dataset) is a threshold for stable and reliable predictions. The findings provide useful guidance for designing AI models that require efficient and representative sampling, particularly in peptide-based drug discovery and bioengineering.

Major Issues

The choice of sampling methods is not fully justified. The paper does not explain why these four specific methods were chosen. Would alternative approaches, like stratified sampling or active learning, offer better results? A brief comparison with other techniques would strengthen the argument.

While UDS ensures an even distribution of sequences, it does not necessarily create a balanced dataset in terms of physicochemical properties. The study could explore ways to adjust for property distribution bias, such as weighted sampling or a hybrid approach.

The study suggests that increasing the sample size improves model performance, but it does not quantify computational trade-offs. Would the accuracy gains from increasing the dataset from 8,000 to 12,000 samples justify the extra computational cost? A cost-benefit analysis would help clarify this.

The AI model predicts key peptide properties such as aggregation propensity, hydrophilicity, and isoelectric point, but none of these predictions are validated experimentally. Even verifying a small set of AI predictions through lab experiments would add credibility to the results.

The study focuses on tetrapeptides but does not discuss whether the findings apply to longer peptide sequences. Would a pentapeptide or decapeptide dataset require exponentially larger samples for similar accuracy? A discussion on this would be valuable.

Minor Issues

The UDS method maintains sequence diversity but is inconsistent in capturing the full range of property distributions. Would combining UDS with PPS help correct this issue? A short discussion on potential improvements would be useful.

Some figures, particularly those showing property distribution errors, could be better labeled. Highlighting key trends directly in the captions would improve readability.

The paper includes an effect size analysis, but the real-world impact of these values isn't fully explained. How do the reported effect sizes translate into meaningful improvements for peptide property prediction?

A few sentences could be clearer. For example, the sentence "No significant differences in AI prediction accuracy are observed among all four sampling methods" could be rewritten as "The study found no major differences in prediction accuracy across the four sampling methods." Small refinements in wording would improve overall readability.

Final Recommendation

Accept with minor revisions. The study provides practical insights into sampling strategies for AI-driven peptide property prediction. With some refinements—particularly around sampling method justification, property bias, computational trade-offs, and validation—this could be a very strong contribution to the field.

Competing interests

The author declares that they have no competing interests.

Read the original source

How Does Sampling Affect the AI Prediction Accuracy of Peptides’ Physicochemical Properties?

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Peer Review of "How Does Sampling Affect the AI Prediction Accuracy of Peptides' Physicochemical Properties?"

Peer Review of "How Does Sampling Affect the AI Prediction Accuracy of Peptides' Physicochemical Properties?"

Competing interests

Harnessing Uniform Design to Enhance AI-Driven Predictions of Physicochemical Properties of Short Peptides

Accelerating Antibody Development: Sequence and Structure-Based Models for Predicting Developability Properties through Size Exclusion Chromatography

Benchmarking AlphaFold3-like Methods for Protein-Peptide Complex Prediction

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Peer Review of "How Does Sampling Affect the AI Prediction Accuracy of Peptides' Physicochemical Properties?"

Peer Review of "How Does Sampling Affect the AI Prediction Accuracy of Peptides' Physicochemical Properties?"

Competing interests

Related articles

Harnessing Uniform Design to Enhance AI-Driven Predictions of Physicochemical Properties of Short Peptides

Accelerating Antibody Development: Sequence and Structure-Based Models for Predicting Developability Properties through Size Exclusion Chromatography

Benchmarking AlphaFold3-like Methods for Protein-Peptide Complex Prediction