Human limits in Machine Learning: Prediction of plant phenotypes using soil microbiome data

Rosa Aghdam
Xudong Tang
Shan Shan
Richard Lankau
Claudia Solis-Lemus

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The preservation of soil health is a critical challenge in the 21st century due to its significant impact on agriculture, human health, and biodiversity. We provide the first deep investigation of the predictive potential of machine learning models to understand the connections between soil and biological phenotypes. We investigate an integrative framework performing accurate machine learning-based prediction of plant phenotypes from biological, chemical, and physical properties of the soil via two models: random forest and Bayesian neural network. We show that prediction is improved when incorporating environmental features like soil physicochemical properties and microbial population density into the models, in addition to the microbiome information. Exploring various data preprocessing strategies confirms the significant impact of human decisions on predictive performance. We show that the naive total sum scaling normalization that is commonly used in microbiome research is not the optimal strategy to maximize predictive power. Also, we find that accurately defined labels are more important than normalization, taxonomic level or model characteristics. In cases where humans are unable to classify samples accurately, machine learning model performance is limited. Lastly, we provide domain scientists via a full model selection decision tree to identify the human choices that optimize model prediction power.

Version published to 10.21203/rs.3.rs-3957562/v1 on Research Square
Mar 1, 2024

Machine Learning-Based Assessment of the Healthy Human Gut Mycobiota Landscape Using ITS1 DNA Metabarcoding Data

This article has 7 authors:
1. Giuseppe Defazio
2. Erika Lorusso
3. Mariangela De Robertis
4. Tommaso Mello
5. Andrea Galli
6. Graziano Pesole
7. Bruno Fosso
This article has no evaluationsLatest version Feb 19, 2026
Predicting Pest and Disease Occurrence Using Synthetic Data and Explainable Machine Learning Methods

This article has 2 authors:
1. PRIYANKA BALLEY
2. Prof. Kanchan K. Doke
This article has no evaluationsLatest version Mar 3, 2026
Systematic benchmarking of foundation models and classical baselines for microbiome-based disease prediction

This article has 3 authors:
1. Jin Mu
2. Zheng-Zheng Tang
3. Guanhua Chen
This article has no evaluationsLatest version Feb 25, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Machine Learning-Based Assessment of the Healthy Human Gut Mycobiota Landscape Using ITS1 DNA Metabarcoding Data

Predicting Pest and Disease Occurrence Using Synthetic Data and Explainable Machine Learning Methods

Systematic benchmarking of foundation models and classical baselines for microbiome-based disease prediction