Uncovering genetic mechanisms underlying trait variation in switchgrass using explainable artificial intelligence

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife Assessment

    The study by Izquierdo and colleagues provides important insights into the field of genomic and transcriptomic prediction of traits across multiple environments. The rationale and analyses conducted to integrate the two types of ~omics datasets across two environments are solid. However, some clarification would be appreciated in the presentation of the results, and adding some statistical control to clarify how the predictors were selected, or assessing their importance using the SHAP framework, would further consolidate the findings.

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

Uncovering the genetic architecture of quantitative traits is challenging because polygenic control yields small individual gene effects and because gene–gene and genotype-by-environment interactions add further complexity. To understand the genetic basis of polygenic traits and their plasticity across environments, we integrated genome-wide SNPs and RNA-seq transcript data with interpretable statistical and machine learning models in a switchgrass ( Panicum virgatum ) diversity panel grown at contrasting field sites in Michigan and Texas. Notably, in addition to single environments, our trait prediction models were able to predict phenotypic differences, across environments i.e., plasticity. By interpreting trait prediction models with explainable artificial intelligence methods, we identified important features—genes that are the most predictive of flowering time and annual biomass production across environments, based on their associated gene expression levels and nearby SNPs. This approach recovered canonical flowering regulators and revealed novel, environment-specific candidate flowering genes. Further, transcriptome models consistently recovered more switchgrass genes homologous to experimentally validated genes in Arabidopsis and rice than SNP-based models. Feature interaction scores from the models also allow the identification of trait- and environment-dependent gene–gene interactions, where flowering time showed stronger and more abundant interactions than biomass. While some of the interactions identified are consistent with the link between flowering time and yield, most are novel predictors that need to be further evaluated. Together, these results demonstrate that interpretable genomic prediction with explainable artificial intelligence approaches can convert trait prediction models into mechanistic hypotheses about putative causal genes and interactions controlling traits within and across environments. These results will help to prioritize target genes for validation and inform germplasm selection for cultivar improvement.

Article activity feed

  1. eLife Assessment

    The study by Izquierdo and colleagues provides important insights into the field of genomic and transcriptomic prediction of traits across multiple environments. The rationale and analyses conducted to integrate the two types of ~omics datasets across two environments are solid. However, some clarification would be appreciated in the presentation of the results, and adding some statistical control to clarify how the predictors were selected, or assessing their importance using the SHAP framework, would further consolidate the findings.

  2. Reviewer #1 (Public review):

    Summary:

    P. Izquierdo et al. investigated the genetic determinism of various traits of interest in switchgrass using large-scale genomic and transcriptomic data. More specifically, they worked on a diversity panel comprising 426 genotypes evaluated in common-garden experiments at two locations (Michigan and Texas). The phenotypic and genomic data were already published. In this work, they produced transcriptomic data for each of the 426 genotypes at each site, and they carried out phenotype predictions using genomic and transcriptomic data separately or together. While they were moderately correlated at each location, both omic information appeared to be complementary for the prediction of phenotype. To further exploit the fact that they have data across two locations, they computed differences for phenotypes and transcripts between locations as indicators of trait and transcript plasticity, respectively. They built predictive models of trait plasticity using genomic information and transcript plasticity, which proved to be quite accurate for traits affected by GxE. Finally, they made use of SHAP values from predictive models of flowering time and biomass at each location, as well as for their plasticity, to gain insight into their genetic determinism. These SHAP values provide the importance of the predictive features (SNP and/or transcripts) for trait prediction. This allowed them to confirm some candidate genes and to propose new candidates for both traits.

    Strengths:

    I found this study interesting and rich. I think the sample size (426 genotypes) is large enough to support the findings. The use of a modern machine-learning approach (XGBoost) together with SHAP indices to find interesting features and get insights into the biological mechanisms underlying flowering time and biomass production is quite original. The methodology employed is globally sound. I also like the fact that the authors accounted implicitly for the population structure by providing a baseline prediction using the first 5 PCs.

    Weaknesses:

    While the methodology is globally sound, I sometimes had difficulties following exactly what was done. This is partly due to the fact that the authors used 2 omics (SNPs and transcripts) to predict phenotypes, and sometimes, in the results, it is not clear which of the 2 is the focus. This was especially the case for the importance of the features and the interpretability of the models, where I found it sometimes hard to tell whether the analysis was done on SNPs or transcripts.

    Also, regarding the methodology, I did not understand why the authors needed to perform a feature selection approach. Maybe it was required to perform the interaction analysis, which could not be deployed on all the features? But regarding the importance of the features, I do not get the added value of the selection over the direct use of SHAP indices when using all features. Maybe this is because I am not a specialist in this kind of approach, but maybe the authors could add more details to explain the rationale behind the feature selection.

  3. Reviewer #2 (Public review):

    Summary:

    The authors aimed to evaluate whether integrating genomic (SNP) and transcriptomic information with machine learning can improve phenotypic prediction of polygenic traits across environments. The manuscript explored not only the predictability across models and predictor feature sets, but also attempted to identify meaningful genes and interactions underlying trait variation.

    Strengths:

    The main strength of the manuscript is its integration of SNP, transcriptomic, and phenotype datasets for 426 sorghum genotypes between Texas and Michigan. It provides a systematic comparison of predictor types (SNP versus transcriptomic abundance) and model strategies to integrate them.

    Weaknesses:

    (1) Experimental Design

    The experimental design raises several concerns that should be clarified before strong biological conclusions are drawn from the transcriptomic analyses.

    First, the transcriptomic sampling is not well aligned with the developmental stages most relevant to the phenotypes being modeled. Leaf tissue was collected at a single time point in each environment, whereas traits such as flowering time, biomass, tiller count, and panicle height arise from developmental processes occurring over extended and potentially distinct temporal windows. Consequently, the measured expression profiles are likely to reflect physiological states specific to the sampling dates (May 5-6 in Texas and June 22-24 in Michigan) rather than the regulatory processes underlying the target phenotypes.

    Second, the phrase "haphazardly randomized" is questionable for a field experiment. It is unclear whether the design included formal randomization, blocking, row/column structure, or spatial correction. Without explicit accounting for spatial field heterogeneity, environmental variation within sites may confound genotype and transcriptomic effects.

    Third, the Methods do not clearly describe biological replication for RNA-seq. If each genotype-by-environment combination were represented by a single transcriptomic sample, then within-genotype expression variance cannot be estimated. This is important because transcript abundance is highly sensitive to microenvironment, sampling time, tissue status, developmental stage, and technical variation. The absence of replication significantly weakens confidence in gene-level feature importance and gene-gene interaction claims.

    Four, the analysis of expression differences across environments is based on a simple subtraction (TX - MI) followed by correlation with genetic similarity. This approach is not standard in transcriptomic analysis and does not account for variability, replication, or statistical uncertainty. Conventional methods for assessing differential expression and genotype-by-environment interactions rely on model-based frameworks that explicitly estimate variance components and test for interaction effects. Without such modeling, the observed expression differences may reflect noise or confounding factors rather than genotype-driven responses.

    (2) SHAP contribution values

    Although SHAP is a well-established framework for decomposing model predictions into feature-level contributions, its use in this manuscript raises several concerns regarding interpretation, statistical validity, and biological inference.

    First, SHAP values quantify the contribution of features within the fitted model, conditional on the joint distribution of inputs and the model structure. They do not represent causal effects or direct biological importance. There is a difference where SHAP values are often in log-odds and the regression model uses absolute units. Without a fair evaluation of model fit, the interpretation of SHAP values needs to take a cautious step because a model could fit poorly when a feature shows very high SHAP values.

    In genomic data, where features are highly correlated due to linkage disequilibrium and co-expression, SHAP values can distribute contribution values across correlated variables in ways that are not uniquely identifiable. As a result, features highlighted as "important" may reflect correlation structure rather than true functional relevance.

    This correlative structure can be exacerbated in this manuscript because of the use of TPM-normalized transcript abundances as predictor variables without biological replicates. Assume the estimates of transcript abundances are robust, TPM values are compositional, with a constant-sum constraint that creates dependencies among all genes that induce negative correlations. This issue is particularly relevant for the interpretation of gene importance and interaction effects, where correlated predictors can lead to unstable and non-unique attributions. This biological interpretation of transcript-based features remains uncertain.

    (3) Result interpretation

    For example, in page 11, "plasticity SNP- and transcriptomic-based models generally outperformed single-environment models for traits with low cross-environment correlation, such as green-up (Fig. 2c, r = -0.13, p < 8.3 × 10⁻³) and tiller count (Fig. 2f, r = -0.08, p = 0.1) (Supplementary Fig. S1).", is too broad. For green-up, the Diff model appears much better than MI, but not clearly better than TX.

    And, same page 11, "...Diffexp was more predictive than SNPs for trait plasticity in biomass, flowering time, and tiller count..." only holds true for biomass, not flowering time, or tiller count.

    The aspect of "complementary information" between SNP and transcriptomic models in page 12 is stronger than what is supported by Figure 2. Figure 2 shows different predictive performance, but it does not by itself demonstrate complementarity. Establishing complementarity requires evidence that combining SNP+T improves prediction consistently or captures distinct, non-overlapping signals. Yet the preceding section says SNP+T outperformed either single data type in only 15% of cases, with modest gains. This is confusing. Also, there was not G+T in Figure 2; it is SNP+T.