16S rRNA sequence captures microbial functional potential
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
16S rRNA amplicon sequencing is widely used for microbiome profiling, but most methods rely on reference databases of characterized organisms, limiting its accuracy in function prediction for underrepresented environments. We discovered that 16S rRNA k-mer composition carries substantial functional signal: (i) whole-genome k-mer profiles predict genome-encoded functions, and (ii) 16S rRNA k-mer profiles reflect their source genome’s composition. Building on these relationships, we developed embeRNA, a neural network framework that predicts functions directly from 16S rRNA k-mer embeddings without requiring taxonomy assignment or phylogenetic placement. embeRNA outputs per-function probability scores, enabling users to tune decision thresholds to balance precision and recall or account for community novelty. In a stringent “novel microbes” benchmark - where all test sequences shared <97% identity with training data - embeRNA outperformed reference-based methods, particularly for hard-to-label functions. Applied to soil metagenomes with paired 16S and whole metagenome shotgun sequencing (WMS) data, embeRNA recovered most WMS-inferred functions and produced abundance profiles strongly correlated with WMS results, attaining better performance than a reference-based approach. Our findings demonstrate that 16S rRNA directly captures functional potential, and 16S amplicon sequencing data can complement WMS-based inference to broaden functional characterization of microbiomes, especially in understudied environments.