nf-core/proteinfamilies: A scalable pipeline for the generation of protein families
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (GigaScience)
Abstract
The growth of metagenomics-derived amino acid sequence data has transformed our understanding of protein function, microbial diversity and evolutionary relationships. However, the vast majority of these proteins remain functionally uncharacterized. Grouping the millions of such uncharacterised sequences with the few experimentally characterised ones allows the transfer of annotations, while the inspection of conserved residues with multiple sequence alignments can provide clues to function, even in the absence of existing functional information. To address the challenges associated with this data surge and the need to group sequences, we present a scalable, open-source, parametrizable Nextflow pipeline ( nf-core/proteinfamilies ) that generates protein nascent families or assigns new proteins to existing families. The computational benchmarks demonstrated that resource usage can scale approximately linearly with input size, while the biological benchmarks showed that the generated protein families closely resemble manually curated families found in widely used databases.
Article activity feed
-
AbstractThe growth of metagenomics-derived amino acid sequence data has transformed our understanding of protein function, microbial diversity and evolutionary relationships. However, the vast majority of these proteins remain functionally uncharacterized. Grouping the millions of such uncharacterised sequences with the few experimentally characterised ones allows the transfer of annotations, while the inspection of conserved residues with multiple sequence alignments can provide clues to function, even in the absence of existing functional information. To address the challenges associated with this data surge and the need to group sequences, we present a scalable, open-source, parametrizable Nextflow pipeline (nf-core/proteinfamilies) that generates protein nascent families or assigns new proteins to existing families. The …
AbstractThe growth of metagenomics-derived amino acid sequence data has transformed our understanding of protein function, microbial diversity and evolutionary relationships. However, the vast majority of these proteins remain functionally uncharacterized. Grouping the millions of such uncharacterised sequences with the few experimentally characterised ones allows the transfer of annotations, while the inspection of conserved residues with multiple sequence alignments can provide clues to function, even in the absence of existing functional information. To address the challenges associated with this data surge and the need to group sequences, we present a scalable, open-source, parametrizable Nextflow pipeline (nf-core/proteinfamilies) that generates protein nascent families or assigns new proteins to existing families. The computational benchmarks demonstrated that resource usage can scale approximately linearly with input size, while the biological benchmarks showed that the generated protein families closely resemble manually curated families found in widely used databases.
This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag009), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
Reviewer 2: Castrense Savojardo
This manuscript presents a Nextflow pipeline (nf-core/proteinfamilies) for large-scale protein-family generation. Overall, I think the paper is well written and clear. The pipeline appears very useful, and the reported results show good performance in both family reproducibility and computational efficiency.
I have a few minor comments requesting additional details:
- Does the quality-check step only compute statistics, or is it also used to filter/clean the input set? If so, please specify the criteria and whether filtered sequences are excluded downstream.
- Which MMseqs2 clustering mode is used (set cover, connected components, or greedy)? Can this be changed within the pipeline? If configurable, please indicate the relevant parameters.
- In the reproducibility benchmark, you use DIAMOND BLASTp to assess similarity between the initial sequence set for the selected families and additional Swiss-Prot sequences. Which sequence identity and alignment coverage (if any) thresholds were applied?
- Counts and coverage (p. 6): You state that "These 709 families captured 96.66% of the original unique sequence identifiers (103,385 out of 106,959).". However, a few lines above the final input set is reported as 169,605 unique protein sequences. Could you please clarify the initial number of sequences and the actual coverage after family generation and redundancy reduction?
- Figures S1 and S2 are difficult to read due to low resolution.
-
AbstractThe growth of metagenomics-derived amino acid sequence data has transformed our understanding of protein function, microbial diversity and evolutionary relationships. However, the vast majority of these proteins remain functionally uncharacterized. Grouping the millions of such uncharacterised sequences with the few experimentally characterised ones allows the transfer of annotations, while the inspection of conserved residues with multiple sequence alignments can provide clues to function, even in the absence of existing functional information. To address the challenges associated with this data surge and the need to group sequences, we present a scalable, open-source, parametrizable Nextflow pipeline (nf-core/proteinfamilies) that generates protein nascent families or assigns new proteins to existing families. The …
AbstractThe growth of metagenomics-derived amino acid sequence data has transformed our understanding of protein function, microbial diversity and evolutionary relationships. However, the vast majority of these proteins remain functionally uncharacterized. Grouping the millions of such uncharacterised sequences with the few experimentally characterised ones allows the transfer of annotations, while the inspection of conserved residues with multiple sequence alignments can provide clues to function, even in the absence of existing functional information. To address the challenges associated with this data surge and the need to group sequences, we present a scalable, open-source, parametrizable Nextflow pipeline (nf-core/proteinfamilies) that generates protein nascent families or assigns new proteins to existing families. The computational benchmarks demonstrated that resource usage can scale approximately linearly with input size, while the biological benchmarks showed that the generated protein families closely resemble manually curated families found in widely used databases.
This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag009), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
Reviewer 1: Vikram Alva
The authors present nf-core/proteinfamilies, a standardized Nextflow workflow that constructs protein families de novo or classifies sequences against existing families. Using a curated 200-family benchmark and a UniRef90-scale run, the authors show that the pipeline attains high recall with efficient runtimes. Given the ever-increasing size of sequence databases, this work is timely and fills a practical gap in reproducible, at-scale family curation; I expect it to be adopted widely by many research groups.
I have several comments and suggestions below:
- In my view, this workflow will, by construction, yield a mixture of families: some anchored on a single conserved domain/segment, others centered on recurrent multi-domain cores, and some that capture the full-length sequence. This differs from widely used family databases: Pfam is largely domain-level, whereas HAMAP and NCBIFAM are mostly full-length/isofunctional (with PANTHER sitting in between). The resulting granularity is largely determined by MMseqs2 settings (sequence identity, query/target coverage, coverage mode) and by any alignment trimming, which biases toward conserved cores. Please add a brief discussion making this explicit, with practical guidance for tuning toward full-length versus domain-centric generation of families.
I also recommend a parameter-sensitivity analysis on the 200-family set: sequence identity (30-70%), coverage thresholds (50-95%), and coverage mode (query/target/both), with and without trimming. For each setting, report (i) total families and split/merge rates per curated family, and (ii) a simple granularity readout, the proportion classified as domain-anchored, multi-domain, or full-length. This would clarify how parameter choices drive family counts and domain/full-length centricity, and help readers select defaults aligned with their use case.
In the results, the splits/misses are concentrated in Pfam/PANTHER, while HAMAP/NCBIFAM are much closer to one-to-one (HAMAP 50/50). This suggests the inflated family count is driven, in part, by the domain-centric portion of the benchmark rather than the method itself. Please add a brief note in the Discussion to make this explicit.
Since AFDB has models for most UniProt entries, could these models be used as an orthogonal purity check of the generated families; e.g., map members to AFDB and ask whether they cluster to the same fold by TM-score/Foldseek (allowing full-length differences when the family is domain-anchored)?
HHsearch-based merging of divergent splits. In my view, and the authors note this, several curated families split simply because sequences are very divergent. An optional HHsearch (HMM-HMM) pass could merge these back: merge only at high probability (≈≥95%) with reciprocal coverage of the shorter model (≥0.6). It would be useful to include this as an optional stage in the pipeline.
Optional annotation of de novo families. I think it would be useful to add an annotation step that compares each de novo family (family HMM or MSA) against curated resources (Pfam, NCBIFAM, PANTHER/HAMAP).
Could you briefly outline your expectations for how the pipeline handles transmembrane segments, coiled-coils, repeats, and IDRs, classes prone to over-splitting under MMseqs2 seeding and trimming due to short-motif signal, low complexity, and variable lengths?
-
