Seqrutinator: Non-Functional Homologue Sequence Scrutiny for the Generation of large Datatsets for Protein Superfamily Analysis

Agustín Amalfitano
Nicolás Stocchi
Hugo Marcelo Atencio
Fernando Villarreal
Arjen ten Have

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

@ZonaPellucida_'s saved articles (unknown_user_13)

Abstract

Background

In recent years protein bioinformatics has resulted in many good algorithms for multiple sequence alignment (MSA) and phylogeny. Little attention has been paid to sequence selection whereas notably recently published complete proteomes often have many sequences that are partial or derive from pseudogenes. Not only do these sequences add noise to the MSA, phylogeny and other downstream computational analyses, they also instigate many errors in the processing of the MSAs and downstream analyses, including the phylogeny.

Objective

This work aims to provide and test an objective, automated but flexible pipeline for the scrutiny of sequence sets from large, complex, eukaryotic protein superfamilies. The pipeline should classify sequences with high precision and recall as either functional or non-functional. The pipeline should classify no or only a few SwissProt sequences as non-functional (high precision) and sequences from other related superfamilies as non-functional (high recall) and result in a demonstrably much improved MSA (high performance).

Results

Seqrutinator is a pipeline that consists of five modules written in Python3 that identify and remove sequences that are likely Non-Functional Homologues (NFH). Here we tested the pipeline using three complex plant superfamilies (BAHD, CYP and UGT) that act in specialized metabolism, using the complete proteomes of 16 plant species as input and SwissProt as a control. Only 1.94% of SwissProt sequences with wetlab evidence were identified as NFH and all sequences from other related superfamilies were removed. Most NFH sequences are partial but, interestingly, their removal results in highly improved MSAs. a few but significant sequences that instigate large gaps were found. The five modules show similar behaviour when applied to the 16 sequence sets of the three analysed superfamilies. Pipelines with different module orders result in similar classifications and, moreover, show that different modules often detect the same sequences.

Conclusion and perspective

Seqrutinator forms a consistent pipeline for sequence scrutiny that does result in sequence sets that generate high fidelity MSAs. Recovery analyses show the method has high precision and recall.

Version published to 10.1101/2022.03.22.485366 on bioRxiv
Mar 25, 2022