Proteoform Search from Protein Database with Top-Down Mass Spectra: Algorithms and Evaluation

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

In this paper, we propose a new search algorithm for proteoform identification that computes the largest size error correction alignments between a protein mass graph (PMG) and a spectrum mass graph (SMG). We also design a filtering algorithm. The combined method uses the filtering algorithm to get some candidates and then apply the search algorithm to report the final results. Our exact searching method is 3.9 to 9.0 times faster than the popular methods such as TopMG and TopPIC. Our combined method can further speed up the running time of our method sTopMG by 6.2 times without affecting the search accuracy. Lots of search methods have been developed in the past decade. The search results reported by various methods are known to be significantly different. In literature, there is no top-down mass spectra dataset with known corresponding protein sequences. There is no tool for generating simulated top-down mass spectra with input protein sequences, either. Though there are many published papers to compare and evaluate various search methods, they all use some kind of indirect measures since there is no real or simulated dataset of top-down spectra with known corresponding true protein sequences. Thus, in some sense, the accuracy of existing search methods is somewhat uncertain. Here we develop a pipeline for generating simulated top-down spectra based on input protein sequences with modifications. To our knowledge, this is the first tool to generate simulated top-down mass spectra in a reasonable way indicated by an interesting measure, match gap distribution. Experiments on simulated datasets show that our combined method has of 95% accuracy, while the best existing methods have accuracy far below this. To further evaluation the performance of the existing methods, we generate a set of 55 real top-down spectra from 3 domains of a known antibody. The real dataset shows that our new method has 94.2% accuracy using deconvolution method FLASHDeconv, which is consistent with the accuracy of the simulated data.

Article activity feed