ANARCII: A Generalised Language Model for Antigen Receptor Numbering
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Antigen receptor numbering allows the rapid delineation of the antigen-binding regions of antibody and T cell receptor (TCR) sequences, from sequence alone. It also allows the comparison of the vast diversity of antigen receptors in a consistent frame of reference. Numbering of antigen receptors is currently achieved by aligning sequences to a reference set. This approach may result in different numbering, depending on the reference set used or may fail to number query sequences derived from new species or rare sequence types. To address this problem, we have built a new numbering method (ANARCII) which requires no alignment step and is based on a Seq2Seq language model.
Our results show that ANARCII can deal with the complexity that arises in experimentally collected sequencing data and generalise to sequences which are highly dissimilar to those in training. In test sets designed to contain challenging and ambiguous sequence patterns ANARCII numbering was identical to existing methods for over 99.99% of conserved residues and over 99.94% for complete CDR regions. The lightweight architecture allows numbering of over 90,000 sequences per minute on a single A100 GPU. Furthermore, the ANARCII package can be conditioned to fit rare sequence types and provide new training data for fine-tuning. We demonstrate that fine-tuned versions of ANARCII can correctly number other immunoglobulin domains such as TCRs and VNARs. Our model is freely available as a web tool ( https://github.com/oxpig/ANARCII ), as well as a package for high throughput numbering of next generation sequencing data ( https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabpred/anarcii/ ).