ANARCII: A Generalised Language Model for Antigen Receptor Numbering

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Antigen receptor numbering allows the rapid delineation of the antigen-binding regions of antibody and T cell receptor (TCR) sequences, from sequence alone. It also allows the comparison of the vast diversity of antigen receptors in a consistent frame of reference. Numbering of antigen receptors is currently achieved by aligning sequences to a reference set. This approach may result in different numbering, depending on the reference set used or may fail to number query sequences derived from new species or rare sequence types. To address this problem, we have built a new numbering method (ANARCII) which requires no alignment step and is based on a Seq2Seq language model.

Our results show that ANARCII can deal with the complexity that arises in experimentally collected sequencing data and generalise to sequences which are highly dissimilar to those in training. In test sets designed to contain challenging and ambiguous sequence patterns ANARCII numbering was identical to existing methods for over 99.99% of conserved residues and over 99.94% for complete CDR regions. The lightweight architecture allows numbering of over 90,000 sequences per minute on a single A100 GPU. Furthermore, the ANARCII package can be conditioned to fit rare sequence types and provide new training data for fine-tuning. We demonstrate that fine-tuned versions of ANARCII can correctly number other immunoglobulin domains such as TCRs and VNARs. Our model is freely available as a web tool ( https://github.com/oxpig/ANARCII ), as well as a package for high throughput numbering of next generation sequencing data ( https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabpred/anarcii/ ).

Article activity feed