ANARCII: A Generalised Language Model for Antigen Receptor Numbering

Alexander Greenshields-Watson
Parth Agarwal
Sarah A. Robinson
Benjamin Heathcote Williams
Gemma L. Gordon
Henriette L. Capel
Yushi Li
Fabian C. Spoendlin
Broncio Aguilar-Sanjuan
Fergus Boyles
Charlotte M. Deane

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Antigen receptor numbering allows the rapid delineation of the antigen-binding regions of antibody and T cell receptor (TCR) sequences, from sequence alone. It also allows the comparison of the vast diversity of antigen receptors in a consistent frame of reference. Numbering of antigen receptors is currently achieved by aligning sequences to a reference set. This approach may result in different numbering, depending on the reference set used or may fail to number query sequences derived from new species or rare sequence types. To address this problem, we have built a new numbering method (ANARCII) which requires no alignment step and is based on a Seq2Seq language model.

Our results show that ANARCII can deal with the complexity that arises in experimentally collected sequencing data and generalise to sequences which are highly dissimilar to those in training. In test sets designed to contain challenging and ambiguous sequence patterns ANARCII numbering was identical to existing methods for over 99.99% of conserved residues and over 99.94% for complete CDR regions. The lightweight architecture allows numbering of over 90,000 sequences per minute on a single A100 GPU. Furthermore, the ANARCII package can be conditioned to fit rare sequence types and provide new training data for fine-tuning. We demonstrate that fine-tuned versions of ANARCII can correctly number other immunoglobulin domains such as TCRs and VNARs. Our model is freely available as a web tool ( https://github.com/oxpig/ANARCII ), as well as a package for high throughput numbering of next generation sequencing data ( https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabpred/anarcii/ ).

Version published to 10.1101/2025.04.16.648720 on bioRxiv
Apr 21, 2025

Alignment of RNA Secondary Structures with Arbitrary Pseudoknots using Structural Sequences

This article has 4 authors:
1. Luca Tesei
2. Francesca Levi
3. Michela Quadrini
4. Emanuela Merelli
This article has no evaluationsLatest version Mar 23, 2026
Redefining bacterial Wzy polymerase families via three-dimensional structure-based clustering

This article has 5 authors:
1. Johanna Kenyon
2. Thomas Stanton
3. Liam Ulacco
4. Ruth Hall
5. Kelly Wyres
This article has no evaluationsLatest version Apr 2, 2026
Principles for the encoding of molecular information in DNA, RNA and protein motifs

This article has 4 authors:
1. Ezequiel Alejandro Galpern
2. Inés Bauer
3. Diego Ulises Ferreiro
4. Ignacio Enrique Sanchez
This article has no evaluationsLatest version Apr 15, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Alignment of RNA Secondary Structures with Arbitrary Pseudoknots using Structural Sequences

Redefining bacterial Wzy polymerase families via three-dimensional structure-based clustering

Principles for the encoding of molecular information in DNA, RNA and protein motifs