Discovery of Expression-Governing Residues in Proteins

Fan Jiang
Mingchen Li
Banghao Wu
Liang Zhang
Bozitao Zhong
Yuanxi Yu
Liang Hong

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (Arcadia Science)

Abstract

Understanding how amino acids influence protein expression is crucial for advancements in biotechnology and synthetic biology. In this study, we introduce Venus-TIGER, a deep learning model designed to accurately identify amino acids critical for expression. By constructing a two-dimensional matrix that links model representations to experimental fitness, Venus-TIGER achieves improved predictive accuracy and enhanced extrapolation capability. We validated our approach on both public deep mutational scanning datasets and low-throughput experimental datasets, demonstrating notable performance compared to traditional methods. Venus-TIGER exhibits robust trans-ferability in zero-shot predicting scenarios and enhanced predictive performance in few-shot learning, even with limited experimental data. This capability is particularly valuable for protein design aimed at enhancing expression, where generating large datasets can be costly and time-consuming. Additionally, we conducted a statistical analysis to identify expression-associated features, such as sequence and structural preferences, distinguishing between those linked to high and low expression. Our investigation also revealed a correlation among stability, activity and expression, providing insight into their interconnected roles and underlying mechanisms.

Arcadia Science
Feb 7, 2025

Very interesting work! I’m curious about the effects of using training data from multiple expression systems (bacteria, fungi, mammalian cells), particularly since expression requirements can vary slightly between organisms. Have you explored whether expression system-specific models perform better when predicting expression within a given system? Or, is the training data biased toward one particular expression system, potentially leading to worse predictions for others? Or has the model really learned general features of expression across these organisms? Great work!

Read the original source
Version published to 10.1101/2025.01.06.631498 on bioRxiv
Jan 7, 2025

Understanding Pathways in Bioinformatics, Genomics, and Health Applications

This article has 1 author:
1. Diptarup Mallick
This article has no evaluationsLatest version Jan 19, 2026
A Survey on Efficient Protein Language Models

This article has 8 authors:
1. Shouren Wang
2. Debargha Ganguly
3. Vinooth Kulkarni
4. Wang Yang
5. Zhuoran Qiao
6. Daniel Blankenberg
7. Vipin Chaudhary
8. Xiaotian Han
This article has no evaluationsLatest version Dec 24, 2025
Artificial Intelligence–Driven Structural Mining Enables Functional Inference in the Human Dark Proteome

This article has 7 authors:
1. Valentina Carbonari
2. Annamaria Defilippo
3. Ugo Lomoio
4. Caterina Francesca Perri
5. Barbara Puccio
6. Pierangelo Veltri
7. Pietro Hiram Guzzi
This article has no evaluationsLatest version Dec 23, 2025

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Understanding Pathways in Bioinformatics, Genomics, and Health Applications

A Survey on Efficient Protein Language Models

Artificial Intelligence–Driven Structural Mining Enables Functional Inference in the Human Dark Proteome