Unannotated Genes in Genomics: Challenges, Opportunities, and AI Solutions

Adeel Farooq
Asma Rafique
Eunyoung Han

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The rapid proliferation of next-generation sequencing (NGS) technologies has generated an unprecedented volume of genomic data, yet a substantial fraction of these sequenced genomes remains functionally uncharacterized, a phenomenon collectively termed " genomic dark matter." Unannotated genes, including hypothetical proteins (HPs), orphan and de novo genes, small open reading frames (smORFs), and non-canonical ORFs (ncORFs), constitute 40–60% of bacterial genomes, approximately 30–35% of the human proteome, and up to 43% of metagenomic protein clusters. These uncharacterized sequences represent a critical bottleneck in translating genomic data into biological insight and biotechnological innovation. This review provides a comprehensive examination of the categories of unannotated genes, the systemic challenges that perpetuate the annotation gap, and the diverse biotechnological opportunities these sequences harbor across plant, animal, microbial, medical, and industrial domains. Critically, we evaluate the transformative role of artificial intelligence (AI) in bridging this gap, encompassing protein structure prediction tools such as AlphaFold2 and ESMFold, protein and genome language models including ESM2 and DNABERT-2, deep learning-based functional inference frameworks, and high-throughput experimental validation platforms such as CRISPR perturbomics and transposon-insertion sequencing (TIS). We argue that an integrative, AI-driven approach to functional genomics is not merely advantageous but essential for realizing the full potential of the genomic revolution.

Version published to 10.20944/preprints202604.0792.v1
Apr 13, 2026

Metagenomic-scale analysis of the predicted protein structure universe

This article has 11 authors:
1. Martin Steinegger
2. Jingi Yeo
3. Yewon Han
4. Nicola Bordin
5. Andy Lau
6. Shaun Kandathil
7. Hyunbin Kim
8. Eli Levy Karin
9. Milot Mirdita
10. David Jones
11. Christine Orengo
This article has no evaluationsLatest version Mar 31, 2026
Horizontal Gene Transfer Between Fungi and Myxozoa: An Evolutionary Perspective

This article has 2 authors:
1. Amr G. A. Ibrahim
2. Edson A. Adriano
This article has no evaluationsLatest version Mar 17, 2026
Biological Memory of the Genome: An Extension of the Gene Latency Framework

This article has 1 author:
1. Abdulmohsen H. Alrohaimi
This article has no evaluationsLatest version Mar 12, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Metagenomic-scale analysis of the predicted protein structure universe

Horizontal Gene Transfer Between Fungi and Myxozoa: An Evolutionary Perspective

Biological Memory of the Genome: An Extension of the Gene Latency Framework