qcMol: a large-scale dataset of 1.2 million molecules with high-quality quantum chemical annotations for molecular representation learning

Haoyu Wang
Ziyan Zhang
Haipeng Gong

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Recent advancements in deep learning have greatly prompted the de novo design of drugs and materials. Previous studies have shown that a well-designed molecular representation is critical for improving the accuracy of deep-learning-based molecular property prediction methods. However, the lack of large-scale data enriched with detailed physicochemical information hinders effective learning of an informative molecular representation. To fill this data gap, we introduce qcMol, a dataset consisting of 1.2 million molecules from 95 datasets with high-quality quantum chemical annotations, to facilitate molecular representation learning as well as downstream molecular property prediction. Chemicals in this dataset include drug-like compounds, metabolites and molecules with matched experimental data, covering 247,448 kinds of scaffolds and a broad spectrum of molecular sizes. Each compound in qcMol is annotated with detailed quantum chemical information, obtained through reliable quantum chemical calculations based on B3LYP-D3/def2-SV(P)//GFN2-xTB as well as the follow-up wave function post-analysis. These features are organized into multiple formats, allowing for flexible integration into diversified molecular representation learning frameworks. The broad data distribution, comprehensive quantum chemical annotations and flexible data formats jointly enable qcMol to serve as the pre-training resource as well as the benchmark test set for deep learning models, benefiting the practical in silico drug discovery.

qcMol is freely accessible from https://structpred.life.tsinghua.edu.cn/qcmol/ .

Version published to 10.1101/2025.09.07.674462 on bioRxiv
Sep 12, 2025

qcMol: a large-scale dataset of 1.2 million molecules with high-quality quantum chemical annotations for molecular representation learning

This article has 3 authors:
1. Haipeng Gong
2. Haoyu Wang
3. Ziyan Zhang
This article has no evaluationsLatest version Dec 29, 2025
LinkerMind: An Interpretable, Mechanism-Informed Deep Learning Framework for the De Novo Design of Antibody Drug Conjugate Linkers

This article has 1 author:
1. Martins Otun
This article has no evaluationsLatest version Dec 19, 2025
Drug discovery guided by maximum drug likeness

This article has 3 authors:
1. Hao-Yu Zhu
2. Lu Xu
3. Wei Shi
This article has no evaluationsLatest version Dec 31, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

qcMol: a large-scale dataset of 1.2 million molecules with high-quality quantum chemical annotations for molecular representation learning

LinkerMind: An Interpretable, Mechanism-Informed Deep Learning Framework for the De Novo Design of Antibody Drug Conjugate Linkers

Drug discovery guided by maximum drug likeness