CellBinDB: a large-scale multimodal annotated dataset for cell segmentation with benchmarking of universal models

Can Shi
Jinghong Fan
Zhonghan Deng
Huanlin Liu
Qiang Kang
Yumei Li
Jing Guo
Jingwen Wang
Jinjiang Gong
Sha Liao
Ao Chen
Ying Zhang
Mei Li

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (GigaScience)

Abstract

In recent years, cell segmentation techniques have played a critical role in the analysis of biological images, especially for quantitative studies. Deep learning–based cell segmentation models have demonstrated remarkable performance in segmenting cell and nucleus boundaries, but they are typically tailored to specific modalities or require manual tuning of hyperparameters, limiting their generalizability to unseen data. Comprehensive datasets that support both the training of universal models and the evaluation of various segmentation techniques are essential for overcoming these limitations and promoting the development of more versatile cell segmentation solutions. Here, we present CellBinDB, a large-scale multimodal annotated dataset established for these purposes. CellBinDB contains more than 1,000 annotated images, each labeled to identify the boundaries of cells or nuclei, including 4′,6-diamidino-2-phenylindole, single-stranded DNA, hematoxylin and eosin, and multiplex immunofluorescence staining, covering over 30 normal and diseased tissue types from human and mouse samples. Based on CellBinDB, we benchmarked 8 state-of-the-art and widely used cell segmentation technologies/methods, and our further analysis reveals that complex cell shapes reduce segmentation accuracy while higher image gradients improve boundary detection, offering insights for refining segmentation strategies across diverse imaging scenarios.

GigaScience
Jul 8, 2025

In recent years, cell segmentation techniques have played a critical role in the analysis of biological images, especially for quantitative studies. Deep learning-based cell segmentation models have demonstrated remarkable performance in segmenting cell and nucleus boundaries, however, they are typically tailored to specific modalities or require manual tuning of hyperparameters, limiting their generalizability to unseen data. Comprehensive datasets that support both the training of universal models and the evaluation of various segmentation techniques are essential for overcoming these limitations and promoting the development of more versatile cell segmentation solutions. Here, we present CellBinDB, a large-scale multimodal annotated dataset established for these purposes. CellBinDB contains more than 1,000 annotated images, each …

In recent years, cell segmentation techniques have played a critical role in the analysis of biological images, especially for quantitative studies. Deep learning-based cell segmentation models have demonstrated remarkable performance in segmenting cell and nucleus boundaries, however, they are typically tailored to specific modalities or require manual tuning of hyperparameters, limiting their generalizability to unseen data. Comprehensive datasets that support both the training of universal models and the evaluation of various segmentation techniques are essential for overcoming these limitations and promoting the development of more versatile cell segmentation solutions. Here, we present CellBinDB, a large-scale multimodal annotated dataset established for these purposes. CellBinDB contains more than 1,000 annotated images, each labeled to identify the boundaries of cells or nuclei, including 4’,6-Diamidino-2-Phenylindole (DAPI), Single-stranded DNA (ssDNA), Hematoxylin and Eosin (H&E), and Multiplex Immunofluorescence (mIF) staining, covering over 30 normal and diseased tissue types from human and mouse samples. Based on CellBinDB, we benchmarked seven state-of-the-art and widely used cell segmentation technologies/methods, and further analyzed the effects of four cell morphology indicators and image gradient on the segmentation results.

This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf069 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

Reviewer: Shan Raza

The paper presents a multimodal data set for cell segmentation and benchmarking. The major strength of the dataset is its multimodal nature and including both mouse and human tissue. The paper analyses existing data sets and the performance of state-of-the-art methods. However, the authors missed one of the biggest data sets on the cell segmentation and classification which includes more than 500,000 annotated nuclei in H&E https://www.sciencedirect.com/science/article/pii/S1361841523003079.

The CoNIC challenge paper also analysis state-of-the-art nuclei segmentation and classification methods. The authors should add one of the best performing models in their analysis. I would also suggest the authors to include PQ and froc in the metrics to analyse the results as this is commonly used in this domain for comparison. I would also suggest to compare the results with HoVerNet or HoVerNext (https://github.com/digitalpathologybern/hover_next_train) which are state-of-the-art algorithms for nuclei instance segmentation. The code for these algorithms is publicly available.

Read the original source
GigaScience
Jul 8, 2025
In recent years, cell segmentation techniques have played a critical role in the analysis of biological images, especially for quantitative studies. Deep learning-based cell segmentation models have demonstrated remarkable performance in segmenting cell and nucleus boundaries, however, they are typically tailored to specific modalities or require manual tuning of hyperparameters, limiting their generalizability to unseen data. Comprehensive datasets that support both the training of universal models and the evaluation of various segmentation techniques are essential for overcoming these limitations and promoting the development of more versatile cell segmentation solutions. Here, we present CellBinDB, a large-scale multimodal annotated dataset established for these purposes. CellBinDB contains more than 1,000 annotated images, each …
In recent years, cell segmentation techniques have played a critical role in the analysis of biological images, especially for quantitative studies. Deep learning-based cell segmentation models have demonstrated remarkable performance in segmenting cell and nucleus boundaries, however, they are typically tailored to specific modalities or require manual tuning of hyperparameters, limiting their generalizability to unseen data. Comprehensive datasets that support both the training of universal models and the evaluation of various segmentation techniques are essential for overcoming these limitations and promoting the development of more versatile cell segmentation solutions. Here, we present CellBinDB, a large-scale multimodal annotated dataset established for these purposes. CellBinDB contains more than 1,000 annotated images, each labeled to identify the boundaries of cells or nuclei, including 4’,6-Diamidino-2-Phenylindole (DAPI), Single-stranded DNA (ssDNA), Hematoxylin and Eosin (H&E), and Multiplex Immunofluorescence (mIF) staining, covering over 30 normal and diseased tissue types from human and mouse samples. Based on CellBinDB, we benchmarked seven state-of-the-art and widely used cell segmentation technologies/methods, and further analyzed the effects of four cell morphology indicators and image gradient on the segmentation results.

This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf069 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

Jeff Rhoades

General comments:

Dataset Innovation: CellBinDB offers a significant improvement over existing datasets with its diversity of staining types (DAPI, ssDNA, H&E, mIF) and broad tissue coverage, including normal and diseased samples.

Benchmarking of Models: The evaluation of seven state-of-the-art segmentation algorithms provides valuable insights for researchers selecting tools for various imaging modalities.

Analysis of Influencing Factors: The manuscript thoroughly examines biological (e.g., cell morphology) and technical (e.g., image gradient) factors affecting model performance, providing practical recommendations for improving segmentation outcomes.

Preprocessing Impact: Demonstrating the effectiveness of preprocessing (e.g., grayscale conversion for H&E images) is an immediately actionable takeaway for practitioners. However, authors should apply preprocessing uniformly to all segmentation approaches, not just those that did poorly initially.

Major Areas for Improvement:

Preprocessing Uniformity:

Apply preprocessing steps uniformly across all segmentation approaches to ensure fair comparisons and avoid bias.

Inclusion of Cellpose3 Training Dataset:

The manuscript should include the dataset used for training Cellpose3 in its comparisons. Cellpose3's superior generalist model performance is emphasized, yet the absence of its training dataset in the comparisons raises questions about robustness of the benchmarking.

Evidence of Dataset Utility:

While the dataset's benchmarking is well-done, the manuscript does not provide evidence that models trained on CellBinDB outperform those trained on other datasets. Addressing this, though potentially out of scope, would strengthen the manuscript's impact.

Figure Panels:

Labeling in figure panels should be clearer to enhance interpretability. For instance, indicate whether the instance or semantic masks are being shown and consider making instance segmentation masks colorful to highlight unique IDs.

Semantic masks could be omitted if space is constrained, as they are largely redundant with instance masks.

Ensure figures are spaced more evenly throughout the text, ideally located near their first references, to improve readability.

Abstract Clarity:

The abstract should better reflect the intellectual contributions of the analysis of segmentation performance factors (i.e. cell morphology and image gradients).

Normalization Methods:

Provide details on how cell morphology indicators are normalized in the methods section to ensure reproducibility and clarity.

Explanation of Image Gradient:

The discussion of gradient magnitude and its calculation using the Sobel operator requires more accessible language. Not all readers will be familiar with this concept, so additional context is essential.

Tissue Classification:

Group related tissues, such as "brain," "half brain," and "cerebellum," under a common "neural tissue" category for easier interpretation and analysis. Additional Suggestions:

Address grammatical errors and improve clarity in some sections, such as the benchmarking pipeline description.

Replace vague terms like "ML-based" when referring to CellProfiler with specific algorithmic descriptions.

Including public datasets, such as Cellpose, to create a unified, all-inclusive CellBinDB dataset might significantly enhance the resource's utility for machine learning practitioners.
Read the original source
Version published to 10.1093/gigascience/giaf069
Jan 1, 2025
Version published to 10.1101/2024.11.20.619750 on bioRxiv
Nov 21, 2024

Breast Mammary Gland Dataset (BMGD): DAPI-Stained Fluorescent Images for Nuclei Segmentation

This article has 5 authors:
1. Zabina Tasneem
2. Jinwei Fan
3. Aishwarya Shrestha
4. Joy Zhao
5. Qingsu Cheng
This article has no evaluationsLatest version Jan 16, 2026
Multimodal Data Fusion Reveals Morpho-Genetic Variations in Human Cortical Neurons Associated with Tumor Infiltration

This article has 18 authors:
1. Hanchuan Peng
2. Yufeng Liu
3. Zhixi Yun
4. Lingli Zhang
5. Wen Ye
6. Kaifeng Chen
7. Xiefeng Wang
8. Mengzhu Ou
9. Jing Rong
10. Xiaomin Yang
11. Lei Mao
12. Chiyuan Ma
13. Liang Chen
14. Ying Mao
15. Nan Ji
16. Liwei Zhang
17. Yongping You
18. Junxia Zhang
This article has no evaluationsLatest version Dec 29, 2025
Deep Learning-Based Brain Tumor Segmentation Using 3D MRI Scans from the BraTS 2020 Dataset

This article has 3 authors:
1. N. Deena Nepolian
2. M. Mary Synthuja Jain Preetha
3. K. S. Vijula Grace
This article has no evaluationsLatest version Jan 28, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Breast Mammary Gland Dataset (BMGD): DAPI-Stained Fluorescent Images for Nuclei Segmentation

Multimodal Data Fusion Reveals Morpho-Genetic Variations in Human Cortical Neurons Associated with Tumor Infiltration

Deep Learning-Based Brain Tumor Segmentation Using 3D MRI Scans from the BraTS 2020 Dataset