CellBinDB: A Large-Scale Multimodal Annotated Dataset for Cell Segmentation with Benchmarking of Universal Models
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (GigaScience)
Abstract
In recent years, cell segmentation techniques have played a critical role in the analysis of biological images, especially for quantitative studies. Deep learning-based cell segmentation models have demonstrated remarkable performance in segmenting cell and nucleus boundaries, however, they are typically tailored to specific modalities or require manual tuning of hyperparameters, limiting their generalizability to unseen data. Comprehensive datasets that support both the training of universal models and the evaluation of various segmentation techniques are essential for overcoming these limitations and promoting the development of more versatile cell segmentation solutions. Here, we present CellBinDB, a large-scale multimodal annotated dataset established for these purposes. CellBinDB contains more than 1,000 annotated images, each labeled to identify the boundaries of cells or nuclei, including 4’,6-Diamidino-2-Phenylindole (DAPI), Single-stranded DNA (ssDNA), Hematoxylin and Eosin (H&E), and Multiplex Immunofluorescence (mIF) staining, covering over 30 normal and diseased tissue types from human and mouse samples. Based on CellBinDB, we benchmarked seven state-of-the-art and widely used cell segmentation technologies/methods, and further analyzed the effects of four cell morphology indicators and image gradient on the segmentation results.
Article activity feed
-
In recent years, cell segmentation techniques have played a critical role in the analysis of biological images, especially for quantitative studies. Deep learning-based cell segmentation models have demonstrated remarkable performance in segmenting cell and nucleus boundaries, however, they are typically tailored to specific modalities or require manual tuning of hyperparameters, limiting their generalizability to unseen data. Comprehensive datasets that support both the training of universal models and the evaluation of various segmentation techniques are essential for overcoming these limitations and promoting the development of more versatile cell segmentation solutions. Here, we present CellBinDB, a large-scale multimodal annotated dataset established for these purposes. CellBinDB contains more than 1,000 annotated images, each …
In recent years, cell segmentation techniques have played a critical role in the analysis of biological images, especially for quantitative studies. Deep learning-based cell segmentation models have demonstrated remarkable performance in segmenting cell and nucleus boundaries, however, they are typically tailored to specific modalities or require manual tuning of hyperparameters, limiting their generalizability to unseen data. Comprehensive datasets that support both the training of universal models and the evaluation of various segmentation techniques are essential for overcoming these limitations and promoting the development of more versatile cell segmentation solutions. Here, we present CellBinDB, a large-scale multimodal annotated dataset established for these purposes. CellBinDB contains more than 1,000 annotated images, each labeled to identify the boundaries of cells or nuclei, including 4’,6-Diamidino-2-Phenylindole (DAPI), Single-stranded DNA (ssDNA), Hematoxylin and Eosin (H&E), and Multiplex Immunofluorescence (mIF) staining, covering over 30 normal and diseased tissue types from human and mouse samples. Based on CellBinDB, we benchmarked seven state-of-the-art and widely used cell segmentation technologies/methods, and further analyzed the effects of four cell morphology indicators and image gradient on the segmentation results.
This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf069 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
Reviewer: Shan Raza
The paper presents a multimodal data set for cell segmentation and benchmarking. The major strength of the dataset is its multimodal nature and including both mouse and human tissue. The paper analyses existing data sets and the performance of state-of-the-art methods. However, the authors missed one of the biggest data sets on the cell segmentation and classification which includes more than 500,000 annotated nuclei in H&E https://www.sciencedirect.com/science/article/pii/S1361841523003079.
The CoNIC challenge paper also analysis state-of-the-art nuclei segmentation and classification methods. The authors should add one of the best performing models in their analysis. I would also suggest the authors to include PQ and froc in the metrics to analyse the results as this is commonly used in this domain for comparison. I would also suggest to compare the results with HoVerNet or HoVerNext (https://github.com/digitalpathologybern/hover_next_train) which are state-of-the-art algorithms for nuclei instance segmentation. The code for these algorithms is publicly available.
-
In recent years, cell segmentation techniques have played a critical role in the analysis of biological images, especially for quantitative studies. Deep learning-based cell segmentation models have demonstrated remarkable performance in segmenting cell and nucleus boundaries, however, they are typically tailored to specific modalities or require manual tuning of hyperparameters, limiting their generalizability to unseen data. Comprehensive datasets that support both the training of universal models and the evaluation of various segmentation techniques are essential for overcoming these limitations and promoting the development of more versatile cell segmentation solutions. Here, we present CellBinDB, a large-scale multimodal annotated dataset established for these purposes. CellBinDB contains more than 1,000 annotated images, each …
In recent years, cell segmentation techniques have played a critical role in the analysis of biological images, especially for quantitative studies. Deep learning-based cell segmentation models have demonstrated remarkable performance in segmenting cell and nucleus boundaries, however, they are typically tailored to specific modalities or require manual tuning of hyperparameters, limiting their generalizability to unseen data. Comprehensive datasets that support both the training of universal models and the evaluation of various segmentation techniques are essential for overcoming these limitations and promoting the development of more versatile cell segmentation solutions. Here, we present CellBinDB, a large-scale multimodal annotated dataset established for these purposes. CellBinDB contains more than 1,000 annotated images, each labeled to identify the boundaries of cells or nuclei, including 4’,6-Diamidino-2-Phenylindole (DAPI), Single-stranded DNA (ssDNA), Hematoxylin and Eosin (H&E), and Multiplex Immunofluorescence (mIF) staining, covering over 30 normal and diseased tissue types from human and mouse samples. Based on CellBinDB, we benchmarked seven state-of-the-art and widely used cell segmentation technologies/methods, and further analyzed the effects of four cell morphology indicators and image gradient on the segmentation results.
This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf069 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
Jeff Rhoades
General comments:
Dataset Innovation: CellBinDB offers a significant improvement over existing datasets with its diversity of staining types (DAPI, ssDNA, H&E, mIF) and broad tissue coverage, including normal and diseased samples.
Benchmarking of Models: The evaluation of seven state-of-the-art segmentation algorithms provides valuable insights for researchers selecting tools for various imaging modalities.
Analysis of Influencing Factors: The manuscript thoroughly examines biological (e.g., cell morphology) and technical (e.g., image gradient) factors affecting model performance, providing practical recommendations for improving segmentation outcomes.
Preprocessing Impact: Demonstrating the effectiveness of preprocessing (e.g., grayscale conversion for H&E images) is an immediately actionable takeaway for practitioners. However, authors should apply preprocessing uniformly to all segmentation approaches, not just those that did poorly initially.
Major Areas for Improvement:
- Preprocessing Uniformity:
- Apply preprocessing steps uniformly across all segmentation approaches to ensure fair comparisons and avoid bias.
- Inclusion of Cellpose3 Training Dataset:
- The manuscript should include the dataset used for training Cellpose3 in its comparisons. Cellpose3's superior generalist model performance is emphasized, yet the absence of its training dataset in the comparisons raises questions about robustness of the benchmarking.
- Evidence of Dataset Utility:
- While the dataset's benchmarking is well-done, the manuscript does not provide evidence that models trained on CellBinDB outperform those trained on other datasets. Addressing this, though potentially out of scope, would strengthen the manuscript's impact.
- Figure Panels:
- Labeling in figure panels should be clearer to enhance interpretability. For instance, indicate whether the instance or semantic masks are being shown and consider making instance segmentation masks colorful to highlight unique IDs.
- Semantic masks could be omitted if space is constrained, as they are largely redundant with instance masks.
- Ensure figures are spaced more evenly throughout the text, ideally located near their first references, to improve readability.
- Abstract Clarity:
- The abstract should better reflect the intellectual contributions of the analysis of segmentation performance factors (i.e. cell morphology and image gradients).
- Normalization Methods:
- Provide details on how cell morphology indicators are normalized in the methods section to ensure reproducibility and clarity.
- Explanation of Image Gradient:
- The discussion of gradient magnitude and its calculation using the Sobel operator requires more accessible language. Not all readers will be familiar with this concept, so additional context is essential.
- Tissue Classification:
- Group related tissues, such as "brain," "half brain," and "cerebellum," under a common "neural tissue" category for easier interpretation and analysis. Additional Suggestions:
- Address grammatical errors and improve clarity in some sections, such as the benchmarking pipeline description.
- Replace vague terms like "ML-based" when referring to CellProfiler with specific algorithmic descriptions.
- Including public datasets, such as Cellpose, to create a unified, all-inclusive CellBinDB dataset might significantly enhance the resource's utility for machine learning practitioners.
-
-