CEM500K, a large-scale heterogeneous unlabeled cellular electron microscopy image dataset for deep learning

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    This manuscript describes the curation of a training dataset that will be an important resource for developers of new segmentation and deep-learning algorithms for electron microscopy data. The small size of the dataset makes it easy to use, and its broad range of image modalities ensure that the model will be applicable in many situations, making it very useful for the community.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #2 agreed to share their names with the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Automated segmentation of cellular electron microscopy (EM) datasets remains a challenge. Supervised deep learning (DL) methods that rely on region-of-interest (ROI) annotations yield models that fail to generalize to unrelated datasets. Newer unsupervised DL algorithms require relevant pre-training images, however, pre-training on currently available EM datasets is computationally expensive and shows little value for unseen biological contexts, as these datasets are large and homogeneous. To address this issue, we present CEM500K, a nimble 25 GB dataset of 0.5 × 10 6 unique 2D cellular EM images curated from nearly 600 three-dimensional (3D) and 10,000 two-dimensional (2D) images from >100 unrelated imaging projects. We show that models pre-trained on CEM500K learn features that are biologically relevant and resilient to meaningful image augmentations. Critically, we evaluate transfer learning from these pre-trained models on six publicly available and one newly derived benchmark segmentation task and report state-of-the-art results on each. We release the CEM500K dataset, pre-trained models and curation pipeline for model building and further expansion by the EM community. Data and code are available at https://www.ebi.ac.uk/pdbe/emdb/empiar/entry/10592/ and https://git.io/JLLTz .

Article activity feed

  1. Reviewer #3 (Public Review):

    This paper presents an extensive study on providing a large dataset CEM500K, pre-trained models for electron microscopy data. This dataset is provided by the authors as an unlabeled dataset for supporting generalization problems like transfer learning.

    Strengths:

    — The motivation problem is well defined as the lack of large and, importantly, diverse training datasets of supervised DL segmentation models for cellular EM data.

    — A large and comprehensive dataset, CEM500K, including both 2D and 3D images is designed by the authors to overcome this issue.

    — The experimental results present the efficiency and prominent role of this dataset in training DL.

    Concerns:

    — Some of the claims have not been well supported by proofs/references/examples. As an example, the following claim "The homogeneity of such datasets often means that they are ineffective for training DL models to accurately segment images from unseen experiments" would be more valuable if some examples are provided by the authors.

  2. Reviewer #2 (Public Review):

    In their manuscript "CEM500K - A large-scale heterogeneous unlabeled cellular electron microscopy image dataset for deep learning", the authors describe how they established and evaluated CEM500K, a new dataset and evaluation framework for unsupervised pre-training of 2D deep learning based pixel classification in electron microscopy (EM) images.

    The authors argue that unsupervised pre-training on large and representative image datasets using contrastive learning and other methods has been demonstrated to benefit many deep learning applications. The most commonly used dataset for this purpose is the well established ImageNet dataset. ImageNet, however, is not representative for structural biases observed in EM of cells and biological tissues.

    The authors demonstrate that their CEM500K dataset leads to improved downstream pixel classification results and reduced training time on a number of existing benchmark datasets a new combination thereof compared to no pre-training and pre-training with ImageNet.

    The data is available on EMPIAR under a permissive CC0 license, the code on GitHub under a similarly permissive BSD 3 license.

    This is an excellent manuscript. The authors established an incredibly useful dataset, and designed and conducted a strict and sound evaluation study. The paper is well written, easy to follow and overall well balanced in how it discusses technical details and the wider impact of this study.

  3. Reviewer #1 (Public Review):

    This manuscript describes the curation of a training dataset, CEM500K, of cellular electron microscope (EM) data including STEM, TEM of sections, electron tomography, serial section and array tomography SEM, block-face and focused-ion beam SEM. Using CEM500K to train an unsupervised deep learning algorithm, MoCoV2, the authors present segmentation results on a number of publically available benchmark datasets. They show that the standard Intersection-over-Union scores obtained with the CEM500K-trained MoCoV2 model, referred to as CEM500K-moco, equal or exceed the scores of benchmark segmentation results. They also demonstrate the robustness of CEM500K-moco's performance with respect to input image transformations, including rotation, Gaussian blur and noise, brightness, contrast and scale. The authors make the remarkable discovery that MoCoV2 spontaneously learned to use organelles as "landmarks" to identify important features in images, simulating human behavior to some degree.

  4. Evaluation Summary:

    This manuscript describes the curation of a training dataset that will be an important resource for developers of new segmentation and deep-learning algorithms for electron microscopy data. The small size of the dataset makes it easy to use, and its broad range of image modalities ensure that the model will be applicable in many situations, making it very useful for the community.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #2 agreed to share their names with the authors.)