Subcellular Region Morphology Reflects Cellular Identity

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

In multicellular organisms, various cells perform distinct physiological and structural roles. Traditionally, cell identity has been defined through morphological features and molecular markers, but these methods have limitations. Our study explores the potential of subcellular morphology to define cellular identity and predict molecular differences. We developed workflows to identify subcellular regions in different cell lines, using convolutional neural networks (CNNs) to classify these regions and finally quantify morphological distances between cell types. First, we demonstrated that subcellular regions could accurately distinguish between isolated cell lines and predict cell types in mixed cultures. We extended this approach to predict molecular differences by training networks to identify human dermal fibroblast subtypes and correlating morphological features with gene expression profiles. Further, we tested pharmacological treatments to induce controlled morphological changes, validating our approach in order to detect these changes. Our results showed that subcellular morphology could be a robust indicator of cellular identity and molecular characteristics. We observed that features learned by networks to distinguish specific cell types could be generalized to quantify distances between other cell types. Networks focusing on different subcellular regions (nucleus, cytosol, membrane) revealed distinct morphological features correlating with specific molecular changes. This study underscores the potential of combining imaging and AI-based methodologies to enhance cell classification without relying on markers or destructive sampling. By quantifying morphological distances, we provide a quantitative characterization of cell subtypes and states, offering valuable insights for regenerative medicine and other biomedical fields.

Article activity feed

  1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

    Learn more at Review Commons


    Reply to the reviewers

    Reviewer #1 comments

    Major comments:

    - Neither data nor code was made available for review. There's only a mention of them being in Figshare with no link. As a consequence and a matter of principle, this study is not publishable without both public data and code. I would recommend using adequate repositories for data and code. Image data can be deposited in a public image data repository such as the BioImage Archive which would ensure that minimal metadata are provided and code could go to a public code repository (e.g. GitLab...) so that it is discoverable and eventual changes can be tracked and visible (for example should any bug be fixed after publication). Also consider depositing the models into the BioImage Model Zoo (https://bioimage.io).

    We will upload all the code used in the article in GitHub while image data will be deposited in BioImage Archive as suggested by the referee. Method section will be also rewritten.

    - The use of the term morphology is misleading. Like I expect most readers would, I understand morphology in this context as being related to shape. However, there is no indication that any specific type of information (like shape, texture, size/scale...) is used or learned by the described method. To understand what information the classifiers rely on, it would be interesting to compare with human engineered features extracted from the same ROIs.

    All references to morphology in the text must be removed unless indication can be provided as to what type of information is used by the models.

    We understand the concern regarding the use of "morphology" and will revise the manuscript to be more precise. Instead of referring broadly to "morphology," we will specify "image-derived features" or "texture and structural features" where applicable.


    Additionally, to address this concern directly, we have performed an analysis comparing our learned features to classical human-engineered features (such as texture and shape descriptors) to better understand what type of information is utilized by the model. These results will be incorporated into the revised manuscript.

    - The method should be described with more details:

    - How are the window sizes to use determined? Are the two sizes listed in the methods section used simultaneously? What is the effect of this parameter on the performance?

    - How are the ROIs determined? In a grid pattern? Do they overlap? i.e. how does the windowing function work?

    - Predictions seem to be made at the ROI level but it isn't clear if this is always the case. Can inference be made at the level of individual cells?

    Window Sizes: We will clarify that the two window sizes were chosen based on empirical performance assessments. We will include a specific figure evaluating the impact of window size on classification performance, by expanding the analysis to multiple window sizes and number of training regions.

    ROI Determination: We will describe thoroughly the ROI selection in the Method Section. We will include a comparison between overlapped and non-overlapping grid selection.

    Inference at the Cell Level: While predictions are made at the ROI level (we will clarify the text), we will discuss an additional approach that aggregates ROI-level predictions into a final cell-level classification, which we will add as an optional post-processing step.

    - What would be the advantages of the proposed subcellular approach compared to learning to classify whole images?

    We will detail a comparison between subcellular and cellular or whole image classification; the main advantage of this subcellular technique (that will be remarked in the text) is the reduction in the number of images that are required to learn to classify cell types. Nevertheless, other advantages are the robustness to confluency variations (whole-image classification can be biased by confluency differences, while subcellular regions focus on individual cell features) and a fine-grained feature learning.

    - When fluorescent markers are used, the text isn't clear on what measures have been taken to prevent these markers from bleeding through into the brightfield image. To rule out the possibility that the models learn from bleed-through of the marker into the brightfield image, the staining should be performed after the brightfield image acquisition. Without this, conclusions of the related experiments are fatally flawed.

    __We appreciate this important point and confirm that all fluorescent staining was performed after brightfield image acquisition, ensuring that no fluorescence contamination influenced model training. We will have explicitly stated this in the Methods section. __

    - How robust are the models e.g. with respect to culture age and batch effects? Use of a different microscope is mentioned in the methods section. This should be shown, i.e. can a model trained on one microscope accurately predict on data acquired from a different microscope? Does mixing images from different sources for training improve robustness?

    We have used different cellular batches without any effect on accuracy. We will also include the experiment using another microscope, and we will add new data with/without combination of mixed images from different figures. In summary, we include a new supplementary figure that address the use of distinct and mixed cellular batches and microscopies in terms of accuracy and trained models.

    - Why not use the Mahalanobis distance in feature space? This would be the natural choice given that PCA has been selected for visualization and would allow to show uncertainty regions in the PCA plots. Could other dimensionality reduction methods show better separation of the groups? Why not train the network for further dimensionality reduction if the goal is to learn a useful feature space?

    We appreciate this suggestion and will include a comparison of Mahalanobis distance-based classification with our existing approach. Regarding dimensionality reduction, we will test additional methods including t-SNE and UMAP as supplementary figures. Finally, while training a network specifically for dimensionality reduction is an interesting alternative, our current pipeline was focused on simplicity and the ample range of techniques that allow to address. However, we include include a discussion on potential future directions where such an approach could be explored.

    Minor comments:

    - Make sure the language used is clear, e.g. The text describes the method as involving a transformation to black and white followed by thresholding. This doesn't make sense. What is meant by "the set of 300 genes was subjected to Gene Ontology"? Use percent instead of permille in the text for easier reading.

    * *

    These minor changes will be addressed in the text, including the percent instead of permille as it was a common point suggested by the referees.

    - To provide more context, cite previous work that indicates that brightfield images contain exploitable information, e.g.

    - Cross-Zamirski, J.O., Mouchet, E., Williams, G. et al. Label-free prediction of cell painting from brightfield images. Sci Rep 12, 10001 (2022). https://doi.org/10.1038/s41598-022-12914-x

    - Harrison PJ, Gupta A, Rietdijk J, Wieslander H, Carreras-Puigvert J, et al. (2023) Evaluating the utility of brightfield image data for mechanism of action prediction. PLOS Computational Biology 19(7): e1011323. https://doi.org/10.1371/journal.pcbi.1011323

    We will cite these references in the introduction of the paper.


    Reviewer #2

    Major comments:

      • Place this study in context of previous studies that classify cell types. Here are two relevant recent papers, which could provide a good start for properly crediting previous work and placing your contribution in context: PMID: 39819559 (note the "Nucleocentric" approach) and PMID: 38837346. Please seek for papers that use label free for similar applications (which is the main contribution of the current manuscript).*

    We appreciate this suggestion (shared by reviewer #1) and we will include references to these and other relevant studies on label-free cell classification. We specifically discuss how our approach differs from the "nucleocentric" method in PMID: 39819559 and how our method complements existing work in label-free imaging. We will update both the Introduction and Discussion sections to reflect this improved contextualization.

    • Many experiments were performed, but we found it hard to follow them and the logic behind each experiment. Please include a Table summarizing the experiments and their full statistics (see below) and also please provide more comprehensive justifications for designing these specific experiments and regarding the experimental details. This will make the reading more fluent.*

    We will include a summary table in the Methods section that provides an overview of all experiments, detailing:


    -The purpose of each experiment

    -The dataset used

    -The number of images/cells

    -Objective used

    -Cellular confluence

    -Reference to BioImage Archive

    -Model used (reference to Github)

    -Technical / Biological replicates

    -The main conclusions drawn

    -Figure that presents the data


    Additionally, we will revise the Results section to provide clearer justifications for each experiment, improving the logical flow of the manuscript.

    • The experiments, data acquisition and data reporting details are lacking. 10x objective is reported in the Results and 20x in the Methods. Please explain how the co-culturing (mixed) experiments were performed including co-culturing experiments with varying fractions of each cell type and on what data were the models trained on (Fig. 2F). Differential confluency experiments are not described in the Methods (and not on what confluency levels were the models trained on), this is also true for the detachment experiment. How many cells were acquired in each experiment (it says "20 and 40 images per cell line" but this is a wide range + it is not clear how many cells appear in each image)? How many biological/technical replicates were performed for each experiment? Please report these for each experiment in the corresponding figure legend and show the results on replicates (can be included as Supplementary). "Using a different microscope with the same objective produced similar results (data not shown)" (lines #370-371), please report these results (including what is the "different microscope") in the SI.*

    We will carefully review and expand the Methods section to provide complete details, as with the Table that we will prepare to address the previous comment and this one. In addition, the co-culturing experiments will explicitly describe how cell fractions were varied and how training data were generated for Fig. 2F. The differential confluency and detachment experiments will be fully described, including confluency levels used during model training. The secondary microscopy data will be added as part of a new figure that was commented for reviewer #1.

    • The machine learning details are lacking. The train-validation-test strategy is not described, which could be critical in excluding concerns for data leakage (e.g., batch effects) which could be a major concern in this study. It is not always clear what network architecture was used. What were the parameters used for training? Accuracy is reported in % (and sometimes in an awkward representation, 990‰). Proper evaluation will use measurements that are not sensitive to unbalanced data (e.g., ROC-AUC). What are the controls (i.e., could the accuracy reported be by chance?). Reporting accuracy at the pixel/patch level and not at the cell level is a weakness. Estimation of cell numbers (in methods) is helpful but I did not see when it was used in the Results - a better alternative is using fluorescent nuclear markers to move to a cell level (not necessary to implement if it was not imaged).*

    We will significantly expand the Machine Learning method and result sections, providing:


    -A detailed description of the train-validation-test split strategy, (that explicitly rules out batch effects as a confounding factor). A clarification of the network architecture used for different tasks and their parameters (always the same one).

    -We will expand the evaluation metrics, including ROC-AUC scores to account for class imbalances, and baseline models as controls, ensuring that model performance is not due to chance as a new supplementary figure.

    - Accuracy will be reported to use percentage instead of permille as suggested by other referees.

    - We will clarify the use of cell number estimation in the specific figures in which we use it, including new data in the first figure for the generalization of patch-to-cell estimation.

    • Downstream analyses lacking sufficient information to enable us to follow and interpret the results, please provide more information.*

    • The PCA ellipses visualizations reference to previous papers. Please explain what was done, how the ellipses were calculated and from how much data? If they are computed from a small number of data points - please show the actual data. It would also be useful to briefly include the information regarding the representation and dimensionality reduction in the Results and not only in the Methods. No biologically-meaningful interpretation is provided - perhaps providing cell images along the PCs projections can help interpret what are the features that distinguish between different experimental conditions.*

    We will include a clearer explanation in methods as well as results for PCA and dimensionality reduction, as well as the use of Mahalanobis distance as another metric, another visualization for improved interpretation, and a supplementary figure related to tSNE reduction. We will update the figure for inclusion of real subcelullar images that help the biological interpretation of the results.

      • How were the pairwise accuracies calculated? How did the authors avoid potential batch effects driving classification.*

    We have used different cellular batches without any effect on accuracy. In the new revised manuscript, we will clarify batch normalization techniques used in training and include additional control analyses ensuring that batch effects are not driving classification results (new figure as suggested by reviewer #1 with mixed and separate cellular batches).

      • "suggesting that the current workflow can handle four cell lines simultaneously" (lines #126-127) - how were the cell lines determined for each analysis? We assume that the performance will depend on the cell types (e.g., two similar morphology cell types will be hard to distinguish). Fig. 2F is not clear: the legend should report a mixture of four cell types, and this should be translated to clear visualization in the figure panel itself: what do the data points mean? Where are the different cell types?*

    We will include additional experiments with other cell lines, and we will explicitly describe the rationale for cell line pairings, considering morphological similarities. Fig. 2F will be redesigned for clarity, ensuring data points are clearly labeled by cell type.

      • Lines 232 and onwards use #pixels as a subcellular size measurement when referring to cell nucleus, cytoplasm and membrane, please report the actual physical size and show specific examples of these patches. This visualization and analysis of patch sizes should appear much earlier in the manuscript because it relates to the method's robustness and interpretability.*

    We will explicitly report patch sizes in microns and include a supplementary figure illustrating different subcellular regions to enhance interpretability.

      • Analysis of co-cultured (mixed) experiments is not clear. Was the fluorescent marker used to define ground truth? Was the model trained and evaluated on co-cultures or trained on cultures of a single cell type and evaluated on mixed cultures? We assume that the models were still evaluated on the label-free data? "...obtain subcellular ROIs only from regions positive in the red channel. Using these labeled ROIs,.." (138-139) - shouldn't both positive and negative ROIs be used to have both cell types? What are the two quantifications in the bottom of Fig. 1E? Did the "labeled cells" trained another classifier for the fluorescent labels?*

    We will clarify both the method and results section regarding the co-culture experiment from the first figure. In that specific case, the model learned from positive ROIs in order to demonstrate that this approach can also be used from a mixed culture. In order to become clearer, we will transfer this experiment to a supplementary figure.

      • Please interpret the results from Fig. 3C-D - should we expect to see passage-related changes in cells (that lead to deterioration in classification) or is it a limitation of the current study?*

    We will explicitly discuss whether passage-related changes affect cell morphology. In addition, we will include novel RNA-seq data comparing passage and batch effects, in order to correlate them to the image-based deterioration as part of the figure.

      • In general, as we mentioned a couple of times. It would be useful to visualize different predictions (or use explainability methods such as GradCam) to try to interpret what the model has learned.*

    We will perform a GradCAM analysis, highlighting which subcellular regions contribute most to classification, improving interpretability.

      • The correlation analysis between transcriptional profiles and morphological profiles is not clear. There are not sufficient details to follow the genetic algorithm (and its justification). What was the control for this analysis? Would shuffling the cells' labels (identities) and repeating the analysis will not yield a correlation?*

    We agree with the concern of the reviewer. We will expand the Methods section to clarify how the correlation was calculated, as well as the genetic algorithm. We will perform a control analysis using shuffled cell identities, trying to demonstrate that correlations do not arise by chance.

    • Please use proper scientific terms. For example, "white-light microscopy" and "live cell red marker".*

    We will change the text accordingly, making a global review of the manuscript.

    • This is a "Methods" manuscript and thus should open the source code and data, along with some examples on how to use it in order to enable others to replicate the results and to enable others to use it.*

    We acknowledge that our manuscript is more a ‘Methods’ manuscript instead of a general article (that it was conceptualized by us). Probably most of the critical points arose by the referees at the end are explained by this reason. We will deposit image data in the BioImage Archive with proper metadata, and we will published our code in GitHub as well as the models.

    • Please improve the figures. Fonts are tiny and in some places even clipped (e.g., Fig. 1D,E, Fig.2 E, E', and many more), some labels are missing (e.g., units of the color bar in Fig. 1B).*

    Figures will be redesigned accordingly.

    • Discussion. Please place this work in context of other studies that tackled a similar challenge of classifying cell types and discuss cons and pros of the different measurements. For example, there are clear benefits of using label-free data to reduce the number of fluorescent labels and enable long-term live cell imaging following a process without photobleaching and phototoxicity (Fig. 2G) but it is more difficult to interpret these differences in label-free image patches rather than fluorescently labeled single cells. One solution to bridge this gap that could be discussed is using silico labeling (PMID: 38838549).*

    The Discussion will be significantly expanded to compare our work with other methods, including in silico labeling (PMID: 38838549).

    • The idea of using the pairwise correlation distance of different cell types to model unseen cell types is interesting and promising. Why did these specific pairwise networks were used? How robust is this representation to inclusion of other/additional models?*

    As the referees are very interested in pairwise correlation distance, we will include a sensitivity analysis, testing alternative model selections to assess robustness.



    Reviewer #3

    ## General

    - It is often unclear if a sample in the particular experiment is a patch or a pixel. Please be more specific on this in the text.

    Manuscript will be rewritten for clarification of pixel/patch.

    - It is unclear which patch size was used and if it was consistent throughout the experiments. Please add this information.

    We will include a new figure with comparison between different patch sizes, as suggested by reviewer #1.

    - It is often unclear which data was used for training/validation and final readout. Did you do train/val splits? Did you predict on the same data or new samples? This should be stated more specifically.

    We will clarify in the Methods section the strategy of training/testing (90% - 10%, same data) with new samples used for final readout. All reported classification results come from that set, ensuring that the model was evaluated on unseen data.

    - Also, it is a little bit unclear what you mean by patch or by ROI or by region, please be more consistent and explain what you mean by adding definitions.

    We will standardize the use of these terms, leaving only ROI.

    - Please compare your method to other approaches and to baselines (see also our comment above).

    We will compare our approach with whole-image classification, showing that our subcellular approach provides better generalization. A new supplementary analysis will explore the feasibility of alternative feature extraction techniques and their relative performance. Several baselines will be incorporated in order to assess random accuracies (following the suggestions of other reviewers).

    - In general, if possible, please add more concrete examples of how you envision your method to be used in practice. There are general ideas presented in the discussion section, but we feel those could be substantiated by more concrete implementation suggestions.

    We will provide three specific case studies in the Discussion section, demonstrating how our approach can be applied in real-world scenarios:


    -Drug Screening: Identifying cellular responses to drug treatments in high-throughput screening pipelines.

    -Stem Cell Differentiation Monitoring: Tracking changes in subcellular morphology during differentiation to assess developmental stages.

    -Cancer Cell Classification: Distinguishing between different subtypes of cancer cells in heterogeneous populations.

    Minor comments (grouped and summarized for clarification):

    General Clarifications & Wording Improvements

    Line 18: Clarify if the study is based on morphological features and specify the novelty (e.g., subcellular features).

    Lines 25 & 29: The wording suggests that the workflow was extended before being validated. Improve clarity.

    Line 92: Add a brief explanation of "subcellular region."

    We will clarify in the Introduction that our study is based on morphological features but specifically focuses on subcellular features, which distinguishes it from whole-cell analysis. We will rephrase the relevant sentences to make it clear that the workflow was first validated and then extended. We will provide a brief definition of "subcellular region" and ensure consistency throughout the manuscript.

    Experimental Setup & Methodological Details

    Lines 100-141: Clarify the use of validation and test sets, and discuss potential batch effects.

    Line 113: Missing training details (loss function, data volume, epochs).

    Line 117: Clarify if "pairwise classification" is meant.

    Line 119: Accuracy should be reported in percent instead of permille.

    Lines 136-141: Justify why two cell lines were mixed but only one was analyzed.

    We will add a clear explanation of the train-validation-test split, ensuring reproducibility and ruling out batch effects. Additional batch effect control experiments will be performed and included in Supplementary Figures as suggested by other reviewers.

    We will include training details (e.g., loss function, number of epochs, data volume) in the Methods section and referenced it in the Results section for clarity. The terminology will be updated to "pairwise classification" where appropriate. We will report accuracy in percent (%) as suggested by other reviewers. The rationale for mixing two cell lines but analyzing only one is now explicitly stated: we used a mixed culture to simulate realistic conditions but focused on one cell type to test classification specificity. Nevertheless, following other reviewer suggestion this experiment will be placed in a supplementary figure in order to become clearer.

    Technical & Experimental Design Clarifications

    Line 105: Replace "white light microscopy" with "brightfield microscopy."

    Line 107: Be specific about "transformation to black and white" and "contrast thresholding algorithm."

    Line 125: Explain why performance dropped—did you try a larger network?

    Line 133: Clarify how confluency was estimated.

    "White light microscopy" will be replaced with "brightfield microscopy." The thresholding method will be explicitly described, with a reference to the Methods section where details are provided. We will discuss the possible reasons for performance drop. Confluency estimation will be described, explaining that it was calculated using automated image segmentation and validated manually.

    Data Representation & Interpretation

    Line 143-158: Clarify the ground truth—was it based on dye labeling, thresholding, or human annotation?

    Line 156: What is meant by "magnification"? Higher resolution? Different microscope? Crops?

    Lines 163-166: Sudden switch to pixels instead of ROIs—explain why.

    Line 191 & 192: If a strong correlation is claimed, include a statistical test.

    Lines 211-214: If differences are claimed, add a quantitative analysis.

    Lines 396-404: Clarify how the test set was chosen and what "in situ prediction" means.

    Lines 407-409: What do you mean by "binarizing the image"? What threshold was used?

    We will clearly explain terms like “ground truth”, "magnification", “in situ prediction” and ‘binarization”. Consistent terminology will be ensured, regarding ROIs throughout the text. Statistical analyses will be added to correlation results and morphological feature comparisons to support claims.

    Biological Interpretation & Feature Space Analysis

    Line 226-228: You show classification in feature space but not whether distances in feature space correlate with real-world differences between cell types.

    Line 234-236: What do you mean by "detect potentially more informative subcellular regions"?

    Line 302-303: The claimed application (estimating cell types in an unseen culture) was not shown—please add an experiment.

    We now include an experiment comparing three cell types, where two are closely related and one is more distinct, to test if feature space distance corresponds to real-world differences. The concept of "informative subcellular regions" will be rephrased. We will add an experiment demonstrating the ability of our model to estimate the number of cell types in an unseen culture, as suggested.

    Figure & Visualization Improvements

    • Improve figure readability (tiny fonts, clipped text).*

    Line 653-655: Show actual data points in PCA ellipses, not just ellipses.

    Line 672-677: Add a quantification of performance differences between different categories.

    All figures will be revised for better readability, ensuring that text is legible, axes are labeled, and color bars are clear. We will overlay data points onto PCA ellipses for better visualization of feature distribution, as suggested by other reviewers. Performance differences between different experimental conditions will be quantified, with statistical comparisons provided.

    Model Training & Data Reproducibility

    Lines 386-392: Add exact details on model architecture, loss function, number of images used per experiment.

    A complete breakdown of model architecture, loss function, training set size, and validation details will be included in the Methods section, ensuring full reproducibility.


    Dimensionality Reduction & Feature Space Interpretation

    Line 438-439: Consider using UMAP or t-SNE in addition to PCA. Report variance explained by PCA components.

    Line 439-440: Provide more details on how eigenvectors were used to calculate ellipses.

    Line 442-443: Clarify which correlation method was used.

    We will include t-SNE visualizations in Supplementary Figures and report the variance explained by PCA components, as well as Mahalanobis distance, as suggested by other reviewers. The eigenvector-based ellipse calculation will be described in more detail in the Methods section, and the specific correlation metric used will be explicitly stated.

    Code & Data Accessibility

    Line 491: Provide a direct URL to the code and data. Consider using GitHub for code and BioImage Archive for data.

    We will include the code to GitHub and image data to the BioImage Archive, following the reviewers recommendation, with direct URLs.

  2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #3

    Evidence, reproducibility and clarity

    Summary

    The authors present a computational workflow that automatically classifies patches of transmission microscopy images of cultured cells into different cell types.

    Comments to the Manuscript

    General

    • It is often unclear if a sample in the particular experiment is a patch or a pixel. Please be more specific on this in the text.
    • It is unclear which patch size was used and if it was consistent throughout the experiments. Please add this information.
    • It is often unclear which data was used for training/validation and final readout. Did you do train/val splits? Did you predict on the same data or new samples? This should be stated more specifically.
    • Also, it is a little bit unclear what you mean by patch or by ROI or by region, please be more consistent and explain what you mean by adding definitions.
    • Please compare your method to other approaches and to baselines (see also our comment above).
    • In general, if possible, please add more concrete examples of how you envision your method to be used in practice. There are general ideas presented in the discussion section, but we feel those could be substantiated by more concrete implementation suggestions.

    Specific

    Line 18

    • Isn't this study also based on morphological features? Eventually, you could be more specific what the novelty is, it might be the fact that your features are subcellular?

    Lines 25 & 29

    • In general one would expect a workflow to be validated first and extended afterwards. You could improve the wording here to make this clear for the reader.

    Line 92

    • Please add a short explanation of what is meant by "subcellular region".

    Lines 100-141

    • Did you validate the classification results with a validation and test set? Maybe with cross validation? Please add more details on how this was done.
    • It could be that the model exploits batch effects of different imaging runs (e.g. different overall intensity in patches). It would be nice if this could be checked by an additional experiment.

    Line 105

    • "white light microscopy" is an unusual term, can you be more specific, e.g. bright-field?

    Line 107

    • It is unclear what a "transformation to black and white" and a "contrast thresholding algorithm" are, please be more specific (and potentially point the reader to a corresponding Methods section).

    Line 113

    • How does training work? Which loss is used? How much data? How many epochs? ... All of this information is missing which makes the study non-reproducible. Please add this here or point to an appropriate method section.

    Line 117

    • Do you mean pairwise classification?

    Line 119

    • It is unusual to use permille as a unit to report, percent is more common
    • Also, it is unclear if accuracy is the correct read-out here, are all the data sets balanced?
    • more information about the data sets could be added in a methods section and the decision to use accuracy as a measure could be explained

    Line 121

    • this hypothesis was never stated before, please explain this to the reader first and then check your hypothesis by experiments

    Line 125

    • do you have a hypothesis why the performance dropped? Did you for example try a larger network?

    Line 133

    • how is the confluence estimated?

    Lines 136-141

    • It is unclear why two cell lines were mixed when only data of one of them is used for analysis afterwards. Could you explain this in more detail or specify why this approach is used?

    Lines 143-158

    • We think you are trying to establish a ground truth here. Unfortunately, there are two things mixed here, the labeling with an additional dye combined with thresholding and human annotation. It is unclear which is considered the ground truth or if both are considered true. Could you explain this in more detail or be more specific?

    Line 156

    • What do you mean by magnification? Images with a higher resolution (different microscope with higher magnification)? Crops of the same data? Something else? Could you explain this in more detail or be more specific?

    Lines 163-166

    • Suddenly you talk about pixels instead of ROIs, where are they coming from? Maybe point the reader to a method section and explain the switch here.
    • Also, why is the pixel size cell line dependent, didn't you use the same microscope for all of them? Could you define what you mean by pixel size?
    • You say you compared different cell lines, how is this summarized in one plot? Please explain in more detail.

    Lines 177-214

    • Again, it is unclear which data was used for training, validation and analysis in the end. Please add this.

    Lines 191 & 192

    • If you claim a strong correlation please add a statistical analysis that shows this.

    Lines 211-214

    • If you claim these differences you should add a quantitative read-out with a statistical analysis. You could use distances in your representation space as a basis for this.

    Lines 226-228

    • What is shown here is that the morphological features can be used to classify cell types. You show that these classes are distant in feature space. But you don't show any correlation between the distance in feature space and the distance in real space (a.k.a how different the cell types are). It would be nice to have an experiment with at least three classes where 2 are closer to each other than to the third one. This would be a stronger claim that your features actually capture meaningful distances/differences.

    Lines 234-236

    • What do you mean by "detect potentially more informative subcellular regions within the cell"? Please describe in more detail what the training task was for the model and how you interpret the results.

    Lines 296-298

    • It is a little bit confusing what you mean here since you do train a network for each pair of cell lines. What you are describing is a foundation model. Please explain in more detail what you mean.

    Lines 302-303

    • The application you are claiming here was never shown in the experiments. Could you please add this experiment where a model estimates the number of cell types in an unseen culture.

    Line 323

    • Could you please elaborate how you would identify "specific cellular compartments"?

    Lines 323-326

    • Are there other studies that suggest that such malignant cells show features that are recognizable by your approach?

    Lines 342-365

    • Did you use biological replicates? This would be interesting and also a nice way to validate your models.

    Lines 369-373

    • Why do you claim that a similar microscope produces similar images? Can you give more details why this is relevant. And if that is the case it would be nice to show them. Maybe in some supplementary material.
    • How big is one image? How many cells can you see in one image? What is the resolution? What is the pixel size? ... Also, for which experiment did you use how many images? Please add all these details.
    • Also, please show some example images to make it clear for the reader what the data looks like. Could be done in supplementary material.

    Lines 386-392

    • Again, please add details. As it is right now the study is not reproducible. How many images were used for each experiment? How many for training, validation, analysis? Give the exact architecture of the model used. Which loss was used for training?

    Lines 396-404

    • Please add more details and clarify. How was the test set chosen? What do you mean by "in situ prediction"? What do you mean by "running ROIs"? What do you mean by "if the cell type was predicted to be more than 50 % of the times"? Was the human annotation or the life cell marker used for the final accuracy? Humans are never unbiased.

    Lines 404-407

    • This sounds like the ground truth for a segmentation task - is this what you mean? Since you are solving a classification task this is confusing. Please clarify.

    Lines 407-409

    • This sentence is confusing and it is unclear what was done. Please clarify. Do you mean the image was binarized? If yes, which threshold was used? What do you mean by "accuracy was estimated as with the prediction"? The accuracy should be estimated by comparing the prediction to the ground truth.

    Lines 413-422

    • Please give more details. What are these specific numbers? What do you mean by "pixel size of each cell type"? The pixel size is metadata given by the microscope/image and should not be cell type specific. We also did not understand what is meant by "fitting the percentages" and what the aim of this is. Please consider rewriting this to make it more clear.

    Lines 426-430

    • Please provide the oligo sequence.

    Line 435

    • Please consider rephrasing to: "the output of the last max pooling layer"

    Lines 438-439

    • It would be interesting to visualize the data based on a different dimensionality reduction algorithm that is non-linear like UMAP or t-SNE. If you use PCA, could you give a measure on how much of the variance is captured in the first two PCs.

    Lines 439-440

    • Please give some more details on how you use eigenvectors to calculate ellipses.

    Lines 442-443

    • Please give more details on which correlation you calculated.

    Lines 447-457

    • It would be nice if you could rephrase this a little bit to make clear that the preprocessing itself stays the same but you basically establish different data sets by separating ROIs based on their distance to the closest nucleus.

    Lines 455-457

    • Please be more precise here. The networks still learn to classify patches and are not aware of the fact that these ROIs fall in a certain category. You exploit this fact afterwards for your analysis.

    Lines 464-474

    • Please add more details why this experiment is done. Why is a genetic algorithm needed? Could not the same analysis be done on the original transcriptomics data?

    Line 486

    • Do you mean technical or biological replicates? If that is the case, could you please clearly state that you report mean values and also give the standard deviation.
    • "test" should be experiment

    Line 491

    • Could you please provide a URL to the code and the data.
    • Also, it is common practice to upload code to GitHub and image data to the Bioimage Archive. Please consider doing this.

    Lines 627-633

    • Panel A could be improved by making the ROIs larger since it is hard to see them.
    • Also, please make sure that it is clear that one ROI at a time is given to the model.

    Line 638

    • What does "magnification" mean here - see above.
    • Why do you not show the same region?

    Line 640-642

    • This basically shows that your approach is as good as simple thresholding. What do you want to show with this?

    Lines 643-644

    • Please clarify. It is unclear what percentage you present here.

    Lines 652-653 (Fig. 3C)

    • Please clarify. It is unclear what statistical analysis was performed here and to what end.

    Lines 653-655

    • It would be interesting to see not only the ellipses but also the actual data points plotted.

    LInes 658-661

    • Please add a statistical analysis of what you want to show here.
    • It is clear that the correlation is not as clear for higher values on the x-axis, why is this?

    Lines 661-662

    • Please clarify. It is unclear what statistical analysis was performed here.

    Lines 662-664

    • Please add a statistical analysis of what you want to show here.

    Lines 672-677

    • Please also plot the actual data points
    • Also, if possible it would be nice to quantify the differences in performance between the different categories.

    Code and data availability

    We could not see how to access example image data. To our best knowledge it is current best practice to upload image data to the Bioimage Archive: https://www.ebi.ac.uk/bioimage-archive/

    Specifically for this kind of study the reader should have access to the training and test data that was used to train the classifier.

    We also could not see how to reproduce the analysis. To our best knowledge it is current best practice to make all code publicly accessible, e.g. in a GitHub repository.

    Please see https://www.nature.com/articles/s41592-023-01987-9 for general guidelines of publishing bioimage data and analysis.

    Significance

    The ability to use label free microscopy for extracting biologically meaningful information is very valuable and it is very interesting to learn that simple transmission microscopy contains enough information to reveal cell types. In this study the authors trained a neural network for this task and demonstrated that it works with rather high accuracy.

    In its current form, we could not access the data nor the code. We could thus not fully judge the quality of the presented work. For a future revision, access to data and code will be essential.

    We also found it difficult to judge how difficult the classification task is, because the size of the cells in the current figures does not allow one to see texture detail in the images. Since we did not manage to access the image data, we could not assess whether the classification task is very hard (and indeed requires an AI approach) or whether the differences are rather obvious and could be quantified with classical image analysis. To enable the interested reader to better assess this important information we would like to recommend to (a) add figures that allow one to better see the cells and their texture, at least for some of the cell types, and (b) provide easy download access to the raw image data.

    Along those lines, we think it would be very interesting to actually test whether training a neural network is required or whether other methods would yield similar results. For instance, we would recommend to simply compute the mean and variance of the intensities in each patch and check whether this information also can perform some of the classification tasks. Depending on the outcome of this analysis this could be either added to some of the main figures of the article or to the supplemental material.

  3. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #2

    Evidence, reproducibility and clarity

    Summary:

    Automatic classification of single cell types and cell states in heterogeneous mixed cell populations has many applications in cell biology and screening. The authors present a machine learning workflow to distinguish between different cell types or cell states from label-free microscopy image patches of subcellular size. The authors evaluate their ability to identify different cell types and molecular profiles on many applications.

    Major comments:

    The application of classifying cell type and states from label-free data is promising and useful, but this manuscript requires major rewriting to enable us comprehensive assessment. Specifically, provide all technical details necessary for its evaluation, improve clarity and justification for the methodology used and the results obtained, and to better place this study in context of other studies in the field. Two crucial points are excluding the concern of the possibility that batch effects are contributing to the classification results and providing stronger evidence for a link between transcriptional and morphological profiles. Some efforts to interpret the classification decision making could help understand what morphological information was used for classification and reduce the concerns for the model using non-biologically meaningful information for the classification (e.g., illumination changes due to batch effects). Finally, making the source code and data publicly available would be important to enable others to apply the method (code) and to benchmark other methods (data).

    1. Place this study in context of previous studies that classify cell types. Here are two relevant recent papers, which could provide a good start for properly crediting previous work and placing your contribution in context: PMID: 39819559 (note the "Nucleocentric" approach) and PMID: 38837346. Please seek for papers that use label free for similar applications (which is the main contribution of the current manuscript).
    2. Many experiments were performed, but we found it hard to follow them and the logic behind each experiment. Please include a Table summarizing the experiments and their full statistics (see below) and also please provide more comprehensive justifications for designing these specific experiments and regarding the experimental details. This will make the reading more fluent.
    3. The experiments, data acquisition and data reporting details are lacking. 10x objective is reported in the Results and 20x in the Methods. Please explain how the co-culturing (mixed) experiments were performed including co-culturing experiments with varying fractions of each cell type and on what data were the models trained on (Fig. 2F). Differential confluency experiments are not described in the Methods (and not on what confluency levels were the models trained on), this is also true for the detachment experiment. How many cells were acquired in each experiment (it says "20 and 40 images per cell line" but this is a wide range + it is not clear how many cells appear in each image)? How many biological/technical replicates were performed for each experiment? Please report these for each experiment in the corresponding figure legend and show the results on replicates (can be included as Supplementary). "Using a different microscope with the same objective produced similar results (data not shown)" (lines #370-371), please report these results (including what is the "different microscope") in the SI.
    4. The machine learning details are lacking. The train-validation-test strategy is not described, which could be critical in excluding concerns for data leakage (e.g., batch effects) which could be a major concern in this study. It is not always clear what network architecture was used. What were the parameters used for training? Accuracy is reported in % (and sometimes in an awkward representation, 990‰). Proper evaluation will use measurements that are not sensitive to unbalanced data (e.g., ROC-AUC). What are the controls (i.e., could the accuracy reported be by chance?). Reporting accuracy at the pixel/patch level and not at the cell level is a weakness. Estimation of cell numbers (in methods) is helpful but I did not see when it was used in the Results - a better alternative is using fluorescent nuclear markers to move to a cell level (not necessary to implement if it was not imaged).
    5. Downstream analyses lacking sufficient information to enable us to follow and interpret the results, please provide more information.

    a. The PCA ellipses visualizations reference to previous papers. Please explain what was done, how the ellipses were calculated and from how much data? If they are computed from a small number of data points - please show the actual data. It would also be useful to briefly include the information regarding the representation and dimensionality reduction in the Results and not only in the Methods. No biologically-meaningful interpretation is provided - perhaps providing cell images along the PCs projections can help interpret what are the features that distinguish between different experimental conditions.

    b. How were the pairwise accuracies calculated? How did the authors avoid potential batch effects driving classification.

    c. "suggesting that the current workflow can handle four cell lines simultaneously" (lines #126-127) - how were the cell lines determined for each analysis? We assume that the performance will depend on the cell types (e.g., two similar morphology cell types will be hard to distinguish). Fig. 2F is not clear: the legend should report a mixture of four cell types, and this should be translated to clear visualization in the figure panel itself: what do the data points mean? Where are the different cell types?

    d. Lines 232 and onwards use #pixels as a subcellular size measurement when referring to cell nucleus, cytoplasm and membrane, please report the actual physical size and show specific examples of these patches. This visualization and analysis of patch sizes should appear much earlier in the manuscript because it relates to the method's robustness and interpretability.

    e. Analysis of co-cultured (mixed) experiments is not clear. Was the fluorescent marker used to define ground truth? Was the model trained and evaluated on co-cultures or trained on cultures of a single cell type and evaluated on mixed cultures? We assume that the models were still evaluated on the label-free data? "...obtain subcellular ROIs only from regions positive in the red channel. Using these labeled ROIs,.." (138-139) - shouldn't both positive and negative ROIs be used to have both cell types? What are the two quantifications in the bottom of Fig. 1E? Did the "labeled cells" trained another classifier for the fluorescent labels?

    f. Please interpret the results from Fig. 3C-D - should we expect to see passage-related changes in cells (that lead to deterioration in classification) or is it a limitation of the current study?

    g. In general, as we mentioned a couple of times. It would be useful to visualize different predictions (or use explainability methods such as GradCam) to try to interpret what the model has learned.

    h. The correlation analysis between transcriptional profiles and morphological profiles is not clear. There are not sufficient details to follow the genetic algorithm (and its justification). What was the control for this analysis? Would shuffling the cells' labels (identities) and repeating the analysis will not yield a correlation?

    1. Please use proper scientific terms. For example, "white-light microscopy" and "live cell red marker".
    2. This is a "Methods" manuscript and thus should open the source code and data, along with some examples on how to use it in order to enable others to replicate the results and to enable others to use it.
    3. Please improve the figures. Fonts are tiny and in some places even clipped (e.g., Fig. 1D,E, Fig.2 E, E', and many more), some labels are missing (e.g., units of the color bar in Fig. 1B).
    4. Discussion. Please place this work in context of other studies that tackled a similar challenge of classifying cell types and discuss cons and pros of the different measurements. For example, there are clear benefits of using label-free data to reduce the number of fluorescent labels and enable long-term live cell imaging following a process without photobleaching and phototoxicity (Fig. 2G) but it is more difficult to interpret these differences in label-free image patches rather than fluorescently labeled single cells. One solution to bridge this gap that could be discussed is using silico labeling (PMID: 38838549).
    5. The idea of using the pairwise correlation distance of different cell types to model unseen cell types is interesting and promising. Why did these specific pairwise networks were used? How robust is this representation to inclusion of other/additional models?

    Significance

    Automated classification of cell types and cell states in mixed cell populations using label-free images has important applications in academic research and in industry (e.g., cell profiling). This paper applies standard machine learning toward this technical goal, and demonstrates it on many different experimental systems, exceeding the common standard in terms of quantity and variability, and with the potential of being a nice contribution to the field. However, we were not able to properly evaluate these results due to lacking experimental and methodological details as detailed above and thus can not make a strong point regarding validity and significance before a major revision. Our expertise is in computational biology, and specifically applications of machine learning in microscopy. We are not familiar with the specific cell types, states and perturbations used in this manuscript.

  4. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #1

    Evidence, reproducibility and clarity

    Summary:

    This paper presents a method to classify cells in brightfield images using information from subcellular regions. The approach consists in first thresholding a brightfield image then splitting the resulting binary image into small ROIs which are then fed to a CNN-based classifier. The authors demonstrate application to the identification of cell types in pure cultures and in cultures with mixed types. They then show that features learned by the classifier correlate with expression of cell type-specific genes and explore what information can be learned from networks trained on subcellular regions selected based on distance from the nucleus. The authors conclude that subcellular ROIs extracted from brightfield images contain useful information about the identity and state of the cells in the image.

    Major comments:

    • Neither data nor code was made available for review. There's only a mention of them being in Figshare with no link. As a consequence and a matter of principle, this study is not publishable without both public data and code.
    • I would recommend using adequate repositories for data and code. Image data can be deposited in a public image data repository such as the BioImage Archive which would ensure that minimal metadata are provided and code could go to a public code repository (e.g. GitLab...) so that it is discoverable and eventual changes can be tracked and visible (for example should any bug be fixed after publication). Also consider depositing the models into the BioImage Model Zoo (https://bioimage.io).
    • The use of the term morphology is misleading. Like I expect most readers would, I understand morphology in this context as being related to shape. However, there is no indication that any specific type of information (like shape, texture, size/scale...) is used or learned by the described method. To understand what information the classifiers rely on, it would be interesting to compare with human engineered features extracted from the same ROIs. All references to morphology in the text must be removed unless indication can be provided as to what type of information is used by the models.
    • The method should be described with more details:
      • How are the window sizes to use determined? Are the two sizes listed in the methods section used simultaneously? What is the effect of this parameter on the performance?
      • How are the ROIs determined? In a grid pattern? Do they overlap? i.e. how does the windowing function work?
      • Predictions seem to be made at the ROI level but it isn't clear if this is always the case. Can inference be made at the level of individual cells?
    • What would be the advantages of the proposed subcellular approach compared to learning to classify whole images?
    • When fluorescent markers are used, the text isn't clear on what measures have been taken to prevent these markers from bleeding through into the brightfield image. To rule out the possibility that the models learn from bleed-through of the marker into the brightfield image, the staining should be performed after the brightfield image acquisition. Without this, conclusions of the related experiments are fatally flawed.
    • How robust are the models e.g. with respect to culture age and batch effects? Use of a different microscope is mentioned in the methods section. This should be shown, i.e. can a model trained on one microscope accurately predict on data acquired from a different microscope? Does mixing images from different sources for training improve robustness?
    • Why not use the Mahalanobis distance in feature space? This would be the natural choice given that PCA has been selected for visualization and would allow to show uncertainty regions in the PCA plots. Could other dimensionality reduction methods show better separation of the groups? Why not train the network for further dimensionality reduction if the goal is to learn a useful feature space?

    Minor comments:

    • Make sure the language used is clear, e.g.
      • The text describes the method as involving a transformation to black and white followed by thresholding. This doesn't make sense.
      • What is meant by "the set of 300 genes was subjected to Gene Ontology"?
    • Use percent instead of permille in the text for easier reading.
    • To provide more context, cite previous work that indicates that brightfield images contain exploitable information, e.g.
      • Cross-Zamirski, J.O., Mouchet, E., Williams, G. et al. Label-free prediction of cell painting from brightfield images. Sci Rep 12, 10001 (2022). https://doi.org/10.1038/s41598-022-12914-x
      • Harrison PJ, Gupta A, Rietdijk J, Wieslander H, Carreras-Puigvert J, et al. (2023) Evaluating the utility of brightfield image data for mechanism of action prediction. PLOS Computational Biology 19(7): e1011323. https://doi.org/10.1371/journal.pcbi.1011323

    Referees cross-commenting

    I support comments from reviewers 2 and 3 around the lack of sufficient details fro interpretability and reproducibility. Some of the necessary information could be communicated through well documented re-usable code and computational workflows as well as properly documented data sets.

    Jean-Karim Hériché (heriche@embl.de)

    Significance

    This is an interesting study that adds to a growing body of evidence showing that information contained in brightfield images can be usefully exploited, potentially replacing the expensive and time-consuming use of fluorescent markers and is therefore of interest to a broad audience of cell biologists.