DynaMorph: self-supervised learning of morphodynamic states of live cells

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

A cell’s function depends on its dynamic architecture. There is growing interest in modern cell biology for methods enabling the automated analysis of morphodynamics of live cells. Here we demonstrate the power of label-free live cell imaging and self-supervised deep learning for automated analysis of complex dynamics of human microglia.

Article activity feed

  1. Note: This rebuttal was posted by the corresponding author to Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Reply to the reviewers

    General Statements [optional]

    We thank the reviewers for their thoughtful, constructive, and highly actionable critique. The reviewers mentioned that “the experiments presented are well-designed, the methods well-implemented, and communication of the authors' findings is clear and concise”. We are happy to hear that “figure presentation and manuscript layout are top notch and... these data are easy to read and interpret”.

    We appreciate reviewers’ suggestions in improving the interpretability of the morphodynamic representation and address each of the Reviewers’ comments (typeset in blue) in the document below.

    Description of the planned revisions

    Insert here a point-by-point reply that explains what revisions, additional experimentations and analyses are planned to address the points raised by the referees.

    Reviewer # 1 (major points)

    * The Trajectory Feature Vectors (TFVs) are averaged over time - this seems to lose a lot of the salient information in the trajectories themselves, resulting in the low(ish) accuracy of the GMM. Could a Hidden Markov Model trained on the trajectories in state space help to identify/classify those trajectories that change their morphology/motion over time?

    Thanks for the suggestion. We did recognize that averaging will smooth the dynamics in each cell trajectory and reduce diversity of phenotypes. On the other hand, the temporal smoothing serves to reduce the noise, especially when the cells have reached steady state dynamics after being stimulated with pro- or anti-inflammatory cytokines. Our experiments were constructed to probe steady state dynamics and therefore we opted to use temporal smoothing.

    It is possible to identify rare transitions even with some temporal smoothing.

    In our analysis of rare transitions (Fig. 4C), we extracted long trajectories and split them into segments (10~15 frames, 1.5~2 hours). By applying Gaussian Mixture Model (GMM) to each segment, we identified a sequence of states along the full trajectory, from which state transitions were identified.

    During the revision, we will employ the Hidden Markov Model (HMM) to model state transitions in the latent shape space as suggested by the reviewer to detect rare transitions. Our expectation is that HMM will be able to identify more transition events due to its higher time resolution (frame instead of segment), though it may also be affected by unexpected imaging artifacts and noise.

    Reviewer # 1 (minor points)

    Could the authors provide some example images showing interpolation of each PC using the generative decoder?

    Thanks for the suggestion, however the discrete nature of the latent codebook of VQ-VAE makes it challenging to use interpolation as a proxy for utility of interpolation. A possible link between interpolation abilities and usefulness of representation learned by autoencoders has been explored in this paper by Berthelot et al. As Berthelot et al. note, “We perform interpolation in the VQ-VAE by interpolating continuous latents, mapping them to their nearest codebook entries, and decoding the result. Assuming a sufficiently large codebook, a semantically “smooth” interpolation may be possible. On the lines task, we found that this procedure produced poor interpolations. Ultimately, many entries of the codebook were mapped to unrealistic datapoints, and the interpolations resembled those of the baseline autoencoder.”

    Reviewer # 2 (major points)

    -It's unclear what the effect of speed is on the final state determination. TFVs were composed of auto-encoder-based features (PCs from latent space) and speed of the cells. Would the states be very different without speed as part of the TFVs or with TFVs consisting only of speed features? Please quantify and discuss.

    Thanks for your comment. We agree that speed of the cell is a main factor that contributes to the clustering, though shape features (from VQ-VAE) do contribute (Fig. 3B, histograms) to discrimination of cell states. In the revision, we will perform the clustering analysis with only shape features and compare with current results of Fig. 4.

    Reviewer # 3 (major points)

    1. Temporal consistency regularization

    In the authors' framework, models are regularized to minimize the l2 norm between embeddings of adjacent timepoints.

    This is approach is conceptually well-motivated, but could have some unintended effects.

    For instance, some cells may make a rapid state transition such that state(t-1) = A, state(t) = B, state(t+1) = A'.

    In these cases, a regularized model may best minimize the joint loss by returning an embedding at time t that interpolates between state A and A', rather than returning an embedding that reflects the true distinct state B.

    The work would be strengthened if the authors analyzed the impact of this regularization term on the detection of rapid state transitions that occur for only a few frames (e.g. when cells that exhibit filopodial motility "jump" in an actin/myosin contraction).

    This might be accomplished through experiments scanning different regularization hyperparameters on some of the authors' real data, fitting models on temporally downsampled versions of the real data where "slow" multi-timestep transitions now occur in a few timesteps, or perhaps using simulations where rapid state transitions are known to occur.

    Even if the regularization does have some negative impacts, it does not argue against the utility of the general approach, but it is important for users to understand the constraints on downstream applications.

    In our revision, we will evaluate the optimal matching loss for our dataset by training the model with a series of temporal matching loss weights. With this computational experiment, we will illustrate the trade-offs introduced by the relative strengths of matching and reconstruction losses.

    Our expectation is that with very high matching loss, the embeddings (latent vectors) of the frames of the same trajectory will collapse regardless of morphology. For, a relatively wide range of matching loss weights, rank relations between transition pairs ([A->B] + [B->A'] >> [A->A']) should be preserved, from which the rare transitions can be robustly identified. In our experiments, most cells reached steady state morphodynamics when imaged, i.e., the matching loss between two adjacent frames arises primarily due to variations in background/noise. Fast transitions are “rare” in our data. Numerically, fast transitions contribute less to the matching loss during training and therefore their latent representations are not minimized. In other words, if B is a morphologically different state from A/A', the model is driven more by the reconstruction loss due to morphological difference rather than temporal smoothness across three consecutive frames.

    Baseline comparisons

    The authors evaluate their method by assessing the correlation of embedding PCs with heuristic features (Fig. 2C,D + supp.), variation of embedding PCs across cell treatment groups (Fig. 3), and qualitative interpretation of embedding trajectories.

    In the supplement, the authors compare their VQ-VAE approach to VAEs and AAEs and chose to use a VQ-VAE based on lower reconstruction error and higher PC/heuristic feature correlation.

    However, the authors do not compare their method to much simpler baseline approaches to this problem.

    Existing literature suggests that heuristic features of cell shape and motion (similar to those the authors use to evaluate the relevance of their embeddings) are sufficient to perform many of the same tasks a VQ-VAE is used for in this work.

    For instance, in Fig. 3 it appears that a simple analysis of cell centroid speed recovers much of same information as the complex VQ-VAE embeddings.

    In Fig. 2 - Supp. 6, it appears that after regressing out many heuristic features of cell geometry, the latent space largely explains cell non-autonomous information about the background environment, suggesting the heuristic features are largely sufficient.

    To demonstrate the usefulness of their deep modeling approach relative to simple baselines, the authors should compare against existing heuristics and embeddings of heuristics (e.g. PCA) using some of the tasks shown for the VQ-VAE (recovery of perturbation state, state transition detection, qualitative trajectory analysis, discrimination of cell types).

    Heuristics might include those already calculated here, or a more comprehensive set as cited in the Introduction.

    The authors may also consider comparing against baselines that don't include time information for some of their tasks (e.g. recovery of perturbation state could arguably be achieved with CNNs either ignorant of the timestep with simple temporal conditioning, not including trajectory information).

    If these features are sufficient for many of the same tasks performed in this work, the authors should provide a clear argument for readers as to why the unsupervised VQ-VAE approach may be preferable (e.g. ability to recover potentially unknown cell changes, for which no heuristic exists).

    The VQ-VAE doesn't need to be superior along every axis to hold merit, but the work would be strengthened if the authors could show clear superiority along some dimension.

    Thanks for your comments. We agree that through our exploration, specific heuristic features are found to be correlated with latent shape features. We did not start with heuristic features, but instead identified them after observing how cell morphology changes along the principal components of the latent shape space. Discovering the heuristic shape features that describe the variation in shape space, in our view, reinforces the value of self-supervised learning of complex cellular morphologies.

    We’d argue that the dynamorph pipeline complements heuristic approaches: it enables discovery of cell states through unbiased encoding and clustering, and the correlation of learned features with heuristic features enables interpretation of the cell state/data distribution more quantitatively than using either approach in isolation. Our argument is further reinforced by the related work (e.g., Zaritsky et al. and others mentioned in the introduction) on self-supervised learning of cell shape and interpretation of its latent space.

    More specifically, self-supervised learning with temporal matching generates unbiased and smooth encodings for cell morphologies, from which we identified the rank correlations between top PCs and certain geometric properties. However, this does not indicate that the set of heuristics chosen a priori will be equally descriptive of the shape distribution. For example, optical density of cells (phase) is a heuristic feature that has not been used in previous studies, which we recognized after sampling the PCs of shape space. Further identification of such correlations is by itself an interesting discovery enabled by self-supervised learning.

    In the current manuscript, we compared learned latent features (PCA on VQ-VAE latent embeddings) against a simple baseline (top PCs of raw images) and showed superior performances, which already illustrate the advantage of self-supervised learning in denoising data and extracting key diversities. In the revision, we will compare PCs of multiple heuristic features (e.g., cell size) with latent features to further strengthen the above point.

    Reviewer # 3 (minor points)

    For Fig. 4 - supp 1 -- isn't it expected that the GMM cluster of a vector can be predicted from the vector? The GMM clusters were derived from the vectors to begin with, so this seems like a bit of a circular analysis. If I'm missing something, this figure might benefit from more exposition.

    Thanks for your question. The original purpose of having this confusion matrix is to parallel Fig. 3 - supp 2, showing that GMM generated distinct cell states that describe population better than perturbation conditions. The confusion matrix itself is trivial, so we will evaluate how to make this point more precisely during the revision.

    For Fig. 4 - Supp 3, the authors should consider changing the "state" and "cluster" colors on the embedding projections so that they do not match. As presented, it appears as if the states and clusters were co-assayed and linked by some experimental label, when in fact the State 1::Cluster1, State 2::Cluster 2 relationship is just inferred.

    Thanks for your comment, we will change the color scheme for Fig. 4 - supp 3 to avoid confusion in the revision.

    * Description of the revisions that have already been incorporated in the transferred manuscript

    Please insert a point-by-point reply describing the revisions that were already carried out and included in the transferred manuscript. If no revisions have been carried out yet, please leave this section empty.

    Reviewer # 1 (major points)

    * The temporal matching to enforce a smooth latent space representation is interesting. The authors mention that they mask out surrounding cells with a median pixel value. Have the authors considered using a pixel weighting in the reconstruction/matching loss to differentiate foreground/background? Also, does this affect detection of any fast (or indeed rare) transitions in the trajectories?

    Thanks for your comment and question. Yes, we indeed incorporated a pixel weighting strategy during training. In addition to masking out surrounding cells, we used a smoothed and enlarged version of individual cell's segmentation mask to emphasize accurate reconstruction of the center cell in each patch, and reduce the influence of the surrounding cells/artifacts/background fluctuations. Matching loss is computed from latent vectors, which will be indirectly affected by the pixel weighting as well.

    More detailed description of the weighting strategy will be added to the methods section. The code for our weighting strategy can be found at: https://github.com/czbiohub/dynamorph/blob/b3321f4368002707fbe39d727bc5c23bd5e7e199/HiddenStateExtractor/vq_vae_supp.py#L287

    Reviewer # 1 (minor points)

    I was a little confused by the labels given to the PCs, as they seem to vary between figures. For example, In Fig2, PC1 and PC2 are Size and Peak Retardance, but in Fig3 they are referred to as Size and Cell Density (which could be interpreted as the number of cells per unit area). Could the authors clarify these in the captions?

    We have clarified the text to distinguish between cell density (population) and optical density (phase).

    The authors note that single-cell tracking is of vital importance. This should be elaborated upon. Also - could the VQ-VAE encodings be used to help track linking in cases of high density?

    We added a clearer reference to the methods section containing details of the tracking procedure. Additionally, we clarified in the discussion that the methods used for segmentation and tracking cells can be refined for high density cultures. Since we rely on the tracks to compute the temporal matching loss and regularize the VQ-VAE encodings (shape space) during the training, the encodings are not useable for refining tracking in high density population.

    Reviewer # 2 (major points)

    -'Cell state' in the field of cell biology has been operationally defined in so many different ways and with so many different types of measurement data, that 'cell state' is becoming a somewhat vacuous term. This is not only a problem of this paper but a challenge for the field. In this case, clustering of cells using a Gaussian mixture model that uses the first few principal components of the latent space coefficients as well speed - both averaged across the frames of cell tracks. This is fine and descriptive, but it's unclear whether this definition of 'cell state' is easily applied to other datasets and how this definition can be operationalized for hypothesis generation and experimentation. For other datasets, e.g. other cell types and other processes, such as differentiation, where e.g. tracking and segmentation may be more difficult and images would look quite different, can one still apply the same approach towards describing cell states? One could state that this definition of cell state is very specific to the dataset and therefore not generally useful. How would the authors respond to such a statement?

    This is an excellent point. We agree that the meaning of a “cell state” or a “cell type” can depend on the context. Cell state can be rigorously described in terms of measurements of the cells, and recent developments of new cell probing techniques, including imaging modalities and single-cell genomics keep adding to the growing list of the features that can be measured. Time-lapse imaging is high dimensional and therefore admits multiple definitions of cell state. Our use of the terms ‘latent shape space’ and ‘trajectory feature vectors’ clarifies how we define the cell state. Given the increasingly wider use of live cell imaging for biological studies and drug discovery, both of these descriptors of cell state are valuable. In the current manuscript, we focus on a combination of morphodynamic features, including but not limited to the cell shape, size, and speed. We use these features to cluster cells in an unbiased manner to detect morpho-dynamic “states” unique for this particular culture system. Our approach can be generalized to other cell culture systems, such as cell differentiation, where cell architecture evolves substantially.

    To clarify this point, we add the following text in the manuscript:

    Line 85: “The meaning of a "cell state" can vary with the physiological and methodological context. In this work, we refer to "morphodynamic states" as a combination of morphological and temporal features. From the trajectory of cells in the latent shape space, we identified transitions among morphodynamic states of single cells. The same approach enabled detection of transitions in the morphodynamic states of cells as a result of immunogeneic perturbations.

    In the discussion:

    Line 333: “ Our work formalizes an analytical approach for data-driven discovery of morphodynamic cell states based on the quantitative shape and motion descriptors. A cell state can be rigorously described in terms of measurements of the cells, and recent developments in measurement techniques, including imaging modalities and single-cell genomics keep adding to the growing list of the features that can be measured. Time-lapse imaging is high dimensional and therefore admits multiple definitions of a cell state.”

    -It's unclear to the reviewer whether the training data (unperturbed microglia) are close enough to the test data (perturbed microglia) such that application of the trained model to the test data makes sense. The authors provide reconstruction loss numbers, but they are difficult to interpret. Can the authors create plots of the unperturbed microglia cells and unperturbed microglia cells in the latent space and show overlap, or in other ways, show that training data and test data are close enough for this application.

    Thank you for pointing out the lack of clarity in generalizability of the model. We trained the model on control, untreated microglia acquired during one experiment, and then applied it to a separate dataset acquired during another experiment that included perturbed and control microglia. The reconstructions shown in Fig. 2 are from the test dataset that was not used during training. The quality of reconstructions supports that the shape space of the training set is representative of the shape space of the larger test set. We will add a density plot in the supplementary figures showing the overlapping latent space distribution of unperturbed (training dataset) and perturbed (test dataset) microglia.

    We now include the revised sentence in the manuscript to clarify the results:

    Line 132: “Comparison of reconstructed shapes from the test set and training set along with the analysis of the shape space described in the next section show that our self-supervised model trained on training dataset generalized well between independent experiments and can be used to compare cell state changes between control microglia and cells treated with multiple perturbations”.

    -Only a small amount of intensity variation is explained; 17% using the first 4 PC components which are mainly used in the analyses. This seems like a very low number. There is a lot of variation in the intensity images that is not explained by the autoencoder. The autoencoder seems to be doing a bad job. At the same time, the downstream analyses using the latent space are insightful and sensible. Can the authors provide more explanation?

    Thanks for your question. We would like to first clarify that the autoencoder (VQ-VAE) used in this work follows the design of the original reference, which doesn't have a very large compression. Given the latent space size (16x16x16), it is understandable that the 4 top PCs captured relatively smaller portions of the variance. The fact that cell shape cannot be described with few principal components is likely due to: a) diversity of morphology of microglia, b) diversity of modalities used to train the model.

    We now include the following text in the manuscript: Line 158: “The high variance of the shape space of microglia can be due to more complex shapes of microglia, such as diversity of protrusions, sub-cellular structures and variations in cell optical density, location of nuclei in migrating cells, etc. As we mentioned above, the inclusion of several imaging channels (brightfield, phase, and retardance) increases the performance of the model, possibly by increasing the diversity of morphological information encoded in our input data.”

    As you note, the downstream analyses from the learned latent space are insightful, e.g., we do detect substantial changes in top PCs upon perturbations. This supports our view that the shape space of microglia as encoded by our data is intrinsically high dimensional and the transients in the shape space are informative.

    Reviewer # 2 (minor points)

    -The motivation for GMMs over k-means is unclear. K-means clustering leads to spatial separation between clusters (states) since all cells/tracks that closest to their cluster mean are per definition further away from the means of other clusters. This is not the case with the more flexible GMMs; e.g. they allow one to have a smaller cluster (with small variance components) inside of a larger cluster (with large variance). The latter scenario seems undesirable for interpretation in terms of states.

    Thanks for your comments. The major reason for choosing GMMs over K-means clustering is that GMM allows different prior distributions for different perturbations. In practice, K-means would be capable of generating clusters regardless of perturbation conditions, while GMM enables a finer separation of states which are very likely correlated with perturbations. We agree that GMM has certain caveats as you mentioned in the comment. In our analyses, we didn’t notice the issues such as ‘nesting of components’ that you described.

    -Related to the previous point, 'self-supervised' sounds nice, but it's still optimizing towards something, in this case explaining the variation in input intensity images. A lot of the variation in the intensity images may not be of interest for the biological investigation of shape and dynamics. Did the authors uncover that indeed some of the latent dimensions are encoding other aspects of the images which may be less related to the biology and more to image properties/artifacts/biases?

    We agree with your assessment. Precisely for the reasons you point out, we counter the dependence of learned representation on non-biological variations in data using temporal regularization. This point is recognized by the reviewer #3. We clarify this concept. We clarify that not all the latent features represent biology of the cells and some represent the features of the instrument and the experiment. We report this for the top few PCs of latent representation and provide the code for the interested reader to discover what other PCs report.

    -The original images are 3D (5 z-planes). The analyzed images were 2D. The reviewer missed how the authors went from 3D to 2D. And since cells are 3D, can the authors describe what they gained by going to 2D and what they potentially lost?

    We added additional text to the methods subsection describing the Dynamorph Pipeline (line 590):

    “The input data for both semantic segmentation and VQ-VAE models are 2D-images of computed phase and retardance that measure integrated optical density and anisotropy across the depth of the cell. The raw collected data is 3-dimensional (5 z-slices acquired in multiple polarization channels). The 2D phase is computed from the full stack of brightfield images via deconvolution. The retardance is computed from an average of the intensities across the 5 z-slices. Subsequent model training is more tractable with 2D data instead of 3D, while capturing the cell architecture across the depth.”

    Reviewer # 3 (major points)

    Cell state transition interpretation

    In line 278, the authors propose that the unbalanced nature of transitions such that p(1 -> 2) >> p(2 -> 1) must represent some difference in timescales across the transitions because "cell states should have reached equilibrium after several days in culture at the time of the imaging experiments".

    This logic is unclear to me for two reasons.

    * If the population obeys detailed balance (e.g. transitions have equal frequency), then observed transitions should be balanced on a reasonably long time window, even if individual transitions occur on different timescales.

    * The assumption that cell states are balanced after a few days in culture is at odds with a few different aspects of the biology. Cell density and nutrient availability are continually changing in the dish, so culture conditions are non-stationary. Imaging apparatuses also commonly impact the cell biology of imaged samples due to imperfect incubation, etc. (2 or 3)

    It seems likelier that these data represent an unbalanced transition due to the non-stationary nature of the culture system.

    Given the authors' emphasis on the value of measuring these transitions, the work would be strengthened by a more careful interpretation of these results, additional analysis details (e.g. how large are most state transitions? are these mostly small shifts "over the border" in state space, or large jumps?), and an attempt at biological interpretation of the observed phenomenon.

    The authors' RNA-seq data may be helpful in this latter regard.

    This is an excellent point. We agree that the cell culture conditions, including nutrient availability, accumulating presence of metabolites and imagine-induced changes constantly introduce new variations to the system. In an attempt to mitigate these dynamic changes to the system, we maintained cells in culture for six days before starting the experiment. To avoid cell stimulation due to freshly added nutrients and growth factors from the culture media, we consistently exchanged the media and performed cytokine treatments 24 hours before each imaging experiment. Each imaging round was started after the cells were allowed to equilibrate to the environmental chamber for at least one hour before imaging. Despite these efforts, we agree with the reviewer that the conditions cannot be considered fully stationary. We removed the sentence “ Given that cell states should have reached equilibrium after several days in culture at the time of the imaging experiments, these results suggest that the transitions from state 2 to state 1 occur at a different time scale (i.e., much slower)” and changed the text to reflect this point:

    Line 294:

    “In our analysis, transition events are very rare among cells treated with IFN beta, while the most frequent cell transitions were observed among cells treated with GBM supernatant. One possible explanation for this imbalance is that IFN-treated cells represent a single polarization axis, while a heterogeneous cell signaling milieu derived from cancer cells provides conflicting pro- and anti-inflammatory signals, instructing cells to transition between the states. While both directions of transitions were observed within the imaging period, cells in state-1 are more likely to transition to state-2 than vice versa within the chosen time frame. This imbalance between the rates of state transitions correlates with the higher state-2/state-1 ratio in GBM and control environment and may explain the longitudinal accumulation of cells in a more activated state under these culture conditions.”

    1. Single cell RNA-seq analysis

    The authors performed a very interesting experiment where they profiled the same cell population using both timelapse imaging and single cell RNA-seq.

    The authors argue that the global structure of the state space resolved by each modality is analogous, but this seems a bit of a stretch to me.

    The behavior state space is unimodal (bifurcated into two states by GMM clustering), while the mRNA-seq space has several distinct clusters.

    The argument that these states are analogous would be significantly strengthened by biological interpretation of the RNA-seq data.

    Do the mRNA profiles exhibit differentially expressed genes that might explain differences in behavior in the cell behavior states?

    The analyses in Fig. 4 - Supp 4 are suggestive that "State 1" contains interferon-responsive cells and not control cells, but broader conclusions don't appear well supported by current analyses.

    We agree with the reviewer’s comment that the analogy between molecular cell states defined with scRNAseq analysis and morphodynamic cell states defined with dynamorph needs to be clarified. In our current work, the correlative measurement of morphodynamics and transcriptome was exploratory and relied on population statistics measured with each modality. More detailed studies linking morphodynamic states to the single cell transcriptomics, such as Patch-Seq or laser microdissection, are needed to decisively link morphodynamics and molecular programs underlying these phenotypes.

    Single cell transcriptomics simultaneously measures thousands of mRNA species in individual cells. Therefore, it can provide a nuanced interpretation for the molecular states of each population, as can be seen at a more granular separation of sub-states in scRNAseq clustering. For example, Cluster 1-2 was defined by high expression of interferon response genes, and predictably, this cluster was primarily derived from the cells treated with IFNb. Interferon exposure induces morphological changes associated with increased cell perimeter, which reports ramification of microglia plasma membrane (Aw et al., PMID: 33183319). It was also shown that infections with neurotropic viruses, leading to interferon response, also leads to decreased velocity and distance traveled for cultured microglia cells (Fekete et al., PMID: 30027450). These observations are in direct agreement with our morphodynamic analysis demonstrating a higher proportion of cells in State 1, characterized by lower cell velocity. Interestingly, scRNAseq analysis also identified a population of cells with high expression of cell cycle genes (Cluster 1-3), which would also be predicted to have a slower speed and potentially larger cell body. These results point to the fact that different molecular states may be underlying very similar morphodynamic states.

    We now provide a revised statement to reflect the above.

    Line 290: “We further compared the detected morphodynamic states with scRNA measurements of the same cell populations. Interestingly, the separation of cells in state-1 and state-2 from control and IFN group parallels the clusters identified with cell transcriptome, suggesting that correlative analysis of gene expression and morphodynamics can reveal molecular programs underlying these phenotypes. In our preliminary analysis, scRNAseq revealed a greater degree of granularity in each of the cell populations, such as cluster 1 of the scRNAseq separating into three additional subclusters. Cluster 1-2 was defined by high expression of interferon response genes, and predictably, this cluster was primarily derived from the cells treated with IFNb. Interferon exposure induces morphological changes associated with increased cell perimeter, which reports ramification of microglia membrane (Aw et al., 2020). It was also shown that infections with neurotropic viruses, leading to interferon response, also leads to decreased velocity and distance traveled for cultured microglia cells (Fekete et al., 2018). These observations are in direct agreement with the higher proportion of cells in State 1, characterized by lower cell velocity. Interestingly, scRNAseq analysis also identified a population of cells with high expression of cell cycle genes (Cluster 1-3), which would also be predicted to have a slower speed and potentially larger cell body. These results point to the fact that different molecular states may be underlying very similar morphodynamic states. Correlative single-cell measurements of morphodynamic states and single cell transcriptomics, such as Patch-Seq or laser microdissection, are needed to decisively link morphodynamics and molecular programs underlying these phenotypes.”

    Reviewer # 3 (minor points)

    1. Check grammar. Some articles are missing and some subject-verb agreements are mismatched. e.g. line 624 "we regularized [the] latent space", line 713 "after both loss[es] achieved".

    Thanks for pointing this out, we have thoroughly checked grammar and typos in this submission.

  2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #3

    Evidence, reproducibility and clarity

    Summary

    Here, the authors present Dynamorph, an unsupervised learning framework for timelapse cell microscopy data built on VQ-VAEs.

    The authors apply this method to the analysis of microglial cell behavior under a series of perturbation conditions.

    Methodologically, the primary contributions of this work are the introduction of a temporal consistency regularization penalty on the latent space of a VQ-VAE model for application to timeseries data and the introduction of a "temporal feature vector"-ization procedure to summarize complex temporal trajectories in a single low-dimensional vector for analysis. Biologically, the primary contributions are the demonstration that microglial responses to different perturbogens and dynamics state transitions can be resolved by transmitted light microscopy.

    Overall, the experiments presented are well-designed, the methods well-implemented, and communication of the authors' findings is clear and concise.

    However, there are unaddressed potential caveats to the proposed framework and the manuscript fails to compare the proposed method to any existing baselines, such that the particular strengths and weaknesses of the method are unclear to readers.

    Major Points

    1. Temporal consistency regularization

    In the authors' framework, models are regularized to minimize the l2 norm between embeddings of adjacent timepoints. This is approach is conceptually well-motivated, but could have some unintended effects.

    For instance, some cells may make a rapid state transition such that state(t-1) = A, state(t) = B, state(t+1) = A'. In these cases, a regularized model may best minimize the joint loss by returning an embedding at time t that interpolates between state A and A', rather than returning an embedding that reflects the true distinct state B.

    The work would be strengthened if the authors analyzed the impact of this regularization term on the detection of rapid state transitions that occur for only a few frames (e.g. when cells that exhibit filopodial motility "jump" in an actin/myosin contraction). This might be accomplished through experiments scanning different regularization hyperparameters on some of the authors' real data, fitting models on temporally downsampled versions of the real data where "slow" multi-timestep transitions now occur in a few timesteps, or perhaps using simulations where rapid state transitions are known to occur.

    Even if the regularization does have some negative impacts, it does not argue against the utility of the general approach, but it is important for users to understand the constraints on downstream applications.

    1. Baseline comparisons

    The authors evaluate their method by assessing the correlation of embedding PCs with heuristic features (Fig. 2C,D + supp.), variation of embedding PCs across cell treatment groups (Fig. 3), and qualitative interpretation of embedding trajectories. In the supplement, the authors compare their VQ-VAE approach to VAEs and AAEs and chose to use a VQ-VAE based on lower reconstruction error and higher PC/heuristic feature correlation.

    However, the authors do not compare their method to much simpler baseline approaches to this problem. Existing literature suggests that heuristic features of cell shape and motion (similar to those the authors use to evaluate the relevance of their embeddings) are sufficient to perform many of the same tasks a VQ-VAE is used for in this work. For instance, in Fig. 3 it appears that a simple analysis of cell centroid speed recovers much of same information as the complex VQ-VAE embeddings. In Fig. 2 - Supp. 6, it appears that after regressing out many heuristic features of cell geometry, the latent space largely explains cell non-autonomous information about the background environment, suggesting the heuristic features are largely sufficient.

    To demonstrate the usefulness of their deep modeling approach relative to simple baselines, the authors should compare against existing heuristics and embeddings of heuristics (e.g. PCA) using some of the tasks shown for the VQ-VAE (recovery of perturbation state, state transition detection, qualitative trajectory analysis, discrimination of cell types). Heuristics might include those already calculated here, or a more comprehensive set as cited in the Introduction. The authors may also consider comparing against baselines that don't include time information for some of their tasks (e.g. recovery of perturbation state could arguably be achieved with CNNs either ignorant of the timestep with simple temporal conditioning, not including trajectory information).

    If these features are sufficient for many of the same tasks performed in this work, the authors should provide a clear argument for readers as to why the unsupervised VQ-VAE approach may be preferable (e.g. ability to recover potentially unknown cell changes, for which no heuristic exists). The VQ-VAE doesn't need to be superior along every axis to hold merit, but the work would be strengthened if the authors could show clear superiority along some dimension.

    1. Cell state transition interpretation

    In line 278, the authors propose that the unbalanced nature of transitions such that p(1 -> 2) >> p(2 -> 1) must represent some difference in timescales across the transitions because "cell states should have reached equilibrium after several days in culture at the time of the imaging experiments". This logic is unclear to me for two reasons.

    • If the population obeys detailed balance (e.g. transitions have equal frequency), then observed transitions should be balanced on a reasonably long time window, even if individual transitions occur on different timescales.
    • The assumption that cell states are balanced after a few days in culture is at odds with a few different aspects of the biology. Cell density and nutrient availability are continually changing in the dish, so culture conditions are non-stationary. Imaging apparatuses also commonly impact the cell biology of imaged samples due to imperfect incubation, etc.

    It seems likelier that these data represent an unbalanced transition due to the non-stationary nature of the culture system. Given the authors' emphasis on the value of measuring these transitions, the work would be strengthened by a more careful interpretation of these results, additional analysis details (e.g. how large are most state transitions? are these mostly small shifts "over the border" in state space, or large jumps?), and an attempt at biological interpretation of the observed phenomenon. The authors' RNA-seq data may be helpful in this latter regard.

    1. Single cell RNA-seq analysis

    The authors performed a very interesting experiment where they profiled the same cell population using both timelapse imaging and single cell RNA-seq. The authors argue that the global structure of the state space resolved by each modality is analogous, but this seems a bit of a stretch to me. The behavior state space is unimodal (bifurcated into two states by GMM clustering), while the mRNA-seq space has several distinct clusters.

    The argument that these states are analogous would be significantly strengthened by biological interpretation of the RNA-seq data. Do the mRNA profiles exhibit differentially expressed genes that might explain differences in behavior in the cell behavior states? The analyses in Fig. 4 - Supp 4 are suggestive that "State 1" contains interferon-responsive cells and not control cells, but broader conclusions don't appear well supported by current analyses.

    Minor Points

    1. Check grammar. Some articles are missing and some subject-verb agreements are mismatched. e.g. line 624 "we regularized [the] latent space", line 713 "after both loss[es] achieved".
    2. For Fig. 4 - supp 1 -- isn't it expected that the GMM cluster of a vector can be predicted from the vector? The GMM clusters were derived from the vectors to begin with, so this seems like a bit of a circular analysis. If I'm missing something, this figure might benefit from more exposition.
    3. For Fig. 4 - Supp 3, the authors should consider changing the "state" and "cluster" colors on the embedding projections so that they do not match. As presented, it appears as if the states and clusters were co-assayed and linked by some experimental label, when in fact the State 1::Cluster1, State 2::Cluster 2 relationship is just inferred.

    Positive comments

    1. Figure presentation and manuscript layout are top notch. Thanks to the authors for making these data easy to read and interpret.

    Significance

    See above.

  3. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #2

    Evidence, reproducibility and clarity

    Provide a short summary of the findings and key conclusions (including methodology and model system(s) where appropriate). Please place your comments about significance in section 2.

    -The authors describe Dynamorph; a deep-learning based autoencoder to represent - in an interpretable latent space - live cell microscopy image data of motile microglia in unperturbed and perturbed situations. Using Dynamorph, the authors identify and describe 'morphodynamic' states of the microglia.

    Major comments:

    Are the key conclusions convincing?

    -Yes, the methodology, observations and conclusions are clearly explained and convincing.

    Should the authors qualify some of their claims as preliminary or speculative, or remove them altogether?

    -'Cell state' in the field of cell biology has been operationally defined in so many different ways and with so many different types of measurement data, that 'cell state' is becoming a somewhat vacuous term. This is not only a problem of this paper but a challenge for the field. In this case, clustering of cells using a Gaussian mixture model that uses the first few principal components of the latent space coefficients as well speed - both averaged across the frames of cell tracks. This is fine and descriptive, but it's unclear whether this definition of 'cell state' is easily applied to other datasets and how this definition can be operationalized for hypothesis generation and experimentation. For other datasets, e.g. other cell types and other processes, such as differentiation, where e.g. tracking and segmentation may be more difficult and images would look quite different, can one still apply the same approach towards describing cell states? One could state that this definition of cell state is very specific to the dataset and therefore not generally useful. How would the authors respond to such a statement?

    Would additional experiments be essential to support the claims of the paper? Request additional experiments only where necessary to evaluate the paper as it is, and do not ask authors to open new lines of experimentation.

    -It's unclear to the reviewer whether the training data (unperturbed microglia) are close enough to the test data (perturbed microglia) such that application of the trained model to the test data makes sense. The authors provide reconstruction loss numbers, but they are difficult to interpret. Can the authors create plots of the unperturbed microglia cells and unperturbed microglia cells in the latent space and show overlap, or in other ways, show that training data and test data are close enough for this application.

    -It's unclear what the effect of speed is on the final state determination. TFVs were composed of auto-encoder-based features (PCs from latent space) and speed of the cells. Would the states be very different without speed as part of the TFVs or with TFVs consisting only of speed features? Please quantify and discuss. -Only a small amount of intensity variation is explained; 17% using the first 4 PC components which are mainly used in the analyses. This seems like a very low number. There is a lot of variation in the intensity images that is not explained by the autoencoder. The autoencoder seems to be doing a bad job. At the same time, the downstream analyses using the latent space are insightful and sensible. Can the authors provide more explanation?

    -Related to the previous point, 'self-supervised' sounds nice, but it's still optimizing towards something, in this case explaining the variation in input intensity images. A lot of the variation in the intensity images may not be of interest for the biological investigation of shape and dynamics. Did the authors uncover that indeed some of the latent dimensions are encoding other aspects of the images which may be less related to the biology and more to image properties/artifacts/biases? Are the suggested experiments realistic for the authors? It would help if you could add an estimated cost and time investment for substantial experiments. -These are computational experiments based on already existing data/results/code. It should be relatively straightforward to do these additional computational experiments. Careful analysis and interpretation require time.

    Are the data and the methods presented in such a way that they can be reproduced? -The methods are described with sufficient detail.The complicated experimental and computational processes seem reproducible to a decent extent. The code is captured in Github repos. The reviewer did not attempt to reproduce computational results. The reviewer did not check whether the available data meets FAIR requirements. Are the experiments adequately replicated and statistical analysis adequate?

    -Yes, and there is lots of useful supplementary material which helps with interpretation of the results. Minor comments: Specific experimental issues that are easily addressable. -The motivation for GMMs over k-means is unclear. K-means clustering leads to spatial separation between clusters (states) since all cells/tracks that closest to their cluster mean are per definition further away from the means of other clusters. This is not the case with the more flexible GMMs; e.g. they allow one to have a smaller cluster (with small variance components) inside of a larger cluster (with large variance). The latter scenario seems undesirable for interpretation in terms of states.

    -The original images are 3D (5 z-planes). The analyzed images were 2D. The reviewer missed how the authors went from 3D to 2D. And since cells are 3D, can the authors describe what they gained by going to 2D and what they potentially lost? Are prior studies referenced appropriately?

    -Yes, citations are amply and relevant. Are the text and figures clear and accurate?

    -Yes, the figures are informative. Do you have suggestions that would help the authors improve the presentation of their data and conclusions?

    -No specific suggestions

    Significance

    Describe the nature and significance of the advance (e.g. conceptual, technical, clinical) for the field.

    -This is a technological/computational advance using a large integrative (experimental+computational) approach.

    Place the work in the context of the existing literature (provide references, where appropriate).

    -The authors have done an excellent job at this.

    State what audience might be interested in and influenced by the reported findings.

    -Cell biologists, brain researchers, computer vision computational biologists

    Define your field of expertise with a few keywords to help the authors contextualize your point of view. Indicate if there are any parts of the paper that you do not have sufficient expertise to evaluate.

    -Cell biology, cancer biology, systems biology, machine learning, statistics, data integration

    -Brain biology aspects (biological significance of the findings on morphodynamic microglial states) are difficult to assess for the reviewer

    Referee Cross-commenting

    Comments by Reviewer #1 look great and useful. I think they are in line with my comments. I think this manuscript would benefit from a reviewer that could comment on the biological significance. The review reports are skewed towards questions and remarks about the computational approach.

  4. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #1

    Evidence, reproducibility and clarity

    Summary:

    The authors use a combination of quantitative phase microscopy and machine learning to determine the state space of microglia cells. The key conclusions are that a VQ-VAE is able to capture a compact latent representation of the cell morphology, and combined with motion features, can predict state changes in single cell trajectories, and discriminate between purturbations.

    Major comments:

    Overall - I very much enjoyed reading the manuscript. The work has been carefully performed and the results are interesting.

    • The temporal matching to enforce a smooth latent space representation is interesting. The authors mention that they mask out surrounding cells with a median pixel value. Have the authors considered using a pixel weighting in the reconstruction/matching loss to differentiate foreground/background? Also, does this affect detection of any fast (or indeed rare) transitions in the trajectories?
    • The Trajectory Feature Vectors (TFVs) are averaged over time - this seems to lose a lot of the salient information in the trajectories themselves, resulting in the low(ish) accuracy of the GMM. Could a Hidden Markov Model trained on the trajectories in state space help to identify/classify those trajectories that change their morphology/motion over time?

    Minor comments:

    • Could the authors provide some example images showing interpolation of each PC using the generative decoder?
    • I was a little confused by the labels given to the PCs, as they seem to vary between figures. For example, In Fig2, PC1 and PC2 are Size and Peak Retardance, but in Fig3 they are referred to as Size and Cell Density (which could be interpreted as the number of cells per unit area). Could the authors clarify these in the captions?
    • The authors note that single-cell tracking is of vital importance. This should be elaborated upon. Also - could the VQ-VAE encodings be used to help track linking in cases of high density?
    • I was pleased to see the full source code available!

    Significance

    Nature and significance:

    This is a significant, mostly technical piece of work, that explores a complex new area of science -- using ML and large datasets to gain insight into biological systems. There are significant challenges, not least that interpreting ML models can be challenging.

    Existing literature/context:

    There have been relatively few examples of using self-supervised learning to gain insight into these complex datasets. Much of the work has concentrated on learning morphological descriptors. The present work starts to introduce the time dimension more explicity.

    Target Audience:

    Broadly applicable to those studying cell biology, microscopy and machine learning.

    My expertise:

    ML applied to microscopy data. Single cell tracking.