A Pooled Cell Painting CRISPR Screening Platform Enables de novo Inference of Gene Function by Self-supervised Deep Learning

This article has been Reviewed by the following groups

Read the full article

Listed in

Log in to save this article

Abstract

Pooled CRISPR screening has emerged as a powerful method of mapping gene functions thanks to its scalability, affordability, and robustness against well or plate-specific confounders present in array-based screening 1–6 . Most pooled CRISPR screens assay for low dimensional phenotypes (e.g. fitness, fluorescent markers). Higher-dimensional assays such as perturb-seq are available but costly and only applicable to transcriptomics readouts 7–11 . Recently, pooled optical screening, which combines pooled CRISPR screening and microscopy-based assays, has been demonstrated in the studies of the NFkB pathway, essential human genes, cytoskeletal organization and antiviral response 12–15 . While the pooled optical screening methodology is scalable and information-rich, the applications thus far employ hypothesis-specific assays. Here, we enable hypothesis-free reverse genetic screening for generic morphological phenotypes by re-engineering the Cell Painting 16 technique to provide compatibility with pooled optical screening. We validated this technique using well-defined morphological genesets (124 genes), compared classical image analysis and self-supervised learning methods using a mechanism-of-action (MoA) library (300 genes), and performed discovery screening with a druggable genome library (1640 genes) 17 . Across these three experiments we show that the combination of rich morphological data and deep learning allows gene networks to emerge without the need for target-specific biomarkers, leading to better discovery of gene functions.

Article activity feed

  1. We trained a ViT-small model with patch size = 8, number of global crops = 2, number of local crops = 8on 4 nodes x 8 NVIDIA-V100 GPUs per node (32 GPUs) for 100 epochs

    would it be possible (and meaningful) to mention how many GPU hours this required? Also, some more details would be helpful for non-ML experts; e.g., why the choice of 100 epochs, was a stopping criterion used, which epoch was used for the final analysis/results, etc.

  2. we re-parameterized the first layer ofthe model as:

    This equation is a bit opaque; it would be helpful to explain what the superscripts and subscripts of theta mean.

  3. (both ~1-1.5million cell tile images)

    Does the 1-1.5m figure mean single-cell images? or FOVs? It would also be super helpful to comment on how this dataset size was chosen. Was it the minimum amount of data required for this level of performance? More generally, did you do any experiments varying the quantity or diversity of the training data?

  4. The superior performance of CP-DINO 1640 is unlikely a result oftrivial memorization, as the 1640-genes druggable genome library and 300-genes MoA library sharesimilar numbers of overlapping genes with the 124 PoC library (30 and 26 genes respectively).

    I think to make this claim more convincing, it would be important to show how many genes in the 1640 library are very similar to (rather than merely identical to) genes in the 124 PoC library ("very similar" is obviously subjective but I'm thinking of homologs/paralogs or genes that are components of the same complex or pathway)

  5. nti-phospho-S6 (pS6) antibodywith AlexaFluor 750-conjugated secondary antibody was used in the 6th channel as an establishedbiomarker

    it would be helpful to mention here what cellular structures of features the pS6 antibody labels, and also (for the non-biologists among us) what mTORC1 is

  6. Nevertheless, CP-DINO 300 trained on bioimaging data yielded a moreinformative embedding that has higher median prediction accuracy than the other two models (Fig.S4a-b), and correctly classified more perturbations with better accuracy (Fig. 4c). CP-DINO 300 alsorecovered more known biological relationships from StringDB as measured by cosine similarity of theaggregate gene KO embeddings (Methods) than the other two models (Fig. 4d)

    It's awesome to see such an explicit and direct comparison of classic feature engineering with modern unsupervised ML models!

    If possible it would be great to quantify how much better the DINO-based approach is; Figures 4a-d are a bit hard to understand at first and obscure the relative differences; Fig 4d in particular doesn't give the impression that DINO is that much better than the CellStats approach (even though the 0.12 of DINO vs the 0.09 of CellStats is actually a 30% improvement!). Also, some measure of statistical significance would be helpful; in particular, how likely is it that the 0.09 vs 0.12 in Fig 4d is reproducible?

  7. phenotypic clustering of genes by their annotated mechanism of action,

    It feels like there's a typo here somewhere, since genes don't really have a "mechanism of action" and the screen here does not involve compounds but rather gene KOs. Is the idea to use the phenotype of the KOs to cluster genes by the MoA of the compounds that target them? In any case, the reference to MoAs here is doubly confusing because the clustering shown in Fig 4E appears to capture cellular localization (and also pathway membership?), but I couldn't see any discussion of the clustering relative to the MoAs of the compounds used to select the 300 genes

  8. a-b.Comparison of feature embedding methodologies based on median AUC of binary classification of KOfrom WT for each genetic perturbation.

    It looks like the majority of the genes have AUC ~= 0.5; what is the interpretation of that? Does that mean that most gene KOs tested do not exhibit a phenotype distinguishable from wild-type?

  9. he field of view images are then cropped around thecentroids of each of the segmented nuclei and masked by the corresponding cell segmentation mask tocreate tiles with a single cell in context

    are pixels that are outside the mask painted with zeros on all channels?