Major data analysis errors invalidate cancer microbiome findings

This article has been Reviewed by the following groups

Read the full article

Listed in

Log in to save this article

Abstract

We re-analyzed the data from a recent large-scale study that reported strong correlations between DNA signatures of microbial organisms and 33 different cancer types and that created machine-learning predictors with near-perfect accuracy at distinguishing among cancers. We found at least two fundamental flaws in the reported data and in the methods: (i) errors in the genome database and the associated computational methods led to millions of false-positive findings of bacterial reads across all samples, largely because most of the sequences identified as bacteria were instead human; and (ii) errors in the transformation of the raw data created an artificial signature, even for microbes with no reads detected, tagging each tumor type with a distinct signal that the machine-learning programs then used to create an apparently accurate classifier. Each of these problems invalidates the results, leading to the conclusion that the microbiome-based classifiers for identifying cancer presented in the study are entirely wrong. These flaws have subsequently affected more than a dozen additional published studies that used the same data and whose results are likely invalid as well.

IMPORTANCE

Recent reports showing that human cancers have a distinctive microbiome have led to a flurry of papers describing microbial signatures of different cancer types. Many of these reports are based on flawed data that, upon re-analysis, completely overturns the original findings. The re-analysis conducted here shows that most of the microbes originally reported as associated with cancer were not present at all in the samples. The original report of a cancer microbiome and more than a dozen follow-up studies are, therefore, likely to be invalid.

Article activity feed

  1. Luo et al. (20), Zhu et al. (21), F. Chen et al. (22), Narunsky-Haziza et al. (23), C. Chen et al. (24), Lim et al. (25), Bentham et al. (26), Y. Kim et al. (27), Y.Xu et al. (28), and Y. Li et al. (29)

    were you time-limited in examining these or are they more difficult to tell than the above cases in whether the results represent true biology?

  2. Notethat we do not know precisely where Poore et al. went wrong in applying the normalization code

    It would be helpful to know whether:

    1. All of the data and code were available to try to exactly repeat what Poore et al did. If not, what ingredients are missing?
    2. If you are able exactly repeat what Poore et al ran, do you get the exact same results? If not, is it because they didn't report e.g. a random seed value, or does it seem like the code that is reported isn't what was actually used?
  3. Note that even with two rounds of alignment against the humangenome, many of the reads in each sample were still classified as human by the Kraken programusing our database.

    Given that human contamination is a leading issue here, I think it could be interesting to show what mapping against a human pangenome accomplishes in terms of reduced number of human sequences in unmapped reads. Similarly, a program like bbmap's bbduk.sh could be used to remove human sequences. It's similar to kraken in that it detects matches based on k-mers. While neither of these things are necessary to prove the points in this preprint, I think this preprint might receive a lot of attention and I think it could be a gift to the community to demonstrate the most effective ways to reduce human reads in a sample when that is the goal. This would also have applications to the metagenomics field for host-based samples, where removal of human reads from e.g. gut microbiomes should be performed before data deposition.