A crowd of BashTheBug volunteers reproducibly and accurately measure the minimum inhibitory concentrations of 13 antitubercular drugs from photographs of 96-well broth microdilution plates

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    The authors evaluate a novel crowd-sourcing method to interpret minimum inhibitory concentrations of Mycobacterium tuberculosis, the causative agent of tuberculosis. To provide valuable test results without the need for available expert mycobacteriologists, the authors demonstrate that when presented appropriately, 11-17 interpretations by lay interpreters can provide reproducible results for most tuberculosis drugs. This analysis demonstrates that among those samples that can be reliably interpreted by automated detection software, lay interpretation provides a potential alternative means to provide a timely confirmatory read. The work will be of interest to bacteriologists and those with an interest in antimicrobial resistance.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. The reviewers remained anonymous to the authors.)

This article has been Reviewed by the following groups

Read the full article

Abstract

Tuberculosis is a respiratory disease that is treatable with antibiotics. An increasing prevalence of resistance means that to ensure a good treatment outcome it is desirable to test the susceptibility of each infection to different antibiotics. Conventionally, this is done by culturing a clinical sample and then exposing aliquots to a panel of antibiotics, each being present at a pre-determined concentration, thereby determining if the sample isresistant or susceptible to each sample. The minimum inhibitory concentration (MIC) of a drug is the lowestconcentration that inhibits growth and is a more useful quantity but requires each sample to be tested at a range ofconcentrations for each drug. Using 96-well broth micro dilution plates with each well containing a lyophilised pre-determined amount of an antibiotic is a convenient and cost-effective way to measure the MICs of several drugs at once for a clinical sample. Although accurate, this is still an expensive and slow process that requires highly-skilled and experienced laboratory scientists. Here we show that, through the BashTheBug project hosted on the Zooniverse citizen science platform, a crowd of volunteers can reproducibly and accurately determine the MICs for 13 drugs and that simply taking the median or mode of 11–17 independent classifications is sufficient. There is therefore a potential role for crowds to support (but not supplant) the role of experts in antibiotic susceptibility testing.

Article activity feed

  1. Evaluation Summary:

    The authors evaluate a novel crowd-sourcing method to interpret minimum inhibitory concentrations of Mycobacterium tuberculosis, the causative agent of tuberculosis. To provide valuable test results without the need for available expert mycobacteriologists, the authors demonstrate that when presented appropriately, 11-17 interpretations by lay interpreters can provide reproducible results for most tuberculosis drugs. This analysis demonstrates that among those samples that can be reliably interpreted by automated detection software, lay interpretation provides a potential alternative means to provide a timely confirmatory read. The work will be of interest to bacteriologists and those with an interest in antimicrobial resistance.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. The reviewers remained anonymous to the authors.)

  2. Reviewer #1 (Public Review):

    This work presents a significant amount of data and analyses (openly available via a Github repository) explaining and rigorously assessing how a crowd of citizens can estimate the growth of Mycobacterim tuberculosis (Mtb) strains and therefor measure the Minimum Inhibitory Concentration (MIC) needed to choose which antibiotics can be used to treat patients infected by Mtb. Measuring MIC can rapidly become time consuming and requires highly skilled scientists and/or dedicated software (such as AmyGDA) but the latter is sometime prone to errors such as artifacts classification. Thus, the use of a crowd will definitely help to overcome these limitations.

    This manuscript has several undeniable strengths and few minor weaknesses:

    Strengths:
    - First, Fowler and co-authors have constructed different datasets to then assess the users classifications. Interestingly, they also discussed the potential biases and limitations of each dataset. These datasets would be then useful to develop other programs to measure MIC for Mtb strains or to train ML models.
    - Then, Fowler and co-authors performed a careful analysis to validate the consensus, reproducibility, and accuracy of the classifications made by the citizens.
    - Overall, the development of the citizen science project BashTheBug is of particular interest to rapidly classify huge amount of MIC images generated by the CRyPTIC project. It may also pave the way of a new approach to quickly assess MIC data for Mtb growth in countries which may not have access to state of the art facilities or highly skilled scientists.

    Weaknesses:
    - While the authors explained how they try to engage people to "play" this serious game and describe the number of classifications, there is no real discussion about the number of users playing every day. Reaching a minimum number of regular players is essential for the sustainability of such project.
    - In the discussion the authors mentioned that this approach may help training laboratory scientists unfortunately this claim was not really explored in this manuscript. It may have been interesting to analyze, for the most engaged volunteers, the improvement in term of accuracy of a user after 10, 100 or 1000 classifications. This may be also interesting to reweight the accuracy results in function of the users experience to see if it can improve the classification scores. It would have been also of interest to know if the way of presenting the data or playing the game may help experts (i.e. laboratory scientists) to improve their skills to quickly assess MIC using the methodology design to assess the citizens (such as time spent on a classification presented in Fig. S5).
    - 13 drugs were tested on 19 different strains of Mtb. It would have been of broad interest to see then how to reconstruct each plate from the different classifications and briefly present the practical outputs of these classifications: i.e. the resistance of each strain to the different antibiotics. Furthermore, except H37rV, the other strains are not mentioned; only a vial code is presented in Table S1.

  3. Reviewer #2 (Public Review):

    Thank you for the opportunity to review this extremely interesting, transdisciplinary paper, which reports a new method to analyse large scale data about the pathogen M. tuberculosis. The authors have the long-term goal of shifting the paradigm of antibiotic susceptibility testing (AST) from culture-based to genetics-based, where the treatability of a given pathogen is inferred from its genome rather than ascertained from laboratory testing, which can take several weeks and be demanding of expertise and resources. This study aims to establish that a crowd of volunteers with no clinical training can measure the growth rates of microbial samples and thus provide a large dataset suitable to train machine learning models, which would have required a great deal more time and resources if produced by experts. M. tuberculosis is a particularly significant pathogen to study because of its long incubation period, which means that testing for it requires more resources than most pathogens, and its prevalence in the world, having caused the most fatalities of any pathogen in 2019 (only surpassed by SARS-CoV-2 in 2020).

    This paper will be of significance in many fields: that of microbiology and other health-related fields, and also of citizen science. I do not have the experience to judge the former in detail, but in the latter, the authors use a well-established platform (the Zooniverse) to carry out their experiment, and thoughtfully consider many sources of bias and and compare similar Zooniverse projects with their own. The process of recruiting and training volunteers is well described and a "wisdom of crowds" approach is used (though not named as such or referenced) to ensure that each data point is examined by at least 17 different people. The data quality obtained from a smaller number of classifications, e.g. 9, is considered, but it is not explained why 17 classifications in particular was chosen as a suitable number. (It is also stated that some samples were classified by far more than 17 people, but not why this was or what effect it had.) The authors also showed themselves to be thoughtful, flexible and empathetic to volunteers in the early stages of the experiment, noting that the M. tuberculosis samples were set out in trays of 96 wells and that classifying all 96 was too lengthy a task, and so they personally watched the process undertaken by a trained expert and noted that this expert worked by comparing each well to a control one in which no antibiotic was used. This effort to understand how participants think and work will have contributed to the quality of the Zooniverse project.

    The authors demonstrate clearly the varying appropriateness of the mode, median and mean from these classifications to determine the minimum inhibitory concentration (MIC) of each antibiotic, which essentially determines whether or not it is an effective antibiotic with which to treat that culture - and, in the long term, a pathogen with that genome. There is a very good explanation of the different sources of bias incurred from expert analysis and also from analysis by software known as Automated Mycobacterial Growth Detection Algorithm (AMyGDA) - although the authors point out that this makes comparisons and analysis difficult, they do not state clearly how they addressed this difficulty.

    The authors do not seem to have done any tests for even a very small crowd of experts, to see if there is significant difference between this and a crowd of volunteers, although they do compare volunteers' classifications to experts' classifications. Some new metrics are presented for data comparison, which are named the exact agreement and the essential agreement, and which I found difficult to understand. They seem to be derived from a reference method that comes from a set of standards from 2007 that have since been withdrawn, and it is not explained why these are superior to more standard statistical methods. Nonetheless, overall the authors have thoroughly compared their datasets in a great many ways and their approach seems rigorous.

    The conclusion - that a crowd of volunteers can indeed analyse the growth rates of microbial samples - is broadly supported by the data presented, though there are a few inconsistencies. In several places the paper highlights a need for 95% reproducibility and 90% accuracy; however, later, when showing reproducibilities of 94.2% and 94.1% (for the mode and median), and 90.2% and 91.0% accuracy (again for the mode and median), although this almost but not quite meets the 95% benchmark, the authors suddenly claim that the 95% is not needed. There is probably a good reason for this, but it was not clearly explained.

    The authors provide very few details about the origin of the 20,000 samples of M. tuberculosis collected by one of their organisations - presumably these are clinical samples taken from patients, but it is not stated whether patient consent was required or obtained, or what the geographical distribution was of this collection. Furthermore, it is remarked that vials of samples were sent to seven laboratories but only tested by two members of staff, the logistics of which were rather confusing to picture, and it was hinted that there was a more detailed process which was "described previously", but only in another paper.

    There is some very interesting discussion of participation inequality (measured by the Gini coefficient) and the fact that this seemed, on average, greater for this project than other biomedical Zooniverse projects, but few suggestions as to why this might be or the implications for similar projects, of which there are likely to be many in the near future given that health citizen science is a rapidly expanding field. Nevertheless, some exciting recommendations are made for the field, such as their support for efforts to create a set of standards for Mycobacterial antibiotic susceptibility testing, and a remark that the Zooniverse platform does not yet allow images to be withdrawn from classification after enough agreement has been established by volunteer classifications, even if its number of classifications is less than that recommended (in this case 17). This means that, as well as having implications for the testing and treatment of cases of tuberculosis worldwide, this paper has possible implications for microbial citizen science methodology.

  4. Reviewer #3 (Public Review):

    In this article the authors present a creative approach to resolve the limited availability of expert mycobacteriologists while providing accurate MIC testing by presenting MIC data to lay interpreters that provide reproducible, accurate results for a large number of samples. While overall the work is compelling, this creative, comprehensive analysis has a few important limitations that should be addressed.

    The authors report that the combination of the single expert classification and the AMyGDA classification represents a dataset with two complimentary, and therefore negligible, sources of error. For many similar studies MIC reading would be performed by two independent reviewers, the authors should discuss their previous data supporting the assumption that this combination represents a valid reference against which the user-level comparisons can be compared. The use of this as the reference for lay interpretation may in part explain the high rate of image exclusion (50.3%), which may significantly change the results of the study. An alternative consideration could be that a high error rate for the AMyGDA software (implied in the text) restricts the validity of the crowdsourcing model only to those samples for which AMyGDA is also consistent with expert consensus, and as a result, supporting the use of the crowdsourcing model for MIC reads, but limiting the benefit of crowdsourcing beyond that which AMyGDA supplies on its own.

    In addition, the authors refer to an absence of reference standards for Mtb antimicrobial susceptibility testing (Results section, subheading "How to compare?" lines 3-5 and Discussion section 1st paragraph line 3). The 2018 "Technical report on critical concentrations for drug susceptibility testing of medicines used in the treatment of drug-resistant tuberculosis" published by WHO with FIND should be considered to serve this purpose. It is likely that the authors intended to refer to the absence of consensus methods for MIC determination, which would be better supported by their following paragraph. This should be clarified.

    Given that MIC testing currently serves primarily as a reference level test performed at sites with a greater availability of mycobacterial expertise, the idea described in the final paragraph of the discussion in which the authors suggest that crowdsourcing could replace the second reviewer may be of less benefit. While it is possible that such an approach could be effective, since the presented analysis represents neither a comparison between the crowdsourcing model and the full dataset, nor between the crowdsourcing model and a two expert reader dataset, such a consideration remains hypothetical. The presented conclusions would be strengthened were the authors to confirm similar benefit among those samples for which AMyGDA was ineffective - indicating that among samples for which automated software could not replace an expert read, crowdsourcing might. Alternatively, if crowdsourcing and AMyGDA were adequate on their own, the expert may not be required at all.