Citation needed? Wikipedia bibliometrics during the first wave of the COVID-19 pandemic

This article has been Reviewed by the following groups

Read the full article

Abstract

Background

With the COVID-19 pandemic’s outbreak, millions flocked to Wikipedia for updated information. Amid growing concerns regarding an “infodemic,” ensuring the quality of information is a crucial vector of public health. Investigating whether and how Wikipedia remained up to date and in line with science is key to formulating strategies to counter misinformation. Using citation analyses, we asked which sources informed Wikipedia’s COVID-19–related articles before and during the pandemic’s first wave (January–May 2020).

Results

We found that coronavirus-related articles referenced trusted media outlets and high-quality academic sources. Regarding academic sources, Wikipedia was found to be highly selective in terms of what science was cited. Moreover, despite a surge in COVID-19 preprints, Wikipedia had a clear preference for open-access studies published in respected journals and made little use of preprints. Building a timeline of English-language COVID-19 articles from 2001–2020 revealed a nuanced trade-off between quality and timeliness. It further showed how pre-existing articles on key topics related to the virus created a framework for integrating new knowledge. Supported by a rigid sourcing policy, this “scientific infrastructure” facilitated contextualization and regulated the influx of new information. Last, we constructed a network of DOI-Wikipedia articles, which showed the landscape of pandemic-related knowledge on Wikipedia and how academic citations create a web of shared knowledge supporting topics like COVID-19 drug development.

Conclusions

Understanding how scientific research interacts with the digital knowledge-sphere during the pandemic provides insight into how Wikipedia can facilitate access to science. It also reveals how, aided by what we term its “citizen encyclopedists,” it successfully fended off COVID-19 disinformation and how this unique model may be deployed in other contexts.

Article activity feed

  1. COVID-19 pandemic

    **Reviewer 3. Daniel Mietchen **

    This review includes supplemental files, videos and hypothes.is annotations of the preprint!: https://zenodo.org/record/4909923

    The videos of the review process are also available on YouTube:

    Part 1 (Screen Recording 2021-06-05 at 10.02.02.mov): https://youtu.be/_UnDdE3Oi-4 Part 2 (Screen Recording 2021-06-05 at 10.52.51.mov): https://youtu.be/z5xRK0lg3b4 Part 3 (Screen Recording 2021-06-05 at 11.27.01.mov): https://youtu.be/VnztlEqFW2A Part 4 (Screen Recording 2021-06-07 at 02.51.59.mov): https://youtu.be/IYtLfMcLTvA Part 5 (Screen Recording 2021-06-07 at 06.11.52.mov): https://youtu.be/Jv_AUHCASQw Part 6 (Screen Recording 2021-06-07 at 18.07.45.mov): https://youtu.be/6Y-yA9oahzM Part 7 (Screen Recording 2021-06-07 at 19.07.02.mov): https://youtu.be/LV5whFhfmEU

    First round of review:

    Summary The present manuscript provides an overview of how the English Wikipedia incorporated COVID-19-related information during the first months of the ongoing COVID-19 pandemic.

    It focuses on information supported by academic sources and considers how specific properties of the sources (namely their status with respect to open access and preprints) correlate with their incorporation into Wikipedia, as well as the role of existing content and policies in mediating that incorporation.

    No aspect of the manuscript would justify a rejection but there are literally lots of opportunities for improvements, so "Major revision" appears to be the most appropriate recommendation at this point.

    General comments The main points that need to be addressed better: (1) documentation of the computational workflows; (2) adaptability of the Wikipedia approach to other contexts; (3) descriptions of or references to Wikipedia workflows; (4) linguistic presentation.

    Ad 1: while the code used for the analyses and for the visualizations seems to be shared rather comprehensively, it lacks sufficient documentation as to what was done in what order and what manual steps were involved. This makes it hard to replicate the findings presented here or to extend the analysis beyond the time frame considered by the authors. Ad 2: The authors allude to how pre-existing Wikipedia content and policies - which they nicely frame as Wikipedia's "scientific infrastructure" or "scientific backbone" - "may provide insight into how its unique model may be deployed in other contexts" but that potentially most transferrable part of the manuscript - which would presumably be of interest to many of its readers - is not very well developed, even though that backbone is well described for Wikipedia itself. Ad 3: there is a good number of cases where the Wikipedia workflows are misrepresented (sometimes ever so slightly), and while many of these do not affect the conclusions, some actually do, and overall comprehension is hampered. I highlighted some of these cases, and others have been pointed out in community discussions, notably at https://en.wikipedia.org/w/index.php?title=Wikipedia_talk:WikiProject_COVID- 19&oldid=1028476999#Review_of_Wikipedia's_coverage_of_COVID and http://bluerasberry.com/2021/06/review-of-paper-on-wikipedia-and-covid/ . Some resources particularly relevant to these parts of the manuscript have not been mentioned, be it scholarly ones like https://arxiv.org/abs/2006.08899 and https://doi.org/10.1371/journal.pone.0228786 or Wikimedia ones like https://en.wikipedia.org/wiki/Wikipedia_coverage_of_the_COVID-19_pandemic and https://commons.wikimedia.org/wiki/File:Wikimedia_Policy_Brief_-COVID-19- _How_Wikipedia_helps_us_through_uncertain_times.pdf . Likewise essentially missing - although this is a common feature in academic articles about Wikipedia - is a discussion of how valid the observations made for the English Wikipedia are in the context of other language versions (e.g. Hebrew). On that basis, it is understandable that no attempt is made to look beyond Wikipedia to see how coverage of the pandemic was handled in other parts of the Wikimedia ecosystem (e.g. Wikinews, Wikisource, Wikivoyage, Wikimedia Commons and Wikidata), but doing so might actually strengthen the above case for deployability of the Wikipedia approach in other contexts. Disclosure: I am closely involved with WikiProject COVID-19 on Wikidata too, e.g. as per https://doi.org/10.5281/zenodo.4028482 . Ad 4: The relatively high number of linguistic errors - e.g. typos, grammar, phrasing and also things like internal references or figure legends - needlessly distracts from the value of the paper. The inclusion of figures - both via the text body and via the supplement - into the narrative is also sometimes confusing and would benefit from streamlining. While GigaScience has technically asked me to review version 3 of the preprint (available via https://www.biorxiv.org/content/10.1101/2021.03.01.433379v3 and also via GigaScience's editorial system), that version was licensed incompatibly with publication in GigaScience, so I pinged the authors on this (via https://twitter.com/EvoMRI/status/1393114202349391872 ), which resulted (with some small additional changes) in the creation of version 4 (available via https://www.biorxiv.org/content/10.1101/2021.03.01.433379v4 ) that I concentrated on in my review.

    Production of that version 4 - of which I eventually used both the PDF and the HTML, which became available to me at different times - took a while, during which I had a first full read of the manuscript in version 3.

    In an effort to explore how to make the peer review process more transparent than simply sharing the correspondence, I recorded myself while reading the manuscript for the second time, commenting on it live. These recordings are available via https://doi.org/10.5281/zenodo.4909923 .

    In terms of specific comments, I annotated version 4 directly using Hypothes.is, and these annotations are available via https://via.hypothes.is/https://www.biorxiv.org/content/10.1101/2021.03.01.433379v4.full .

    Re-review: I welcome the changes the authors have made - both to the manuscript itself (of which I read the bioRxiv version 5) and to the WikiCitationHistoRy repo - in response to the reviewer comments. I also noticed comments they chose not to address, but as stated before, none of these would be ground for rejection. What I am irritated about is whether the proofreading has actually happened before the current version 5 was posted. For instance, reference 44 seems missing (some others are missing in the bioRxiv version, but I suspect that's not the authors' fault), while lots of linguistic issues in phrases like "to provide a comprehensive bibliometric analyses of english Wikipedia's COVID-19 articles" would still benefit from being addressed. At this point, I thus recommend that the authors (a) update the existing Zenodo repository such that there is some more structure in the way the files are shared (b) archive a release of WikiCitationHistoRy on Zenodo

  2. Background

    **Reviewer 2. Dean Giustini ** This is a well-written manuscript. The methods are well-described. I've confined my comments to improving the reporting of your methods, some comments about the paper's structure, and a few about the readability of the figures and tables (which I think in general are too small, and difficult to read). Here are my main comments for your consideration as you work to improve your paper:

    1. Title of manuscript - the title of your paper seems inadequate to me, and doesn't really convey its content. A more descriptive title that includes the idea of the "first wave" might be useful from my point of view as a reader who scans titles to see if I am interested. I'd recommend including words in the title that refer to your methods. What type of research is this - a quantitative analysis of citations? Title words say a lot about the robust nature of your methods. As you consider whether to keep your title as is, keep mind that title words will aid readers in understanding your research at a glance, and provide impetus to read your abstract (and one hopes the entire manuscript). These words will help researchers find the paper later as well via the Internet's many search engines (i.e., Google Scholar).

    2. Abstract - The abstract is well-written. Could the aims of your research be more obvious? and clearly articulated? How about using a statement such as "This research aims to" or similar? I also don't understand the sentence that begins with "Using references as a readout". What is meant by a "readout" in this context? Do you mean to read a print-out of references later? Lower down, you introduce the concept of Wikipedia's references as a "scientific infrastructure", and place it in quotations. Why is it in quotations? I wondered what the concept was on first reading it. A recurring web of papers in Wikipedia constitutes a set of core references - but would I call them a scientific infrastructure? Not sure; they are a mere sliver of the scientific corpus. Not sure I have any suggestions to clarify the use of this phrase.

    3. Introduction - This is an excellent introduction to your paper, and it provides a lot of useful context and background. You make a case for positioning Wikipedia as a trusted source of information based on the highly selective literature cited by the entries. However, I would only caution that some COVID-19 entries cite excellent research but the content is contested, and vice versa. One suggestion I had for this section was the possibility of tying citizen science (part of open science) to the rise of Wikipedia's medwiki volunteers. Wikipedia provides all kinds of ways for citizens to get involved in science. As an open science researcher, I appreciated all of the open aspects you mention. Clearly, open access to Wikipedia in all languages is a driving force in combatting misinformation generally, and the COVID "infodemic" specifically. I admit I struggled to understand the point of the section that begins, "Here, we asked what role does scientific literature, as opposed to general media, play in supporting the encyclopedia's coverage of the COVID-19 as the pandemic spread." The opening sentence articulates your a priori research question, always welcome for readers. Would some of the information that follows in this section around your methods be better placed in the following section under the "Material and Methods"? I found it jarring to read that "....after the pandemic broke out we observed a drop in the overall percentage of academic references in a given coronavirus article, used here as a metric for gauging scientificness in what we term an article's Scientific Score." These two ideas are introduced again later, but I had no idea on reading them here what they signified or whether they were related to research you were building on. You might consider adding a parenthetical statement that they will be described later, and that the idea of a score is your own.

    4. Material and methods - Your methods section might benefit from writing a preamble to prepare your readers. As already mentioned, consider taking some of the previous section and recasting it as an introduction to your methods. Consider adding some information to orient readers, and elaborating in a sentence or two about why identifying COVID-19 citations / information sources is an important activity.

    By the way, what is meant by this: "To delimit the corpus of Wikipedia articles containing DOIs"? Do you mean "identify" Wikipedia articles with DOIs in their references? As I mentioned (apologies in advance for the repetition), it strikes me as odd that you don't refer to this research as a form of citation analysis (isn't that what it is?). Instead you characterize it as "citation counting". If your use of words has been intentional, is there a distinction you are making that I simply do not understand? Also: bibliometricians and/or scientometricians might wonder why you avoid the phrase citation analysis. Further to your methods which are primarily quantitative and statistical - what are the qualitative methods used throughout the paper to analyze the data? How did you carry out this qualitative work? (On page 10, you state "we set out to examine in a temporal, qualitative and quantitative manner, the role of references in articles linked directly to the pandemic as it broke.") That part of your methods seems to be a bit under-developed, and may be worth reconsidering as you work to improve your reporting in the manuscript.

    1. Table 1. I am not sure what this table adds to the methods given it leads off your visuals. Do you really need it? It doesn't reveal anything to me and could be in a supplemental file. I also have difficulties in properly seeing table 1; perhaps you could make it larger and more readable?

    2. Figure 1. This is the most informative visual in the paper but it is hard to read and crowded. It deserves more space or the information it provides is not fully understood.

    3. Figure 3. This is very bulky as a figure, although informative. Again, I'm not sure all of it needs inclusion. Perhaps select part of it, and include other parts in a supplement.

    4. Limitations - The paper does not adequately address its limitations. A more fulsome evaluation of limitations would be beneficial to me as a reader, as it would place your work in a larger context. For example, consider asking whether the results are indicative of Wikipedia's other medical or scientific entries? Or are the results not generalizable at all? In other works, are they indicative of something very limited based on the timeframe that you examined? I found myself disagreeing with: "....the mainstream output of scientific work on the virus predated the pandemic's outbreak to a great extent". Is this still true? and what might its significance be now that we are in 2021? Would it be helpful to say that most of the foundational research re: the family of coronaviruses was published pre-2020, but entries about COVID-19 disease and treatment entries are now distinctly different in terms of papers cited, especially going forward. Wiki editors identify relevant papers over time but are not adept at identifying emerging evidence in my experience, or at incorporating important papers early; it's strange given that recency is one of its true calling cards. For me, the most confounding aspect of the infodemic is the constant shifts of evidence, and how to respond in a way that is prudent and evidence-based. As you point out, Wikipedia has a 8.7 year latency in citing highly relevant papers - and, it seem likely that many important COVID-19 papers were neglected in Wikipedia in the first wave especially about the disease. As you point out, this will form part of future research, which I hope you and your team will pursue.

    5. Reference 31 lacks a source: Amit Arjun Verma and S. Iyengar. Tracing the factoids: the anatomy of information reorganization in wikipedia articles. 2021.

    Good luck with the next stages in improving your manuscript for publication. I believe it adds to our understanding of Wikipedia's role in promoting sources of information.

  3. Abstract

    This paper has been published in GigaScience under a CC-BY 4.0 license (see: https://doi.org/10.1093/gigascience/giab095). As the journal carriers out open peer review these have also been published under the same license.

    **Reviewer 1. Dariusz Jemielniak ** This is a very solid article on a timely topic. I also commend you for the thorough and meticulous methodology.

    One thing that I believe you could amplify on is what would your proposed solution to the "trade off between timeliness and scientificness"? After all, Wikipedia relies on the sources that are reliable, verifiable, but foremostly... available. At the time when there are no academic journal articles published (yet) the chosen modus operandi does not appear to be a trade-off, it is basically the only logical solution. A trade-off would occur if the less valuable sources were not replaced when more academic ones appear, and this is not the case. I believe you should mention the fact that Wikipedia has an agreement with Cochrane database, which likely affects the popularity of this source.

    Additionally, I think that the literature review needs to be expanded. There are already some publications about Wikipedia and COVID-19, as well as about medical coverage on Wikipedia (some non-exhaustive references added below). Moreover, Wikipedia has been a topic covered in GigaScience and it would be reasonable to reflect on the previous conversations in the journal in your publication.

    Chrzanowski, J., Sołek, J., & Jemielniak, D. (2021). Assessing Public Interest Based on Wikipedia's Most Visited Medical Articles During the SARS-CoV-2 Outbreak: Search Trends Analysis. Journal of medical Internet research, 23(4), e26331. Colavizza, G. (2020). COVID-19 research in Wikipedia. Quantitative Science Studies, 1-32. Jemielniak, D. (2019). Wikipedia: Why is the common knowledge resource still neglected by academics?. GigaScience, 8(12), giz139.

    Jemielniak, D., Masukume, G., & Wilamowski, M. (2019). The most influential medical journals according to Wikipedia: quantitative analysis. Journal of medical Internet Research, 21(1), e11429.

    Kagan, D., Moran-Gilad, J., & Fire, M. (2020). Scientometric trends for coronaviruses and other emerging viral infections. GigaScience, 9(8), giaa085.

  4. SciScore for 10.1101/2021.03.01.433379: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    Institutional Review Board Statementnot detected.
    Randomizationnot detected.
    Blindingnot detected.
    Power Analysisnot detected.
    Sex as a biological variablenot detected.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    Subsequently, several statistics were computed for each Wikipedia article and information for each of their DOI were retrieved using Altmetrics, CrossRef and EuroPMC R packages.
    Altmetrics
    suggested: None
    CrossRef
    suggested: (CrossRef, RRID:SCR_003217)
    This metric, called Sci Score, is defined by the ratio of academic as opposed to non-academic references any Wikipedia article includes, as such:

    Our investigation, as noted, also included an analysis into the latency (8) of any given DOI citation on Wikipedia.

    Wikipedia
    suggested: (Wikipedia, RRID:SCR_004897)

    Results from OddPub: Thank you for sharing your code and data.


    Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:
    We tried to address these limitations using technical solutions, such as regular expressions for extracting URLs, hyprelinks, DOIs and PMIDs. In this study, we retrieved most of our scientific literature metadata using Altmetrics (30, 31), EuroPMC (32) and CrossRef (33) R APIs. How-ever, this method was not without limitations and we could not, for example, retrieve all of the extracted DOIs meta-data. Moreover, information regarding open access (among others) varied with quality between the APIs (34). In addition, our preprint analysis was mainly focused on MedRxiv and BioRxiv which have the benefit of having a distinct DOI prefix. Unfortunately, no better solution could be found to annotate preprints from the extracted DOIs. Preprint servers do not necessarily use the DOI system (35) (i.e ArXiv) and others share DOI prefixes with published paper (for instance the preprint server used by The Lancet). Moreover, we developed a parser for general citations (news outlets, websites, publishers), and we could not properly clean redundant entries (i.e “WHO”, “World Health Organisation”). Finally, as Wikipedia is constantly changing, some of our conclusions are bound to change. Therefore, our study is focused on the pandemic’s first wave and its history, crucial to examine the dynamics of knowledge online at a pivotal timeframe. In summary, our findings reveal a trade off between quality and scientificness in regards to scientific literature: most of Wikipedia’s COVID-19 content was...

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.