Bio- informatics: Integrate negative controls to get the good data

Rob van Nues

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

High-throughput datasets, like any experimental output, can be full of noise. Negative controls, i.e. mock experiments not providing information concerning the biological system under study, visualise background. Overlooking this ‘training set’ of wrong examples in publicly available datasets can seriously undermine validity of bioinformatics analyses. We present a program, COALISPR, for explicit and transparent application of negative control data in the comparison of high-throughput sequencing results. This yields mapping coordinates that guide fast counting of reads, bypassing the need for a reference file, and is especially relevant when small RNA sequencing libraries contaminated with breakdown products are analysed for poorly annotated organisms.

We have re-analysed small RNA datasets for mouse and fungus Cryptococcus neoformans , leading to consistent identification of miRNAs and of fungal transcripts targeted by siRNAs. Cryptococcal Argonautes are directed to spliced transcripts indicating that RNAi must be triggered by events downstream of intron removal. Negative control datasets contain large amounts of ribosomal RNA (rRNA) fragments (rRFs). These differ from small RNAs associated with RNAi, making a biological role for rRFs in association with Argonautes unlikely. Background signals enabled identification of cryptococcal genes for RNase P, U1 snRNA, 37 H/ACA and 63 Box C/D snoRNAs, including U3 and U14 essential for pre-rRNA processing. To gain meaning, high-throughput RNA-Seq analyses need to incorporate negative data.

GRAPHICAL ABSTRACT

Version published to 10.1101/2024.10.08.617225 on bioRxiv
Oct 9, 2024

Discuss this preprint

Listed in

Abstract

GRAPHICAL ABSTRACT

Article activity feed