A field guide for the compositional analysis of any-omics data

This article has been Reviewed by the following groups

Read the full article

Abstract

Background

Next-generation sequencing (NGS) has made it possible to determine the sequence and relative abundance of all nucleotides in a biological or environmental sample. A cornerstone of NGS is the quantification of RNA or DNA presence as counts. However, these counts are not counts per se: their magnitude is determined arbitrarily by the sequencing depth, not by the input material. Consequently, counts must undergo normalization prior to use. Conventional normalization methods require a set of assumptions: they assume that the majority of features are unchanged and that all environments under study have the same carrying capacity for nucleotide synthesis. These assumptions are often untestable and may not hold when heterogeneous samples are compared.

Results

Methods developed within the field of compositional data analysis offer a general solution that is assumption-free and valid for all data. Herein, we synthesize the extant literature to provide a concise guide on how to apply compositional data analysis to NGS count data.

Conclusions

In highlighting the limitations of total library size, effective library size, and spike-in normalizations, we propose the log-ratio transformation as a general solution to answer the question, “Relative to some important activity of the cell, what is changing?”

Article activity feed

  1. Now published in GigaScience doi: 10.1093/gigascience/giz107

    Thomas P. Quinn 1Bioinformatics Core Research Group, Deakin University, 3220, Geelong, Australia2Centre for Molecular and Medical Research, Deakin University, 3220, Geelong, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Thomas P. QuinnIonas Erb 3Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr Aiguader 88, 08003, Barcelona, SpainFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteGreg Gloor 4Department of Biochemistry, University of Western Ontario, London, Ontario, CanadaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteCedric Notredame 3Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr Aiguader 88, 08003, Barcelona, SpainFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMark F. Richardson 1Bioinformatics Core Research Group, Deakin University, 3220, Geelong, Australia5Genomics Centre, School of Life and Environmental Sciences, Deakin University, 3220, Geelong, Australia6Centre for Integrative Ecology, School of Life and Environmental Sciences, Deakin University, 3220, Geelong, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteTamsyn M. Crowley 7Poultry Hub Australia, University of New England, 2351, Armidale, AustraliaFind this author on Google ScholarFind this author on PubMedSearch for this author on this site

    A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giz107 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

    These peer reviews were as follows:

    Reviewer 1: http://dx.doi.org/10.5524/REVIEW.101917 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.101918