Generating Correlated Data for Omics Simulation

Jianing Yang
Gregory R. Grant
Thomas G. Brooks

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Simulation of realistic omics data is a key input for benchmarking studies that help users obtain optimal computational pipelines. Omics data involves large numbers of measured features on each samples and these measures are generally correlated with each other. However, simulation too often ignores these correlations, perhaps due to the inconvenience and computational hurdles of doing so. To alleviate this, we describe in detail three approaches for quickly generating omics-scale data with correlated measures which mimic real data sets. These approaches all are based on a Gaussian copula approach with a covariance matrix that decomposes into a diagonal part and a low-rank part. We use these approaches to demonstrate the importance of including correlation in two benchmarking applications. First, we show that variance of results from the popular DESeq2 method increases when dependence is included. Second, we demonstrate that CYCLOPS, a method for inferring circadian time of collection from transcriptomics, improves in performance when given gene-gene dependencies in some circumstances. We provide an R package, dependentsimr, that has efficient implementations of these methods and can generate dependent data with arbitrary distributions, including discrete (binary, ordered categorical, Poisson, negative binomial), continuous (normal), or with an empirical distribution.

Version published to 10.1101/2025.01.31.634335v1 on bioRxiv
Feb 6, 2025

vcfsim : flexible simulation of all-sites VCFs with missing data

This article has 2 authors:
1. Paimon Goulart
2. Kieran Samuk
This article has no evaluationsLatest version Feb 2, 2025
SHINE: Deterministic Many-to-Many clustering of Molecular Pathways

This article has 5 authors:
1. Lexin Chen
2. Jeremy M. G. Leung
3. Krisztina Zsigmond
4. Lillian T. Chong
5. Ramón Alain Miranda-Quintana
This article has no evaluationsLatest version Feb 8, 2025
Optimal Inference of Asynchronous Boolean Network Models

This article has 1 author:
1. Guy Karlebach
This article has no evaluationsLatest version Mar 17, 2025

Generating Correlated Data for Omics Simulation

Listed in

Abstract

Article activity feed

vcfsim : flexible simulation of all-sites VCFs with missing data

SHINE: Deterministic Many-to-Many clustering of Molecular Pathways

Optimal Inference of Asynchronous Boolean Network Models

Listed in

Abstract

Article activity feed

Related articles

vcfsim : flexible simulation of all-sites VCFs with missing data

SHINE: Deterministic Many-to-Many clustering of Molecular Pathways

Optimal Inference of Asynchronous Boolean Network Models