sampleclusteR: A lightweight R package for automated clustering of transcriptomics samples using metadata
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
As technologies for genome-wide gene-expression analysis continue to develop, the databases storing the resulting data have grown accordingly. The Gene Expression Omnibus (GEO) has grown to over 250,000 data series across more than 25,000 omics platforms. Likewise, ArrayExpress is comprised of over 70,000 transcriptome and methylome datasets. Conducting meta-analyses of data from these databases can be challenging, typically requiring extensive manual grouping of samples to identify experimental groups for comparison.
Results
Here we present sampleclusteR, a lightweight R package which automates the clustering of gene-expression study samples based on their metadata. To demonstrate the utility of the approach to large scale analysis of GEO data series, 275 GEO data series were analysed using the package. sampleclusteR was able to correctly cluster 4694 of the 5081 samples across the 275 data sets in an unsupervised manner. In addition, 250 datasets from ArrayExpress were analysed by the package with 8547 of the 9154 samples being automatically clustered into correct groups. We show how sampleclusteR can be used to automate analysis of gene-expression datasets by conducting a meta-analysis of multiple GEO data series related to the Wnt signaling pathway. sampleclusteR correctly assigned all samples to the correct experimental groups and identified sets of differentially expressed genes for downstream analysis.
Conclusions
sampleclusteR enables large-scale analysis of data from GEO or ArrayExpress by automating the clustering of both GEO and ArrayExpress metadata tables using text mining of their associated metadata.