Scalable analysis of multi-modal biomedical data
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (GigaScience)
Abstract
Background
Targeted diagnosis and treatment options are dependent on insights drawn from multi-modal analysis of large-scale biomedical datasets. Advances in genomics sequencing, image processing, and medical data management have supported data collection and management within medical institutions. These efforts have produced large-scale datasets and have enabled integrative analyses that provide a more thorough look of the impact of a disease on the underlying system. The integration of large-scale biomedical data commonly involves several complex data transformation steps, such as combining datasets to build feature vectors for learning analysis. Thus, scalable data integration solutions play a key role in the future of targeted medicine. Though large-scale data processing frameworks have shown promising performance for many domains, they fail to support scalable processing of complex datatypes.
Solution
To address these issues and achieve scalable processing of multi-modal biomedical data, we present TraNCE, a framework that automates the difficulties of designing distributed analyses with complex biomedical data types.
Performance
We outline research and clinical applications for the platform, including data integration support for building feature sets for classification. We show that the system is capable of outperforming the common alternative, based on “flattening” complex data structures, and runs efficiently when alternative approaches are unable to perform at all.
Article activity feed
-
This article is a preprint and has not been certified by peer review [what does this mean?].
Jaclyn Smith 1University of OxfordFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Jaclyn SmithFor correspondence: jaclyn.smith@cs.ox.ac.ukYao Shi 1University of OxfordFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMichael Benedikt 1University of OxfordFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMilos Nikolic 2University of EdinburghFind this author on Google ScholarFind this author on PubMedSearch for this author on this site
This work has been peer reviewed in GigaScience, which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as …
This article is a preprint and has not been certified by peer review [what does this mean?].
Jaclyn Smith 1University of OxfordFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Jaclyn SmithFor correspondence: jaclyn.smith@cs.ox.ac.ukYao Shi 1University of OxfordFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMichael Benedikt 1University of OxfordFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMilos Nikolic 2University of EdinburghFind this author on Google ScholarFind this author on PubMedSearch for this author on this site
This work has been peer reviewed in GigaScience, which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
**Reviewer 1: JianJiong Gao **
In this manuscript, the authors introduced a tool named TraNCE for distributed processing and multimodal data analysis. While the topic and tool are interesting, the writing can be improved. The current manuscript reads more like a technical manual than a scientific paper.
For example, in the background, the discussion on data modeling in the contexts of multi-omics analysis and distributed systems is extensive, but the writing can be better organized. The examples are helpful, but they are very technical and can be hard to follow. It would be good if the main challenges can be summarized on a high level. It might also be useful to have an example analysis use case to lead the technical discussion on data modeling.
It is also unclear how are the targeted users of the tool and why distributed computing is needed. For example, in application 1 & 2, it is unclear why distributed computing is necessary.
**Reviewer 2. Umberto Ferraro Petrillo ** First review:
The authors propose a new framework, called TraNCE, for automating the design of distributed analysis pipelines over complex biomedical data types. They focus on the problem of unrolling references between different datasets (which can be very large), assuming that these datasets contain complex data types consisting of structured objects containing collections of other objects. By using TraNCE, it is possible to formulate queries over collections of nested data using a very high-level declarative language. Then, these queries are translated by TraNCE in Apache Spark applications able to implement those queries in an efficient and scalable way. Apart from a quick description of the TraNCE framework and of the declarative language it supports, the paper also includes a vast collection of examples of multi-omics analyses conducted using TraNCE on real-world data. I found the contribution proposed by this paper to be very actual. Indeed, there is a flourishing of public multi-omics databases. But, their huge volumes make their analysis difficult and very expensive, if not approached with the right methodologies. Distributed analysis frameworks like Spark can be of help, but they are often not easy to be mastered, especially for those not having deep distributed programming skills. So, TraNCE looks like a very much need contribution on this topic. However, I have some remarks. The high-level querying language supported by TraNCE is not original because, as far as I understand, it has been presented in a previous paper [1] (which has been written by almost the same authors and that has been correctly referenced to in this submission). Even the TraNCE framework is not completely original because its name appears as the name of the project containing the code presented in [1]. Finally, at least one of the experiments presented in [1] seems to have been run on the same Hadoop installation used for the experiments presented in the current submission, and has involved the same datasets from the International Cancer Genome Consortium. So, I am a bit confused about what it is original in this new submission and what has been borrowed from [1]. My advice is to definitely clarify this point.
Another issue that I think should be addressed is about the proposed framework being scalable. The authors state that the framework supports scalable processing of complex datatypes, however, no evidence is brought about this claim. The several different experiments that are reported seem to focus more on the expressiveness of the proposed language while no experiment about the scalability of the generated code is provided when run on a computing architecture of increasing size. I think we may agree on the fact that using Spark does not means that your code is scalable, neither I think it is enough to say that the scalability of TraNCE has been proved in [1]. So, I would suggest to elaborate also on this. To be honest, I am a bit skeptical about the practical performance of the standard compilation route. I think that when applied to very large datasets it is likely to return huge RDDs that could require very long processing times. Instead, the shredded compilation route looks much clever to me. Could you elaborate further on this difference, especially according to the results of your experimentations? I also disagree with your idea of not describing how data skewness is dealt with in your framework. It is indeed one of the main cause for bad performance of many distributed applications so it would be interesting to know how did you manage this problem in your particular case. On the bright side, I really appreciated the flexibility of the proposed framework, as witnessed by the vast amount of examples provided, as well as its positive implications on the analysis of multi-omics databases.
Finally, the English of the manuscript is very good and I have not been able to find any typos so far.
[1] Jaclyn Smith, Michael Benedikt, Milos Nikolic, and Amir Shaikhha. 2020. Scalable querying of nested data. Proc. VLDB Endow. 14, 3 (November 2020), 445-457.
Re-review: I appreciated the robust revision done by the authors and think the paper is now ready to be published
-
-