Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (GigaScience)
Abstract
Background
Large streamed datasets, characteristic of life science applications, are often resource-intensive to process, transport and store. We propose a pipeline model, a design pattern for scientific pipelines, where an incoming stream of scientific data is organized into a tiered or ordered “data hierarchy". We introduce the HASTE Toolkit, a proof-of-concept cloud-native software toolkit based on this pipeline model, to partition and prioritize data streams to optimize use of limited computing resources.
Findings
In our pipeline model, an “interestingness function” assigns an interestingness score to data objects in the stream, inducing a data hierarchy. From this score, a “policy” guides decisions on how to prioritize computational resource use for a given object. The HASTE Toolkit is a collection of tools to adopt this approach. We evaluate with 2 microscopy imaging case studies. The first is a high content screening experiment, where images are analyzed in an on-premise container cloud to prioritize storage and subsequent computation. The second considers edge processing of images for upload into the public cloud for real-time control of a transmission electron microscope.
Conclusions
Through our evaluation, we created smart data pipelines capable of effective use of storage, compute, and network resources, enabling more efficient data-intensive experiments. We note a beneficial separation between scientific concerns of data priority, and the implementation of this behaviour for different resources in different deployment contexts. The toolkit allows intelligent prioritization to be `bolted on' to new and existing systems – and is intended for use with a range of technologies in different deployment scenarios.
Article activity feed
-
Now published in GigaScience doi: 10.1093/gigascience/giab018
Ben Blamey 1Department of Information Technology, Uppsala University, SwedenFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Ben BlameyFor correspondence: ben.blamey@it.uu.seSalman Toor 1Department of Information Technology, Uppsala University, SwedenFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMartin Dahlö 2Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, Sweden3Science for Life Laboratory, Uppsala University, Stockholm, SwedenFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteHåkan Wieslander 1Department of Information Technology, Uppsala University, SwedenFind this author …
Now published in GigaScience doi: 10.1093/gigascience/giab018
Ben Blamey 1Department of Information Technology, Uppsala University, SwedenFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Ben BlameyFor correspondence: ben.blamey@it.uu.seSalman Toor 1Department of Information Technology, Uppsala University, SwedenFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMartin Dahlö 2Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, Sweden3Science for Life Laboratory, Uppsala University, Stockholm, SwedenFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteHåkan Wieslander 1Department of Information Technology, Uppsala University, SwedenFind this author on Google ScholarFind this author on PubMedSearch for this author on this sitePhilip J Harrison 2Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, Sweden3Science for Life Laboratory, Uppsala University, Stockholm, SwedenFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteIda-Maria Sintorn 1Department of Information Technology, Uppsala University, Sweden3Science for Life Laboratory, Uppsala University, Stockholm, Sweden4Vironova AB, Stockholm, SwedenFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteAlan Sabirsh 5Advanced Drug Delivery, Pharmaceutical Sciences, R&D, AstraZeneca, Gothenburg, SwedenFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteCarolina Wählby 1Department of Information Technology, Uppsala University, Sweden3Science for Life Laboratory, Uppsala University, Stockholm, SwedenFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteOla Spjuth 2Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, Sweden3Science for Life Laboratory, Uppsala University, Stockholm, SwedenFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Ola SpjuthAndreas Hellander 1Department of Information Technology, Uppsala University, SwedenFind this author on Google ScholarFind this author on PubMedSearch for this author on this site
A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab018 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.
These peer reviews were as follows:
Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102684 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102685 Reviewer 3: http://dx.doi.org/10.5524/REVIEW.102686
-
-