polars-bio – fast, scalable and out-of-core operations on large genomic interval datasets

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Motivation

Genomic studies very often rely on computational intensive analyses of relationships between features, which are typically represented as intervals along one-dimensional coordinate system (such as positions on a chromosome). In this context, Python programming language is extensively used for manipulation and analysis of data stored in a tabular-form of rows and columns called dataframe. Pandas is the most-widely used Python dataframe package and is criticized for its inefficiencies and scalability problems that its novel alternative – Polars – tries to address with a vectorized backend written in Rust programming language.

Results

polars-bio is a Python library that enables fast, parallel and out-of-core operations on large genomic intervals datasets. Its main components are implemented in Rust, using the Apache DataFusion query engine and Apache Arrow for efficient data representation. It is compatible with Polars and Pandas DataFrame formats. Single-thread benchmarking results confirm that for operations such as count overlaps is 38x and coverage is 15x faster than the state-of-the-art bioframe library. Our implementation of overlap operation consumes 90x less memory in streaming mode. Multi-thread benchmarks show good scalability and up to 282x speedup for count overlaps operation when executed with 8 CPU cores. To the best of our knowledge, polars-bio is the fastest single-node library for genomic interval dataframes in Python.

Availability and implementation

polars-bio is an open source Python package distributed under the Apache License available for all main platforms, such as Linux, macOS and Windows in the PyPI registry. The web page is https://biodatageeks.org/polars-bio/ and the source code is available on GitHub: https://github.com/biodatageeks/polars-bio

Article activity feed