ParTIpy: A Scalable Framework for Archetypal Analysis and Pareto Task Inference
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (Review Commons)
Abstract
Motivation
Trade-offs between different functions or tasks are pervasive across scales in biological systems. For example, individual cells cannot perform all possible functions simultaneously; instead they allocate limited resources to specialize in subsets of tasks by activating specific gene expression programs. Pareto Task Inference (ParTI) is a framework for analyzing biological trade-offs grounded in the theory of multi-objective optimality. However, existing software implementations of ParTI lack scalability to large datasets and do not integrate well with standard biological data analysis workflows, especially in the context of single-cell transcriptomics, limiting broader adoption.
Results
We have developed ParTIpy (Pareto Task Inference in Python), an open-source Python package that combines advances in optimization and coreset methods to scale archetypal analysis, the primary algorithm underlying ParTI, to large-scale datasets. By providing additional tools to characterize archetypes, comprehensive documentation, and adopting standard scverse data structures, ParTIpy facilitates seamless integration into existing analysis workflows and broadens accessibility, particularly within the single-cell community. We demonstrate how ParTIpy can be used to study intra-cell-type gene expression variability through the lens of task allocation, offering a principled alternative to methods that impose discrete cell state classifications on inherently continuous variation.
Availability and implementation
ParTIpy’s open-source code is available on GitHub ( https://github.com/saezlab/ParTIpy ) and pypi ( https://pypi.org/project/partipy ). Documentation is available at https://partipy.readthedocs.io . The code to reproduce the results of this paper is on GitHub ( https://github.com/saezlab/ParTIpy_paper )
Article activity feed
-
Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.
Learn more at Review Commons
Reply to the reviewers
1. General Statements
Thank you for providing an assessment of our manuscript. We suggest here a revision plan to address the points raised by the reviewers regarding code documentation, benchmarking, and biological applications.
As part of the revisions implemented we have:
Clarified the management of dependencies of our package Fixed the data download run times of test data Clarified the parameters of the normalization and optimization functions We plan to:
Extend our manuscript to include a section on cross-condition analysis that builds on our tutorials, where we will illustrate how ParTIpy can quantify shifts in the distribution of fibroblasts …
Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.
Learn more at Review Commons
Reply to the reviewers
1. General Statements
Thank you for providing an assessment of our manuscript. We suggest here a revision plan to address the points raised by the reviewers regarding code documentation, benchmarking, and biological applications.
As part of the revisions implemented we have:
Clarified the management of dependencies of our package Fixed the data download run times of test data Clarified the parameters of the normalization and optimization functions We plan to:
Extend our manuscript to include a section on cross-condition analysis that builds on our tutorials, where we will illustrate how ParTIpy can quantify shifts in the distribution of fibroblasts across the functional space defined by archetypal analysis between healthy and failing hearts. Extend our benchmarks of scalability of coresets, by reporting wall-clock time and peak memory usage across distinct data sizes. Extend our benchmarks of stability of coresets, by reporting the similarity of the estimated archetypes based on the original versus the sampled data. Include the original enrichment analysis of ParTI to provide users with distinct options to work with the archetypes, and provide a larger discussion on the distinct strategies. We believe these revisions will strengthen our__ software manuscript__ and will help us to provide a robust and practical tool to analyze functional trade-offs from biological data.
2. Description of the planned revisions
Reviewer #1
Summary
The paper "ParTIpy: A Scalable Framework for Archetypal Analysis and Pareto Task Inference" presents ParTIpy, an open-source Python package that modernizes and scales the Pareto Task Inference (ParTI) framework for analyzing biological trade-offs and functional specialization. Unlike the earlier MATLAB implementation, which required a commercial license and was limited in scalability, ParTIpy leverages Python's open ecosystem and integration with tools such as scverse to make archetypal analysis more accessible, flexible, and compatible with modern biological data workflows. Through advanced optimization and coreset algorithms, it efficiently handles large scale single cell and spatial transcriptomics datasets. ParTIpy identifies "archetypes", or optimal phenotypic extremes, to reveal how cells balance competing functional programs. The paper demonstrates its application in modeling hepatocyte specialization across the liver lobule, highlighting spatial patterns of metabolic division of labor.
Overall, ParTIpy represents a modern, accessible, and scalable Python-based solution for exploring biological trade-offs and resource allocation in high-dimensional data. The paper is clearly written and addresses an important methodological gap. However, the enrichment analysis differs from the original ParTI framework and should be discussed more explicitly, and the documentation and tutorials, while helpful, could be refined to improve usability and reproducibility.
Major Comments
- The archetype enrichment analysis used in this paper differs from the original enrichment analysis implemented in ParTI. This is acceptable, but: a) The authors should explicitly state and discuss the differences between the two approaches. b) The enrichment analysis should be made more systematic. For each tested feature (e.g. gene or pathway), the analysis should report a p-value for the hypothesis that the feature is enriched near an archetype - that is, its expression (or value) is high close to the archetype and decreases with distance. Appropriate multiple-hypothesis correction should also be applied.
We thank the reviewer for this valuable comment and agree that the differences between our enrichment analysis and the original ParTI implementation should be stated more explicitly. We will incorporate the original enrichment algorithm into ParTIpy, enabling users to select their preferred method. In the revised manuscript, we will note that two enrichment algorithms are available and describe both in greater detail in the supplementary methods section. We also note that the current enrichment analysis already reports p-values adjusted for multiple hypothesis testing.
Reviewer #2
Summary
This paper introduces the software ParTIpy, a scalable Python implementation of Pareto Task Inference (ParTI), designed to infer functional trade-offs in biological systems through archetypal analysis. The framework modernizes the previous toolbox with efficient optimization, memory-saving coreset construction, and integration with the scverse ecosystem for single-cell transcriptomic data.
Using hepatocytes scRNA-seq data as a test case, the authors identify archetypes corresponding to distinct gene expression patterns. These archetypes align with known liver domains in spatial transcriptomics data, validating both the method's interpretability and its biological relevance.
Major comments
(1) Conclusions
The core computational and biological claims are well supported. ParTIpy clearly scales better than earlier implementations and reproduces known biological structure. However, claims about "scalability to large datasets" should be further qualified (see below).
We will implement further performance benchmarks as discussed below.
(2) Claims
Archetypal analysis based on current matrix computation formulation is non-parametric, and new data require recomputation of archetypes. Therefore, the method cannot generalize to unseen data in the way deep learning approaches, which could be further acknowledged and clarified.
We thank the reviewer for this insightful comment. We agree that deep learning frameworks are typically amortized, allowing them to generalize to unseen data without retraining, and we will clarify this distinction in the discussion of the revised manuscript. However, we note that mapping new cells into an existing archetypal space is computationally inexpensive, as it only requires solving a single convex optimization problem.
(3) Additional suggested analyses or experiments
1) Absolute performance benchmarks : it's suggested to report wall-clock time and memory for a few dataset sizes (10k, 100k, 1M cells).
We thank the reviewer for this helpful suggestion. We will extend the coreset benchmark to quantify how coreset size affects both archetype positions and biological interpretation. Specifically, we will match archetypes across coreset sizes by solving the linear sum assignment problem, as we currently do when comparing bootstrap samples. We will then compare the distances between archetypes inferred from the full dataset and those obtained from different coreset sizes. In addition to measuring displacement, we will assess biological stability by comparing the gene expression vectors of corresponding archetypes as well as their enriched pathways (using metrics such as cosine similarity and Jaccard index).
**Referee cross-commenting**
I agree with the other reviewer's suggestion to check consistency and reproducibility with previous implementation, and enhance the tutorial of the software for users from a biological background. Combined with my comments to further improve the biological application showcase, the revised manuscript could be an impactful contribution to the field, if these comments could be properly addressed.
(1) Advance
This paper is primarily a technical contribution. It modernizes the Pareto Task Inference framework into a scalable and user-friendly Python implementation, which is valuable. However, to further improve its significance especially for the broader biological audience, more detailed analysis could be performed (see below)
(2) Biological scope and applications [optional]
The current biological validation in hepatocyte is technically fine but limited in breadth and impact. It demonstrates that ParTIpy works but falls in short of showing what new insights it can reveal. Several promising applications could be further explored:
- Cross-condition comparisons: could ParTIpy quantify how the Pareto front shifts between conditions (e.g., normal vs. tumor, treated vs. control)?
We thank the reviewer for this valuable suggestion. We have shown ParTIpy's applicability to cross-condition settings in our online tutorials (https://partipy.readthedocs.io/en/latest/notebooks/cross_condition_lupus.html). However, we agree that a more explicit mention in the manuscript is needed. Thus, we will include a cross-condition analysis as a second application in the revised manuscript, focusing on fibroblasts from heart failure patients from Amrute, et. al. (2023) 1. This will illustrate how ParTIpy can quantify shifts in the distribution of cells across the functional space defined by archetypal analysis.
Because the manuscript does not explore these scenarios, the biological impact remains narrow, and the framework's broader interpretive power is somehow underrepresented.
We hope that the additional application included in the revised manuscript helps better illustrate the framework's strength. We would also like to note that the online tutorials provide a comprehensive overview of ParTIpy's functionality, as we expect these will serve as a primary entry point for many researchers interested in archetypal analysis and Pareto Task Inference.
(3) Audience and impact
The paper will interest computational biologists, systems biologists, and bioinformaticians focused on single-cell analysis, and its impact will grow substantially if the authors demonstrate more biological applications.
(4) Reviewer expertise
Computational biology, single-cell transcriptomics, machine learning, computational math
3. Description of the revisions that have already been incorporated in the transferred manuscript
Reviewer #1
2. The package documentation on GitHub and ReadTheDocs is a major strength, but the tutorials can be improved for clarity and accessibility:
We thank the reviewer for this positive feedback. Indeed, providing comprehensive documentation to facilitate ease of adoption was a major motivation behind this project. In response to the reviewer's suggestions, we have revised the tutorials to further improve their clarity, structure, and accessibility, as detailed below.
a) The documentation should list external dependencies that need to be installed seperately, e.g. pybiomart.
We thank the reviewer for pointing this out. We had added all dependencies under the optional-dependencies.extra header, which allows users to run
pip install partipy[extra]to be able to run all tutorial notebooks. However, we forgot to explain that in the tutorial or Readme page, which we corrected now. The Readme now reads:Install the latest stable full release from PyPI with the extra dependencies (e.g.,
pybiomart,squidpy,liana) that are required to run every tutorial:pip install partipy[extra]Additionally we include clarifications in every tutorial notebook that uses additional dependencies: "To run this notebook, install ParTIpy with the tutorial extras:
pip install partipy[extra]".b) The dataset used in the Quickstart demo appears to be inaccessible or extremely slow to download (the function load_hepatocyte_data_2() did not complete even after 30 minutes, at least in my experience). The authors should verify data availability on Zenodo and consider providing a smaller or cached version to make the demo more reliable and reproducible.
We thank the reviewer for this helpful comment. We agree that the previous implementation of load_hepatocyte_data_2() was not reliable due to slow download speeds from Zenodo. To address this, we now host the required AnnData object on figshare (https://figshare.com/articles/dataset/scRNA-seq_hepatocyte_data_from_Ben-Moshe_et_al_2022_/30588713?file=59459459), ensuring faster and more stable access for the Quickstart tutorial via scanpy.read:
adata = sc.read("data/hepatocyte_processed.h5ad", backup_url="https://figshare.com/ndownloader/files/59459459") adatac) The tutorial order could be more intuitive - for instance, "archetype crosstalk network" appears before "archetypal analysis". Consider starting with the simulated dataset and presenting the full pipeline before moving to more complex real-world examples.
We thank the reviewer for this helpful suggestion and agree that the previous ordering was not intuitive. We have reordered the tutorials such that the notebook introducing archetypal analysis now appears first, followed by the Quickstart tutorial and the subsequent applied examples.
Minor comments
- In the Python function, the parameter "optim" could use more descriptive option names - for example, renaming "projected_gradients" to "PCHA" would make it clearer and more consistent with terminology used in the paper.
We thank the reviewer for this helpful suggestion. We agree that the previous naming could be misleading. While PCHA does not precisely describe the underlying algorithm, it is the term most users are familiar with from the literature. We have therefore updated the function to accept both "PCHA" and "projected_gradients", which now map to the same underlying optimization routine.
In the Quickstart preprocessing, the authors use the following code:
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
However, they do not specify the target sum in the normalize_total function. The authors should ensure that the data values before the logarithmic transformation span several orders of magnitude (e.g., 0-10,000); if normalization is performed to a sum of 1, the log transformation becomes ineffective.
We thank the reviewer for this helpful comment. By default, sc.pp.normalize_total scales the counts in each cell to the median total counts across all cells, which preserves the typical range of expression values prior to logarithmic transformation. We therefore consider this default behavior appropriate for the Quickstart example. Nonetheless, we will clarify this explicitly in the tutorial to avoid confusion.
**Referee cross-commenting**
I agree with Reviewer #2 observation that the paper's contribution is primarily technical; however, I consider this technical advance to be an important and timely one that will enable many biologists to apply archetypal analysis more effectively in their own work.
We thank the reviewer for this positive and encouraging assessment.
Reviewer #1 (Significance (Required)):
This study presents ParTIpy, a Python-based implementation of Pareto Task Inference (ParTI) that makes archetypal analysis more accessible, scalable, and compatible with modern single-cell and spatial transcriptomics workflows. Its main strength lies in translating a conceptually powerful but technically limited MATLAB framework into an open-source, efficient Python package, enabling wider use in computational biology. The package is well-documented, which further enhances its accessibility and adoption potential, though documentation could be improved to enhance reproducibility and ease of use. It will be of interest to computational systems biologists, particularly those working with omics data, and those interested in studying functional trade-offs and resource allocation.
We appreciate the reviewer's positive evaluation and are encouraged by their recognition of ParTIpy's relevance and potential impact in computational biology.
4. Description of analyses that authors prefer not to carry out
Reviewer #2
The current biological validation in hepatocyte is technically fine but limited in breadth and impact. It demonstrates that ParTIpy works but falls in short of showing what new insights it can reveal. Several promising applications could be further explored:
- Transient or plastic states: Cells with mixed archetype weights or high mixture entropy can be interpreted as transient, functionally flexible states. ParTIpy can quantify such transience geometrically, even in static data, which providing a competitive counterpart to models like CellRank or CellSimplex (https://doi.org/10.1093/bioinformatics/btaf119).
We thank the reviewer for this interesting suggestion. While we agree that quantifying transient or plastic states based on archetype mixtures is an intriguing idea, validating whether cells with mixed archetype weights ("generalists") truly represent transient states would require additional data modalities such as temporal or lineage-tracing measurements. Although we find this direction highly interesting, given that the manuscript is intended as a software paper, we prefer to focus on more directly supported applications of cross-condition data, where labeled data is available.
However, we will expand our discussion to relate ParTIpy with CellSimplex since we believe this is an interesting angle that future users could explore.
5. References
- Amrute, J. M. et al. Defining cardiac functional recovery in end-stage heart failure at single-cell resolution. *Nat. Cardiovasc. Res. *2, 399-416 (2023).
-
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #2
Evidence, reproducibility and clarity
Summary
This paper introduces the software ParTIpy, a scalable Python implementation of Pareto Task Inference (ParTI), designed to infer functional trade-offs in biological systems through archetypal analysis. The framework modernizes the previous toolbox with efficient optimization, memory-saving coreset construction, and integration with the scverse ecosystem for single-cell transcriptomic data.
Using hepatocytes scRNA-seq data as a test case, the authors identify archetypes corresponding to distinct gene expression patterns. These archetypes align with known liver domains in spatial transcriptomics data, validating both the …
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #2
Evidence, reproducibility and clarity
Summary
This paper introduces the software ParTIpy, a scalable Python implementation of Pareto Task Inference (ParTI), designed to infer functional trade-offs in biological systems through archetypal analysis. The framework modernizes the previous toolbox with efficient optimization, memory-saving coreset construction, and integration with the scverse ecosystem for single-cell transcriptomic data.
Using hepatocytes scRNA-seq data as a test case, the authors identify archetypes corresponding to distinct gene expression patterns. These archetypes align with known liver domains in spatial transcriptomics data, validating both the method's interpretability and its biological relevance.
Major comments
(1) Conclusions
The core computational and biological claims are well supported. ParTIpy clearly scales better than earlier implementations and reproduces known biological structure. However, claims about "scalability to large datasets" should be further qualified (see below).
(2) Claims
Archetypal analysis based on current matrix computation formulation is non-parametric, and new data require recomputation of archetypes. Therefore, the method cannot generalize to unseen data in the way deep learning approaches, which could be further acknowledged and clarified.
(3) Additional suggested analyses or experiments
- Absolute performance benchmarks : it's suggested to report wall-clock time and memory for a few dataset sizes (10k, 100k, 1M cells).
- Coreset sensitivity analysis: Could authors show how coreset size affects archetype positions and biological interpretation?
Referee cross-commenting
I agree with the other reviewer's suggestion to check consistency and reproducibility with previous implementation, and enhance the tutorial of the software for users from a biological background. Combined with my comments to further improve the biological application showcase, the revised manuscript could be an impactful contribution to the field, if these comments could be properly addressed.
Significance
(1) Advance
This paper is primarily a technical contribution. It modernizes the Pareto Task Inference framework into a scalable and user-friendly Python implementation, which is valuable. However, to further improve its significance especially for the broader biological audience, more detailed analysis could be performed (see below)
(2) Biological scope and applications [optional]
The current biological validation in hepatocyte is technically fine but limited in breadth and impact. It demonstrates that ParTIpy works but falls in short of showing what new insights it can reveal. Several promising applications could be further explored:
Cross-condition comparisons: could ParTIpy quantify how the Pareto front shifts between conditions (e.g., normal vs. tumor, treated vs. control)?
Transient or plastic states: Cells with mixed archetype weights or high mixture entropy can be interpreted as transient, functionally flexible states. ParTIpy can quantify such transience geometrically, even in static data, which providing a competitive counterpart to models like CellRank or CellSimplex (https://doi.org/10.1093/bioinformatics/btaf119).
Because the manuscript does not explore these scenarios, the biological impact remains narrow, and the framework's broader interpretive power is somehow underrepresented.
(3) Audience and impact
The paper will interest computational biologists, systems biologists, and bioinformaticians focused on single-cell analysis, and its impact will grow substantially if the authors demonstrate more biological applications.
(4) Reviewer expertise Computational biology, single-cell transcriptomics, machine learning, computational math
-
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #1
Evidence, reproducibility and clarity
Summary
The paper "ParTIpy: A Scalable Framework for Archetypal Analysis and Pareto Task Inference" presents ParTIpy, an open-source Python package that modernizes and scales the Pareto Task Inference (ParTI) framework for analyzing biological trade-offs and functional specialization. Unlike the earlier MATLAB implementation, which required a commercial license and was limited in scalability, ParTIpy leverages Python's open ecosystem and integration with tools such as scverse to make archetypal analysis more accessible, flexible, and compatible with modern biological data workflows. Through advanced optimization and coreset …
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #1
Evidence, reproducibility and clarity
Summary
The paper "ParTIpy: A Scalable Framework for Archetypal Analysis and Pareto Task Inference" presents ParTIpy, an open-source Python package that modernizes and scales the Pareto Task Inference (ParTI) framework for analyzing biological trade-offs and functional specialization. Unlike the earlier MATLAB implementation, which required a commercial license and was limited in scalability, ParTIpy leverages Python's open ecosystem and integration with tools such as scverse to make archetypal analysis more accessible, flexible, and compatible with modern biological data workflows. Through advanced optimization and coreset algorithms, it efficiently handles large scale single cell and spatial transcriptomics datasets. ParTIpy identifies "archetypes", or optimal phenotypic extremes, to reveal how cells balance competing functional programs. The paper demonstrates its application in modeling hepatocyte specialization across the liver lobule, highlighting spatial patterns of metabolic division of labor. Overall, ParTIpy represents a modern, accessible, and scalable Python-based solution for exploring biological trade-offs and resource allocation in high-dimensional data. The paper is clearly written and addresses an important methodological gap. However, the enrichment analysis differs from the original ParTI framework and should be discussed more explicitly, and the documentation and tutorials, while helpful, could be refined to improve usability and reproducibility.
Major Comments
- The archetype enrichment analysis used in this paper differs from the original enrichment analysis implemented in ParTI. This is acceptable, but:
a. The authors should explicitly state and discuss the differences between the two approaches.
b. The enrichment analysis should be made more systematic. For each tested feature (e.g. gene or pathway), the analysis should report a p-value for the hypothesis that the feature is enriched near an archetype - that is, its expression (or value) is high close to the archetype and decreases with distance. Appropriate multiple-hypothesis correction should also be applied.
- The package documentation on GitHub and ReadTheDocs is a major strength, but the tutorials can be improved for clarity and accessibility:
a. The documentation should list external dependencies that need to be installed seperately, e.g. pybiomart.
b. The dataset used in the Quickstart demo appears to be inaccessible or extremely slow to download (the function load_hepatocyte_data_2() did not complete even after 30 minutes, at least in my experience). The authors should verify data availability on Zenodo and consider providing a smaller or cached version to make the demo more reliable and reproducible.
c. The tutorial order could be more intuitive - for instance, "archetype crosstalk network" appears before "archetypal analysis". Consider starting with the simulated dataset and presenting the full pipeline before moving to more complex real-world examples.
Minor comments
- In the Python function, the parameter "optim" could use more descriptive option names - for example, renaming "projected_gradients" to "PCHA" would make it clearer and more consistent with terminology used in the paper.
- In the Quickstart preprocessing, the authors use the following code: sc.pp.normalize_total(adata) sc.pp.log1p(adata) However, they do not specify the target sum in the normalize_total function. The authors should ensure that the data values before the logarithmic transformation span several orders of magnitude (e.g., 0-10,000); if normalization is performed to a sum of 1, the log transformation becomes ineffective.
Referee cross-commenting
I agree with Reviewer #2 observation that the paper's contribution is primarily technical; however, I consider this technical advance to be an important and timely one that will enable many biologists to apply archetypal analysis more effectively in their own work.
Significance
This study presents ParTIpy, a Python-based implementation of Pareto Task Inference (ParTI) that makes archetypal analysis more accessible, scalable, and compatible with modern single-cell and spatial transcriptomics workflows. Its main strength lies in translating a conceptually powerful but technically limited MATLAB framework into an open-source, efficient Python package, enabling wider use in computational biology. The package is well-documented, which further enhances its accessibility and adoption potential, though documentation could be improved to enhance reproducibility and ease of use. It will be of interest to computational systems biologists, particularly those working with omics data, and those interested in studying functional trade-offs and resource allocation.
-
