Efficient and reproducible pipelines for spike sorting large-scale electrophysiology data

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife Assessment

    This study presents a valuable and well-documented computational pipeline for the scalable analysis and spike sorting of large extracellular electrophysiology datasets, with particular relevance for high-density recordings such as Neuropixels. The authors demonstrate the pipeline's utility for benchmarking spike sorter performance and evaluating the effects of data compression, supported by thorough testing, clear figures, and openly available code. The workflow is reproducible, portable, and practical, providing concrete guidance on computational cost and runtime. Overall, the evidence supporting the pipeline's performance and output quality is compelling, and this work will be of broad interest to the systems neuroscience community.

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

The scale of in vivo electrophysiology has expanded in recent years, with simultaneous recordings across thousands of electrodes now becoming routine. These advances have enabled a wide range of discoveries, but they also impose substantial computational demands. Spike sorting, the procedure that extracts spikes from extracellular voltage measurements, remains a major bottleneck: a dataset collected in a few hours can take days to spike sort on a single machine, and the field lacks rigorous validation of the many spike sorting algorithms and preprocessing steps that are in use. Advancing the speed and accuracy of spike sorting is essential to fully realize the potential of large-scale electrophysiology. Here, we present an end-to-end spike sorting pipeline that leverages parallelization to scale to large datasets. The same workflow can run reproducibly on individual workstations, high-performance computing clusters, or cloud environments, with computing resources tailored to each processing step to reduce costs and execution times. In addition, we introduce a benchmarking pipeline, also optimized for parallel processing, that enables systematic comparison of multiple sorting pipelines. Using this framework, we show that <monospace>Kilosort4</monospace>, a widely used spike sorting algorithm, outperforms <monospace>Kilosort2.5</monospace> (Pachitariu et al. 2024). We also show that 7× lossy compression, which substantially reduces the cost of data storage, has minimal impact on spike sorting performance. Together, these pipelines address the urgent need for scalable and transparent spike sorting of electrophysiology data, preparing the field for the coming flood of multi-thousand-channel experiments.

Article activity feed

  1. eLife Assessment

    This study presents a valuable and well-documented computational pipeline for the scalable analysis and spike sorting of large extracellular electrophysiology datasets, with particular relevance for high-density recordings such as Neuropixels. The authors demonstrate the pipeline's utility for benchmarking spike sorter performance and evaluating the effects of data compression, supported by thorough testing, clear figures, and openly available code. The workflow is reproducible, portable, and practical, providing concrete guidance on computational cost and runtime. Overall, the evidence supporting the pipeline's performance and output quality is compelling, and this work will be of broad interest to the systems neuroscience community.

  2. Reviewer #1 (Public review):

    Summary:

    Extracellular electrophysiology datasets are growing in both number and size, and recordings with thousands of sites per animal are now commonplace. Analyzing these datasets to extract the activity of single neurons (spike sorting) is challenging: signal-to-noise is low, the analysis is computationally expensive, and small changes in analysis parameters and code can alter the output. The authors address the problem of volume by packaging the well-characterized SpikeInterface pipeline in a framework that can distribute individual sorting jobs across many workers in a compute cluster or cloud environment. Reproducibility is ensured by running containerized versions of the processing components.

    The authors apply the pipeline in two important examples. The first is a thorough study comparing the performance of two widely used spike-sorting algorithms (Kilosort 2.5 and Kilosort 4). They use hybrid datasets created by injecting measured spike waveforms (templates) into existing recordings, adjusting those waveforms according to the measured drift in the recording. These hybrid ground truth datasets preserve the complex noise and background of the original recording. Similar to the original Kilosort 4 paper, which uses a different method for creating ground truth datasets that include drift, the authors find Kilosort 4 significantly outperforms Kilosort 2.5. The second example measures the impact of compression of raw data on spike sorting with Kilosort 4, showing that accuracy, precision, and recall of the ground truth units are not significantly impacted even by lossy compression. As important as the individual results, these studies provide good models for measuring the impact of particular processing steps on the output of spike sorting.

    Strengths:

    The pipeline uses the Nextflow framework, which makes it adaptable to different job schedulers and environments. The high-level documentation is useful, and the GitHub code is well organized. The two example studies are thorough and well-designed, and address important questions in the analysis of extracellular electrophysiology data.

    Weaknesses:

    The pipeline is very complete, but also complex. Workflows - the optimal artifact removal, best curation for data from a particular brain area or species - will vary according to experiment. Therefore, a discussion of the adaptability of the pipeline in the "Limitations" section would be helpful for readers.

  3. Reviewer #2 (Public review):

    Summary:

    This work presents a reproducible, scalable workflow for spike sorting that leverages parallelization to handle large neural recording datasets. The authors introduce both a processing pipeline and a benchmarking framework that can run across different computing environments (workstations, HPC clusters, cloud). Key findings include demonstrating that Kilosort4 outperforms Kilosort2.5 and that 7× lossy compression has minimal impact on spike sorting performance while substantially reducing storage costs.

    Strengths:

    (1) Extremely high-quality figures with clear captions that effectively communicate complex workflow information.

    (2) Very detailed, well-written methods section providing thorough documentation.

    (3) Strong focus on reproducibility, scalability, modularity, and portability using established technologies (Nextflow, SpikeInterface, Code Ocean).

    (4) Pipeline publicly available on GitHub with documentation.

    (5) Clear cost analysis showing ~$5/hour for AWS processing with transparent breakdown.

    (6) Good overview of previous spike sorting benchmarking attempts in the introduction.

    (7) Practical value for the community by lowering barriers to processing large datasets.

    Weaknesses:

    No significant weaknesses were identified, although it is noted that the limitations section of the discussion could be expanded.

  4. Reviewer #3 (Public review):

    Summary:

    The authors provide a highly valuable and thoroughly documented pipeline to accelerate the processing and spike sorting of high-density electrophysiology data, particularly from Neuropixels probes. The scale of data collection is increasing across the field, and processing times and data storage are growing concerns. This pipeline provides parallelization and benchmarking of performance after data compression that helps address these concerns. The authors also use their pipeline to benchmark different spike sorting algorithms, providing useful evidence that Kilosort4 performs the best out of the tested options. This work, and the ability to implement this pipeline with minimal effort to standardize and speed up data processing across the field, will be of great interest to many researchers in systems neuroscience.

    Strengths:

    The paper is very well written and clear in most places. The accompanying GitHub and ReadTheDocs are well organized and thorough. The authors provide many benchmarking metrics to support their claims, and it is clear that the pipeline has been very thoroughly tested and optimized by users at the Allen Institute for Neural Dynamics. The pipeline incorporates existing software and platforms that have also been thoroughly tested (such as SpikeInterface), so the authors are not reinventing the wheel, but rather putting together the best of many worlds. This is a great contribution to the field, and it is clear that the authors have put a lot of thought into making the pipeline as accessible as possible.

    Weaknesses:

    There are no major weaknesses. I have only a handful of very minor questions and suggestions that could clarify/generalize aspects of the pipeline or make the text more understandable to non-specialists.

    (1) Could the authors please expand on the statement on line 274, that processing their test dataset serially "on a single GPU-capable cloud workstation... would take approximately 75 hours and cost over 90 USD." How were these values calculated? I was a bit surprised that this is a >4-fold slow-down from their pipeline, but only increases the cost by ~1.35x, if I understood correctly. More context on why this is, and maybe some context on what a g4dn.4xlarge is compared to the other instances, might help readers who are less familiar with AWS and cloud computing.

    (2) One of the most commonly used preprocessing pipelines for Neuropixels data is the CatGT/ecephys pipeline from the developers of SpikeGLX at Janelia. It may be worth commenting very briefly, either in the preprocessing section or in the discussion, on how the preprocessing steps available in this pipeline compare to the steps available in CatGT. For example, is "destriping" similar to the "-gfix" option in catGT to remove high-amplitude artifacts?

    (3) Why are there duplicate units (line 194), and how often is this an issue? I understand that this is likely more of a spike sorter issue than an issue with this pipeline, but 1-2 sentences elaborating why might be helpful for readers.

    (4) It seems from the parameter files on GitHub that the cluster curation parameters are customizable - correct? If so, it may be worth explicitly saying so in the curation section of the text, as the presented recipe will not always be appropriate. A presence ratio of >0.8 could be particularly problematic for some recordings, for example, if a cell is only active during a specific part of the behavior, that may be a feature of the experiment, or the animal could be transitioning between sleep and wake states, in which different units may become active at different times.

    (5) The axis labels in Figures 3d-e are too small to see, and Figure 3d would benefit from a brief description of what is shown.

    (6) What is the difference between "neural" and "passing QC" in Figure 4?

    (7) I understand the current paper is focused on spike data, so there may not be an answer to this, but I am curious about the NP2.0 probes that save data in wideband. Does the lossy compression negatively affect the LFP data? Is software filtering applied for the spike band before or after compression?