Activin A marks a novel progenitor cell population during fracture healing and reveals a therapeutic strategy
Curation statements for this article:
Curated by eLife
eLife assessment
This work is a valuable presentation of sharpwaveripple reactivation of hippocampal neural ensemble activity recorded as animals explored two different environments. It attempts to use the fact that the ensemble code remaps between the two mazes to identify the best replaydetection procedures for analyzing this type of data. The reviewers found the evidence for a prescriptive conclusion inadequate, while still appreciating the concept of comparing mazeidentity discrimination with replay.
This article has been Reviewed by the following groups
Listed in
 Evaluated articles (eLife)
Abstract
Insufficient bone fracture repair represents a major clinical and societal burden and novel strategies are needed to address it. Our data reveal that the transforming growth factorβ superfamily member Activin A became very abundant during mouse and human bone fracture healing but was minimally detectable in intact bones. Singlecell RNAsequencing revealed that the Activin Aencoding gene Inhba was highly expressed in a unique, highly proliferative progenitor cell (PPC) population with a myofibroblast character that quickly emerged after fracture and represented the center of a developmental trajectory bifurcation producing cartilage and bone cells within callus. Systemic administration of neutralizing Activin A antibody inhibited bone healing. In contrast, a single recombinant Activin A implantation at fracture site in young and aged mice boosted: PPC numbers; phosphorylated SMAD2 signaling levels; and bone repair and mechanical properties in endochondral and intramembranous healing models. Activin A directly stimulated myofibroblastic differentiation, chondrogenesis and osteogenesis in periosteal mesenchymal progenitor culture. Our data identify a distinct population of Activin Aexpressing PPCs central to fracture healing and establish Activin A as a potential new therapeutic tool.
Article activity feed


eLife assessment
This work is a valuable presentation of sharpwaveripple reactivation of hippocampal neural ensemble activity recorded as animals explored two different environments. It attempts to use the fact that the ensemble code remaps between the two mazes to identify the best replaydetection procedures for analyzing this type of data. The reviewers found the evidence for a prescriptive conclusion inadequate, while still appreciating the concept of comparing mazeidentity discrimination with replay.

Reviewer #1 (Public Review):
This work introduces a novel framework for evaluating the performance of statistical methods that identify replay events. This is challenging because hippocampal replay is a latent cognitive process, where the ground truth is inaccessible, so methods cannot be evaluated against a known answer. The framework consists of two elements:
1. A replay sequence pvalue, evaluated against shuffled permutations of the data, such as radon line fitting, rankorder correlation, or weighted correlation. This element determines how trajectorylike the spiking representation is. The pvalue threshold for all accepted replay events is adjusted based on an empirical shuffled distribution to control for the false discovery rate.
2. A trajectory discriminability score, also evaluated against shuffled permutations of the data. …Reviewer #1 (Public Review):
This work introduces a novel framework for evaluating the performance of statistical methods that identify replay events. This is challenging because hippocampal replay is a latent cognitive process, where the ground truth is inaccessible, so methods cannot be evaluated against a known answer. The framework consists of two elements:
1. A replay sequence pvalue, evaluated against shuffled permutations of the data, such as radon line fitting, rankorder correlation, or weighted correlation. This element determines how trajectorylike the spiking representation is. The pvalue threshold for all accepted replay events is adjusted based on an empirical shuffled distribution to control for the false discovery rate.
2. A trajectory discriminability score, also evaluated against shuffled permutations of the data. In this case, there are two different possible spatial environments that can be replayed, so the method compares the log odds of track 1 vs. track 2.The authors then use this framework (accepted number of replay events and trajectory discriminability) to study the performance of replay identification methods. They conclude that sharp wave ripple power is not a necessary criterion for identifying replay event candidates during awake run behavior if you have high multiunit activity, a higher number of permutations is better for identifying replay events, linear Bayesian decoding methods outperform rankorder correlation, and there is no evidence for preplay.
The authors tackle a difficult and important problem for those studying hippocampal replay (and indeed all latent cognitive processes in the brain) with spiking data: how do we understand how well our methods are doing when the ground truth is inaccessible? Additionally, systematically studying how the variety of methods for identifying replay perform, is important for understanding the sometimes contradictory conclusions from replay papers. It helps consolidate the field around particular methods, leading to better reproducibility in the future. The authors' framework is also simple to implement and understand and the code has been provided, making it accessible to other neuroscientists. Testing for track discriminability, as well as the sequentiality of the replay event, is a sensible additional data point to eliminate "spurious" replay events.
However, there are some concerns with the framework as well. The novelty of the framework is questionable as it consists of a log odds measure previously used in two prior papers (Carey et al. 2019 and the authors' own Tirole & Huelin Gorriz, et al., 2022) and a multiple comparisons correction, albeit a unique empirical multiple comparisons correction based on shuffled data.
With respect to the log odds measure itself, as presented, it is reliant on having only two options to test between, limiting its general applicability. Even in the data used for the paper, there are sometimes three tracks, which could influence the conclusions of the paper about the validity of replay methods. This also highlights a weakness of the method in that it assumes that the true model (spatial track environment) is present in the set of options being tested. Furthermore, the log odds measure itself is sensitive to the defined ripple or multiunit start and end times, because it marginalizes over both position and time, so any inclusion of place cells that fire for the animal's stationary position could influence the discriminability of the track. Multiple track representations during a candidate replay event would also limit track discriminability. Finally, the authors call this measure "trajectory discriminability", which seems a misnomer as the time and position information are integrated out, so there is no notion of trajectory.
The authors also fail to make the connection with the control of the false discovery rate via false positives on empirical shuffles with existing multiple comparison corrections that control for false discovery rates (such as the Benjamini and Hochberg procedure or Storey's qvalue). Additionally, the particular type of shuffle used will influence the empirically determined pvalue, making the procedure dependent on the defined null distribution. Shuffling the data is also considerably more computationally intensive than the existing multiple comparison corrections.
Overall, the authors make interesting conclusions with respect to hippocampal replay methods, but the utility of the method is limited in scope because of its reliance on having exactly two comparisons and having to specify the null distribution to control for the false discovery rate. This work will be of interest to electrophysiologists studying hippocampal replay in spiking data.

Reviewer #2 (Public Review):
This study proposes to evaluate and compare different replay methods in the absence of "ground truth" using data from hippocampal recordings of rodents that were exposed to two different tracks on the same day. The study proposes to leverage the potential of Bayesian methods to decode replay and reactivation in the same events. They find that events that pass a higher threshold for replay typically yield a higher measure of reactivation. On the other hand, events from the shuffled data that pass thresholds for replay typically don't show any reactivation. While wellintentioned, I think the result is highly problematic and poorly conceived.
The work presents a lot of confusion about the nature of null hypothesis testing and the meaning of pvalues. The prescription arrived at, to correct pvalues by putting …
Reviewer #2 (Public Review):
This study proposes to evaluate and compare different replay methods in the absence of "ground truth" using data from hippocampal recordings of rodents that were exposed to two different tracks on the same day. The study proposes to leverage the potential of Bayesian methods to decode replay and reactivation in the same events. They find that events that pass a higher threshold for replay typically yield a higher measure of reactivation. On the other hand, events from the shuffled data that pass thresholds for replay typically don't show any reactivation. While wellintentioned, I think the result is highly problematic and poorly conceived.
The work presents a lot of confusion about the nature of null hypothesis testing and the meaning of pvalues. The prescription arrived at, to correct pvalues by putting animals on two separate tracks and calculating a "sequenceless" measure of reactivation are impractical from an experimental point of view, and unsupportable from a statistical point of view. Much of the observations are presented as solutions for the field, but are in fact highly dependent on distinct features of the dataset at hand. The most interesting observation is that despite the existence of apparent sequences in the PRERUN data, no reactivation is detectable in those events, suggesting that in fact they represent spurious events. I would recommend the authors focus on this important observation and abandon the rest of the work, as it has the potential to further befuddle and promote poor statistical practices in the field.
The major issue is that the manuscript conveys much confusion about the nature of hypothesis testing and the meaning of pvalues. It's worth stating here the definition of a pvalue: the conditional probability of rejecting the null hypothesis given that the null hypothesis is true. Unfortunately, in places, this study appears to confound the meaning of the pvalue with the probability of rejecting the null hypothesis given that the null hypothesis is NOT truei.e. in their recordings from awake replay on different mazes. Most of their analysis is based on the observation that events that have higher reactivation scores, as reflected in the mean log odds differences, have lower pvalues resulting from their replay analyses. Shuffled data, in contrast, does not show any reactivation but can still show spurious replays depending on the shuffle procedure used to create the surrogate dataset. The authors suggest using this to test different practices in replay detection. However, another important point that seems lost in this study is that the surrogate dataset that is contrasted with the actual data depends very specifically on the null hypothesis that is being tested. That is to say, each different shuffle procedure is in fact testing a different null hypothesis. Unfortunately, most studies, including this one, are not very explicit about which null hypothesis is being tested with a given resampling method, but the pvalue obtained is only meaningful insofar as the null that is being tested and related assumptions are clearly understood. From a statistical point of view, it makes no sense to adjust the pvalue obtained by one shuffle procedure according to the pvalue obtained by a different shuffle procedure, which is what this study inappropriately proposes. Other prescriptions offered by the study are highly dataset and method dependent and discuss minutiae of event detection, such as whether or not to require power in the ripple frequency band.

Reviewer #3 (Public Review):
This study tackles a major problem with replay detection, which is that different methods can produce vastly different results. It provides compelling evidence that the source of this inconsistency is that biological data often violates assumptions of independent samples. This results in false positive rates that can vary greatly with the precise statistical assumptions of the chosen replay measure, the detection parameters, and the dataset itself. To address this issue, the authors propose to empirically estimate the false positive rate and control for it by adjusting the significance threshold. Remarkably, this reconciles the differences in replay detection methods, as the results of all the replay methods tested converge quite well (see Figure 6B). This suggests that by controlling for the false positive …
Reviewer #3 (Public Review):
This study tackles a major problem with replay detection, which is that different methods can produce vastly different results. It provides compelling evidence that the source of this inconsistency is that biological data often violates assumptions of independent samples. This results in false positive rates that can vary greatly with the precise statistical assumptions of the chosen replay measure, the detection parameters, and the dataset itself. To address this issue, the authors propose to empirically estimate the false positive rate and control for it by adjusting the significance threshold. Remarkably, this reconciles the differences in replay detection methods, as the results of all the replay methods tested converge quite well (see Figure 6B). This suggests that by controlling for the false positive rate, one can get an accurate estimate of replay with any of the standard methods.
When comparing different replay detection methods, the authors use a sequenceindependent logodds difference score as a validation tool and an indirect measure of replay quality. This takes advantage of the twotrack design of the experimental data, and its use here relies on the assumption that a true replay event would be associated with good (discriminable) reactivation of the environment that is being replayed. The other way replay "quality" is estimated is by the number of replay events detected once the false positive rate is taken into account. In this scheme, "better" replay is in the top right corner of Figure 6B: many detected events associated with congruent reactivation.
There are two possible ways the results from this study can be integrated into future replay research. The first, simpler, way is to take note of the empirically estimated false positive rates reported here and simply avoid the methods that result in high false positive rates (weighted correlation with a place bin shuffle or allspike Spearman correlation with a spikeid shuffle). The second, perhaps more desirable, way is to integrate the practice of estimating the false positive rate when scoring replay and to take it into account. This is very powerful as it can be applied to any replay method with any choice of parameters and get an accurate estimate of replay.
How does one estimate the false positive rate in their dataset? The authors propose to use a cellID shuffle, which preserves all the firing statistics of replay events (bursts of spikes by the same cell, multiunit fluctuations, etc.) but randomly swaps the cells' place fields, and to repeat the replay detection on this surrogate randomized dataset. Of course, there is no perfect shuffle, and it is possible that a surrogate dataset based on this particular shuffle may result in one underestimating the true false positive rate if different cell types are present (e.g. place field statistics may differ between CA1 and CA3 cells, or deep vs. superficial CA1 cells, or place cells vs. nonplace cells if inclusion criteria are not strict). Moreover, it is crucial that this validation shuffle be independent of any shuffling procedure used to determine replay itself (which may not always be the case, particularly for the predecoding place field circular shuffle used by some of the methods here) lest the true falsepositive rate be underestimated. Once the false positive rate is estimated, there are different ways one may choose to control for it: adjusting the significance threshold as the current study proposes, or directly comparing the number of events detected in the original vs surrogate data. Either way, with these caveats in mind, controlling for the false positive rate to the best of our ability is a powerful approach that the field should integrate.
Which replay detection method performed the best? If one does not control for varying false positive rates, there are two methods that resulted in strikingly high (>15%) false positive rates: these were weighted correlation with a place bin shuffle and Spearman correlation (using all spikes) with a spikeid shuffle. However, after controlling for the false positive rate (Figure 6B) all methods largely agree, including those with initially high false positive rates. There is no clear "winner" method, because there is a lot of overlap in the confidence intervals, and there also are some additional reasons for not overly interpreting small differences in the observed results between methods. The confidence intervals are likely to underestimate the true variance in the data because the resampling procedure does not involve hierarchical statistics and thus fails to account for statistical dependencies on the session and animal level. Moreover, it is possible that methods that involve shuffles similar to the crossvalidation shuffle ("wcorr 2 shuffles", "wcorr 3 shuffles" both use a predecoding place field circular shuffle, which is very similar to the predecoding place field swap used in the crossvalidation procedure to estimate the false positive rate) may underestimate the false positive rate and therefore inflate adjusted pvalue and the proportion of significant events. We should therefore not interpret small differences in the measured values between methods, and the only clear winner and the best way to score replay is using any method after taking the empirically estimated false positive rate into account.
The authors recommend excluding lowripple power events in sleep, because no replay was observed in events with low (03 zunits) ripple power specifically in sleep, but that no ripple restriction is necessary for awake events. There are problems with this conclusion. First, ripple power is not the only way to detect sharpwave ripples (the sharp wave is very informative in detecting awake events). Second, when talking about sequence quality in awake nonripple data, it is imperative for one to exclude theta sequences. The authors' speed threshold of 5 cm/s is not sufficient to guarantee that no theta cycles contaminate the awake replay events. Third, a direct comparison of the results with and without exclusion is lacking (selecting for the lower ripple power events is not the same as not having a threshold), so it is unclear how crucial it is to exclude the minority of the sleep events outside of ripples. The decision of whether or not to select for ripples should depend on the particular study and experimental conditions that can affect this measure (electrode placement, brain state prevalence, noise levels, etc.).
Finally, the authors address a controversial topic of denovo preplay. With replay detection corrected for the false positive rate, none of the detection methods produce evidence of preplay sequences nor sequenceless reactivation in the tested dataset. This presents compelling evidence in favour of the view that the sequence of place fields formed on a novel track cannot be predicted by the sequential structure found in pretask sleep.
