Neural synchronization is strongest to the spectral flux of slow music and depends on familiarity and beat salience
Curation statements for this article:- 
  Curated by eLifeEvaluation Summary: This study investigated the neural tracking of music using novel methodology. The core finding was stronger neuronal entrainment to "spectral flux" than to other, more commonly tested features such as amplitude envelope. As such the study is methodologically sophisticated and provides novel insight on the neuronal mechanisms of music perception. (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #3 agreed to share their names with the authors.) 
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (eLife)
Abstract
Neural activity in the auditory system synchronizes to sound rhythms, and brain–environment synchronization is thought to be fundamental to successful auditory perception. Sound rhythms are often operationalized in terms of the sound’s amplitude envelope. We hypothesized that – especially for music – the envelope might not best capture the complex spectro-temporal fluctuations that give rise to beat perception and synchronized neural activity. This study investigated (1) neural synchronization to different musical features, (2) tempo-dependence of neural synchronization, and (3) dependence of synchronization on familiarity, enjoyment, and ease of beat perception. In this electroencephalography study, 37 human participants listened to tempo-modulated music (1–4 Hz). Independent of whether the analysis approach was based on temporal response functions (TRFs) or reliable components analysis (RCA), the spectral flux of music – as opposed to the amplitude envelope – evoked strongest neural synchronization. Moreover, music with slower beat rates, high familiarity, and easy-to-perceive beats elicited the strongest neural response. Our results demonstrate the importance of spectro-temporal fluctuations in music for driving neural synchronization, and highlight its sensitivity to musical tempo, familiarity, and beat salience.
Article activity feed
- 
    
- 
      Author Response Reviewer #1 (Public Review): This paper examines EEG responses time-locked to (or "entrained" by) musical features and how these depend on tempo and feature identity. Results revealed stronger entrainment to "spectral flux" than to other, more commonly tested features such as amplitude envelope. Entrainment was also strongest for lowest rates tested (1-2 Hz). The paper is well written, its structure is easy to follow and the research topic is explained in a way that makes it accessible to readers outside of the field. Results will advance the scientific field and give us further insights into neural processes underlying auditory and music perception. Nevertheless, there are a few points that I believe need to be clarified or discussed to rule out alternative explanations or to better understand the acquired data. We… Author Response Reviewer #1 (Public Review): This paper examines EEG responses time-locked to (or "entrained" by) musical features and how these depend on tempo and feature identity. Results revealed stronger entrainment to "spectral flux" than to other, more commonly tested features such as amplitude envelope. Entrainment was also strongest for lowest rates tested (1-2 Hz). The paper is well written, its structure is easy to follow and the research topic is explained in a way that makes it accessible to readers outside of the field. Results will advance the scientific field and give us further insights into neural processes underlying auditory and music perception. Nevertheless, there are a few points that I believe need to be clarified or discussed to rule out alternative explanations or to better understand the acquired data. We thank the Reviewer for taking the time to evaluate our manuscript and for the positive response. We have now conducted further analyses to strengthen our conclusion that neural synchronization was strongest at slower musical tempi and to rule out an alternative explanation that neural synchronization was strongest for music presented near its own original or “natural” tempo. We also added some points to the Discussion in response to your comments; revised text is reproduced as part of our point-by-point responses below for your convenience. The page and line numbers correspond to the manuscript file without track changes. - Results reveal spectral flux as the musical feature producing strongest entrainment. However, entrainment can only be compared across features in an unbiased way if these features are all equally present in the stimulus. I wonder whether entrainment to spectral flux is only most pronounced because the latter is the most prominent feature in music. Can the authors rule out such an explanation?
 Respectfully, it is not fully clear to us based on the literature that entrainment can only be compared across features fairly when those features are equally presented in the stimulus. Previous work in the speech domain has compared entrainment to amplitude envelope vs. spectrogram, vs. a symbolic representation of the time of occurrence of different phonemes (Di Liberto et al., 2015). Work in the music domain has compared entrainment to amplitude envelope (and its derivative) vs. features quantifying melodic expectation (surprise and entropy, quantified using a hidden Markov-model trained on a corpus of Western music; Di Liberto et al., 2020). In these papers, there was no quantification of the degree to which each feature was present in the stimulus material, and when comparing such qualitatively different features, it is not clear to us how one would do so. Nonetheless, these studies used the resulting TRF-based dependent measures to evaluate which feature best predicted the neural response. Here, although we do not know what acoustic feature might be most present / strongest in music, we believe that we can investigate the degree to which each feature predicts the neural response. In fact, we might argue the sort of reverse of the logic in your comment – that the TRF results actually tell us which feature is perceptually or psychologically the most important in terms of driving brain responses, which may not be fully predictable from the acoustics of those features. From a data analysis perspective, we have independently normalized (z-scored) each feature as well as the neural data, as prescribed in Crosse et al., 2021, to try to level the playing field for the musical features we are comparing. Moreover, we made changes in the discussion to acknowledge your concern. The text is reproduced here for your convenience. p. 26, l. 489-497: “One hurdle to performing any analysis of the coupling between neural activity and a stimulus time course is knowing ahead of time the feature or set of features that will well characterize the stimulus on a particular time scale given the nature of the research question. Indeed, there is no necessity that the feature that best drives neural synchronization will be the most obvious or prominent stimulus feature. Here, we treated feature comparison as an empirical question (Di Liberto et al., 2015), and found that spectral flux is a better predictor of neural activity than the amplitude envelope of music. Beyond this comparison though, the issue of feature selection also has important implications for comparisons of neural synchronization across, for example, different modalities.” - Spectral analyses of neural data often yield the strongest power at lowest frequencies. Measures of entrainment can be biased by the amount of power present, where entrainment increases with power. Can the authors rule out that the advantage for lower frequencies is a reflection of such an effect?
 Thank you for this insightful comment. In response to your comment and the comments of Reviewer 3, we normalized the TRF correlations, stimulus–response correlations, and stimulus–response coherences by surrogate distributions that were calculated separately for each musical feature and – importantly – for every tempo condition. Following Zuk et al., 2021, we formed surrogate distributions by shifting the relevant neural data time course relative to the stimulus-feature time courses by a random amount. We did this 50 times, and for each shift re-calculated all dependent measures. We then normalized our dependent measures calculated from the intact time series relative to these surrogate distributions by subtracting the mean and dividing by the standard deviation of the surrogate distribution (“z-scoring”). Since the approach of shifting the neural data leaves the neural time series intact, the power spectrum of the data is preserved, but only its relationship to the stimulus is destroyed. After normalization, the plots obviously look a little different, but the main results – a higher level of neural synchronization to slower stimulation tempi and in response to the spectral flux – remain. The changes can be found throughout the manuscript, but especially on p. 11, l. 210-218, Figures 2-3 and a more detailed explanation in the Methods section. p. 39, l. 821-829: “In order to control for any frequency-specific differences in the overall power of the neural data that could have led to artificially inflated observed neural synchronization at lower frequencies, the SRCorr and SRCoh values were z-scored based on a surrogate distribution (Zuk et al., 2021). Each surrogate distribution was generated by shifting the neural time course by a random amount relative to the musical feature time courses, keeping the time courses of the neural data and musical features intact. For each of 50 iterations, a surrogate distribution was created for each stimulation subgroup and tempo condition. The z-scoring was calculated by subtracting the mean and dividing by the standard deviation of the surrogate distribution.” A related point, what was the dominant rate of spectral flux in the original set of stimuli, before tempo was manipulated? Could it be that the slow tempo was preferred because in this case participants listened to a most "natural" stimulus? This is a good point, thank you. We did two things to attempt to address this (see also comment Reviewer 3). First, the original tempo for each song can be found in Supplementary Table 1. To make the table more readable and more comparable with the main manuscript, we have updated the table and now state the original tempi in BPM and Hz. Second, we added histograms of the original tempi across all songs as well as the maximum amount by which all songs were tempo-shifted (i.e., the maximum tempo difference between the slowest (or fastest) version of each song segment compared to the original tempo). These histograms have been added to Figure 1 – figure supplement 2, and are paraphrased here for your convenience (p. 13 l. 265-273): The original tempo of the set of musical stimuli ranges between 1-2.75 Hz. This indeed overlaps with the tempo range that revealed strongest neural synchronization. When songs were tempo-shifted to be played at a slower tempo than the original, they were shifted by ~0.25-1.25 Hz. In contrast, shifting a song to have a faster tempo typically involved a larger shift of ~1-2.25 Hz. Thus, it is definitely possible that tempo, degree of tempo shift, and proximity to “natural” tempo were not completely independent values. For that reason, to investigate the effects of the amount of tempo manipulation on neural synchronization, we conducted an additional analysis. We compared TRF correlations for a) songs that were shifted very little relative to their original tempi to b) songs that were shifted a lot relative to their original tempi. We did not have enough song stimuli to do this for every stimulation tempo, but we were able to do the TRF correlation comparison for two illustrative stimulation tempo conditions (at 2.25 Hz and 1.5 Hz). In those tempo conditions, we took the TRF correlations for up to three trials per participant when the original tempo was around the manipulated tempo (1.25-1.6 Hz for 1.5 Hz or 2.01-2.35 Hz for 2.25 Hz) and compared it to those trials where the original tempo was around 0.75¬–1 Hz faster or slower than the manipulated tempo at which the participants heard the songs (Figure 3 – figure supplement 2). This analysis revealed that there was no significant effect of the original music tempi on the neural response (please see Material and Methods, p. 40, l. 855-861 and Results p. 13, l. 265-273). In response to your and Reviewer’s 3 comments, we also added this additional point to the discussion. p. 23-24 l. 427-436: “The tempo range within which we observed strongest synchronization partially coincides with the original tempo range of the music stimuli (Figure 1 – figure supplement 2). A control analysis revealed that the amount of tempo manipulation (difference between original music tempo and tempo the music segment was presented to the participant) did not affect TRF correlations. Thus, we interpret our data as reflecting a neural preference for specific musical tempi rather than an effect of naturalness or the amount that we had to tempo shift the stimuli. However, since our experiment was not designed to answer this question, we were only able to conduct this analysis for two tempi, 2.25 Hz and 1.5 Hz (Figure 3 – figure supplement 3), and thus are not able to rule out the influence of the magnitude of tempo manipulation on other tempo conditions.” - The authors have a clear hypothesis about the frequency of the entrained EEG response: The one that corresponds to the musical tempo (or harmonics). It seemed to me that analyses do not sufficiently take that hypothesis into account and often include all possible frequencies. Restricting the analysis pipeline to frequencies that are expected to be involved might reduce the number of comparisons needed and therefore increase statistical power.
 Although we manipulated tempo, and so had an a priori hypothesis about the frequency at which the beat would be felt, natural music is a complex stimulus composed of different instruments playing different lines at different time scales, many or most of which are nonisochronous. Thus, we analyzed the data in two different ways – 1) based on TRFs and 2) based on stimulus–response correlation and coherence. Stimulus–response coherence is a frequency-domain measure, and so it was possible to do exactly as you suggest here and consider coherence only at the stimulation tempo and first harmonic, which we did (Figure 2E-J). However, for the TRF analyses, we followed previous literature (e.g., Ding et al., 2014; Di Liberto et al., 2020; Teng et al., 2021), and considered broader-band EEG activity (bandpass filtered at 0.5-30 Hz). Previous work has shown that the beat in music evokes a neural response at harmonics up to at least 4 times the beat rate (Kaneshiro et al., 2020), so we wanted to leave a broad frequency range intact in the neural data. Despite being based on differently filtered data, we found that the dependent measures from the two analysis approaches were correlated, which suggests to us that neural tracking at the stimulation tempo itself was probably the largest contributor to the results we observed here. Related to your comment, we added two points to our discussion, which we reproduce here for your convenience. p. 24-25, l. 453-461: “Regardless of the reason, since frequency-domain analyses separate the neural response into individual frequency-specific peaks, it is easy to interpret neural synchronization (SRCoh) or stimulus spectral amplitude at the beat rate and the note rate – or at the beat rate and its harmonics – as independent (Keitel et al., 2021). However, music is characterized by a nested, hierarchical rhythmic structure, and it is unlikely that neural synchronization at different metrical levels goes on independently and in parallel. One potential advantage of TRF-based analyses is that they operate on relatively wide-band data compared to Fourier-based approaches, and as such are more likely to preserve nested neural activity and perhaps less likely to lead to over- or misinterpretation of frequency-specific effects.” p. 29 l. 564-577: “Despite their differences, we found strong correspondence between the dependent variables from the two types of analyses. Specifically, TRF correlations were strongly correlated with stimulation-tempo SRCoh, and this correlation was higher than for SRCoh at the first harmonic of the stimulation tempo for the amplitude envelope, derivative and beat onsets (Figure 4 - figure supplement 1). Thus, despite being computed on a relatively broad range of frequencies, the TRF seems to be correlated with frequency-specific measures at the stimulation tempo. The strong correspondence between the two analysis approaches has implications for how users interpret their results. Although certainly not universally true, we have noticed a tendency for TRF users to interpret their results in terms of a convolution of an impulse response with a stimulus, whereas users of stimulus–response correlation or coherence tend to speak of entrainment of ongoing neural oscillations. The current results demonstrate that the two approaches produce similar results, even though the logic behind the techniques differs. Thus, whatever the underlying neural mechanism, using one or the other does not necessarily allow us privileged access to a specific mechanism.” Reviewer #2 (Public Review): Kristin Weineck and coauthors investigated the neural entertainment to different features of music, specifically the amplitude envelope, its derivative, the beats and the spectral flux (which describes how fast are spectral changes) and its dependence on the tempo of the music and self-reports of enjoyment, familiarity and ease of beat perception. They use and compare analysis approaches typically used when working with naturalistic stimuli: temporal response functions (TRFs) or reliable components analysis (RCA) to correlate the stimulus with its neural response (in this case, the EEG). The spectral flux seems the best music descriptor among the tested ones with both analyses. They find a stronger neural response to stimuli with slower beat rates and predictable stimuli, namely familiar music with an easy-to-perceive beat. Interestingly, the analysis does not show a statistically significant difference between musicians and non-musicians. The authors provide an extensive analysis of the data, but some aspects need to be clarified and extended. We thank the Reviewer for taking the time to evaluate and summarize our manuscript and for the great comments. We addressed the concerns and made changes throughout the manuscript, but especially in the introduction and discussion sections about the terminology (neural entrainment and neural measures), musical features of the stimuli, and musical experience of the participants. Below you can find the alterations described in more detail. The page and line numbers correspond to the manuscript file without track changes. - It would be helpful to clarify better the concepts of neural entertainment, synchronization and neural tracking and their meaning in this specific context. Those terms are often used interchangeably, and it can be hard for the reader to follow the rest of the paper if they are not explicitly defined and related to each other in the introduction. Note that this is fundamental to understanding the primary goal of the paper. The authors clarify this point only at the end of the discussion (lines 570-576). I suggest moving this part in the introduction. Still, it is unclear why the authors use the TRF model and then say they want to be agnostic about the physiological mechanisms underlying entertainment. The choice of the TRF (as well as the stimulus representation) automatically implies a hypothesis about a physiological mechanism, i.e., the EEG reflects convolution of the stimulus properties with an impulse response. Please could you clarify this point? I might have missed it.
 Thank you for this valuable comment. We agree that it is fundamental to define and uniformly use terminology, and have made changes throughout the manuscript along these lines. First of all, we have changed all instances of “neural entrainment” or “neural tracking” to “neural synchronization”, as we think this term avoids evoking a specific theoretical background or strong mechanistic assumptions. Second, we have moved the Discussion paragraph you mention to the Introduction and expanded it. Specifically, we take the opportunity to address the association between specific analysis approaches (TRFs vs. stimulus–response correlation or coherence) and specific mechanistic assumptions (convolution of stimulus properties with an impulse response vs. entrainment of an ongoing oscillation, respectively). This allowed us to clarify what we mean when we say we prefer to stay agnostic to specific mechanistic interpretations. We are happy to have had the chance to strengthen this discussion, and think it benefits the manuscript a lot. We reproduce the new Introduction paragraph here for your convenience. p. 5-6, l. 101-123: “The current study investigated neural synchronization to natural music by using two different analysis approaches: Reliable Components Analysis (RCA) (Kaneshiro et al., 2020) and temporal response functions (TRFs) (Di Liberto et al., 2020). A theoretically important distinction here is whether neural synchronization observed using these techniques reflects phase-locked, unidirectional coupling between a stimulus rhythm and activity generated by a neural oscillator (Lakatos et al., 2019) versus the convolution of a stimulus with the neural activity evoked by that stimulus (Zuk et al., 2021). TRF analyses involve modeling neural activity as a linear convolution between a stimulus and relatively broad-band neural activity (e.g., 1–15 Hz or 1–30 Hz; (Crosse et al., 2016, Crosse et al., 2021); as such, there is a natural tendency for papers applying TRFs to interpret neural synchronization through the lens of convolution (though there are plenty of exceptions to this e.g., (Crosse et al., 2015, Di Liberto et al., 2015)). RCA-based analyses usually calculate correlation or coherence between a stimulus and relatively narrow-band activity, and in turn interpret neural synchronization as reflecting entrainment of a narrow-band neural oscillation to a stimulus rhythm (Doelling and Poeppel, 2015, Assaneo et al., 2019). Ultimately, understanding under what circumstances and using what techniques the neural synchronization we observe arises from either of these physiological mechanisms is an important scientific question (Doelling et al., 2019, Doelling and Assaneo, 2021, van Bree et al., 2022). However, doing so is not within the scope of the present study, and we prefer to remain agnostic to the potential generator of synchronized neural activity. Here, we refer to and discuss “entrainment in the broad sense” (Obleser and Kayser, 2019) without making assumptions about how neural synchronization arises, and we will moreover show that these two classes of analyses techniques strongly agree with each other.” - Interestingly, the neural response to music seems stronger for familiar music. Can the authors clarify how this is not in contrast with previous works that show that violated expectations evoke stronger neural responses ([Di Liberto et al., 2020] using TRFs and [Kaneshiro et al., 2020] using RCA])? [Di Liberto et al., 2020] showed that the neural response of musicians is stronger than non-musicians as they have a stronger expectation (see point 2). However, in the present manuscript, the analysis does not show a statistically significant difference between musicians and non-musicians. The authors state that they had different degrees of musical training in their dataset, and therefore it is hard to see a clear difference. Still, in the "Materials and Methods" section, they divided the participants into these two groups, confusing the reader.
 Our findings are consistent with previous studies showing stronger inter-subject correlation in response music in a familiar style vs. music in an unfamiliar style (Madsen et al., 2019) and stronger phase coherence in response to familiar relative to unfamiliar sung utterances (Vanden Bosch der Nederlanden et al., 2022). We actually don’t think our results (stronger neural synchronization for familiar music) or these previous results are incompatible with work showing that violations of expectations evoke stronger neural responses. This work either manipulated music so it violated expectations (Kaneshiro et al., 2020) or explicitly modeled “surprisal” as a feature (Di Liberto et al., 2020). Thus, we could think of those stronger neural responses to expectancy violations as reflecting something like “prediction error”. Our music stimuli did not contain any violations, and we were unable to model responses to surprisal given the nature of our music stimuli, as we better explain below (p. 27 l. 514-529). Thus, neural synchronization was stronger to familiar music, and we would argue that listeners were able to form stronger expectations about music they already knew. We would predict that expectancy violations in familiar music would evoke stronger neural responses to those in unfamiliar music, though we did not test that here. We now include a paragraph in the Discussion reconciling our findings with the papers you have cited. p. 27 l. 514-529: “We found that the strength of neural synchronization depended on the familiarity of music and the ease with which a beat could be perceived (Figure 5). This is in line with previous studies showing stronger neural synchronization to familiar music (Madsen et al., 2019) and familiar sung utterances (Vanden Bosch der Nederlanden et al., 2022). Moreover, stronger synchronization for musicians than for nonmusicians has been interpreted as reflecting musicians’ stronger expectations about musical structure. On the surface, these findings might appear to contradict work showing stronger responses to music that violated expectations in some way (Kaneshiro et al., 2020, Di Liberto et al., 2020). However, we believe these findings are compatible: familiar music would give rise to stronger expectations and stronger neural synchronization, and stronger expectations would give rise to stronger “prediction error” when violated. In the current study, the musical stimuli never contained violations of any expectations, and so we observed stronger neural synchronization to familiar compared to unfamiliar music. There was also higher neural synchronization to music with subjectively “easy-to-tap-to” beats. Overall, we interpret our results as indicating that stronger neural synchronization is evoked in response to music that is more predictable: familiar music and with easy-to-track beat structure.” Your other question was why we did not see effects of musical training / sophistication on neural synchronization to music, when other studies have. There are a few possible reasons for this. One is that previous studies aiming to explicitly test the effects of musical training recruited either professional musicians or individuals with a high degree of musical training for their “musician” sample. In contrast, we did not target individuals with any degree of musical training, but attempted this analysis in a post-hoc way. For this reason, our musicians and nonmusicians were not as different from each other in terms of musical training as in previous work. Given this, we have opted to remove the artificial split into musician and nonmusician groups, and now only include a correlation with musical sophistication (as you suggest in your next comment), which was also nonsignificant (Figure 5 – figure supplement 2). - Musical expertise was also assessed using the Goldsmith Music Sophistication Index, which could be an alternative to the two-group comparison between musicians and non-musicians. Does this mean that in Figure 5, we should see a regression line (the higher the Gold-MSI, the higher should be the TRF correlation)? Since we do not see any significant effect, might this be due to the choice of the audio descriptor? The spectral flux is not a high-level descriptor; maybe it is worth testing some high-level descriptors such as entropy and surprise. The choice of the stimulus features defines linear models such as the TRF as they determine the hierarchical level of auditory processing, and for testing the musical expertise, we might need more than acoustic features. The authors should elaborate more on this point.
 It is true that the Goldsmith Music Sophistication Index serves as an alternative way of investigating the effects of musical expertise on neural synchronization to natural music, and we now include this approach exclusively instead of dividing our sample (see response to the previous comment). Indeed, if musical sophistication would have an effect on the TRF correlations in this study, we would see a regression line in Figure 5 – figure supplement 2. Based on our experiment it is difficult to assess whether the lack of a correlation between neural measures and musical expertise is based on our choice of stimulus features. That is because our experiment was designed to investigate the effects of fundamental acoustic features of music, and it was not possible to calculate high-level descriptors, such as the entropy or surprisal, for the music stimuli we chose to work with – the stimuli were polyphonic, and moreover were purchased in a .wav format, so we do not have access to the individual MIDI versions or sheet music of each song that would have been necessary to apply, for example, the IDyOM (Information Dynamics of Music) model. As we cannot rule out that the (lack of) effects of varying levels of musical expertise on TRF correlations is due to our choice of stimulus features, we added this to the discussion. p. 28 l. 541-546: “Another potential reason for the lack of difference between musicians and non-musicians in the current study could originate from the choice of utilizing pure acoustic audio-descriptors as opposed to “higher order” musical features. However, “higher order” features such as surprise or entropy that have been shown to be influenced by musical expertise (Di Liberto et al., 2020), are difficult to compute for natural, polyphonic music.” - Regarding the stimulus representation, I have a few points. The authors say that the amplitude envelope is a too limited representation for music stimuli. However, before testing the spectral flux, why not test the spectrogram as in previous studies? Moreover, the authors tested the TRF on combining all features, but it was not clear how they combined the features.
 One of the main reasons that we did not use the spectrogram as a feature was that it wouldn’t be possible to use a two-dimensional representation for the RCA-based measures, SRCorr and SRCoh, so we would not have been able to compare across analysis approaches. However, spectral flux is calculated directly from the spectrogram, and so is a useful one-dimensional measure that captures the spectro-temporal fluctuations present in the spectrogram (https://musicinformationretrieval.com/novelty_functions.html). Thank you for making this important point, we added this explanation to the Materials and Methods section (p. 35 l. 726-727). Sorry for not explaining the multivariate TRF approach better. Instead of using only one stimulus feature, e. g. the amplitude envelope, several stimulus features can be concatenated into a matrix (with the dimensions: time T x 4 musical features M at different time lags), which is then used as an input for the mTRFcrossval, mTRFtrain and mTRFpredict of the mTRF Matlab Toolbox (Crosse et al., 2016) – actually this is exactly how using a 2D feature like the spectrogram would work. The multivariate TRF is calculated by extending the stimulus lag matrix (time course of one musical feature at different time lags, T × τwindow) by an additional dimension (time course of several musical features at different time lags, T × M x τwindow). We added an explanation to the Methods section of the manuscript and hope that it is this way better understandable: p. 39 l. 840-842: “For the multivariate TRF approach, the stimulus features were combined by replacing the single time-lag vector by several time-lag vectors for every musical feature (Time x 4 musical features at different time lags).” Reviewer #3 (Public Review): Subjects listened to various excerpts from music recordings that were designed to cover musical tempi ranging from 1-4 Hz, and EEG was recorded as subjects listened to these excerpts. The main and novel findings of the study were: 1) spectral flux, measuring sudden changes in frequency, were tracked better in the EEG than other measures of fluctuations in amplitude, 2) neural tracking seemed to be best for the slowest tempi, 3) measures of neural tracking were higher when subject's rated an excerpt as high for ease-of-tapping and familiarity, and 4) their measure of the mapping between stimulus feature and response could predict whether a subject tapped at the expected tempo or at 2x the expected tempo after listening to the musical excerpt. One of the key strengths of this study is the use of novel methodologies. The authors in this study used natural and digitally manipulated music covering a wide range of tempi, which is unique to studies of musical beat tracking. They also included both measures of stimulus-response correlation and phase coherence along with a method of linear modeling (the temporal response function, or TRF) in order to quantify the strength of tracking, showing that they produce correlated results. Lastly, and perhaps most importantly, they also had subjects tap along with the music after listening to the full excerpt. While having a measure of tapping rate itself is not new, combined with their other measures they were able to demonstrate that neural data predicted the hierarchical level of tapping rate, opening up opportunities to study the relationship between neural tracking, musical features, and a subject's inferred metrical level of the musical beat. Additionally, the finding that spectral flux produced the best correlations with the EEG data is an important one. Many studies have focused primarily on the envelope (amplitude fluctuations) when quantifying neural tracking of continuous sounds, but this study shows that, for music at least, spectral flux may add information that is tracked by the EEG. However, given that it is also highly correlated with the envelope, what additional features spectral flux contributes to measuring EEG tracking is not clear from the current results and worth further study. All four of their main findings are important for research into the neural coding of musical rhythm. I have some concerns, however, that two of these findings could be a consequence of the methods used, and one could be explained by related correlations to acoustic features: We thank the Reviewer for the very helpful review, the summary, and the great suggestions. We addressed the comments and performed additional analysis. We made changes throughout the manuscript, but especially 1) concerning the potential advantage of the neural response to slower music, 2) the effects of the amount of tempo manipulation on neural synchronization, 3) the SVM-related analysis and 4) the relation between stimulus features and behavioral ratings. The implemented modifications can be found below in more detail. The page and line numbers correspond to the manuscript file without track changes. The authors found that their measures of neural tracking were highest for the lowest musical tempos. This is interesting, but it is also possible that this is a consequence of lower frequencies producing a large spread of correlations. Imagine two signals that are fluctuating in time with a similar pattern of fluctuation. When they are correctly-aligned they are correlated with each other, but if you shift one of the signals in time those fluctuations are mismatched and you can end up with zero or negative correlations. Now imagine making those fluctuations much slower. If you use the same time shifts as before, the signals will still be fairly correlated, because the rates of signal change are much longer. As a result, the span of null correlations also increases. This can be corrected by normalizing the true correlations and prediction accuracies with a null distribution at each tempo. But with this in mind, it is hard to conclude if the greater correlations found for lower musical tempos in their current form are a true effect. Thank you for this great suggestion. We followed your lead (Zuk et al., 2021), and normalized all measures of neural synchronization (TRF correlation, SRCorr, SRCoh) relative to a surrogate distribution. The surrogate distribution was calculated by randomly and circularly shifting the neural data relative to the musical features for each of 50 iterations. This was done separately for every musical feature and stimulation tempo condition (Figures 2 and 3). After normalization, the results look qualitatively similar and the main results – spectral flux and slow stimulation tempi resulting in highest levels of neural synchronization – persist. The changes in the manuscript based on your comment (and the comment of Reviewer 1) can be found throughout the manuscript, but especially on p. 11, l. 210-218, Figures 2-3 and a more detailed explanation in the Methods section: p. 39, l. 821-829: “In order to control for any frequency-specific differences in the overall power of the neural data that could have led to artificially inflated observed neural synchronization at lower frequencies, the SRCorr and SRCoh values were z-scored based on a surrogate distribution (Zuk et al., 2021). Each surrogate distribution was generated by shifting the neural time course by a random amount relative to the musical feature time courses, keeping the time courses of the neural data and musical features intact. For each of 50 iterations, a surrogate distribution was created for each stimulation subgroup and tempo condition. The z-scoring was calculated by subtracting the mean and dividing by the standard deviation of the surrogate distribution.” If the strength of neural tracking at low tempos is a true effect, it is worth noting that the original tempi for the music clips span 1 - 2.5 Hz (Supplementary Table 1), roughly the range of tempi exhibiting the largest prediction accuracies and correlations. All tempos above this range are produced by digitally manipulating the music. It is possible that the neural tracking measures are higher for music without any digital manipulations rather than reflecting the strength of tracking at various tempi. This could also be related to the author's finding that neural tracking was better for more familiar excerpts. This alternative interpretation should be acknowledged and mentioned in the discussion. Thank you for these important suggestions (see also comment #2 (part 2) from Reviewer 1). First up, it is important to say that all music stimuli were tempo manipulated: even if the tempo of an original music segment was e. g. 2 Hz and the same song was presented at 2 Hz, it was still converted via the MAX patch to 2 Hz again (to make it comparable to the other musical stimuli). Second, it is true that we cannot fully exclude the possibility that the amount of tempo manipulation could have an effect on neural synchronization to music – meaning that less tempo manipulated music segments (so a stimulation tempo close to the original tempo) could result in higher neural synchronization. However, we have now conducted an additional analysis to address this as best we could. We compared TRF correlations for a) songs that were shifted very little relative to their original tempi to b) songs that were shifted a lot relative to their original tempi. We did not have enough song stimuli to do this for every stimulation tempo, but we were able to do the TRF correlation comparison for two illustrative stimulation tempo conditions (at 2.25 Hz and 1.5 Hz). In those tempo conditions, we took the TRF correlations for up to three trials per participant when the original tempo was around the manipulation tempo (1.25-1.6 Hz for 1.5 Hz or 2.01-2.35 Hz for 2.25 Hz) and compared it to those trials where the original tempo was around 0.75¬–1 Hz faster or slower than the manipulated tempo at which the participants heard the songs (Figure 3 – figure supplement 2). This analysis revealed that there was no significant effect of the original music tempi on the neural response (please see Material and Methods, p. 40, l. 855-861 and Results p. 13, l. 265-273). In response to your and Reviewer’s 1 comments, we also added it to the discussion. p. 23-24 l. 427-436: “The tempo range within which we observed strongest synchronization partially coincides with the original tempi of the music stimuli (Figure 1 – figure supplement 2). A control analysis revealed that the amount of tempo manipulation (difference between original music tempo and tempo the music segment was presented to the participant) did not affect TRF correlations. Thus, we interpret our data as reflecting a neural preference for specific musical tempi rather than an effect of naturalness or the amount that we had to tempo shift the stimuli. However, since our experiment was not designed to answer this question, we were only able to conduct this analysis for two tempi, 2.25 Hz and 1.5 Hz (Figure 3 – figure supplement 3), and thus are not able to rule out the influence of tempo manipulation on other tempo conditions.” We also provide more information to the reader about the amount of tempo shift that each stimulus underwent. We added two plots to the manuscript that show 1) the distribution of original tempi of the music stimuli and 2) the distribution of the amount of tempo manipulation across all stimuli (Figure 1 – figure supplement 2). Their last finding regarding predicting tapping rates is novel and important, and the model they use to make those predictions does well. But I am concerned by how well it performs (Figure 6), since it is not clear what features of the TRF are being used to produce this discrimination. Are the effects producing discriminable tapping rates and stimulation tempi apparent in the TRF? I noticed, though, that these results came from two stages of modeling: TRFs were first fit to groups of excerpts with different tapping rates or stimulation tempo separately, then a support vector machine (SVM) was used to discriminate between the two groups. So, another way to think about this pipeline is that two response models (TRFs) were generated for the separate groups, and the SVM finds a way of differentiating between them. There is no indication about what features of the TRFs the SVM is using, and it is possible this is overfitting. Firstly, I think it needs to be clearer how the TRFs are being computed from individual trials. Secondly, the authors construct surrogate data by shuffling labels (before training) but it is not clear at which training stage this is performed. They can correct for possible issues of overfitting by comparing to surrogate data where shuffling happens before the TRF computation, if this wasn't done already. Thank you for noticing this important point. You are absolutely right – when re-analyzing that part of the results based on your comment, we noticed that we had an error in our understanding of the analysis pipeline. Indeed, we first calculated two TRF models for the separate groups (e. g. stimulation tempo = tapping tempo vs. stimulation tempo = 2* tapping tempo) based on all trials of each group apart from the left-out-trial. Next, the resulting TRFs were fed into the SVM which was used to predict the group. The shuffling of the surrogate data occurred at the SVM training step. Based on your comment, we tried several approaches to solve this problem. First, we calculated TRFs on a single-trial basis (instead of using the two-group TRFs as before, only one trial was used to calculate the TRFs) and submitted the resulting TRFs to the SVM. The resulting SVM accuracy was compared to a “surrogate SVM accuracy” which was calculated based on shuffling the labels when training the SVM classifier. Second, we shuffled, as you suggest, the labels not at the SVM training step, but instead prior to the TRF calculation. This way we could compare our “original” SVM accuracies (based on the two-group TRFs) to a fairer surrogate dataset. However, in both cases the resulting SVM accuracies did not perform better than the surrogate data. Therefore, we felt that it is the fairest to remove this part from the manuscript. We are aware that this was one of the main results of the paper and we are sorry that we had to remove it. However, we feel that our paper is still strong and offers a variety of different results that are important for the auditory neuroscience community. Lastly, they show that their measures of neural tracking are larger for music with high familiarity and high ease-of-tapping. I expect these qualitative ratings could be a consequence of acoustic features that produce better EEG correlations and prediction accuracies, especially ease-of-tapping. For example, music with acoustically-salient events are probably easier to tap to and would produce better EEG correlations and prediction accuracies, hence why ease-of-tapping is correlated with the measures of neural tracking. To understand this better, it would be useful to see how the stimulus features correlate with each of these behavioral ratings. We agree that our rating-based results could be influenced by acoustic stimulus features (at least for ease of tapping, it’s actually not clear to us why familiarity would be related to acoustics). As it is difficult to correlate stimulus features (time-domain, and one time course per song) with behavioral ratings (one single value per song per participant), we conducted frequency-domain analysis on the musical features to arrive at a single value quantifying the strength of spectral flux at the stimulation frequency and its first harmonic. We calculated single-trial FFTs on the spectral flux (which was used for the main Figure 5) for the 15 highest- and 15 lowest-rated trials per behavioral category (enjoyment, familiarity, ease to tap the beat) and participant. We compared the z-scored FFT peaks at the stimulation tempo and first harmonic for the top- and bottom-rated stimuli. We did observe significant acoustic differences between top- and bottom-rated stimuli in each category, but the differences were not in the direction that would be expected based on acoustically more salient events leading to better TRF correlations, with the exception of ease of tapping. Easy-to-tap music did indeed have stronger spectral flux than difficult-to-tap music, which is intuitive. However, spectral flux was stronger for more enjoyed music (we did not see any significant differences between TRF correlations of more vs. less enjoyed music; Figure 5C) and for less familiar music (this is the opposite of what we saw for the TRF measures). Overall, given the inconsistent relationship between acoustics, behavioral ratings, and TRF measures, we would argue that acoustic features alone cannot solely explain our results (Figure 5 – figure supplement 1, p. 21 l. 381 – 387). 
- 
      Evaluation Summary: This study investigated the neural tracking of music using novel methodology. The core finding was stronger neuronal entrainment to "spectral flux" than to other, more commonly tested features such as amplitude envelope. As such the study is methodologically sophisticated and provides novel insight on the neuronal mechanisms of music perception. (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #3 agreed to share their names with the authors.) 
- 
      Reviewer #1 (Public Review): This paper examines EEG responses time-locked to (or "entrained" by) musical features and how these depend on tempo and feature identity. Results revealed stronger entrainment to "spectral flux" than to other, more commonly tested features such as amplitude envelope. Entrainment was also strongest for lowest rates tested (1-2 Hz). The paper is well written, its structure is easy to follow and the research topic is explained in a way that makes it accessible to readers outside of the field. Results will advance the scientific field and give us further insights into neural processes underlying auditory and music perception. Nevertheless, there are a few points that I believe need to be clarified or discussed to rule out alternative explanations or to better understand the acquired data. - Results reveal spectral … 
 Reviewer #1 (Public Review): This paper examines EEG responses time-locked to (or "entrained" by) musical features and how these depend on tempo and feature identity. Results revealed stronger entrainment to "spectral flux" than to other, more commonly tested features such as amplitude envelope. Entrainment was also strongest for lowest rates tested (1-2 Hz). The paper is well written, its structure is easy to follow and the research topic is explained in a way that makes it accessible to readers outside of the field. Results will advance the scientific field and give us further insights into neural processes underlying auditory and music perception. Nevertheless, there are a few points that I believe need to be clarified or discussed to rule out alternative explanations or to better understand the acquired data. - Results reveal spectral flux as the musical feature producing strongest entrainment. However, entrainment can only be compared across features in an unbiased way if these features are all equally present in the stimulus. I wonder whether entrainment to spectral flux is only most pronounced because the latter is the most prominent feature in music. Can the authors rule out such an explanation? 
- Spectral analyses of neural data often yield the strongest power at lowest frequencies. Measures of entrainment can be biased by the amount of power present, where entrainment increases with power. Can the authors rule out that the advantage for lower frequencies is a reflection of such an effect? 
 A related point, what was the dominant rate of spectral flux in the original set of stimuli, before tempo was manipulated? Could it be that the slow tempo was preferred because in this case participants listened to a most "natural" stimulus? - The authors have a clear hypothesis about the frequency of the entrained EEG response: The one that corresponds to the musical tempo (or harmonics). It seemed to me that analyses do not sufficiently take that hypothesis into account and often include all possible frequencies. Restricting the analysis pipeline to frequencies that are expected to be involved might reduce the number of comparisons needed and therefore increase statistical power.
 
- 
      Reviewer #2 (Public Review): Kristin Weineck and coauthors investigated the neural entertainment to different features of music, specifically the amplitude envelope, its derivative, the beats and the spectral flux (which describes how fast are spectral changes) and its dependence on the tempo of the music and self-reports of enjoyment, familiarity and ease of beat perception. They use and compare analysis approaches typically used when working with naturalistic stimuli: temporal response functions (TRFs) or reliable components analysis (RCA) to correlate the stimulus with its neural response (in this case, the EEG). The spectral flux seems the best music descriptor among the tested ones with both analyses. They find a stronger neural response to stimuli with slower beat rates and predictable stimuli, namely familiar music with an … Reviewer #2 (Public Review): Kristin Weineck and coauthors investigated the neural entertainment to different features of music, specifically the amplitude envelope, its derivative, the beats and the spectral flux (which describes how fast are spectral changes) and its dependence on the tempo of the music and self-reports of enjoyment, familiarity and ease of beat perception. They use and compare analysis approaches typically used when working with naturalistic stimuli: temporal response functions (TRFs) or reliable components analysis (RCA) to correlate the stimulus with its neural response (in this case, the EEG). The spectral flux seems the best music descriptor among the tested ones with both analyses. They find a stronger neural response to stimuli with slower beat rates and predictable stimuli, namely familiar music with an easy-to-perceive beat. Interestingly, the analysis does not show a statistically significant difference between musicians and non-musicians. The authors provide an extensive analysis of the data, but some aspects need to be clarified and extended. 1. It would be helpful to clarify better the concepts of neural entertainment, synchronization and neural tracking and their meaning in this specific context. Those terms are often used interchangeably, and it can be hard for the reader to follow the rest of the paper if they are not explicitly defined and related to each other in the introduction. Note that this is fundamental to understanding the primary goal of the paper. The authors clarify this point only at the end of the discussion (lines 570-576). I suggest moving this part in the introduction. Still, it is unclear why the authors use the TRF model and then say they want to be agnostic about the physiological mechanisms underlying entertainment. The choice of the TRF (as well as the stimulus representation) automatically implies a hypothesis about a physiological mechanism, i.e., the EEG reflects convolution of the stimulus properties with an impulse response. Please could you clarify this point? I might have missed it. 2. Interestingly, the neural response to music seems stronger for familiar music. Can the authors clarify how this is not in contrast with previous works that show that violated expectations evoke stronger neural responses ([Di Liberto et al., 2020] using TRFs and [Kaneshiro et al., 2020] using RCA])? [Di Liberto et al., 2020] showed that the neural response of musicians is stronger than non-musicians as they have a stronger expectation (see point 2). However, in the present manuscript, the analysis does not show a statistically significant difference between musicians and non-musicians. The authors state that they had different degrees of musical training in their dataset, and therefore it is hard to see a clear difference. Still, in the "Materials and Methods" section, they divided the participants into these two groups, confusing the reader. 3. Musical expertise was also assessed using the Goldsmith Music Sophistication Index, which could be an alternative to the two-group comparison between musicians and non-musicians. Does this mean that in Figure 5, we should see a regression line (the higher the Gold-MSI, the higher should be the TRF correlation)? Since we do not see any significant effect, might this be due to the choice of the audio descriptor? The spectral flux is not a high-level descriptor; maybe it is worth testing some high-level descriptors such as entropy and surprise. The choice of the stimulus features defines linear models such as the TRF as they determine the hierarchical level of auditory processing, and for testing the musical expertise, we might need more than acoustic features. The authors should elaborate more on this point. 4. Regarding the stimulus representation, I have a few points. The authors say that the amplitude envelope is a too limited representation for music stimuli. However, before testing the spectral flux, why not test the spectrogram as in previous studies? Moreover, the authors tested the TRF on combining all features, but it was not clear how they combined the features. 
- 
      Reviewer #3 (Public Review): This study uses novel methodologies to study the neural tracking of music, and the results highlight the importance of accounting for spectral changes when quantifying neural tracking to music. However, more work needs to be done to validate that the results are not a consequence of their analyses or their choice of music before tempo manipulation. One of the key strengths of this study is the use of novel methodologies. The authors in this study used natural and digitally manipulated music covering a wide range of tempi, which is unique to studies of musical beat tracking. They also included both measures of stimulus-response correlation and phase coherence along with a method of linear modeling (the temporal response function, or TRF) in order to quantify the strength of tracking, showing that they produce … Reviewer #3 (Public Review): This study uses novel methodologies to study the neural tracking of music, and the results highlight the importance of accounting for spectral changes when quantifying neural tracking to music. However, more work needs to be done to validate that the results are not a consequence of their analyses or their choice of music before tempo manipulation. One of the key strengths of this study is the use of novel methodologies. The authors in this study used natural and digitally manipulated music covering a wide range of tempi, which is unique to studies of musical beat tracking. They also included both measures of stimulus-response correlation and phase coherence along with a method of linear modeling (the temporal response function, or TRF) in order to quantify the strength of tracking, showing that they produce correlated results. Lastly, and perhaps most importantly, they also had subjects tap along with the music after listening to the full excerpt. While having a measure of tapping rate itself is not new, combined with their other measures they were able to demonstrate that neural data predicted the hierarchical level of tapping rate, opening up opportunities to study the relationship between neural tracking, musical features, and a subject's inferred metrical level of the musical beat. Additionally, the finding that spectral flux produced the best correlations with the EEG data is an important one. Many studies have focused primarily on the envelope (amplitude fluctuations) when quantifying neural tracking of continuous sounds, but this study shows that, for music at least, spectral flux may add information that is tracked by the EEG. However, given that it is also highly correlated with the envelope, what additional features spectral flux contributes to measuring EEG tracking is not clear from the current results and worth further study. All four of their main findings are important for research into the neural coding of musical rhythm. I have some concerns, however, that two of these findings could be a consequence of the methods used, and one could be explained by related correlations to acoustic features: The authors found that their measures of neural tracking were highest for the lowest musical tempos. This is interesting, but it is also possible that this is a consequence of lower frequencies producing a large spread of correlations. Imagine two signals that are fluctuating in time with a similar pattern of fluctuation. When they are correctly-aligned they are correlated with each other, but if you shift one of the signals in time those fluctuations are mismatched and you can end up with zero or negative correlations. Now imagine making those fluctuations much slower. If you use the same time shifts as before, the signals will still be fairly correlated, because the rates of signal change are much longer. As a result, the span of null correlations also increases. This can be corrected by normalizing the true correlations and prediction accuracies with a null distribution at each tempo. But with this in mind, it is hard to conclude if the greater correlations found for lower musical tempos in their current form are a true effect. If the strength of neural tracking at low tempos is a true effect, it is worth noting that the original tempi for the music clips span 1 - 2.5 Hz (Supplementary Table 1), roughly the range of tempi exhibiting the largest prediction accuracies and correlations. All tempos above this range are produced by digitally manipulating the music. It is possible that the neural tracking measures are higher for music without any digital manipulations rather than reflecting the strength of tracking at various tempi. This could also be related to the author's finding that neural tracking was better for more familiar excerpts. This alternative interpretation should be acknowledged and mentioned in the discussion. Their last finding regarding predicting tapping rates is novel and important, and the model they use to make those predictions does well. But I am concerned by how well it performs (Figure 6), since it is not clear what features of the TRF are being used to produce this discrimination. Are the effects producing discriminable tapping rates and stimulation tempi apparent in the TRF? I noticed, though, that these results came from two stages of modeling: TRFs were first fit to groups of excerpts with different tapping rates or stimulation tempo separately, then a support vector machine (SVM) was used to discriminate between the two groups. So, another way to think about this pipeline is that two response models (TRFs) were generated for the separate groups, and the SVM finds a way of differentiating between them. There is no indication about what features of the TRFs the SVM is using, and it is possible this is overfitting. Firstly, I think it needs to be clearer how the TRFs are being computed from individual trials. Secondly, the authors construct surrogate data by shuffling labels (before training) but it is not clear at which training stage this is performed. They can correct for possible issues of overfitting by comparing to surrogate data where shuffling happens before the TRF computation, if this wasn't done already. Lastly, they show that their measures of neural tracking are larger for music with high familiarity and high ease-of-tapping. I expect these qualitative ratings could be a consequence of acoustic features that produce better EEG correlations and prediction accuracies, especially ease-of-tapping. For example, music with acoustically-salient events are probably easier to tap to and would produce better EEG correlations and prediction accuracies, hence why ease-of-tapping is correlated with the measures of neural tracking. To understand this better, it would be useful to see how the stimulus features correlate with each of these behavioral ratings. 
- 
  