Where is the melody? Spontaneous attention orchestrates melody formation during polyphonic music listening
Curation statements for this article:-
Curated by eLife
eLife Assessment
This valuable work potentially advances our understanding of melody extraction in polyphonic music listening by identifying spontaneous attentional focus in uninstructed listening contexts. However, the evidence supporting the main conclusions is incomplete. The work will be of interest to psychologists and neuroscientists working on music listening, attention, and perception in ecological settings.
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (eLife)
Abstract
Humans seamlessly process multi-voice music into a coherent perceptual whole. Yet the neural strategies supporting this experience remain unclear. One fundamental component of this process is the formation of melody, a core structural element of music. Previous work on monophonic listening has provided strong evidence for the neurophysiological basis of melody processing, for example indicating predictive processing as a foundational mechanism underlying melody encoding. However, considerable uncertainty remains about how melodies are formed during polyphonic music listening, as existing theories (e.g., divided attention, figure–ground model, stream integration) fail to unify the full range of empirical findings. Here, we combined behavioral measures with non-invasive electroencephalography (EEG) to probe spontaneous attentional bias and melodic expectation while participants listened to two-voice classical excerpts. Our uninstructed listening paradigm eliminated a major experimental constraint, creating a more ecologically valid setting. We found that attention bias was significantly influenced by both the high-voice superiority effect and intrinsic melodic statistics. We then employed transformer-based models to generate next-note expectation profiles and test competing theories of polyphonic perception. Drawing on our findings, we propose a weighted-integration framework in which attentional bias dynamically calibrates the degree of integration of the competing streams. In doing so, the proposed framework reconciles previous divergent accounts by showing that, even under free-listening conditions, melodies emerge through an attention-guided statistical integration mechanism.
Article activity feed
-
-
-
eLife Assessment
This valuable work potentially advances our understanding of melody extraction in polyphonic music listening by identifying spontaneous attentional focus in uninstructed listening contexts. However, the evidence supporting the main conclusions is incomplete. The work will be of interest to psychologists and neuroscientists working on music listening, attention, and perception in ecological settings.
-
Reviewer #1 (Public review):
Summary:
This manuscript investigates the interplay between spontaneous attention and melody formation during polyphonic music listening. The authors use EEG recordings during uninstructed listening to examine how attention bias influences melody processing, employing both behavioural measures and computational modelling with music transformers. The study introduces a very clever pitch-inversion manipulation design to dissociate high-voice superiority from melodic salience, and proposes a "weighted integration" model where attention dynamically modulates how multiple voices are combined into perceived melody.
Strengths:
(1) The attention bias findings (Figure 2) are compelling and methodologically sound, with convergent evidence from both behavioral and neural measures.
(2) The pitch-inversion manipulation …
Reviewer #1 (Public review):
Summary:
This manuscript investigates the interplay between spontaneous attention and melody formation during polyphonic music listening. The authors use EEG recordings during uninstructed listening to examine how attention bias influences melody processing, employing both behavioural measures and computational modelling with music transformers. The study introduces a very clever pitch-inversion manipulation design to dissociate high-voice superiority from melodic salience, and proposes a "weighted integration" model where attention dynamically modulates how multiple voices are combined into perceived melody.
Strengths:
(1) The attention bias findings (Figure 2) are compelling and methodologically sound, with convergent evidence from both behavioral and neural measures.
(2) The pitch-inversion manipulation appears to super elegantly dissociate two competing factors (high-voice superiority vs melodic salience), moreover, the authors claim that the chosen music lends itself perfectly to his PolyInv condition. A claim I cannot really evaluate, but which would make it even more neat.
(3) Nice bridge between hypotheses and operationalisations.
Weaknesses:
The results in Figure 3 are very striking, but I have a number of questions before I can consider myself convinced.
(1) Conceptual questions about surprisal analysis:
The pattern of results seems backwards to me. Since the music is inherently polyphonic in PolyOrig, I'd expect the polyphonic model to fit the brain data better - after all, that's what the music actually is. These voices were composed to interact harmonically, so modeling them as independent monophonic streams seems like a misspecification. Why would the brain match this misspecified model better?
Conversely, it would seem to me the pitch inversion in PolyInv disrupts (at least to some extent) the harmonic coherence, so if anywhere, I'd a priori expect that in this condition, listeners would rather be processing streams separately - making the monophonic model fit better there (or less bad), not in PolyOrig. The current pattern is exactly opposite to what seems logical to me.
(2) Missing computational analyses:
If the transformer is properly trained, it should "understand" (i.e., predict/compress) the polyphonic music better, right? Can the authors demonstrate this via perplexity scores, bits-per-byte, or other prediction metrics, comparing how well each model (polyphonic vs monophonic) handles the music in both conditions? Similarly, if PolyInv truly maintains musical integrity as claimed, the polyphonic model should handle it as well as PolyOrig. But if the inversion does disrupt the music, we should see this reflected in degraded prediction scores. These metrics would validate whether the experimental manipulation works as intended. Also, how strongly are the surprisal streams correlated? There are many non-trivial modelling steps that should be reported in more detail.
(3) Methodological inconsistencies:
Why are the two main questions (Figures 2 and 3) answered with completely different analytical approaches? The switch from TRF to CCA with match-vs-mismatch classification seems unmotivated. I think it's very important to provide a simpler model comparison - just TRF with acoustic features plus either polyphonic or monophonic surprisal - evaluated on relevant electrodes or the full scalp. This would make the results more comparable and interpretable.
(4) Presentation and methods:
a) Coming from outside music/music theory, I found the paper somewhat abstract and hard to parse initially. The experimental logic becomes clearer with reflection, but you're doing yourselves a disservice with the jargon-heavy presentation. It would be useful to include example stimuli.
b) The methods section is extremely brief - no details whatsoever are provided regarding the modelling: What specific music transformer architecture? Which implementation of this "anticipatory music transformer"? Pre-trained on what corpus - monophonic, polyphonic, Western classical only? What constituted "technical issues" for the 9 excluded participants? What were the channel rejection criteria?
-
Reviewer #2 (Public review):
Summary:
The authors sought to understand the drivers of spontaneous attentional bias and melodic expectation generation during listening to short two-part classical pieces. They measured scalp EEG data in a monophonic condition and trained a model to reconstruct the audio envelope from the EEG. They then used this model to probe which of the two voices was best reflected in the neural signal during two polyphonic conditions. In one condition, the original piece was presented, in the other, the voices were switched in an attempt to distinguish between effects of (a) the pitch range of one voice compared to the other and (b) intrinsic melodic features. They also collected a behavioural measure of attentional bias for a subset of the stimuli in a separate study. Further modelling assessed whether expectations …
Reviewer #2 (Public review):
Summary:
The authors sought to understand the drivers of spontaneous attentional bias and melodic expectation generation during listening to short two-part classical pieces. They measured scalp EEG data in a monophonic condition and trained a model to reconstruct the audio envelope from the EEG. They then used this model to probe which of the two voices was best reflected in the neural signal during two polyphonic conditions. In one condition, the original piece was presented, in the other, the voices were switched in an attempt to distinguish between effects of (a) the pitch range of one voice compared to the other and (b) intrinsic melodic features. They also collected a behavioural measure of attentional bias for a subset of the stimuli in a separate study. Further modelling assessed whether expectations of how the melody would unfold were formed based on an integrated percept of melody across the two voices, or based on a single voice. The authors sought to relate the findings to different theories of how musical/auditory scene analysis occurs, based on divided attention, figure-ground perception, and stream integration.
Strengths:
(1) A clever but simple manipulation - transposing the voices such that the higher one became the lower one - allowed an assessment of different factors that might affect the allocation of attention.
(2) State-of-the-art analytic techniques were applied to (a) build a music attention decoder (these are more commonly encountered for speech) and (b) relate the neural data to features of the stimulus at the level of acoustics and expectation.
(3) The effects appeared robust across the group, not driven by a handful of participants.
Weaknesses:
(1) A key goal of the work is to establish the relative importance for the listener's attention of a voice's (a) mean pitch in the context of the two voices (high-voice superiority) and (b) intrinsic melodic statistics/motif attractiveness. The rationale of the experimental manipulation is that switching the relative height of the lines allows these to be dissociated by imparting the same high-voice benefit to the new high-voice and the same preferred intrinsic melodic statistics to the new low voice. However, previous work suggests that the high-voice superiority effect is not all-or-nothing. Electrophysiology supported by auditory nerve modelling found it to depend on the degree of voice separation in a non-monotonic way (see https://doi.org/10.1016/j.heares.2013.07.014 at p. 68). Although the authors keep the overall pitch of the lower (and upper) line fixed across conditions, systematically different contour patterns across the voices could give rise to a sub-optimal distribution of separations in the PolyInv versus PolyOrig condition. This could weaken the high-voice superiority effect in PolyInv and explain the pattern of results. One could argue that such contour differences are examples of the "intrinsic melodic statistics" put forward as the effect working in opposition to high-voice superiority, but it is their interaction across voices that matters here.
(2) Although melody statistics are mentioned throughout, none have been calculated. It would be helpful to see the features that presumably lead to "motif attractiveness" quantified, as well as how they differ across lines. The work of David Huron, such as at https://dl.acm.org/doi/abs/10.1145/3469013.3469016, provides examples that could be calculated with ease and compared across the two lines: "the tendency for small over large pitch movements, for large leaps to ascend, for musical phrases to fall in pitch, and for phrases to begin with an initial pitch rise". The authors also mention differences in ornamentation. Such comparisons would make it more tangible for the reader as to what differs across the original "melody" and "support" line. In particular, as the authors themselves note, lines in double-counterpoint pieces can, to a degree, operate interchangeably. Bach's inventions in particular use a lot of direct repetition (up to octave invariance), which one would expect to minimise differences in the statistics mentioned. The references purporting to relate to melodic statistics (11-14 in original numbering) seem rather to relate to high-voice superiority.
(3) The exact nature of the transposition manipulation is obscured by a confusing Figure 1B, which shows an example in which the transposed line does not keep the same note-to-note interval structure as the original line.
(4) The transformer model is barely described in the main text. Even readers who are familiar with the Hidden Markov Models (e.g., in IDyOM) previously used by some of the authors to model melodic surprise and entropy would benefit from a brief description in the main text at least of how transformer models are different. The Methods section goes a little further but does not mention what the training set was, nor the relative weight given to long- and short-term memory models.
(5) The match-mismatch procedure should be explained in enough detail for readers to at least understand what value represents chance performance and why performance would be measured as an average over participants. Relatedly, there is no description at all of CCA or the match-mismatch procedure in the Methods.
(6) Details of how the integration model was implemented will be critical to interpreting the results relating to melodic expectations. It is not clear how "a single melody combining the two streams" was modelled, given that at least some notes presumably overlapped in time.
(7) The authors propose a weighted integration model, referring in the Discussion to dynamics and an integration rate. They do show that in the PolyOrig case, the top stream bias is highest and the monophonic model gives the best prediction, while in the PolyInv case, the top stream bias is weaker and the polyphonic model provides the best prediction. However, that doesn't seem to say anything about the temporal rate of integration, just the degree, which could be fixed over the whole stimulus. Relatedly, the terms "strong attention bias" and "weak attention bias" in Highlight 4 might give the impression of different attention modes for a given listener, or perhaps different types of listeners, but this seems to be shorthand for how attention is allocated for different types of stimuli (namely those that have or have not had their voices reversed).
(8) Another aspect of the presentation relating to temporal dynamics is that in places (e.g., Highlight 1), the authors suggest they are tracking attention dynamically. However, as acknowledged in the Discussion, neither the behavioural nor neural measure of attentional bias are temporally resolved. The measures indicate that on average participants attend more to the higher line (less so when it formed the lower line in the original composition).
(9) It is not clear whether the sung-back data were analysed (and if not why participants were asked to sing the melody back rather than just listen to the two components and report which they thought was the melody). It is also not stated whether the order in which the high and low voices were played back was randomised. If not, response biases or memory capacity might have affected the behavioural attention data.
-
Reviewer #3 (Public review):
Summary:
In this paper, Winchester and colleagues investigated melodic perception in natural music listening. They highlight the central role of attentional processes in identifying one particular stream in polyphonic material, and propose to compare several theoretical accounts, namely (1) divided attention, (2) figure-ground separation, and (3) stream integration. In parallel, the authors compare the relative strength of exogenous attentional effects (i.e., salience) produced by two common traits of melodies: high-pitch (compared to other voices), and attractive statistics. To ensure the generalisability of their results to real-life listening contexts, they developed a new uninstructed listening paradigm in which participants can freely attend to any part of a musical stimulus.
Major strengths and …
Reviewer #3 (Public review):
Summary:
In this paper, Winchester and colleagues investigated melodic perception in natural music listening. They highlight the central role of attentional processes in identifying one particular stream in polyphonic material, and propose to compare several theoretical accounts, namely (1) divided attention, (2) figure-ground separation, and (3) stream integration. In parallel, the authors compare the relative strength of exogenous attentional effects (i.e., salience) produced by two common traits of melodies: high-pitch (compared to other voices), and attractive statistics. To ensure the generalisability of their results to real-life listening contexts, they developed a new uninstructed listening paradigm in which participants can freely attend to any part of a musical stimulus.
Major strengths and weaknesses of the methods and results:
(1) Winchester and colleagues capitalized on previous attention decoding techniques and proposed an uninstructed listening paradigm. This is an important innovation for the study of music perception in ecological settings, and it is used here to investigate the spontaneous attentional focus during listening. The EEG decoding results obtained are coherent with the behavioral data, suggesting that the paradigm is robust and relevant.
(2) The authors first evaluate the relative importance of high-pitch and statistics in producing an attentional bias (Figure 2). Behavioral results show a clear pattern, in which both effects are present, with a dominance of the high-pitch one. The only weakness inherent to this protocol is that behavioral responses are measured based on a second presentation of short samples, which may induce a different attentional focus than in the first uninstructed listening.
(3) Then, the analyses of EEG data compare the decoding results of each melody (the high or low voice, and with "richer" or "poorer" statistics), and show a similar pattern of results. However, this report leaves open the possibility of a confounding factor. In this analysis, a TRF decoding model is first trained based on the presentation of monophonic samples, and it is later used to decode the envelope of the corresponding melodies in the polyphonic scenario. The fitting scores of the training phase are not reported. If the high-pitch or richer melodies were to produce higher decoding scores during monophonic listening (due to properties of the physiological response, or to perceptual processes), a similar difference could be expected during polyphonic listening. To capture attentional biases specifically, the decoding scores in the polyphonic conditions should be compared to the scores in the monophonic conditions, and attention could be expected to increase the decoding of the attended stream or decrease the unattended one.
(4) Then, Winchester and colleagues investigate the processing of melodic information by evaluating the encoding of melodic surprise and uncertainty (Figure 3). They compare the surprise and uncertainty estimated from a monophonic or a polyphonic model (Anticipatory Music Transformer), and analyse the data with a CCA analysis. The results show a double dissociation, where the processing of melodies with a strong attentional bias (high-pitch, rich statistics) is better approximated with a monophonic model, while a polyphonic model better classifies the other melodies. While this global result is compelling, it remains a preliminary and intriguing finding, and the manuscript does not further investigate it. As it stands, the result appears more like a starting point for further exploration than a definitive finding that can support strong theoretical claims. First, it could be complemented by a comparison of the encoding of individual melodies (e.g., AMmono high-voice vs AMmono low-voice, in PolyOrig and PolyInv conditions) to highlight a more direct correspondence with the previous results (Figure 2) and allow a more precise interpretation. Second, additional analyses or experiments would be needed to unpack this result and provide greater explanatory power. Additionally, the CCA analysis is not described in the method. The statistical testing conducted on this analysis seems to be performed across the 250 repetitions of the evaluation rather than across the 40 participants, which may bias the resulting p-values. Moreover, the choice and working principle of the Anticipatory Music Transformer are not described in the method. Overall, these results seem at first glance solid, but the missing parts of the method do not allow for full evaluation or replication of them.
An appraisal of whether the authors achieved their aims, and whether the results support their conclusions:
(1) Winchester and colleagues aimed at identifying the melodic stream that attracts attention during the listening of natural polyphonic music, and the underlying attentional processes. Their behavioral results confirm that high-pitched and attractive statistics increase melodic salience with a greater effect size of the former, as stated in the discussion. The TRF analyses of EEG data seem to show a similar pattern, but could also be explained by confounding factors. Next, the authors interpret the CCA results as the results of stream segregation when there is a high melodic salience, and stream integration when there are weaker attentional biases. These interpretations seem to be supported by the data, but unfortunately, no additional analyses or experiments have been conducted to further evaluate this hypothesis. The authors also acknowledge that their results do not show whether stream segregation occurs via divided attention or figure-ground separation. However, the lack of information about the music model used (Anticipatory Music Model) and the way it was set up raises some questions about its relevance and limits as a model of cognition (e.g. Is this transformer a "better" model of the listeners' expectations than the well-established IDyOM model, and why ?), and about the validity of those results.
(2) Overall, the authors achieved most of the aims presented in the introduction, although they couldn't give a more precise account of the attentional processes at stake. The interpretations are sound and not overstated, with the exception of potential confounding factors that could compromise the conclusions on the neural tracking of salient melodies (EEG results, Figure 2).
Impact of the work on the field, and the utility of the methods and data to the community:
The new uninstructed listening paradigm introduced in this paper will likely have an important impact on psychologists and neuroscientists working on music perception and auditory attention, enabling them to conduct experiments in more ecological settings. While the attentional biases towards melodies with high-pitch and attractive statistics are already known, showing their relative effect is an important step in building precise models of auditory attention, and allows future paradigms to explore more fine-grained effects. Finally, the stream segregation and integration shown with this paradigm could be important for researchers working on music perception. Future work may be necessary to identify the models (Markov chains, deep learning) and setup (data analysis, stimuli, control variables) that do or do not replicate these results.
-