Learning visual-to-auditory sensory substitution reveals flexibility in image-to-sound mapping
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Visual-to-auditory sensory substitution devices (SSDs) translate images to sounds, enabling a degree of visual perceptual access for visually-impaired and blind individuals. One SSD, The vOICe, converts images into auditory soundscapes using spectral-temporal mappings. Specifically, a pixel’s vertical position translates into pitch and horizontal position into time. This mapping, a priori, is primarily based on technical considerations for preserving image content in human-audible sounds without making claims about intuitiveness, although some literature also invokes crossmodal correspondences in perception, such as pitch for elevation. This presupposition remains to be empirically validated with human subjects. We therefore investigated the efficacy of learning this mapping versus an inverted mapping as well as a single-tone control mapping. Sixty sighted, adult participants were randomly assigned to one of three groups (Traditional, Reversed, or Control) and completed brief learning and evaluation sessions using simplified black-and-white visual stimuli. Both the Traditional and Reversed groups learned mappings within 30 minutes and demonstrated successful recognition of novel stimuli, outperforming the Control group, suggesting that structured mappings facilitate SSD learning. However, there was no evidence of performance or processing time differences between the Traditional and Reversed groups. Mapping pixel position onto spectral-temporal acoustic axes appears flexible. On the one hand, these findings open new possibilities in how SSDs may be rendered bespoke to individual users, specific categories of stimuli, or functionalities (e.g. object recognition, reading, or navigation). On the other hand, indistinguishable performance with different algorithms demonstrates the flexibility in mapping between visual and auditory features, which does not appear anchored to specific notions of cross-modal correspondences.