Learning visual-to-auditory sensory substitution reveals flexibility in image-to-sound mapping

Asa Kucinkas
Chrysa Retsa
Mark Wallace
Monica Gori
Micah M. Murray

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Visual-to-auditory sensory substitution devices (SSDs) translate images to sounds, enabling a degree of visual perceptual access for visually-impaired and blind individuals. One SSD, The vOICe, converts images into auditory soundscapes using spectral-temporal mappings. Specifically, a pixel’s vertical position translates into pitch and horizontal position into time. This mapping, a priori, is primarily based on technical considerations for preserving image content in human-audible sounds without making claims about intuitiveness, although some literature also invokes crossmodal correspondences in perception, such as pitch for elevation. This presupposition remains to be empirically validated with human subjects. We therefore investigated the efficacy of learning this mapping versus an inverted mapping as well as a single-tone control mapping. Sixty sighted, adult participants were randomly assigned to one of three groups (Traditional, Reversed, or Control) and completed brief learning and evaluation sessions using simplified black-and-white visual stimuli. Both the Traditional and Reversed groups learned mappings within 30 minutes and demonstrated successful recognition of novel stimuli, outperforming the Control group, suggesting that structured mappings facilitate SSD learning. However, there was no evidence of performance or processing time differences between the Traditional and Reversed groups. Mapping pixel position onto spectral-temporal acoustic axes appears flexible. On the one hand, these findings open new possibilities in how SSDs may be rendered bespoke to individual users, specific categories of stimuli, or functionalities (e.g. object recognition, reading, or navigation). On the other hand, indistinguishable performance with different algorithms demonstrates the flexibility in mapping between visual and auditory features, which does not appear anchored to specific notions of cross-modal correspondences.

Version published to 10.31234/osf.io/6w278_v2 on OSF Preprints
Oct 31, 2025
Version published to 10.31234/osf.io/6w278_v1 on OSF Preprints
Sep 3, 2025

Embodied speech: Sensorimotor contributions to native and non-native phoneme processing and learning

This article has 5 authors:
1. Tzuyi Tseng
2. Jennifer Krzonowski
3. Claudio Brozzoli
4. Alice C. Roy
5. Véronique Boulenger
This article has no evaluationsLatest version Jan 26, 2026
Pathways of Cross-Modal Access to the Visual Cortex in Late Blindness

This article has 5 authors:
1. Samuel Paré
2. James Isaac Lubell
3. Sylvain Baillet
4. Ron Kupers
5. Maurice Ptito
This article has no evaluationsLatest version Jan 20, 2026
Transfer of statistical learning from speech perception to production generalizes to reading

This article has 3 authors:
1. Kyle David Huffaker
2. Lori L. Holt
3. Nazbanou Nozari
This article has no evaluationsLatest version Feb 4, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Embodied speech: Sensorimotor contributions to native and non-native phoneme processing and learning

Pathways of Cross-Modal Access to the Visual Cortex in Late Blindness

Transfer of statistical learning from speech perception to production generalizes to reading