Fine-Tuning Whisper for American English Air Traffic Control Speech Recognition: A Data-Efficient Pipeline
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Automatic speech recognition (ASR) for air traffic control (ATC) presents a persistent domain adaptation challenge: VHF radio channel degradation, specialized vocabulary, and a near-total absence of American English training corpora in the published literature. The state-of-the-art open ATC ASR model—WhisperATC by van Doorn et al. [20], fine-tuned on European ATCO2 and ATCOSIM corpora—achieves 3.88% word error rate (WER) on ATCOSIM (speaker-split evaluation) but degrades to 30.3% on American ATC transmissions, exposing a systematic accent and phraseology mismatch. We present a data-efficient fine-tuning pipeline that adapts Whisper Large v3 [17] to American English ATC using only 55 manually transcribed clips recorded from three major US airports (KIAH, KJFK, KSFO) via LiveATC.net. Domain-matched audio preprocessing—a 300–3400 Hz Butterworth bandpass filter with EBUR128 loudness normalization—combined with five-fold stochastic data augmentation addresses the limited corpus size. Full fine-tuning with conservative hyperparameters achieves 13.7% WER, a 54.8% relative reduction from the European-trained baseline, using 370× fewer training clips than the most comparable prior study. A secondary contribution is the characterization of a structural incompatibility between the HuggingFace PEFT LoRA implementation and Whisper’s log-mel spectrogram encoder that prevents parameter-efficient fine-tuning without modification of library internals. All code, the fine-tuned model, and training notebooks are publicly available.