k2SSL: A Faster and Better Framework for Self-Supervised Speech Representation Learning

Yifan Yang
Jianheng Zhuo
Zengrui Jin
Xie Chen

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Self-supervised learning (SSL) has achieved great success in speech-related tasks, driven by advancements in speech encoder architectures and the expansion of datasets. While Transformer and Conformer architectures have dominated SSL backbones, encoders like Zipformer, which excel in automatic speech recognition (ASR), remain unexplored in SSL. Concurrently, inefficiencies in data processing within existing SSL training frameworks, such as fairseq, pose challenges in managing the growing volumes of training data. To address these issues, we propose k2SSL, an open-source framework that offers faster, more memory-efficient, and better-performing self-supervised speech representation learning, with a focus on downstream ASR tasks. The optimized HuBERT and proposed Zipformer-based SSL systems exhibit substantial reductions in both training time and memory usage during SSL training. Experiments on LibriSpeech and Libri-Light demonstrate that Zipformer-based SSL systems significantly outperform comparable HuBERT and WavLM systems, achieving a relative WER reduction on dev-other/test-other of up to 34.8%/32.4% compared to HuBERT Base after supervised fine-tuning, along with a 3.5x pre-training speedup in total GPU hours.

Version published to 10.32388/2c9tpu
Dec 2, 2024

Self-Supervised Audio Representation Learning Model Based on Time-Frequency Decoupling and Masked Reconstruction

This article has 3 authors:
1. Jie Xu
2. Yuhao Dai
3. Zhifeng Wang
This article has no evaluationsLatest version Dec 31, 2025
Addressing Challenges in Multimodal Large Language Model Development

This article has 4 authors:
1. Feidlimid Shyama
2. Lucas Pereira
3. João Souza
4. Ana Costa
This article has no evaluationsLatest version Dec 22, 2025
SAREC: A Semantic- Aware Retrieval-Augmented Conformer for Multilingual Low-Resource Speech Recognition

This article has 4 authors:
1. B. Arukiran Reddy
2. S. Udaya Bhaskar
3. J. Sunil Kumar
4. P. Raghunadh
This article has no evaluationsLatest version Dec 29, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Self-Supervised Audio Representation Learning Model Based on Time-Frequency Decoupling and Masked Reconstruction

Addressing Challenges in Multimodal Large Language Model Development

SAREC: A Semantic- Aware Retrieval-Augmented Conformer for Multilingual Low-Resource Speech Recognition