SimWhisper-Codec Demo

Speaking Clearly: A Simplified Whisper-based Codec for Low-Bitrate Speech Coding

Abstract

Speech codecs serve as bridges between continuous speech signals and large language models, yet face an inherent conflict between acoustic fidelity and semantic preservation. To mitigate this conflict, prevailing methods augment acoustic codecs with complex semantic supervision. We explore the opposite direction: a semantic-first approach that starts from a semantically-capable model and adapts it for high-fidelity acoustic reconstruction. Through empirical analysis, we discover that targeted architectural simplification can unlock the acoustic modeling potential of Whisper, a text-aligned Automatic Speech Recognition (ASR) model. Based on this finding, we propose SimWhisper-Codec, a novel codec that balances the semantic and acoustic preservation by leveraging a frozen, simplified Whisper encoder without requiring external supervision. Experimental results demonstrate that SimWhisper-Codec achieves superior performance in both semantic preservation and acoustic quality compared to semantically-supervised codecs such as Mimi Codec and SpeechTokenizer at similar bitrates, validating the effectiveness of our semantic-first approach.

HiFiGAN Reconstruction Samples

Comparison of HiFiGAN vocoder trained on different feature encoders

Method

Sample 1

Sample 2

Sample 3

Sample 4

Ground Truth

Whisper encoder

Whisper encoder (WO_APE)

Whisper encoder (WO_GELU)

Whisper encoder (WO_GELU_APE)

Hubert

WavLM

Speaker Clustering Experiments

t-SNE visualization of final-layer embeddings from four encoders on LibriSpeech test-clean

Experiment Setup: t-SNE visualization of final-layer embeddings from four encoders on LibriSpeech test-clean with 5 utterances per speaker. Points are colored by ground-truth speaker identity. ARI measures clustering quality using k-means on the same embeddings (k=40).

Results: Our Simplified Whisper achieves the highest ARI score (0.533), demonstrating the clearest speaker separation and superior speaker-discriminative representations compared to other methods.