Comparison of HiFiGAN vocoder trained on different feature encoders
| Method | Sample 1 | Sample 2 | Sample 3 | Sample 4 |
|---|---|---|---|---|
| Ground Truth |
|
|
|
|
| Whisper encoder |
|
|
|
|
| Whisper encoder (WO_APE) |
|
|
|
|
| Whisper encoder (WO_GELU) |
|
|
|
|
| Whisper encoder (WO_GELU_APE) |
|
|
|
|
| Hubert |
|
|
|
|
| WavLM |
|
|
|
|
t-SNE visualization of final-layer embeddings from four encoders on LibriSpeech test-clean
Experiment Setup: t-SNE visualization of final-layer embeddings from four encoders on LibriSpeech test-clean with 5 utterances per speaker. Points are colored by ground-truth speaker identity. ARI measures clustering quality using k-means on the same embeddings (k=40).
Results: Our Simplified Whisper achieves the highest ARI score (0.533), demonstrating the clearest speaker separation and superior speaker-discriminative representations compared to other methods.
Comparison of different speech codecs on reconstruction quality
| Method | Sample 1 | Sample 2 | Sample 3 | Sample 4 |
|---|---|---|---|---|
| Ground Truth |
|
|
|
|
| SimWhisper-Codec (Ours) |
|
|
|
|
| Xcodec2.0 |
|
|
|
|
| SpeechTokenizer |
|
|
|
|
| Mimi |
|
|
|
|
| DAC |
|
|
|
|
| EnCodec |
|
|
|
|