Articulatory Encodec: Vocal Tract Kinematics as a Codec for Speech
Audio Samples
Resynthesis Samples (LibriTTS-R)
Ground Truth |
Resynthesized |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Resynthesis Samples (VCTK)
Ground Truth |
Resynthesized |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Resynthesis Samples (Multilingual)
Language |
Ground Truth |
Resynthesized (English-Only-Trained) |
Resynthesized (Fine-Tuned) |
German |
|
|
|
Dutch |
|
|
|
Portuguese |
|
|
|
Italian |
|
|
|
Polish |
|
|
|
Spanish |
|
|
|
French |
|
|
|
Korean |
|
|
|
Japanese |
|
|
|
Chinese |
|
|
|
Controllability Demo 1: Interpolating Tongue Traces to Manipulate Place of Articulation Interpolation Samples ("lock-rock")
Mixing Ratio |
Transcription |
Vocal Tract Visualization |
100% lock + 0% rock
|
lock
|
|
80% lock + 20% rock
|
lock
|
|
60% lock + 40% rock
|
lock
|
|
40% lock + 60% rock
|
lock
|
|
20% lock + 80% rock
|
rock
|
|
0% lock + 100% rock
|
rock
|
|
-20% lock + 120% rock
|
rock
|
|
Controllability Demo 2: Translating "Loudness" Trace to Manipulate Voice Onset Time ("may-bay-pay")
Shift in Loudness (ms) |
Transcription |
Simulated Sample |
-100
|
may
|
|
-80
|
may
|
|
-60
|
may
|
|
-40
|
may
|
|
-20
|
may
|
|
0
|
bay
|
|
20
|
bay
|
|
40
|
bay
|
|
60
|
pay
|
|
80
|
pay
|
|
100
|
pay
|
|