Articulatory Encodec: Vocal Tract Kinematics as a Codec for Speech

Audio Samples

Resynthesis Samples (LibriTTS-R)

Ground Truth	Resynthesized

Resynthesis Samples (VCTK)

Ground Truth	Resynthesized

Resynthesis Samples (Multilingual)

Language	Ground Truth	Resynthesized (English-Only-Trained)	Resynthesized (Fine-Tuned)
German
Dutch
Portuguese
Italian
Polish
Spanish
French
Korean
Japanese
Chinese

Voice Conversion Samples

Source	Target	Converted

Controllability Demo 1: Interpolating Tongue Traces to Manipulate Place of Articulation Interpolation Samples ("lock-rock")

Mixing Ratio	Transcription	Vocal Tract Visualization
100% lock + 0% rock	lock
80% lock + 20% rock	lock
60% lock + 40% rock	lock
40% lock + 60% rock	lock
20% lock + 80% rock	rock
0% lock + 100% rock	rock
-20% lock + 120% rock	rock

Controllability Demo 2: Translating "Loudness" Trace to Manipulate Voice Onset Time ("may-bay-pay")

Shift in Loudness (ms)	Transcription	Simulated Sample
-100	may
-80	may
-60	may
-40	may
-20	may
0	bay
20	bay
40	bay
60	pay
80	pay
100	pay