Articulatory Encodec: Vocal Tract Kinematics as a Codec for Speech

Audio Samples

Resynthesis Samples (LibriTTS-R)

Ground Truth Resynthesized

Resynthesis Samples (VCTK)

Ground Truth Resynthesized

Resynthesis Samples (Multilingual)

Language Ground Truth Resynthesized (English-Only-Trained) Resynthesized (Fine-Tuned)
German
Dutch
Portuguese
Italian
Polish
Spanish
French
Korean
Japanese
Chinese

Voice Conversion Samples

Source Target Converted




Controllability Demo 1: Interpolating Tongue Traces to Manipulate Place of Articulation Interpolation Samples ("lock-rock")

Mixing Ratio Transcription Vocal Tract Visualization
100% lock + 0% rock lock
80% lock + 20% rock lock
60% lock + 40% rock lock
40% lock + 60% rock lock
20% lock + 80% rock rock
0% lock + 100% rock rock
-20% lock + 120% rock rock




Controllability Demo 2: Translating "Loudness" Trace to Manipulate Voice Onset Time ("may-bay-pay")

Shift in Loudness (ms) Transcription Simulated Sample
-100 may
-80 may
-60 may
-40 may
-20 may
0 bay
20 bay
40 bay
60 pay
80 pay
100 pay