Neural Voice Puppetry:
Audio-driven Facial Reenactment

Justus Thies Mohamed Elgharib Ayush Tewari Christian Theobalt Matthias Nießner


teaser


We present Neural Voice Puppetry, a novel approach for audio-driven facial video synthesis. Given an audio sequence of a source person or digital assistant, we generate a photo-realistic output video of a target person that is in sync with the audio of the source input. This audio-driven facial reenactment is driven by a deep neural network that employs a latent 3D face model space. Through the underlying 3D representation, the model inherently learns temporal stability while we leverage neural rendering to generate photo-realistic output frames. Our approach generalizes across different people, allowing us to synthesize videos of a target actor with the voice of any unknown source actor or even synthetic voices that can be generated utilizing standard text-to-speech approaches. Neural Voice Puppetry has a variety of use-cases, including audio-driven video avatars, video dubbing, and text-driven video synthesis of a talking head. We demonstrate the capabilities of our method in a series of audio- and text-based puppetry examples. Our method is not only more general than existing works since we are generic to the input person, but we also show superior visual and lip sync quality compared to photo-realistic audio- and video-driven reenactment techniques.

Video Paper Project Page







Please note that the demo below is intended for academic purpose only!

The authors are not responsible for the content!

Due to misuse, synthesis requests are disabled at the moment!




Info: synthesis requests are disabled at the moment.





Audio Generator:

Please note that audio generation is not part of NeuralVoicePuppetry. Our goal is to synthesize the video to a given audio stream. We, therefore, rely on state-of-the-art methods for audio synthesis given text inputs. Specifically, the demo is driven by audio generated by Tacotron 2 and Real-Time-Voice-Cloning. Note that the methods perform best on sentences with medium length (something like 10 words). Especially, sentences like "Hello!" lead to strong artefacts in the audio synthesis step. Please note that the text to speech methods are trained on English audio only. When requesting a new video, you can chose between the voice generators using the green "Target Voice" select button. We provide different voices for both techniques. In case of Tacotron 2, we use the pretrained model (female voice) and fine-tuned models (with fixed encoder). The voices generated by Real-Time-Voice-Cloning are all based on a pretrained model.

We use the following publicly available implementations of the text to speech approaches:
 • https://github.com/NVIDIA/tacotron2
 • https://github.com/CorentinJ/Real-Time-Voice-Cloning


For our online demo, we recommend the female Tacotron 2 voice (default target voice) which performs best in our experiments.




Note:

The generated videos are based on user input. The authors are not responsible for the content. We implemented profanity checks and will regularily screen the created content. The input text of the users as well as the selected voice model is stored on our server to provide the generated videos including the text to other users. In case of misuse, we will shut-down the online demo.

Feel free to contact the authors via the links below the title. We are happy to answer questions about the method/demo as well as to receive a note about misuse of the demo.