New method allows you to change facial expressions in deepfake videos


Reaction score
Wav2Lip-Emotion copies emotion-related facial expressions from one part of the video and replaces them at other points.


Experts have developed a new machine learning technique that allows you to arbitrarily change the emotional expression of faces in videos, adapting such recently emerging technologies as lip syncing with dubbing in a foreign language.

The study, Invertable Frowns: Video-to-Video Facial Emotion Translation, is a collaboration between Northeastern University in Boston and MIT's Media Lab. While the researchers acknowledge that the original quality of the results should be improved in further research, they also argue that the Wav2Lip-Emotion method they developed is the first of its kind to directly change facial expressions in video using a neural network.

The project's codebase has been published on GitHub, and the model breakpoints will be added to the open source repository at a later date, the researchers promised.

In theory, such manipulations are possible thanks to the full training of models using traditional deepfake repositories such as DeepFaceLab and FaceSwap. However, the standard workload involves the use of an alternate identity instead of a real one. For example, an actor may impersonate a target personality, whose facial expressions, along with other actions, will be transferred to another person. In addition, the use of deepfake technology to fake the voice will be required to make the video credible.

Moreover, altering the facial expression in the original video using these popular repositories requires changing the face alignment vectors of the overlay in ways that these architectures currently do not facilitate.

Wav2Lip-Emotion effectively copies emotion-related facial expressions from one part of the video and replaces them at other points, preserving the original data, which ultimately provides a simple and convenient method for manipulating facial expressions.

Later, offline models can be developed that are trained on alternate video images of the speaker and thereby eliminate the need for the video to contain the entire palette of facial expressions.