VLOGGER: Google's new AI brings people to life in photos

Teacher

Professional
Messages
2,669
Reaction score
819
Points
113
The company introduced a system for instantly creating videos with talking people.

Google introduced the BLOGGER AI model, which allows you to generate videos with talking people based on a single image of a person and an audio file.

The new method was made possible by the use of generative diffusion models, which distinguishes VLOGGER from previous developments. The method does not require individual training for each person and is able to work without detecting and cropping faces, generating complete images, including the face and torso, in various scenarios.

zkfs92nga00alyqh8dbwwnap1ehkgrto.png


The VLOGGER system works in two stages:
  • the first stage takes as input the form of an audio signal to create intermediate body movement controls that are responsible for the look, facial expressions, and posture;
  • The second stage is a temporal image-to-image transformation model that predicts further body movements to generate appropriate frames. To link the process to a specific person, VLOGGER also uses a reference image of the person.

5frhg56fbclhqdn667njj1bfghv5clpx.png


Special attention is paid to the variety and realism of the generated videos. VLOGGER is able to create videos with heavy traffic and a high level of detail, while maintaining identity and time consistency. The model was trained on the new MENTOR large-scale dataset, which includes 2,200 hours of video and 800,000 personalities, which is 10 times more than the previous datasets.

VLOGGER is used in a number of areas, including video editing and creating videos with talking people based on a single input image and audio. The model can be used to edit existing videos by changing the subject's facial expression, such as closing their mouth or eyes, or to adapt the video to new audio tracks in different languages, ensuring that the movement of the lips and face is consistent with the new audio.
 
Top