Speech to speech. Create a neural network that falsifies the voice.
The content of the article
- Generating voice
- Text to speech
- Sounds to speech
- Speech to speech
- Making a fake voice
- How the voice simulator works
- Method testing
- Conclusions
Generating voice
The human voice is the result of the movement of ligaments, tongue, lips. The computer has only numbers representing the wave recorded by the microphone. How does a computer create sound that we can hear from speakers or headphones?
Text to speech
One of the most popular and researched methods for generating sounds is directly converting the text to be played into sound. The earliest programs of this kind glued individual letters into words and words into sentences.
With the development of synthesizer programs, a set of phonemes (letters) pre-recorded on a microphone became a set of syllables, and then whole words.
The advantages of such programs are obvious: they are easy to write, use, maintain, they can reproduce all the words that are in the language, predictable - all this at one time became the reason for their commercial use. But the quality of the voice created by this method leaves much to be desired. We all remember the distinctive features of such a generator - insensitive speech, incorrect stress, words and letters torn from each other.
Sounds to speech
This method of generating speech relatively quickly replaced the first, since it imitated human speech better: we pronounce not letters, but sounds. That is why systems based on the international phonetic alphabet - IPA, are of higher quality and more pleasant to listen to.
This method is based on individual sounds pre-recorded in the studio, which are glued together into words. Compared to the first approach, a qualitative improvement is noticeable: instead of simple gluing of audio tracks, sound mixing methods are used both based on mathematical laws and based on neural networks.
Speech to speech
A relatively new approach is entirely based on neural networks. Recursive architecture
WaveNet, built by researchers at DeepMind, allows you to convert sound or text to another sound directly, without involving pre-recorded building blocks (
research paper).
The key to this technology is the correct use of Long Short-Term Memory recursive neurons, which retain their state not only at the level of each individual cell of the neural network, but also at the level of the entire layer.
In general, this architecture works with any kind of sound wave, regardless of whether it is music or a human voice.
To recreate speech, such systems use sound notation generators from text and intonation generators (stress, pause) to create a natural-sounding voice.
This is the most advanced technology for creating speech: it not only glues or mixes sounds incomprehensible to the machine, but independently creates transitions between them, pauses between words, changes the pitch, strength and timbre of the voice for the sake of correct pronunciation - or any other purpose.
Making a fake voice
For the simplest identification, almost any method is suitable - for especially successful hackers, even an unprocessed five seconds of a recorded voice may be enough. But to bypass a more serious system, built, for example, on neural networks, we need a real, high-quality voice generator.
How the voice simulator works
It takes a lot of effort to create a plausible voice-to-voice model based on WaveNet: you have to record a large amount of text spoken by two different people, and so that all the sounds match a second per second - and this is difficult to do. However, there is another method.
Based on the same principles as sound synthesis technology, you can achieve an equally realistic transmission of all voice parameters. So, a
program was created that clones a voice based on a small speech recording. This is what you and I use.
The program itself consists of several important parts that work sequentially, so let's figure it out in stages.
Voice encoding
Each person's voice has a number of characteristics - they cannot always be recognized by ear, but they are important. In order to accurately separate one speaker from another, it will be correct to create a special neural network that forms its own sets of features for different people.
This encoder allows not only transferring the voice in the future, but also comparing the results with the desired ones.
Creating a spectrogram
Based on these characteristics, it is possible to create a chalk-spectrogram of sound from the text. This is done by the synthesizer, which is based on the Tacotron 2, using WaveNet.
The generated spectrogram contains all the information about pauses, sounds and pronunciation, and it already contains all the pre-calculated characteristics of the voice.
Sound synthesis
Now another neural network - based on
WaveRNN - will gradually create a sound wave from the chalk spectrogram. This sound wave will be played as a finished sound.
All characteristics of the main voice are preserved in the synthesized sound, which, albeit not without difficulty, recreates the original human voice on any text.
Method testing
Now that we know how to create a believable voice simulation, let's try to put it into practice. I talked about two very simple but working methods for identifying a person by voice: using the analysis of small-cepstral coefficients and using neural networks specially trained to identify one person. Let's find out how well we can fool these systems with fake records.
Let's take a five second recording of a man's voice and create two recordings with our tool. The original and the recordings that I got can be
downloaded or listened to.
Let us compare these records using the Mel-cepstral coefficients.
The difference in odds is also visible in numbers:
Synthesis_1 - original: 0.38612951111628727
Synthesis_2 - original: 0.3594987201660116
How will the neural network react to such a good fake?
Synthesis_1 - original: 89.3%
Synthesis_2 - original: 86.9%
It turned out to be possible to convince the neural network, but not perfectly. Serious security systems, which are installed, for example, in banks, are likely to be able to detect a fake, but a person, especially on the phone, is unlikely to be able to distinguish a real interlocutor from his computer imitation.
Conclusions
Faking a voice is no longer as difficult as it used to be, and this opens up great opportunities not only for hackers, but also for content creators: indie game developers will be able to make high-quality and cheap voice acting, animators - to voice their characters, and film directors - to film reliable documentary.
And even though the technologies of high-quality speech synthesis are still developing, their potential is already taking your breath away. Soon all voice assistants will find their own personal voice - not metallic, cold, but filled with emotions and feelings; chat with tech support will stop annoying, and you can make your phone answer unpleasant calls for you.