Machine hearing. How identification of a person by his voice works.

Mutt · Jul 24, 2021

You may have already come across voice identification. It is used in banks for phone identification, to verify identity at control points and in household voice assistants that can recognize the owner. How it works? Let's try to understand the details and make our own implementation.

The content of the article

Voice characteristics
Sound preprocessing
Identification using MFCC
Method testing
Voice identification using neural networks
Method testing
Conclusions

Voice characteristics
First of all, the voice is determined by its height. Pitch is the fundamental frequency of sound around which all movements of the vocal cords are built. This frequency is easy to feel by ear: some have a higher, louder voice, while others have a lower, deeper voice.
Another important parameter of the voice is its strength, the amount of energy that a person puts into pronunciation. Its volume and saturation depend on the strength of the voice.
Another characteristic is how the voice transitions from one sound to another. This parameter is the most difficult to understand and to understand by ear, although it is the most accurate - like a fingerprint.

Sound preprocessing
The human voice is not a lonely wave, it is the sum of the many distinct frequencies created by the vocal cords, as well as their harmonics. Because of this, it is difficult to find voice patterns in the processing of raw wave data.

The Fourier transform will come to our aid - a mathematical way to describe one complex sound wave with a spectrogram, that is, a set of many frequencies and amplitudes. This spectrogram contains all the key information about the sound: this is how we find out which frequencies are contained in the original voice.

But Fourier transform is a mathematical function that targets a perfect, unchanging audio signal, so it requires practical adaptation. So, instead of extracting frequencies from the entire recording at once, we will divide this recording into small segments during which the sound will not change. And apply the transformation to each of the pieces.

The next step is to calculate the second order spectrogram, that is, the spectrogram from the spectrogram. This must be done, since the spectrogram, in addition to the fundamental frequencies, also contains harmonics that are not very convenient for analysis: they duplicate information. These harmonics are located at an equal distance from each other, their only difference is the decrease in amplitude.

Let's see what the spectrum of a monotone sound looks like. Let's start with a wave - a sinusoid, which is emitted, for example, by a wired telephone when dialing a number.

It can be seen that, in addition to the main peak, which actually represents the signal, there are smaller peaks, harmonics, which do not carry useful information. That is why, before obtaining a second-order spectrogram, the first spectrogram is logarithmized, which results in peaks of a similar size.

Now, if we look for a second-order spectrogram, or, as it was called, "cepstrum" (an anagram of the word "spectrum"), we get a much more decent picture, which completely, with one peak, reflects our original monotonic wave.

One of the most useful features of our hearing is its non-linear nature in relation to the perception of frequencies. Through long experiments, scientists have found that this pattern can not only be easily deduced, but also easily used.

This new value is called chalk, and it perfectly reflects a person's ability to recognize different frequencies - the higher the frequency of a sound, the more difficult it is to distinguish it.

Hertz to chalk conversion chart
Now let's try to put all this into practice.

Identification using MFCC
We can take a long-term recording of a person's voice, count the cepstrum for each small area, and get a unique imprint of the voice at each moment in time. But this fingerprint is too large for storage and analysis - it depends on the chosen block length and can go up to two thousand numbers for every 100 ms. Therefore, it is necessary to extract a certain number of features from such a variety. A chalk scale will help us with this.

We can choose certain “areas of audibility”, on which we sum up all the signals, and the number of these areas is equal to the number of necessary features, and the lengths and boundaries of the areas depend on the chalk scale.

So we got acquainted with the chalk frequency cepstral coefficients (MFCC). The number of signs can be arbitrary, but most often it varies from 20 to 40.

These coefficients perfectly reflect each "frequency block" of the voice at every moment of time, which means that if we summarize the time, summing up the coefficients of all the blocks, we can get a person's voice print.

Method testing
Let's download some YouTube video recordings from which we will extract the voice for our experiments. We want clear sound without noise. I chose the TED Talks channel.

We will download several videos in any convenient way, for example, using the youtube-dl utility. It is available through pipor through the official Ubuntu or Debian repository. I downloaded three videos of performances: two women and one man.

Then we convert the video to audio, create several pieces of different lengths without music or applause.

Code:

$ ffmpeg -ss 00: 00: 27.0 -i man1.webm -t 200 -vn man1.1.wav

Now let's look at the Python 3 program. We need the numpycomputation and librosasound processing libraries that we can install with pip. For your convenience, all complex calculations of the coefficients were packed into one function librosa.feature.mfcc. Let's load the audio track and extract the characteristics of the voice.

Code:

import librosa as lr
import numpy as np

SR = 16000 # Sampling rate

def process_audio (aname):
  audio, _ = lr.load (aname, sr = SR) # Load the track into memory

  # Extract the coefficients
  afs = lr.feature.mfcc (audio, # from our sound
                        sr = SR, # with a sampling rate of 16 kHz
                        n_mfcc = 34, # Extract 34 parameters
                        n_fft = 2048) # Use blocks of 125ms
  # Sum up all coefficients over time
  # We discard the first two, since they are inaudible to humans and contain noise
  afss = np.sum (afs [2:], axis = -1)

  # Normalize them
  afss = afss / np.max (np.abs (afss))

  return afss

def confidence (x, y):
  return np.sum ((x - y) ** 2) # Euclidean distance
  # Less is better

## Loading multiple audio tracks
woman21 = process_audio ("woman2.1.wav")
woman22 = process_audio ("woman2.2.wav")
woman11 = process_audio ("woman1.1.wav")
woman12 = process_audio ("woman1.2.wav")

## Compare odds for proximity
print ('same', confidence (woman11, woman12))
print ('same', confidence (woman21, woman22))
print ('diff', confidence (woman11, woman21))
print ('diff', confidence (woman11, woman22))
print ('diff', confidence (woman12, woman21))
print ('diff', confidence (woman12, woman22))

Result:

Code:

same 0.08918786797751492
same 0.04016324022920391
diff 0.8353932676024817
diff 0.5290006939899561
diff 0.5996234966734799
diff 0.9143384850090941

Identity is working correctly. But we can improve our algorithm by adding a silence and pause filter between words and sentences.

Code:

def filter_audio (audio):
  # We count the energy of the voice for each block in 125 ms
  apower = lr.amplitude_to_db (np.abs (lr.stft (audio, n_fft = 2048)), ref = np.max)

  # Sum up the energy at each frequency, normalize
  apsums = np.sum (apower, axis = 0) ** 2
  apsums - = np.min (apsums)
  apsums / = np.max (apsums)

  # Smooth the graph to keep short gaps and pauses, remove sharpness
  apsums = np.convolve (apsums, np.ones ((9,)), 'same')
  # Normalize again
  apsums - = np.min (apsums)
  apsums / = np.max (apsums)

  # Set the threshold to 35% noise above the voice
  apsums = np.array (apsums> 0.35, dtype = bool)

  # Extend blocks each for 125 ms
  # up to individual samples (2048 per block)
  apsums = np.repeat (apsums, np.ceil (len (audio) / len (apsums))) [: len (audio)]

  return audio [apsums] # Filtering!

Let's test the new program.

Code:

same 0.07287868313339689
same 0.07599075249316399
diff 1.1107063027198296
diff 0.9556985491806391
diff 0.9212706723328299
diff 1.019240307344966

We calculated the values of various features.

These graphs show how our program compares the values of different attributes. Red and green colors indicate the coefficients that were obtained from the votes of two women: two entries for each. Lines of the same color are close to each other - the voice of the same person. Lines of different colors are located farther apart, since they are voices of different people.

Now let's compare the male and female voices.

Code:

same 0.07287868313339689
same 0.1312549383658766
diff 1.4336642787341562
diff 1.5398833283440216
diff 1.9443562070029585
diff 1.6660100959317368

Here the differences are more pronounced, as can be seen on the graph. The man's voice is lower: there are more peaks at the beginning of the graph and less at the end.
This algorithm really works, and it works well. Its main drawback is the dependence of the accuracy of the result on noise and recording duration. If the recording is shorter than ten seconds, the accuracy decreases rapidly.

Voice identification using neural networks
We can improve our algorithm using neural networks, which show incredible efficiency on such problems. We use the Keras library to create a neural network model.

Code:

import librosa as lr
import numpy as np

from keras.layers import Dense, LSTM, Activation
from keras.models import Sequential
from keras.optimizers import Adam

SR = 16000 # Sampling rate
LENGTH = 16 # Number of blocks in one pass of the neural network
OVERLAP = 8 # Step in the number of blocks between training samples
FFT = 1024 # Block length (64ms)

def prepare_audio (aname, target = False):
  # Load and prepare data
  print ('loading% s'% aname)
  audio, _ = lr.load (aname, sr = SR)
  audio = filter_audio (audio) # Remove silence and spaces between words
  data = lr.stft (audio, n_fft = FFT) .swapaxes (0, 1) # Extract the spectrogram
  samples = []

  for i in range (0, len (data) - LENGTH, OVERLAP):
    samples.append (np.abs (data [i: i + LENGTH])) # Create training sample

  results_shape = (len (samples), 1)
  results = np.ones (results_shape) if target else np.zeros (results_shape)
  return np.array (samples), results

## List of all entries
voices = [("woman2.wav", True),
          ("woman2.1.wav", True),
          ("woman2.2.wav", True),
          ("woman2.3.wav", True),
          ("woman1.wav", False),
          ("woman1.1.wav", False),
          ("woman1.2.wav", False),
          ("woman1.3.wav", False),
          ("man1.1.wav", False),
          ("man1.2.wav", False),
          ("man1.3.wav", False)]

## Combining training samples
X, Y = prepare_audio (* voices [0])
for voice in voices [1:]:
  dx, dy = prepare_audio (* voice)
  X = np.concatenate ((X, dx), axis = 0)
  Y = np.concatenate ((Y, dy), axis = 0)
  del dx, dy

## Randomly shuffle all the blocks
perm = np.random.permutation (len (X))
X = X [perm]
Y = Y [perm]

## Create the model
model = Sequential ()
model.add (LSTM (128, return_sequences = True, input_shape = X.shape [1:]))
model.add (LSTM (64))
model.add (Dense (64))
model.add (Activation ('tanh'))
model.add (Dense (16))
model.add (Activation ('sigmoid'))
model.add (Dense (1))
model.add (Activation ('hard_sigmoid'))

## Compile and train the model
model.compile (Adam (lr = 0.005), loss = 'binary_crossentropy', metrics = ['accuracy'])
model.fit (X, Y, epochs = 15, batch_size = 32, validation_split = 0.2)

## Testing the resulting model
print (model.evaluate (X, Y))
## Save the model for future use
model.save ('model.hdf5')

This model uses two layers of Long Short-Term Memory, which allow the neural network to analyze not only the voice itself, its pitch and strength, but also its dynamic parameters, for example, transitions between the sounds of a voice.

Method testing
Let's train the model and see the results.

Code:

Epoch 1/20
5177/5177 [====================] - loss: 0.4099 - acc: 0.8134 - val_loss: 0.2545 - val_acc: 0.8973
...
Epoch 20/20
5177/5177 [====================] - loss: 0.0360 - acc: 0.9944 - val_loss: 0.2077 - val_acc: 0.9807
[0.18412712604838924, 0.9819283065512979]

Excellent! 98% accuracy is a good result. Let's look at the statistics of accuracy for each individual person.

Code:

woman1: 98.4%
woman2: 99.0% target
man1: 98.4%

The neural network does an excellent job, overcoming most of the interference: noise and restrictions on the length of the recording (the neural network analyzes only one second of the recording at a time). This method of human identification is the most promising and effective.

Conclusions
Technologies for recognizing a person by his voice are only at the stage of research and development, and therefore there are no good and popular solutions in the public domain. However, in the commercial sector, such software products are already being distributed, which makes it easier for call center employees, smart home developers. Now you can use this technique at work or for your projects.

HibrikBilik · Nov 29, 2021

It's good if it's a problem with the voice acting of the game and not with your hearing. I had a similar experience when I couldn't hear the dialogues of the characters in an online game and thought it was because of the poor quality of the voice acting. And then I found out that the reason is in my ears. I turned to the hearing clinic near me and it turned out that because of my love of listening to loud music with headphones, I started having problems with my hearing organs. I was prescribed a course of treatment and now the situation is much better and I am much more attentive to my health.

Machine hearing. How identification of a person by his voice works.

Mutt

Professional

HibrikBilik

Member

Similar threads