How facial recognition actually works

Tomcat

Professional
Messages
2,686
Reputation
10
Reaction score
733
Points
113
Over the past decade, facial recognition technology has made great strides forward - and at the same time has become the subject of a lot of controversy and discussion. On the Internet you can find a huge number of notes and articles about how facial recognition works, why it is implemented and how well or poorly it works. And, as always happens, it is very difficult to understand this entire volume of information and separate the truth from idle speculation, especially if you do not have the appropriate background. For example, one author can claim that today’s neural networks can completely accurately identify the right person in a large crowd, another can give examples of curious artificial intelligence blunders, and a third can reveal secret ways to be guaranteed to deceive the recognition algorithm.

But how do things work in practice? What should we believe?

We, the NtechLab team, will try to explain in clear language what the most modern facial recognition algorithms that each of us encounters in everyday life actually consist of, discuss what they are capable of and what they are not yet capable of, and try to answer questions about when technology works well and when it works poorly, and what it depends on.

How everything works​

What's in, what's out​

We suggest immersing yourself in the face recognition system gradually. To begin with, it will be easiest for us to imagine it as a black box, which takes an image as input (this could be a frame from a video or a photograph), and at the output returns a certain set of real numbers that “encodes” the face. This set is often also called a “feature vector” (and it is actually a vector) or a “biometric template”.

8cc039014668d19ace5aa6ef8c579586.png


Each system can have its own dimension of this vector; usually it is some power of two: 128, 256 or 512. Whatever the dimension, the norm of the vector is equal to one:

for a vector
a = (a_1, a_2 \dots a_N)
it will be true
\sqrt{a_1^2+a_2^2 + \dots + a_N^2}=1


This means that all vectors returned by the system lie on an N-dimensional hypersphere, where N is the dimension of the vector. It is quite difficult to imagine such a hypersphere, so - again, for simplicity - we will resort to Geoffrey Hinton’s famous advice on visualizing multidimensional space:

To deal with hyper-planes in a 14-dimensional space, visualize a 3-D space and say “fourteen” to yourself very loudly. Everyone does it. (When you have to deal with hyperplanes in 14-dimensional space, imagine it in three dimensions and say out loud: “Fourteen!” Everyone does it.)

So, let us have the familiar three-dimensional sphere and three-dimensional vectors on it, which, as we remember, are encoded faces:

11bc450b650018b1491a82232f79c816.png


These vectors have the following property: if we try to encode the same face image twice, we will get two identical vectors - the angle between them will be zero, and the more different the faces are, the farther apart they will lie on the sphere and the more the angle between them will be greater. This means that to determine the “similarity” of two faces, we only need to measure the angle between their vectors; It is most convenient to use the cosine of the angle as a measure of similarity, and not the angle itself (do not forget that all vectors have a norm equal to 1):

similarity = \cos{\Theta} = a_1b_1+a_2b_2+ \dots + a_Nb_N


And for even greater convenience, let’s rewrite the similarity measure in such a way that it changes over the interval [0; 1]:

similarity = \frac{1 + \cos{\Theta}}{2}


The facial recognition system cannot tell us that a certain photograph shows the fictitious Ivanov I.I. (or vice versa, that the photo is not Ivanov at all) - it works differently. We can take a real photo of Ivanov and use the system to construct a feature vector for it. In the future, this vector can be compared with the vector of the image under study and the measure of their similarity can be found out.

Detector​

Now we will gradually open the black box. First of all, having received a picture as input, the algorithm needs to find people’s faces in it. This is done by a component called a detector, and its job is to highlight areas that contain something that resembles a face.

f80a29a83b1213b5e26a7a492eae14bf.png


Until some time, this problem was solved by the Viola-Jones method or using HOG detectors, but today neural networks have replaced them almost everywhere - they are more accurate, less sensitive to the shooting angle (tilts, rotations, etc.) and much more stable in their predictions, than classical methods. Even the speed of operation, which was traditionally considered the key advantage of the “classics”, today has ceased to be a problem for neural networks: with the huge amount of data available for training and the development of computing resources, you can easily select a neural network size that will satisfy your needs.

Normalizer​

The faces that the detector returns are still in their natural position: somehow rotated, somehow tilted - and they are all, of course, different sizes. To make it easier for us to process and compare them at the next stages, it would be worth bringing them to some universal form. This problem is solved by a system component called a normalizer.

Ideally, we would like to work only with frontal images of the face, for which, in turn, it is necessary to be able to convert any image received by the system to a frontal type, and convert simply and quickly, without resorting to 3D reconstruction and other “rocket science”. Of course, there is no magic, and it is impossible to easily bring any face to the frontal one, but we can still try to get an image as close to the frontal one as possible - as far as possible for the existing picture. We have three tools at our disposal:
  • scale: we can “zoom in” or “zoom out” of the face;
  • rotation: we can rotate the face to any angle in the image plane;
  • shift: we can move the face a few pixels to the left or right, up or down.
Each of these transformations is described by a 3x3 matrix. By multiplying all three matrices, we also get a 3x3 matrix for the total transformation - it must be applied to the face to bring it to the form we need:

54c48283ec1108ef425c807b08ffab4d.png


One way to determine the transformation could be the following: find the key points of the face (centers of the eyes, tip of the nose) and calculate a matrix such that the tip of the nose will be in the center of the image, and the eyes will be aligned at the same horizontal level. The method is quite simple, however, firstly, it highly depends on the quality of detection of key points, and secondly, there is no guarantee that the heuristics described above are optimal for recognition.

The alternative, as you probably already guessed, is again a neural network. With it, we can predict the final transformation matrix directly, without searching for key points or making any assumptions about the location of the nose and eyes. An example of the operation of a neural network normalizer is shown in the figure above (we have drawn the points just for illustration).

Extractor​

Now that we have a normalized face, it's time to build a vector - this is done by a component called an extractor, the main element of the entire system. It accepts images of a fixed resolution as input - usually 90-130 pixels, this size allows you to maintain a balance between the accuracy of the algorithm and its speed (a picture of a higher resolution could contain more information useful for recognition, but its processing would also take longer).

Vector extraction is the final stage of the face processing pipeline, which can be schematically depicted as follows:

Extractor training​

The main thing we expect from a good extractor is that it builds vectors that are as “close” as possible for similar faces and as “distant” as possible for dissimilar ones. To do this, the extractor needs to be trained, and for training, first of all, we need a dataset - a set of labeled data. It might look something like this:

e3b62e2a0851e988f534736b38f0165e.png

That is, we have a certain set of unique people - “persons” (Person k and Person m are different people if m ≠ k), and for each of them there is a certain set of pictures. At the same time, we know exactly which person is in which picture.

How many people do we need for training? And how many photos for each person? The obvious answer is: the more, the better. The coolest systems are trained on datasets of millions, or even tens of millions of people, but five to ten photographs for each of them will be enough for us (and even a person with a single photo may be useful for training), but again: more - better. Nowadays, you can find a large number of public datasets on the Internet, and other researchers collect photographs of celebrities for training.

When forming a training set, you should take into account the fact that the extractor (in fact, this is true for any neural network) will always work better on data similar to those on which it was trained. If our dataset contains only representatives of the Caucasian race, there is a high risk that the results for people of Asian type or dark-skinned races will disappoint us - which means we need to try to make the training sample as diverse as possible.

FAQ​

What do the features that the neural network detects look like? What guides the algorithm when constructing a vector? Can the vector itself tell us something about a person's appearance? And - the icing on the cake: which parts of the face does the algorithm pay attention to? Perhaps we are not asked about anything else as often as this. In fact, these are all very good questions, and they are of great interest not only to ordinary people, but also to the researchers themselves - the developers of neural networks.

If you ask an ordinary person to describe the features of a certain face, he will probably name the shape and color of the eyes, the features of the hairstyle and facial hair, the length of the nose, the arch of the eyebrows... A trained physiognomist (for example, a border guard who checks your passport at the airport, or a criminologist specializing in portraiture). examination) will evaluate the location of anthropometric points and the key distances between them. The same goes for the neural network: it certainly “pays attention” to the characteristic features of the face, but you need to understand that each of the numbers that make up the feature vector is not responsible for any specific point or facial feature. We cannot, looking at the numerical representation of the vector, indicate that this section of it describes the eyes, and that one describes the shape of the nose. (Meanwhile, later in this article we will illustrate a number of experiments that will reveal how difficult it is for a neural network to recognize a person without seeing, for example, his eyes or mouth).

So, the vector is unsuitable for manual analysis. At the same time, there is a considerable amount of scientific research, the authors of which are trying to restore a person’s appearance using a vector of features. Among them, we mention the work of scientists from Canada and the USA Vec2Face. The essence of the solution they propose, if we present it as simply as possible, is this: to generate an image from a feature vector, we will use a specially designed neural network, and we will obtain a dataset for its training by running a large number of photographs through an extractor and saving the resulting vectors.

The authors managed to obtain a very acceptable result:

683717f5d4d75cb68380e240eeb114a5.png


The bottom row of the illustration shows real photos, and above them are pictures that were synthesized from their vectors using different restoration methods. See for yourself: the images in the third and fourth rows are pretty close to reality!

However, you need to understand that applying the described approach in practice will be very difficult. Firstly, the process of developing a recovery algorithm is quite complex, requiring serious computing resources and a significant investment of time. Secondly, to create an algorithm, researchers will need unrestricted access to the extractor - and this is not so difficult to protect against. Thirdly, for each new extractor the recovery model needs to be trained anew; it is impossible to create a universal algorithm that will be able to “invert” any extractor. Fourthly (and probably most importantly), it is far from a fact that for any particular extractor it will, in principle, be possible to create a good quality restoration algorithm.

How it works in practice​

In theory, each element of the pipeline is well trained for its task. However, when we start combining them into one system and applying them to real data, some unforeseen problems may arise.

Assessing image quality​

Let's start with the fact that our detector can mistakenly recognize a face where there is none. Or it can be recognized correctly, but the image turns out to be of poor quality: very small in size, or rotated at a very large angle, or very blurry, or very dark, or all of this at once. However, once a face is found, it will go through the entire pipeline and receive a vector - but it will not be possible to apply it. In the case of a low-quality face, the vector turns out to be very noisy and inaccurate, and if the detector is falsely triggered, it does not carry any useful information at all, which means there is no point in using such vectors for comparison.

We need some kind of mechanism that will allow us to transfer only high-quality images to the extractor. At NtechLab we use an additional lightweight neural network - a quality detector, which returns an integral indicator in the range [0; 1], allowing us to conclude how “good” the image was produced by the detector. If the measured quality indicator is below a certain threshold, such a face will be rejected, and by moving this threshold, we are able to fine-tune the system to the actual shooting conditions.

In addition to filtering out junk images, the quality detector has another application. When we need to recognize faces in a video, for each person in the camera's field of view we receive a sequence of frames - a track that will be longer the longer the person is in front of the lens. To optimize the system, we will not build a vector for each frame, each time running a rather “heavy” extractor. Instead, using our lightweight mesh, we will select the face with the highest quality score from the entire track - and then extract a single template from it.

Here we will not dwell on how to train a good quality detector - this topic deserves a separate large article. To illustrate, here are a few examples of faces and the quality score predicted by the detector for each of them:

fe823acacece4c28daf45333c132cc1a.png


Working in difficult conditions​

Now is the time for experimentation: we suggest testing how the face recognition system behaves under different conditions. To do this, we have a small test dataset for about 2000 people; for each person it contains pictures taken with the head turned or tilted, with different lighting, wearing glasses, wearing a medical mask, and so on. We will compare the work of two algorithms: the one we developed and one of the best publicly available implementations - Insightface.

Head position​

Let's evaluate how head rotation affects the quality of recognition. For each person from the dataset, we will calculate the similarity for two of his photographs: frontal and with a change in the position of the head in different planes.

First, let's look at turning left and right:

2665bcb2adf05280b375377c690e6004.png


The first row shows pairs of pictures that are compared. The top histogram is the distribution of similarities for the two algorithms. Ideally, I would like the peak of the distribution to be closer to one, because all pairs being compared are the same person. The lower histogram is the distribution of image quality obtained by the quality detector for frontal and rotated faces.

Let's build the same histograms for head deflection back:

909b30196c7b4ccea4d135dda7a0a7f2.png


...and forward:

e2a3df7bcb2befcc3f1daf2f80ff7600.png


The first thing that catches your eye is that the similarity distributions of the two algorithms, proprietary and public, are very different. (We note, however, that from these histograms alone it is impossible to say with certainty which algorithm is better and by how much - we will talk about this a little later).

The second conclusion: scenarios with forward-backward bends for both algorithms turned out to be more difficult than left-right turns. This may be partly due to the fact that in the test sample the angles of rotation to the side are small and the face is always visible almost completely.

We also note that the picture quality in all the cases considered here turned out to be acceptable (the predicted figure was more than 0.5).

Lighting​

What about darkness? Let's see how the algorithms behave in low light conditions:

16cd31dd31fd8009ca725439ea8df5c8.png

This example clearly shows how different the perception of a picture by a person and a neural network is: if such a darkening greatly interferes with recognition for a person, then for a neural network it is not so critical. The quality of the pictures also decreased slightly.

Elements of camouflage​

How will medical masks, sunglasses and hats affect recognition?

5e886164dcebd015985e0de9204b6629.png


018d0c227d38adad023b2601a9fdd0ba.png


ff2fd22c00e9b80ae56b2f0554c85a87.png


You can notice that the presence of a headdress that does not cover the face has practically no effect on the operation of the algorithm. This is explained by the fact that the upper part of the head simply does not fall into the extractor; it is also cut off by the detector and normalizer, since it is not needed for recognition. This means that you can even dye your hair any color - this will have little effect on recognition by the neural network.

But it is clear that medical masks pose a serious challenge - and this is logical: the mask covers most of the face. However, as can be seen from the histogram, the remaining image is quite sufficient for modern recognition algorithms.

All at once​

What happens if you put it all on at once?

496bfb6118a01b68103de6faa449cd46.png


This is where the real difficulty begins! Even a strong algorithm is no longer so sure. And, if we pay attention to the distribution of image quality, we will notice: most of the images are considered undesirable for recognition.

But let's not stop and complicate the task even more by adding lighting and rotation:

03c281b8b79b3cf431d9210d04872d64.png


For humans, recognition under such conditions is simply impossible, but what about a neural network?

It's all pretty sad for her too. Firstly, in more than 10% of the images it was not possible to find a face at all, that is, recognition ended before it began. The pictures in which a face was nevertheless detected are of such low quality (lower histogram) that adequate operation of the algorithm on them is hardly possible.

Quality control​

Well, we have defeated modern artificial intelligence within our dataset! There is only one detail left to figure out: how big is the difference between a proprietary and a public algorithm? To get an objective answer, the first step is to understand how to measure the “quality” of the algorithm’s work.

(Everything that will be stated later in this chapter is essentially a brief retelling of our article from four years ago .)

In principle, all possible scenarios for biometric matching can be reduced to two - verification and identification:
  1. verification (aka 1:1 comparison) is a comparison of two samples to investigate whether they belong to the same person. Verification, in particular, is performed when you try to unlock a smartphone using a face image - here the biometric system answers the question of whether it is confident enough that the presented image belongs to the owner of the device;
  2. identification (also known as search, also known as 1:N matching) involves selecting from a certain set of candidate samples those that presumably belong to the same person as the desired sample presented to the system. An example would be an access control system that unlocks a magnetic lock when it “sees” a familiar face on the camera.
At the very beginning, we already said that the biometric system does not return answers like “yes, it’s definitely him” or “no, it’s definitely not him.” The result of vector comparison will be a similarity indicator measured on the interval [0; 1], and to reduce it to a binary answer “yes/no” we need to enter a threshold value. If the similarity indicator, as a result of some comparison, turns out to be higher than the threshold or equal to it, we will regard the system’s answer as “yes”, and if lower, as “no”.

Several important caveats need to be made about the similarity indicator. First, you should not take the similarity score returned by the system as the “quality” of the match. We have had to deal with situations where, when comparing several systems, customers said: “System A rated the match between two people as 0.78, and system B as 0.91, and since this is the same person, then system B worked better.” . Secondly, the similarity indicator itself does not say anything at all. For example, one system might have a similarity of 0.78 indicating very high confidence in a match between two candidates, while another might return 0.85 when comparing two different people. Thirdly, similarity cannot be interpreted as a “match” of faces - a result of 0.78 does not mean that “this is 78% the same person.” It is simply a number that must be interpreted one way or another depending on the threshold chosen.

The most obvious and most naive approach to assessing the quality of an algorithm would be to measure matching accuracy—the ratio of cases where the system worked correctly to the total number of matching attempts.

Accuracy = (Cases when the system worked correctly) / (Total cases)

What exactly is his naivety and why is he not good enough? Let's speculate. First: even speculatively it is very easy to conclude that, whatever the threshold we set, only four outcomes of the comparison are always possible:
  • true match: the calculated similarity score is above a set threshold, and both samples compared actually belong to the same person;
  • true non-match: the calculated similarity score is below a specified threshold, and both samples compared actually belong to different people;
  • false negative (type I error, false non-match): the calculated similarity score is below the established threshold, while both compared samples actually belong to the same person;
  • false positive (type II error, false match): the calculated similarity score is above the established threshold, while both samples being compared actually belong to different people.
Second: considering accuracy as the only or main metric, it is difficult to conclude how many errors the system made. Let's say I was able to unlock my own smartphone every time with a face image (subjective accuracy - 100%), but how many impostors and attackers managed to do the same thing with my smartphone?

We reason further: if we lower the threshold very close to zero, then for almost any candidate we will always get some kind of match, but most of them will be false (there will be many errors of the second type), and if we raise it almost to one, we will get only the most correct matches, while making a large number of type I errors. Therefore, by increasing or decreasing the threshold, we can get a more strict or less strict system. In the case of a smartphone, being more strict is a good thing: even if I have to make several attempts, it will be very difficult for an impostor to unlock the device. And it’s another thing when we’re trying to find a missing person in a huge metropolis: even if there are more false matches, we definitely won’t miss the person we’re looking for.

But how can we compare two algorithms if one has a more strictly adjusted threshold and the other less? Direct comparison of accuracy metrics in this case turns out to be completely meaningless! The solution is simple: instead of measuring the frequency of correct comparisons, we study what the true-positive rate of the algorithm is for a fixed type II error rate. This metric is denoted as
TMR@FMR=\alpha
, where:
  • TMR
    — true match rate, the ratio of the number of true positive results to the total number of “positive” comparisons made (when both compared samples actually belong to the same person);
  • FMR
    — false match rate, the ratio of the number of errors of the second type to the total number of “negative” comparisons made (when the samples being compared actually belong to different people);
  • \alpha
    — fixed value
    FMR
    at which is measured
    TMR
    .
To calculate the metric, we again need test data in the same format that we used during training. It is important to understand: in the same format, but not the same; Systems need to be tested using data obtained under the conditions in which they will be operated (or as close as possible to them).

We will compare any possible pairs of images from the dataset with each other and store the returned similarity score - in this case, since the dataset is labeled, for each comparison performed we know exactly what the result should actually be. Then, for each algorithm under study, we will choose a value of the similarity indicator
t_{\alpha}
at which
FMR=\alpha
, and calculate what is the frequency of true positive results at the obtained threshold
t_{\alpha}
, that is, how often “positive” comparisons have similarity greater than the threshold
t_{\alpha}
. The more, the better the algorithm.

Here are the results we got for one of the NtechLab algorithms and the already mentioned Insightface in different comparison scenarios (what was compared and with what is in the column headings):
Test description​
(turns and lighting) vs (turns and lighting)​
(mask) vs (mask, rotations and lighting)​
(frontal face) vs (mask, rotations and lighting)​
\alpha
3e-6​
3e-5​
3e-5​
NtechLab (private)​
98.03​
92.54​
92.24​
Insightface (public)​
13.7​
6.68​
0.05​

Conclusion​

We tried to explain in simple terms what facial recognition technology is at this stage of its development - we hope that everyone can now answer the questions posed at the very beginning on their own. And about whether there are ideal systems, and about why funny mistakes happen, and even about whether it is possible to fool the algorithm with a guarantee.

We will be very glad if our article is useful to you!
 
Top