Performance indicators of biometric algorithms

Tomcat · May 26, 2024

Our statistical beauties have finished setting up the show and plunged into the sweet, magical, poetic world of summaries, figures, reports, plans and estimates.
(Film "Office Romance", director - E. Ryazanov)

In the previous article “Biometrics in Payments,” I reviewed the main technologies used for authentication and identification of a person by face (face recognition). I described the principles of operation of algorithms for finding a face in a photo, recognizing facial features and creating biometric templates. In this article I will dwell in more detail on assessing the quality of work of solutions for identifying and authenticating a user by face.

Formation of a user database
Let's say a new user wants to register in the system. First, he enters preliminary data on a personal (mobile phone, tablet) or stationary (biometric kiosk) device: possibly full name, fictitious nickname (can, for example, be used in loyalty programs), phone number, etc. Then a biometric sample is formed, in the case of using facial biometrics - from a photograph, from a shooting frame.

Let me briefly describe this process again: first, the area in which the person’s face falls is selected in the photograph, then the image of the face is converted into a biometric vector sample.

A sample vector is a point in a multidimensional space, the dimension of which depends on the number of facial characteristics used by the algorithm. After this, the sample vector is sent to the recognition server, where records of user samples are stored. First, an initial comparison of the new sample with the existing ones takes place to check that the user has not yet been registered. If the user is not found, a dossier with a unique identifier is created for him. The new sample is placed in the dossier and represents the first, title sample for this user.

In the future, with each successful identification, the user’s dossier can be replenished with new samples for subsequent use in comparisons.

Every person is subject to changes: he can change his hairstyle, gain or lose weight, he can be tired or, conversely, cheerful. The more user samples in the dossier, the more reliable the results of identification and authentication of the system user will be.

Identification and authentication problems, similarity indicator
Let us recall the definitions we gave in the previous article:

The task of identification is to find an object based on a certain characteristic among many that are similar to each other. This task is also called a 1:N (one-to-many) comparison, where N is the total number of objects being searched.
The task of authentication is to confirm that an object is sufficiently similar to its previously recorded image. The problem is also called a 1:1 comparison (one to one)

So, both tasks, when applied to facial recognition, involve comparing biometric samples.

The usual result of a single comparison, that is, a comparison of two biometric samples, is a similarity indicator - a value that takes values from 0 to 1.

In most cases, it is determined by the solution not quite explicitly, that is, it is the result of calculating the distance (according to one of the metrics, for example, as the usual Euclidean distance) between vectors of biometric samples, which (distances) in turn are calculated by neural networks. The result of the work of neural networks cannot always be easily interpreted, as we observed in the example with “own faces”, therefore the neural network itself can be considered as a “black box”, the output giving a vector of characteristics of the biometric sample.

To put it loosely, you can think of a similarity score as a kind of “probability of similarity,” which is greater the more similar the biometric samples (faces) being compared are.

However, I will note a fundamental difference between the answers that identification and authentication algorithms produce: in the first case, the system produces as a result a sequence (usually ordered in descending order) of values of similarity indicators, also called a vector of similarity indicators. This sequence determines the set of biometric samples that are most similar, from an algorithmic point of view, to the sample submitted to the input. In other words, the sequence is a list of identifiers of faces that are closest to the searched person, indicated by a similarity score and ordered by it. Typically it includes about 10 values, both for large and small user bases.

In the second case, the result is a single value of the similarity indicator. Thus, the identification task is more complex compared to the authentication task; in fact, searching the user base for identification is a comparison of the input sample with each dossier, and in authentication with a single dossier.

General terminology
Let me remind the reader of the definitions for the main concepts that are used in evaluating decisions:

The confidence threshold T is the limit set for the decision algorithm. When the level of similarity of the compared biometric data with the reference data is above T, a decision is made on a match. Otherwise, a decision is made that there is no match.
Account creation error rate is the ratio of the number of system failures when trying to create an account to the number of all attempts (Failure-to-enroll rate, FTE).
The probability of a false non-match is the ratio of the number of falsely understood decisions with a similarity value below T to the number of all decisions made (False Non-Match Rate, FNMR).
The probability of a false match is the ratio of the number of cases of falsely made decisions with a similarity value above T to the number of all decisions made (False Match Rate, FMR). The probability of a false match should only be calculated for cases of unintentional similarity where no effort is made to achieve a similarity with another person.
The probability of false negative identification is the ratio of the number of false decisions made when trying to identify registered users, in other words, the proportion of cases in which, for a given similarity level T, the algorithm produces identifiers that do not include the identifier of the identified one (False Negative-Identification Rate, FNIR).
The probability of false positive identification is the ratio of the number of false decisions made by users who are not registered in the system. (False Positive-Identification Rate, FPIR).
A zero effort attack is a verification attempt where a user gives their biometric data to the system to compare with their account, but the comparison is made with another person (zero effort attack).
Error balance curve - a graph of a parametric (depending on the T parameter) curve of false positive and false negative errors (DET curve).

Evaluating Authentication Algorithms
The main criterion for the quality of authentication algorithms is the FNMR and FMR indicators, that is, the relative amounts of errors of the first and second types. In other words, the system should not refuse real users too often, but it should also not allow malicious attempts and zero-effort attacks.

For testing in a system submitted for certification, a large number of users are registered from a pre-prepared database of samples (persons), and then test images are sent to the input. The algorithm must give a conclusion (similarity parameter) for each presented test image.

The system failure rate is checked, that is, it is checked how often a genuine user will be denied authentication. The decision threshold T should ideally be such that it balances FMR and FNMR as required by the organization planning to use the solution under test. The FNMR indicator is calculated as follows:

$FNMR(T)=1 - \frac{1}{N}\sum_{i=1}^NH(u_i - T)$

u i are components of an N-dimensional vector consisting of indicators of the similarity parameter demonstrated by the system when presenting an image of a real or genuine user.

T is the threshold value of the similarity parameter indicator selected by the certification system.

H(x) is the Heaviside unit function. For all negative arguments it takes the value 0, for non-negative arguments it takes the value 1.

Thus, non-zero components of the sum are determined only by those values of u i that exceed the threshold value T.

The FMR indicator is calculated in a similar way:

$FMR(T)=\frac{1}{N}\sum_{i=1}^NH(v_i - T)$

v i are components of an N-dimensional vector consisting of indicators of the similarity parameter demonstrated by the system when presented with an image of a fake user or, in other words, an attacker.

The T indicator can vary and take any value in the interval [0;1]. A set of indicators T is calculated based on the quantiles of the observed similarity indicators of the attackers’ images v i as follows:

To begin with, we select the range of values we are interested in [FMR L ; FMR U ]
We divide this interval with K points:

$FMR_k, k \in [1;K]:FMR_k=FMR_L \cdot \Big(\frac{FMR_U}{FMR_L}\Big)^\frac{k}{K}$

The partition gives a uniform logarithmic scale, the number of points K can vary
The indicator T k is calculated as follows:

$T_k=Q_v(1-FMR_k)=\min \{T \in [0:1]:F(T)\geq 1- FMR_k \}$

In other words, having performed, for example, 100 tests with the presentation of images of attackers, and received 100 results (similarity parameters) , we have a distribution function for the value v (similarity parameters) . We choose the value K , for example, 10. Based on this, we build a partition FMR k and for each k we calculate the value (1 - FMR k ) .

Then, to obtain the T k values , you need to calculate the quantile function

$Q_v: Q_v(1-FMR_k)= \min \{T\in [0;1]:F(T) \geq 1-FMR_k\}$

The function F(T) shows the probability that v ≤ T , for example, if from the results of our 100 tests we received a vector of similarity values

$\vec{v}=\{0.5, 0.5, \dots, 0.5, 0.51, 0.9\}$

then F(0.5)=0.98, F(0.51)=0.99 , and Q v (0.98) = {minimum T such that the probability of v getting no higher than T is more than 0.98}=0.51

That is, each 1 - FMR k needs to find a threshold T k such that the number of cases when v≤T k is more than M(1 - FMR k ) , where M is the number of experiments with the presentation of portraits of attackers.

Next, we build a DET graph in the FNMR(T) and FMR(T) axes . FMR v tends to 1, and FMR L decreases as much as the size of the database of attacker portraits allows.

The DET chart allows you to visually assess the stability of the solution when changing , and also, for example, determine the threshold value at which equality of errors of the first and second types will be achieved.

It should be separately noted that you can act according to a simplified verification algorithm, namely, instead of determining the vector v i , you can analyze only its first component, or, in other words, take for evaluation only one similarity indicator from those issued by the solution - the maximum. In this case, the threshold value plays the role of a primary filter that discards the “badest” images. For example, you can set it in such a way that only the highest quality images are analyzed.

Evaluation of identification algorithms
The main criterion for the quality of operation of identification algorithms is the FPIR and FNIR indicators, that is, the user should be easily recognized by the system, and it should not take many attempts to recognize him. On the other hand, the system must generate biometric samples so that different users have as different vectors of biometric characteristics as possible.

As in the previous case, for testing in the system submitted for certification, a large number of users are registered from a pre-prepared database of samples (persons), and then test images are sent to the input.

However, the difference is that, as you remember, the output is not one similarity indicator, but several: one might say, a list of identifiers whose owners are “most likely” to match the presented image.

As in the case of authentication, the choice of the threshold value for making decisions (which user matches the presented photo) is important. It is the threshold value that determines the decision; the algorithm only produces a similarity parameter. Recall that a single experiment consists of comparing a set of images, and the ideal result is such that all samples that exceed the similarity threshold belong to the identifier (user) who is being identified, and samples that do not exceed the threshold belong to other users, and none of unregistered users could not exceed the similarity threshold.

Here are the formulas for calculating FPIR and FNIR errors:

$FPIR = \frac{F_1}{G}; \hspace{1cm}FNIR=\frac{F_2}{G}$

F 1 -- the number of test instances of unregistered users whose identifiers were included in the list of those that exceeded the similarity threshold;

F 2 -- the number of test instances of registered users whose identifiers were not included in the list of those that exceeded the similarity threshold;

G -- number of test instances.

To calculate the FPIR value at a known threshold value, you need to:

Select test instances registered in the system, then count the number of errors made by the solution for which the similarity parameter is greater than the threshold;
Divide by the total number of test instances.

To calculate the FNIR value at a known threshold value, you need to:

Prepare test instances that are not registered in the system;
Count the number of correctly made decisions produced by the tested algorithm, for which the similarity parameter is less than the threshold value;
Divide by the total number of test instances.

The DET curve is constructed by calculating the FPIR and FNIR values at various threshold values in the range from 0 to 1:

Let us first set T equal to 0;
We calculate FNIR values at the threshold value T ;
We calculate the FPIR values at the threshold value T ;
The resulting values (FPIR; FNIR) are a point on the graph. Note: FPIR is plotted along the X axis, FNIR – along the Y axis;
Increase T by 0.005;
If T < 1, we return to point 2, otherwise we finish the construction;
The DET curve shows the stability of the solution relative to changes in the threshold value T, and allows you to find the optimal values of FPIR and FNIR.

Selecting conditions and data sets to test the solution
Now you know the main ways to solve the problem of evaluating biometric solutions for facial recognition, but there are a few additional points worth mentioning.

Firstly, the size of the base of test samples matters. A solution that claims to perform well for small user bases should be tested on small test samples: the number of users in the system in this case may be limited to 1,000 dossiers. Such solutions can be used to work provided that there are not many people registered in the system, for example, in a closed environment, say, in an office.

In cases where attacks by intruders are practically excluded, tests can only be carried out using registered users. However, in other cases, it is necessary to collect and process, in other words, divide into a dossier, a significant (about 100,000 dossier) number of photographs (portraits). It is also necessary to select a database of portraits that will be used as unregistered users.

It is desirable that the dossiers (people who own the portraits) in the first and second cases do not overlap

These problems were solved in the NSPK: a large number of dossiers were collected, with the help of Generative Adaptive Networks a set of more than 10,000,000 “non-existent” faces, that is, completely synthetic portraits of people, was created. Secondly, it is important to note that both open (public) and closed (non-public) datasets should be used for testing. There are several large (on the order of hundreds of thousands of dossiers) data sets that have been in the public domain for some time. They are composed mainly of photographic portraits of celebrities obtained from open sources (video recordings, interviews, photographs from newspapers, etc.). Such data could be used by biometric solution providers to train neural networks, so solutions trained on it could show better results compared to solutions using biometric data from ordinary users.

Requirements for the quality of solutions

For the authentication algorithm: FMR should not exceed 10 -4 with FNMR≤10 -3
For the identification algorithm on databases of up to 100 thousand individuals: FPIR+FNIR ≤ 10 -4
When conducting Presentation Attack Detection in class B (video fragments, masks, high-quality photos, device screens), the algorithm must provide FMR ≤5 ⋅10 -4 with FNMR ≤10 -3

Note: FMR, FNMR, FPIR, FNIR values are measured with an error of 10% and a confidence level of 80%

Speed requirements:

For the biometric authentication algorithm, the verification time does not exceed 0.2 seconds.
For a biometric identification algorithm with a database of up to 100 thousand individuals, the verification time does not exceed 1 second.
The operating time of the live presence verification algorithms does not exceed 2 seconds.

Links:
[1] https://pages.nist.gov/frvt/reports/11/frvt_11_report.pdf
[2] https://pages.nist.gov/frvt/reports/1N/frvt_1N_report.pdf
[3] ISO_IEC_19795-1

Performance indicators of biometric algorithms

Tomcat

Professional

Similar threads