VoIP wiretapping. How PRISM and BULLRUN extract information from the hacker_frei voice stream

Hacker · Jul 26, 2021

The content of the article

What an attacker can extract from an encrypted audio stream
Attack on VoIP through bypass channels
A few words about the DTW algorithm
How HMM machines work
How PHMM machines work
From theory to practice: recognizing the language in which a conversation is taking place
Listening to an encrypted Skype audio stream
And if you disable VBR- mode?
Conclusion

In order to find the answer to this question, you must first understand how voice traffic is transmitted in VoIP. The data transmission channel in VoIP systems, as a rule, is implemented over the UDP protocol and most often runs over the SRTP (Secure Real-time Transport Protocol), which supports packaging (using audio codecs) and encryption audio stream. In this case, the encrypted stream that is output is the same size as the input audio stream. As will be shown below, such seemingly insignificant information leaks can be used to eavesdrop on "encrypted" VoIP conversations.

What is the basis for the ability to listen to encrypted voice traffic?

On low data entropy arising from compression optimization
On weak encryption keys
On the features of the UDP protocol

What an attacker can extract from an encrypted audio stream
Most of the audio codecs that are used in VoIP systems are based on the CELP algorithm (Code-Excited Linear Prediction), the functional blocks of which are shown in the figure below. To achieve higher sound quality without increasing the load on the data transmission channel, VoIP software usually uses audio codecs in VBR mode (Variable bit-rate - audio stream with a variable bit rate). This is how the Speex audio codec works, for example.

Functional blocks of the CELP algorithm
What does this lead to in terms of privacy? A simple example. Speex, working in VBR mode, packs sibilant consonants with a lower bitrate than vowels, and moreover - even certain vowel and consonant sounds packs with a specific bitrate. The graph in the figure below shows the distribution of packet lengths for a phrase with sibilant consonants: Speed skaters sprint to the finish. The deep valleys of the chart are precisely in the hissing fragments of this phrase. The figure shows (source) the dynamics of the input audio stream, bit rate and size of the output (encrypted) packets, superimposed on the general time scale; the striking similarity between the second and third graphs can be seen with the naked eye.

How hissing sounds affect packet size
Plus, if you look at the picture through the prism of the mathematical apparatus of digital signal processing (which is used in speech recognition tasks), such as the PHMM-automaton (Profile Hidden Markov Models - an extended version of the hidden Markov model), then you can see much more than just the difference between vowels sounds from consonants. This includes identifying the speaker's gender, age, language and emotions.

VoIP Bypass Attack
The PHMM automaton does a very good job of processing numerical strings, comparing them with each other and finding patterns between them. That is why PHMM is widely used in speech recognition problems.

In addition, the PHMM automaton turns out to be useful for listening to an encrypted audio stream. But not directly, but through bypass channels. In other words, a PHMM machine cannot directly answer the question: "What phrase is contained in this chain of encrypted audio packets?" audio stream? "

Thus, the PHMM machine can recognize only those phrases for which it was originally trained. However, modern technologies of deep learning are so powerful that they are able to train a PHMM-machine to such an extent that for it the line between the two questions sounded just above is virtually erased. To appreciate the full power of this approach, you need to dive a little into the materiel.

What is the main difference between a bypass attack and an encryption key cracking?

It requires more resources.
It is performed like MITM
The resulting decryption is always probabilistic.

A few words about the DTW algorithm
The DTW (Dynamic Time Warping) algorithm has until recently been widely used to solve the problems of speaker identification and speech recognition. It is able to find similarities between two number chains generated according to the same law - even when these chains are generated at different speeds and are located in different places on the timeline. This is exactly what happens when digitizing an audio stream. For example, a speaker might say the same phrase with the same accent, but faster or slower, with different background noise. This will not prevent the DTW algorithm from finding similarities between the first and second options. To illustrate with an example, consider two integer strings:

0 0 0 4 7 14 26 23 8 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 6 13 25 24 9 4 2 0 0 0 0 0

If we compare these two chains "head-on", then they are obviously very different from each other. However, if we compare their characteristics, we will see that the chains definitely have some similarities: they both consist of eight integers, both have a similar peak value (25-26). A head-on comparison starting from their entry points ignores these important characteristics. But the DTW algorithm, comparing two chains, takes them into account and other indicators. However, we will not focus too much on the DTW algorithm, since today there is a more efficient alternative - PHMM machines.

It was experimentally found that PHMM machines "recognize" phrases from an encrypted audio stream with 90% accuracy, while the DTW algorithm gives only an 80% guarantee. Therefore, the DTW algorithm (which during its heyday was a popular tool in solving speech recognition problems) is mentioned only to show how much better PHMM machines are in comparison with it (in particular, when recognizing an encrypted audio stream). Of course, the DTW algorithm learns much faster than the PHMM automata. This advantage is undeniable. However, with modern computing power, it will not be fundamental.

The principle of operation of HMM machines
HMM (just HMM, not PHMM) is a statistical modeling tool that generates number chains following a system given by a deterministic finite automaton, each of whose transition functions is a so-called Markov process. The operation of this automaton always begins with state B (begin) and ends with state E (end). The choice of the next state to which the transition from the current state will be performed is made in accordance with the transition function of the current state. As we move between the states, the HMM-machine at each step outputs one number, from which the output chain of numbers is formed. When the HMM is in state E, chain formation ends.

With the help of the HMM-automaton it is possible to find patterns in chains that outwardly look random. For example, here this advantage of the HMM machine is used to find a pattern between the chain of packet lengths and the target phrase, the presence of which we check in an encrypted VoIP stream.

An example of an HMM machine
Although there are a large number of possible paths that an HMM machine can take from point B to point E (in our case, when packing a single audio fragment), there is still one for each specific example (even for such a random one as a Markov process). single best path, single best chain. It will also be the most likely contender, which is most likely to be chosen by the audio codec when packaging the corresponding audio fragment (after all, its uniqueness is expressed, among other things, in the fact that it lends itself better to packaging than others). Such "best chains" can be found using the Viterbi algorithm (as, for example, done here).

In addition, in speech recognition tasks (including from an encrypted data stream, as in our case), it is also useful to be able to calculate how likely it is that the chain we have chosen will be generated by an HMM machine. A laconic solution to this problem is given here ; it is based on the forward-backward algorithm and the Baum-Welsh algorithm.

Here, on the basis of the HMM machine, a method for identifying the language in which the conversation is taking place has been developed with an accuracy of 66%. But such low accuracy is not very impressive, so there is a more advanced modification of the HMM machine - PHMM, which draws much more patterns from the encrypted audio stream. For example, here it is described in detail how to identify words and phrases in encrypted traffic using a PHMM machine (and this task will be more difficult than just identifying the language in which the conversation is taking place) with an accuracy of 90%.

The principle of operation of PHMM machines
PHMM is an improved modification of the HMM-automaton, in which, in addition to the states of "match" (squares with the letter M), there are also states of "insert" (diamonds with the letter I) and "delete" (circles with the letter D). Thanks to these two new states, PHMM automata, unlike HMM automata, are able to recognize a hypothetical chain ABCD, even if it is not completely present (for example, ABD) or has been inserted into it (for example, ABXCD). In solving the problem of recognizing an encrypted audio stream, these two innovations of the PHMM machine are especially useful. Because the output of the audio codec rarely matches, even when the audio inputs are very similar (when, for example, the same person utters the same phrase). Thus, the simplest model of a PHMM automaton consists of three interconnected chains of states ("correspondences",

An example of a PHMM machine
However, since in the encrypted audio stream, the network packets over which the target phrase is packaged are usually surrounded by other network packets (the rest of the conversation), we need an even more advanced PHMM machine. One that can isolate the target phrase from other surrounding sounds. Here for this, five new states are added to the original PHMM machine. The most important of these added five states is "random" (a diamond with the word Random). The PHMM automaton (after the completion of the training stage) goes into this state when it receives at the input those packets that are not part of the phrase we are interested in. The states PS (Profile Start) and PE (Profile End) provide a transition between the random state and the profile part of the model. This improved modification of the PHMM automaton is capable of recognizing even those phrases that the automaton “did not hear” at the training stage.

PHMM machine solves the problem of recognizing encrypted audio stream

Which Russian VoIP telephony operator has the weakest encryption?

Rostelecom
Mango Telecom
Zadarma
Everyone has the same shitty

From theory to practice: recognizing the language of the conversation
Here is an experimental setup based on a PHMM automaton, with the help of which encrypted audio streams with speech of 2000 native speakers from 20 different language groups were analyzed. After completing the training process, the PHMM-automaton identified the spoken language with an accuracy of 60 to 90%: for 14 of 20 languages, the identification accuracy exceeded 90%, for the rest - 60%.

The experimental setup shown in the figure below includes two Linux PCs with open source VoIP software. One of the machines acts as a server and listens to SIP calls on the network. After receiving a call, the server automatically answers the subscriber, initializing the voice channel to Speex over RTP mode. It should be mentioned here that the control channel in VoIP systems, as a rule, is implemented over the TCP protocol and either operates over some of the publicly available open-architecture protocols (SIP, XMPP, H.323), or has a closed architecture specific to specific application (like in Skype, for example).

Experimental setup for working with a PHMM automaton
When the voice channel is initialized, the server plays the file to the caller and then terminates the SIP connection. The subscriber, which is another machine in our local network, makes a SIP call to the server and then, using a sniffer, "listens" to the file that is played by the server: he listens to a chain of network packets with encrypted audio traffic coming from the server. Further, the subscriber either trains the PHMM machine to identify the conversation language (using the mathematical apparatus described in the previous sections), or "asks" the PHMM machine what language the conversation is in. As already mentioned, this experimental setup provides a language identification accuracy of up to 90%.

Listening to Skype encrypted audio stream
It demonstrates how to use a PHMM machine to solve an even more difficult task: to recognize an encrypted audio stream generated by Skype (which uses the Opus / NGC audio codec in VBR mode and 256-bit AES encryption). This development uses an experimental setup like the one shown in the picture above, but only with Skype's Opus codec.

To train their PHMM automaton, the researchers used the following sequence of steps:

First, they put together a set of soundtracks, including all the phrases they were interested in.
Then we installed a network packet sniffer and initiated a voice conversation between two Skype accounts (this led to the generation of encrypted UDP traffic between the two machines, in P2P mode).
Then each of the collected soundtracks was played in a Skype session using a media player, with five-second intervals of silence between tracks.
In the meantime, a packet sniffer was configured to log all traffic to the second machine in the experimental setup.

After collecting all the training data, the UDP packet length chains were extracted using an automatic parser for PCAP files. The resulting chains, consisting of payload packet lengths, were then used to train the PHMM model using the Baum - Welsh algorithm.

And if you turn off VBR mode?
It would seem that the problem of such leaks can be solved by switching audio codecs to constant bitrate mode (although what a solution - the bandwidth is drastically reduced from this), but even in this case, the security of the encrypted audio stream still leaves much to be desired. After all, exploiting the lengths of VBR traffic packets is just one example of a bypass attack. But there are other examples of attacks, such as tracking pauses between words.

The task, of course, is not a trivial one, but it is quite solvable. Why non-trivial? Because in Skype, for example, in order to harmonize the operation of the UDP protocol and NAT (network address translation), as well as to improve the quality of the transmitted voice, the transmission of network packets does not stop, even when there are pauses in the conversation. This complicates the task of identifying pauses in speech.

However , an adaptive threshold value algorithm has been developed here, which allows distinguishing silence from speech with an accuracy of more than 80%; The proposed method is based on the fact that speech activity is highly correlated with the size of encrypted packets: more information is encoded in a voice packet when the user is speaking than during the user's silence. And here (with an emphasis on Google Talk, Lella and Bettati) the speaker is identified even when no leakage goes through the packet size (even when VBR mode is disabled). Here, the researchers rely on measuring the time intervals between packet receptions. The described method relies on silence phases, which are encoded into smaller packets, with longer time intervals, to separate words from each other.

Conclusion
As practice shows, even the most modern cryptography is incapable of protecting encrypted VoIP communications from eavesdropping, including if this cryptography is properly implemented - which in itself is unlikely. It is also worth noting that in this article only one mathematical model of digital signal processing (PHMM machines) is analyzed in detail, which turns out to be useful in recognizing an encrypted audio stream (in such government intelligence spy software as PRISM and BULLRUN). But there are tens and hundreds of such mathematical models. So if you want to keep up with the times, look at the world through the prism of higher mathematics.

How far have VoIP-traffic eavesdropping systems advanced?

Special services neural networks decrypt it on the fly
You can only listen to individual subscribers, and this requires additional actions
You can get only general data (language of communication and typical phrases)

INFO
Don't forget about critical thinking. Try talking to Google voice typing, but don't try to say phrases slowly and clearly. Speak as usual. Turn on automatic subtitles on Youtube. How do you like the quality of recognition of originally unchanged voice information?

The logical conclusion: with all the listed algorithms in terms of speech recognition, everything is much worse.

VoIP wiretapping. How PRISM and BULLRUN extract information from the hacker_frei voice stream

Hacker

Professional

Similar threads