Man
Professional
- Messages
- 3,070
- Reaction score
- 606
- Points
- 113
How and why American researchers tried to extract sound from a video signal and why they needed it at all.
Researchers from two US universities recently published a paper examining the Side Eye attack, a practical way to extract audio data from video shot by a smartphone. It is worth clarifying one rather non-obvious point right away. When you turn on video recording on your phone, naturally, the sound is recorded along with the image. The authors of the paper tried to find out whether it is possible to extract sound from an image if for some reason there is no audio track in the source. Imagine a video recording of a conversation between two businessmen that was published on the Internet, with the sound cut out in advance to preserve the privacy of the negotiations. It turned out that, albeit with some assumptions, it is possible to reconstruct speech from such a recording without audio. And this is perhaps due to a non-standard feature of the optical image stabilization system built into most modern smartphones.
Schematic diagram of the optical image stabilization system in modern cameras.
In this case, we don’t need to understand in detail how such stabilization works. The only important thing is that the elements of the camera are movable relative to each other. They can move when necessary, using miniature actuators. But they can also move on their own, as a result of external vibrations, including loud sounds. Let’s imagine that our smartphone is lying on the table near the speaker and recording a video (without sound!). If the speaker plays speech loudly enough, the table vibrates, and the phone vibrates with it, along with these very elements of the optical stabilizer.
On the recorded video, such vibrations turn into microscopic shaking of the objects recorded on the video. When simply viewed, such shaking will be completely unnoticeable, but it can be recorded by careful analysis of the video data. There is another problem here: a typical smartphone shoots video at a frequency of 30, 60 or, at best, 120 frames per second. We have exactly this many opportunities to record some small displacements of objects on the video, and this is very little. According to the Kotelnikov theorem, an analog signal (for example, sound) of a certain frequency can be reconstructed from measurements taken at double the frequency. By measuring the "shaking" of a picture with a frequency of 60 hertz, we will at best be able to reconstruct sound vibrations with a frequency of up to 30 hertz, while human speech lies in the sound range from 300 to 3400 hertz. Nothing will work!
Another feature of any digital camera comes to the rescue, the so-called rolling shutter or time parallax. Each frame of the video is recorded from the light-sensitive matrix not at once, but line by line, from top to bottom. Accordingly, when the last line in the image is “digitized”, fast-moving objects in the frame may already have shifted. This feature is clearly visible if, for example, you record a video from the window of a fast-moving train or car. Road posts in such a video will seem to be tilted, although in fact they are perpendicular to the ground. Another typical example is a photo or video of a fast-rotating propeller on an airplane.
The relatively slow data readout from the camera's sensor means that the blades have time to move before the frame is completed.
We have already shown such a picture in the post about an interesting way to attack smart card readers. What does such a time parallax give us in the case of analyzing microvibrations in video? The number of "samples", that is, the frequency with which we can analyze the image, increases significantly. If the video is shot with a resolution of 1080 pixels vertically, this number must be multiplied by the number of frames per second (30, 60 or 120). It turns out that we measure the vibrations of the camera in the smartphone with much greater accuracy - tens of thousands of times per second. This is generally enough to reconstruct the sound from the video. And this is another example of a side-channel attack: when we use an unobvious physical feature of the object of study, which leads to a leak of secrets. In this case, the sound that the creators of the video tried to hide from us.
Restoring audio from video using the rolling shutter effect.
The algorithm's success is tested on fairly simple tasks, not on real human free speech. The results are as follows: in almost 100% of cases, it was possible to correctly determine the person's gender. In 86% of cases, it was possible to distinguish one interlocutor from another. In 67% of cases, it was possible to correctly recognize which number the person was naming. And this is in the most ideal conditions, when the phone recording the video was lying 10 centimeters from the speaker on a glass tabletop. Change the tabletop to a wooden one, and the accuracy begins to decrease. Move the phone further away - it will be even worse. Reduce the volume to the normal volume of a normal conversation - the accuracy will drop to critical.
Now let's take a break from theoretical calculations and try to mentally apply the proposed scenario to reality. We will have to immediately exclude all "eavesdropping" options. If a hypothetical spy can get close to the interlocutors conducting a secret conversation and has a phone in his hands, he can easily record the sound from the microphone. We can imagine a scenario where we record the interlocutors on a surveillance camera from afar and the microphone cannot pick up speech. We will not be able to restore anything from the video either: the researchers moved the camera from the speaker a maximum of three meters, and even at this distance the system practically did not work (the numbers were correctly recognized in about 30% of cases).
Therefore, the beauty of this study lies precisely in finding a new "side channel" of information leakage. Perhaps, in future works it will be possible to somehow improve the proposed scheme. The main discovery of the authors is the image stabilization system in smartphones, which in theory should remove vibrations from video, sometimes carefully records them in the final video recording. Moreover, this trick works on many modern smartphones. It is enough to train the algorithm on one, and in most cases it will be able to recognize speech from a video recorded on another device.
But if we imagine that the proposed “attack” can be improved, then the fact that it analyzes the recorded video comes to the fore. We can fantasize and imagine a situation where in the future we will be able to download various videos from the Internet without sound and find out from them what the interlocutors near the camera were talking about. Here, however, we are waiting for two additional problems. It was not for nothing that the authors played the speech on an audio speaker installed on the table, where the phone was lying. Direct human speech is much more difficult to analyze using this method of “video eavesdropping”. And finally, videos on a phone are usually filmed handheld, and this introduces additional vibrations. But, you must agree, this is a beautiful attack. It once again shows us how complex modern devices are. And that we should not make assumptions when it comes to privacy. If you are being filmed on video, do not hope that “they will change the soundtrack later”. After all, in addition to machine learning algorithms, there is also the ancient art of recognizing words from lip movements.
Source
Researchers from two US universities recently published a paper examining the Side Eye attack, a practical way to extract audio data from video shot by a smartphone. It is worth clarifying one rather non-obvious point right away. When you turn on video recording on your phone, naturally, the sound is recorded along with the image. The authors of the paper tried to find out whether it is possible to extract sound from an image if for some reason there is no audio track in the source. Imagine a video recording of a conversation between two businessmen that was published on the Internet, with the sound cut out in advance to preserve the privacy of the negotiations. It turned out that, albeit with some assumptions, it is possible to reconstruct speech from such a recording without audio. And this is perhaps due to a non-standard feature of the optical image stabilization system built into most modern smartphones.
Optical stabilization and side-channel attack
An optical stabilizer provides a higher quality image when shooting video and photos. It smooths out hand shake, camera movements when walking, and other similar unwanted vibrations. In order for such stabilization to work, the photosensitive matrix of the camera is made to move relative to the lens. Sometimes the lenses in the lens itself are also made to move. In general, the idea of optical stabilization is shown in the image below: if the motion sensors in a smartphone or camera detect movement, the matrix or lens in the lens shifts so that the final image does not move. It turns out that up to a certain limit, small vibrations do not affect the final video recording.
Schematic diagram of the optical image stabilization system in modern cameras.
In this case, we don’t need to understand in detail how such stabilization works. The only important thing is that the elements of the camera are movable relative to each other. They can move when necessary, using miniature actuators. But they can also move on their own, as a result of external vibrations, including loud sounds. Let’s imagine that our smartphone is lying on the table near the speaker and recording a video (without sound!). If the speaker plays speech loudly enough, the table vibrates, and the phone vibrates with it, along with these very elements of the optical stabilizer.
On the recorded video, such vibrations turn into microscopic shaking of the objects recorded on the video. When simply viewed, such shaking will be completely unnoticeable, but it can be recorded by careful analysis of the video data. There is another problem here: a typical smartphone shoots video at a frequency of 30, 60 or, at best, 120 frames per second. We have exactly this many opportunities to record some small displacements of objects on the video, and this is very little. According to the Kotelnikov theorem, an analog signal (for example, sound) of a certain frequency can be reconstructed from measurements taken at double the frequency. By measuring the "shaking" of a picture with a frequency of 60 hertz, we will at best be able to reconstruct sound vibrations with a frequency of up to 30 hertz, while human speech lies in the sound range from 300 to 3400 hertz. Nothing will work!
Another feature of any digital camera comes to the rescue, the so-called rolling shutter or time parallax. Each frame of the video is recorded from the light-sensitive matrix not at once, but line by line, from top to bottom. Accordingly, when the last line in the image is “digitized”, fast-moving objects in the frame may already have shifted. This feature is clearly visible if, for example, you record a video from the window of a fast-moving train or car. Road posts in such a video will seem to be tilted, although in fact they are perpendicular to the ground. Another typical example is a photo or video of a fast-rotating propeller on an airplane.

The relatively slow data readout from the camera's sensor means that the blades have time to move before the frame is completed.
We have already shown such a picture in the post about an interesting way to attack smart card readers. What does such a time parallax give us in the case of analyzing microvibrations in video? The number of "samples", that is, the frequency with which we can analyze the image, increases significantly. If the video is shot with a resolution of 1080 pixels vertically, this number must be multiplied by the number of frames per second (30, 60 or 120). It turns out that we measure the vibrations of the camera in the smartphone with much greater accuracy - tens of thousands of times per second. This is generally enough to reconstruct the sound from the video. And this is another example of a side-channel attack: when we use an unobvious physical feature of the object of study, which leads to a leak of secrets. In this case, the sound that the creators of the video tried to hide from us.
Difficulties in practical implementation
But don't think that the authors of the study were able to restore clear and understandable human speech as a result of complex video signal processing. The graph on the left shows the original spectrogram of the audio recording, in which a person sequentially names the numbers "zero", "seven" and "nine". On the right is the spectrogram of the sound restored from the video recording. Even here, it is clear that the information was restored with great losses. On the project's website, the authors provide real recordings of the original and restored speech. There, you can fully appreciate the shortcomings of this complex eavesdropping method. From the video, it is possible to reconstruct something that is more reminiscent of rattling. It is very difficult to guess what number the person named. But even such heavily corrupted data can be successfully processed using machine learning systems: when the algorithm is first given the opportunity to analyze known pairs from the original and restored audio recordings, and then, by analogy, it restores previously unknown data.
Restoring audio from video using the rolling shutter effect.
The algorithm's success is tested on fairly simple tasks, not on real human free speech. The results are as follows: in almost 100% of cases, it was possible to correctly determine the person's gender. In 86% of cases, it was possible to distinguish one interlocutor from another. In 67% of cases, it was possible to correctly recognize which number the person was naming. And this is in the most ideal conditions, when the phone recording the video was lying 10 centimeters from the speaker on a glass tabletop. Change the tabletop to a wooden one, and the accuracy begins to decrease. Move the phone further away - it will be even worse. Reduce the volume to the normal volume of a normal conversation - the accuracy will drop to critical.
Now let's take a break from theoretical calculations and try to mentally apply the proposed scenario to reality. We will have to immediately exclude all "eavesdropping" options. If a hypothetical spy can get close to the interlocutors conducting a secret conversation and has a phone in his hands, he can easily record the sound from the microphone. We can imagine a scenario where we record the interlocutors on a surveillance camera from afar and the microphone cannot pick up speech. We will not be able to restore anything from the video either: the researchers moved the camera from the speaker a maximum of three meters, and even at this distance the system practically did not work (the numbers were correctly recognized in about 30% of cases).
Therefore, the beauty of this study lies precisely in finding a new "side channel" of information leakage. Perhaps, in future works it will be possible to somehow improve the proposed scheme. The main discovery of the authors is the image stabilization system in smartphones, which in theory should remove vibrations from video, sometimes carefully records them in the final video recording. Moreover, this trick works on many modern smartphones. It is enough to train the algorithm on one, and in most cases it will be able to recognize speech from a video recorded on another device.
But if we imagine that the proposed “attack” can be improved, then the fact that it analyzes the recorded video comes to the fore. We can fantasize and imagine a situation where in the future we will be able to download various videos from the Internet without sound and find out from them what the interlocutors near the camera were talking about. Here, however, we are waiting for two additional problems. It was not for nothing that the authors played the speech on an audio speaker installed on the table, where the phone was lying. Direct human speech is much more difficult to analyze using this method of “video eavesdropping”. And finally, videos on a phone are usually filmed handheld, and this introduces additional vibrations. But, you must agree, this is a beautiful attack. It once again shows us how complex modern devices are. And that we should not make assumptions when it comes to privacy. If you are being filmed on video, do not hope that “they will change the soundtrack later”. After all, in addition to machine learning algorithms, there is also the ancient art of recognizing words from lip movements.
Source