WebRTC for everyone

Tomcat

Professional
Messages
2,689
Reaction score
929
Points
113

Contents of this part​

  • What is WebRTC?
    • Why learn WebRTC?
    • The WebRTC protocol is a collection of other technologies
    • Signaling: how peers find each other
    • Connecting and displaying NAT using STUN/TURN
    • Securing the Data Link Using DTLS and SRTP
    • Communication between peers via RTP and SCTP
    • WebRTC - a collection of protocols
    • How does WebRTC (API) work?
  • Signaling
    • How does the alarm work?
    • What is Session Description Protocol (SDP)?
    • Studying SDP
    • WebRTC uses only some SDP keys
    • Media descriptions in the session description
    • Full example
    • How SDP and WebRTC work together
    • What are suggestions and answers?
    • Transceivers
    • SDP values used in WebRTC
    • Example of a WebRTC session description
  • Connection
    • How it works?
    • Real World Constraints
    • Different networks
    • Protocol Limitations
    • Firewall Rules
    • NAT Mapping
    • Creating a NAT Mapping
    • Options for creating a NAT mapping
    • NAT Mapping Filtering Options
    • Updating the NAT Mapping
    • STUN
    • Protocol structure
    • Creating a NAT Mapping
    • NAT Type Determination
    • TURN
    • Lifecycle of TURN
    • Using TURN
    • ICE
    • Creating an ICE Agent
    • Candidate Gathering
    • Connection checks
    • Selection of candidates
    • Reboot

What is WebRTC?​


WebRTC(Web Real-Time Communication - real-time communication) is an API (Application Programming Interface) and protocol. A protocol WebRTCis a set of rules that allows two agents WebRTC (browsers) to conduct bi-directional, secure communication in real time. WebRTC APIallows developers to use the WebRTC. WebRTC APIcurrently only defined for JavaScript.

You may already be aware of another pair with similar interaction between HTTP and Fetch API . In our case, the protocol WebRTCis HTTP, and WebRTC APIis Fetch API.

The protocol is maintained by the IETF rtcwebWebRTC working group . documented in the W3C as webrtc . WebRTC API

Why learn WebRTC?​


If you try to briefly describe the features WebRTC, you will get the following list. Moreover, it is not exhaustive, these are just examples of interesting characteristics that you will encounter while studying WebRTC. Don't worry if you are not familiar with some of the terms, they will all be covered below:
  • Open standard.
  • Different implementations.
  • Availability in browsers.
  • Mandatory encryption.
  • NAT mapping (NAT Traversal).
  • Repurposing existing technologies.
  • Congestion control.
  • Low latency (at the level of fractions of a second, sub-second latency).

The WebRTC protocol is a collection of other technologies​


In the process of establishing a connection using WebRTC4 stages:
  • Signaling.
  • Connection (connection establishment).
  • Security.
  • Communication (interaction).

Transitions between stages occur sequentially. A prerequisite for starting the next stage is the successful completion of the previous one.

An interesting fact about this WebRTCis that each stage consists of a large number of other protocols. WebRTCMany existing technologies were combined to create it . In this sense WebRTC, it is a combination and configuration of well-known technologies that emerged in the early 2000s.

Each of the listed 4 stages is devoted to a separate section, but first let's look at them at the highest level.
Since the stages depend on each other, this will make it easier to explain and understand the purpose of each of them later.

Signaling: how peers find each other​


When launched WebRTC, the agent does not know with whom and about what the communication will take place. And it is signaling that gives us transparency. This is the first stage and its purpose is to prepare a call so that two agents WebRTCcan begin communication.

Signaling is carried out using an existing protocol SDP (Session Description Protocol). Let us remember that SDPthis is a text protocol. Each message SDPconsists of several key/value pairs and contains a list of "media sections". SDPexchanged between agents WebRTCcontains the following information:
- IP and ports through which you can access the agent (candidates);
  • how many audio and video tracks the agent wants to send;
  • what audio and video codecs are supported by each agent;
  • values used during connection process ( uFrag/uPwd);
  • values used for security (certificate fingerprint).

Please note that signaling usually occurs outside WebRTC; WebRTC, usually not used to transmit signaling messages. Technologies such as REST SDP endpoints, web sockets, or authentication proxies can be used for communication between connected peers.

Connecting and displaying NAT using STUN/TURN​


During the signaling process, agents WebRTCreceive enough information to attempt a connection. For this, another technology called ICE.

ICE(Interactive Connectivity Establishment) is another protocol that predates the advent of WebRTC. ICEallows you to establish a connection between two agents. Agents can be located on the same (local) network or in different parts of the world. ICEis a solution for establishing a direct connection without a central server.

The real magic is "NAT mapping" and servers STUN/TURN. And that's all you need to communicate with an agent ICElocated on a different subnet.

After successful connection, ICE WebRTCit begins to install an encrypted transport channel. It is used to transmit audio, video and other data.

Securing the Data Link Using DTLS and SRTP​


Once we have established a bidirectional communication channel (using ICE), we need to make this channel secure. This is done using two protocols, also developed long before the advent of WebRTC. The first protocol is DTLS (Datagram Transport Layer Security ), which is simply TLS on top of UDP. TLSis a cryptographic protocol that is used for secure communication over HTTPS. The second protocol is SRTP (Secure Real-time Transport Protocol - used for secure real-time data transfer).

First WebRTCperforms a handshake usingDTLS the connection ICE. Unlike HTTPS, WebRTCit does not use a central authority to verify certificates. Instead, WebRTCit verifies that the certificate passed through DTLSmatches the fingerprint (we'll talk about this in detail in the security section) passed through the signaling process. Later DTLS-подключениеit is used to transmit messages via DataChannel(data communication channel).

For audio/video transmission, another protocol called RTP(Real-time Transport Protocol) is used. Packets transmitted over RTP, are protected using SRTP. The session SRTPbegins by retrieving the keys from the established session DTLS.

Congratulations! If the previous steps have completed successfully, we have bidirectional and secure communication. If the connection between agents WebRTCis stable, you can begin exchanging data. Unfortunately, in the real world we are constantly faced with packet loss and limited bandwidth, which we will discuss in the next section.

Communication between peers via RTP and SCTP​


So, we have established a secure bidirectional connection between two agents WebRTCand we are finally starting to communicate! For this, two protocols are used: SCTP RTP (Stream Control Transmission Protocol). is used to transmit media encrypted with , and is needed to send and receive messages via , encrypted with .RTPSRTPSCTPDataChannelDTLS

RTPis quite minimalistic, but it provides everything you need for real-time streaming. Flexibility RTPallows developers to address latency, data loss, and congestion issues in a variety of ways.

The last protocol in the stack is SCTP. It provides many settings related to message delivery. We, for example, may sacrifice reliability and the correct order of delivery of data packets in favor of low delivery latency. This is what is critical for real-time communication.

WebRTC - a collection of protocols​


At first glance WebRTCit may seem over-engineered. But we can forgive him for this, since with his help we can solve a large number of problems. The genius WebRTClies in his modesty: he does not try to solve all the problems himself. Instead, it combines many existing specialized technologies into a single whole.

This allows you to explore and study each part separately. A very appropriate comparison WebRTCis that of an orchestrator for a large number of other protocols.

sqlx_wxrgq97tnxjm8absyrmijk.png


How does WebRTC (API) work?​


In this section we will talk about how the protocol WebRTCis implemented in JavaScript API. This is not a detailed overview, but merely an attempt to paint a general picture of what happens in real-time communication.

new RTCPeerConnection

RTCPeerConnection— “WebRTC session” of the top level. It combines all the protocols mentioned above. All necessary subsystems are being prepared, but nothing is happening yet.

addTrack

addTrackcreates a new data stream (stream) RTP. A random synchronization source is generated for this thread SSRC. The stream is inside the Session Description SDP(in the media section) generated by createOffer. Each call addTrackcreates a new SSRCand media section.

Once installed SRTP-сессииand encrypted, these media packets begin to be transmitted through ICE.

createDataChannel

createDataChannelcreates a new one SCTP-потокin its absence. Disabled by default SCTP, it is enabled only when requested by either side of the data link.

Once installed DTLS-сессииand encrypted, these data packets begin to be transmitted via ICE.

createOffer

createOffergenerates a description of the local state of the session, transmitted to the remote (in the sense of “far away”, remote) peer.

The call createOfferdoes not change anything for the local peer.

setLocalDescription

setLocalDescriptioncommits the requested (made) changes. addTrack, createDataChanneland similar calls are temporary until called setLocalDescription. setLocalDescriptioncalled with the value generated by createOffer.

setRemoteDescription

setRemoteDescription– a way to inform the local agent about the status of remote candidates. This is signaling performed by JavaScript API.

After a call setRemoteDescriptionby both parties, the agents WebRTChave enough information to begin communication P2P(Peer-To-Peer - equal to equal).

addIceCandidate

addIceCandidateallows you WebRTC-агентуto add additional candidates ICEat any time. This interface sends the candidate ICEdirectly to the subsystem ICEand has no other effect on the overall connection.

ontrack

ontrackis a callback function that is called when received RTP-пакетаfrom a remote peer. Incoming packets are placed in a session description, which is passed to setRemoteDescription.

oniceconnectionstatechange

oniceconnectionstatechangeis a callback that is called when the agent's state changes ICE. This is how we receive notifications about connection establishment and completion.

onconnectionstatechange

onconnectionstatechangeis a combination of state ICEand DTLS. We can use this callback to receive notification of successful installation ICEand DTLS.

Signaling​


At the time of its creation, the agent WebRTCknows nothing about the other peer. He has no idea with whom the connection will be established and what they will exchange. Signaling is preparation for making a call. After exchanging the necessary information, agents can communicate with each other directly.

The messages transmitted during the signaling process are simply text. Agents do not care how they are transmitted (what transport is used for this). Typically they are sent via websockets, but this is not required.

How it works?​


WebRTCuses the protocol SDP. Through it, two agents exchange the state necessary to establish a connection. The protocol itself is easy to read. The difficulty comes when examining the values generated by WebRTC.

This protocol is not specific to WebRTC. First, we will consider SDPin isolation from WebRTC, and then its application in WebRTC.

What is Session Description Protocol (SDP)?​


The Session Description Protocol is defined in RFC 8866. It consists of key/value pairs. Each pair is on a separate line. It is similar to an INI file. The session description consists of 0 or more media descriptions. You can think of a session description as an array of media descriptions.

A media description usually refers to a specific stream of media data. So if we want to describe a call containing 3 video streams and 2 audio streams, we will have 5 media descriptions.

Studying SDP​


Each new line in the session description begins with one character - a key. Then comes the equal sign. Everything else (up to the new line) is the value.

SDPdefines all keys that are valid. Only letters of the Latin alphabet can be used for keys. Each key has a specific meaning.

Let's look at a small piece of the session description:

Code:
a=first-value
a=second-value

We have 2 lines and they both start with the key a. The value of the first line is first-value, and the second is second-value.

SDP keys used in WebRTC​


Not all keys defined in SDPare used in WebRTC. Only keys that appear in the JavaScript Session Establishment Protocol JSEP, defined in RFC 8829, are used. Right now it is enough to understand the following 7 keys:
  • v— version (version) 0;
  • o— origin, a unique identifier useful for re-establishing a connection;
  • s— session name ( -);
  • t— timing ( 0 0);
  • m— media description ( m=<media> <port> <proto> <fmt> ...);
  • a— attribute, free text field. The most common key;
  • c— connection information ( IN IP4 0.0.0.0).

Media descriptions in the session description​


A session description can consist of an unlimited number of media descriptions.

A media description definition consists of a list of formats. These formats correspond to RTPRTP Payload Types. The codec is determined by an attribute with a value rtpmapin the session description. Each media description can consist of an unlimited number of attributes.

Let's look at one more piece of the session description:

Code:
v=0
m=audio 4000 RTP/AVP 111
a=rtpmap:111 OPUS/48000/2
m=video 4000 RTP/AVP 96
a=rtpmap:96 VP8/90000
a=my-sdp-value

We have two media descriptions: one describes audio with the format 111, the other describes video with the format 96. The first description contains one attribute. This attribute defines (maps) the payload type 111as Opus. The second description contains two attributes. The first attribute defines the payload type 96as VP8, the second one contains a custom value my-sdp-value.

Full example​


The following example shows all the keys SDPused in WebRTC:

Code:
v=0
o=- 0 0 IN IP4 127.0.0.1
s=-
c=IN IP4 127.0.0.1
t=0 0
m=audio 4000 RTP/AVP 111
a=rtpmap:111 OPUS/48000/2
m=video 4002 RTP/AVP 96
a=rtpmap:96 VP8/90000

  • v, o, s, cand tare defined, but they do not affect the session WebRTC;
  • we have two descriptions of media. One with type audio, the other - video;
  • each description contains one attribute. It defines the details of a pipeline RTP.

How SDP and WebRTC work together​


The next piece of the puzzle is understanding how to WebRTCuse SDP.

What are suggestions and answers?​


WebRTCuses the offer/answer model. This means that if one agent "offers" to start communication, the other agent "responds" whether he wants to or not.

This allows the responder to reject unsupported codecs specified in the media descriptions. In this way, peers determine what data formats they will exchange.

Transceivers​


Transceivers are WebRTCa concept specific to API. Their main task is to convert the "media description" into JavaScript API. Every media description becomes a transceiver. Each time a transceiver is created, a new media description is added to the local session description.

Each session description WebRTChas an attribute that determines the direction of data transfer (direction). This allows the agent to declare things like, "I'm going to send you this codec, but I don't want to receive anything back." Valid direction values are:
  • send(sendonly, sending - sending);
  • recv(recvonly, receiving - receiving);
  • sendrecv(sending and receiving);
  • inactive(inactive state).

SDP values used in WebRTC​


Below is a list of some common session description attributes used by agents. Many of these values are controlled by subsystems that we haven't discussed yet (but will discuss soon).

group:BUNDLE

Bundling is the transmission of several types of traffic through one connection (often called “batching”, batching). Some implementations WebRTCallocate a separate connection for each media stream. Assembly is preferred.

fingerprint:sha-256

This is the hash of the certificate used by the peer for DTLS. Once the handshake is complete, DTLSwe compare the hash to the certificate to confirm that we are communicating with the person we expect.

setup:

Controls the behavior of the agent DTLS. Determines whether the agent is a client or a server after installation ICE. Possible values:
  • setup:active— launch as a client DTLS;
  • setup:passive— launch as a server DTLS;
  • setup:actpass— we ask another agent WebRTCto make a choice.

ice-ufrag

The user fragment value for the agent ICE. Used for traffic authentication ICE.

ice-pwd

Agent password ICE. Used for traffic authentication ICE.

rtpmap

Used to define the relationship between a specific codec and payload type RTP. Payload types are dynamic, so for each call the initiator selects payload types for each codec.

fmtp

Defines additional payload type values. Can be used to configure video profile or encoding.

candidate

A candidate ICEreceived from an agent ICE. One of the addresses where the agent is available WebRTC.

ssrc

Defines a specific media stream track.

labelis the thread identifier. mslabel— identifier of a container that can contain multiple threads.

Example session description​


Full session description generated by the client WebRTC:
Code:
v=0
o=- 3546004397921447048 1596742744 IN IP4 0.0.0.0
s=-
t=0 0
a=fingerprint:sha-256 0F:74:31:25:CB:A2:13:EC:28:6F:6D:2C:61:FF:5D:C2:BC:B9:DB:3D:98:14:8D:1A:BB:EA:33:0C:A4:60:A8:8E
a=group:BUNDLE 0 1
m=audio 9 UDP/TLS/RTP/SAVPF 111
c=IN IP4 0.0.0.0
a=setup:active
a=mid:0
a=ice-ufrag:CsxzEWmoKpJyscFj
a=ice-pwd:mktpbhgREmjEwUFSIJyPINPUhgDqJlSd
a=rtcp-mux
a=rtcp-rsize
a=rtpmap:111 opus/48000/2
a=fmtp:111 minptime=10;useinbandfec=1
a=ssrc:350842737 cname:yvKPspsHcYcwGFTw
a=ssrc:350842737 msid:yvKPspsHcYcwGFTw DfQnKjQQuwceLFdV
a=ssrc:350842737 mslabel:yvKPspsHcYcwGFTw
a=ssrc:350842737 label:DfQnKjQQuwceLFdV
a=msid:yvKPspsHcYcwGFTw DfQnKjQQuwceLFdV
a=sendrecv
a=candidate:foundation 1 udp 2130706431 192.168.1.1 53165 typ host generation 0
a=candidate:foundation 2 udp 2130706431 192.168.1.1 53165 typ host generation 0
a=candidate:foundation 1 udp 1694498815 1.2.3.4 57336 typ srflx raddr 0.0.0.0 rport 57336 generation 0
a=candidate:foundation 2 udp 1694498815 1.2.3.4 57336 typ srflx raddr 0.0.0.0 rport 57336 generation 0
a=end-of-candidates
m=video 9 UDP/TLS/RTP/SAVPF 96
c=IN IP4 0.0.0.0
a=setup:active
a=mid:1
a=ice-ufrag:CsxzEWmoKpJyscFj
a=ice-pwd:mktpbhgREmjEwUFSIJyPINPUhgDqJlSd
a=rtcp-mux
a=rtcp-rsize
a=rtpmap:96 VP8/90000
a=ssrc:2180035812 cname:XHbOTNRFnLtesHwJ
a=ssrc:2180035812 msid:XHbOTNRFnLtesHwJ JgtwEhBWNEiOnhuW
a=ssrc:2180035812 mslabel:XHbOTNRFnLtesHwJ
a=ssrc:2180035812 label:JgtwEhBWNEiOnhuW
a=msid:XHbOTNRFnLtesHwJ JgtwEhBWNEiOnhuW
a=sendrecvp

Here's what we should understand from this message:
  • we have two media sections: one for audio and one for video;
  • both are transceivers sendrecv. We receive two streams and can send two streams in response;
  • we have candidates ICEand authentication details that allow us to attempt to establish a connection;
  • We have a certificate fingerprint, which makes the call secure.

Connection​


Most applications developed today implement a client-server connection architecture. This architecture assumes the presence of a server with a known and stable transport address ( IPand port). The client sends a request, and the server responds to it.

It WebRTCuses a different architecture - a peer-to-peer network (Peer-to-Peer, P2P). In such a network, the task of establishing a connection is distributed among peers. This is due to the fact that the transport address cannot be determined in advance and can change during the session. WebRTCcollects all available information and does a lot to enable bi-directional communication between agents.

Establishing such a connection is not an easy task. Agents can be located on different networks, i.e. have no direct connection. And even if it is, there may be other problems. For example, clients may use different protocols (UPD<-> TCP) or different versions IP( IPv4<-> IPv6).

Despite this, WebRTCit provides some advantages over client-server architecture.

Reducing the size of transferred data

Since data is exchanged directly between agents, we do not need to “pay” for a server dedicated to relaying (redirecting) this data.

Reduced latency

Direct communication is faster. When the user is forced to send data through the server, relaying increases latency.

Increased security

Direct communication is more secure. Since the server is not involved in the data transfer, users can be sure that the data will not be decrypted until it reaches the destination.

How it works?​


The process described above is called Interactive Connectivity Establishment (ICE).

ICEis a protocol that tries to determine the best way to establish a connection between two agents. Each agent publishes a path by which it can be reached. Such paths are called candidates. Essentially, a candidate is a transport address that one agent considers reachable by another agent. Then ICEit determines the most suitable pair of candidates.

In order to be convinced of the need ICE, we need to understand what difficulties we have to overcome on the way to establishing an interactive connection.

Real World Constraints​


The main purpose ICEis to overcome the limitations imposed on the connection by the real world. Let's briefly discuss these limitations.

Different networks​


In most cases, the second agent will be on a different network. Typically, a connection is established between agents located on different networks.

Below is a graph of two independent networks connected together via the public Internet. We have two hosts on each network.

6dajpadwe3lnahtlj3mlya1ubnw.png


For hosts on the same network, establishing a connection is a simple task. Communication between them 192.168.0.1 -> 192.168.0.2is easy to organize. Such hosts can communicate with each other without outside help.

However, a host using Router B. is not able to communicate with hosts using Router A. How to determine the difference between 192.168.0.1from Router Aand the same IPfrom Router B? These IPare private (closed to the outside world, private)! The host from Router Bcan send data to Router A, but the request will fail. How Router Ado you determine which host to send a message to?

Protocol Limitations​


In some networks it is prohibited to transmit traffic via UDP, in others - via TCP. Some networks may have a very low Maximum Transmission Unit (MTU). There are a large number of network settings that can make communication difficult to say the least.

Firewall Rules​


Another problem is "Deep Packet Inspection" and other filtering of network traffic. Some network administrators apply such software on a per-packet basis. In many cases, this software does not understand WebRTCand considers packets WebRTCto be suspicious packets UDPtransmitted over a port that is not included in the whitelist .

NAT Mapping​


Displaying the result of Network Address Translation (NAT) (NAT Mapping) is the magic that makes connections WebRTCpossible. This is what allows two peers from different networks to communicate with each other. Let's look at how mapping works NAT.

NATdoes not use a relay, proxy or server. We have Agent 1and Agent 2, which are on different networks. Despite this, they can exchange data. This is what it looks like:

tfam-87ok_feb9rduj6kx3oguxc.png


To enable such communication, we resort to mapping NAT. Agent 1uses the port 7000to establish a connection WebRTCwith Agent 2. 192.168.0.1:7000binds to 5.0.0.1:7000. This allows you Agent 2to "reach" Agent 1by sending packets to 5.0.0.1:7000. Creating a mapping NATis similar to the automatic version of port forwarding in a router.

The downside NATis that there is no single form of mapping (eg static port forwarding), and the behavior may vary across different networks. ISPs and hardware manufacturers may do this in different ways. In some cases, network administrators can disable it altogether.

The good news is that all of these features are monitored and taken into account, allowing the agent ICEto create the display NATand its attributes.

The document describing this process is RFC 4787.

Creating a NAT Mapping​


Creating the display is the easy part. A mapping is created when we send a packet to an address outside our network. The mapping NATis simply temporary public ones IPand the port occupied by our NAT. Outgoing messages are rewritten in such a way that the created mapping becomes their source address. When you post a message to a display, it is automatically forwarded to the host inside the one that created it NAT. And this process gets complicated when it comes to the details of creating a display.

Options for creating a NAT mapping​


Creating a display is divided into three categories.

Standalone (endpoint-independent) mapping

A separate display is created internally for each sender NAT. If we send two packets to two remote addresses, the same mapping is used for both packets. Both remote hosts will see the same IPsource port. When responses are received from hosts, they are passed to the same local listener.

This is the best case scenario. To make a call, at least one party must be of this type.

Address dependent mapping

A new mapping is created when a packet is sent to a new address. If we send two packets to two different hosts, two mappings are created. If we send two packets to the same remote host but on different ports, ONE mapping is created.

Address and port dependent display

A new mapping is created if the remote IPor port is different. If we send two packets to the same remote host but on different ports, two mappings are created.

NAT Mapping Filtering Options​


Display filtering is the rules that determine who can use a display. There are three possible options:

Offline filtering

Anyone can use the display. We can share it with other peers and they can send traffic to it.

Address - dependent filtering

A mapping can only be used by the host that created it. If we send a packet to a host A, it can send back any number of packets. If the host Bsends a packet to this mapping, the packet will be ignored.

Address - and port-dependent filtering

The mapping can only be used by the host and port that created it. If we send a packet to a host A:5000, it can respond with any number of packets. The packet sent by host A:5001will be ignored.

Updating the NAT Mapping​


It is recommended to destroy a display that is not used for five minutes, but this depends on ISPs and hardware manufacturers.

STUN​


Session Traversal Utilities for NAT (STUN) is a protocol designed to work with NAT. It is defined in RFC 8489 , which also defines the packet structure STUN. The protocol STUNis also used ICE/TURN.

STUNallows you to programmatically create mappings NAT. Before, STUNwe could create mappings, but could not receive information about the created ones IPand the ports. STUNnot only allows you to create displays, it also provides information that can be shared with others so that they can feed data into the display.

Let's start with a basic description STUN. Later we will look at how STUNit is used in ICEand TURN. Let's look at the request/response process for creating a mapping, and also talk about how to get information about IPports. This process occurs when we have a server stun:in the paths (urls) ICEfor WebRTC PeerConnection(for example, new RTCPeerConnection({ iceServers: [{ urls: ['stun:stun.l.google.com:19302'] }] })). STUNhelps the endpoint behind NAT, to obtain details about the created mapping, by sending a request to the server STUNto provide information about what it sees from the outside (observes).

Protocol structure​


Each package STUNhas the following structure:

0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0 0| STUN Message Type | Message Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Magic Cookie |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
| Transaction ID (96 bits) |
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Data |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

STUN Message Type

Package type. At the moment we are interested in the following:
  • Binding Request - 0x0001;
  • Binding Response - 0x0101.

To create the display NATwe execute Binding Request. The server sends us Binding Response.

Message Length

Section length Data(data size). This section contains data defined in Message Type.

Magic Cookie

Fixed value 0x2112A442in network byte order. This helps differentiate it STUNfrom other protocols.

Transaction ID

A 96-bit identifier that uniquely identifies the request/response. This helps identify request and response pairs.

Data

The data contains a list of attributes STUN. The attribute STUNhas the following structure:

0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Value (variable) ....
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

STUN Binding Requestdoes not use attributes. This means that the request contains only a header.

STUN Binding Responseuses XOR-MAPPED-ADDRESS (0x0020). This attribute IPalso contains the port from the mapping NAT.

Creating a NAT Mapping​


The “price” of creating a display NATusing STUNis one request. We send Binding Requestto the server STUN. He returns to us Binding Response. Binding Responsecontains Mapped Address. Mapped Address— how the server STUNsees the display NAT. Mapped Address- something that can be used by another party to send data to us.

Mapped Address- this is our Public IP(public address) or Server Reflexive Candidatein terminology WebRTC.

NAT Type Determination​


Unfortunately, Mapped Addressit may not be used in all cases. If NATis address dependent, we can only receive data from the server STUN. In this case, messages sent by the other party to Mapped Addresswill be dropped. This makes it Mapped Addressuseless. And this problem is solved by the server STUNredirecting the packets to the peer. This solution is called TURN.

RFC 5780 describes a method for inferring a NAT. This allows you to determine in advance whether a direct connection can be established.

TURN​


Traversal Using Relays around NAT (TURN), defined in RFC 8656, is a solution to the lack of direct connectivity. This can happen when types are incompatible NATor agents use different protocols. TURNcan also be used to ensure privacy. By creating communication using TURN, we obfuscate (hide) the real addresses of clients.

TURNuses a dedicated server. This server is a proxy for the client. The client connects to the server TURNand creates Allocation(location). It is allocated a temporary IP/port/protocol that can be used for data transmission. This handler is called Relayed Transport Address. This is a kind of forwarding address that can be used by another party to transmit traffic through TURN. For each peer, a separate one is created Relayed Transport Address, which requires granting Permission(permission) for communication.

When traffic is transmitted through TURN, it is transmitted through Relayed Transport Address. When a remote peer receives traffic, it sees that the traffic came from TURN.

Lifecycle of TURN​


Let's talk about what a client who wants to create a placement in TURN. Communication with someone who uses TURNdoes not require additional actions. The other peer simply receives IPa port that can be used to transmit data.

Placements

Placements are the core TURN. Allocation- "session TURN". To create a location in TURNwe contact Server Transport Addressthe server TURN(the default port is 3478).

When creating a placement, the server must provide the following information:
  • username/password - creating a placement requires authentication;
  • transport used by the hosting - the transport protocol between the server ( Server Transport Address) and peers: UDPor TCP;
  • Even-Port- we can request a sequence of ports for multiple non- WebRTC.

If the request is successful, we receive TURNa response from the server with the following attributes in the section Data:
  • XOR-MAPPED-ADDRESS— Mapped Addressclient TURN(TURN Client). When sending data to Relayed Transport Address, it is redirected to this address;
  • RELAYED-ADDRESS— address transmitted to other clients. When a packet is sent to this address, it is transmitted to the client TURN;
  • LIFETIME— how much time is left before the placement is destroyed TURN. The lifetime of an allocation can be increased by sending a request Refresh.

Permissions

The remote host cannot send data to ours Relayed Transport Addressuntil we grant it permission. When creating a permission, we tell the server TURNthat IPincoming traffic is allowed on the specified port.

The remote host must provide us IPwith the port as they appear on the server TURN. This means he must send STUN Binding Request. A common mistake is sending such a request to the wrong server TURN. After sending the request, the remote host contacts us asking for permission.

Let's say we want to create a permission for a host located behind Address Dependent Mapping. If we receive Mapped Addressfrom another server TURN, all incoming traffic will be lost. Each time you interact with another host, a new mapping is created. The lifetime of the permit is 5 minutes.

SendIndication/ChannelData

These messages are for the client TURNto send messages to the remote peer.

SendIndication - a self-contained message. Inside it is the data to be sent, as well as the recipient. Sending a large number of messages to a remote peer is wasteful. When sending 1000 messages, IPthe remote peer's address will be repeated 1000 times!

ChannelDataallows you to send data without duplication IP-адреса. We create a channel with IPand port. When a message is sent, it indicates ChannelId, and IPthe port is filled in by the server. This allows you to reduce the load on the server when sending a large number of messages.

Update

Placements are destroyed automatically. To maintain placement, its LIFETIME(lifetime) must be updated periodically.

Using TURN​


TURNcan be used in two forms. Typically, we have one peer acting as the "client TURN" and another peer accessing the "client" directly. In some cases it may be necessary to use TURNon both sides, for example because both sides are on networks that block UDP, so the connection to the respective servers TURNis made through TCP.

One TURN allocation for communication

8nnaahtbdukwvf40milw0dgfbim.png


Two TURN allocations for communication

13yyqpl0xsojihqa1rycg1qg1k4.png


ICE​


ICE— how WebRTCit connects two agents. This protocol is defined in RFC 8445. ICE— protocol for establishing a connection. It describes all possible routes between two peers and ensures connection stability.

These routes are called Candidate Pairs(candidate pairs) and represent a pair of local and remote transport addresses. This is where STUNand come into play TURN. These addresses can be our local IPand port, mapping NATor Relayed Transport Address. Each side collects the addresses it wants (and can) use, passes them on to the other side, and attempts to connect.

Two agents ICEcommunicate using ping packets ICE(formally called connectivity checks) to establish a connection. Once the connection is established, the parties can exchange data. This is similar to using regular websockets. All necessary checks are performed using the protocol STUN.

Creating an ICE Agent​


An agent ICEcan be Controlling(manager) or Controlled(managed). The managing agent is the one who selects Candidate Pair. Typically, the peer sending the offer is the control party.

Each side must have user fragmentand password. The parties must exchange these values before starting connectivity checks. user fragmentsent in plain text and can be used to demuxing multiple sessions ICE. passwordused to generate the attribute MESSAGE-INTEGRITY. At the end of each packet STUNthere is an attribute with a hash of the contents of the packet - passwordused as a key. This allows the packet to be authenticated, i.e. make sure that it has not been replaced or modified during the transfer process.

In the case of, WebRTCthese values are passed through Session Description(session description), which was discussed in the previous section.

Candidate Gathering​


Now we need to collect all the addresses at which agents are reachable. These addresses are called candidates.

Host

The host candidate is provided by the local interface. It could be UDPor TCP.

mDNS

The candidate mDNSis similar to the host candidate, but is IP-адресhidden. Instead, IP-адресаthe other party is provided UUID as the host name. After that, we set up a multicast listener that responds to requests to the published UUID.

If we are on the same network as the agent, we can find each other via multicast. If we and the agent are on different networks, the connection will not be established. At a minimum, until the network administrator explicitly allows such packet transfer.

This allows for confidentiality. In the case of the host candidate, the user sees our IPaddress through WebRTC(without even trying to establish a connection), but in the case of the candidate, mDNShe will only receive a random one UUID.

Server Reflexive

The Server Reflexive Candidate is generated by the server STUNin response to STUN Binding Request.

When received, STUN Binding Responsethe content contained in it XOR-MAPPED-ADDRESSis our candidate from the server.

Peer Reflexive

Peer Reflexive Candidate is when we receive an incoming request from an unknown address. Since ICEit is an authenticated protocol, we know that the traffic is valid. This just means that the remote peer is communicating with us from an unknown address.

This usually happens when communicating Host Candidatewith Server Reflexive Candidate. Since we are communicating outside of our subnet, a new mapping is created NAT. Remember when we said that connection checks are actually packets STUN? The response format STUNallows the peer to return the peer's candidate address (peer-reflexive address).

Relay

The relay candidate is generated by the server TURN.

After the initial handshake, we receive RELAYED-ADDRESSour relay candidate.

Connection checks​


Now we know the remote agent candidates user fragment. passwordWe are trying to connect! Candidates are divided into pairs. So if we had three candidates from each side, we end up with nine pairs of candidates.

This is what it looks like:

evfdi-n53jzyix7u_ut-hb-srxo.png


Selection of candidates​


The managing and managed agents begin transmitting traffic through each pair. This is required when one agent is behind Address Dependent Mapping, resulting in the creation of Peer Reflexive Candidate.

Each Candidate Pair(pair of candidates) that “sees” the traffic becomes a pair of Valid Candidate(valid candidates). The managing agent takes one Valid Candidateand nominates him. The candidates nominated by the parties become Nominated Pairthe (nominated pair). The agents again attempt to establish bidirectional communication. If successful, Nominated Pairbecomes Selected Candidate Pair(the selected or chosen pair of candidates). This pair is used for the remainder of the session.

Reboot​


If Selected Candidate Pairit stops working for any reason (the lifetime of the display has expired NAT, the server has crashed TURN), the agent ICEgoes into the state Failed. Both agents can be restarted and the process will start over.

This completes the first part of the translation.

Thank you for your attention and happy coding!

(c) https://webrtcforthecurious.com
 
Last edited:

Contents of this part​

  • Safety
    • How it works?
    • Security 101
    • DTLS
    • Package format
    • Handshake
    • Key generation
    • Data exchange
    • SRTP
    • Creating a session
    • Media sharing
  • Real-time communication
    • What attributes make networks complex?
    • Overload
    • Dynamism
    • Solving the problem of data loss
    • Acknowledgments
    • Selective Acknowledgments
    • Negative Acknowledgments
    • Forward Error Correction
    • Solving the Jitter Problem
    • Overload Definition
    • Solving the congestion problem
    • Reduced data transfer rate
    • Reducing the amount of data transferred
  • Media sharing
    • How it works?
    • Delay vs quality
    • Real World Constraints
    • The video is complex
    • Video 101
    • Lossy and lossless compression
    • Intra and inter-frame compression
    • Types of interframe compression
    • The video is sensitive
    • RTP
    • Package format
    • RTCP
    • Package format
    • Full INTRA-frame request (FIR) and Picture Loss Indication (PLI)
    • Negative confirmations
    • Sender and Recipient Reports
    • How do RTP/RTCP solve problems together?
    • Forward Error Correction
    • Adaptive bitrate and bandwidth estimation
    • Identification and transmission of network state
    • Sender/Recipient Reports
    • TMMBR, TMMBN, REMB and TWCC in combination with GCC
      • Google Congestion Control (GCC)
    • TMMBR, TMMBN and REMB
    • Transport Wide Congestion Control
    • Other Ways to Estimate Bandwidth

Safety​


Every connection WebRTCis authenticated and encrypted. This allows us to be sure that a third party does not see what we send and cannot add (insert) fictitious messages. We can also be sure that the agent generating the session description is really the one we want to communicate with.

It is very important that no one can spoof messages. It's okay if a third party reads the session description during the transfer. However, WebRTCit does not provide protection against its modification. An attacker can carry out a man-in-the-middle attack (man-in-the-middle attack) by spoofing candidates ICEand updating the certificate's fingerprint.

How it works?​


To ensure connection security, WebRTCit uses 2 existing protocols: DTLSand SRTP.

DTLSallows you to prepare a session and securely exchange data between peers. It is similar to TLSthat used in HTTPS, but DTLSuses . UDPinstead of TCP. This means that it DTLSdoes not guarantee reliable delivery of messages. SRTPwas specifically designed for secure media exchange. It provides several optimizations over DTLS.

First used DTLS. It performs a handshake over a connection ICE. DTLSis a client-server protocol, so one of the parties must initiate the handshake. Client/server roles are determined during the signaling process. During the handshake, DTLSeach party generates a certificate. After the handshake is completed, each certificate is compared with its hash contained in the session description. This allows you to verify that the handshake occurred with the expected agent. The connection can then DTLSbe used to communicate using DataChannel.

To create a session, SRTPwe initialize it using the keys generated by DTLS. SRTPdoes not provide a handshake mechanism, so connection preparation is done using foreign keys. Once installed, SRTP-connections the parties can exchange encrypted media.

Security 101​


Let's get acquainted with the terms used in cryptography.

Plain text and cipher text

Plain text is the input data for encryption. Ciphertext is the result of encryption.

Encryption

Encryption (cipher) is a sequence of operations to convert plain text into cipher text. The cipher can be stored for later decryption of the ciphertext. Typically, a cipher has a key that changes its behavior. This process is also called encrypting and decrypting.

An example of a simple cipher is ROT13. Each letter of the source text is shifted 13forward by characters. To decrypt, each letter is shifted 13back by characters. Plain text HELLObecomes cipher text URYYB. In this case, the cipher is ROT, and the key is 13.

Hash functions

A hash function is an irreversible (one-way) process of converting input data into a hash (digest). For the same input data, the same result is always obtained. It is important that the result is irreversible. The result should not allow the input data to be determined. Hashing allows you to ensure that the message has not been spoofed.

A simple hashing function would be one that skips every other letter. HELLOwill become HLO. We can't "go back" to HELLO, but we can make sure it HELLOmatches the hash.

Public and private keys

Public/private key cryptography describes the type of encryption used by DTLSand SRTP. In such a system we have two keys: public (open) and private (closed). The public key is used for encryption and can be shared with others. The private key is used for decryption and should only be known to us. An encrypted message can only be decrypted using the corresponding private key.

Diffie-Hellman protocol

The Diffie–Hellman Protocol allows two users who have never met to securely create shared secrets over the Internet. A user Acan send secrets to a user Bwithout fear of eavesdropping. This depends on the complexity of solving the discrete logarithm problem. This makes a handshake possible DTLS.

Pseudorandom function

Pseudorandom Function (PRF) is a predefined function that generates values that appear random at first glance. It can take multiple input values and produce one result.

Key generation function

Key derivation is a type of pseudo-random function. This function is used to strengthen the key. One of the most common patterns is key stretching .

Let's assume that we have a key of 8 bytes. We can use it KDFto make it stronger (longer).

Single use number

A nonce is an additional input for encryption. It is used to produce different results when encrypting the same message. Each message uses a unique nonce.

Message authentication code

The message authentication code is a hash placed at the end of the message. MACallows you to identify the source of the message.

Key Rotation

Key rotation is the practice of periodically replacing keys. This allows you to reduce damage from possible theft (leakage) of the key, i.e. ensure that a stolen or otherwise compromised key can be used to decrypt only a limited number of messages.

DTLS​


DTLSallows two peers to establish a secure connection without prior configuration. Even if someone listens to the connection, they will not be able to decrypt the messages.

In order for communication between client and server DTLSto be possible, they must agree on a cipher and key. These values are determined during the handshake process. During a handshake, messages are presented in plain text. After the client or server have exchanged enough information to begin encryption, the Change Cipher Spec. After this, all subsequent messages are encrypted.

Package format​


Every packet DTLSstarts with a header.

Type of content

The content type can be one of:
  • 20 — Change Chipher Spec;
  • 22— handshake (Handshake);
  • 23— application data (Application Data).

Handshakeused to exchange information necessary to start a session. Change Chipher Specused to notify the other party that message encryption has begun. Application Dataare the encrypted messages themselves.

Version

Version can be 0x0000feff(for DTLS v1.0) or 0x0000fefd(for DTLS v1.2). Version 1.1 does not exist.

era

Epoch begins with 0and takes on a value 1after Change Chipher Spec. Any messages with a "non-zero epoch" are encrypted.

Serial number

The sequence number is used to maintain the order of messages. Each time a message is sent, this number increases. When the epoch increases, this number is reset to zero.

Message length and payload

A payload is a specific type of content. The payload Application Datais encrypted data. The payload Handshakedepends on the type of message. The message length (length) determines the size of the payload.

Handshake​


During a handshake, the client and server exchange a series of messages. These messages are grouped into packages (flights). Each packet can contain several messages or just one. A packet is considered received only after all messages it contains have been delivered. Next we will look at the purpose of each message.

7xolqn1dp4518tdgbqyytiwct04.png


ClientHello

ClientHello- This is the initial message from the client. It contains a list of attributes. These attributes tell the server the cipher and capabilities supported by the client. This cipher is also used as a cipher SCTP. ClientHelloalso contains random data, which is subsequently used to generate session keys.

HelloVerifyRequest

HelloVerifyRequestis a message from the server to the client. It allows you to verify that the client actually intended to send ClientHello. After this, the client sends again ClientHello, but with a token from HelloVerifyRequest.

ServerHello

ServerHello— this is the server response with session settings. It contains a cipher and random data.

Certificate

Certificatecontains a certificate for a client or server. The certificate is used to identify the other party of the communication. After the handshake is complete, we check that the hash of the certificate matches its fingerprint in SessionDescription.

ServerKeyExchange/ClientKeyExchange

These messages are used to transmit the public key. During initialization, the client and server generate a key pair. After the handshake, these values are used to generate Pre-Master Secret.

CertificateRequest

CertificateRequest— a message from the server to the client about the former’s desire to receive a certificate. The server may request or require a certificate.

ServerHelloDone

ServerHelloDonenotifies the client that the server has completed the handshake.

CertificateVerify

CertificateVerify— this is how the sender confirms that he has the private key from the message with the certificate.

ChangeCipherSpec

ChangeCipherSpecinforms the recipient that subsequent messages will be encrypted.

Finished

Finishedis encrypted and contains a hash of all previous messages. This ensures that the handshake has not been modified.

Key generation​


Once the handshake is complete, we can send encrypted data. The cipher is selected by the server and contained in its ServerHello. But how is the key chosen?

First we generate Pre-Master Secret. To obtain this value, the Diffie-Hellman protocol is used for keys exchanged using ServerKeyExchangeand ClientKeyExchange. The details depend on the cipher chosen.

Then it is generated Master Secret. Each version DTLShas a specific pseudo-random function. For DTLS v1.2the function "takes" Pre-Master Secretand random values from ClientHelloand ServerHello. The result of executing a pseudorandom function is Master Secret. Master Secretis the value that is used for the cipher.

Data exchange​


The workhorse DTLSis ApplicationData. Once we have the initialized cipher, we can encrypt and transmit messages.

Messages ApplicationDatause a header DTLSas described earlier. Payloadfilled with ciphertext. Now we have a session DTLSand can communicate securely.

SRTP​


SRTPis a protocol designed specifically for encrypting packets RTP. To start a session, SRTPyou need to define the keys and cipher. Unlike DTLShere, we do not have a handshake mechanism. All settings and keys are generated during the handshake DTLS.

DTLSprovides a separate one APIfor exporting keys for use in other processes. This APIis defined in RFC 5705 .

Creating a session​


SRTPdefines a key derivation function that is applied to the input data. When a session is created, SRTPthe input is passed through this function to generate keys for the cipher SRTP. After this, you can proceed to data processing.

Media sharing​


Each packet RTPhas a 16-bit SequenceNumber(sequence number). This number is used to maintain the correct order in which packets are delivered (sequence numbers are like primary keys). During the session, this number is automatically increased. For this, a special counter (rollover counter) is used.

When encrypting a packet, SRTPit uses a counter and a sequence number as a nonce. This makes it possible to ensure that even if the message is sent again, its encrypted text will be different. This protects against the attacker identifying the pattern and also preventing a repeat attack.

Real-time communication​


Networks are the limiting factor for real-time communication. In an ideal world, the size of transmitted data is unlimited and packets are delivered instantly. In the real world this is not the case. Network capabilities are limited and conditions may change at any time. Measuring and monitoring networks is also challenging. We get different behavior depending on the hardware, software and their settings.

In the case of real-time communication, there is another problem that does not exist in other environments. For a developer, slow website performance on some networks is not a serious problem. As long as all the data comes in, users are happy. However, WebRTCthe "old" data is useless. Nobody cares what was said at the conference five seconds ago. Therefore, when developing real-time systems, one often has to choose between transmission delay and the size of the transmitted data.

What attributes make networks complex?​


Code that works effectively across all networks is complex. It is necessary to take into account many different factors that may influence each other in unobvious ways. Below is a list of the most common problems that developers face.

Bandwidth

Bandwidth is the amount of data that can be transferred over a network. It is important to understand that this value is not constant. The throughput depends on the load, i.e. on the number of people using this route (route).

Transfer time and round trip travel time

Transmission time is the time it takes a packet to reach its destination. Like bandwidth, this value is not constant. Transfer times may vary during the session.

Code:
transmission_time = receive_time - send_time

To calculate the transmission time, we need clocks on the sender and recipient sides, synchronized with millisecond accuracy. Even a small difference can result in unreliable time measurement. Because WebRTCit operates in highly heterogeneous environments, achieving perfect synchronization between hosts is nearly impossible.

Round trip time is an attempt to solve the synchronization problem.

Instead of using a distributed clock, the peer WebRTCsends a special packet containing the time in sendertime1. The second peer receives the packet, records the time, and sends the packet back. After the sender receives the packet, it subtracts the time recorded in sendertime1, from the current time sendertime2. This time difference is called round trip travel time.

rtt = sendertime2 - sendertime1

Half the round trip time is considered a good approximation of the transmission time. This approach has some disadvantages. When using it, we make the assumption that sending and receiving a packet takes equal time. However, in cellular networks, sending and receiving operations may be asymmetric in time. You may have noticed that your phone's data transfer speed is almost always slower than your download speed.

transmission_time = rtt/2

The technical details of round trip time measurement are described in detail in the next section.

Jitter

Jitter is caused by different packet transmission times. Packets may be delayed and arrive in batches.

Packet Loss

Packet loss is when messages are lost during transmission, i.e. do not reach the addressee. The loss may be permanent or accidental. This may be related to the type of network, such as satellite or Wi-Fi. This may also be due to software encountered along the communication path.

Maximum transmission unit

Maximum transmission unit is the maximum size of a single packet. Networks don't allow you to send one giant message. At the protocol level, messages are split into smaller packets.

MTUalso varies depending on the route chosen. To determine, MTUyou can use the Path MTU Study protocol .

Overload​


Congestion occurs when network limits are reached. Typically, this is due to reaching the maximum throughput of the current route. Or it may be due to time restrictions imposed by your Internet Service Provider.

Overload manifests itself in different ways. There is no standard behavior. In most cases, when congestion occurs, extra packets begin to be ignored. In other cases, buffering occurs. This results in increased transmission time. Increased jitter can also be observed in congested networks. What we're talking about here is a rapidly evolving field, and new algorithms for congestion detection continue to emerge.

Dynamism​


Networks are incredibly dynamic and their conditions change very quickly. During a call, we can send and receive hundreds of thousands of packets. These packets will pass through several transitions (intermediaries, relays) (hops). These referrals will be distributed among millions of other users. Even on a local network, we can download a movie in a format HDor our device can download an update during a call.

Maintaining call stability is not as simple as exploring the network during initialization. It is necessary to carry out constant calculations. It is also necessary to take into account all the factors that are influenced by network hardware and software.

Solving the problem of data loss​


Packet loss is the first problem that needs to be addressed when designing a real-time communication system. There are many ways to do this, each with its own pros and cons. It depends on what we are sending and how critical the delay is. It should also be noted that not every loss is fatal. The loss of a small amount of video data will be invisible to the human eye. However, losing the message text is fatal.

Let's say we sent 10 packets and packets 5 and 6 were lost. Let's look at how we can solve this problem.

Acknowledgments​


Acknowledgments are the way the recipient informs the sender that each packet has been received. The sender becomes aware of the packet loss when it receives a reacknowledgement of a packet that is not the latest one. When the sender receives ACK4 for packet twice, it realizes that packet 5 was lost.

Selective Acknowledgments​


Selective Acknowledgments are an improvement on regular acknowledgments. The receiver can send SACKwith acknowledgments for multiple packets and thus notify the sender of packet loss. The sender receives SACK4 and 7 for packets and realizes that packets 5 and 6 were lost. In response, it resends the lost packets.

Negative Acknowledgments​


Negative Acknowledgments solve the problem in the opposite way. Instead of notifying the sender of what was received, the recipient notifies him of what was lost. In our case, NACKit is sent for packets 5 and 6. The sender only knows about the packets that need to be resent.

Forward Error Correction​


Forward Error Correction attempts to solve the problem of data loss in advance. The sender sends additional data so that packet loss does not affect the result. One example FECis the Reed-Solomon code.

This reduces the delay/complexity of sending and receiving confirmations. FECis a waste of bandwidth when network data loss is close to zero.

Solving the Jitter Problem​


Jitter is present in most networks. Even internally, LANwe have many devices sending data at different speeds. You can easily see jitter by pinging another device using the command pingand observing the jitter that occurs during the round trip ( rtt).

To overcome jitter, clients use JitterBuffer. It ensures constant packet delivery speed. The downside is the delay it JitterBufferadds to packets that arrive too early. Its advantage is that late packets do not cause jitter. Imagine that during a call you see the following packet arrival times:

* time=1.46 ms
* time=1.93 ms
* time=1.57 ms
* time=1.55 ms
* time=1.54 ms
* time=1.72 ms
* time=1.45 ms
* time=1.73 ms
* time=1.80 ms

In this case, a good choice would be 1,8 мс. Packets that arrive later will be able to use our window of delay or time gap (window of latency). Packages arriving early will be slightly delayed and then placed in the window with late packages. This means we get rid of jitter and ensure smooth delivery of messages to the client.

JitterBuffer operation

w47x4uy--zce307iuqsmfp1u97s.png


Each packet is buffered upon receipt. When there are enough packets to reconstruct the frame, the frame is released from the buffer and sent for decoding. The decoder, in turn, decrypts and renders the video frame on the user's screen. Because the buffer has limited capacity, packets that sit there too long are discarded.

jitterBufferDelayprovides good data on network performance and its impact on smooth playback. jitterBufferDelayis the part WebRTC statictics APIrelated to the receiver's incoming stream. Delay determines the time that video frames spend in the buffer before being sent to the decoder. High latency means our network is heavily loaded.

Overload Definition​


Before solving the problem of network congestion, it is necessary to determine this congestion. For this, a congestion controller is used. This is a complex thing that is currently being actively developed. New algorithms are published and tested all the time. At the highest level, they all do the same thing. The congestion controller performs a capacity estimation based on some input data, for example:
  • Packet Loss—Packet loss increases in congested networks;
  • jitter - as the load increases, packet delivery times become increasingly unstable;
  • round trip time - round trip time increases under high load;
  • Explicit congestion notification—Newer networks can mark packets as at risk of loss to reduce load.

During the call, these values must be measured at all times. Network usage may go up and down, so the throughput will vary.

Solving the congestion problem​


After measuring throughput, we need to adjust what we send. The settings depend on the type of data being transferred.

Reduced data transfer rate​


Limiting the data rate is the first solution to the congestion problem. The load controller provides the calculations and the sender limits the rate.

In most cases, this is the method used. In the case of TCPthis, the operating system takes over, the process is completely transparent to users and developers.

Reducing the amount of data transferred​


In some cases, we may send less information to meet limits. We WebRTCcannot reduce the data transfer speed.

If we do not have enough bandwidth, we can, for example, reduce the quality of the transmitted video. This requires close coupling between the video encoder and the congestion controller.

Media sharing​


WebRTCallows you to send and receive an unlimited number of audio and video streams. We can add and remove these threads at any time during the session. Threads can be standalone or combined into one. We can send screen capture video and add audio and video from the webcam to it.

The protocol WebRTCis independent of codecs. The underlying transport supports everything, even things that don't yet exist. However, the agent we speak with may not have the tools necessary to accept the call.

WebRTCdesigned to operate in dynamic network environments. During a call, bandwidth may increase or decrease. We may suddenly experience severe packet loss. The protocol provides opportunities to solve this problem. WebRTCReacts to changing network conditions and tries to provide the best user experience given the available resources.

How it works?​


WebRTCuses 2 protocols, RTPand RTCP, defined in RFC 1889 .

RTPis a protocol for transmitting media data. It was designed for real-time video transmission. It does not set any rules regarding packet delivery latency or reliability, but does provide tools to configure them. RTPprovides streams that can be transmitted over a single connection. It also provides information about the sending time and order of packets necessary for the correct formation of the media pipeline.

RTCPis a protocol containing metadata about the call. The protocol format is very flexible and allows you to add any metadata. This is used to collect statistics about the call. It can also be used to handle packet loss and implement congestion control. It provides bidirectional communication that automatically adapts to constantly changing network conditions.

Delay vs quality​


Sharing media in real time always involves a trade-off between latency and quality. The more latency we can afford, the higher the quality of the media will be.

Real World Constraints​


This is due to real world constraints—network characteristics that must be taken into account.

The video is complex​


Transferring video is a complex process. To store 30 minutes of uncompressed 720 8-bit video requires about 110 GB. Under such conditions, a conference with four participants is impossible. Therefore, we need some way to reduce the size of the video and the answer is compression. However, this has some disadvantages.

Video 101​


We won't look at video compression in detail. We will look at it enough to understand why RTPit is designed the way it is designed. Compressing a video is encoding it, converting it into another format that requires fewer bits to represent a similar video.

Lossy and lossless compression​


We can encode video without loss (lossless compression) and with loss (lossy compression) quality. Since lossless encoding requires more data to be transmitted to the peer, increasing latency and packet loss, it RTPtypically adopts lossy encoding even though it results in lower video quality.

Intra and inter-frame compression​


Video compression is divided into 2 types. The first is intra-frame compression (or simply frame compression, intra-frame compression). This compression reduces the number of bits used to describe a single video frame. A similar technique is used to compress still images, JPEGe.g.

The second type is inter-frame compression. Since video consists of a large number of images (frames), we are looking for ways to avoid sending the same information twice.

Types of interframe compression​


Interframe compression is divided into 3 types:
  • I-Frame - full image, can be encoded into anything;
  • P-Frame - partial image containing only changes to the previous image;
  • B-Frame - partial image - combination of previous and future images.

Visualization of these types:

_papcarjstymqzuquzwolj2lmzi.png


The video is sensitive​


Video compression is very state dependent, making it difficult to transmit over the network. What happens if we lose a part I-Frame? How P-Framedoes he know what to modify? As the complexity of compression increases, more and more questions arise. Fortunately, RTPthey RTCPprovide a solution to these problems.

RTP​


Package format​


Each package RTPhas the following structure:

0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|V=2|P|X| CC |M| PT | Sequence Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Timestamp |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Synchronization Source (SSRC) identifier |
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
| Contributing Source (CSRC) identifiers |
| .... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Payload |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Version (V)

Version (version) always matters 2.

Padding (P)

Padding(padding) - A boolean value that determines whether the payload has padding.

The last byte of the payload contains the number of padding bytes.

Extension (X)

If set, the header RTPwill contain extensions. We will talk about it later.

CSRC count (CC)

The number of identifiers CSRCthat come after SSRCbut before the payload.

Marker (M)

The marker bit has no predefined meaning and can be used in any way.

In some cases it is set when the user says something. It is also often used to indicate a keyframe.

Payload Type (PT)

Payload Type (payload type) is the unique identifier of the codec used by this package.

Payload Typeis dynamic. VP8 in one call may be different from VP8 in another call. The call initiator (offerer) defines the relationship between Payload Typesand codecs in Session Descriptionthe (session description).

Sequence Number

Sequence Number(sequence number) is used to order the packets in the flow. Each time the packet is sent, Sequence Numberit increases by one.

RTPdesigned in such a way that the recipient has the ability to detect packet loss in a timely manner.

Timestamp

The sampling instant for this packet. This is not a global clock, but the time elapsed since the stream began to be transmitted. Multiple packets RTPcan have the same Timestamp, if they are, for example, part of the same frame.

Synchronization Source (SSRC)

SSRC(sync source) is the thread's unique identifier. This allows multiple media streams to be transmitted over a single connection RTP.

Contributing Source (CSRC)

CSRC(auxiliary source) is typically used for speech indicators. Let's say on the server we combined several audio streams into one RTP-поток. Next, we use this field to determine that "incoming threads A and C are currently talking to each other."

Payload

Payloadis the payload, the actual data being transferred.

RTCP​


Package format​


Each package RTCPhas the following structure:

0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|V=2|P| RC | PT | length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Payload |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Version (V)

Versionalways matters 2.

Padding (P)

Likewise RTP.

Reception Report Count (RC)

Number of reports in this package. One package RTCPcan contain several events.

Packet Type (PT)

The unique identifier of the package type RTCP. An agent WebRTCis not required to support all types; support may vary between agents. In practice you can find the following types:
  • 192— full request INTRA-frame ( FIR);
  • 193— negative confirmations ( NACK);
  • 200— sender report;
  • 201— receiver report (receiver report);
  • 205— general feedback (feedback) RTP;
  • 206— feedback of a specific payload.

The significance of each of these packet types will be described below.

Full INTRA-frame request (FIR) and Picture Loss Indication (PLI)​


Messages FIRand PLI(Image Loss Detection) serve the same purpose. These messages request the sender to provide a complete keyframe. PLIused when the decoder receives partial frames and cannot decode them. This may occur due to data loss or decoder error.

According to RFC 5104, FIR it should not be used when packets or frames are lost. This is a task PLI. FIRrequests a key frame, for example, when a new participant joins a session. A key frame is required to start decoding the video, the decoder will reject frames until it receives it.

The recipient requests the full keyframe immediately after connecting, minimizing the delay between the connection and the appearance of the image on the user's screen.

Packets PLIare part of messages that respond to a specific payload.

In practice, software that can work with PLIand packages FIRis used in both cases. It sends a signal to the encoder to provide a new complete key frame.

Negative confirmations​


NACK "asks" the sender to resend the packet RTP. This is usually due to packet loss, but can also be due to packet delay.

NACKis a much more bandwidth efficient solution than requesting the entire frame. Because RTPit splits packets into chunks, only the small missing piece is requested. The recipient creates a message with SSRCand sequence number. If the sender cannot resend the requested packet, it simply ignores the message.

Sender and Recipient Reports​


These reports are used to share statistics between agents. Statistics contain information about the number of received packets and jitter.

Reports can be used to diagnose and monitor congestion.

How do RTP/RTCP solve problems together?​


RTPand RTCPwork together to address the above-mentioned challenges associated with dynamic network conditions. These techniques are still being actively developed.

Forward Error Correction​


Also known as FEC(proactive error correction) is another way to solve the problem of packet loss. FEC- this is when we send data several times, without a request to send it again. This happens at the level RTPor even at the codec level.

If packet loss is constant, FECis more latency efficient than NACK. The round-trip time (rtt) of a request to resend a missing packet and send that packet can be significant in the case of NACK.

Adaptive bitrate and bandwidth estimation​


As discussed in the previous section, networks are unpredictable and unreliable. Bandwidth may change several times during one session. It is not uncommon for the available bandwidth to change dramatically (by orders of magnitude) within one second.

The basic idea is to adjust the encoding bitrate based on the predicted, current and future network throughput. This allows you to ensure the best quality of transmitted audio or video and avoid disconnection due to overload. Heuristic techniques that model network behavior and attempt to predict it are known as bandwidth estimation.

There are many nuances here, let’s look at some of them in more detail.

Identification and transmission of network state​


RTP/RTCPare used in different networks, as a result, sometimes connection breaks occur. Being built on top of UDP, these protocols do not provide a built-in mechanism for packet retransmission, let alone congestion control.

To provide the best user experience, WebRTCit must calculate the quality of the network route and adapt to its changes. The key monitoring factors are the following: available bandwidth (in each direction, may be asymmetrical), round trip time, and jitter (fluctuations that occur during the round trip). This is necessary to calculate packet loss and exchange information about changes in these properties as network conditions evolve.

The protocols under consideration have two main goals:
  1. Calculate the available bandwidth (in each direction) supported by the network.
  2. Exchange of information about network characteristics between sender and recipient.

RTP/RTCPprovides three approaches to solving these problems. Each approach has its pros and cons. The approach taken depends on the software stack available to clients, as well as the libraries used in the application.

Sender/Recipient Reports​


These messages RTCPare defined in RFC 3550 and are used to exchange network characteristics between endpoints. Receiver reports focus on network quality (including packet loss, round-trip time, and jitter) and are used in algorithms designed to calculate available bandwidth.

Sender reports/receiver reports (SR/RR) together form a picture of network quality. These are sent for each SSRCand are the input to the throughput calculation. These calculations are made by the sender after receiving data RRcontaining the following fields:
  • Fraction Lost(proportion of lost packets) - what percentage of packets were lost since the last one RR;
  • Cumulative Number of Packets Lost(cumulative number of lost packets) - how many packets were lost during the session;
  • Extended Highest Sequence Number Received(extended maximum sequence number received) - what sequence number was received last and how many times it was received;
  • Interarrival Jitter(total jitter) — jitter of the entire call;
  • Last Sender Report Timestamp(sender's last reported time) - sender's last known time, used to calculate round trip times.

SRand RRare used to calculate round trip times.

The sender includes in SRhis local time sendertime1. When a packet is received, SRthe recipient sends a response RR. Among other things, RRincludes sendertime1received from the sender. There is a delay between receiving SRand sending RR. For this reason, RRit also includes "delay after last sender report" (DLSR). DLSRused to adjust round trip time calculations. Once received by the sender, RRit subtracts sendertime1and DLSRfrom the current time sendertime2. This time difference is called round-trip time (rtt).

Code:
rtt = sendertime2 - sendertime1 - DLSR

RTTin simple words:
  • I am sending you a message, my watch shows 16 hours 20 minutes 42 seconds and 420 milliseconds;
  • you send me the same time in response;
  • you also send the time elapsed between receiving my message and sending your message, say 5 ms;
  • after receiving your message, I look at my watch again;
  • they now show 16 hours 20 minutes 42 seconds 690 ms;
  • this means that it took 265 ms (690 - 420 - 5) to send the message back and forth;
  • thus the round trip time is 265 ms.

osdvtwdu2mdakskpaicfdkrbwhc.png


TMMBR, TMMBN, REMB and TWCC in combination with GCC​


Google Congestion Control (GCC)​


Google's Congestion Control (GCC) algorithm, outlined in draft-ietf-rmcat-gcc-02, attempts to solve the bandwidth calculation problem. It combines with a variety of other protocols to reduce the requirements required to establish a connection. It works well for both the receiving side (when working with TMMBR/ TMMBNor REMB) and the sending side (when working with TWCC).

GCCfocuses on packet loss and frame arrival time fluctuations as the two main metrics for throughput calculation. It passes these indicators through two related controllers: for calculating losses (loss-based) and for calculating delay (delay-based).

The first component GCC, the controller for calculating losses, is very simple:
  • if the packet loss exceeds 10%, the throughput estimate is reduced;
  • if losses are in the range 2-10%, the estimate remains unchanged;
  • if the loss is lower 2%, the estimate increases.

Calculations are performed at least once per second. Depending on the pairwise protocol, packet loss can either be explicitly reported ( TWCC) or the loss can be assumed ( TMMBR/TMMBNand REMB).

The second function interacts with the controller to calculate losses and monitors changes in packet arrival times. This controller detects when network links become congested and reduces the bandwidth estimate. The idea is that an overloaded interface will queue packets until its buffer capacity is exhausted. If such an interface receives more traffic than it can send, it will be forced to drop packets that do not fit in its buffer space. This type of packet loss is especially dangerous for low latency or real-time networks, and can negatively impact all types of communications on the network and should be avoided if possible. Thus, GCCit attempts to determine whether the network queue depth is increasing before actual packet loss begins. It reduces throughput if queuing delays increase.

To solve this problem, GCCit tries to determine the increase in queue depth by measuring a slight increase in round trip time. It records the "inter-arrival time" of frames t(i) - t(i-1)—the difference in the arrival times of two groups of packets (typically sequential video frames). These groups are often sent at regular intervals (for example, every 1/24second for videos with frequency 24 кадра в секунду). As a result, measuring inter-arrival time becomes as simple as measuring the time difference between the start of the first group of packets (i.e., frame) and the first frame of the next group.

In the chart below, the average increase in inter-packet delay is +20мс, which is a clear indicator of network congestion.

aifvmqgoamebfwqc-ctd2zfpzom.png


An increase in the time between the arrivals of groups of packets is evidence of an increase in the queue depth in the connected network interfaces and, as a result, network congestion. Note : GCCSmart enough to account for fluctuations in frame byte sizes. GCCrefines the latency measurements taken using a Kalman filter and performs multiple round-trip time measurements (and its variations) before concluding that the network is congested. The Kalman filter GCCcan be considered a replacement for linear regression: it allows accurate predictions to be made while taking into account jitter, which adds noise to time measurements. When overload is detected, GCCreduces the available bitrate. Under stable network conditions, it can slowly increase the throughput estimate to test higher load values.

TMMBR, TMMBN and REMB​


For TMMBR, TMMBNand REMBthe receiver first calculates the available incoming bandwidth (using GCC, for example), then transmits this information to the sender. The parties do not need to exchange information about packet loss or other network characteristics because the operations performed on the recipient side measure inter-arrival time and packet loss directly. Instead TMMBR, TMMBNand REMBexchange throughput estimates:
  • Temporary Maximum Media Stream Bit Rate Request(request for temporary maximum bitrate of a media stream) - mantissa/exponent of the requested bitrate for one SSRC;
  • Temporary Maximum Media Stream Bit Rate Notification(notification of the temporary maximum bitrate of the media stream) - notification of receipt TMMBR;
  • Receiver Estimated Maximum Bitrate(receiver's estimated maximum bitrate) is the mantissa/exponent of the requested bitrate for the entire session.

TMMBRand TMMBNappeared first and were defined in RFC 5104. REMBcame later, a draft draft-alvestrand-rmcat-remb was developed, which was never standardized.

Illustration of a session using REMB:

14z2g0pg1ar14htve1-vfgb8qee.png


This method works well on paper. The sender receives an estimate from the receiver, sets the encoder bitrate to the received value. Voila! We have adapted to network conditions.

However, in practice REMBit has some disadvantages.

The first disadvantage is the inefficiency of the encoder. When we set a bitrate for it, the encoding result will not necessarily match the specified value. We may receive more or fewer bits depending on the settings of the encoder and the frame being encoded.

For example, the output of an encoder x264with a setting tune=zerolatencymay differ significantly from the set target bitrate. One possible scenario:
  • Let's assume our initial bitrate is 1000 кбит/с;
  • the encoder only outputs 700 кбит/с, because it lacks high frequency features to encode;
  • Let's assume that the recipient receives a video 700 кбит/сwith zero packet loss. It then applies a rule REMB 1to increase the incoming bitrate by 8%;
  • the recipient sends a packet REMBwith a proposal to increase the bitrate to 756 кбит/с( 700 кбит/с * 1.08) to the sender;
  • The sender sets the encoding bitrate to 756 кбит/с;
  • the encoder produces an even lower bitrate;
  • this process is repeated over and over again, resulting in the bitrate being reduced to an absolute minimum.

Ultimately, this will result in unwatchable video even if you have a great connection.

Transport Wide Congestion Control​


TWCC (transport layer congestion control) is one of the latest advances in the field of network communication condition assessment. It is defined in draft-holmer-rmcat-transport-wide-cc-extensions-01 .
It TWCCuses a simple principle:

_haerviqtpcalte7gmjmgd7atqo.png


In the case of, REMBthe recipient informs the sender of the available download bitrate. It uses precise measurements of expected packet loss and packet inter-arrival time data.

TWCC- a kind of symbiosis of protocols SR/RRand REMB. Bandwidth estimation is the responsibility of the sender (as in SR/RR), but the estimation technique is more similar to REMB.

The TWCCrecipient informs the sender of the arrival time of each package. This information is sufficient to enable the sender to measure various delays between packet arrivals and to identify lost and late packets. By exchanging such information frequently, the sender has the ability to quickly adapt to changes in network conditions and adjust throughput using algorithms such as GCC.

The sender monitors the sending of packets, their sequence numbers, size and time of sending. When receiving a message RTCPfrom the recipient, it compares the sent interarrival delay with the received one. An increase in the received delay indicates network congestion, in which case the sender takes the necessary measures to reduce it.

By providing the sender with raw data, TWCCit provides an objective assessment of actual network conditions:
  • almost instantaneous information about packet loss, down to individual packets;
  • exact sending bitrate;
  • exact receiving bitrate;
  • jitter measurement;
  • the difference between sending and receiving delays;
  • a description of how the network copes with fluctuating or stable throughput.

One of the most significant benefits TWCCis the flexibility it provides developers WebRTC. Because congestion control algorithms are applied on the sender side, client code can remain simple and easily scalable. Complex congestion control algorithms can be implemented more quickly on the hardware they directly control (for example, the Selective Forwarding Unit, discussed in the next section). For browsers and mobile devices, this means that customers can benefit from algorithm improvements without having to wait for browser standardization or updates (which can take a long time).

Other Ways to Estimate Bandwidth​


The most developed implementation is "A Google Congestion Control Algorithm for Real-Time Communication" defined in draft-alvestrand-rmcat-congestion.

There are several alternatives GCC, for example, NADA: A Unified Congestion Control Scheme for Real-Time Media and SCReAM - Self-Clocked Rate Adaptation for Multimedia .

This completes the second part of the translation.

Thank you for your attention and happy coding!

(c) https://webrtcforthecurious.com
 
Top