How are users tracked online?

Hacker

Professional
Messages
1,044
Reaction score
834
Points
113
This article was written for educational purposes only. We do not call anyone to anything, only for information purposes! The author is not responsible for your actions
It always bothered me how obsessively Google AdSense would pop in contextual ads based on my old search queries. It seems that a lot of time has passed since the search, and the cookies and browser cache were cleared more than once, but the ads remained. How did they keep tracking me? It turns out that there are plenty of ways to do this.

A small preface
Identification, user tracking or simply web tracking means calculating and setting a unique identifier for each browser that visits a particular site. In general, initially it was not conceived by some kind of universal evil and, like everything, has a downside, that is, it is intended to be useful. For example, allow site owners to distinguish regular users from bots, or provide the ability to store user preferences and apply them on a subsequent visit. But at the same time, the advertising industry really liked this opportunity. As you well know, cookies are one of the most popular ways to identify users. And they began to be actively used in the advertising industry since the mid-nineties.

Since then, a lot has changed, technology has gone far ahead, and nowadays, tracking users is not limited to cookies alone. In fact, users can be identified in different ways. The most obvious option is to set some sort of identifier, like cookies. The next option is to use data about the PC used by the user, which can be gleaned from the HTTP headers of the sent requests: address, type of OS used, time, and the like. And finally, you can distinguish the user by his behavior and habits (cursor movements, favorite sections of the site, etc.).

Explicit identifiers
This approach is quite obvious, all that is required is to save on the user's side some kind of long-lived identifier that can be requested during a subsequent visit to the resource. Modern browsers provide ample ways to do this transparently to the user. First of all, these are good old cookies. Then the features of some plugins that are close in functionality to cookies, for example, Local Shared Objects in flash or Isolated Storage in silverlight. HTML5 also includes several client-side storage engines, including localStorage, File and IndexedDB API. In addition to these places, unique tokens can also be stored in cached resources of the local machine or cache metadata ( Last-Modified, ETag). In addition, it is possible to identify a user by fingerprints obtained from Origin Bound certificates generated by the browser for SSL connections, data contained in SDCH dictionaries, and the metadata of these dictionaries. In a word, there are plenty of opportunities.

Cookies
When it comes to storing some small amount of data on the client side, cookies are usually the first thing that comes to mind. The web server sets a unique identifier for the new user, storing it in cookies, and for all subsequent requests, the client will send it to the server. And although all popular browsers have long been equipped with a convenient interface for managing cookies, and the web is full of third-party utilities for managing and blocking them, cookies are still actively used to track users. The fact is that very few people look through and clean them (remember the last time you did this). Perhaps the main reason for this is that everyone is afraid to accidentally delete the necessary "cookie", which, for example, can be used for authorization. Although some browsers allow you to restrict the installation of third-party cookies, the problem persists, since very often browsers consider cookies received via HTTP redirects or other means during the loading of the page content as "native". Unlike most mechanisms, which we will discuss next, the use of cookies is transparent to the end user. In order to "tag" a user, it is not even necessary to store a unique identifier in a separate cookie - it can be collected from the values of several cookies or stored in metadata such as Expiration Time. Therefore, at this stage it is rather difficult to figure out whether a particular cookie is used for tracking or not. the use of cookies is transparent to the end user. In order to "tag" a user, it is not even necessary to store a unique identifier in a separate cookie - it can be collected from the values of several cookies or stored in metadata such as Expiration Time.

Local Shared Objects
To store data on the client side, Adobe Flash uses the LSO mechanism. It is analogous to cookies in HTTP, but unlike the latter, it can store not only short fragments of text data, which, in turn, complicates the analysis and verification of such objects. Prior to version 10.3, the behavior of flash cookies was configured separately from the browser settings: you had to visit the Flash settings manager located on the site macromedia.com (by the way, it is still available). Today this can be done directly from the control panel. In addition, most modern browsers provide a fairly tight integration with a flash player: for example, when cookies and other site data are deleted, LSOs will also be deleted. On the other hand, the interaction of browsers with the player is not yet that close, so setting a browser policy for third-party cookies will not always affect Flash cookies (you can see how to manually disable them on the Adobe website).

Silverlight Isolated Storage
The Silverlight software platform has quite a lot in common with Adobe Flash. So, the analogue of the flash Local Shared Objects is a mechanism called Isolated Storage. True, unlike flash, privacy settings are not tied to the browser in any way, so even if the cookies and browser cache are completely cleared, the data saved in Isolated Storagewill still remain. But even more interesting is that the storage is common for all browser windows (except those opened in the "Incognito" mode) and all profiles installed on the same machine. As with the LSO, from a technical point of view, there is no obstacle to storing session IDs. Nevertheless, given that it is not yet possible to reach this mechanism through the browser settings, it has not become so widespread as a repository for unique identifiers.

IsolatedStorage1.png

Where to Find Silverlight Isolated Storage

HTML5 and client-side data storage
HTML5 introduces a set of mechanisms for storing structured data on the client. These include localStorage, File API, and IndexedDB. Despite the differences, they are all designed to provide persistent storage of arbitrary portions of binary data tied to a specific resource. Plus, unlike HTTP and Flash cookies, there are no significant restrictions on the size of the stored data. In modern browsers, HTML5 storage sits alongside other site data. However, it is very difficult to guess how to manage the storage through the browser settings. For example, to remove data from localStorage in Firefox, the user will have to select offline website data or site preferences and set the time span to everything. Another extraordinary feature inherent only in IE is that the data exists only for the lifetime of the tabs opened at the time of their saving. Plus, the above mechanisms don't really try to follow the restrictions that apply to HTTP cookies. For example, you can write to localStorage and read from it via cross-domain frames even with third-party cookies disabled.

Cached objects
Everyone wants the browser to work quickly and without brakes. Therefore, he has to add the resources of the visited sites to the local cache (so as not to request them on a subsequent visit). Although this mechanism was clearly not intended to be used as a random access store, it can be turned into such. For example, the server can return a JavaScript document with a unique identifier inside its body to the user and set the headers to the Expires / max-age= far future. Thus, the script, and with it the unique identifier, will be written in the browser cache. After that, it will be possible to access it from any page on the Web by simply requesting the download of the script from a known URL. Of course, the browser will periodically ask with a header If-Modified-Sinceif a new version of the script has appeared. But if the server returns 304 ( Not modified), then the cached copy will be used forever. What else is interesting about the cache? There is no concept of "third party" objects, as is the case with HTTP cookies. At the same time, disabling caching can seriously affect performance. And the automatic detection of cunning resources that store some kind of identifiers / tags is difficult due to the large volume and complexity of JavaScript documents found on the Web. Of course, all browsers allow the user to manually clear the cache. But as practice shows (even our own example), this is done not so often, if at all.

ETag and Last-Modified
For caching to work properly, the server needs to somehow inform the browser that a newer version of the document is available. The HTTP / 1.1 standard offers two ways to accomplish this. The first is based on the date the document was last modified, and the second is based on an abstract identifier known as ETag. In the case of the ETag server, it initially returns the so-called version tag in the response header along with the document itself. On subsequent requests to the given URL, the client informs the server through a header If-None-Match this value associated with its local copy. If the version specified in this header is current, then the server responds with HTTP code 304 (Not Modified) and the client can safely use the cached version. Otherwise, the server sends a new version of the document with a new one ETag. This approach is somewhat similar to HTTP cookies - the server stores an arbitrary value on the client only to be read later. The other way Last-Modifiedaround using a header is to store at least 32 bits of data in a date string, which is then sent by the client to the server in the header If-Modified-Since. Interestingly, most browsers don't even require this string to be a date in the correct format. As with the user identification through the cached objects to ETag and Last-Modified does not affect the cookie deletion and site data, to get rid of them, you can only clear the cache.

ETag.png

Server returns ETag to client

HTML5 AppCache
Application Cache allows you to specify how much of the site should be stored on disk and be available even if the user is offline. Everything is controlled with the help of manifests, which specify the rules for storing and retrieving cache items. Like the traditional caching mechanism, AppCache also allows you to store unique user-specific data - both inside the manifest itself and inside resources, which are stored indefinitely (unlike a regular cache, from which resources are deleted after a certain period of time). AppCache is intermediate between HTML5 storage engines and regular browser cache. In some browsers it is cleared when you delete cookies and site data, in others only when you delete your browsing history and all cached documents.

SDCH dictionaries
SDCH is a Google-developed compression algorithm that relies on server-supplied dictionaries to achieve a higher compression level than Gzip or deflate. The fact is that in everyday life, a web server gives too much repetitive information - page headers / footers, embedded JavaScript / CSS, and so on. In this approach, the client receives from the server a dictionary file containing lines that may appear in subsequent responses (the same headers / footers / JS / CSS). After that, the server can simply refer to these elements inside the dictionary, and the client will independently collect the page based on them. As you can imagine, these dictionaries can be easily used to store unique identifiers, which can be placed as in the IDs of the dictionaries returned by the client to the server in the header Avail-Dictionaryand directly into the content itself. And then use it in the same way as in the case of a regular browser cache.

Other storage mechanisms
But there are more options. With the help of JavaScript and its peers, a unique identifier can be stored and requested in such a way that it stays alive even after deleting all browsing history and site data. As one of the options, it can be used for storage window.name or sessionStorage... Even if the user clears all cookies and site data, but does not close the tab in which the tracking site was opened, then on the next visit, the identifying token will be received by the server and the user will be again tied to the data already collected about it. The same behavior is observed with JS, any open JavaScript context retains its state, even if the user deletes the site data. Moreover, such JavaScript can not only belong to the displayed site, but also hide in iframes, web workers, and so on. For example, an ad loaded in an iframe will ignore the deletion of the browsing history and site data at all and will continue to use the identifier stored in a local variable in JS.

Protocols
In addition to mechanisms related to caching, the use of JS and various plugins, modern browsers have several other network features that allow you to store and retrieve unique identifiers.
  1. Origin Bound Certificates (aka ChannelID) are persistent self-signed certificates that identify a client to an HTTPS server. For each new domain, a separate certificate is created, which is used for connections initiated later. Sites can use OBC to track users without taking any action that would be visible to the customer. As a unique identifier, you can take the cryptographic hash of the certificate provided by the client as part of a legitimate SSL handshake.
  2. Similarly, TLS also has two mechanisms - session identifiers and session tickets, which allow clients to resume interrupted HTTPS connections without performing a full handshake. This is achieved by using cached data. These two mechanisms allow servers to identify requests originating from one client within a short period of time.
  3. Almost all modern browsers implement their own internal DNS cache to speed up the name resolution process (and in some cases reduce the risk of DNS rebinding attacks). Such a cache can easily be used to store small amounts of information. For example, if you have 16 available IP addresses, about 8-9 cached names would be enough to identify every computer on the network. However, this approach is limited by the size of the browser's internal DNS cache and can potentially lead to name resolution conflicts with the provider's DNS.

Machine characteristics
All the methods considered before were based on the fact that a unique identifier was set for the user, which was sent to the server on subsequent requests. There is another, less obvious approach to user tracking that relies on querying or measuring the characteristics of the client machine. Individually, each characteristic obtained represents only a few bits of information, but if several are combined, they can uniquely identify any computer on the Internet. In addition to the fact that such surveillance is much more difficult to recognize and prevent, this technique will allow you to identify the user who is using different browsers or using private mode.

Browser fingerprints
The simplest approach to tracking is to construct identifiers by combining a set of parameters available in the browser environment, each of which individually is not of any interest, but together they form a value that is unique for each machine:
  • User-Agent. Gives the browser version, OS version and some of the installed addons. In cases where the User-Agent is missing or you want to check its "veracity", you can determine the browser version by checking for the presence of certain features, implemented or changed between releases.
  • Clock run. If the system does not synchronize its clock with a third-party time server, then sooner or later they will start to lag behind or rush, which will generate a unique difference between real and system time, which can be measured with microsecond precision using JavaScript ... In fact, even when synchronizing with an NTP server, there will still be small deviations that can also be measured.
  • Information about CPU and GPU. It can be obtained both directly (through GL_RENDERER) and through benchmarks and tests implemented using JavaScript.
  • Monitor resolution and browser window size (including second monitor settings in case of a multi-monitor system).
  • List of fonts installed on the system, obtained, for example, using the getComputedStyle API.
  • List of all installed plugins, ActiveX controls, Browser Helper Objects, including their versions. You can get it by brute force navigator.plugins[] (some plugins give out their presence in HTTP headers).
  • Information about installed extensions and other software. Extensions such as ad blockers make certain changes to the pages you view, by which you can determine what the extension is and its settings.

Network fingerprints
A number of other signs lie in the architecture of the local network and the configuration of network protocols. Such signs will be typical for all browsers installed on the client machine, and they cannot be simply hidden using privacy settings or some kind of security utilities. They include:
  • External IP address. For IPv6 addresses, this vector is especially interesting, since the last octets in some cases can be obtained from the MAC address of the device and therefore persist even when connected to different networks.
  • Port numbers for outgoing TCP / IP connections (usually selected sequentially for most operating systems).
  • Local IP address for users behind NAT or HTTP proxy. Together with an external IP address, it allows you to uniquely identify the majority of clients.
  • Information about the proxy servers used by the client, obtained from the HTTP header ( X-Forwarded-For). In combination with the real address of the client, obtained through several possible proxy bypass methods, it also allows the user to be identified.

Behavioral Analysis and Habits
Another option is to look towards characteristics that are not tied to the PC, but rather to the end user, such as locale and behavior. This method will again allow you to identify clients between different browser sessions, profiles, and in the case of private browsing. Conclusions can be made based on the following data, which are always available for study:
  • Preferred language, default encoding and timezone (this all lives in HTTP headers and is accessible from JavaScript).
  • Data in the client's cache and its browsing history. Cache entries can be detected using timing attacks - a snooper can detect long-lived cache entries from popular resources simply by measuring the time from the download (and canceling the transition if the time exceeds the expected load time from the local cache). It is also possible to extract URLs stored in the browser's browsing history, although such an attack in modern browsers would require little user interaction.
  • Mouse gestures, the frequency and duration of keystrokes, data from the accelerometer - all these parameters are unique for each user.
  • Any changes to the standard site fonts and their sizes, zoom level, use of special features, such as text color, size.
  • The state of certain browser features configured by the client: third-party cookie blocking, DNS prefetching, pop-up blocking, Flash security settings, and so on (ironically, users who change the default settings actually make their browser much easier to identify) ...
And these are just the obvious options that lie on the surface. If you dig deeper, you can come up with more.

Let's summarize
As you can see, in practice there are many different ways to track a user. Some of them are the fruit of implementation errors or omissions and theoretically can be corrected. Others are almost impossible to eradicate without completely changing the way computer networks, web applications, and browsers work. Some techniques can be counteracted - clearing the cache, cookies and other places where unique identifiers can be stored. Others work completely invisible to the user, and it is unlikely that it will be possible to protect against them. Therefore, the most important thing is to remember when surfing the Web even in private browsing mode that your movements can still be tracked.

xakep.ru
 
Top