Tools for cleaning up your digital history on the Internet

Carder

Professional
Messages
2,616
Reaction score
1,940
Points
113
Every step on the Internet leaves a mark. Over the years, a long trail of personal data is collected. They are available to outsiders, who will probably try to get the most out of your information.

As the recent leak of data on all Citymobil machines showed, people are effectively tracked even on anonymized data sets. If you combine several anonymous databases with each other, you can reliably establish the identity of a particular person.

This is difficult to deal with, but it is possible. For example, we will try to delete data sets that have accumulated in various Internet services. Let's clean up your Internet history in full.

Total surveillance​

h5F8ZCgte6g.jpg

Google's advanced data collection and Analytics system

Google's data collection and Analytics system is considered one of the most advanced in the world. Google's video, mail, and map services have more than 1 billion users (each of them). The company uses the ubiquity of its products to track user behavior online and in real life, and then target them with paid ads. Google's revenue directly depends on the accuracy of targeting and the breadth of data collected.

Experts from the organization Digital Content Next and Vanderbilt University published the results of the Google Data Collection study with some facts that speak about the total surveillance of people by Google:
  • An Android smartphone with the active Chrome browser in the background transmits location information to Google 340 times over a 24-hour period, which means an average of 14 data transfers per hour. In fact, location information accounts for 35% of all sample data sent to Google.
  • Google may associate anonymous data collected by passive means with the user's personal information. Google establishes this connection mainly through advertising systems, many of which it controls. Advertising IDS that correspond to "anonymous users" collect data about activity in applications and visits to third-party web pages. They can be linked to real Google users by transmitting identification information to Google servers at the Android device level.
  • The Doubleclick cookie, which tracks user activity on third-party web pages - is another example of an "anonymous" identifier that Google can associate with a Google account. A link is established if the user accesses the Google app in the same browser that previously opened the third-party web page.
  • Most of Google's data collection takes place at a time when the user is not interacting directly with any of Google's products. The scale of the collection is very significant. At the same time, the Android smartphone is probably the most popular personal gadget in the world. It is carried around the clock by 2 billion people.
Nowadays, many people are outraged by social networks, including Facebook, where people voluntarily upload huge amounts of private information, including personal photos and personal correspondence in unencrypted form. But in reality, Google has no less opportunities for total surveillance.

One day in the life of a typical Google user​

Here's how Google tracks people's activity across the various services of its Internet Empire (from the Google Data Collection report):

ROHKhePTkSw.jpg


Deleting a digital history​

So how do we deal with the invisible enemy that is sucking our data from thousands of sources? If not websites, then pharmaceuticals. If not data about the location of the card in the shops. If you don't like, then Bank account and Bank statement calculations. Our data is everywhere. Welcome to the age of privacy nihilism. Researchers claim that it is almost impossible to hide from Google's surveillance. But we'll try.

Mail​

Nowadays, email services offer a large amount of cloud storage-completely free of charge. Of course, they do this for a reason, but to accumulate as much user data as possible for data mining, analysis, and profiling. In the end, this allows you to use the service's audience more effectively as an advertising audience, which generates the main profit of Internet companies.

Google and other services expect that you will not delete old messages that will remain in their possession almost forever. If you really need this multi-year archive, then you can keep it. Otherwise, it is better to delete the old messages. This will free up storage space and speed up the search in the archive, plus compliance with the rules of digital hygiene.

The specific procedure for clearing the archive depends on the client and the service. In the case of Gmail, there is no automatic way to erase old emails, so you need to regularly perform such cleaning manually. This is done by using a search query older_than: that specifies the desired time period. For example, the query older_than:1y outputs all emails older than a year, and older_than:6m all emails older than 6 months.

When you get the search results on the screen, you can select all messages (check the box in the upper-left corner - and delete them.

To avoid deleting everything, you can combine the query with other search terms. For example, the request

Code:
older_than:1y is:important

Dsplays all emails older than one year that Gmail has marked as "low priority". For the full list of Gmail search operators, see here.

Other email clients may not have advanced search operators like Gmail, so it's harder to select and delete messages. But in any case, the function of sorting emails by date must be present to see the oldest messages in the archive.

For maximum security on the Internet, it is better to store the archive of messages not on the server, but on a personal computer locally. This allows any local mail client like the Bat!, which downloads and immediately deletes all emails from the mail server, so that they are not stored there at all.:

sve6Zg10M3M.jpg

Automatically delete all received emails from the Gmail mail server in the Bat mail client!

Social media​

Which social networks do you write the most messages on? This can be Facebook, Vkontakte, or Twitter. In any case, it makes no sense to archive old messages that are unlikely to be useful for you, but can (and will) be used against you for sure.

In some social networks, you can even download and save an archive of your messages just in case — and store it in an encrypted personal storage. And then remove it from public access.

To download an archive of your Twitter messages, go toSettingsYour accountDownload the archive of your data.

Then we start deleting old records.

The two best tools for automatic deletion are TweetDelete and Tweet Deleter, which not only have similar names, but also the principle of operation. They automatically delete tweets as soon as the specified time limit passes after they are posted. Tweet Deleter gives you a little more control over which tweets to delete, but TweetDelete has more features available for free.

You can delete tweets once, or run them permanently as a daemon in Linux (for example, cleaning the archive of outdated tweets once a week).

FV5D4qImbhI.jpg

You can mention the Jumbo program (Android and iOS versions), which deletes old messages on Twitter and Facebook as soon as they reach a certain age, saving them inside the app in local storage. This is definitely an easy-to-use option for hiding your social media footprints. Certain functionality for Facebook and Twitter is included in the free Jumbo account, so you don't have to buy the paid version.

Facebook, Instagram, and Vkontakte don't have specific functions for quickly deleting all old messages directly in the social networks, so you'll have to use third-party tools like Jumbo or delete messages manually one at a time. Or Facebook Instagram stories can be published initially and automatically disappear after 24 hours.

But for the sake of digital hygiene, it is better to reduce the use of social networks to the most necessary minimum, and transfer communication to instant messengers with end-to-end encryption. They store your private messages only in a securely encrypted form, or they don't store them at all (depending on the messenger).

Files​

Deleting old files in the cloud is not so much about protecting against information leaks or some kind of espionage, but rather maintaining order and saving money when using paid cloud storage. Such actions cause direct losses to the cloud service. Therefore, it is not surprising that some services do not have a standard function for automatically deleting old (or unnecessary) files. Although there are a few tricks you can try.

In Dropbox, you can click next to the column header and select the "modified date" sorting option to see the oldest files that you haven't edited in a long time. This applies to files only in a specific folder. If there are folders with temporary and less important files, you can quickly view and delete the oldest files by sorting them by date of modification.

On Google Drive, enable list view and click on the "Last modified" column header. The up and down arrows switch between viewing the latest or oldest changes.

Google Drive also uses search queries like before: 2011-01-01 in the main search box of Google Drive to find files that were last modified before a certain date. Use Ctrl+Click to select several files on your Google Drive - and the trash icon to delete them.

OneDrive and iCloud also have similar options with sorting by last modified date. These manual operations are not as convenient as automatic tools, but even if you just run them once every couple of months, you can delete a lot of files that are no longer needed.

Online activity​

When it comes to automatically collecting your personal data that the Internet company has collected about you while browsing the web, Google offers the most advanced options. Although it is fair to say that it also leads directly in collecting this data…

Companies like Apple and Microsoft aren't actually required to have such advanced tools as Google, because they simply don't collect such huge amounts of information about users for ad targeting.

Log in to your Google account, where the "Privacy and personalization" button displays a page with information about what data Google collects about your online activity, search history, and location - both to personalize your work with apps and for targeted advertising. In all categories, you can choose the option of automatic deletion after 3, 18, or 36 months.

Individual pieces of data can be viewed (and deleted) from the main activity dashboard. For example, here you can erase the record of everything you said to your smart speaker over the past week.

Apart from Google, only one company collects data on such a gigantic scale — this is Facebook. Go to "General account settings", there is a section "Your Facebook information". You can view and delete some of this data, though without sorting by date.

In Russia, Yandex and Vkontakte can be added to the list of dangerous services that organize total surveillance of users, but they introduce specific Russian risks associated with compliance with local legislation, so it is more risky to use them in this sense than foreign ones. For example, in Russia, there were cases when users were prosecuted for reposts on The Vkontakte social network (Mail.ru), which actively shares user information with law enforcement agencies, and earlier Mail.ru it transmitted data even in circumvention of the procedures established by law.

The person did not delete their digital history — and it was available to investigators. As a result, a criminal case for extremism.
 

Article: How do they track users? Your footprints on the web.​


I've always been bothered by the way addons compulsively served contextual ads based on my old search queries. It seems that quite a lot of time has passed since the search, and cookies and browser cache were cleared more than once, but ads remained. How did they keep tracking me? It turns out that there are plenty of ways to do this.

Short introduction
User identification, tracking, or simply web tracking involves calculating and setting a unique identifier for each browser that visits a particular site.
Since then, a lot has changed, technology has gone far ahead, and currently tracking users with cookies alone is not limited.

Explicit identifiers
This approach is quite obvious. all that is required is to store some long-lived identifier on the user's side, which can be requested during a subsequent visit to the resource. Local Shared Objects in flash orIsolated Storage Silverlight. localStorage, FileandIndexedDB API. In addition to these locations, unique tokens can also be stored in cached resources on the local machine or cache metadata (Last-Modified,ETag).

Cookies
When it comes to storing some small amount of data on the client side, cookies are the first thing that usually comes to mind. the Web server sets a unique identifier for the new user, storing it in cookies, and for all subsequent requests, the client will send it to the server. Although all popular browsers have long been equipped with a user-friendly interface for managing cookies, and the Network is full of third-party utilities for managing them and blocking them, cookies are still actively used for tracking users.

Local Shared Objects
Adobe Flash uses the LSO mechanism to store data on the client side . It is an analog of cookies in HTTP, but unlike the latter, it can store not only short fragments of text data, which, in turn, complicates the analysis and verification of such objects. Before version 10.3, the behavior of flash cookies was configured separately from the browser settings: you had to visit the Flash settings Manager located on the site macromedia.com. Today, this can be done directly from the control panel. In addition, most modern browsers provide fairly tight integration with the flash player: for example, when deleting cookies and other site data, lsos will also be deleted. On the other hand, the interaction of browsers with the player is still not so close, so setting the browser policy for third-party cookies will not always affect flash cookies (on the Adobe website, you can see how to manually disable them).
Deleting data from localstorage in Firefox

Isolated Silverlight storage
The Silverlight software platform has quite a lot in common with Adobe Flash. So, an analog of Local Shared Objectsserves a mechanism calledIsolated Storage. However, unlike the flash, the privacy settings here are not tied to the browser in any way, so even if the cookies and browser cache are completely cleared, the data stored in Isolated Storage, will still remain. But even more interesting is that the storage is shared by all browser Windows (except those opened in Incognito mode) and all profiles installed on the same machine. As with LSO, there are no technical barriers to storing session IDs. However, given that it is not yet possible to reach this mechanism through the browser settings, it has not become so widely used as a repository for unique identifiers.

IsolatedStorage1.png

Where to look for isolated Silverlight storage

HTML5 and data storage on the client
HTML5 provides a set of mechanisms for storing structured data on the client. These include localStorage, File API и IndexedDB. Despite their differences, they are all designed to provide permanent storage of arbitrary chunks of binary data tied to a specific resource. Plus, unlike HTTP and Flash cookies, there are no significant restrictions on the size of stored data. In modern browsers, the HTML5 storage is located along with other site data. However, it is very difficult to guess how to manage the storage via the browser settings. For example, to delete data from localStorage in Firefox, the user will have to choose offline website data or site preferences and set the time interval to everything. Another unusual feature that is unique to IE is that data exists only for the lifetime of tabs opened at the time of saving them. Plus, the above mechanisms don't really try to follow the restrictions that apply to HTTP cookies. For example, you can write tolocalStorage and read from it via cross-domain frames, even if third-party cookies are disabled.

Cached objects
Everyone wants the browser to work fast and without brakes. Therefore, it has to store the resources of the visited sites in the local cache (so as not to request them during a subsequent session). Although this mechanism was clearly not intended to be used as a random access storage, it can be turned into one. For example, the server can return a JavaScript document to the user with a unique identifier inside its body and set it in the headers Expires / max-age= the distant future. This way, the script and its unique identifier will be stored in the browser cache. After that, it can be accessed from any page on the Network, simply by requesting the script to be downloaded from a known URL. Of course, the browser will periodically use the header to ask If-Modified-Sinceif a new version of the script is available. But if the server returns the 304 code (Not modified), then the cached copy will be used forever. What else is interesting about the cache? There is no concept of "third-party" objects, as, for example, in the case of HTTP cookies. At the same time, disabling caching can seriously affect performance. And automatic detection of tricky resources that store some identifiers/tags is difficult due to the large volume and complexity of JavaScript documents found on the Web. Of course, all browsers allow the user to manually clear the cache.

ETag and Last-Modified
In order for caching to work correctly, the server must somehow inform the browser that a newer version of the document is available. The HTTP / 1.1 standard offers two ways to solve this problem. The first is based on the date when the document was last modified, and the second is based on an abstract identifier known as ETag. In the case of CETag, the server initially returns the so-called version tag in the response header along with the document itself. For subsequent requests to the specified URL, the client informs the server via the header If-None-Match this is the value associated with its local copy. If the version specified in this header is up-to-date, the server responds with the 304 (Not Modified) HTTP code, and the client can safely use the cached version. Otherwise, the server sends a new version of the document with a new ETagone . This approach is somewhat similar to HTTP cookies the server stores an arbitrary value on the client only to read it later. Another method, using a headerLast-Modified, allows you to store at least 32 bits of data in a date string, which is then sent by the client to the server in the header If-Modified-Since. Interestingly, most browsers don't even require this string to represent a date in the correct format. Just like in the case of user identification via cached objects, ETagand Last-Modifieddeleting cookies and site data does not affect them in any way. You can only get rid of them by clearing the cache.

ETag.png

The server returns an ETag to the client

HTML5 AppCache
Application Cache allows you to specify which part of the site should be saved to disk and be accessible, even if the user is offline. Everything is managed using manifests, which set rules for storing and retrieving cache elements. Similar to the traditional caching mechanism, AppCache also allows you to store unique, user-specific data-both inside the manifest itself and inside resources that are stored indefinitely (unlike a regular cache, resources from which are deleted after some time).

SDCH dictionaries
SDCH is a compression algorithm developed by Google that uses the dictionaries provided by the server and allows you to achieve a higher level of compression than Gzip or deflate. The fact is that in normal life, the web server returns too much repetitive information-page headers/footers, embedded JavaScript/CSS, and so on. In this approach, the client receives a dictionary file from the server containing strings that may appear in subsequent responses (the same headers/footers/JS/CSS). Avail-Dictionary, and directly into the content itself. And then use it in the same way as in the case of a regular browser cache.

Other storage mechanisms
But this is not all the options. With the help of JavaScript and Its fellow developers, you can save and request a unique identifier so that it remains alive even after deleting the entire browsing history and site data. As one of the options, you can use it for storing window.name илиsessionStorage. Even if the user clears all cookies and site data, but does not close the tab where the tracking site was opened, the identification token will be received by the server on the next visit and the user will again be linked to the data already collected about him. The same behavior is observed in JS. any open JavaScript context retains its state, even if the user deletes the site data. At the same time, such JavaScript can not only belong to the displayed site, but also hide in iframes, web workers, and so on.

Protocols
In addition to the mechanisms associated with caching, the use of JS and various plugins, modern browsers have several other network features that allow you to store and retrieve unique identifiers.
  1. Origin Bound Certificates aka ChannelID) - persistent self-signed certificates that identify the client to the HTTPS server. For each new domain, a separate certificate is created, which is used for connections initiated later. Sites can use OBC to track users without taking any actions that will be visible to the client. As a unique identifier, you can use the cryptographic hash of the certificate provided by the client as part of a legitimate SSL handshake.
  2. Similarly, TLS also has two mechanisms session identifiersandsession tickets, which allow clients to resume interrupted HTTPS connections without performing a full handshake. This is achieved by using cached data. These two mechanisms allow servers to identify requests originating from a single client over a short period of time.
  3. Almost all modern browsers implement their own internal DNS cache to speed up the name resolution process (and in some cases reduce the risk of DNS rebinding attacks). This cache can easily be used to store small amounts of information. For example, if you have 16 available IP addresses, about 8-9 cached names will be enough to identify each computer on the Network. However, this approach is limited by the size of the browsers ' internal DNS cache and can potentially lead to name resolution conflicts with the provider's DNS.

Machine specifications
All the methods considered before were based on the fact that the user was set a unique identifier, which was sent to the server during subsequent requests. There is another, less obvious approach to tracking users that relies on querying or measuring the characteristics of the client machine. Individually, each received characteristic represents only a few bits of information, but if you combine several of them, they can uniquely identify any computer on the Internet.

Browser's "fingerprints"
The simplest approach to tracking is to build identifiers by combining a set of parameters available in the browser environment, each of which individually is not of any interest, but together they form a unique value for each machine:
  • User-Agent. Returns the browser version, OS version, and some of the installed Addons. In cases where the User-Agent is missing or you want to check its "veracity", you can determine the browser version by checking for certain features implemented or changed between releases.
  • Clock running. If the system does not synchronize its clock with a third-party time server, then sooner or later it will start to lag or rush, which will create a unique difference between real and system time, which can be measured with microsecond accuracy using JavaScript. In fact, even when syncing with an NTP server, there will still be small deviations that can also be measured.
  • CPU and GPU information. You can get it either directly (via GL_RENDERER), or through benchmarks and tests implemented using JavaScript.
  • Monitor resolution and browser window size (including parameters of the second monitor in the case of a multi-monitor system).
  • A list of fonts installed in the system, obtained, for example, using getComputedStylethe API.
  • A list of all installed plugins, ActiveX controls, and Browser Helper Objects, including their versions. You can get it by brutenavigator.plugins[] (some plugins show their presence in HTTP headers).
  • Information about installed extensions and other SOFTWARE. Extensions such as ad blockers make certain changes to the pages viewed, which can be used to determine what kind of extension it is and its settings.

Network "fingerprints"
A number of other features are found in the architecture of the local network and the configuration of network protocols. Such signs will be common for all browsers installed on the client machine, and they can't just be hidden using privacy settings or some security utilities. These include:
  • External IP address. For IPv6 addresses, this vector is particularly interesting, since in some cases the last octets can be obtained from the device's MAC address and therefore be preserved even when connected to different networks.
  • Port numbers for outgoing TCP / IP connections (usually selected sequentially for most operating systems).
  • Local IP address for users who are behind a NAT or HTTP proxy. Combined with an external IP address, it allows you to uniquely identify most of your customers.
  • Information about the proxy servers used by the client, obtained from the HTTP header (X-Forwarded-For). In combination with the real client address obtained through several possible proxy bypass methods, it also allows user identification.

Behavioral analysis and habits
Another option is to look in the direction of characteristics that are not tied to the PC, but rather to the end user, such as regional settings and behavior. This method again allows you to identify clients between different browser sessions, profiles, and in the case of private browsing.

You can draw conclusions based on the following data, which is always available for study:
  • Preferred language, default encoding, and time zone (all of this lives in HTTP headers and is accessible from JavaScript).
  • Data in the client's cache and its browsing history. Cache elements can be detected using time-based attacks the tracker can detect long-lived cache elements related to popular resources by simply measuring the time from loading (and canceling the transition if the time exceeds the expected load time from the local cache). You can also extract URLS stored in the browser's browsing history, although such an attack in modern browsers will require little user interaction.
  • Mouse gestures, the frequency and duration of keystrokes, and data from the accelerometer all these parameters are unique for each user.
  • Any changes to the site's standard fonts and their sizes, zoom level, and use of special features such as text color and size.
  • The state of certain browser features configured by the client: blocking third-party cookies, DNS prefetching, blocking pop-UPS, Flash security settings, and so on (ironically, users who change the default settings actually make their browser much easier to identify).

And these are just the obvious options that lie on the surface. If you dig deeper you can come up with more.

To summarize
As you can see, in practice, there are a large number of different ways to track a user. Some of them are the result of implementation errors or omissions and can theoretically be corrected. Others are almost impossible to eradicate without completely changing the principles of computer networks, web applications, and browsers. You can counteract some techniques by clearing the cache, cookies, and other places where unique identifiers can be stored. Others work completely unnoticed by the user, and you are unlikely to be able to protect yourself from them.

xakep.ru
 
Top