We looked at 1 Billion URLs and found that some 150 Millions of them include sensitive content related to Health, Political Beliefs, Sexual Orientation, etc.

… and of course everything is tracked 🙂

The European General Data Protection Regulation (GDPR) includes specific clauses that put restrictions on the collection and processing of sensitive personal data, defined as any data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, also genetic data, biometric data for
the purpose of uniquely identifying a natural person, data concerning
health or data concerning a natural persons sex life or sexual orientation.

The above is setting the tone regarding the treatment of
sensitive personal data, and provides a legal framework for filing
complaints, conducting investigations, and even pursuing cases in
court. Such measures are rather reactive, i.e., they take effect long
after an incident has occurred. To increase further the protection
of sensitive personal data, proactive measures should also be put in
place. For example, the browser of a user, or an add-on program, can
inform the user whenever he visits URLs pointing to sensitive content. When on such sites, trackers can be blocked, and complaints
can be automatically filed. Implementing such services, of course,
hinges on being able to classify automatically arbitrary URLs as
sensitive or not.

However, determining what is truly sensitive is easier
said than done
. As discussed earlier, legal documents merely provide a list of sensitive categories, but without any description, or
guidance about how to judge what content falls within each one of
them. This can lead to a fair amount of ambiguity since, for example,
the word “Health” appears both on web pages about chronic diseases, sexually transmitted diseases, and cancer, but also on pages
about healthy eating, sports, and organic food. For humans it is
easy to disambiguate and recognise that the former are sites about
sensitive content, whereas the latter, not so much. The problem
becomes further exacerbated by the fact that within a web domain,
different sections and individual pages may touch upon very diverse
topics.

In our recent paper (to appear in ACM IMC’20) we’ve shown how to train a series of machine learning classifiers that can differentiate truly sensitive web-sites from non-sensitive ones. We’ve applied our classifier on over 1 Billion URLs and found that more than 150 Millions include sensitive content. Furthermore, we checked to see if such sensitive URLs are being tracked and found out that, unfortunately, such URLs are as as tracked as the rest of the web…

Leave a Reply

Your email address will not be published. Required fields are marked *