Privacy – Personal homepage

Podcast: The Future of Privacy with ProperData PIs

Episode 04: The Future of Privacy with ProperData PIs description

This week, we have Johanna, Umar and Hieu as hosts. It is a special episode, recording live, with a panel of guests in front of an audience. We’ll be unpacking the current state of privacy and exploring paths forward.
Our panelists are four professors from the ProperData Frontier project, which is the research center we’re a part of. ProperData spans 6 institutions, 11 professors, and more than 50 researchers. With us today are Athina Markopoulou (UC Irvine), Zubair Shafiq (UC Davis), David Choffnes (Northeastern University), and Nikolaos Laoutaris (IMDEA Networks).

Podcast available here.

Let’s stop tracking and instead build a personal data inter-network on top of the Web and the Internet. Here’s how …

What if instead of having to implement controversial user tracking techniques, Internet advertising & marketing companies asked explicitly to be granted access to user data by name and category, such as Alice–>Mobility–>05-11-2020? The technology for implementing this, already exists, and is none other than the Information Centric Networks (ICN) developed for over a decade in the framework of Next Generation Internet (NGI) initiatives.

Beyond named access to personal data, ICN’s in-network storage capability can be used as a substrate for retrieving aggregated, anonymized data, or even for executing complex analytics within the network, with no personal data leaking outside. In this opinion article, we discuss how ICNs combined with trusted execution environments and digital watermarking can be combined to build a personal data overlay inter-network in which users will be able to control who gets access to their personal data, know where each copy of said data is, negotiate payments in exchange for data, and even claim ownership, and establish accountability for data leakages due to malfunctions or malice.

Of course, coming up with concrete designs about how to achieve all the above will require a huge effort from a dedicated community willing to change how personal data are handled on the Internet. Our hope is that this opinion article can plant some initial seeds towards this direction. For more details check out our opinion column in ACM CCR.

We looked at 1 Billion URLs and found that some 150 Millions of them include sensitive content related to Health, Political Beliefs, Sexual Orientation, etc.

… and of course everything is tracked 🙂

The European General Data Protection Regulation (GDPR) includes specific clauses that put restrictions on the collection and processing of sensitive personal data, defined as any data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, also genetic data, biometric data for
the purpose of uniquely identifying a natural person, data concerning
health or data concerning a natural persons sex life or sexual orientation.

The above is setting the tone regarding the treatment of
sensitive personal data, and provides a legal framework for filing
complaints, conducting investigations, and even pursuing cases in
court. Such measures are rather reactive, i.e., they take effect long
after an incident has occurred. To increase further the protection
of sensitive personal data, proactive measures should also be put in
place. For example, the browser of a user, or an add-on program, can
inform the user whenever he visits URLs pointing to sensitive content. When on such sites, trackers can be blocked, and complaints
can be automatically filed. Implementing such services, of course,
hinges on being able to classify automatically arbitrary URLs as
sensitive or not.

However, determining what is truly sensitive is easier
said than done. As discussed earlier, legal documents merely provide a list of sensitive categories, but without any description, or
guidance about how to judge what content falls within each one of
them. This can lead to a fair amount of ambiguity since, for example,
the word “Health” appears both on web pages about chronic diseases, sexually transmitted diseases, and cancer, but also on pages
about healthy eating, sports, and organic food. For humans it is
easy to disambiguate and recognise that the former are sites about
sensitive content, whereas the latter, not so much. The problem
becomes further exacerbated by the fact that within a web domain,
different sections and individual pages may touch upon very diverse
topics.

In our recent paper (to appear in ACM IMC’20) we’ve shown how to train a series of machine learning classifiers that can differentiate truly sensitive web-sites from non-sensitive ones. We’ve applied our classifier on over 1 Billion URLs and found that more than 150 Millions include sensitive content. Furthermore, we checked to see if such sensitive URLs are being tracked and found out that, unfortunately, such URLs are as as tracked as the rest of the web…

Why online services should pay You for Your data? The arguments for a Human-Centric Data Economy

UPDATE: An extended version of this post appears as an opinion column in IEEE Internet Computing. Link to it here.

Data, and the economy around it, are said to be driving the fourth industrial revolution. Interestingly, the people — whose data are what moves the new economy — have a rather passive role in this economy, as they are left outside the direct value flow that transforms raw data into huge monetary benefits. This is a consequence of a de facto understanding (or, one may say, misunderstanding) between people and companies that the former receive unpaid access to online services in exchange for unpaid access to their personal data. This situation is increasingly being challenged by various voices calling for the establishment of a new and renegotiated relationship between users and services.

For example, technologist, philosopher, and writer Jaron Lanier argues in his 2012 book “Who Owns the Future?” that online services should return some of their revenue back to the people that feed with data their business models and AI algorithms. Lanier’s arguments include the following:

— Sustainability: The current economic model imposes serious privacy risks on individuals and the society at large, has lead to market failures in the form of large data monopolies and oligopolies, and may, in fact, even be a threat to employment in the future due to job loss from data-driven automation. Paying people for their data could, therefore, be an alternative to labor-based compensation in a future in which most work will be done by machines. Indeed, it was estimated recently that, if fair remuneration algorithms are set in place, a family of four could earn up to $20,000 per year from their data (Posner, 2018). The above figure may seem too small to be a full alternative to labor-based compensation, but can only increase as more and more sectors are being catalyzed by automation.

— Fairness: Paying people for their data, or the related idea of (universal) guaranteed minimum income are potential remedies for modern societal ailments such as increased income disparity, increased unemployment, and other labor-related challenges emerging in the context of machine learning automation, robots, 3D printing, self-driving cars, and other employment-threatening technologies. Critics of guaranteed minimum income are rejecting it as a non-sustainable form of charity. Instead, Lanier has argued that paying people for their data is an altruism-free idea compatible with modern capitalism for achieving the positive objectives of guaranteed minimum income without harming but instead benefiting the market, innovation, and investment in technology. The fundamental argument behind this position is a simple one: business models and machine learning algorithms have zero value without data, and, therefore, paying for those data is not charity but rather neoclassical economics. As Lanier notes “We have invented only half of the data driven industrial revolution — the part that compensates users in kind (i.e., service); we need to invent the other half that will provide explicit (monetary) benefits.”

But the arguments in favor of what I will henceforth refer to as a Human-Centric Data Economy, go beyond the above mentioned.

First, paying for data opens up the pathway to getting more data and data of higher quality, thereby increasing the size of the data economy. In other words, the data economy is not a zero-sum game between users and data collectors; paying the former does not have to harm the latter. It is not surprising, therefore, that the idea of paying or being taxed for data has been positively received by many, including industry leaders such as Bill Gates (Delaney, 2017), Elon Musk (Thomas, 2017), and Mark Zuckerberg (Gillespie, 2017).

Second, paying for data creates economic pressure on online services for applying data minimization principles. Currently, collecting and processing data costs close to zero and, therefore, services greedily collect all the data that they can, even when the actual information that they need is much less. The resulting privacy-related tragedy of the commons of the web can be avoided if online services a) collect the minimum amount of data that they need, and b) do so, only when the benefit that they create for society outweighs, and can therefore re-compensate, for the risks it creates. In other words, in the same way that factories and private cars pay for the amount of pollution that they impose on the environment, online services should pay for the privacy risks they impose on people. Currently, data minimization is just a principle quoted by data protection laws. To realize it in practice, it will require establishing the currently missing economic signals that would push the market in the right directions. Paying people for data is a direct way of achieving this. Two important related questions are, therefore:

What part of a service’s revenue should be returned to its users?
How should the total returned payoff be split among different users?

To answer the first question we are modeling the data value chain using tools like Nash Bargaining used before for modeling the inter-connection value chain on the Internet. To answer the second, we are using the Shapley Value to compute the relative importance of different users’ data on the decisions taken by machine learning algorithms for tasks like movie recommendation.

Stay tuned for more thoughts, ideas, and papers around Human-Centric Data Economies and please do contact me if interested to work or do a PhD in the area.

References

Delaney, K. J. (2017). The robot that takes your job should pay taxes, says Bill Gates. Retrieved from https://qz.com/911968/bill-gates-the-robot-that-takes-your-job-should-pay-taxes/

Gillespie, P. (2017). Mark Zuckerberg supports universal basic income. What is it? Retrieved from https://money.cnn.com/2017/05/26/news/economy/mark-zuckerberg-universal-basic-income/index.html

Posner, E. A. (2018). Radical Markets: Uprooting Capitalism and Democracy for a Just Society. Princeton University Press.

Thomas, L. (2017). Universal basic income debate sharpens as observers grasp for solutions to inequality. Retrieved from https://www.cnbc.com/2017/03/25/universal-basic-income-debate-sharpens.html

Data Transparency: Concerns and Prospects

[What follows is a preview of an opinion note to appear in the Nov. 2018 issue of Proceedings of IEEE (original article here). Many thanks to all the people that read early drafts and provided feedback and corrections.]

Introduction

The question of “how far?” should technologies and business models of the web go into collecting personal data of unassuming, or at best moderately informed citizens, appears to be one of the timeliest questions of our times. Indeed, whenever we read a news article, “like” a page on a social network, or “check in” to a popular spot, our digital trace collected, processed, fused, and traded among myriads of tracking, analytics, advertising, and marketing companies becomes an ever more accurate descriptor of our lives, our beliefs, our desires, our likes and dislikes. The resulting revenue from marketing & advertising activities driven by the digital traces of millions of people is what funds the free online services we have come to rely upon.

In this opinion note, I will lay down my thoughts around Data Transparency and its role in ongoing Data Protection & Privacy debates. The material draws upon my experiences from conducting research in the area over the last 6+ years, running the Data Transparency Lab’s Grant Program in 2015, 2016, and 2017, and attending several computer science, policy, marketing & advertising events. The objective of the note is to discuss the possibility and the likelihood of data transparency acting as an important positive catalyser of data protection problems, as well as to point towards concerns and challenges to be addressed in order for this to materialize. Most of the discussion applies to the use of personal data by marketers on the fixed and mobile web, but some parts may also be relevant to other online and offline use-cases and/or types of data (e.g., off-web health and financial data).

For years, the practice of collecting data on individuals at unprecedented scale was a non-issue for most people, for the simple reason that the public, and even governments, were just unaware of its magnitude, precision, and detail. The last few years, however, attitudes have started to change and the topic of privacy is increasingly appearing in the media and public discussions. This has stirred a huge public debate about who should be the rightful owner of personal data, and where to draw the red line of what is socially acceptable to track and monetize, and what is not. Indeed, the same tracking technology of cookies and tracking pixels used to detect one’s intention for buying new running sneakers, thus prompting an interesting discount at a nearby shop, can also be used to infer one’s medical condition, political affiliation, or sexual preference, thus delivering offers at the wrong time and place, or even worse, releasing the information to third parties that may use it in a discriminatory, excluding, or generally unfair manner. The former use of data has the potential to increase value for individuals, retailers, technology providers and the society, whereas the latter can be detrimental to the trust put by individuals and institutions in technology.

Tragedy of the Commons and the Web

What is particularly alarming is that the economics and incentives of exploiting personal data for advertising and marketing have all the characteristics of a “Tragedy of the Commons” (see Hardin [1]), in which consumer privacy and trust in the web and its business models are a shared commons that can be over-harvested to the point of destruction. The essence of the problem is that even if most technology companies manage to agree on a set of principles, there will always be sufficient temptation for some to push the boundaries in pursuit of greater gains, while inflicting a heavy cost to society. Indeed, from the narrow perspective of some companies, all it takes to pursue a business that involves intrusive and unethical collection of very sensitive personal data, is a paying customer. The above seems to be verified by examples appearing in the press of trackers that compile lists of anything from suspected alcoholics and HIV positive individuals, to active police-officers [13].

In essence, the narrow self-interest of a subset of data collection companies is eroding a valuable commons — the trust put by people in the web, or inversely their hope that nothing bad will happen to them by being carefree online. If, in the minds of citizens, ordering a drug online is associated with the risk of leaking medical info to health insurance companies, then they may very well abandon the web and just walk to a pharmacy. This means that the web, as big and successful as it is presently, is not invincible. It too can fall from grace like newspapers and broadcast-TV have in the past, albeit for other reasons. Loss of public trust appears to be the Achilles’ heel of the web and is being fuelled by questionable practices from large and small companies alike.

The Role of Transparency in Data Protection Debates

Transparency is often heard in debates about governance, business, science, and matters of public life in general. According to Wikipedia, transparency is about: “operating in such a way that it is easy for others to see what actions are performed. Transparency implies openness, communication, and accountability”. Transparency, in its different applications and contexts, largely embodies the famous quote of American Supreme Court justice Louis Brandeis that “Sunlight is said to be the best of disinfectants”.

In the context of data protection and privacy of online services, transparency can be understood as the ability to credibly answer questions such as:

What information is being collected (stored, and processed) about individuals online?
Who is collecting it?
How is it being collected?
How is it being used?
Is it leaking to other unintended recipients?
What are the consequences of such online leakage of private information?

Information leakage is a natural phenomenon in both the offline and online life. In the offline world, whenever we walk on a street or are seen at a public place, we are effectively giving up on our so called “location privacy”. Our clothes, hobbies, the car we may drive or the house where we live convey information about our financial status, employment, and taste. Similarly, in the online world, networks need to know where we are in order to deliver our calls, emails, or chat requests. Social networks need to display our real names so that our offline friends can also befriend us online. The above realizations give rise to a simple alternative to trying to unknot the “utility-vs.-privacy” tradeoff. Since we cannot stop all online leakage under the current technological paradigm, nor prescribe a generic, context unaware solution to the tradeoff, we can instead try to reduce information leakage, while keeping an eye open for controversial practices driven by collecting personal data that go against public sentiment or the letter of the law. Such an objective can be achieved on top of existing web technologies and business models without requiring some radical redesign. Transparency is the guiding light pointing to problematic technologies and business practices that will require revision, if we are to keep a safe distance from a tragedy of the commons on the web.

Transparency has already proved its worth in what is probably the greatest techno-policy debate preceding data protection — the Network Neutrality debate. Network neutrality is the simple principle, now turned into regulation and telecommunications law, that a network operator cannot delay or drop one type of traffic from a certain application in order to protect or expedite the rest. Almost a decade ago, the Network Neutrality debate was ignited by reports that some telecom companies were using Deep Packet Inspection (DPI) equipment to delay or block certain types of traffic, such as peer-to-peer (P2P) traffic from BitTorrent and other protocols. Unnoticed initially among scores of public statements and discussions, a group of computer scientists from Germany developed Glasnost [2]– a set of tools for checking whether a broadband connection was being subjected to P2P blocking. All a user had to do to check whether their ISP was blocking BitTorrent was to visit a web-page and click on a button that launched a series of simple tests — basically streaming two flows of data towards the user, one appearing to be P2P and one not. By comparing the corresponding reception data rates for the two flows the tool could say if P2P was throttled… voila anyone could check for themselves if their provider blocked P2P traffic or delivered the advertised speed of their plan. The existence of such easy-to-use tools created the right incentives that eventually obliged telecom companies to give up blocking or be open about it, and to indeed deliver the promised data rates.

The above network speed measurement tools are early examples of what I will henceforth call Transparency Software, to refer to any software purposefully built for providing transparency and for “shedding light” into public debates involving technology. An important idea, and frankly raison d’être for transparency software, is that complex technology can only be tamed by other, equally advanced, technology. Indeed, investigating complex data protection debates without specialized software for collecting evidence and testing hypotheses is like conducting pre-flight tests or periodic car inspections without specialized equipment that can test all the complex components of an airplane or modern car. In all these domains, there exist entire fields dedicated to developing formal methods for testing and verification. In the same way that air transportation, the car industry, health, and other domains have benefited from purpose-built testing tools, online data protection needs to develop its transparency methods and software.

Although the present article is not meant to be a complete survey, in the remainder I will discuss several examples of transparency software for things like: revealing in real time instances of Personally Identifiable Information (PII) leakage, detecting online price discrimination, detecting online targeted advertising, detecting advanced online tracking through fingerprinting, and others.

Transparency for Whom?

Figure 1: The role of transparency in the context of the greater data protection debate.

Figure 1 illustrates at the topmost level how and where Transparency plugs into the general data protection debate. Assuming that some use of personal data for online marketing in exchange for free service is deemed acceptable, and that the Tragedy of the Commons is not something we should leave to luck, the figure suggests that transparency can be beneficial for all three main stakeholders of the data protection discussion. Transparency should not be seen as an alternative to existing efforts of these stakeholders, but as an extra tool at their disposal.

For the online tracking & advertising industry, Transparency is essential to its efforts to convince government and citizens that it can effectively self-police and self-regulate the sector, making sure that individual companies do not perform actions that go against the public sentiment, or worse, data protection laws. The various Codes of Conduct, and Best Practices documents issued by sector representative bodies and organizations make lots of use of the term transparency, but they often get criticized as being mere intentions without any real means for enforcement and actual demonstration of commitment and application. This is where Transparency Software can play a key role, by allowing anyone to independently check that companies make good on their promises. A smartphone app that commits to not communicate PII back to its servers or other third parties can be checked by software such as ReCon [3], Lumen [4], and AntMonitor [5]. A web-site and its advertising partners that commit to not targeting minors can point users to Aditaur [6], a tool which anyone can use to verify the claim. In essence, the existence of such software allows the sector to make more credible and verifiable promises regarding its ability to self-regulate its data treatment practices.

For individual citizens, transparency is all about empowerment and freedom of choice. For every basic online service there is typically a magnitude of alternative service providers offering it, each one with a potentially different approach and sensitivity towards data collection. Users are accustomed to rating online services in terms of performance, simplicity, feature richness, but gauging the quality of data management practices has for the most part been out of reach. Being able to evaluate the quality over privacy-risk ratio of different services empowers users to select the one providing the right balance. For example, PrivacyMeter [7] can display in real time a risk score for every web-site visited by a user. If a user deems the risk of visiting, say, a news portal to be too high, they can opt for an alternative one with similar content but better performance in terms of privacy. By doing so, users emit clear signals to the industry and contribute through market pressure to pushing it towards the right direction.

Last but not least, government agencies, especially Data Protection Authorities (DPA), need transparency for both their proactive monitoring activities, as well as their investigative activities following complaints from citizens, watchdog groups, or other companies. Transparency software can help DPAs scale up their investigations and even pro-actively monitor for offending practices, something that does not appear to be possible via ad hoc manual investigations. For example, with AppCensus [8], entire marketplaces of mobile apps can be checked for leakage of PII information, by automatically analysing their binary executable distributions. Similarly, WebCensus [9] allows monitoring millions of domains every month to catalogue and rank their tracking practices. With Aditaur, DPAs can proactively check at scale thousands of popular domains for targeted advertising towards sensitive groups like children, or driven by sensitive personal data about health, political, or sexual orientation that are protected under EU’s GDPR law.

Transparency of What?

As mentioned before, some perceive data protection as the challenge of understanding and limiting the amount of personal information that can be leaked and collected online. Although this has an importance of its own, for others the main motivation and driver for discussing data protection matters is understanding and limiting the consequences of personal data leakage. Transparency is important and can help at both levels of the debate. Data transparency and corresponding tools like ReCon, Lumen, and AntMonitor are about revealing what data is leaking and where it is going. Algorithmic transparency on the other hand is looking at how personal data can end up fuelling biased, discriminatory, and generally unfair automated decision making that impacts the life of real people. For example, the Price $heriff [10] reveals whether a consumer searching for a product online is being subjected to online price discrimination by algorithms that decide on the spot a dynamic price for each customer based on their perceived willingness to pay, extracted from information about them. FDVT [11] measures in real time the different economic valuation that Facebook advertisers have about different users, by tallying up their advertisement bids for product placements. Algorithmic transparency, of course, goes beyond the online services mentioned above, and can be applied to a range of offline contexts from health, to finance, to justice, but such areas and applications go beyond the scope of the current note.

Challenges

Next, I discuss some of the hard challenges that need to be addressed if transparency is to make a positive dent upon privacy problems on the web.

Crowdsourcing Privacy Related Data

Several of the tools mentioned above are crowdsourced in nature, i.e., they rely on real users sharing their observations, in order to reverse engineer some aspect of an online service. eyeWnder [12], for example, relies on users reporting the advertisements seen on different pages in order to construct a crowdsourced database through which one can identify active advertising campaigns and the demographics of users targeted by them. Similarly, the Price $heriff relies on users reporting the price they see for the same product at the same site to detect instances of online price discrimination. Both tools use specialized encryption and anonymization techniques to protect the privacy of users that report back to the database the ads or the prices they’ve seen. Crowdsourcing is a powerful tool for detecting different types of discrimination and bias, but requires having in place a solution for protecting the privacy of users that contribute to the crowdsourced corpus of data. The above two tools use ad hoc techniques tailored to their specific function, but there is a clear need for developing more generic privacy-preserving crowdsourcing techniques and platforms that will make it easier to develop additional transparency tools for other problems.

Evaluation criteria / Reproducibility / Correctness

The appearance of several transparency tools in the last 2-3 years (of which I have touched only upon a small subset) is testament to the very significant amount of work that has been done in the area during a rather short amount of time. Still, the area is only in its infancy and thereby important requisites for growth and eventual maturity are yet to be fulfilled. One of them is establishing common criteria and metrics upon which different transparency tools looking into the same task will be compared. Having the ability to directly compare different approaches is fundamental for the evolution of the area, the validity, and the correctness of the findings. In the same spirit, the findings of a tool need to be reproducible. This is difficult to achieve when the tool operates in the wild. eyeWnder, for example, can label an ad banner as targeted today but it might be impossible to reproduce the result after a week since the underlying advertising campaign may no longer exist, or may have changed its target audience. Reproducibility goes hand in hand with the ability to compile extensive crowdsourced datasets on privacy matters upon which the reproducibility of a tool can be checked or its performance compared with alternative approaches on the same problem.

Bootstrapping / UX challenges / Outreach

Both privacy-preserving crowdsourcing, as well as the establishment of common evaluation criteria, are technical problems and, as such, something that a technical community of computer scientists and engineers knows how to handle. A different type of challenge for the area is finding enough users for these tools. For tools that are crowdsourced in nature, establishing an initial user base is a fundamental prerequisite for allowing them to derive credible and useful findings. Even for tools that can work by collecting data without user input (e.g., through crawling and scraping of information), having a user base outside research is largely a measure of true impact and success. To get there, we need to work on improving the usability aspects of such tools and adapting them to the needs and capacities of non-expert users. We also need to work on disseminating them and putting them in front of end users.

Conclusions

With the above, I hope I have convinced you that transparency can have an important role and contribution in contemporary data protection debates. In my mind, a very important first milestone is making sure that the online world is at least as transparent as the offline world. This may seem uninspiring on the surface, but it is actually a very difficult objective in practice. The scale of data collection online is at a totally different level from that offline. Rules and regulations are established for many offline activities, from credit rating, to equality of access to public services, whereas the online equivalents are left to chance. Finally, many transparency aspects that we take for granted in the offline world, are hard to achieve online. Take price discrimination as an example. Two customers walking to a coffee shop see the same price for “cafe latte” written on the wall. If the clerk charges one of them a different price, questions will immediately follow. In the online world, the two customers can order the same coffee at the same time and pay a totally different price without any of them ever realising. This is because in the online realm, the “world” around us is dynamically generated, thereby, we do not even have the benefit of a common reference. Checking for being “followed” or discriminated is more difficult online than offline. Of course, this is just under the current state of affairs. The same technology used for surveilling or discriminating at scale, can be flipped on its head, and used instead for shedding light and providing transparency at scale. This means that an online world that is safer and fairer than the offline one, is an open possibility that we should consider and pursue.

References

[1] G. Hardin, “The Tragedy of the Commons,” Science, Vol. 162, Issue 3859, pp.1243-1248, Dec. 1968.

[2] Glasnost: http://broadband.mpi-sws.org/transparency/

[3] Recon: https://recon.meddle.mobi/

[4] Lumen: https://www.haystack.mobi/

[5] AntMonitor: http://antmonitor.calit2.uci.edu/

[6] Aditaur: https://www.lstech.io/aditaur

[7] PrivacyMeter: https://chrome.google.com/webstore/detail/privacymeter/anejpkgakoflmgebgnombfjiokjdhmhg

[8] AppCensus: https://appcensus.mobi/

[9] WebCensus: https://webtransparency.cs.princeton.edu/webcensus/

[10] Price $heriff: http://www.sheriff-v2.dynu.net/views/home

[11] FDVT: https://fdvt.org/

[12] eyeWnder: http://www.eyewnder.com/

[13] https://money.cnn.com/2013/12/18/pf/data-broker-lists/

Myth-busting: Most tracking flows on European citizens DO NOT terminate outside EU28 GDPR borders

In our latest measurement study we use data from 350 volunteers combined with aggregate network logs from 60M ISP subscribers to show that around 90% of tracking flows originating within Europe also terminate within Europe.

This of course does not preclude that the collected data is not subsequently moved elsewhere but at least we know that the tracking end points are more frequently than not within reach of European Data Protection Authorities.

An optimistic result contrasting prior belief that tracking flows went straight out of Europe. For more details check:

C. Iordanou, G. Smaragdakis, I. Poese, N. Laoutaris, “Tracing Cross Border Web Tracking,” ACM IMC’18. [pdf]

Online advertising, data protection, and privacy concerns of users, industry, and regulators (Video)

A video of our panel at CPDP’17

10 thoughts that stuck with me after attending a data protection event almost every month for the last two years

1. Privacy is not hype.

The uncontrolled erosion of privacy is not a “victimless crime”. The cost is just shifted to the future. Could be paid tomorrow — an offending ad — or in a few years — a record of your assumed health status leaking to an insurance company.

2. People don’t currently care much about it but this can change fast.

Indeed people don’t seem to care that much right now, certainly not enough to give up any of the conveniences of the web. But nothing about it is written on stone. Some other things that people didn’t use to care about: smoking, car safety, airport security, dangerous toys, racial or sexual discrimination. Societies evolve … privacy discussions and debates have started reaching the wider public.

3. Privacy problems will only get worse.

Privacy vs. web business models is a textbook example of a Tragedy of the Commons. The financial temptation is just too great to be ignored, especially by companies that have nothing to risk or loose. Just find a niche for data that somebody would be willing to pay good money and go for it. Even if all the big companies play totally by the book, there’s still a long tail of thousands of medium to small trackers/data aggregators that can destroy consumer and regulator trust.

4. The web can actually break due to (lack of) privacy.

The web as big and successful as it is, is not indestructible. It too can fall from grace. Other media that once were king are no longer. News papers and TV are nowhere near their prior glory. Loss of trust is the Achilles’ heel of the web.

5. Privacy extremism or wishful thinking are not doing anybody any good.

Extremists at both sides of the spectrum are not doing anybody any good. Stopping all leakage is both impossible and unnecessary. Similarly, believing that the market will magically find its way without any special effort or care is wishful thinking. There are complex tradeoffs in the area to be confronted. That’s fine and nothing new really. Our societies have dealt with similar situations again and again in the past. From financial systems, to transportation, and medicine, there are always practical solutions for maximizing the societal benefits while minimising the risks for individuals. They just take time and effort before they can be reached with lots of trial and error along the way.

6. Complex technology can only be tamed by other, equally advanced, technology.

Regulation and self-regulation have a critical role in the area but are effectively helpless without specialised technology for auditing and testing for compliance, whether pro-actively or reactively. Have you lately taken your car to service? What did you see? A mechanic nowadays is merely connecting a computer to another that checks it by running a barrage of tests. Then he analyses and interpretes the results. A doctor is doing a similar thing but for humans. If the modern mechanic and doctor depend on technology for their daily job, why should a lawyer or a judge be left alone to make sense of privacy and data protection on the internet only with paper and a briefcase at hand?

7. Transparency software is the catalyst for trusting again the web.

Transparency software is the catalyser that can empower regulators and DPAs while creating the right incentives and market pressures to expedite the market convergence to a win-win state for all. But hold on a second … What is this “Transparency software”? Well it’s just what its name suggest. Simple to use software for checking (aha “transparency”) for information uses that users or regulators dont like. You know things like targeting minors online, targeting ads to patients, making arbitrary assumptions about one’s political, religious beliefs, or sexual preference.

A simple but fundamental idea here is that since it is virtually impossible to stop all information leakage (this would break the web faster than privacy), we can try to reduce it and then keep an open eye for controversial practices. A second important idea is to resist the temptation of finding holistic solutions and instead start working on specific data protection problems in given contexts. Context can improve many of our discussions and lead to tangible results faster and easier. If such tangible results don’t start showing up in the foreseeable future its only natural to expect that everyone will eventually be exhausted and give up the whole privacy and data protection matter altogether. Therefore why dont we start interleaving in our abstract discussions some more grounded ones. Pick up one application/service at a time, see what (if anything) is annoying people about it, and fix it. Solving specific issues in specific contexts is not as glamorous as magic general solutions but guess what — we can solve PII leakage issues in a specific website in a matter of hours and we can come up with tools to detect PII leakages in six months to a year, whereas coming up with a general purpose solution for all matters of privacy may take too long.

8. Transparency works. Ask the telcos about Network Neutrality.

Transparency has in the past proved to be quite effective. Indeed, almost a decade ago the Network Neutrality debate was ignited by reports that some Telcos were using Deep Packet Inspection (DPI) equipment to delay or block certain types of traffic, such as peer-to-peer (P2P) traffic from BitTorrent and other protocols. Unnoticed among scores of public statements and discussions, groups of computer scientists started building simple to use tools to check whether a broadband connection was being subjected to P2P blocking. Similarly, tools were built to test whether a broadband connection matched the advertised speed. All a user had to do to check whether his ISP was blocking BitTorrent was to visit a site and click on a button that launches a series of test and … voila. Verifying actual broadband speeds was made equally simple. The existence of such easy to use tools seems to have created the right incentives for Telcos to avoid blocking while making sure they deliver on speed promises.

9. Market, self-regulation, and regulation, in that order.

Most of the work for fixing data protection problems should be undertaken by the market. Regulators should alway be there to raise the bottom line and scare the bad guys. Independent audit makes sure self regulation is effective. It gives it more credibility since it can be checked by independent parties that it delivers on its promises.

10. The tools are not going to build themselves. Get busy!

Building the tools is not easy. Are we prepared? Do we have enough people with the necessary skills to build such tools? Questionable. Our $heriff tool for detecting online price discrimination took more than 2 years and very hard work from some very talented and committed PhD students and researchers. Similarly for our new eyeWnder tool for detecting behavioural targeting. Luckily the Data Transparency Lab community of builders is growing fast. Keep an eye for our forthcoming call for proposals and submit your ideas.

DTL Award Grants’16 summary

Few slides from my talk at Columbia University during DTL Conf’16. Summary of the review process and some advice on what to do and what to avoid in future calls. Slides here.

“Oh … but people don’t care about privacy”

If only I had a penny for every time I’ve heard this aphorism!

True, most typology studies out there as well as our own experiences verify that currently most of us act like the kids that rush to the table and grab the candy in the classic delayed gratification marshmallow experiment: convenience rules over our privacy concerns.

But nothing is written in stone about this. Given enough information and some time to digest it, even greedy kids learn. Just take a look at some other things we didn’t use to care about:

Airport security

Never had the pleasure of walking directly into a plane without a security check but from what I hear there was a time that this was how it worked. You would show up at the airport with ticket at hand. The check-in assistant would verify that your name is on the list and check your id. Then you would just walk past the security officer and go directly to the boarding gate. Simple as that.

Then came hijackers and ruined everything. Between 1968 and 1972, hijackers took over a commercial aircraft every other week, on average. So long with speedy boarding and farewell to smoking on planes 20 years later. If you want to get nostalgic, here you go:

Smoking

Since we are in the topic of smoking and given that lots of privacy concerns are caused by personal data collection practices in online advertising I cannot avoid thinking of Betty and Don Draper with cigarettes at hand at work, in the car, or even at home with the kids.

bettysmoking

To be honest I don’t have to go as far as the Mad Men heroes to draw examples. I am pretty, pretty, pretty sure I’ve seen some of this in real life.

Dangerous toys

Where do I start here? I could list some of my own but they are nowhere near as fun as some that I discovered with a quick search around the web. Things like:

Glass blowing kit
Lead casting kit
Working electric power tools for kids
The kerosine train
Magic killer guns that impress, burn, or knock down your friends.

Pictures are louder than words. Just take a look at The 8 Most Wildly Irresponsible Vintage Toys. Last in this list is the “Atomic Energy Lab” which brings us to:

Recreational uses of radio active materials

I love micro-mechanics and there’s nothing more lovable about it than mechanical watches. There is a magic in listening to the ticking sound of a mechanical movement while observing the seconds hand sweep smoothly above the dial. You can even do it the dark because modern watches use super luminova to illuminate watch dial markings and hands.

But it was not always like that. Before super luminova watches used Tritium and before that … Radium.

Swiss Military Watch Commander model with tritium-illuminated face

Radium watch hands under ultraviolet light

I am stretching dangerously beyond my field here but from what I gather, Tritium, a radio-active material, needs to be handled very carefully. Radium is downright dangerous. I mean “you are going to die” dangerous. Just read a bit about what happened to the “Radium Girls” who used to apply radium on watch dials in an assembly line in the ’20s.

But we are not done yet. Remember the title of the section is “Recreational uses of radio active materials”. Watch dials are just the tip of the iceberg. It’s more of a useful than a recreational thing to be able to read the time in the dark (with some exceptions). Could society stomach the dangers for workers? Who knows? It doesn’t really matter because there are these other uses, that were truly recreational (in the beginning at least) for which I hope the answer is pretty clear. Here goes the list:

Radium chocolate
Radium water
Radium toothpaste
Radium spa

Details and imagery at 9 Ways People Used Radium Before We Understood the Risks.

Anyhow, I can go on for hours on this, talk about car safety belts, car seat headrests, balconies, furniture design etc but I think where I am getting at is clear: Societies evolve.

It takes some time and some pain but they evolve. Especially in our time with the ease at which information spreads, they evolve fast. Mark my words, it wont be long before we look back and laugh at the way we approached privacy in the happy days of the web.

happy-days