Podcast: The Future of Privacy with ProperData PIs

Episode 04: The Future of Privacy with ProperData PIs description

This week, we have Johanna, Umar and Hieu as hosts. It is a special episode, recording live, with a panel of guests in front of an audience. We’ll be unpacking the current state of privacy and exploring paths forward.
Our panelists are four professors from the ProperData Frontier project, which is the research center we’re a part of. ProperData spans 6 institutions, 11 professors, and more than 50 researchers. With us today are Athina Markopoulou (UC Irvine), Zubair Shafiq (UC Davis), David Choffnes (Northeastern University), and Nikolaos Laoutaris (IMDEA Networks).

Podcast available here.

A measurement study of B2B Data Marketplaces, their data products and pricing. A first look at what awaits ahead in the era of European Data Spaces and Gaia-X.

A large number of data Marketplaces (DMs) have appeared in the last few years to help owners monetize their data, and data buyers optimize their marketing campaigns, train their ML models, and perform other data-driven decision processes. Even in Europe that took the lead in the data protection race with GDPR, several new initiative like the European Data Spaces, Gaia-X, and the Data Governance Act show that we have entered in the post-privacy era, in which data may be sold as a product, so long as certain rules are obeyed. While such rules are being discussed, the market is moving fast to come up with new business models, technologies, and actual data data products offered for purchase as we speak.

In our recent report, we present a first of its kind measurement study of the growing DM ecosystem and shed light on several totally unknown facts about it. For example, we show that the median price of live data products sold under a subscription model is around US $1,400 per month. For one-off purchases of static data, the median price is around US $2,200. At the extreme of the pricing spectrum we see data products that reach up to a million US $. We analyze the prices of different categories of data and show that products about telecommunications, manufacturing, automotive, and gaming command the highest prices. We also develop classifiers for comparing prices across different DMs as well as a regression analysis for revealing features that correlate with data product prices.

Read more in: S. Andres Azcoitia, C. Iordanou, N. Laoutaris, “What Is the Price of Data? A Measurement Study of Commercial Data Marketplaces,” [arXiv:2111.04427].

Let’s stop tracking and instead build a personal data inter-network on top of the Web and the Internet. Here’s how …

What if instead of having to implement controversial user tracking techniques, Internet advertising & marketing companies asked explicitly to be granted access to user data by name and category, such as Alice–>Mobility–>05-11-2020? The technology for implementing this, already exists, and is none other than the Information Centric Networks (ICN) developed for over a decade in the framework of Next Generation Internet (NGI) initiatives.

Beyond named access to personal data, ICN’s in-network storage capability can be used as a substrate for retrieving aggregated, anonymized data, or even for executing complex analytics within the network, with no personal data leaking outside. In this opinion article, we discuss how ICNs combined with trusted execution environments and digital watermarking can be combined to build a personal data overlay inter-network in which users will be able to control who gets access to their personal data, know where each copy of said data is, negotiate payments in exchange for data, and even claim ownership, and establish accountability for data leakages due to malfunctions or malice.

Of course, coming up with concrete designs about how to achieve all the above will require a huge effort from a dedicated community willing to change how personal data are handled on the Internet. Our hope is that this opinion article can plant some initial seeds towards this direction. For more details check out our opinion column in ACM CCR.

We looked at 1 Billion URLs and found that some 150 Millions of them include sensitive content related to Health, Political Beliefs, Sexual Orientation, etc.

… and of course everything is tracked 🙂

The European General Data Protection Regulation (GDPR) includes specific clauses that put restrictions on the collection and processing of sensitive personal data, defined as any data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, also genetic data, biometric data for
the purpose of uniquely identifying a natural person, data concerning
health or data concerning a natural persons sex life or sexual orientation.

The above is setting the tone regarding the treatment of
sensitive personal data, and provides a legal framework for filing
complaints, conducting investigations, and even pursuing cases in
court. Such measures are rather reactive, i.e., they take effect long
after an incident has occurred. To increase further the protection
of sensitive personal data, proactive measures should also be put in
place. For example, the browser of a user, or an add-on program, can
inform the user whenever he visits URLs pointing to sensitive content. When on such sites, trackers can be blocked, and complaints
can be automatically filed. Implementing such services, of course,
hinges on being able to classify automatically arbitrary URLs as
sensitive or not.

However, determining what is truly sensitive is easier
said than done
. As discussed earlier, legal documents merely provide a list of sensitive categories, but without any description, or
guidance about how to judge what content falls within each one of
them. This can lead to a fair amount of ambiguity since, for example,
the word “Health” appears both on web pages about chronic diseases, sexually transmitted diseases, and cancer, but also on pages
about healthy eating, sports, and organic food. For humans it is
easy to disambiguate and recognise that the former are sites about
sensitive content, whereas the latter, not so much. The problem
becomes further exacerbated by the fact that within a web domain,
different sections and individual pages may touch upon very diverse
topics.

In our recent paper (to appear in ACM IMC’20) we’ve shown how to train a series of machine learning classifiers that can differentiate truly sensitive web-sites from non-sensitive ones. We’ve applied our classifier on over 1 Billion URLs and found that more than 150 Millions include sensitive content. Furthermore, we checked to see if such sensitive URLs are being tracked and found out that, unfortunately, such URLs are as as tracked as the rest of the web…

Why online services should pay You for Your data? The arguments for a Human-Centric Data Economy

UPDATE: An extended version of this post appears as an opinion column in IEEE Internet Computing. Link to it here.

Data, and the economy around it, are said to be driving the fourth industrial revolution. Interestingly, the people — whose data are what moves the new economy — have a rather passive role in this economy, as they are left outside the direct value flow that transforms raw data into huge monetary benefits. This is a consequence of a de facto understanding (or, one may say, misunderstanding) between people and companies that the former receive unpaid access to online services in exchange for unpaid access to their personal data. This situation is increasingly being challenged by various voices calling for the establishment of a new and renegotiated relationship between users and services.

For example, technologist, philosopher, and writer Jaron Lanier argues in his 2012 book “Who Owns the Future?” that online services should return some of their revenue back to the people that feed with data their business models and AI algorithms. Lanier’s arguments include the following:

Sustainability: The current economic model imposes serious privacy risks on individuals and the society at large, has lead to market failures in the form of large data monopolies and oligopolies, and may, in fact, even be a threat to employment in the future due to job loss from data-driven automation. Paying people for their data could, therefore, be an alternative to labor-based compensation in a future in which most work will be done by machines. Indeed, it was estimated recently that, if fair remuneration algorithms are set in place, a family of four could earn up to $20,000 per year from their data (Posner, 2018). The above figure may seem too small to be a full alternative to labor-based compensation, but can only increase as more and more sectors are being catalyzed by automation.

Fairness: Paying people for their data, or the related idea of (universal) guaranteed minimum income are potential remedies for modern societal ailments such as increased income disparity, increased unemployment, and other labor-related challenges emerging in the context of machine learning automation, robots, 3D printing, self-driving cars, and other employment-threatening technologies. Critics of guaranteed minimum income are rejecting it as a non-sustainable form of charity. Instead, Lanier has argued that paying people for their data is an altruism-free idea compatible with modern capitalism for achieving the positive objectives of guaranteed minimum income without harming but instead benefiting the market, innovation, and investment in technology. The fundamental argument behind this position is a simple one: business models and machine learning algorithms have zero value without data, and, therefore, paying for those data is not charity but rather neoclassical economics. As Lanier notes “We have invented only half of the data driven industrial revolution — the part that compensates users in kind (i.e., service); we need to invent the other half that will provide explicit (monetary) benefits.”

But the arguments in favor of what I will henceforth refer to as a Human-Centric Data Economy, go beyond the above mentioned.

First, paying for data opens up the pathway to getting more data and data of higher quality, thereby increasing the size of the data economy. In other words, the data economy is not a zero-sum game between users and data collectors; paying the former does not have to harm the latter. It is not surprising, therefore, that the idea of paying or being taxed for data has been positively received by many, including industry leaders such as Bill Gates (Delaney, 2017), Elon Musk (Thomas, 2017), and Mark Zuckerberg (Gillespie, 2017).

Second, paying for data creates economic pressure on online services for applying data minimization principles. Currently, collecting and processing data costs close to zero and, therefore, services greedily collect all the data that they can, even when the actual information that they need is much less. The resulting privacy-related tragedy of the commons of the web can be avoided if online services a) collect the minimum amount of data that they need, and b) do so, only when the benefit that they create for society outweighs, and can therefore re-compensate, for the risks it creates. In other words, in the same way that factories and private cars pay for the amount of pollution that they impose on the environment, online services should pay for the privacy risks they impose on people. Currently, data minimization is just a principle quoted by data protection laws. To realize it in practice, it will require establishing the currently missing economic signals that would push the market in the right directions. Paying people for data is a direct way of achieving this. Two important related questions are, therefore:

  1. What part of a service’s revenue should be returned to its users?
  2. How should the total returned payoff be split among different users?

To answer the first question we are modeling the data value chain using tools like Nash Bargaining used before for modeling the inter-connection value chain on the Internet. To answer the second, we are using the Shapley Value to compute the relative importance of different users’ data on the decisions taken by machine learning algorithms for tasks like movie recommendation.

Stay tuned for more thoughts, ideas, and papers around Human-Centric Data Economies and please do contact me if interested to work or do a PhD in the area.

References

Delaney, K. J. (2017). The robot that takes your job should pay taxes, says Bill Gates. Retrieved from https://qz.com/911968/bill-gates-the-robot-that-takes-your-job-should-pay-taxes/

Gillespie, P. (2017). Mark Zuckerberg supports universal basic income. What is it? Retrieved from https://money.cnn.com/2017/05/26/news/economy/mark-zuckerberg-universal-basic-income/index.html

Posner, E. A. (2018). Radical Markets: Uprooting Capitalism and Democracy for a Just Society. Princeton University Press.

Thomas, L. (2017). Universal basic income debate sharpens as observers grasp for solutions to inequality. Retrieved from https://www.cnbc.com/2017/03/25/universal-basic-income-debate-sharpens.html

Networking Research: Present, Future and Beyond

Last week I had the honor and privilege of organizing the 11th Annual Workshop of IMDEA Networks. The event gathered an excellent set of keynote speakers and panelists who, together with IMDEA researchers and other colleagues from Madrid and beyond, sat to review the status of networking research and discuss its future.

You can find an Executive Summary of the event here: https://www.networks.imdea.org/whats-new/news/2019/finding-true-north-networking-research

Workshop program and presentations are here: http://workshop2019.networks.imdea.org/workshop-program

Many thanks to all the participants and the IMDEA staff that made the event possible.

Data Transparency: Concerns and Prospects

[What follows is a preview of an opinion note to appear in the Nov. 2018 issue of Proceedings of IEEE (original article here). Many thanks to all the people that read early drafts and provided feedback and corrections.]

Introduction

The question of “how far?” should technologies and business models of the web go into collecting personal data of unassuming, or at best moderately informed citizens, appears to be one of the timeliest questions of our times. Indeed, whenever we read a news article, “like” a page on a social network, or “check in” to a popular spot, our digital trace collected, processed, fused, and traded among myriads of tracking, analytics, advertising, and marketing companies becomes an ever more accurate descriptor of our lives, our beliefs, our desires, our likes and dislikes. The resulting revenue from marketing & advertising activities driven by the digital traces of millions of people is what funds the free online services we have come to rely upon.   

In this opinion note, I will lay down my thoughts around Data Transparency and its role in ongoing Data Protection & Privacy debates. The material draws upon my experiences from conducting research in the area over the last 6+ years, running the Data Transparency Lab’s Grant Program in 2015, 2016, and 2017, and attending several computer science, policy, marketing & advertising events. The objective of the note is to discuss the possibility and the likelihood of data transparency acting as an important positive catalyser of data protection problems, as well as to point towards concerns and challenges to be addressed in order for this to materialize. Most of the discussion applies to the use of personal data by marketers on the fixed and mobile web, but some parts may also be relevant to other online and offline use-cases and/or types of data (e.g., off-web health and financial data).    

For years, the practice of collecting data on individuals at unprecedented scale was a non-issue for most people, for the simple reason that the public, and even governments, were just unaware of its magnitude, precision, and detail. The last few years, however, attitudes have started to change and the topic of privacy is increasingly appearing in the media and public discussions. This has stirred a huge public debate about who should be the rightful owner of personal data, and where to draw the red line of what is socially acceptable to track and monetize, and what is not. Indeed, the same tracking technology of cookies and tracking pixels used to detect one’s intention for buying new running sneakers, thus prompting an interesting discount at a nearby shop, can also be used to infer one’s medical condition, political affiliation, or sexual preference, thus delivering offers at the wrong time and place, or even worse, releasing the information to third parties that may use it in a discriminatory, excluding, or generally unfair manner. The former use of data has the potential to increase value for individuals, retailers, technology providers and the society, whereas the latter can be detrimental to the trust put by individuals and institutions in technology.         

Tragedy of the Commons and the Web

What is particularly alarming is that the economics and incentives of exploiting personal data for advertising and marketing have all the characteristics of a “Tragedy of the Commons” (see Hardin [1]), in which consumer privacy and trust in the web and its business models are a shared commons that can be over-harvested to the point of destruction. The essence of the problem is that even if most technology companies manage to agree on a set of principles, there will always be sufficient temptation for some to push the boundaries in pursuit of greater gains, while inflicting a heavy cost to society. Indeed, from the narrow perspective of some companies, all it takes to pursue a business that involves intrusive and unethical collection of very sensitive personal data, is a paying customer. The above seems to be verified by examples appearing in the press of trackers that compile lists of anything from suspected alcoholics and HIV positive individuals, to active police-officers [13].

In essence, the narrow self-interest of a subset of data collection companies is eroding a valuable commons — the trust put by people in the web, or inversely their hope that nothing bad will happen to them by being carefree online. If, in the minds of citizens, ordering a drug online is associated with the risk of leaking medical info to health insurance companies, then they may very well abandon the web and just walk to a pharmacy. This means that the web, as big and successful as it is presently, is not invincible. It too can fall from grace like newspapers and broadcast-TV have in the past, albeit for other reasons. Loss of public trust appears to be the Achilles’ heel of the web and is being fuelled by questionable practices from large and small companies alike.

The Role of Transparency in Data Protection Debates

Transparency is often heard in debates about governance, business, science, and matters of public life in general. According to Wikipedia, transparency is about: “operating in such a way that it is easy for others to see what actions are performed. Transparency implies openness, communication, and accountability”. Transparency, in its different applications and contexts, largely embodies the famous quote of American Supreme Court justice Louis Brandeis that “Sunlight is said to be the best of disinfectants”.

In the context of data protection and privacy of online services, transparency can be understood as the ability to credibly answer questions such as:

  • What information is being collected (stored, and processed) about individuals online?
  • Who is collecting it?
  • How is it being collected?
  • How is it being used?
  • Is it leaking to other unintended recipients?
  • What are the consequences of such online leakage of private information?  

Information leakage is a natural phenomenon in both the offline and online life. In the offline world, whenever we walk on a street or are seen at a public place, we are effectively giving up on our so called “location privacy”. Our clothes, hobbies, the car we may drive or the house where we live convey information about our financial status, employment, and taste. Similarly, in the online world, networks need to know where we are in order to deliver our calls, emails, or chat requests. Social networks need to display our real names so that our offline friends can also befriend us online. The above realizations give rise to a simple alternative to trying to unknot the “utility-vs.-privacy” tradeoff. Since we cannot stop all online leakage under the current technological paradigm, nor prescribe a generic, context unaware solution to the tradeoff, we can instead try to reduce information leakage, while keeping an eye open for controversial practices driven by collecting personal data that go against public sentiment or the letter of the law. Such an objective can be achieved on top of existing web technologies and business models without requiring some radical redesign. Transparency is the guiding light pointing to problematic technologies and business practices that will require revision, if we are to keep a safe distance from a tragedy of the commons on the web.

Transparency has already proved its worth in what is probably the greatest techno-policy debate preceding data protection — the Network Neutrality debate. Network neutrality is the simple principle, now turned into regulation and telecommunications law, that a network operator cannot delay or drop one type of traffic from a certain application in order to protect or expedite the rest. Almost a decade ago, the Network Neutrality debate was ignited by reports that some telecom companies were using Deep Packet Inspection (DPI) equipment to delay or block certain types of traffic, such as peer-to-peer (P2P) traffic from BitTorrent and other protocols. Unnoticed initially among scores of public statements and discussions, a group of computer scientists from Germany developed Glasnost [2] a set of tools for checking whether a broadband connection was being subjected to P2P blocking. All a user had to do to check whether their ISP was blocking BitTorrent was to visit a web-page and click on a button that launched a series of simple tests — basically streaming two flows of data towards the user, one appearing to be P2P and one not. By comparing the corresponding reception data rates for the two flows the tool could say if P2P was throttled… voila anyone could check for themselves if their provider blocked P2P traffic or delivered the advertised speed of their plan. The existence of such easy-to-use tools created the right incentives that eventually obliged telecom companies to give up blocking or be open about it, and to indeed deliver the promised data rates.

The above network speed measurement tools are early examples of what I will henceforth call Transparency Software, to refer to any software purposefully built for providing transparency and for “shedding light” into public debates involving technology. An important idea, and frankly raison d’être for transparency software, is that complex technology can only be tamed by other, equally advanced, technology. Indeed, investigating complex data protection debates without specialized software for collecting evidence and testing hypotheses is like conducting pre-flight tests or periodic car inspections without specialized equipment that can test all the complex components of an airplane or modern car. In all these domains, there exist entire fields dedicated to developing formal methods for testing and verification. In the same way that air transportation, the car industry, health, and other domains have benefited from purpose-built testing tools, online data protection needs to develop its transparency methods and software.   

Although the present article is not meant to be a complete survey, in the remainder I will discuss several examples of transparency software for things like: revealing in real time instances of Personally Identifiable Information (PII) leakage, detecting online price discrimination, detecting online targeted advertising, detecting advanced online tracking through fingerprinting, and others.   

Transparency for Whom?

Figure 1: The role of transparency in the context of the greater data protection debate.

Figure 1 illustrates at the topmost level how and where Transparency plugs into the general data protection debate. Assuming that some use of personal data for online marketing in exchange for free service is deemed acceptable, and that the Tragedy of the Commons is not something we should leave to luck, the figure suggests that transparency can be beneficial for all three main stakeholders of the data protection discussion. Transparency should not be seen as an alternative to existing efforts of these stakeholders, but as an extra tool at their disposal.

For the online tracking & advertising industry, Transparency is essential to its efforts to convince government and citizens that it can effectively self-police and self-regulate the sector, making sure that individual companies do not perform actions that go against the public sentiment, or worse, data protection laws. The various Codes of Conduct, and Best Practices documents issued by sector representative bodies and organizations make lots of use of the term transparency, but they often get criticized as being mere intentions without any real means for enforcement and actual demonstration of commitment and application. This is where Transparency Software can play a key role, by allowing anyone to independently check that companies make good on their promises. A smartphone app that commits to not communicate PII back to its servers or other third parties can be checked by software such as ReCon [3], Lumen [4], and AntMonitor [5]. A web-site and its advertising partners that commit to not targeting minors can point users to Aditaur [6], a tool which anyone can use to verify the claim. In essence, the existence of such software allows the sector to make more credible and verifiable promises regarding its ability to self-regulate its data treatment practices.        

For individual citizens, transparency is all about empowerment and freedom of choice. For every basic online service there is typically a magnitude of alternative service providers offering it, each one with a potentially different approach and sensitivity towards data collection. Users are accustomed to rating online services in terms of performance, simplicity, feature richness, but gauging the quality of data management practices has for the most part been out of reach. Being able to evaluate the quality over privacy-risk ratio of different services empowers users to select the one providing the right balance. For example, PrivacyMeter [7] can display in real time a risk score for every web-site visited by a user. If a user deems the risk of visiting, say, a news portal to be too high, they can opt for an alternative one with similar content but better performance in terms of privacy. By doing so, users emit clear signals to the industry and contribute through market pressure to pushing it towards the right direction.    

Last but not least, government agencies, especially Data Protection Authorities (DPA), need transparency for both their proactive monitoring activities, as well as their investigative activities following complaints from citizens, watchdog groups, or other companies. Transparency software can help DPAs scale up their investigations and even pro-actively monitor for offending practices, something that does not appear to be possible via ad hoc manual investigations. For example, with AppCensus [8], entire marketplaces of mobile apps can be checked for leakage of PII information, by automatically analysing their binary executable distributions. Similarly, WebCensus [9] allows monitoring millions of domains every month to catalogue and rank their tracking practices. With Aditaur, DPAs can proactively check at scale thousands of popular domains for targeted advertising towards sensitive groups like children, or driven by sensitive personal data about health, political, or sexual orientation that are protected under EU’s GDPR law.     

Transparency of What?

As mentioned before, some perceive data protection as the challenge of understanding and limiting the amount of personal information that can be leaked and collected online. Although this has an importance of its own, for others the main motivation and driver for discussing data protection matters is understanding and limiting the consequences of personal data leakage. Transparency is important and can help at both levels of the debate. Data transparency and corresponding tools like ReCon, Lumen, and AntMonitor are about revealing what data is leaking and where it is going. Algorithmic transparency on the other hand is looking at how personal data can end up fuelling biased, discriminatory, and generally unfair automated decision making that impacts the life of real people. For example, the Price $heriff [10] reveals whether a consumer searching for a product online is being subjected to online price discrimination by algorithms that decide on the spot a dynamic price for each customer based on their perceived willingness to pay, extracted from information about them. FDVT [11] measures in real time the different economic valuation that Facebook advertisers have about different users, by tallying up their advertisement bids for product placements. Algorithmic transparency, of course, goes beyond the online services mentioned above, and can be applied to a range of offline contexts from health, to finance, to justice, but such areas and applications go beyond the scope of the current note.          

Challenges

Next, I discuss some of the hard challenges that need to be addressed if transparency is to make a positive dent upon privacy problems on the web.

Crowdsourcing Privacy Related Data

Several of the tools mentioned above are crowdsourced in nature, i.e., they rely on real users sharing their observations, in order to reverse engineer some aspect of an online service. eyeWnder [12], for example, relies on users reporting the advertisements seen on different pages in order to construct a crowdsourced database through which one can identify active advertising campaigns and the demographics of users targeted by them. Similarly, the Price $heriff relies on users reporting the price they see for the same product at the same site to detect instances of online price discrimination. Both tools use specialized encryption and anonymization techniques to protect the privacy of users that report back to the database the ads or the prices they’ve seen. Crowdsourcing is a powerful tool for detecting different types of discrimination and bias, but requires having in place a solution for protecting the privacy of users that contribute to the crowdsourced corpus of data. The above two tools use ad hoc techniques tailored to their specific function, but there is a clear need for developing more generic privacy-preserving crowdsourcing techniques and platforms that will make it easier to develop additional transparency tools for other problems.    

Evaluation criteria / Reproducibility / Correctness

The appearance of several transparency tools in the last 2-3 years (of which I have touched only upon a small subset) is testament to the very significant amount of work that has been done in the area during a rather short amount of time. Still, the area is only in its infancy and thereby important requisites for growth and eventual maturity are yet to be fulfilled. One of them is establishing common criteria and metrics upon which different transparency tools looking into the same task will be compared. Having the ability to directly compare different approaches is fundamental for the evolution of the area, the validity, and the correctness of the findings. In the same spirit, the findings of a tool need to be reproducible. This is difficult to achieve when the tool operates in the wild. eyeWnder, for example, can label an ad banner as targeted today but it might be impossible to reproduce the result after a week since the underlying advertising campaign may no longer exist, or may have changed its target audience. Reproducibility goes hand in hand with the ability to compile extensive crowdsourced datasets on privacy matters upon which the reproducibility of a tool can be checked or its performance compared with alternative approaches on the same problem.  

Bootstrapping / UX challenges / Outreach

Both privacy-preserving crowdsourcing, as well as the establishment of common evaluation criteria, are technical problems and, as such, something that a technical community of computer scientists and engineers knows how to handle. A different type of challenge for the area is finding enough users for these tools. For tools that are crowdsourced in nature, establishing an initial user base is a fundamental prerequisite for allowing them to derive credible and useful findings. Even for tools that can work by collecting data without user input (e.g., through crawling and scraping of information), having a user base outside research is largely a measure of true impact and success. To get there, we need to work on improving the usability aspects of such tools and adapting them to the needs and capacities of non-expert users. We also need to work on disseminating them and putting them in front of end users.

Conclusions

With the above, I hope I have convinced you that transparency can have an important role and contribution in contemporary data protection debates. In my mind, a very important first milestone is making sure that the online world is at least as transparent as the offline world. This may seem uninspiring on the surface, but it is actually a very difficult objective in practice. The scale of data collection online is at a totally different level from that offline. Rules and regulations are established for many offline activities, from credit rating, to equality of access to public services, whereas the online equivalents are left to chance. Finally, many transparency aspects that we take for granted in the offline world, are hard to achieve online. Take price discrimination as an example. Two customers walking to a coffee shop see the same price for “cafe latte” written on the wall. If the clerk charges one of them a different price, questions will immediately follow. In the online world, the two customers can order the same coffee at the same time and pay a totally different price without any of them ever realising. This is because in the online realm, the “world” around us is dynamically generated, thereby, we do not even have the benefit of a common reference. Checking for being “followed” or discriminated is more difficult online than offline. Of course, this is just under the current state of affairs. The same technology used for surveilling or discriminating at scale, can be flipped on its head, and used instead for shedding light and providing transparency at scale. This means that an online world that is safer and fairer than the offline one, is an open possibility that we should consider and pursue.

References

[1] G. Hardin, “The Tragedy of the Commons,” Science, Vol. 162, Issue 3859, pp.1243-1248, Dec. 1968.

[2] Glasnost: http://broadband.mpi-sws.org/transparency/

[3] Recon: https://recon.meddle.mobi/

[4] Lumen: https://www.haystack.mobi/

[5] AntMonitor: http://antmonitor.calit2.uci.edu/

[6] Aditaur: https://www.lstech.io/aditaur

[7] PrivacyMeter: https://chrome.google.com/webstore/detail/privacymeter/anejpkgakoflmgebgnombfjiokjdhmhg

[8] AppCensus: https://appcensus.mobi/

[9] WebCensus: https://webtransparency.cs.princeton.edu/webcensus/

[10] Price $heriff: http://www.sheriff-v2.dynu.net/views/home

[11] FDVT: https://fdvt.org/

[12] eyeWnder: http://www.eyewnder.com/

[13] https://money.cnn.com/2013/12/18/pf/data-broker-lists/

 

 

 

 

Myth-busting: Most tracking flows on European citizens DO NOT terminate outside EU28 GDPR borders

In our latest measurement study we use data from 350 volunteers combined with aggregate network logs from 60M ISP subscribers to show that around 90% of tracking flows originating within Europe also terminate within Europe.

This of course does not preclude that the collected data is not subsequently moved elsewhere but at least we know that the tracking end points are more frequently than not within reach of European Data Protection Authorities.

An optimistic result contrasting prior belief that tracking flows went straight out of Europe. For more details check:

C. Iordanou, G. Smaragdakis, I. Poese, N. Laoutaris, “Tracing Cross Border Web Tracking,” ACM IMC’18. [pdf]

The three types of research papers and how I learned to recognise them

After 15+ years of reading, writing, presenting, reviewing, selecting, discussing, stapling, and doodling on the margins of papers I have concluded that there exist three large families of research papers:

About-papers

About papers are usually written, read, discussed, championed, sent as attachments by people that care about an area. They love the area so much such that everything that has anything to do with it immediately becomes an interesting read. You can write an about-paper about a dataset that you managed to get hold of, about a trial you ran with real users, about the latest research infrastructure that you are developing, or about your favourite new technology. I love reading well written about-papers. I just prefer reading them in magazines, news papers, blogs, newsletters, etc. I definitely don’t like waking up in the middle of the night to review them, especially in weekends and during holidays. The hallmark of an about-paper is its general interest about the area and its relative disinterest about specific contributions and questions in this area.

Concept-papers

A concept is a magic lens through which complex things become simple helping us to finally understand them. Think of price of anarchy, differential privacy, betweenness centrality, power-usage efficiency. Great concept-papers can have a profound positive impact in our understanding of the world. They cut across areas and problems and reveal underlying hidden truths and structures. Unfortunately, most concept papers are not of the great type. It’s really tempting to think that you’ve come across the silver bullet that will pierce through any type of steel and concrete. Bad concept-papers confuse and distract. Instead of being a means, they become an end to themselves. In the process they distract our attention from real problems and waste huge amounts of time. The easiest way to write the wrong concept-paper is to believe too much in genius and divine intervention. Despite being more than welcome, neither the first nor the second are strict pre-requisites for a concept-paper. Experience and domain expertise is often all it takes to come up with a great concept after having observed a common structure across different fields and problems. A special case of concept-paper craziness is the technology-concept-paper. Using bit-torrent to send people to Mars, bitcoin to cure cancer, and tcp to alleviate traffic jams in Beijing.

Question-papers

  • Is location-based price discrimination happening in e-commerce?
  • Which advertisers place targeted ads driven by sensitive personal data?
  • How much cross-subsidization exists between heavy and light consumers of residential broadband?
  • What percentage of online advertising revenues go to fake clicks?
  • Who starts fake news campaigns in social media?
  • Can we build sub 10ms delay networks?”.

Questions-papers are all about answering a clear and easy to understand question about something that is important and hard to guess without doing some work first. Surely you can find questions in both about- and concept-papers. The difference is that in question-papers it is the question that leads the entire effort as opposed to taking the back seat as in the other two. A clear and important question is an infallible compass for finding your way among the myriads of alternatives arising during any research effort. Putting the question on the driver’s seat makes everything else fall easily in place: the dataset that you need, the expertise required for answering it, the right definition, the right algorithm, the right system, the results to show.

Over the years I have written papers of all three types but I must admit that lately I only care about question-papers. I would love to write a good concept paper in the area where I currently work but I am afraid I still have some question to ask and answer before being ready to do so,

There I said it: The Net Neutrality “debate” is the Climate Change “debate” of the Internet

i.e., it’s no debate at all, not a serious one at least.

It’s more of a huge mess created from taking the good intentions of well meaning people and twisting them to fit economic and other interests.

I could stop here and feel relieved that I finally expressed that which has been haunting me after having spent 3 years on a single paper on the economics of interconnection.

It was actually my frustration from failing to explain the data and economic concepts used in this paper that made me give up on the area and turn to privacy and transparency (in this regard “thank you network neutrality camp!”).

I just couldn’t explain the rather obvious economic fact that a super market can’t function efficiently if the cashier just weigh how much stuff you get and charge you the same independently if you are taking 5 kilos of potatos or 5 kilos of white truffle.

I am not planning to write a full essay on this (yet) so here goes in almost random order my take aways from these 3 years:

1. Connectivity is a two-sided market

When you are campaigning, writing, lobbying, yelling about network neutrality you actually fight for the right of people to keep paying for 100% of interconnection (aka last mile) costs. You may as well fight for your right to pay a higher price for passes for the MWC so that exhibitors can get free booths, or for your right to pay a higher price for magazines so that advertisers can place ads for free.

2. “What about the  little guy?”

Many seem to worry about the small startup without the “deep pockets” that won’t be able to afford paying for the “fast lane”. Well, the little guy has little traffic and therefore doesn’t need to have deep pockets in the first place. If ISPs want to deconstruct this rather superficial argument they should just offer the fast lane for free to small companies and only charge if traffic (aka business) ramps up. This would give the little guy a competitive advantage over established service monopolies/oligopolies that frankly, would be the ones to be challenged the most by a change in the current status quo.

3. “Throttling”

A heavy word. Implying that certain traffic will be delayed so that other traffic can take priority over it. This implies a “zero-sum” Internet in which if I have more leg room (like in business class) you are not able to open up your laptop or reach for the fork. The internet is not zero-sum. If companies want to push HD video or gaming traffic they can go through different lines and ports so that you don’t have to hang up your call to your grandma.

4. Internet Democracy, Freedom of Speech, etc

This is where it all began (the “good intentions” I mentioned earlier) but it’s not about this anymore. Any attempt to exploit non neutrality for something in this area would be suicidal for the brand image of any commercial organisation. Despotic regimes and tyrants don’t care about half measures like non neutrality – they go straight to good old blocking.

5. “Harm innovation” and other “arguments”

When someone resorts to “harm innovation” it means that he is already loosing the “debate” so he has to take out the H-bomb. Usually this comes after “Internet Democracy, Freedom of Speech”, i.e., when 4. fails to convince. The “Harm innovation” “argument” is so fuzzy and ethereal as an “argument” that is indeed difficult to deconstruct. It’s like trying to shoot at a ghost. You’ll never get it. I’ll just say that “business innovation on the web” is not the only innovation in and around technology.

I’ll stop here for now.