These days companies are using more and more of our data to improve their products and services. And it makes a lot of sense if you think about it. It’s better to measure what your users like then to guess and build products that no one wants to use.
However this also is very dangerous. It undermines our privacy because the collected data can be quite sensitive, causing harm if it would leak. So companies love data to improve their products but we as users want to protect our privacy. These contradicting needs can be satisfied with a technique called differential privacy. It allows companies to collect information about their users without compromising the privacy of an individual. But let’s first take a look at why we would go through all this trouble.
Companies could also just take our data, remove our names and call it a day. Right? Not quite! There are two problems with this approach. First of all: this anonymization process usually happens on the servers of the companies that collect your data. So you have to trust them to really remove these identifiable records. And secondly, how anonymous is “anonymized” data really? In 2006 Netflix started a competition called the “Netflix Prize”. Competing teams had to create an algorithm that could predict how someone would rate a movie. To help with this challenge, Netflix provided a dataset that contained over 100 million ratings submitted by over 480,000 users for more then 17,000 movies.
Netflix of course anonymized this dataset by removing the names of users and by replacing some ratings with fake and random ratings. Even though that sounds pretty anonymous, it wasn’t. Two computer scientists from the University of Texas published a paper in 2008 that said that they had successfully identified people from this dataset by combining it with data from IMDb. These types of attacks are called linkage attacks and it happens when pieces of seemingly anonymous data can be combined to reveal real identities.
Another – more creepy – example would be the case of the governor of Massachusetts. In the mid 1990’s the state’s Group Insurance Commission decided to publish the hospital visits of state employees. They anonymized the data by removing names, addresses and other fields that could identify people. However computer scientist Latanya Sweeney decided to show how easy it was to reverse this. She combined the published health records with voter registration records and simply reduced the list. There was only 1 person in the medical data that lived in the same ZIP code, had the same gender and the same date of birth as the governor, thus exposing parts of his medical records. In a later paper she noted that 87% of all Americans can be identified with only three pieces of information: ZIP code, birthday and gender. So much for anonymity.
Clearly this technique isn’t enough to protect our privacy. Differential privacy on the other hand neutralizes these types of attacks! To explain how it works, let’s assume that we want to get a view on how many people do something embarrassing like picking their nose. To do that, we setup a survey with the question “Do you pick your nose” and with YES and NO buttons below it. We collect all these answers on a server somewhere but instead of sending the real answer, we’re going to introduce some noise. Let’s say that Bob is a nose picker and he clicks on the YES button. Before we send his response to the server, our differential privacy algorithm will flip a coin. If it’s heads the algorithm sends Bob’s real answer to our server.
If it’s tails the algorithm flips a second coin and sends YES if it’s tails or NO if its heads. Back on our server we see the data coming in but because of the added noise we can’t trust individual records. Our record for Bob might say that he’s a nose picker but there is at least a 1 in 4 chance that he’s actually not a nose picker but that the answer was simply the effect of the coin toss that the algorithm performed. This is plausible deniability. You can’t be sure of people’s answer so you can’t judge them on it. This is particularly interesting if you’re collecting data about illegal behavior such as drug use for instance. Now because you know how the noise is distributed, you can compensate for it and end up with a fairly accurate view on how many people are actually nose pickers. Now of course the coin toss algorithm is just an example and a bit too simple.
The real algorithms
Real world algorithms use the Laplace distribution to spread the data over a larger range and increase the level of anonymity. In the paper “The Algorithmic Foundations of Differential Privacy” it is noted that differential privacy promises that the outcome of a survey will stay the same, wether or not you participate in it. Therefore you have no reason not to participate in the survey. You don’t have to fear that your data — in this case your nose picking habits — will be exposed. Alright so now we know what differential privacy is and how it works, let’s look at who is already using it. Apple and Google are two of the biggest companies who are using it. Apple started rolling out differential privacy in iOS 10 and macOS Sierra.
They use it to collect data on what websites are using a lot of power, what emoji’s are most used in a certain context and what words people are typing that aren’t in the keyboards dictionary. Apple’s implementation of different privacy is documented but not open source. Google on the other hand has been developing an open source library for this. They use it in Chrome to do studies on browser malware and in Maps to collect data about the traffic in large cities. But overall there aren’t many companies who have adopted differential privacy and those who have only use it for a small percentage of their data collection.
So why is that? Well for starters: differential privacy is only usable for large datasets because of the injected noise. Using it on a tiny dataset will likely result in inaccurate data. And then there is also the complexity of implementing it. It’s a lot more difficult to implement differential privacy compared to just reporting the real data of users and “anonymize” it the old fashion way. So the bottom line is that differential privacy can help companies to learn more about a group of users without compromising the privacy of an individual within that group. However adoption is still limited but it’s clear that there is an increasing need in ways to collect data about people without compromising their privacy. So that was it for this video. If you’re still procrastinating, head over to the simply explained playlist to watch more video’s. And as always: thank you very much for watching!