Differential Privacy

These days companies are using more and more of our data to improve their products and  services.  And it makes a lot of sense if you think about it.  It’s better to measure what your users like then to guess and build products that no one  wants to use.

However this also is very dangerous.  It undermines our privacy because the collected data can be quite sensitive, causing harm  if it would leak.  So companies love data to improve their products but we as users want to protect our privacy.  These contradicting needs can be satisfied with a technique called differential privacy.  It allows companies to collect information about their users without compromising the  privacy of an individual.  But let’s first take a look at why we would go through all this trouble.

Companies could also just take our data, remove our names and call it a day.  Right?  Not quite!  There are two problems with this approach.  First of all: this anonymization process usually happens on the servers of the companies that  collect your data.  So you have to trust them to really remove these identifiable records.  And secondly, how anonymous is “anonymized” data really?  In 2006 Netflix started a competition called the “Netflix Prize”.  Competing teams had to create an algorithm that could predict how someone would rate  a movie.  To help with this challenge, Netflix provided a dataset that contained over 100 million  ratings submitted by over 480,000 users for more then 17,000 movies.

Netflix of course anonymized this dataset by removing the names of users and by replacing  some ratings with fake and random ratings.  Even though that sounds pretty anonymous, it wasn’t.  Two computer scientists from the University of Texas published a paper in 2008 that said  that they had successfully identified people from this dataset by combining it with data  from IMDb.  These types of attacks are called linkage attacks and it happens when pieces of seemingly  anonymous data can be combined to reveal real identities.

Another – more creepy – example would be the case of the governor of Massachusetts.  In the mid 1990’s the state’s Group Insurance Commission decided to publish the hospital  visits of state employees.  They anonymized the data by removing names, addresses and other fields that could identify  people.  However computer scientist Latanya Sweeney decided to show how easy it was to reverse  this.  She combined the published health records with voter registration records and simply  reduced the list.  There was only 1 person in the medical data that lived in the same ZIP code, had the same  gender and the same date of birth as the governor, thus exposing parts of his medical records.  In a later paper she noted that 87% of all Americans can be identified with only three  pieces of information: ZIP code, birthday and gender.  So much for anonymity.

Clearly this technique isn’t enough to protect our privacy.  Differential privacy on the other hand neutralizes these types of attacks!  To explain how it works, let’s assume that we want to get a view on how many people do  something embarrassing like picking their nose.  To do that, we setup a survey with the question “Do you pick your nose” and with YES and  NO buttons below it.  We collect all these answers on a server somewhere but instead of sending the real answer, we’re  going to introduce some noise.  Let’s say that Bob is a nose picker and he clicks on the YES button.  Before we send his response to the server, our differential privacy algorithm will flip  a coin.  If it’s heads the algorithm sends Bob’s real answer to our server.

If it’s tails the algorithm flips a second coin and sends YES if it’s tails or NO if  its heads.  Back on our server we see the data coming in but because of the added noise we can’t  trust individual records.  Our record for Bob might say that he’s a nose picker but there is at least a 1 in 4  chance that he’s actually not a nose picker but that the answer was simply the effect  of the coin toss that the algorithm performed.  This is plausible deniability.  You can’t be sure of people’s answer so you can’t judge them on it.  This is particularly interesting if you’re collecting data about illegal behavior such  as drug use for instance.  Now because you know how the noise is distributed, you can compensate for it and end up with  a fairly accurate view on how many people are actually nose pickers.  Now of course the coin toss algorithm is just an example and a bit too simple.

The real algorithms

Real world algorithms use the Laplace distribution to spread the data over a larger range and  increase the level of anonymity.  In the paper “The Algorithmic Foundations of Differential Privacy” it is noted that  differential privacy promises that the outcome of a survey will stay the same, wether or  not you participate in it.  Therefore you have no reason not to participate in the survey.  You don’t have to fear that your data — in this case your nose picking habits — will  be exposed.  Alright so now we know what differential privacy is and how it works, let’s look at who is  already using it.  Apple and Google are two of the biggest companies who are using it.  Apple started rolling out differential privacy in iOS 10 and macOS Sierra.

They use it to collect data on what websites are using a lot of power, what emoji’s are  most used in a certain context and what words people are typing that aren’t in the keyboards  dictionary.  Apple’s implementation of different privacy is documented but not open source.  Google on the other hand has been developing an open source library for this.  They use it in Chrome to do studies on browser malware and in Maps to collect data about  the traffic in large cities.  But overall there aren’t many companies who have adopted differential privacy and  those who have only use it for a small percentage of their data collection.

So why is that?  Well for starters: differential privacy is only usable for large datasets because of  the injected noise.  Using it on a tiny dataset will likely result in inaccurate data.  And then there is also the complexity of implementing it.  It’s a lot more difficult to implement differential privacy compared to just reporting the real  data of users and “anonymize” it the old fashion way.  So the bottom line is that differential privacy can help companies to learn more about a group  of users without compromising the privacy of an individual within that group.  However adoption is still limited but it’s clear that there is an increasing need in  ways to collect data about people without compromising their privacy.  So that was it for this video.  If you’re still procrastinating, head over to the simply explained playlist to watch  more video’s.  And as always: thank you very much for watching!

Press «Like» and get the best posts on Facebook ↓

Leave a Reply

;-) :| :x :twisted: :smile: :shock: :sad: :roll: :razz: :oops: :o :mrgreen: :lol: :idea: :grin: :evil: :cry: :cool: :arrow: :???: :?: :!:

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Differential Privacy