Privacy: A Natural Resource to Be Preserved

ns/Dwork.pdf. It's a snapshot of the page taken as our search engine crawled the Web.
The web site itself may have changed. You can check the current page or check for previous versions at the Internet Archive. Yahoo! is not affiliated with the authors of this page or responsible for its content.
Privacy: A Natural Resource to Be Preserved Privacy:
A Natural Resource to Be Preserved
Cynthia Dwork, Microsoft The Promise of Data
Parking spot near favorite restaurant; parking/driving
violations
Maintenance
Power management, climate control
safer flying, oil refining
Locating the defibrillator
My car adapting to your erratic braking technique
Advertising based on click stream analysis
New professional contacts
Medical Applications
Saving your life in the ER by getting your medical information
Learning medical facts, eg genotype/phenotype correlations
Allocation of resources
Utilization of spare parts / employees
Usage of public funds, representation in congress The Threat of Data
Parking spot near favorite restaurant; parking/driving
violations
Maintenance
Power management, climate control
safer flying, oil refining
Locating the defibrillator
My car adapting to your erratic braking technique
Advertising based on click stream analysis
New professional contacts
Medical Applications
Saving your life in the ER by getting your medical information
Learning medical facts, eg genotype/phenotype correlations
Allocation of resources
Utilization of spare parts / employees
Usage of public funds, representation in congress The Threat of Data
Parking spot near favorite restaurant; parking/driving violations
Maintenance
Power management, climate control
safer flying, oil refining
Locating the defibrillator
My car adapting to your erratic braking technique
Advertising based on click stream analysis
New professional contacts
Medical Applications
Saving your life in the ER by getting your medical information
Learning medical facts, eg genotype/phenotype correlations
Allocation of resources
Utilization of spare parts / employees
Usage of public funds, representation in congress
Andreas Weigends entire talk Our Focus: Trusted (and Trustworthy) Curator
Privacy-Preserving Analysis of Confidential Data
Mathematical Definition of Privacy
Finding Statistical Correlations
Analyzing medical data
Correlating cough outbreak with chemical plant malfunction
Cant be done with HIPAA safe-harbor sanitized data
Noticing Events
Detecting spike in ER admissions for asthma
Datamining Tasks
Clustering; learning association rules, decision trees, separators;
principal component analysis
Official Statistics
Contingency Table Release Hasnt This Been Done Before?
Yes. 7
Two Models
Database
Sanitized Database
?
K
Non-Interactive: Data are sanitized and released 8
Two Models
Database
Interactive: Multiple Queries, Adaptively Chosen
?
K Outline
Broken Privacy
Wrong Privacy Promises
A Right Privacy Promise: Differential Privacy
Achieving Differential Privacy
Limitations
Summary and Open Questions Linkage Attacks:
A Special Case of Auxiliary Data
Using innocuous data in one dataset to identify a
record in a different dataset containing both
innocuous and sensitive data
At the heart of the voluminous research on hiding
small cell counts in tabular data
The Netflix Prize
Netflix Recommends Movies to its Subscribers
Seeks improved recommendation system
Offers $1,000,000 for 10% improvement
Not concerned here with how this is measured
Publishes training data From the Netflix Prize Rules Page
The training data set consists of more than 100
million ratings from over 480 thousand randomly-
chosen, anonymous customers on nearly 18
thousand movie titles.
The ratings are on a scale from 1 to 5 (integral)
stars. To protect customer privacy, all personal
information identifying individual customers has
been removed and all customer ids have been
replaced by randomly-assigned ids. The date of each
rating and the title and year of release for each movie
are provided. From the Netflix Prize Rules Page
The training data set consists of more than 100
million ratings from over 480 thousand randomly-
chosen, anonymous customers on nearly 18
thousand movie titles.
The ratings are on a scale from 1 to 5 (integral)
stars.
To protect customer privacy, all personal
information identifying individual customers has
been removed and all customer ids have been
replaced by randomly-assigned ids.
The date of each
rating and the title and year of release for each movie
are provided. A Source of Auxiliary Information
Internet Movie Database (IMDb)
Individuals may register for an account and rate movies
Need not be anonymous
Visible material includes ratings, dates, comments A Linkage Attack on the Netflix Prize Dataset
Narayanan & Shmatikov 2006
With 8 movie ratings (of which we allow 2 to be
completely wrong) and dates that may have a 3-day error,
96% of Netflix subscribers whose records have been
released can be uniquely identified in the dataset.
For 89%, 2 ratings and dates are enough to reduce the
set of plausible records to 8 out of almost 500,000, which
can then be inspected by a human for further
deanonymization.
Watch what you say at the water cooler!
Attack prosecuted successfully using the IMDb.
NS draw conclusions about user.
May be wrong, may be right. User harmed either way.
Gavison: Protection from being brought to the attention of others Other Linkage Data Attacks in the Literature
HMO removes names, releases data (ZIP, Bdate, Gender)
However, (Z,B,G) enough to uniquely ID most voters.
[S] observes, and responds!
(k-anonymity)
Sweeney circa 1998
Unfortunately, can still make inferences about secrets.
[MGK] observes, and responds!
(l-diversity)
Machanavajjhala, Gehrke, and Kifer, ICDE 2006
Unfortunately, multiple releases can compromise all.
[XT] observes, and responds!
(m-invariance)
Xiao and Tao, SIGMOD 2007
Next? Analysis of Social Network Graphs
Friendship Graph
Nodes correspond to users
Users may list others as friend, creating an edge
Edges are annotated with directional information
Hypothetical Research Question
How frequently is the friend designation reciprocated? Anonymization of Social Networks
Replace node names/labels with random identifiers
Permits analysis of the structure of the graph
Privacy hope: randomized identifiers make it
hard/impossible to identify nodes with specific
individuals, thereby hiding the privacy of who is
connected to whom
Disastrous!
Vulnerable to active and passive attacks
Backstrom, Dwork, Kleinberg 2007 Flavor of Active Attack
Prior to release, create subgraph of special structure
Very small: circa 12 nodes
Highly internally connected
Lightly connected to the rest of the graph Flavor of Active Attack
Connections:
Victims: Steve and Jerry
Attack Contacts: A and B
Finding A and B allows finding Steve and Jerry
S
J
A
B Flavor of Active Attack
Magic Step
Isolate lightly linked-in subgraphs from rest of graph
Special structure of subgraph permits finding A, B
S
J
A
B What is Going Wrong?
Guarantees
are Syntactic, not Semantic
k, l, m
name replaced with random string
Ad Hoc!
Privacy compromise defined to be a certain set of
undesirable outcomes
No argument that this set is exhaustive or completely
captures privacy
Failure to account for auxiliary information
In vitro vs in vivo Getting it Right in Cryptography:
Semantic Security Against an Eavesdropper
[GM82]
Vocabulary
Plaintext: the message to be transmitted
Ciphertext: the encryption of the plaintext
Auxiliary information: anything else known to attacker
The ciphertext leaks no information about the plaintext.
Formalization
Compare the ability of someone
seeing aux and ciphertext
to guess
(anything about) the plaintext, to the ability of someone
seeing
only aux
to do the same thing
.
Difference should be tiny. 25
Statistical Databases
Dalenius, 1977
Anything that can be learned about a respondent from the
statistical database can be learned without access to the
database
An ad omnia guarantee
Happily, Formalizes to Semantic Security
Recall: Anything about the plaintext that can be learned from
the ciphertext can be learned without the ciphertext
Popular Intuition: prior and posterior views about an individual
shouldnt change too much.
Clearly Silly
My (incorrect) prior is that everyone has 2 left feet.
Very popular in literature nevertheless
Definitional aw