Tragedy of the data commons. (2024)

Link/Page Citation

TABLE OF CONTENTS I. INTRODUCTION II. FRUITS OF THE DATA COMMONS A. Research Data B. The Value of the Data Commons C. Ex Ante Valuation Problems D. The Importance of Broad Accessibility E. Freedom of Information Act Requests: Privacy as an Evasion TechniqueIII. DOOMSDAY DETECTION: THE COMPUTER SCIENCE APPROACH A. How Attack Algorithms Work B. Erroneous Assertions 1. Not Every Piece of Information Can Be an Indirect Identifier 2. Group-Based Inferences Are Not Disclosures 3. A Data Release Can Be Useful and Safe at the Same Time 4. Re-Identifying Subjects in Anonymized Data Is Not Easy 5. De-Anonymized Public Data Is Not Valuable to Adversaries IV. THE SKY IS NOT FALLING: THE REALISTIC RISKS OF PUBLIC DATA A. Defective Anonymization B. The Probability that Adversaries Exist C. Scale of the Risk of Re-Identification in Comparison to Other Tolerated Risks V. A PROPOSAL IN THE STATE OF HIGHLY UNLIKELY RISK A. Anonymizing Data B. Safe Harbor for Anonymized Data C. Criminal Penalties for Data Abuse D. Objections E. Improving the Status Quo VI. CONCLUSION: THE TRAGEDY OF THE DATA COMMONS A. Problems with the Property Model B. The Data Subject as the Honorable Public Servant

I. INTRODUCTION

over the past ten years, the debate over welfare reform has beentransformed by Jeffrey Grogger and his coauthors. Grogger'sdata-driven research shows, among other things, that work requirementsand time limits may have no effect on marriage or fertility rates. (1)In other words, welfare does not produce "welfare queens."More recently, Roland Fryer and Steven Levitt have discreditedHerrnstein's theory that the test score gap between Caucasians andAfrican Americans is the result of biological differences. Fryer andLevitt used longitudinal data to document for the first time that thereare no differences in the cognitive skills of white and blacknine-month-old babies, and that the gap that develops by elementaryschool is explained almost entirely by socio-economic and environmentalfactors. (2) And in 2001, John J. Donohue and Steven D. Levitt presentedshocking evidence that the decline in crime rates during the 1990s,which had defied explanation for many years, was caused in large measureby the introduction of legalized abortion a generation earlier. (3)

These studies and many others have made invaluable contributions topublic discourse and policy debates, and they would not have beenpossible without anonymized research data--what I call the "datacommons." The data commons is comprised of the disparate anddiffuse collections of data made broadly available to researchers withonly minimal barriers to entry. We are all in the data commons;information from our tax returns, medical records, and standardizedtests seed the pastures. We are protected from embarrassment and misuseby anonymization. But a confluence of events has motivated privacyexperts to abandon their faith in data anonymization.

In his recent article, Paul ohm brought the concerns of thecomputer science community to a wide audience of lawyers andpolicymakers. Ohm's argument is simple and superficially sound: Asthe amount of publicly available information on the Internet grows, sotoo does the chance that a malfeasor can reverse engineer a dataset thatwas once anonymized and expose sensitive information about one of thedata subjects. (4) Privacy advocates, the media, and the Federal TradeCommission ("FTC") have accepted uncritically the notion thatanonymization is impossible, and they advocate for the wholesaledismantling of the concept of anonymization. (5) In its place, privacyadvocates recommend that research data should be regulated under thestrong property and autonomy models of privacy favored by LawrenceLessig, Jerry Kang, Paul Schwartz, and other scholars. (6)

Today, data privacy practices are shaped by some combination ofambiguous statutory directives, inconsistent case law, industry bestpractices, whim, and self-serving discretionary preferences. The time isripe for the creation of uniform data privacy policies, and there ismuch to fix. (7) But proposals that inhibit the dissemination ofresearch data dispose of an important public resource without reducingthe privacy risks that actually put us in peril. This Article arguesthat it is in fact the research data that is now in great need ofprotection. People have begun to defensively guard anonymizedinformation about themselves. We are witnessing a modern example of atragedy of the commons. (8) Each individual has an incentive to removeher data from the commons to avoid remote risks of re-identification.This way she gets the best of both worlds: her data is safe, and shealso receives the indirect benefits of helpful health and policyresearch performed on the rest of the data left in the commons. However,the collective benefits derived from the data commons will rapidlydegenerate if data subjects opt out to protect themselves. (9)

This Article challenges the dominant perception about the risks ofresearch data by making three core claims. First, the social utility ofthe data commons is misunderstood and greatly undervalued by mostprivacy scholars. Public research data produces rich contributions toour collective pursuit of knowledge and justice. Second, the influentiallegal scholarship by ohm and others misinterprets the computer scienceliterature, and as a result, oversells the futility of anonymization,even with respect to theoretical risk. And third, the realistic risksposed by the data commons are negligible. So far, there have been noknown occurrences of improper re-identification of a research dataset.Even the hypothetical risks are smaller than other information-basedrisks (from data spills or hacking, e.g.) that we routinely tolerate forconvenience.

The Article proceeds as follows: Parts II, III, and IV perform arisk-utility calculus on the data commons, finding that the public datacommons is tremendously valuable (Part II), that the theoretical risksof research data are exaggerated (Part III), and that the true risksposed by research data are nonexistent (Part IV). Together, Parts IIthrough IV show that concerns over anonymized data have all thecharacteristics of a moral panic and are out of proportion to the actualthreat posed by data dissemination. (10) In Part V, I put forward a boldproposal to redesign privacy policy such that public research data wouldbe easier to disseminate. While data users who intentionally re-identifya subject in an anonymized dataset should be sanctioned heavily,agencies and firms that compile and produce the data in the commonsshould receive immunity from statutory or common law privacy claims solong as they undergo basic anonymization techniques. Part V alsoprovides clear guidance for data producers operating under the currentstatutory regime. Part VI concludes with an appeal to the legalcommunity to think and talk about research data differently. The bulk ofprivacy scholarship has had the deleterious effect of exacerbatingpublic distrust in research data. Rather than encouraging the public tofervently guard their self-interest, scholars should build a sense ofcivic responsibility to pay their "information taxes" andparticipate in research datasets.

II. FRUITS OF THE DATA COMMONS

The benefits flowing from the data commons are indirect butbountiful. Thus far, the nascent technical literature on deanonymizationhas virtually ignored the opportunity costs that would result from adrastic reduction in data sharing. (11) Legal scholars who write on thetopic acknowledge the public interest in information, but they give thatinterest short shrift and describe it in abstract terms. (12) To strikethe right balance between the public's interest in privacy and itsinterest in the data commons, we must have a more concrete understandingof the value gleaned from broadly accessible research data. In thisPart, I define the data commons and explore its utility. I also discussgovernment agencies' pretextual use of privacy law to evade Freedomof Information Act ("FOIA") requests when disclosures couldreveal something embarrassing to the government.

A. Research Data

This Article addresses datasets that are compiled and sharedbroadly for "research," by which I mean a methodical studydesigned to contribute to human knowledge by reaching verifiable andgeneralizable conclusions. (13) Although this is an expansive definitionof "research," it importantly excludes analytic studies on thesubject pool for the purpose of understanding the particular individualsin the pool, as opposed to understanding a general population. (14)

Public-use research datasets are usually subject to legalconstraints that guard the privacy of the data subjects, and the largestproducers of research data (including the U.S. Census Bureau and otherfederal agencies) use sophisticated anonymization techniques that gowell beyond the minimum legal requirements. (15) Privacy laws in theirvarious forms usually prohibit the release of personally identifiableinformation ("PII"). (16) Information is personallyidentifiable if it can be traced to a specific individual. (17)obviously, information that is tied to a direct identifier, such asname, address, or social security number, is personally identifiable.For example:

Jane Yakowitz is actually a giant co*ckroach.

However, PII is not limited to information that directly identifiesa subject. Included in its ambit are pieces of information that can beused in combination to indirectly link sensitive information to aparticular person.

 A 31-year-old white female who works at Brooklyn Law School and lives in ZIP code 11215 is actually a giant co*ckroach.

Or:

 All 31-year-old females that live in ZIP code 11215 are actually giant co*ckroaches.

I will use the term "indirect identifiers" to mean piecesof information that can lead to the identity of a person throughcross-reference to other public sources or through general knowledge.(18) "Nonidentifiers," in contrast, cannot be traced toindividuals without having special non-public information.

Paul Ohm has criticized U.S. privacy law for using staticdefinitions of what constitutes PII, (19) but his description of the lawis inaccurate rate. Privacy statutes list categories of information thatnecessarily must be classified as indirect identifiers (such as sex andZIP code), but the statutes also obligate data producers to guardagainst other unspecified indirect identifiers that, in context, couldbe used to re-identify a subject. For example, the ConfidentialInformation Protection and Statistical Efficiency Act("CIPSEA") disallows the disclosure of statistical data orinformation that is in "identifiable form," defined as"any representation of information that permits the identity of therespondent to whom the information applies to be reasonably inferred byeither direct or indirect means." (20) The Family Education Rightsand Privacy Act ("FERPA") and the regulations implementedunder the Health Insurance Portability and Accountability Act("HIPAA") define PII similarly, with savings clauses thatprohibit releases that might be reverse engineered through indirectmeans. (21)

The PII standard has a significant impact on the data commons.Large, information-rich datasets will inevitably contain PII because thecombinations of indirect identifiers are likely to make some of thesubjects unique, or close to it. Thus, even the legal minimumanonymization requires some of the utility of a dataset to be lostthrough redaction and blurring in order to ensure that no subject has aunique combination of indirect identifiers.

B. The Value of the Data Commons

In 1997, policy researchers at the RAND Corporation warned that theSentencing Reform Act of 1984 and the plethora of state statutes settingminimum sentencing requirements for drug convictions are a lesscost-effective means to reduce the consumption of cocaine than theprevious system. (22) Moreover, both enforcement regimes were lesseffective per dollar spent on enforcement than on treatment programs.(23) While the change in policy could be defended on the basis ofretributive goals, the promised deterrent effects were illusory. (24)Now that states are facing gaping budget holes, the tune has changed.The severity and consistency of drug convictions are no longer politicalimperatives, and the costs of maintaining prisons are causingconsternation. (25) Voters in Arizona and California passed legislationto reduce sentencing for low-level drug offenders. (26) This may seemlike sound policy, given the tenuous relationship between sentencingtime and deterrence, but a new study produced by RAND shows that thispolicy move might be ill advised, too. (27) During the last twentyyears, prosecutors have altered their behavior to adapt to the minimumsentencing laws by using them as bargaining power to secure pleabargains. (28) As a result, offenders serving prison time today forlow-level drug offenses usually have much more serious criminalhistories than their records suggest. (29) Both of the RAND studies havemade important contributions to the complex debate on crime and drugpolicy, and both were made possible by the data commons. (30)

If data anonymity is presumed not to exist, the future of publicusedatasets and all of the social utility flowing from them will be throwninto question. Nearly every recent public policy debate has benefitedfrom mass dissemination of anonymized data. Public use data released bythe Federal Financial Institutions Examination Council provides a meansof detecting housing discrimination and informs policy debates over thehome mortgage crisis. (31) Research performed by health economists andepidemiologists using Medicare and Medicaid data is now central to thedebates about health care reform. (32) Census microdata has been used todetect racial segregation trends in housing. (33) Public-use birth datahas led to great advances in our understanding of the effects of smokingon fetuses. (34) Public crime data has been used to reveal theinequitable allocation of police resources based on the socio-economicstatus of neighborhoods. (35) And the data commons is repeatedly used toexpose fraud and discrimination that would not be discoverable orprovable based on the experience of a single person. (36)

None of this data would be available to the broad researchcommunity under a conception of privacy that abandons hope inanonymization. These datasets are critical to what George T. Duncancalls "Information Justice," which is the fairness thataccessible information offers to the general public in the form ofknowledge, and offers to individuals in the form of a discoverable andverifiable grievance. (37)

C. Ex Ante Valuation Problems

The value of a research database is very difficult to discern inthe abstract, before researchers have had a chance to analyze it. Theuncertain value makes it difficult to know when privacy interests oughtto succumb to the public interest in data sharing. Paul Schwartzdemonstrates the problem when he argues that some types of informationdo not implicate data privacy: "[S]ome kinds of aggregate information involve pools that are large enough to be viewed, at the endof the day, as purely statistical and thus, as raising scant privacyrisks as a functional matter." (38) He cites flu trends as anillustration of this sort of aggregate non-problematic data. (39) ButGoogle's Flu Trends--the fastest and most geographically accurateway to monitor national flu symptoms (40)--only works by collecting allGoogle search queries by IP address. (41) This practice runs afoul ofSchwartz's admonition against collecting information without aspecific and limited purpose. (42)

Google Flu Trends exemplifies why it is not possible to come to anobjective, prospective agreement on when data collection is sufficientlyin the public's interest and when it is not. (43) Flu Trends is aninnovative use of data that was not originally intended to serve anepidemiological purpose. The program uses data that, in other contexts,privacy advocates believe violates Fair Information Practices. (44) Thisillustrates a concept understood by social scientists that is frequentlydiscounted by the legal academy and policy-makers: some of the mostuseful, illuminating data was originally collected for a completelyunrelated purpose. Policymakers will not be able to determine in advancewhich data resources will support the best research and make thegreatest contributions to society. To assess the value of research data,we cannot cherry-pick between "good" and "bad" datacollection. (45)

Take another example, recently reproduced in the Freakonomics blog.The online dating website OkCupid analyzes all of the informationentered by its members to reveal interesting truths about the datingpublic. (46) In one fascinating study, the OkCupid researchers foundthat men of all races responded to the initial contacts of black femalesat significantly lower rates, despite the fact that the profiles ofblack females are as compatible as the females of every other race. (47)

One of the most remarkable aspects of the OkCupid study is that itdid not draw the ire of privacy advocates. (49) ContrastFreakonomics's coverage of the OkCupid study with the L.A.Times's coverage of a Facebook study that came to the unsurprisingconclusion that Facebook statuses are cheery on holidays and dreary whencelebrities die: "If you put something on Facebook, no matter howtight your privacy settings are, Facebook Inc. can still hang onto it,analyze it, remix it and repackage it. Despite its silly name, the GrossNational Happiness indicator is creepy. We 're in there." (50)

How is it that Facebook's study attracted criticism of itsprivacy policies while the data used in the OkCupid study wentunnoticed? The difference is likely explained by the value of theOkCupid study. The OkCupid study's contribution to ourunderstanding of human relations distracts commentators from thinkingabout the source of the data. The utility of the research overshadowsour collective anxiety about research data. The trouble is that thepublic and the press undervalue the beneficial uses of research datawhen the attention turns to data privacy.

The OkCupid study illustrates another important quality of researchmicrodata: that collectively, our data reveals more than any of us couldknow on our own. The message-writing decisions of each individualOkCupid member could not have revealed the patterns of preferences, butwhen aggregated, the data supports a hypothesis about human nature andimplicit bias. Research data describes everybody without describinganybody. If the data from the OkCupid profiles was thought to be theproperty of the members, subject to their exclusive determination on theuses to which it is put, society at large, and OkCupid members inparticular, would be deprived of the discovery of this quiet pattern.

D. The Importance of Broad Accessibility

The value of data is not completely lost on privacy law scholars,but the need for broad access generally is. When data can be sharedfreely, it creates a research dialog that cannot be imitated throughrestricted data and license agreements. In contrast to legal scholars,technology journalists recognize the unmatchable virtues that come fromcrowdsourcing when all interested people have unfettered access to data.(51) General access ensures the best chance that a novel or creative useof a dataset will not be missed.

Privacy laws that constrain the dissemination of the most usefuldata through discretionary licensing agreements (such as HIPAA andFERPA) are designed without sufficient appreciation as to how researchworks. Ironically, they operate on a model that gives researchers toomuch credit, and has too much faith that data supports just oneunassailable version of the truth. In practice, transparency and datasharing are integral to a researcher's credibility. The datacommons protects the public discourse from two common research hazards:(1) the failure to catch innocent mistakes, which are legion, and (2)the restriction of access to highly useful data based on ideologicalconsiderations or self-interest.

Replication is indispensable to the process of achieving credible,long-lasting results. (52) Just as mistakes and even fabrications occurin the hard sciences, (53) they also occur in the social sciences. Thegatekeepers at peer-reviewed science and economics journals have provento be significantly less effective than the motivated monitoring ofpeers and foes in the field. (54) For example, a study published inEngland's preeminent health research journal claimed to have foundstatistical proof that women can increase the chance of conceiving amale fetus if they eat breakfast cereal. (55) The findings were coveredby the New York Times and National Public Radio. (56) When the data wasmade available to other researchers, the results quickly fell apart andhave become something of a cautionary tale against researchers thattorture a dataset into producing statistically significant results. (57)Simple coding errors are even more common and can distort and completelyinvert results. Because of the frequency and inevitability of thesesorts of errors, the most respected journals make data sharing aprerequisite for publication (and even article submission). (58)

Consider the debate on the deterrent effects of the death penalty.In 1972, the U.S. Supreme Court determined that existing death penaltystatutes and practices violated convicts' Eighth Amendment right tobe free from cruel and unusual punishment. (59) But three years later,an explosive empirical study by Isaac Ehrlich concluded that eachexecution had the effect of saving up to eight lives by deterringwould-be criminals from killing. (60) Robert Bork, then the SolicitorGeneral, cited to Ehrlich's study in his brief for Gregg v. Georgia(61) a year later and, lo and behold, the Supreme Court was persuaded toend the moratorium on death sentences. (62) The trouble is,Ehrlich's persuasive study has not stood the test of time andreplication. Since then, the capital punishment debate has attracted theattention of many prized economists. (63) John J. Donohue and JustinWolfers have shown that the empirical studies finding a deterrent effectare highly sensitive to the choice of sampling periods and otherdiscretionary decisions made by the studies' authors. (64) Thedeterrent effects found by Ehrlich are in doubt, now that economistshave had the opportunity to test the robustness of the findings andexplore the idiosyncratic series of methodological decisions that led tothem. (65) Had Ehrlich alone had access to the crime data supporting hisresearch, and had his study been left to circulate in the mediaunchallenged, we might not have seen the wane in public and politicalsupport for capital punishment that we do today. (66)

Data, just like any other valuable resource, can and often doesfall into the control of people or organizations that are politicallyentrenched. (67) Because the legitimacy of discretionary accessdecisions is not independently scrutinized, restricted access policiesallow data producers to withhold information for politically orfinancially motivated reasons. (68) A thriving public data commonsserves the primary purpose of facilitating research, but it also servesa secondary purpose of setting a data-sharing norm so that politicallymotivated access restrictions will stick out and appear suspect. Thus,if an entity shared data with researchers under a restricted license tosupport a study that yielded results that happened to harmonize with theentity's self-interest (as was the case when a pharmaceuticalcompany withheld the raw data from its clinical trials even though theresults were used to support an application for FDA approval (69)), thelack of transparency would be a signal that the research may have beentainted by significant pressure to come out a particular way.

Today we get the worst of both worlds. Data can be shared throughlicensing agreements to whomever the data producer chooses, and privacyprovides the agency with an excuse beyond reproach when the dataproducer prefers secrecy to transparency. This is precisely whathappened in Fish v. Dallas Independent School District. (70) The DallasSchool District denied a request from the Dallas chapter of the NAACPfor longitudinal data on Iowa Test scores that would have tracked Dallasschoolchildren over an eleven-year period. (71) Based on experttestimony that a malfeasor could "trace a student'sidentification with the information requested by [the NAACP] using aschool directory," the requested data was found to violate FERPA.(72)

The Fish opinion interprets and enforces the FERPA regulationsproperly. (73) The outcome is consistent with FERPA's statutorygoals. However, it also exposes the troubling, draconian results ofmodern data privacy policy. The data requested by the NAACP might haveexposed evidence of discrimination or disparate resource allocation. Theschool district had the option to cooperate with the NAACP'srequest by using FERPA's research exemption and providing the dataunder a restrictive license. (74) Alternatively, the district could haveprovided a randomized sample of the data so that class sizes could notbe used to trace identities. But they had little incentive to do either,and perhaps even an incentive not to do so. Privacy law provided theschool district with a shield from public scrutiny, and allowed theschool district to flout the objectives of public records laws.

We will never know what the Fish data might have revealed. Perhapstheories of disparate treatment across class or race lines would havebeen borne out. Perhaps the research would have facilitated some other,unanticipated finding. Even the confirmation of a null hypothesis canhave significant implications, particularly where a portion of thepopulation suspects it may be receiving inequitable treatment. Sinceprivacy law allowed the data producer to avoid disclosure, the value ofthe withheld data will be forever obscured, and any systemic patternswill be known only to the Dallas school district--if they are known atall. The Fish case nicely illustrates the dangers of assigning toolittle value to research data in the abstract.

E. Freedom of Information Act Requests: Privacy as an EvasionTechnique

We would expect public agencies, which are subject to strong publicaccess obligations from FOIA and state public records statutes, (75) tohave fewer opportunities to make improperly motivated access decisions.After all, one of the primary goals of public access statutes is to takedecisions about who does and does not get to access information out ofthe hands of the agency. (76) But increased anxieties over thetheoretical risk of re-identification arm government agencies with apretext for denying records requests. As Douglas Sylvester and SharonLohr have noted, "the strengthening of individual rights-basedprivacy has allowed some agencies to use privacy as a 'shield'to prevent otherwise appropriate disclosures." (77) The moralhazard reached its apex under the Bush Administration, which shieldedthe records of current and past presidents from FOIA requests throughexecutive order. (78) The exemption was voluntarily repealed in 2009.(79)

This is not to say that every denial of a public records request ismade in bad faith. A number of structural problems plague the processand encumber disclosure. First, the lack of comprehensible standards forprivacy protocols (discussed at length in Part V) will tend to drivestate agencies to withhold data from researchers if disclosure exposesthe agency to liability or sanction. Moreover, the penalties and publiccriticism for releasing ineffectively anonymized information are muchharsher than the consequences of improperly denying a public recordsrequest. (80) The imbalanced structural incentives obscure andexacerbate the potential for self-serving behavior. Freedom ofinformation advocates and professional journalism associations allegethat privacy exemptions, like national security exemptions, are abusedwhen the requested information is embarrassing for the agency. (81)Thus, as the Society of Professional Journalists puts it, rich data isdisclosed about tomato farming and transportation, while data that couldbe used to vet a government program or expose agency wrongdoing isredacted into oblivion--if it is released at all. (82)

Numerous examples from the FOIA case law support theseobservations. The Department of Agriculture used the privacy exemptionof FOIA to deny a request for the identity of a corporation thatcompensated or bribed a member of the Dietary Guidelines AdvisoryCommittee. (83) The State Department refused to release documents aboutforcibly repatriated Haitian refugees to human rightsgroups--purportedly to protect their privacy. (84) Privacy was"feebly" held up as a justification for declining to collectinformation about the religious exercise of Navy personnel, in anattempt to rebut a group of Navy chaplains' allegations thatnonliturgical Christians were disfavored and underrepresented in theNavy's decisions about hiring, promotion, and retention. (85) Ineach of these examples, the government's privacy argumenteventually failed. But sometimes this argument prevails. (86) And agreat majority of denials of public records requests are not litigatedat all. (87)

In 2008, UCLA denied a public records request that a faculty memberon the undergraduate admissions committee submitted for theUniversity's admissions data. (88) UCLA concluded that the requestposed "serious privacy concerns" and could not be fulfilledwithout violating FERPA. (89) Astonishingly, the same rationale did notimpede UCLA from sharing similar admissions data under a restrictedlicense agreement to a different UCLA professor. (90) The onlyappreciable difference between the two requests was the divergentattitudes each professor maintained toward UCLA's admissionsprocess. The denied requester openly questioned whether the school wasusing applicant race information in an impermissible way. (91)

The University of Arkansas Little Rock ("UALR") School ofLaw denied a similar request for admissions data from a faculty memberon its admissions committee. The professor regularly reviewed theoriginal, raw admissions files, but the school denied his request fordata, claiming that FERPA prohibited the release of even de-identifiedstatistical data. (92) When a UALR Law School alumna requested access tosimilar application data in an independent request, the University(perhaps inadvertently) disclosed a memorandum of notes documentingadvice from their legal counsel: "We say FERPA, they can challengeif they want." (93) A cogent interpretation is that the federalprivacy law is being used as a tactical device to greatly increase thetransaction costs for public records requests. Since requests foranonymized university and law school admissions data have already passedjudicial scrutiny assessing FERPA compliance, (94) the generalcounsel's offices at UCLA and UALR ought to have known that, withminimal effort, a sufficiently safe admissions dataset could beproduced.

The distribution of access to data is a problem worthy of nationalattention and concerted effort. The data commons is a powerful, naturalantidote to information abuses. It is critical for information justice,since our pooled data can reveal the patterns of human experience thatno single anecdote can. Since the value of a dataset cannot bedetermined ex ante, any rule that significantly impedes the release ofresearch data imposes a social cost of uncertain magnitude.

III. DOOMSDAY DETECTION: THE COMPUTER SCIENCE APPROACH

A large body of computer science literature explores thetheoretical risk that a subject in an anonymized dataset can bere-identified. De-anonymization scientists study privacy from anorientation that emphasizes any harm that is theoretically possible.They are in the habit of looking for worst-case scenario risks. (95)This orientation grows out of a natural inclination to believe that, ifthere is value to abusing anonymized data, and if re-identification isnot too difficult, then such re-identification will happen. In otherwords, where there is motive and opportunity, a de-anonymization attackis a foregone conclusion. The de-anonymization scientists'perspective has some intuitive appeal, and the legal literature hasembraced the findings and predictions of the computer science literaturewithout much skepticism. (96) The de-anonymization literature taps intoprivacy advocates' natural unease any time information isdistributed without the consent of the data subjects.

In this Part, I briefly explain how de-anonymization attacks work.(97) Next, I explore the lessons growing out of the computer scienceliterature and find that they greatly exaggerate the opportunities andmotivations of the hypothetical adversary. The computer scienceliterature (and the policymakers who borrow from it) makes fiveinaccurate assertions: (1) every variable in a dataset is an indirectidentifier; (2) data supporting inferences about a population of datasubjects violates privacy; (3) useful data is necessarilyprivacy-violating; (4) re-identification techniques are easy; and (5)public datasets have value to an adversary over and above theinformation he already has. I will address each of these in turn.

A. How Attack Algorithms Work

All de-anonymization attack algorithms are variants of one basicmodel. An adversary attempts to link subjects in a de-identifieddatabase to identifiable data on the entire relevant population("population records"). The adversary links the two databasesusing indirect identifier variables that the two datasets have incommon. To visualize the attack, suppose the two circles in this diagramrepresent the indirect identifiers in the de-identified database and thepopulation records, respectively. Initially, these databases have nolinkages:

[ILLUSTRATION OMITTED]

The adversary identifies subjects in the de-identified data thathave a unique combination of values among the indirect identifiers. Hedoes the same to the population records:

[ILLUSTRATION OMITTED]

Finally, the adversary links all the sample uniques he can to thepopulation uniques:

[ILLUSTRATION OMITTED]

Only a subset of the sample uniques and population uniques will belinkable because some of the sample uniques might not actually be uniquein the population, and some of the population uniques might not bepresent in the sample of the de-identified data. (98)

Latanya Sweeney provided the classic example of a successfulmatching attack when she combined de-identified Massachusetts hospitaldata with identifiable voter registration records in order to reidentifyGovernor William Weld's medical records. (99) Because the hospitaldata at that time--before the passage of HIPAA--included granular detailon the patients (5-digit ZIP code, full birth date, and gender), manypatients were unique in the hospital data and the voter records.

Today, there is little disagreement that this sort of "trivialdeidentification" of records--the removal of only directidentifiers like names, social security numbers, and addresses--isinsufficient on its own. Subjects can too easily be identified through acombination of indirect identifiers. Thus, like other federal privacystatutes, HIPAA requires data producers to remove not only the obviousdirect identifiers, but also any information known by the disclosingagency that can be used alone or in combination with other informationto identify an individual subject. (100)

While there is broad agreement on the rejection of trivialdeidentification, privacy experts disagree on the efficacy of currentbest practices. Legal scholars and advocacy groups limit their focus tothe computer science studies falling on one side of the debate--thosemaking the common erroneous assertions explored below--while ignoringthe disclosure-risk research coming out of the statistical and publichealth disciplines. This has had the unfortunate consequence of leadingthe legal and policy discourse astray.

B. Erroneous Assertions

The mounting literature on privacy risks associated with anonymizedresearch data propagates five myths about re-identification risk. Incombination, these inaccurate assertions lead lay audiences to believethat anonymized data cannot be safe.

1. Not Every Piece of Information Can Be an Indirect Identifier

Disclosure risk analysis has traditionally looked for categories ofinformation previously disclosed to the public in order to distinguish"indirect identifiers" from "non-identifiers." Forexample, data subjects' names and addresses are available in voterregistration rosters (which are public records); therefore ZIP codes andother geographic codes must be classified as indirect identifiers. (101)on the other hand, food preferences are not systematically collected andre-released publicly, so a variable describing the subject'sfavorite food would traditionally be considered a non-identifier.

De-anonymization scientists do not limit the theoretical adversaryto public sources of information. The most influential deanonymizationstudy, by Arvind Narayanan and Vitaly Shmatikov, describes there-identification of subjects in the Netflix Prize Dataset. (102) In2006, Netflix released an anonymized dataset to the public consisting ofmovie reviews of 500,000 of its members. (103) Narayanan and Shmatikovused information from user ratings on the Internet Movie Database (IMDb)to re-identify subjects in the Netflix Prize dataset. (104) This studyis regarded as proof that publicly accessible datasets can bereverse-engineered to expose personal information even whenstate-of-the-art anonymization techniques are used. (105) The studyenergized the press because the auxiliary information Narayanan andShmatikov used was collected from the Internet. But before diving intohow the algorithm works, it is helpful to note a chasm between Narayananand Shmatikov's conception of privacy risk and that enshrined inU.S. privacy statutes.

Narayanan and Shmatikov examine how auxiliary information learnedthrough any means at all, even at the water cooler, could be used toidentify a target. (106) They ask, "if the adversary knows a few ofthe [target] individual's purchases, can he learn all of herpurchases?" and "if the adversary knows a few movies that theindividual watched, can he learn all movies she watched?" (107) Theimplicit directive from these questions is that public datasets must beimmune from targeted attacks using special information. The belief thatprivacy policy is expected to protect data even from snooping friendsand coworkers is adopted reflexively by Paul ohm without acknowledgingthat it introduces a significant departure from the design of currentlaw: "To summarize, the next time your dinner party host asks youto list your six favorite obscure movies, unless you want everybody atthe table to know every movie you have ever rated on Netflix, saynothing at all." (108) If public policy had embraced this expansivedefinition of privacy--that privacy is breached if somebody in thedatabase could be re-identified by anybody else using special non-publicinformation--dissemination of data would never have been possible.Instead, U.S. privacy law in its various forms requires data producersto beware of indirect identifiers that are, or foreseeably could be, inthe public domain. (109)

However, Narayanan and Shmatikov's study has sway because theInternet gives a malfeasor access to more information than he ever hadbefore. Narayanan and Shmatikov were able to use the IMDb movie reviewsof two strangers to re-identify them in the Netflix data. (110) Theirstudy illustrates how the Internet is a (relatively) new publicinformation resource that blurs the distinction between nonidentifiersand indirect identifiers. (111) The Internet affects data anonymizationby archiving and aggregating large quantities of information and bymaking information gathering practically costless. (112) It alsoprovides a platform for self-revelation and self-publication, making theavailable range of information about any one person unpredictable andpractically limitless.

Current privacy policy does not anticipate how we should deal withthis shift. On one hand, if anybody can access information on theInternet, it seems unquestionable that the information is"public." Thus, this information might best be described as anindirect identifier. On the other hand, data sharing will be severelyconstrained if the status of a category of information is shifted fromnon-identifier to indirect identifier simply because members of a smallminority of data subjects choose to reveal information about themselves.If I blog about a hospital visit, should my action render an entirepublic hospital admissions database (relied on by epidemiologists andhealth policy advocates) in violation of privacy law? Are the bounds ofinformation flow really to be determined by the behavior of the mostextroverted among us? (113) This looks like a quagmire from which noreasonable normative position can emerge. (114) The approach that Iendorse in Part V sidesteps this question because the issue does notbecome relevant until we reach the apocalyptic scenario in whichreidentification is a plausible risk, and adversaries painstakinglytroll through our blogs to put together complete dossiers. For reasonsthat will soon become evident, such adversaries are unlikely tomaterialize.

The Netflix study makes an excellent contribution to our knowledgebase, but it is a theoretical contribution. The Narayanan-Shmatikovde-anonymization algorithm is limited to a set of anonymized datasetswith particular characteristics. For the algorithm to work, the datasetmust be large (in the sense of having a large number of variables orattributes), and it must be sparse (which is a technical term roughlymeaning that most of the dataset is empty, and that the data subjectsare readily distinguishable from each other). (115) Moreover, becausethe attack algorithm infers population uniqueness from sampleuniqueness, the research dataset must have accurate and completeinformation about the data subjects in the sample in order to avoidfalse positives and negatives (116)--a condition that does not even holdfor the Netflix data and is certainly not characteristic of most largecommercial datasets, such as consumer data from Amazon. And,importantly, the adversary must understand entropic deanonymization inorder to test the confidence level of his algorithm's match.

These limitations are sizeable, yet they are entirely ignored bythe legal scholars, privacy advocates, civil litigants, and now, theFTC, relying on the study to conclude that anonymization is dead. (117)The Narayanan-Shmatikov study has provided the first ping in an echochamber that has distorted the conversation about public research data.Consider, for example, this report prepared by the preeminent privacyscholar Paul Schwartz:

 Regarding the question of PII versus non-PII, recent work in computer science has shown how easy it can be to trace non-PII to identifiable individuals.... [A] study involving Netflix movie rentals was able to identify eighty percent of people in a supposedly anonymous database of 500,000 Netflix users; the identification was triggered by their ratings in the Netflix database of at least three films. (118)

The Electronic Privacy Information Center ("EPIC") hasgone further, claiming that the study authors re-identified 99 percentof the Netflix users. (119) These statements bear scant relation toreality. In fact, Narayanan and Shmatikov performed a proof of conceptstudy on a small sample of IMDb users. They successfully re-identifiedtwo of the IMDb users in the Netflix database. (120) There is a realrisk that the echo chamber will continue to distort the reasonedjudgment of lawmakers and regulators if such misconceptions are notcorrected now.

Of the studies conducted in the last decade, only one was conductedunder the conditions that replicate what a real adversary would facewhile also verifying the re-identifications. The Federal Department ofHealth and Human Services Office of the National Coordinator for HealthInformation Technology ("ONC") put together a team ofstatistical experts to assess whether data properly de-identified underHIPAA can be combined with readily available outside data to reidentifypatients. (121) The team began with a set of approximately 15,000patient records that had been de-identified in accordance with HIPAA.(122) Next, they sought to match the de-identified records withidentifiable records in a commercially available data repository andconducted manual searches through external sources (e.g., InfoUSA) todetermine whether any of the records in the identified commercial datawould align with anyone in the de-identified dataset. (123) The teamdetermined that it was able to accurately re-identify two of the 15,000individuals, for a match rate of 0.013%. (124) In other words, therisk--even after significant effort--was very small. (125)

Other, less attention-grabbing studies from the field ofstatistical disclosure risk have similarly differed from the conclusionsdrawn by the Narayanan-Shmatikov study: in realistic settings, datasetscan rarely be matched to one another because both sets of data usuallycontain substantial amounts of measurement error that decimate theopportunity to link with confidence. (126) This is not the sort ofdifficulty that can be overcome with technology or shrewd new attacktechniques; rather, it is a natural protection afforded by theinherently messy nature of data and of people. (127)

2. Group-Based Inferences Are Not Disclosures

Computer scientists have an expansive definition of privacy. Theycount as privacy breaches even mere inferences that might be applied toan individual based on subgroup statistics. Justin Brickell and VitalyShmatikov, computer scientists at the University of Texas whose work hasgreatly influenced Paul ohm's scholarship, define privacy breach toinclude the release of any information where the distribution of asensitive variable for a subgroup of data subjects differs from thatvariable's distribution over the entire sample. (128) Similarly,Cynthia Dwork has crafted her definition of "differentialprivacy" to cover group privacy. (129)

This conception of a privacy right--one that protects against thedisclosure of any sensitive information that differs by demographicsubgroup--avoids two potential harms that can result from groupinference disclosure. First, facts about a group can be used to make adetermination about an individual. For example, a health care providermight deny coverage to a member of a particular subgroup based on thehealth profiles of the entire subgroup. Second, group differences in asensitive characteristic can lead the public to adopt inappropriatestereotypes that mischaracterize individuals and lead to prejudices.James Nehf describes the problem as so: "Since the information usedto form [a] judgment is not the complete set of relevant facts about us,we can be harmed (or helped) by the stereotyping ormischaracteriza." (130)

These criticisms are shortsighted. They are, in fact, attacks onthe very nature of statistical research. Federal statistical agencieshave responded to concerns about subgroup inference disclosure with twopersuasive retorts. "First[,] a major purpose of statistical datais to enable users to infer and understand relationships betweenvariables. If statistical agencies equated disclosure with inference,very little data would be released." (131) Indeed, the definitionof privacy breach used by Brickell and Shmatikov is a measure of thedata's utility; if there are group differences between the valuesof the sensitive variables, such as a heightened risk of cancer for adiscernable demographic or geographic group, then the data is likely tobe useful for exploring and understanding the causes of thosedifferences. (132)

"Second, inferences are designed to predict aggregatebehavior, not individual attributes, and thus are often poor predictorsof individual data values." (133) That is to say, the use ofa*ggregate statistics to judge or make a determination on an individualis often inappropriate. Though stereotyping might happen anyway, it hasnever been a goal of privacy law to prevent all forms of ignorantspeculation. Stereotyping will not go away by suppressing data. To thecontrary, data can be very useful in debunking stereotypes. (134)

3. A Data Release Can Be Useful and Safe at the Same Time

Paul ohm argues that if data is useful to researchers, it mustcreate a serious risk of re-identification. (135) This claim has beenrepeated in the national media. (136) But the assertion is erroneous. Adatabase with just one indirect-identifying variable (such as gender)tied to non-public information (such as pharmaceutical purchases) can betremendously valuable for a specific research question--such as:"Do women purchase drugs in proportion to the national rates ofdiagnosis?"--without any risk of re-identification. Ohm and themedia outlets were thrown off because the technical studies they citeuse a definition of data-mining utility that encompasses all possibleresearch questions that could be probed by the original database. (137)So, for example, if race and geographic indicators are removed from thedatabase, the utility of that database for all possible researchquestions plummets, even though the utility of that database for thisspecific research question stays intact. For specific researchquestions, utility and anonymity can and often do coexist.

4. Re-Identifying Subjects in Anonymized Data Is Not Easy

Computer scientists concerned about data privacy face the challengeof convincing the public that an adversary of low-to-moderate skill iscapable of performing the same sort of attacks that they can.De-anonymization scientists often refer to the fact that their attackscan be performed on home computers using popular programs. (138) Paulohm makes the same rhetorical move in order to argue that we are livingin the era of "easy reidentification." (139)

 The Netflix study reveals that it is startlingly easy to reidentify people in anonymized data. Although the average computer user cannot perform an inner join, most people who have taken a course in database management or worked in IT can probably replicate this research using a fast computer and widely available software like Microsoft Excel or Access. (140)

While the Netflix attack algorithm could be performed using Excel,an adversary would have to understand the theory behind the algorithm inorder to know whether the dataset is a good candidate and whethermatches should be rejected as potential false positives. (141) Thesuggestion that anybody with an IT background and a copy of Excel can dothis is implausible.

The myth of easy re-identification was tested and rejected in thecase of Southern Illinoisan v. Illinois Department of Public Health.(142) In that case, the plaintiff newspaper submitted a public recordsrequest to the Illinois Department of Public Health for a tablecontaining the ZIP codes, dates of diagnosis, and types of cancer forhospital patients in the department's database. (143) The plaintiffnewspaper's goal was to test whether certain forms of cancer wereclustered in distinct geographic areas, (144) which would have suggestedthat their incidence was created or greatly exacerbated by environmentalfactors. (145) The government relied on the testimony of Dr. LatanyaSweeney to support its argument that granting the request would violatecancer patient privacy because the data could be de-anonymized. (146)

Dr. Sweeney's testimony about the process she used tore-identify subjects is under seal out of a fear that the opinion wouldcreate an instruction book for a true malfeasor, (147) but thedescription in the Illinois Supreme Court's opinion suggests thatshe did the following (148): She began by researching the disease ofneuroblastoma--the rare form of cancer of interest to the plaintiffnewspaper--in order to familiarize herself with the symptoms andtreatment. (149) Next, she purchased two thousand dollars' worth ofpublic and "semi-public" datasets, some of which required herto fill out forms and wait for processing. (150) Some of these purchaseddatasets (probably voter registration data) identified their subjects byname and address. (151) If Dr. Sweeney employed the same processes thatshe had previously used to re-identify health records, it is very likelythat she linked the identifiable data to pre-HIPAA hospital dischargedata that had not been anonymized (only the names had been removed) byusing granular detail about the hospital patients' dates of birth,sex, and ZIP codes. (152) Since the passage of HIPAA, such informationis no longer publicly available. (153) Next, Dr. Sweeney used what shelearned about neuroblastoma to identify possible neuroblastoma patientsin the combined purchased databases. (154) The purchased data containedsome information--secondary diagnoses or prescription drug treatmentsperhaps--that allowed her to infer which people in the consumerdatabases suffered from neuroblastoma. (155) Since the purchased publicdata was linked to identities, she was able to use what she learned fromthe purchased resources to produce accurate names for most of theentries in the requested cancer registry dataset. (156)

Dr. Sweeney testified that it would be very easy for anyone toidentify people in the cancer registry dataset:

 It is very easy in the following sense, all I used was commonly available PC technology ... [a]nd readily available software ... and all that was required were the simple programs of using [spreadsheets].... They come almost on every machine now days [sic] ... so they don't require you have [sic] any programming or require you to take a computer class, but they do require you to know the basics of how to use the machine and how to use those simple packages. (157)

The Illinois Supreme Court was not convinced. The court reasonedthat it was Dr. Sweeney's "'knowledge, education andexperience in this area' that made it possible for her to identifythe Registry patients" and not merely her access to MicrosoftExcel. (158) Because Dr. Sweeney used her well-honed discretion to makematches between two data sources that did not map easily onto eachother, Dr. Sweeney's methods took advantage of her efforts andtalents. Southern Illinoisan and the Netflix example illustrate thatdesigning an attack algorithm that sufficiently matches multipleindirect identifiers across disparate sources of information, andassesses the chance of a false match, may require a good deal ofsophistication.

5. De-Anonymized Public Data Is Not Valuable to Adversaries

The plaintiffs in Southern Illinoisan had a second objection to Dr.Sweeney's testimony: Dr. Sweeney identified neuroblastoma patientsusing the purchased data resources, not the dataset requested by theplaintiffs. (159) She used the requested table "only to verify herwork" (160): she checked to see if the ZIP codes and diagnosisdates of her neuroblastoma candidate guesses matched the anonymouscancer registry. (161)

The requested table undoubtedly provided some value by allowing herto have more confidence in the attack algorithm. However, the addedutility to an adversary in this situation, as compared to what theadversary could have done without the requested table, was very small.(162) Whether the anticipated abuse is direct marketing or mindlessharassment, the identification of likely neuroblastoma patients who areadduced from the purchased datasets will do the trick. Whether thehypothetical adversary is a pharmaceutical company or an ErinBrockovich-style environmental torts firm, the adversary could directit* solicitations to the set of likely candidates derived from thepurchased, non-anonymized datasets. Dr. Sweeney testified that therequested cancer registry data was the "gold standard" thatallowed her to re-identify the patients with confidence, (163) but thisoverstates the importance of the registry data tables since, without thegovernment's verification, an attacker could still identify thelikely candidates with enough confidence for her purposes.

Similarly, Narayanan and Shmatikov overstate the harm that can flowfrom re-identifying subjects in the Netflix database. Narayanan andShmatikov explain that their algorithm works best when the moviesreviewed on IMDb are less popular films. (164) The authors go into vividdetail in describing the movies that their two re-identified subjectsrated in the Netflix database and draw absurd conclusions from them.(165) But they provide no information about the movies that the targetshad freely chosen to rate publicly on IMDb using their real names--thatis, the information that Narayanan and Shmatikov already knew beforere-identifying them in the Netflix data. This information is crucial forunderstanding the marginal utility to putative adversaries. Theinferences that are being drawn from the Netflix ratings--that theyreveal political affiliation, sexual orientation, or, as the complaintfor a recent lawsuit against Netflix alleges, "personal struggleswith issues such as domestic violence, adultery, alcoholism, orsubstance abuse" (166)--can be drawn just as easily from the set ofmovies that the target had publicly rated in the first place. If theadversary already knows five or six movies that the target has watched,that knowledge can go a long way toward pigeonholing and makingassumptions about the target. (167)

Of course, it is possible that a public data release could providea great deal of extra information that would be valuable to a malfeasor.(168) But too often the marginal value is assumed to be very highwithout any effort to compare the privacy risks after data release tothe risks that exist irrespective of the data release. (169) More to thepoint, the accretion problem described by Paul ohm--the prediction thatincreasing quantities of anonymized data will make reidentification of arich data profile of us all the more possible (170)--is likely to beovershadowed by the accretion of identified data. Given the data miningopportunities available on identifiable information from companies likeLexisNexis and Acxiom that aggregate identified information from privateinsurance and credit companies as well as public records, (171) it ishighly unlikely that an adversary will find it worth his time to learnthe Shannon entropy formula so that he can apply the Netflix algorithm.

IV. THE SKY IS NOT FALLING: THE REALISTIC RISKS OF PUBLIC DATA

The previous Part provided evidence that the focus of influentialcomputer science literature is preternaturally consumed by hypotheticalrisks. (172) Unfortunately, legal scholars have taken up the refrain andhave come to equally alarmist conclusions about the current state ofdata sharing.

In considering a public-use dataset's disclosure risk, dataarchivists focus on marginal risks--that is, the increase in risk of thedisclosure of identifiable information compared to the pre-existingrisks independent from the data release. (173) Just as the disclosurerisk of a data release is never zero, the pre-existing risk to datasubjects irrespective of the data release is also never zero. There arealways other possible means for the protected information to becomepublic unintentionally. How much marginal risk does a public researchdatabase create in comparison to the background risks we already endure?(174)

This Part assesses the realistic risks posed by the data commons.It lays out the frequency of improper anonymization and analyzes thelikelihood that adversaries would choose re-identification as theirmeans to access private information. The unavoidable conclusion is thatcontemporary privacy risks have little to do with anonymized researchdata.

A. Defective Anonymization

How often are public datasets released without properanonymization? In other words, how often do data producers remove directidentifiers only, without taking the additional step of checking forsubgroup sizes among indirect identifiers or without consideration tothe discoverability of the sampling frame?

Paul Ohm discusses two high-profile examples: Massachusettshospital data that failed to sufficiently cluster the indirectidentifiers, and the AOL search query data that failed to remove lastnames. (175) This led two journalists at the New York Times tore-identify Thelma Arnold, who shared the spotlight with her searchphrase "dog that urinates on everything." (176) ohm arguesthat vulnerable public datasets with weak anonymization must be legion.(177) If sophisticated organizations like the Massachusetts GroupInsurance Commission and AOL are not getting it right, what could weexpect from a local agency? (178)

This concern has merit. A systematic study of disclosures madepursuant to the federal No Child Left Behind Act supports Ohm'sintuition. The authors, Krish Muralidhar and Rathindra Sarathy, auditedpublicly available accountability data from several states to seewhether the tabulations allow data users to glean PII. (179) While allof the states attempted to implement anonymization protocols, they allgot it wrong one way or another. (180) Large repeat players in the datacommons like the University of Michigan's InteruniversityConsortium of Policy and Social Research ("ICPSR") or the U.S.Census Bureau do not make these rookie mistakes, and often usedata-swapping and noise-adding techniques for an additional level ofsecurity. (181) But the data commons no doubt contains some inadequatelyanonymized datasets that have not undergone best practices. This isalmost certainly due to the abysmal state of the guidance provided byregulatory agencies and decisional law. There has not yet been a clearand theoretically sound pronouncement about the steps a data producershould take to reduce the risk of re-identification. I address thisproblem in Part V. For reasons I will elaborate on now, the risksimposed on data subjects by datasets that do go through adequateanonymization procedures are trivially small.

B. The Probability that Adversaries Exist

The "adversary" or "intruder" from the computerscience literature is a mythical creature, the chimera of privacypolicy. There is only a single known instance of de-anonymization for apurpose other than the demonstration of privacy risk, (182) and no knowninstances of a re-identification for the purpose of exploiting orhumiliating the data subject. The Census Bureau has not had any knowninstances of data abuse, nor has the National Center for EducationStatistics. (183)

This is not surprising, because the marginal value of theinformation in a public dataset is usually too low to justify the effortfor an intruder. The quantity of information available in the datacommons is outpaced by the growth in information self-publicized on theInternet or collected for commercially available consumer data. Consumerdata catalogs boast that businesses can "choose [an] audience bytheir ailments & medications." (184)

Unfortunately, privacy advocates routinely fail to report thedearth of known re-identification attacks. (185) Instead, scenarios ofre-identification and public humiliation are held up likeDesdemona's handkerchief, inspiring suspicion and fear for which wehave, as yet, no evidence. As Paul Ohm says,

 Almost every person in the developed world can be linked to at least one fact in a computer database that an adversary could use for blackmail, discrimination, harassment, or financial or identity theft. I mean more than mere embarrassment or inconvenience; I mean legally cognizable harm. Perhaps it is a fact about past conduct, health, or family shame. For almost every one of us, then, we can assume a hypothetical database of ruin, the one containing this fact but until now splintered across dozens of databases on computers around the world, and thus disconnected from our identity. Reidentification has formed the database of ruin and given our worst enemies access to it. (186)

Ohm speaks in the present tense; he suggests the database of ruinhas arrived.

It is possible that intruders are keeping their operationsclandestine, reverse-engineering our public datasets without detection.But this conviction should not be embraced too quickly. other forms ofdata-privacy abuse that ought to be difficult to detect havenevertheless come to light due to whistleblowing and sleuthing. (187)Paul Syverson suggests that we could test the hypothesis of covertreidentification by comparing the incidence of identity theft tobehaviors or characteristics in accessible datasets to see if there is acorrelation that might suggest these data subjects were re-identified atsome point. (188) This experiment is worthwhile, but the availableaggregate data suggests there is no such relationship. Identity theftplateaued between 2003 and 2009 and dropped to its lowest recorded levelin 2010. (189) Moreover, the largest category of identity fraud schemesinvolves "friendly fraud"--fraudulent impersonation committedby people that know the victim personally (such as a roommate orrelative)--and this category has grown in proportion while the othercategories declined. (190) These statistics contradict the position thatwe are inching ever closer to our digital ruination.

Like any default hypothesis, the best starting point for privacypolicy is to assume that re-identification does not happen until we haveevidence that it does. Because there is lower-hanging fruit for theidentity thief and the behavioral marketer--blog posts to be scraped andconsumer databases to be purchased--the thought that these personae nongratae are performing sophisticated deanonymization algorithms isimplausible.

C. Scale of the Risk of Re-Identification in Comparison to OtherTolerated Risks

Privacy risks are difficult to measure and understand--to feel at agut level. (191) one useful heuristic for comprehending the privacyrisks of public anonymized data is to compare those risks to otherprivacy risks that we know and tolerate.

Our trash is a rich and highly accessible source of privateinformation about us--indeed, it continues to have the distinction ofbeing a tremendously valuable resource for private investigators andidentity thieves. (192) Data presents no more risk (and often less risk)than our garbage. Thomas Lenard and Paul Rubin have noted that breachnotification requirements and other warnings about the privacy hazardsof conducting business online could lead consumers to conduct businessoffline and demand paper statements. Ironically, this result wouldgreatly increase the likelihood of identity theft. (193)

Moreover, consider the large quantity of sensitive personallyidentifiable information available in public records. Incomeinformation, thought to be among the most sensitive categories ofinformation, (194) is available for most public employees. (195) Thenames and salaries of the highest-paid employees in California aretracked on the Sacramento Bee's website. (196) Litigants andwitnesses in lawsuits are often forced to divulge personal informationand face embarrassing accusations, and juror identities andquestionnaire responses are usually within the public domain. (197) Weaccept these types of exposures because the countervailinginterests--ensuring transparency and accountability in stateaction--warrant it. The Constitution protects these types of disclosuresthrough a robust set of First Amendment precedents, and the tradeoffs interms of privacy invasions have proven to be bearable to society. (198)

The closest cousin to the malicious de-anonymizer is the hacker.This type of adversary certainly exists. If we are to imagine a skilledcomputer programmer determined to find out a target's secrets, isit not easier to imagine him just hacking into the target'spersonal computer? This, after all, was HBGary Federal's modusoperandi when it consulted to do the dirty work for Bank of America,corporate law firms, and their clients. (199) HBGary Federal planned tocreate extensive dossiers of rivals or critics for the purpose offorming smear campaigns. (200) When HBGary Federal proposed to make adossier on members of U.S. Chamber Watch, a consumer watchdogorganization, their plans included identifying vulnerabilities in thetargets' computer networks that could be exploited. (201) HBGaryFederal responded to the incentives to engage in unethical and illegalbehavior to garner the favor of its clients. Yet, it is difficult toimagine that HBGary's agenda would ever include re-identifyingtheir targets in public-use anonymized datasets. The alternativeapproaches are so much easier.

A malfeasor with no specific target in mind is still better offusing hacking techniques rather than de-anonymization algorithms. Thatis what hackers did to expose 236,000 mammography patient records at theUniversity of North Carolina School of Medicine, (202) 160,000 healthrecords for University of California students, (203) and 8,000,000records in the Virginia Prescription Monitoring Program (for which thehackers sought a $10 million ransom). (204) These sorts of hacks requiresignificantly less skill than the de-anonymization of a research datasetbecause malware capable of exploiting bugs in popular programs andoperating systems is sold on the black market to whomever is unethicalenough to use it. (205) The programs require little to no customizationbecause they apply malicious code to popular programs that all sufferfrom identical vulnerabilities. (206) De-anonymization algorithms, incontrast, require a theoretical understanding of the algorithm in orderto suit the attack to a particular dataset. (207)

Data spills--the mishandling of unencrypted data--provide anotherillustration of the risk of re-identification. These spills typicallyexpose the personally identifiable information of customers or patients.In the last couple years the medical records of 7.8 million people havebeen exposed in various sorts of security breaches. (208) The spills areoften the result of improper handling by employees who were authorizedto access the information. For example, Massachusetts General Hospitalrecently agreed to pay a one million dollar fine after one of itsemployees lost the records of 192 patients on the subway, many of whomhad HIV/AIDS. (209) So the question for our purposes is this: if we areto fear users of public anonymized datasets, why do we tolerate thehandling of our personal information by minimally paid, unskilled dataprocessors? (210) (Indeed, some companies have used prison labor toperform data entry. (211))

The intuitive answer is that data has become the lifeblood of oureconomy. It is more rational to spread risk among all the consumers andmodify data handling behavior through fines and sanctions than it is toexpect consumers to forego the convenience and customized service of theinformation economy. (212) It is puzzling, then, why privacy advocateshave chosen to target anonymized research data--data that posesrelatively low risk to the citizenry and offers valuablepublic-interest-motivated research in return--as a cause worthy ofpreemptive strike. (213)

V. A PROPOSAL IN THE STATE OF HIGHLY UNLIKELY RISK

The fractured set of privacy statutes and rules in the UnitedStates generally requires data producers to refrain from releasing datathat can be used to re-identify a data subject. (214) A great limitationof current U.S. privacy law--a limitation that runs against theinterests of the data subjects and researchers alike--is that privacylaw regulates the release of data rather than its use. (215) Privacy lawdoes not prohibit an end-user from re-identifying somebody in apublic-use dataset. Rather, the laws and statutory schemes actexclusively on the releaser. (216) In many respects, the currentapproach to data privacy is dissatisfying to the full range of affectedparties, and we are beginning to see an influx of new proposals.

The most popular suggestions for altering data privacy laws differin their particulars, but they invariably impose large transaction costson research, if they do not preclude it altogether. The FTC'srecently unveiled framework for consumer data advises companies not todistinguish between anonymized and personally identifiable data, whichmeans that anonymized research data must be subjected to the exact samelimitations imposed on the collection and use of identifiable data.(217) This vision bars private companies from participating in the datacommons, since a public release of research data would be treated thesame as a security breach or a spill of identifiable data. TheFTC's framework borrows from the European Data ProtectionDirective, which requires the unambiguous consent of data subjectsbefore personal data can be processed into statistical research data.(218) If the FTC framework is a harbinger for what is to come, the datacommons is in real trouble. (219)

Paul ohm and Daniel Solove propose "contextual" privacyregulations to bring legal liability in line with the risk that the dataproducer has created. (220) ohm suggests that a data releaser shouldconsider all the determinants of re-identification risk and assesswhether a threat to the data subject exists. (221) While this solutionhas natural appeal as a levelheaded approach, a loose case-by-casestandard will provide little guidance and assurance for data producers.In fact, existing statutes already implement the bulk of the suggestionsohm puts forward. HIPAA regulations, for example, instruct dataproducers to remove any information that, in context, might lead to there-identification of a data subject, and they differentially scrutinizepublic releases much more severely, while giving agencies and firms widelatitude when drawing up licenses with business associates. (222) Butfor the reasons detailed in Part II, these standards are encouragingover-protectionism and providing agencies with an evasion tactic.Moreover, licensing processes impose transaction costs on researchersthat are not justified by the speculative risks of re-identification.

I propose something altogether different: simple, easy-to-applyrules. (223) My policy has three aspects to its design: (1) it clarifieswhat a data producer is expected to do in order to anonymize a datasetand avoid the dissemination of legally cognizable PII; (2) it immunizesthe data producer from privacy-related liability if the anonymizationprotocols are properly implemented; and (3) it punishes with harshcriminal penalties any recipient of anonymized data who re-identifies asubject in the dataset for an improper purpose. I will describe each ofthese aspects in more detail and explain why the proposed approachoffers an improvement over current laws and regulations.

A. Anonymizing Data

Under my approach, a data producer is required to do just twothings in order to convert personally identifiable data into anonymized(non-PII) data: (1) strip all direct identifiers, and (2) either checkfor minimum subgroup sizes on a preset list of common indirectidentifiers--such as race, sex, geographic indicators, and otherindirect identifiers commonly found in public records--or use aneffective random sampling frame.

(1) Stripping Direct Identifiers. The removal of direct identifiers(name, telephone number, address, social security number, IP addresses,biometric identifiers like fingerprints, and any other uniqueidentifying descriptor) is an obvious first step, but one that shouldnot go without comment. After all, this critical oversight led to there-identification of a data subject in the AOL search term database.(224) Remarkably, the privacy community and even the FTC have held thisup as a key exemplar for the proposition that there is no viable way toadequately anonymize data anymore. (225) In fact, the AOL story is anexample of a lack of anonymization.

(2) Basic Risk Assessment. My next step requires the data producereither to count the minimum subgroup sizes or to confirm that thedataset has an unknown sampling frame. Neither of these is conceptuallydifficult.

Minimum Subgroup Count--This ensures that no combination ofindirect identifiers yields fewer than a certain threshold number ofobservations (usually between three and ten). For the purpose of thisArticle I will use five. (226) This is known as "k-anonymity"in the computer science literature. (227) Suppose a college wishes torelease a public-use version of its grades database. If there are onlytwo Asian fefemale chemistry majors in the cohort of students thatentered in 2010, then the school should not release a dataset thatincludes race, gender, major, and cohort year unless it first blurstogether some of these categories. The college might choose to lumpseveral majors together into clusters, or lump cohort years into bandsspanning five years. There are a number of ways to blur the categoriessuch that minimum subgroup counts stay above the required threshold.Indirect identifiers are limited to categories of information that arepublicly available for all or most of the data subjects--e.g., age,gender, race, and geographic location. They do not include informationthat is not systematically compiled and distributed by third parties.(228)

Unknown Sampling Frame--If a public data user has no basis forknowing whether an individual is in the universe of people described inthe dataset, then the dataset does not--and cannot--disclose PII.Sampling frame is a powerful tool for anonymizing data, and largestatistical bureaus (such as the U.S. Census Bureau) often employ itwhen they collect information on a random sample of Americans. (229)Thus, if the Bureau of Labor Statistics produces a dataset that includesonly one veterinarian in Delaware, we need not be concerned unless thereis some way to know which of the many veterinarians in Delaware thedataset is describing. If the sampling frame is unknown, then theminimum subgroup count and extremity-coding rules need not apply. (230)But precautions must be taken to ensure that an outsider really cannotdiscern whether the sample includes a particular individual. (231)

If either of these protocols is properly implemented, the datasetwould be legally recognized as anonymized non-PII data. To be clear,this standard is less onerous than the current state and federal lawslike HIPAA. This is by design. While my proposal diverges sharply fromothers', it flows naturally from the assertion, supported earlierin this Article, that the risk of re-identification is not significant.Nevertheless, agencies and organizations that work with data frequentlyenough to have Institutional Review Boards should continue to useheightened standards determined by current best practices. (232) Theprocedures described above set an appropriate floor, and need not beinterpreted as a ceiling.

Freeing up the flow of data will enrich the proverbial marketplaceof ideas. In the past, the simplified process of stripping obviousidentifiers was legally sufficient to protect an individual'sprivacy. (233) We have drifted into protecting against more and moreintricate attacks without having experienced any of them. Moreover, someof the more complex disclosure-risk avoidance techniques (such asdata-swapping or noise-adding) have gone awry. The U.S. CensusBureau's publicuse microdata samples ("PUMS files") fromthe 2000 census contain substantial errors in the reporting of age andgender that have affected analyses for a decade's worth ofresearch. (234)

B. Safe Harbor for Anonymized Data

If a data producer follows the anonymization protocols, it will beshielded from liability based on privacy torts, certain types ofcontractual liability, and federal statutory penalties defined byprivacy statutes like HIPAA. The anonymization protocols would also takethe data out of the ambit of privacy exemptions in public recordsstatutes (meaning that government agencies legally obligated to discloseinformation through public records laws could not make use of theprivacy exemption if a useful dataset could be produced using theanonymization procedures described above). With the exception ofcontractual liability, on which I elaborate below, the scope of thissafe harbor provision is fairly predictable.

The safe harbor provision protects data producers from liabilitybased on confidentiality agreements unless the confidentiality agreementexplicitly prohibits the dissemination of all information, whether ornot it is in identifiable form, to any unnamed third parties. To beclear, if the firm collecting data reserves the right to shareinformation to a third party in the private agreement, anonymized datawill not violate the confidentiality agreement. The reason forstructuring the safe harbor provision this way is to prevent the verylikely scenario in which a company wishes to profit from the informationit collects by sharing it with marketers or business partners, whilesimultaneously having a consumer-friendly-sounding excuse for shieldinganonymized data from researchers who might use the data to uncover fraudor discrimination. of course, nothing in this scheme obligates anorganization to share anonymized research data, but it does remove thefig leaf--the pretense of sensitivity--when data is shared for marketingand business purposes.

Immunity is bold, but it is not unusual for the law to go to greatlengths to bolster the public's interest in information. Courtshave been especially protective of the First Amendment right todisseminate truthful information of public concern. (235) In the contextof undercover journalism, scholars and lawmakers have concluded that thepublic interest in unearthing information justifies immunity from tortliability, even when journalists employ deceptive newsgatheringpractices. (236) C. Thomas Dienes notes that "[i]n the privatesector, when the government fails in its responsibility to protect thepublic against fraudulent and unethical business and professionalpractices, whether because of lack of resources or unwillingness, mediaexposure of such practices can and often does provide the spur forcinggovernment action." (237) Likewise, Erwin Chemerinsky defendspaparazzi-style journalism by reminding the academy:

 Speech is protected because it matters in people's lives, and aggressive newsgathering is often crucial to obtaining the information. The very notion of a marketplace of ideas rests on the availability of information. ... People on their own cannot expose unhealthy practices in supermarkets or fraud by telemarketers or unnecessary surgery by doctors. But the media can expose this, if it is allowed the tools to do so, and the public directly benefits from the re porting. (238)

Undeniably, the data commons is one of these tools. It providesinvaluable probative power that cannot be matched by anecdote orconcentrated theorizing, and the risk of re-identification is relativelysmall compared to the informational value.

C. Criminal Penalties for Data Abuse

Finally, the safe harbor must be buttressed by a statute thatcriminalizes and stiffly punishes the improper re-identification ofsubjects within a properly anonymized dataset. (239) Criminal liabilityattaches the instant an adversary discloses the identity and a piece ofnonpublic information to one other person who is not the data producer.(240) First, this design avoids unintentionally criminalizingdisclosure-risk research--research that can usefully identifyvulnerabilities in anonymized datasets. This sort of information will beinvaluable to data producers and regulators if an attack seems likely tobe replicated by a true malfeasor. De-anonymization scientists will beable to continue publishing their work with impunity. Second, thisdesign avoids the possibility of innocent technical violations byrequiring an overt, malicious act--disclosing a non-public piece ofinformation to one other person. (241)

Current privacy statutes leave a blatant gap in coverage: they donot restrain an adversary from re-identifying a subject. To addressthis, the Institute of Medicine of the National Academies has proposedlegal sanctions for re-identification, (242) and Robert Gellman hasproposed a system of data sharing through uniform licensing agreementsthat protect against the re-identification of data subjects usingcriminal and civil sanctions. (243) In fact, much of the public researchdata available to researchers today requires the execution of datalicense agreements prohibiting re-identification and requiring theresearch staff to ensure the security of the data. (244) A federalcriminal statute would provide uniform protection for all data subjects,and would reduce transaction costs between data users and data producersby making contractual promises of this sort unnecessary.

The criminal penalty is particularly important when a dataset hasbeen properly anonymized, but an adversary decides to target a specificdata subject about whom the adversary has special information. Take thefollowing example, which comes from the Department of Education'scommentary on the 2009 revisions of the FERPA regulations:

 [I]f it is generally known in the school community that a particular student is HIV-positive ... then the school could not reveal that the only HIV-positive student in the school was suspended. However, if it is not generally known or obvious that there is an HIV-positive student in school, then the same information could be released, even though someone with special knowledge of the student's status as HIV-positive would be able to identify the student and learn that he or she had been suspended. (245)

Likewise, someone with special knowledge about the circ*mstances ofa particular student's suspension could use that information todiscern that he or she is HIV-positive. While the student might havecivil recourse if the adversary publicizes this fact and causessufficient harm, (246) nothing in FERPA's design outlaws theadversary's acts in re-identifying the student in the first place.The heavy hand of the prosecutor is an appropriate means for enforcingthe ethics of the data commons.

Though detection and enforcement of this provision would no doubtbe very difficult, this does not mean that retributive disincentiveshave no effect. People and firms often overreact to improbable butunknown risks of criminal sanction. (247) Moreover, one major motivationfor my proposal is the understanding that re-identification is unlikelyto happen. Thus, the criminal element to this data privacy scheme is, bydesign, expensive and likely to operate more as a disincentive than as apenalty actually imposed by courts.

D. Objections

The objection to my framework is simple: What if I am wrong? By thetime we realize that anonymization can be undone, it is too late!Ohm's contention is that data that cannot re-identify us today willbe capable of doing so tomorrow. (248) We need urgent action because weare laying the groundwork for the "database of ruin." (249)This argument shares a remarkable resemblance to fears about theintroduction of computers into the federal government in the 1960s. Thestatement of Representative Cornelius E. Gallagher of New Jersey beforethe Committee on Government Operations is typical of these fears:

 Nor do we wish to see a composite picture of an individual recorded in a single informational warehouse, where the touch of a button would assemble all the governmental information about the person since his birth.... Although the personal data bank apparently has not been proposed as yet, many people view this proposal as a first step toward its creation. ... We cannot be certain that such dossiers would always be used by benevolent people for benevolent purposes. (250)

Anxieties over potential abuse of new information technologies area hardy perennial. (251) Today, the threatening technology is theInternet. While the Internet certainly increases the risk ofreidentification, and while producers of anonymized data should becognizant of new and rich collections of auxiliary information availableto a malicious intruder, the additional risk is not as great as it mightseem. Remember that, in order to re-identify a subject in a dataset, anadversary must be confident that a unique data subject matches a uniquemember of the general population. (252) Suppose an anonymizedprescription dataset described a fifty-year-old woman in central Vermontwho is taking pharmaceutical drugs to treat depression and highcholesterol. An adversary comes across a Live Journal blog post by awoman who identifies herself, reveals that she is fifty years old andliving in Montpelier, and describes her experience on Lipitor. (253) Theadversary has stumbled upon a likely candidate to match up to theanonymized data subject, and if he is right, he will have learned thatthe blogger is also clinically depressed. But in order to be confidentin the match, he must have some reason to believe that this is the onlyfifty-year-old woman in central Vermont using a cholesterol-loweringdrug. The Internet provides a lot of information about a lot of people,but it is not a source of comprehensive and systematic information, soit is a flawed tool for the malicious intruder. At best, the adversarymight be able to use some statistical source of medical treatment ratesto estimate the likelihood that the Montpelier woman is unique.

Ohm and other critics of anonymization believe that onceadversaries are able to sync up one anonymized database to identities,they will be able to match the combined database to a third anonymousdatabase, and then a fourth, et cetera, until a complete profile isbuilt. (254) This threat is premised on perfect matching attacks thatcontain no false matching error. If a re-identification attack isassumed to have error (which it most certainly will in the absence of acomplete population registry of some sort), then the quality of thedossier will be so poor as to undermine its threat. Even in the unlikelyscenario where each re-identification attack contains only a ten percentfalse match rate, twenty-seven percent of the observations in thecombined dataset will likely contain errors. (255)

Even ignoring the snowballing error rates, the value to anadversary of anonymized data erodes over time. If adversaries are ableand willing to make entropic re-identification attacks in the future,anonymized data from today will have vanishing value as time trots onfor two reasons. First, people's attributes change, so makingmatches will be increasingly hard and subject to false positives andfalse negatives. Studies on databases that are known to cover the samepopulation are, in fact, frequently difficult to match up because thesubject's contemporaneous responses to the same or similarquestions are often incompatible. (256) And since the profiles used tomake the match will likely be riddled with error, matching to old datawill often fail. (257) Second, even if a successful match is made and isverifiable, there will be less intrinsic value to knowing oldattributes. No matter what the adversary's bad motives are, thevalue of old data (again, its marginal utility) decreases with time.

Privacy advocates tend to take on the role of doom prophets--theirpredictions of troubles are ahead of their time. (258) Convinced of theinevitability of the harms, privacy scholars are dissatisfied withreactive or adaptive regulation and insist on taking prospective,preemptive action. (259) Dull as it is, reactive legislation is the mostappropriate course for anonymized research data. Legislation inhibitingthe dissemination of research data would have guaranteed drawbacks todayfor the research community and for society at large. We should find outwhether re-identification risk materializes before taking such drasticmeasures.

E. Improving the Status Quo

In the meantime, we would do well to clean up the muddled state ofthe PII-based privacy system currently in place. Right now case law andregulatory guidance are so reluctant to commit to a protocol that dataproducers cannot be sure what is expected of them.

The regulatory goal of a PII-based privacy statute is quitestraightforward: a data user should not be able to learn something newabout a data subject using publicly available auxiliary information.Direct identifiers are removed, of course, and some additionalprecautions are often required. The mandates of current privacy statutescan be met using what I will refer to as the "Four KeyPrinciples" of PII-based anonymization. These principles are notbeyond the capabilities of a FOIA officer at a public agency:

(1) Unknown Sampling Frame--If the data producer is confident thatdata users cannot use public information to determine whether somebodyis in the dataset or not, the other precautions described in thissection need not be taken. (260)

(2) Minimum Subgroup Count--This concept is incorporated into myproposal above: the data producer ensures that no combination ofindirect identifiers yields fewer than a certain threshold number ofobservations. The data producer must use good judgment in categorizingthe variables as indirect identifiers or non-identifiers. (261)

(3) Extremity-redacting--Data producers can redact the highest orlowest value of sensitive continuous variables (e.g., income or testscores) within each subgroup if they are concerned that an adversarywould be able to draw conclusions about the maximum (or minimum) valuefor a whole subgroup. To understand the risk this approach averts,suppose a school wishes to release a dataset containing the race,gender, and grade point average ("GPA") of its students.Suppose also that all white females at the school earned GPAs lower than3.0. An adversary could use the database to learn that a particularwhite female (indeed, any white female) had a GPA below 3.0. Thus, eventhough the adversary cannot re-identify a particular line of data, hehas learned something new and sensitive about each individual whitefemale. If the school had redacted the highest GPA within eachrace-gender subgroup and replaced it with a random alphanumeric symbol,the adversary no longer knows the upper bound in the white females'(or any other group's) GPAs. (262)

(4) Monitoring Future Overlapping Data Releases--Finally, a dataproducer must ensure that it will not disclose two datasets covering thesame population that can be linked through non-identifiers. Building onthe race, gender, and high school GPA database example in the lastparagraph, suppose the same school released a second dataset providinghigh school GPA and ZIP code. On its own, the second dataset seemsperfectly innocuous. But any observation with a unique GPA (most likelyat the bottom or top of the GPA distribution) could be linked to thefirst database. By doing so, an adversary can learn the race, gender,ZIP code, and GPA for those observations. This greatly increases thechance of re-identification. (263)

The theoretical concepts required to create a low-risk publicdataset are not difficult when they are explained clearly anddeliberately. But to this point, the judiciary has had great difficultyreasoning through and applying anonymization concepts in a principled,replicable way. The case law often contradicts itself and establishes adhoc rules that are under- or over-protective. Even when a case reachesthe correct outcome, the analysis is often incomplete or inarticulate inits reasoning.

Consider the opinion from Fish v. Dallas Independent SchoolDistrict, discussed at length in Part II. Though the opinion applies thePII framework and properly finds that the requested dataset would runafoul of FERPA, the opinion uses flawed reasoning. The court focuses onthe fact that one expert witness was able to use publicly availableinformation to trace the identities of 550 of the Dallas students at oneof the elementary schools in "less than one minute." (264)Processing speeds bear no relation to the relative ease or difficulty ofre-identifying a person in a dataset. It is the discretionary decisionmaking that comes before the computation--the skill and specialinformation (if any) known by the human writing the attack code--thatdetermines whether a dataset is at risk of re-identification or not.

Other cases do worse by mechanically applying statistical rules ininappropriate circ*mstances. (265) Consider the case of Long v. IRS(266) At the trial level, the plaintiff succeeded in enforcing an oldconsent decree that required the Internal Revenue Service("IRS") to release statistical reports to the plaintiff and tothe public at large. (267) The issue in the case was whether oneparticular table that reported the number of hours spent auditing taxreturns and the additional tax dollars collected through those auditsviolated the privacy rights of the audited taxpayers. (268) Thestatistics were broken down according to type of tax return, industry,and the income level of the audited taxpayer. (269)

The IRS argued that the table violates taxpayer privacy because thetable contained "cells of one"--cells that described a singleaudited taxpayer. (270) In other words, the IRS argued that the tablewould violate the principle of minimum subgroup size. The plaintiffcountered by arguing that "a reader would not be able to identifythe taxpayer unless he already knew that the taxpayer had been auditedin the relevant time period." (271) That is to say, the plaintiffwas arguing that the table had an unknown sampling frame so that, in theabsence of special information, an adversary would not know who wasaudited, and thus could not know who was being described in the table.So, even if the table reported the audit outcome for just one medicaldoctor, an adversary would not be able to determine which of thecountry's many medical doctors had been audited. The IRS respondedthat publicly available information, such as press releases or publicSecurities and Exchange Commission ("SEC") filings, could beused to determine the identities of some taxpayers in the samplingframe. (272) The trial court found that the IRS's position was"speculative at best," and noted that the government hadprovided no evidence to support its claim that a cell of one could becombined with public information to identify a taxpayer. (273) Thedistrict court properly focused on whether the sampling frame wassufficiently unknown and made a factual determination in theplaintiff's favor.

The Ninth Circuit reversed and left an illogical and unsoundprecedent in its wake. First, the appellate court mischaracterized thedistrict court's opinion, claiming that the lower court hadconsidered the table to be effectively anonymized once directidentifiers had been removed. (274) Having constructed this straw man,the appellate court went too far in knocking it down: "[W]e holdthat tax data that starts out as confidential return informationassociated with a particular taxpayer maintains that status when itappears unaltered in a tabulation with only the identifying informationremoved." (275) The court determined that cells of two, on theother hand, do not implicate privacy concerns. (276) The Ninth Circuithas created a test (no cells of one) that will be over- andunder-inclusive in targeting re-identification risk. The court applies athreshold that is too low for minimum subgroup size (two, as compared tothe standard thresholds over three) without any regard for theprotective power of the unknown sampling frame.

The unknown sampling frame principle is at the root of muchconfusion in U.S. privacy policy. (277) Government agencies assignedwith the task of providing guidance to data producers have bungled theirefforts in this regard. For example, in discussing cell sizelimitations, Working Paper No. 22--a guideline for federal datadisclosures--provides the following as an illustration of an aggregatedstatistical table with disclosure risk:

The highlighted cells are supposedly problematic, because theycontain fewer than five respondents. (279) But the Federal Committee onStatistical Methodology ("FCSM") mindlessly applied theminimum subgroup count rule without grounding it in a principled theory.It is true that only one of the delinquent children lives in AlphaCounty with a medium-educated head of household. But that delinquentchild is not in danger of being re-identified. An adversary has no wayof knowing who is in this sample of delinquent children unless theadversary already knows the child is delinquent. Knowing that some childlives in Alpha County with a medium-educated head of household alsotells the adversary nothing about whether that child is delinquentbecause he cannot determine whether that child is in the sample. If theadversary did know that some particular target is in the sample, hewould already know the most potentially harmful information about thetarget: that the target is a delinquent child. (280) When the conditionsof an unknown sampling frame are met, the cell sizes have no relation tothe hypothetical abuses that could flow from tabular data. (281)

In another brief, FCSM suggests that the problem with small cellsin a simple frequency table like this is that anyone privy to theinformation about one of the data subjects is more likely to be able toidentify the other people described in the same, small cell. (282) Thissuggestion may sound reasonable, but it does not logically follow.Consider the parents of a delinquent child, who know without ambiguitywhere their child falls in the frequency table. Even if their child wasone of the two delinquent children from Gamma County with a head ofhousehold with very high education, that parent could not learn anythingabout the identity of the other delinquent child unless they alreadyknew the county and education level associated with that delinquentchild (in which case, they would know all there is to know).

My criticism of this exemplar table is not meant to imply thattables of aggregated information cannot breach privacy. They can andthey have. The following table reports pass rates for the No Child LeftBehind Exit Exam for a single high school in California. This tableshows how the results were reported in public documents by theCalifornia Department of Education.

This table violates privacy by revealing the math test results withcertainty for female and white students, despite the schooldistrict's effort to redact results for cells smaller than ten byreplacing the number with "n/a."284

Blame for deficient anonymization does not reside with thedata-producing agencies alone. Regulators charged with the task ofsetting out standards for data sharing seem to go out of their way toavoid clarity. (285) Working Paper No. 22 runs through a menu of optionsfor data producers, including random sampling, top-coding, adding randomnoise, and blurring or clustering the indirect identifier variables.(286) But the paper does not provide a uniform guideline, admitting that"there are no accepted measures of disclosure risk for a microdatafile, so there is no 'standard' that can be applied to assurethat protection is adequate." (287)

This guidance is stunningly inadequate for a small firm or publicagency charged with the task of producing a public-use dataset. It isunderstandable that statistical agencies would not want to committhemselves to a list of indirect identifiers or to a specific fixed setof protocols. Identifying which variables are indirect identifiersrequires some working knowledge of the dataset and the publiclyavailable resources that can be matched to the dataset. But the privacyregulators fail even to elucidate workable principles. (288) Theregulatory body that administers HIPAA, for example, has failed toprovide clear guidance on "specific conditions that must be met inorder for privacy risks to be minim[ized]," leaving the details tobe sorted out by individual privacy boards and Institutional ReviewBoards. (289)

The result is complete chaos. Simply, there are no standard privacypractices. Richard Sander, a law professor at University of California,Los Angeles, recently requested anonymized admissions data from 100public colleges and 70 public law schools. (290) The requests weresubmitted pursuant to an effort to conduct a systematic examination ofadmissions practices, but the data collection process serves as its ownmeta-experiment on public records compliance. Since Sander sentidentical requests to every school, their responses provide a uniqueopportunity to observe the variance in interpretations of educationprivacy laws. The meta-experiment produced two important insights.First, the schools had widely divergent interpretations of theirobligations under FERPA. Some of the schools complied with the FOIArequests right away and without redactions, but the majority provideddata only after protracted negotiations lasting as long as two years.one fifth of the schools refused even dramatically scaled-back requeststhat presented no appreciable risk of re-identification. Second, thediversity among state FOIA statutes and privacy laws had little bearingon a school's likelihood to provide data. Noncompliant schoolsshared their state borders with compliant ones. Some of the refusingschools sent letters denying the request on the basis of privacyexemptions to the state's public records laws. Other schools becamenonresponsive in the course of negotiations. (291) And a few schoolseffectively denied the request by sending data that redacted raceinformation or by charging excessive fees. (292)

The void in standard practices naturally heightens the fears ofmembers of the public, who view inconsistency as evidence that theirconfidentiality may not be sufficiently protected. (293) Thediscrediting of anonymization and the growing perception that currentprivacy protocols are a fragile facade have already taken a toll on thedata commons. Some public-use datasets require researchers to signnotarized affidavits and cut through a good deal of red tape before andduring their use of the data. (294) And some agencies have pulled publicdatasets into on-site research enclaves. (295) These trends increase thecosts of doing research. Some policymakers are interfering withagencies' ability to release research data at all: the Departmentof Transportation and Related Agencies Appropriations Act was the firstfederal law prohibiting access to records in the absence of individualopt-in consent, even though the records were previously open to thepublic and had not been the subject of any known abuses. (296)Conditioning the collection of certain categories of information on theconsent of the consumer is fatal to the collection of any reasonablyuseful data. (297)

The stakes for data privacy have reached a new high-water mark, butthe consequences are not what they seem. We are at great risk not ofprivacy threats, but of information obstruction.

VI. CONCLUSION: THE TRAGEDY OF THE DATA COMMONS

The contours of the right to privacy are in the grips of anexistential crisis. Social networking, history-sniffing cookies, andcostless digital archiving have forced us to grapple with new anddifficult problems. There are many worthy targets for the worries ofprivacy scholars. Research data is not one of them.

Parts II-IV of this Article analyzed the risk and the utility ofpublic research data. With high benefit and low risk, the inescapableconclusion is that current privacy risks have little to do withanonymized research data, and the sharing of such data should be aidedby the law rather than discouraged by it. But the proposals in Part Vwill no doubt be controversial. Now that researchers, legal scholars,and major policymakers have converged on an alarmist interpretation ofthe current state of data sharing, cool-headed balancing between risksand benefits is extraordinarily difficult. our collective focus has beenset on detriment alone.

Paul Ohm refers to the "inchoate harm[s]" of datasetsthat are released without airtight protections againstre-identification. (298) Conceived of this way, the right to not bere-identified is one that need not bend to any considerations for thepublic interest in reliable research data. ohm's approach toprivacy policy is the same as my own--he advocates a balancing of theinterests in privacy against the interests in data release. (299) ohmand I arrive at very different policy proposals because we havedivergent estimations of re-identification risks and the value of publicdata releases. However, other scholars have encouraged privacy law todrift into a property-based enforcement regime. (300) Proponents ofproperty entitlement would say, "It is my data, and I want it outof the data commons." To conclude this Article, I highlight thefeatures that make a property regime in anonymized data unworkable andunwise. Because risk is borne by individuals while utility is spreadacross the entire community, circ*mstances are ripe for a tragedy of thecommons. The tort liability model for enforcement of privacy rights ismuch more sensible since tort liability rules are tailored to the risksand costs at a higher level of generality--the societal level.

A. Problems with the Property Model

There is no Pareto-optimal way to share data. This, unfortunately,is irrefutable. Though we are collectively better off with publicresearch data, sharing data imposes risk on the data subjects. This riskcan be greatly reduced by taking certain precautions, but it can neverreach zero. Who, then, is to decide how much risk is too much?

Many people want (and probably believe they have) a propertyinterest in information that describes them. (301) The practicalsignificance of enforcing privacy rights through the property model isthat the data subject retains the right to hold out. Thus, recent classaction lawsuits for releasing research data demanded injunctions againstsharing data in the future and brought claims for trespass to chattels.(302) Additionally, Lawrence Lessig, Jerry Kang, and Paul Schwartz haveargued that Americans should have control over their information that isat least as strong as a property regime would permit and preferablystronger. (303)

In the case of research data, the property model is the wrongchoice, not only for efficiency reasons, but also because it fails tomeet the distributional goals required for justice. (304) Americans arenaturally distrustful about data collection. Significant segments of thepopulation continue to evade U.S. Census reporting, despite both thelegal mandate to do so (305) and the Bureau's clean confidentialityrecord during the last six decades. (306) If data subjects refuse toconsent to even small amounts of risk, which a rational actor modelwould predict they would do, then the data commons will dwindle asproperty is claimed. (307)

This problem is analogous to the modern vaccine controversy.Children under the age of vaccination are often at the greatest risk ofdeath from virulent diseases like whooping cough. (308) The bestprotection is for everyone else (of eligible age) to get the vaccine,even though the vaccine itself poses dubious but popularly acceptedrisks. (309) Parents who choose not to vaccinate their children expectto have it both ways: since everyone else is vaccinated, their child isunlikely to be exposed to the virus or disease. But they also avoid thesmall chance that their child could have an adverse reaction to thevaccination. The trouble is, once enough parents opt out of thevaccination pool, the communal protection falls apart. Thus, we are nowwitnessing a resurgence in infant mortality from whooping cough becausethe virus is spreading among adults and older children, who historicallyhad been vaccinated but no longer are. (310)

Like the communal vaccination shield, the data commons isespecially vulnerable to opt-outs. As people opt out, the value of theoverall data diminishes precipitously rather than linearly: even a smallnumber of holdouts will produce selection bias effects that compromisethe utility of the remaining data. Khaled El Emam, Elizabeth Jonker, andAnita Fineberg have recently compiled and analyzed the evidence ofselection bias caused by consent requirements to perform research onobservation health data--data that was already collected in the courseof treatment, such that research requires no additional interaction withthe patients. (311) Consent is denied more frequently by patients whoare younger, African American, unmarried, less educated, of lowersocio-economic status, or--importantly--healthy. (312) These patternsare very difficult to control for, and they cause distortions in healthresearch. (313) Put bluntly, property rights that follow the informationinto the data commons (and allow the data to be clawed back out) wouldallow holdouts to wreak disproportional havoc on research. (314)

The impulse to enforce research data privacy rights throughproperty rules should be jettisoned and a tort approach restored. (315)on this issue, Paul ohm and I agree that the public interest is bestserved by asking whether the utility of a public dataset significantlyoutweighs the risk of harm. (316) This would mark a return to therational balancing anticipated by Samuel Warren and Louis Brandeis, whor*cognized that privacy rights should not interfere with informationflow when that information is socially valuable. (317) This balancing ofrisks and benefits will also realign the policy discourse with theanonymization practices that are already widely in use and embraced byprivacy experts in the statistics and social science fields.Anonymization was never believed to be a "privacy-providingpanacea." (318) As Douglas Sylvester and Sharon Lohr correctlyassert, "[t]he law, in fact, does not require that there beabsolutely no risk that an individual could be identified from releaseddata." (319) Rather, the law was assumed to reflect a conservativeposition in the risk-utility analysis--and it still does.

Radical as they may sound, this Article's proposals areformally reconcilable with the privacy scholarship that demandsinalienable rights in the control of information. De-identified(anonymized) data need not be considered as relating to the underlyingdata subject at all--unless and until their data has been re-identified.The theoretical foundations for establishing a distinct regime foranonymized data are already in existence. Jerry Kang has noted thatprivacy is in some tension with intellectual property since there is noavailable copyright ownership interest in facts. (320) once data hasbeen unlinked from an identifiable person, perhaps it is best understoodas a fact in the public domain. Better still, Ted Janger and PaulSchwartz have proposed a move to "constitutive privacy"rights, where access to information and limits on it should be modeledwith an eye toward the nature of our society and the way we like tolive. (321) Here the "democratic community" (322) is muchbetter served by relinquishing an individual's control overanonymized research data.

Detaching privacy rights from anonymized data presents the bestoption available because it prevents what Anita Allen calls themaldistribution of privacy. (323) Consider the following scenario: aschool district wishes to test a theory that implicit biases cause itsteachers to depress grades of minority students when students areevaluated on subjective criteria. To test the hypothesis, the schooldistrict uses the objective scores received by its students on validatedexams as controls to see if minority students receive significantlylower grades when grading is left to the teacher's subjectivejudgment. A small set of parents, after catching wind of the study,object to the use of their (Caucasian) children's data because thesecondary use of their children's information does not suit theirinterests. Should we consider the data, in anonymized form, to be theirdata? Individuals' control over research data would result in amaldistribution of knowledge.

B. The Data Subject as the Honorable Public Servant

The data commons is the tax we pay to our public informationreserves. Danielle Citron and Paul Schwartz have persuasively arguedthat privacy is a critical ingredient to a healthy social discourse.(324) In many respects this is true, but if taken to the extreme, dataprivacy can also make discourse anemic and shallow by removing from itrelevant and readily attainable facts.

In time, technological solutions are likely to pare down theexisting tension between data utility and disclosure risk. (325)Statistical software that allows the dataset to remain on a secureserver while researchers submit statistical queries has been developed,and many data producers are slowly beginning to implement it. (326) Inthe meantime, anonymization continues to be an excellent compromise.Rather than sounding alarms and feeding into preexisting paranoia, thevoices of reason from the legal academy should invoke a civic duty toparticipate in the public data commons and to proudly contribute to thedigital fields that describe none of us and all of us at the same time.

(1.) JEFFREY GROGGER & LYNN A. KAROLY, WELFARE REFORM: EFFECTSOF A DECADE OF CHANGE 196-97 (2005). Grogger has also produced empiricalevidence that welfare-to-work reforms did lead to increased wages andincreased rates of non-dependence among the welfare recipients, but alsohad a negative impact on the academic performance of their adolescentchildren. Jeff Grogger & Charles Michalopoulos, Welfare DynamicsUnder Term Limits (Nat'l Bureau of Econ. Research, Working PaperNo. 7353, 1999); Jeffrey Grogger, Lynn A. Karoly & Jacob AlexKlerman, Conflicting Benefits Trade-Offs in Welfare Reform, RAND.ORG(2002), http://www.rand.org/publications/randreview/issues/rr-12-02/benefits.html.

(2.) Roland G. Fryer, Jr. & Steven D. Levitt, Understanding theBlack-White Test Score Gap in the First Two Years of School, 86 REV.ECON. & STAT. 447, 447 (2004); Roland G. Fryer, Jr. & Steven D.Levitt, Testing for Racial Differences in the Mental Ability of YoungChildren (Nat'l Bureau of Econ. Research, Working Paper No. 12066,2006).

(3.) John J. Donohue III & Steven D. Levitt, The Impact ofLegalized Abortion on Crime, 116 Q.J. ECON. 379 (2001).

(4.) See Paul ohm, Broken Promises of Privacy: Responding to theSurprising Failure of Anonymization, 57 UCLA L. REV. 1701 (2010).

(5.) See id. See generally FTC, PROTECTING CONSUMER PRIVACY IN ANERA OF RAPID CHANGE: A PROPOSED FRAMEWORK FOR BUSINESSES ANDPOLICYMAKERS (2010) [hereinafter FTC PRIVACY REPORT], available athttp://www.ftc.gov/os/2010/12/ 101201privacyreport.pdf; Ryan Singel,Netflix Spilled Your Brokeback Mountain Secret, Lawsuit Claims, WIREDTHREAT LEVEL (Dec. 17, 2009, 4:29 PM),http://www.wired.com/threatlevel/2009/12/netflix-privacy-lawsuit; SethSchoen, What Information is "Personally Identifiable"?,ELECTRONIC FRONTIER FOUND. DEEPLINKS (Sept. 11, 2009, 10:43 PM),http://www.eff.org/deeplinks/2009/09/what-informationpersonally-identifiable; Re-identification, ELECTRONIC PRIVACY INFO. CENTER,http://epic.org/privacy/reidentification/ (last visited Dec. 21, 2011).Parties in several recent lawsuits have argued that there is no longer atenable difference between anonymized information and personallyidentifiable information. See, e.g., Complaint at 20, Gaos v. GoogleInc., No. 10-CV-04809 (N.D. Cal. May 2, 2011); Complaint at 15, Doe v.Netflix, No. C09 05903 (N.D. Cal. Dec. 17, 2009) [hereinafter DoeComplaint]; Elinor Mills, AOL Sued over Web Search Data Release, CNETNews Blogs (Sept. 25, 2006, 12:17 PM),http://news.cnet.com/8301-10784_3-6119218-7.html.

(6.) See, e.g., LAWRENCE LESSIG, CODE AND OTHER LAWS OF CYBERSPACE142-63 (1999); Jerry Kang & Benedikt Buchner, Privacy in Atlantis,18 HARV. J.L. & TECH. 229, 255 (2004); Paul M. Schwartz, Property,Privacy, and Personal Data, 117 HARV. L. REV. 2055, 2076, 2088-113(2004).

(7.) Privacy law is on the mind of politicians and regulators andhas entered what John Kingdon calls the proverbial "policywindow." JOHN KINGDON, AGENDAS, ALTERNATIVES, AND PUBLIC POLICIES165 (2d ed. 2002).

(8.) The tragedy of the commons model I explore here is notperfectly analogous to the "grazing commons" conceptpopularized by Garrett Hardin. Garrett Hardin, The Tragedy of theCommons, 162 SCIENCE 1243 (1968). In the grazing model, self-interestedactors convert the communal benefits of the commons into privatebenefits for themselves. The gain from adding one more cow of their ownis internalized, while the losses in the form of overgrazing areexternalized and borne by the entire population. Id. In the datacommons, the data subject depletes the commons by removing his data. Themarginal detriment of his decision is externalized and shared across theentire population. Meanwhile, he enjoys the full value of the avoidedrisk of re-identification. unlike the traditional commons examples, eachactor is constrained in how much of the commons he is capable ofdepleting since he has but one line of data to remove. (The grazing andpollution examples that Hardin discusses anticipate actors who depositmultiple cows, or increasing amounts of pollution, into the commons).But the key point is intact: communal benefits are lost due to actionsmotivated by self-interest. Vaccination makes an even better comparison.See infra Part VI.

(9.) Fred Cate makes a similar argument in the context of consumerdata used for credit reports. See Fred H. Cate, Data and Democracy,Herman B Wells Distinguished Lecture of the Institute and Society forAdvanced Study (Sept. 21, 2001), in IND. UNIV., INST. FOR ADVANCED STUDYAND SOC'Y FOR ADVANCED STUDY, HERMAN B WELLS DISTINGUISHED LECTURESERIES 1 (2001), available athttps://scholarworks.iu.edu/dspace/bitstream/handle/2022/8508/IAS-WDLS-01.pdf.

(10.) For a discussion of "moral panics," see STANLEYCOHEN, FOLK DEVILS AND MORAL PANICS (1972). Here, advocacy groups'demand for political action is driven by fears that privacy andanonymity as we know them are on the brink of ruin.

(11.) For example, the Netflix de-anonymization study, on which Ohmrelies heavily, makes no effort to compare the risk of re-identificationto the utility of the dataset. Arvind Narayanan & Vitaly Shmatikov,Robust De-anonymization of Large Sparse Datasets, 2008 Proc. 29th IEEESYMP. ON SECURITY & PRIVACY 111. The early work of Latanya Sweeneyacknowledged a tradeoff between a dataset's utility and itstheoretical reidentification risk, but the discussion of utility wasabstract and very brief. Moreover, Sweeney's recent work pays noregard to the countervailing interests in data utility at all. CompareLatanya Sweeney, Computational Disclosure Control: A Primer on DataPrivacy Protection (May 2001) (unpublished Ph.D. thesis, MassachusettsInstitute of Technology), available athttp://dspace.mit.edu/bitstream/handle/172L1/8589/49279409.pdf, withLatanya Sweeney, Patient Identifiability in Pharmaceutical MarketingData (Data Privacy Lab Working Paper 1015, 2011), available athttp://dataprivacylab.org/projects/identifiability/ pharma1.pdf. Thestatistical literature on disclosure risk generally recognizes thetension between the utility of data sharing and its concomitant risksbut struggles to define best practices that can persist with increasingamounts of data accumulation. For a review of the state of the currentcomputer science literature on the subject, see GEORGE T. DUNCAN ET AL.,STATISTICAL CONFIDENTIALITY: PRINCIPLES AND PRACTICE (2011).

(12.) See, e.g., PAUL M. SCHWARTZ, THE CTR. FOR INFO. POLICYLEADERSHIP, DATA PROTECTION LAW AND THE ETHICAL USE OF ANALYTICS 8(2010), available at http://www.huntonfiles.com/files/webupload/CIPL_Ethical_Undperinnings_of_Analytics_Paper.pdf; Ohm, supra note 4, at1708, 1714. But see, e.g., Douglas J. Sylvester & Sharon

Lohr, The Security of Our Secrets: A History of Privacy andConfidentiality in Law and Statistical Practice, 83 DENV. U. L. REV.147, 196-99 (2005); Eugene Volokh, Freedom of Speech and InformationPrivacy: The Troubling Implications of a Right to Stop People FromSpeaking About You, 52 STAN. L. REV. 1049, 1122-24 (2000).

(13.) 45 C.F.R. [section] 164.501 (2010) (defining research as"a systematic investigation, including research development,testing, and evaluation, designed to develop or contribute togeneralizable knowledge").

(14.) A business entity might be very interested in what theparticular individuals in its, or a competitor's, databases arelike and inclined to purchase, regardless of whether their analytics canbe generalized to describe human phenomena. Data researchers arenaturally indifferent to information about any particular person becauseinformation about that person cannot be generalized to any class ofpersons. "Statistical data are unconcerned with individualidentities. They are collected to answer questions such as 'howmany?' or 'what proportion?', not 'who?'. Theidentities and records of co-operating (or non-cooperating [sic])subjects should therefore be kept confidential, whether or notconfidentiality has been explicitly pledged." ISIDeclaration onProfessional Ethics, INT'L STAT. INST. (Aug. 1985),http://isi-web.org/about/ethics1985; see also Sylvester & Lohr,supra note 12, at 185.

(15.) See, e.g., Confidentiality Statement, U.S. CENSUS BUREAU,http://factfinder.census.gov/jsp/saff/SAFFInfo.jsp?_pageId=su5_confidentiality (last updated Mar. 17, 2009).These techniques include top-coding, data swapping, and the addition ofrandom noise. See Jerome P. Reiter, Estimating Risks of IdentificationDisclosure in Microdata, 100 J. AM. STAT. ASS'N 1103, 1103 (2005).While these techniques increase privacy, they come at a cost to theutility of the data since the fuzzied data affects the results ofstatistical analyses. See, e.g., A. F. Karr et al., A Framework forEvaluating the Utility of Data Altered to Protect Confidentiality, 60AM. STATISTICIAN 224, 224 (2006). Data archivists and social scientistsconceive of privacy obligations differently from lawmakers and, notsurprisingly, their approach is more nuanced.

(16.) See discussion of the Family Education Rights and Privacy Act("FERPA"), Health Insurance Portability and Accountability Act("HIPAA"), and the Confidential Information Protection andStatistical Efficiency Act infra text accompanying notes 20-21.

(17.) For example, the HIPAA Standards for Privacy of IndividuallyIdentifiable Health Information (the "HIPAA Privacy Rule")define individually identifiable information as information that"identifies the individual" or information "[w]ithrespect to which there is a reasonable basis to believe the informationcan be used to identify the individual." 45 C.F.R. [section]160.103 (2010).

(18.) I borrow this term from the Department of Education'scommentary on the final ruling of the 2008 revisions to the FERPAregulations. Family Educational Rights and Privacy, 73 Fed. Reg. 74,806,74,831 (Dec. 9, 2008). Although some use other terminology such as"high risk variables," I prefer the term "indirectidentifier" because it connotes that the information might beusable for tracing an identity without implying that it always andnecessarily heightens the risk of re-identification to an unacceptablelevel. Latanya Sweeney, the computer scientist at Carnegie MellonUniversity who popularized the kanonymity model for de-identifying data,uses the term "quasi-identifiers." Latanya Sweeney,k-Anonymity: A Model for Protecting Privacy, 10 INT'L J.UNCERTAINTY, FUZZINESS AND KNOWLEDGE-BASED SYSTEMS 557, 563 (2002).

(19.) ohm, supra note 4, at 1740-41. Paul ohm suggests modifyingthe rhetoric used in information privacy to connote that common privacytechniques merely "try to achieve anonymity," and do notactually achieve it. Id. at 1744. I like his recommendation to use the

term "scrub," id., but Ohm's linguistic analysisreveals something about his assumptions. To ohm, there never was adifference between trying to achieve anonymity and anonymity;anonymization techniques were never believed to be completely withoutrisk.

(20.) E-Government Act of 2002, Pub. L. No. 107-347, [section]502(4), 116 Stat. 2962, 2962 (codified at 44 U.S.C. [section] 3501note).

(21.) See FERPA, 20 U.S.C.A. [section] 1232g (West 2010 & Supp.2011); HIPAA Standards for Privacy of Individually Identifiable HealthInformation, 45 C.F.R. [section] 160.103 (2010). Usually, multipleindirect identifiers have to be combined in order to ascertain theidentity of a specific individual. Privacy law is mindful of thispotential route to re-identification and explicitly guards againstit--any combination of publicly knowable information that can be used totrace to an identity is PII. The FERPA regulations prohibit thedisclosure of "[o]ther information that, alone or in combination,is linked or linkable to a specific student that would allow areasonable person in the school community, who does not have personalknowledge of the relevant circ*mstances, to identify the student withreasonable certainty." 34 C.F.R. [section] 99.3 (2010). The HIPAAPrivacy Rule prohibits the disclosure of "protected healthinformation," 45 C.F.R. [section] 164.502 (2010), includinginformation "(i) [t]hat identifies the individual; or (ii) [w]ithrespect to which there is a reasonable basis to believe the informationcan be used to identify the individual." Id. [section] 160.103.

(22.) JONATHAN P. CAULKINS ET AL., RAND, MANDATORY MINIMUM DRUGSENTENCES: THROWING AWAY THE KEY OR THE TAXPAYERS' MONEY? 62(1997), available athttp://www.rand.org/pubs/monograph_reports/MR827.html.

(23.) Id.

(24.) U.S. SENTENCING COMM'N, SPECIAL REPORT TO THE CONGRESS:MANDATORY MINIMUM PENALTIES IN THE FEDERAL CRIMINAL JUSTICE SYSTEM iii(1991) ("Deterrence, a primary goal of the Sentencing Reform Actand the Comprehensive Crime Control Act, is dependent on certainty andappropriate severity.").

(25.) K. JACK RILEY ET AL., RAND, JUST CAUSE OR JUST BECAUSE?:PROSECUTION AND PLEA-BARGAINING RESULTING IN PRISON SENTENCES ONLOW-LEVEL DRUG CHARGES IN CALIFORNIA AND ARIZONA xiii (2005), availableat http://www.rand.org/pubs/monographs/MG288.html.

(26.) Substance Abuse and Crime Prevention Act of 2000, Cal. Prop.36 (codified at CAL. PENAL CODE [section] 1210 (West 2006)); ActRelating to Laws on Controlled Substances and those Convicted ofPersonal Use or Possession of Controlled Substances, Prop. 200, (Ariz.1996) (codified as amended at Ariz. Rev. Stat. Ann. [section] 41-1404.16(2011).

(27.) RILEY ET AL., supra note 25, at 76.

(28.) Id. at 62.

(29.) Id. at 76.

(30.) The 1997 study used data from the U.S. Drug EnforcementAgency's System to Retrieve Information from Drug Evidence("STRIDE") and from the National Household Survey on DrugAbuse. Caulkins, supra note 22, at 85. The 2005 study used data from theCalifornia and Arizona Departments of Corrections. RILEY ET AL., supranote 25, at 20, 24.

(31.) Press Release, Federal Financial Institutions ExaminationCouncil (Sept. 8, 2006), available athttp://www.ffiec.gov/hmcrpr/hm090806.htm; Janneke Ratcliffe & KevinPark, Written Comments and Supplement to Oral Testimony Provided byJanneke Ratcliffe at the Hearing on Community Reinvestment ActRegulations (Aug. 31, 2010), available athttp://www.ccc.unc.edu/documents/CRA_written_8.6.2010.v2.pdf.

(32.) See, e.g., Jacob S. Hacker, Inst. for America's Future,Public Plan Choice in Congressional Health Plans, CAMPAIGN FORAMERICA'S FUTURE (Aug. 20, 2009),http://www.ourfuture.org/files/Hacker_Public_Plan_August_2009.pdf.

(33.) Casey J. Dawkins, Recent Evidence on the Continuing Causes ofBlack-White Residential Segregation, 26 J. URB. AFF. 379, 379 (2004).

(34.) Allen J. Wilcox, Birth Weight and Perinatal Mortality: TheEffect of Maternal Smoking, 137 AM. J. EPIDEMIOLOGY 1098, 1098 (1993).

(35.) Cate, supra note 9, at 14.

(36.) For example, the data routinely collected by the EqualEmployment opportunity Commission is used to check for statisticallysignificant disparities between racial and gender groups. See, e.g.,Paul Meier, Jerome Sacks & Sandy L. Zabell, What Happened inHazelwood: Statistics, Employment Discrimination, and the 80% Rule, 1984AM. B. FOUND. RES. J. 139.

(37.) George T. Duncan, Exploring the Tension Between Privacy andthe Social Benefits of Governmental Databases, in A LITTLE KNOWLEDGE:PRIVACY, SECURITY AND PUBLIC INFORMATION AFTER SEPTEMBER 71, 82 (2004)(Peter M. Shane, John Podesta & Richard C. Leone eds., CenturyFoundation 2004).

(38.) SCHWARTZ, supra note 12, at 8.

(39.) Id. at 8, 15.

(40.) Miguel Helft, Aches, a Sneeze, a Google Search, N.Y. TIMES,Nov. 12, 2008, at A1.

(41.) Miguel Helft, Is There a Privacy Risk in Google Flu Trends?,N.Y. TIMES BITS (Nov. 13, 2008, 8:20 PM),http://bits.blogs.nytimes.com/2008/11/13/does-google-flu-trendsraises-new- privacy-risks.

(42.) SCHWARTZ, supra note 12, at 24.

(43.) The problem of valuing information is as old as privacy.Samuel Warren and Louis Brandeis believed that the press in their daywas overstepping "the obvious bounds of propriety and ofdecency" by photographing the private lives of public and elitefigures for the gossip pages. Samuel D. Warren & Louis D. Brandeis,The Right to Privacy, 4 HARV. L. Rev. 193, 196 (1890). But today gossipjournalism is imbedded into mainstream culture and often the spearheadfor the uncovering of important news items. See David Perel, How theEnquirer Exposed the John Edwards Affair, WALL ST. J., Jan. 23, 2010, atA15.

(44.) Google Watches as You Type in Search Words and Displays"Live " Results in Real Rime. Creeped Out, So Are We.,TECHALOUD (Aug. 23, 2010),http://www.techaloud.com/2010/08/google-tests-search-results-that-update- as-you-type (expressing displeasure with Google's use of privateinformation in generating search terms); Chris Jay Hoofnagle, BeyondGoogle and Evil: How Policy Makers, Journalists and Consumers ShouldTalk Differently About Google and Privacy, FIRST MONDAY (Apr. 6, 2009),http://www.firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2326/2156.

(45.) But see Roger Clarke, Computer Matching by GovernmentAgencies: The Failure of Cost/Benefit Analysis as a Control Mechanism, 4INFO. INFRASTRUCTURE & POL'Y 29 (1995).

(46.) See Ian Ayres, Race and Romance: An Uneven Playing Field forBlack Women, FREAKONOMICS, (Mar. 3, 2010, 2:00 PM),http://www.freakonomics.com/2010/03/03/raceand-romance-an-uneven-playing- field-for-black-women.

(47.) Id.

(48.) Id.

(49.) Its own privacy assurances seemed to have deflected criticismwell enough. See Jason Del Rey, In Love with Numbers: Getting the Mostout of Your Company Data, INC. MAGAZINE, Oct. 2010, at 105,106.

(50.) Mark Milian, Facebook Digs Through User Data and Graphs U.S.Happiness, L.A. TIMES TECH. (Oct. 6, 2009, 3:50 PM),http://latimesblogs.latimes.com/technology/2009/10/facebook-happiness.html.

(51.) See, e.g., Of Governments and Geeks, ECONOMIST, Feb. 6, 2010,at 65; Chris Soghoian, AOL, Netflix and the End of Open Access toResearch Data, CNET SURVEILLANCE STATE (Nov. 30, 2007, 8:30 AM),http://news.cnet.com/8301-13739_3-9826608-46.html.

(52.) See Gary King, Replication, Replication, 28 PS: POL. SCI.& POLITICS 444, 444 (1995).

(53.) See Spectacular Fraud Shakes Stem Cell Field, MSNBC (Dec. 23,2005), http://www.msnbc.msn.com/id/10589085/ns/technology_and_science-science.

(54.) The National Institute of Health found that only one out ofevery twenty claims flowing from observational studies ends up beingreproducible in controlled studies. S. Stanley Young, Everything IsDangerous: A Controversy, AM. SCIENTIST (Apr. 22, 2009),http://www.americanscientist.org/science/pub/everything-is-dangerous-a-controversy.

(55.) Fiona Mathews, et al., You Are What Your Mother Eats:Evidence for Maternal Preconception Diet Influencing Foetal Sex inHumans, 275 Proc. ROYAL SOC'Y B 1661, 1665 (2008).

(56.) Tara Parker-Pope, Boy or Girl? The Answer May Depend onMom's Eating Habits, N.Y. TIMES WELL (April 23, 2008, 12:59 PM),http://well.blogs.nytimes.com/2008/04/23/boy-or-girl-the-answer-may-depend-on-moms-eating-habits; Allison Aubrey,Can a Pregnant Woman's Diet Affect Baby's Sex?, (NPR radiobroadcast Jan. 15, 2009), available athttp://www.npr.org/templates/story/story.php?storyId=99346281.

(57.) See Young, supra note 54.

(58.) NATIONAL RESEARCH COUNCIL OF THE NATIONAL ACADEMIES, SHARINGPUBLICATION-RELATED DATA AND MATERIALS: RESPONSIBILITIES OF AUTHORSHIPIN THE LIFE SCIENCES 3 (2003). Science, an academic journal, changed itsreview policy in 2006 to require all authors to post the raw datasupporting their findings online after the discovery that one of themost important stem cell research findings at that time was a completefabrication. See Barry R. Masters, Book Review, 12 J. BIOMEDICAL OPTICS039901-1, 039901-1 (2007) (reviewing ADIL E. SHAMOO & DAVID B.RESNIK, RESPONSIBLE CONDUCT OF RESEARCH (2003)).

(59.) Furman v. Georgia, 408 U.S. 238, 240 (1972).

(60.) Isaac Ehrlich, The Deterrent Effect of Capital Punishment: AQuestion of Life and Death, 65 AM. ECON. REV. 397, 398 (1975).

(61.) 428 U.S. 153 (1976).

(62.) Id. at 233-34.

(63.) See John J. Donohue & Justin Wolfers, Uses and Abuses ofEmpirical Evidence in the Death Penalty Debate, 58 STAN. L. REV. 791,793 (2005) (noting that Lawrence Katz, Steven Levitt, EllenShustorovich, Hashem Dezhbakhsh, Paul H. Rubin, Joanna M. Shepherd, H.Naci Mocan, R. Kaj Gittings, and Paul R. Zimmerman have written on theissue).

(64.) Id. at 794. Moreover, with so few capital sentences per yearthe deterrence effects of each capital sentence cannot be disentangledfrom the year and state controls. Id.

(65.) The Donohue and Wolfers study has been praised by independentreviewers for its use of sensitivity analysis, and for testing findingsagainst alternative specifications and controls. Joshua D. Angrist &Jorn-Steffen Pischke, The Credibility Revolution in Empirical Economics:How Better Research Design is Taking the Con out of Econometrics 15(Nat'l Bureau of Econ. Research, Working Paper No. 15794, 2010),available at http://ssrn.com/abstract=1565896.

(66.) Steve Chapman, The Decline of the Death Penalty, CHI. TRIB.,Dec. 26, 2010, at C29; Andrew Kohut, The Declining Support forExecutions, N.Y. TIMES, May 10, 2001, at A33. The empirical researchcommunity has seen a similar debate play out in the context of the guncontrol debate. See Ian Ayres & John J. Donohue III, Shooting Downthe "More Guns, Less Crime" Hypothesis, 55 STAN. L. REV. 1193,1202 (2003).

(67.) This phenomenon is, in fact, what motivates George T.Duncan's concept of "information injustice." Duncan,supra note 37, at 71, 82.

(68.) See Lawrence O. Gostin, Health Services Research: PublicBenefits, Personal Privacy, and Proprietary Interests, 129 ANNALS OFINTERNAL MED. 833 (1998).

(69.) Pub. Citizen Health Research Grp. v. FDA, No. Civ.A.99-0177(JR), 2000 WL 34262802, at *1 (D.D.C. Jan. 19, 2000) (G.D. Searle& Co. intervened to support the government's decision towithhold clinical trial data based on the privacy exemption in the FOIAstatute).

(70.) 170 S.W.3d 226 (Tex. App. 2005).

(71.) Id. at 227.

(72.) Id. at 230.

(73.) The requested dataset would have included the sex, age,ethnicity, random teacher code, random school code, test scores, and afew other variables for each student. The request would have revealedPII because the random school and teacher codes, though they sound likenon-identifiers, are actually indirect identifiers. First, the schoolcodes in the Dallas dataset could be cracked using publicly availableschool enrollment statistics. For example, if Preston Hollow ElementarySchool was the only school that enrolled 750 students in the year 1995,then its school code could easily be identified by finding the school inthe dataset with 750 subjects for the year 1995. Even if two schoolshappened to have identical enrollment figures for one particular year,the enrollment patterns over time were unique for every school. (Theplaintiffs asked for several consecutive years of test scores.) once theschool codes were reverse-engineered, most of the teacher codes could bereidentified using the same methods. once the school and teacher codeswere cracked, Dallas schoolchildren could be organized into small classclusters. A class of thirty schoolchildren cannot be diced into racialgroups and gender categories without dissolving into unique cases. Cf.infra Part III. This protocol, checking to see whether subgroups ofindividuals in a dataset could be re-identified using combinations ofpublicly documented characteristics, is consistent with the directivespromulgated by the Family Policy Compliance Office ("FPCO"),the federal agency charged with enforcing FERPA. In providing guidanceon the

scope of "personally identifiable information," the FPCOopined that under certain circ*mstances "the aggregation ofanonymous or de-identified data into various categories could renderpersonal identity 'easily traceable.' In those cases, FERPAprohibits disclosure of the information without consent." SeeLetter from LeRoy S. Rooker, Director, Family Policy Compliance Office,to Corlis P. Cummings, Senior Vice Chancellor for Support Services, Bd.of Regents of the Univ. Sys. of Ga. (Sept. 25, 2003), available athttp://www2.ed.gov/policy/gen/guid/fpco/ferpa/library/georgialtr.html.

(74.) 20 U.S.C. [section] 1232g(b)(1)(F) (2006).

(75.) See, e.g., Freedom of Information Act, 5 U.S.C. [section] 552(2006); California Public Records Act, CAL. GOV'T CODE[section][section] 6250 et seq. (West 2008); Freedom of Information Law,N.Y. PUB. OFFICERS LAW [section][section] 84 et seq. (Consol. 2011).

(76.) See, e.g., CAL. GOV'T CODE [section] 6250 (West 2011)("[A]ccess to information concerning the conduct of thepeople's business is a fundamental and necessary right of everyperson in this state.").

(77.) Sylvester & Lohr, supra note 12, at 190; see also Cate,supra note 9, at 13-15.

(78.) Further Implementation of the Presidential Records Act, Exec.Order No. 13233, 66 Fed. Reg. 56,025 (Nov. 5, 2001).

(79.) Presidential Records, Exec. Order No. 13489, 74 Fed. Reg.4669 (Jan. 21, 2009).

(80.) For example, in Arizona, improper disclosure of private factsis a felony, while improper denial of a legitimate public recordsrequest is a misdemeanor. See Air Talk: The "Open GovernmentPlan" (Southern California Public Radio broadcast Dec. 14, 2009),available at http://www.scpr.org/programs/airtalk/2009/12/14/the-open-government-plan.

(81.) Id.

(82.) Id.

(83.) Physicians Comm. for Responsible Med. v. Glickman, 117 F.Supp. 2d 1, 5-6 (D.D.C. 2000).

(84.) U.S. Dep't of State v. Ray, 502 U.S. 164, 166 (1991).

(85.) Adair v. England, 183 F. Supp. 2d 31, 56 (D.D.C. 2002).

(86.) See Fish v. Dallas Indep. Sch. Dist., 170 S.W.3d 226 (Tex.App. 2005).

(87.) COALITION OF JOURNALISTS FOR OPEN GOV'T, FOIA LITIGATIONDECISIONS, 1999-2004 1 (2004), available athttp://www.cjog.net/documents/Litigation_Report_9904.pdf.

(88.) TIMOTHY GROSECLOSE, CUARS RESIGNATION REPORT (2008),available at http://images.ocregister.com/newsimages/news/2008/08/CUARSGrosecloseResignationRep ort.pdf; see also Seema Mehta, UCLAAccused of Illegal Admitting Practices, L.A. TIMES, Aug. 30, 2008, atB1.

(89.) See GROSECLOSE, supra note 88.

(90.) Id.

(91.) Id.

(92.) Robert Steinbuch, What They Don't Want Me (and You) toKnow About Non-Merit Preferences in Law School Admissions: An Analysisof Failing Students, Affirmative Action, and Legitimate EducationalInterests 3 (unpublished manuscript) (on file with author).

(93.) Richard J. Peltz, From the Ivory Tower to the Glass House:Access to "De-Identified" Public University Admission Recordsto Study Affirmative Action, 25 HARV. J. ON RACIAL & ETHNIC JUSTICE181, 185-87 (2009).

(94.) See, e.g., Osborn v. Bd. of Regents of Univ. of Wis., 647N.W.2d 158, 171 (Wis. 2002) ("[B]y redacting or deleting the nameof the high school or undergraduate institution, the University nolonger faces a situation where only one minority student from a named

high school applies to one of the University's campuses andtherefore, even though the student's name is not disclosed, thedata could be personally identifiable.").

(95.) Mark Elliot, DIS: A New Approach to the Measurement ofStatistical Disclosure Risk, 2 RISK MGMT. 39 (2000) (putting forward anew method of measuring the "worst-case risk"); Jordi Nin etal., Rethinking Rank Swapping to Decrease Disclosure Risk, 64 DATA &KNOWLEDGE ENGINEERING 346 (2008). But note that many computer scientistsalso incorporate assessments of data utility and information loss intotheir work. See, e.g., Duncan, supra note 11; Josep Domingo-Ferrer etal., Comparing SDC Methods for Microdata on the Basis of InformationLoss and Disclosure Risk, EUROPEAN COMMISSION (2001),http://epp.eurostat.ec.europa.eu/portal/page/portal/research_methodology/ documents/81.pdf.

(96.) See infra notes 117-119, 135 and accompanying text.

(97.) For a concise overview on how de-anonymization attacks work,see JANE YAKOWITZ & DANIEL BARTH-JONES, TECH. POLICY INST., THEILLUSORY PRIVACY PROBLEM IN SORRELL V. IMS HEALTH 1-5 (2011),http://www.techpolicyinstitute.org/files/the%20illusory%20privacy%20problem%20in%20sorrell1.pdf.

(98.) More sophisticated techniques will make matches not based onstrong exact linkages but on the similarity of the matching variablesand the greater deviation between the best match and the second-bestmatch. This allows an attack algorithm to make matches under morerealistic conditions in which databases contain measurement error, butit nevertheless requires that the adversary have access to more-or-lesscomplete information on the general

population from which the de-identified data was sampled. Thesemethods are described more thoroughly by Josep Domingo-Ferrer et al.,supra note 95, at 813-14.

(99.) See Sweeney, supra note 11, at 52.

(100.) HIPAA Standards for Privacy of Individually IdentifiableHealth Information, 45 C.F.R. [section] 164.514(b)(2)(ii) (2010).Alternatively, the disclosing entity must use "generally acceptedstatistical and scientific principles and methods" to ensure thatthe risks of reidentification are "very small." [section]164.514(b)(1).

(101.) What is a Quasi-identifier?, ELECTRONIC HEALTH INFO.LABORATORY (Oct. 18, 2009),http://www.ehealthinformation.ca/knowledgebase/article/AA-00120. Notethat indirect identifiers are also known as"quasi-identifiers."

(102.) Narayanan & Shmatikov, supra note 11.

(103.) Id.

(104.) Id. at 122-23. The authors first mapped the five-point scalefrom Netflix movie ratings onto the ten-point scale used by IMDb, andthen attempted to identify matches based on strings of movies that werereviewed similarly on both websites. Id.

(105.) Brief of Amicus Curiae Electronic Frontier Foundation inSupport of Petitioners at 9-10, Sorrell v. IMS Health Inc., 131 S.Ct.2653 (2011) (No. 10-779), 2011 WL 757416, at *9-10; see also supra note5 (discussing various privacy lawsuits). Netflix had added random noiseto the dataset. Narayanan & Shmatikov, supra note 11, at 119.

(106.) See Narayanan & Shmatikov, supra note 11, at 122("A water-cooler conversation with an office colleague about hercinematographic likes and dislikes may yield enough information [tode-anonymize her subscriber record]....").

(107.) Id. at 112; see also Cynthia Dwork, Differential Privacy,2006 PROC. 33rd INT'L COLLOQUIUM ON AUTOMATA, LANGUAGES &PROGRAMMING, available athttp://research.microsoft.com/pubs/64346/dwork.pdf.

(108.) Ohm, supra note 4, at 1721.

(109.) For example, regulations issued under FERPA define PII toinclude "information that, alone or in combination, is linked orlinkable to a specific student that would allow a reasonable person inthe school community, who does not have personal knowledge of therelevant circ*mstances, to identify the student with reasonablecertainty." 34 C.F.R. [section] 99.3 (2011) (emphasis added).Likewise, "[a]t a minimum, each statistical agency must assure thatthe risk of disclosure from the released data when combined with otherrelevant publicly available data is very low." Report onStatistical Disclosure Limitation Methodology 3 (Fed. Comm. onStatistical Methodology, Statistical Working Paper No. 22, 2d version,2005) [hereinafter Working Paper No. 22] (emphasis added), available athttp://www.fcsm.gov/working-papers/SPWP22_rev.pdf.

(110.) Narayanan & Shmatikov, supra note 11, at 123.

(111.) See generally id. Narayanan and Shmatikov make similarbreakthroughs using graphs of network connections of anonymized Twitteraccounts by matching them to sufficiently unique networked accounts onFlickr. Arvind Narayanan & Vitaly Shmatikov, Deanonymizing SocialNetworks, 2009 Proc. 30TH IEEE SYMP. ON SECURITY & PRIVACY 173.

(112.) Schwartz, supra note 6; Daniel J. Solove, Access andAggregation: Public Records, Privacy and the Constitution, 86 MINN. L.REV. 1137, 1185 (2002) ("The aggregation problem arises from thefact that the digital revolution has enabled information to be easilyamassed and combined.").

(113.) As Andrew Serwin puts it, "[i]ndeed, in today'sWeb 2.0 world, where many people instantly share very private aspects oftheir lives, one can hardly imagine a privacy concept more foreign thanthe right to be let alone." Andrew Serwin, Privacy 3.0--ThePrinciple of Proportionality, 42 U. MICH. J. L. REFORM 869, 872 (2009).

(114.) Indeed, "lifelogging" on the Internet presents anumber of challenges for privacy scholars even on their own. Anita Allenhas written about the problems of the Internet's "perniciousmemory" recalling information that puts the lifelogger in the worstlight. Anita L. Allen, Dredging Up the Past: Lifelogging, Memory, andSurveillance, 75 U. CHI. L. REV. 47, 56-63 (2008).

(115.) See Narayanan & Shmatikov, supra note 11, at 111.

(116.) The Narayanan-Shmatikov algorithm utilizes thedataset's sparseness to test for false positive matches. If a setof movies leads to a unique match in the Netflix data, and if the moviesdon't share a common fan base, then the algorithm will be confidentthat the match is accurate. Id. at 112. But the Netflix Data is missinga lot of information about the movie-viewing of its own data subjects.The algorithm is susceptible to false positives and false negatives whenit attempts to match against auxiliary information. Take this simplifiedbut illustrative hypothetical: Albert, Bart, and Carl have all seenDoctor Zhivago, Evil Dead II, and Dude, Where's My Car?. Albert andBart are in the Netflix database, Carl is not. Albert rates all threemovies, but Bart rates only Doctor Zhivago, and, thus, Netflix has norecord of his having seen Evil Dead II and Dude, Where's My Car?.Because Albert is the only person in the Netflix dataset who rated allthree movies, he looks highly unique among

the Netflix data subjects, even though we know, in fact, that thesethree movies are not unique to him even within the Netflix sample. Carlcomments on all three movies on IMDb. The attack algorithm matchesCarl's IMDb profile to Albert's Netflix data and reports backwith a high degree of statistical confidence that the match is not afalse positive.

(117.) In January 2010, a panel of privacy law experts and computerscientists advised the FTC that, in promulgating new regulations, itshould abandon faith in anonymization and clamp down on broad datasharing to the extent possible. The Narayanan-Shmatikov study was heldup as evidence that anonymization protocols offer no security againstreidentification. Remarks at the FTC Second Roundtable on ExploringPrivacy 15, 56 (Jan. 28, 2010) (transcript available athttp://www.ftc.gov/bcp/workshops/privacyroundtables/PrivacyRoundtable_Jan2010_Transcript.pdf). Narayanan, however, cognizantof the importance of research data, has worked with entities toanonymize public release datasets sufficiently to reduce risks. SeeSteve Lohr, The Privacy Challenge in Online Prize Contests, N.Y. TIMESBITS (May 21, 2011, 5:25 PM), http://bits.blogs.nytimes.com/2011/05/21/the-privacy-challenge-in-online-prize-contests.

(118.) SCHWARTZ, supra note 12, at 7 (emphasis added).

(119.) Brief of Amici Curiae Electronic Privacy Information Center(EPIC) et al. in Support of the Petitioners at 33, Sorrell v. IMSHealth, Inc., 131 S. Ct. 2653 (2011) (No. 10779), available athttp://www.scotusblog.com/case-files/cases/sorrell-v-ims-health-inc.

(120.) Narayanan & Shmatikov, supra note 11, at 122-23.

(121.) Deborah Lafky, Dep't of Health and Human Servs. Officeof the Nat'l Coordinator for Health Info. Tech., The Safe HarborMethod of De-Identification: An Empirical Test 15-19 (2009),http://www.ehcca.com/presentations/HIPAAWest4/lafky_2.pdf.

(122.) Id. at 16.

(123.) Id. at 17-18.

(124.) Id. at 19.

(125.) These findings are consistent with an earlier study thatexamined re-identification attacks under realistic conditions. See U.Blien et al., Disclosure Risk for Microdata Stemming from OfficialStatistics, 46 STATISTICA NEERLANDICA 69 (1992).

(126.) See id. at 80-81.

(127.) Even under conditions that are considered risky,re-identification of anonymized datasets is difficult to pull off due tothe "natural unreliability of measurement," which serves as anatural barrier. Walter Muller, et al., Identification Risks ofMicrodata, 24 SOC. METHODS & RES. 131, 151 (1995).

(128.) Justin Brickell & Vitaly Shmatikov, The Cost of Privacy:Destruction of Data-Mining Utility in Anonymized Data Publishing, 2008Proc. 14th ACM SIGKDD INT'L CONF. ON KNOWLEDGE DISCOVERY & DATAMINING (KDD) 70, 72; see also Narayanan & Shmatikov, supra note 11,at 114.

(129.) Dwork, supra note 107, at 9.

(130.) James P. Nehf, Recognizing the Societal Value in InformationPrivacy, 78 Wash. L. Rev. 1, 24 (2003). Similar arguments have arisen inresponse to the disclosure of information about Tay-Sachs disease in theJewish community and sickle-cell anemia in the African-Americanpopulation. Lawrence O. Gostin & Jack Hadley, Health ServicesResearch: Public Benefits, Personal Privacy, and Proprietary Interests,129 ANNALS OF INTERNAL MED. 833, 834 (1998).

(131.) Working Paper No. 22, supra note 109, at 11.

(132.) I discuss in Part IV how the Brickell and Shmatikovdefinition of privacy has misled legal scholars to believe that there isa forced choice between privacy and data utility.

(133.) Working Paper No. 22, supra note 109, at 11.

(134.) To the very limited extent group inference privacy has beentested in the courts, judges have been unwilling to recognize an impliedcontract or privacy challenge to releases of de-identified data, evenwhen the de-identified data could be used to make group inferences formarketing purposes. See London v. New Albertson's, Inc., No.08-CV-1173 H(CAB), 2008 WL 4492642, at *5-6 (S.D. Cal. Sept. 30, 2008)(holding that the disclosure of anonymous individual-level pharmacypatient data to a marketing firm did not contravene assurances from apharmacy that it "collects your personal information andprescription information only for the fulfillment of your prescriptionorder and to enable you to receive individualized customer servicebeyond what we can provide to anonymous users").

(135.) Ohm, supra note 4, at 1755.

(136.) Singel, supra note 5.

(137.) See Brickell & Shmatikov, supra note 128, at 74. Thestudy does helpfully prove that small increases in privacy protectioncause disproportionately large destruction of overall utility. Id. at78. But if privacy protocols are designed to preserve the utility of adataset for a particular research question, nothing in the studysuggests that this would not be possible.

(138.) See, e.g., S. Illinoisan v. Ill. Dep't of Pub. Health,844 N.E.2d 1, 7 (Ill. 2006).

(139.) ohm, supra note 4, at 1716.

(140.) Id. at 1730 (footnote omitted).

(141.) See supra text accompanying notes 115-116.

(142.) 844 N.E.2d 1.

(143.) Id. at 3.

(144.) Id.

(145.) See id. at 7.

(146.) Id. at 4. The privacy standard for this case was heightenedfrom PII to information that "tends to lead to the identity."Id. at 18 (emphasis added). Nevertheless the court found that thegovernment failed to demonstrate that the requested data would tend tolead to the

identities of the subjects. Id. at 21. Before she took the witnessstand in this case, Dr. Sweeney had demonstrated that re-identificationof allegedly anonymized data was possible by reverse-engineeringMassachusetts medical data. See Sweeney, supra note 11.

(147.) S. Illinoisan, 844 N.E.2d at 7-8.

(148.) Id. at 8.

(149.) Id.

(150.) Id.

(151.) Id. at 4.

(152.) See Sweeney, supra note 11.

(153.) 45 C.F.R. [section] 164.514(b)(2)(i) (2010).

(154.) S. Illinoisan, 844 N.E.2d at 8.

(155.) Id.

(156.) Id.

(157.) Id. at 9 (alterations in original).

(158.) Id. at 20 (quoting S. Illinoisan v. Dep't of Pub.Health, 812 N.E.2d 27, 29 (Ill. App. Ct. 2004)).

(159.) Id. at 13.

(160.) Id.

(161.) Id. at 8.

(162.) More generally, the National Research Council has noted thatin cases where "the same data are available elsewhere, even if notin the same form or variable combination, the added risk of releasing aresearch data file may be comparatively small." COMM. ON NAT'LSTATISTICS, NAT'L RESEARCH COUNCIL, IMPROVING ACCESS TO ANDCONFIDENTIALITY OF RESEARCH DATA 12 (Christopher Mackie & NormanBradburn eds., 2000), available athttp://www.geron.uga.edu/pdfs/BooksOnAging/ConfRes.pdf.

(163.) S. Illinoisan, 844 N.E.2d at 8.

(164.) See Narayanan & Shmatikov, supra note 11, at 116.

(165.) Id. at 123 ("[H]is political orientation may berevealed by his strong opinions about 'Power and Terror: NoamChomsky in Our Times' and 'Fahrenheit 9/11,' and hisreligious views by his ratings on 'Jesus of Nazareth' and'The Gospel of John.'").

(166.) Doe Complaint, supra note 5, at 18.

(167.) Privacy policy should not aspire to regulate thesewrong-headed inferences; plenty of heterosexuals enjoyed BrokebackMountain, and plenty of liberals dislike Michael Moore. But even ifmovie reviews are windows to the soul, the marginal information gainedby re-identifying somebody in the Netflix dataset is likely to be small.

(168.) Education datasets often tie non-identifying but highlysensitive information (such as GPA or test scores) to indirectidentifiers like age, race, and geography. If individuals in thesedatabases were re-identified using the indirect identifiers, theadversary could learn something significant about the data subjects.See, e.g., Krish Muralidhar & Rathindra Sarathy, Privacy Violationsin Accountability Data Released to the Public by State EducationalAgencies, FED. COMM. ON STAT. METHODOLOGY RES. CONF. 1 (Nov. 2009),http://www.fcsm.gov/09papers/Muralidhar_VI-A.pdf.

(169.) Jeremy Albright, a researcher at the InteruniversityConsortium for Political and Social Research ("ICPSR"), notesthat the statistical disclosure control literature has considered thisapproach but has generally not put it into practice, in part becausenobody agrees on how much information the putative adversary should bepresumed to have ahead of time. Jeremy Albright, Privacy Protection inSocial Science Research: Possibilities and Impossibilities 11-12 (June1, 2010) (unpublished manuscript) (on file with author).

(170.) ohm, supra note 4, at 1746.

(171.) See Acxiom Corp., Understanding Acxiom's MarketingProducts 1 (2010), available athttp://www.acxiom.com/uploadedFiles/Content/About_Acxiom/Privacy/AC1255-10%20Acxiom%20Marketing%20Products.pdf; Risk Solutions Product Index,LexisNexis, http://www.lexisnexis.com/risk/solutions/product-index.aspx(last visited Dec. 21, 2011).

(172.) See, e.g., Brickell & Shmatikov, supra note 128, at 70(claiming that "[r]eidentification is a major privacy threat topublic datasets containing individual records").

(173.) Nat'l Research Council, supra note 162, at 12. ThomasLouis of the University of Minnesota explains that disclosure risksassociated with a particular data release should not be compared to aprobability of zero, but that one should "consider how theprobability of disclosure changes as a result of a specific datarelease." Id. Changes to the marginal risks caused by adding ormasking certain fields in the dataset can be assessed as well. Id.

(174.) Releases of data by sophisticated data producers areexpected, at a minimum, to "assure that the risk of disclosure fromthe released data when combined with other relevant publicly availabledata is very low." Working Paper No. 22, supra note 109, at 3. Ofcourse, that begs the question what it means for disclosure risk to be"very low." Similarly, "[t]here can be no absolutesafeguards against breaches of confidentiality,.... Many methods existfor lessening the likelihood of such breaches, the most common andpotentially secure of which is anonymity." INT'L STAT. INST.,supra note 14, at 10. Likewise, the FPCO's commentary on the newlypassed FERPA regulations anticipate low risk, not the absence of riskaltogether. "The regulations recognize that the risk of avoidingthe disclosure of PII cannot be completely eliminated and is always amatter of analyzing and balancing risk so that the risk of disclosure isvery low." FAMILY POLICY COMPLIANCE ORG., FAMILY EDUCATIONAL RIGHTSAND PRIVACY ACT, FINAL RULE, 34 CFR PART 99: SECTION-BY-SECTION ANALYSIS11 (2008), available at http://www.ed.gov/policy/gen/guid/fpco/pdf/ht12-17-08-att.pdf.

(175.) Ohm, supra note 4, at 1717-20; see also Nate Anderson,"Anonymized" Data Really Isn't--And Here's Why Not,ARS TECHNICA (Sept. 8, 2009, 5:30 AM),http://arstechnica.com/tech-policy/news/2009/09/your-secrets-live-online-in-databases-ofruin.ars.

(176.) Michael Barbaro & Tom Zeller Jr., A Face is Exposed forAOL Searcher No. 4417749, N.Y. TIMES, Aug. 9, 2006, at A1.

(177.) Ohm, supra note 4, at 1729.

(178.) Id. at 1728.

(179.) Muralidhar & Sarathy, supra note 168, at 1.

(180.) Id. at 20.

(181.) RICHARD A. MOORE, JR., U.S. BUREAU OF THE CENSUS, CONTROLLEDDATA-SWAPPING TECHNIQUES FOR MASKING PUBLIC USE MICRODATA SETS 25-26,available at http://www.census.gov/srd/papers/pdf/rr96-4.pdf.

(182.) Duff Wilson, Database on Doctor Discipline is Restored, withRestrictions, N.Y. TIMES, Nov. 10, 2011, at B2 (News organizationslinked identifiable court filings to a national databank of doctordisciplinary actions in order to criticize the disciplinary boards. Thejournalists re-identified doctors who had a known, long history ofmalpractice actions against them to the "de-identified" dataon disciplinary actions. The public-use data employed trivialanonymization--the removal of names only.).

(183.) See NAT'L RESEARCH COUNCIL, supra note 162, at 48;Hermann Habermann, Ethics, Confidentiality, and Data Dissemination, 22J. of Official Stat. 599, 603 (2006).

(184.) SPECIALISTS MKTG. SERVS., INC., MAILING LIST CATALOG,available at http://directdatamailinglists.com/SMS-catalog.pdf.

(185.) See, e.g., DANIEL SOLOVE, THE DIGITAL PERSON 82-83, 173-74(2008), available athttp://docs.law.gwu.edu/facweb/dsolove/Digital-Person/text.htm; Ohm,supra note 4, at 1729; Brickell & Shmatikov, supra note 128, at 70(claiming that "[r]e-identification is a major privacy threat topublic datasets containing individual records"). Thomas M. Lenardand Paul H. Rubin notice this phenomenon, observing that whileSolove's study "lists harms associated with information use,he does not quantify how frequent or serious they are." THOMAS M.LENARD & PAUL H. RUBIN, TECH. POLICY INST., IN DEFENSE OF DATA:INFORMATION AND THE COSTS OF PRIVACY 43 (2009),http://www.techpolicyinstitute.org/files/in%20defense%20of%20data.pdf.

(186.) Ohm, supra note 4, at 1748.

(187.) Pharmatrak, Inc. collected personally identifiable data onweb visitors to its pharmaceutical industry clients using clear GIFs (or"cookies") in direct contravention of the ElectronicCommunications Privacy Act. This practice was exposed and resulted in aclass action lawsuit. In re Pharmatrak, Inc., 329 F.3d 9, 12 (1st Cir.2003). HBGary Federal considered hacking into the networks of itsclients' foes in order to gather evidence for smear campaigns, butthese practices were uncovered, ironically enough, during a hack intotheir

own servers. See Eric Lipton & Charlie Savage, Hackers'Clash with Security Firm Spotlights Inquiries to Discredit Rivals, N.Y.TIMES, Feb. 11, 2011, at A15.

(188.) Paul Syverson, The Paradoxical Value of Privacy, 2d ANN.WORKSHOP ON ECON. & INFO. SECURITY 2 (2003),http://www.cpppe.umd.edu/rhsmith3/papers/Final_session3_syverson.pdf.

(189.) The Notable Decline of Identity Fraud, HELP NET SECURITY(Feb. 8, 2011), www.net-security.org/secworld.php?id=10551; see alsoLenard & Rubin, supra note 185, at 34-35. The aggregate data cannotdirectly answer the question about the relationship between public dataand identity theft. Ironically, microdata is required to reliably testthis theory of covert re-identification.

(190.) See The Notable Decline of Identity Fraud, supra note 189.

(191.) This is at the heart of Peter Swire's criticism ofscholars like me who attempt to compare the costs and benefits ofprivacy. See Peter Swire, Privacy and the Use of Cost/Benefit Analysis4, 10 (June 18, 2003) (unpublished manuscript), available athttp://www.ftc.gov/bcp/workshops/infoflows/present/swire.pdf.

(192.) Frank Abagnale stresses the importance of eliminating thegarbage and paper trail to reduce the risk of identity fraud. SeeAbagnale Recommends Fraud Protection Strategy: Audio, BLOOMBERG (Nov.15, 2010), http://www.bloomberg.com/news/2010-11-15/abagnale-recommends-fraud-protection-strategy-audio.html.

(193.) LENARD & RUBIN, supra note 185, at 38-39.

(194.) See Bernardo A. Huberman, Eytan Adar & Leslie R. Fine,Valuating Privacy, IEEE SECURITY & PRIVACY, Sept.-Oct. 2005, at 22,22-24.

(195.) For example, see the salary information available online forUniversity of Michigan employees athttp://www.umsalary.info/deptsearch.php.

(196.) State Worker Salary Search: Top Salaries Earned in 2010, THESACRAMENTO BEE, http://www.sacbee.com/statepay (last visited Dec. 21,2011); see also Comm'n on Peace Officer Standards and Training v.Superior Court, 165 P.3d 462, 465 (Cal. 2007).

(197.) See, e.g., Pantos v. City & Cnty. of S.F., 198 Cal.Rptr. 489, 491 (Cal. Ct. App. 1984); Forum Commc'ns Co. v. Paulson,752 N.W.2d 177, 185 (N.D. 2008).

(198.) See Cox Broad. Corp. v. Cohn, 420 U.S. 469, 496 (1975); Fla.Star v. B.J.F., 491 U.S. 524, 538 (1989).

(199.) Lipton & Savage, supra note 187.

(200.) Id.

(201.) Id. Ironically, it was HBGary Federal's ownnetworks' vulnerabilities that it should have been focusing on, asthe hacker group Anonymous hacked into HBGary Federal's servers andreleased several emails and PowerPoint presentations on Wikileaks. Id.

(202.) Hackers Attack UNC-Based Mammography Database, UNC HEALTHCARE (Sept. 25, 2009),http://news.unchealthcare.org/som-vital-signs/archives/vital-signs-sept-25- 2009/hackers-attack-unc-based-mammography-database.

(203.) Hackers Get Into U.C. Berkeley Health-Records Database,FOXNEWS.COM (May 8, 2009),http://www.foxnews.com/story/0,2933,519550,00.html.

(204.) Brian Krebs, Hackers Break into Virginia Health ProfessionsDatabase, Demand Ransom, WASH. POST SECURITY FIX (May 4, 2009, 6:39 PM),http://voices.washingtonpost.com/securityfix/2009/05/hackers_break_into_virginia_he.html.

(205.) See Derek E. Bambauer & Oliver Day, The Hacker'sAegis, 60 EMORY L.J. 1051, 1101 (2011); see also Larry Barrett, DataTheft Trojans, Black Market Cybercrime Tools on the Rise, ESECURITYPLANET (Mar. 31, 2010),http://www.esecurityplanet.com/trends/article.php/3873891/Data-Theft-Trojans- Black-Market-Cybercrime-Tools-on-theRise.htm.

(206.) See Bambauer & Day, supra note 205, at 1060-62; JaziarRadianti & Jose J. Gonzalez, Toward a Dynamic Modeling of theVulnerability Black Market 4-7 (Oct. 23-24, 2006) (unpublishedmanuscript), available athttp://wesii.econinfosec.org/draft.php?paper_id=44.

(207.) See supra Part III.

(208.) Milt Freudenheim, A New Push to Protect Health Data, N.Y.TIMES, May 31, 2011, at B1.

(209.) See Morning Edition: MGH Settles for $1M over Lost HIV/AIDSRecords, NAT'L PUB. RADIO (Feb. 25, 2011),http://www.wbur.org/2011/02/25/mgh-privacy.

(210.) In Massachusetts General Hospital's case, the employeewas a billing manager, and not a low skilled employee. Id. But records,particularly consumer records, are often in the hands of low skill dataprocessors or outsourced to third parties that process the dataoffshore. See, e.g., Outsourcing Data Entry Privacy Policy, DATA ENTRYSERVICES INDIA, http://www.dataentryservices.co.in/privacy_policy.htm(last visited Dec. 21, 2011). Not everybody is comfortable with the riskthat accompanies routine data-handling. Parents of a student whoparticipated in a research survey at their child's school attemptedto mount a legal challenge based on the potential privacy risks that anadministrator might divulge their child's informationinadvertently, but the suit was dismissed. C.N. v. Ridgewood Bd. ofEduc., 430 F.3d 159, 161 (3d Cir. 2005).

(211.) Sandra T.M. Chong, Data Privacy: The Use of Prisoners forProcessing Personal Information, 32 U.C. DAVIS L. REV. 201, 204 (1998).

(212.) See Cate, supra note 9, at 12-16. Poor encryption practicesare an excellent target for effective privacy regulation. There is noreason for a business or agency to fail to encrypt its files thatcontain personally identifiable information. See Derek E. Bambauer,Rules, Standards, and Geeks, 5 BROOK. J. CORP. FIN. & COM. L. 49,56-57 (2010).

(213.) A wiser target is the law surrounding data securitybreaches. See, e.g., Paul M. Schwartz and Edward J. Janger, Notificationof Data Security Breaches, 105 MICH. L. REV. 913 (2007); see alsoBambauer, supra note 212, at 49.

(214.) See supra text accompanying notes 20-21.

(215.) Id.

(216.) Id.

(217.) FTC PRIVACY REPORT, supra note 5, at 43, 51-52.

(218.) See Directive 95/46/EC, of the European Parliament and ofthe Council of 24 October 1995 on the Protection of Individuals withRegard to the Processing of Personal Data and on the Free Movement ofSuch Data, 2001 O.J. (L 281) 31, 34, 40. Note that under Recital 29,processing for statistical purposes is, at least, not a use inconsistentwith any other use for which the data may be processed. Id. at 34.Social science researchers often have to perform their analyses at thephysical location of the data enclaves. See, e.g., Stefan Bender et al.,Improvement of Access to Data Set from the Official Statistics 4-5(German Council for Soc. and Econ. Data, Working Paper No. 118, 2009),available at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1462086.

(219.) Legislation seeking to limit the storage of data has alreadybeen proposed. See, e.g., Eliminate Warehousing of Consumer InternetData Act of 2006, H.R. 4731, 109th Cong. (2006).

(220.) Ohm, supra note 4, at 1762; Daniel J. Solove,Conceptualizing Privacy, 90 CALIF. L. REV. 1087, 1091-93 (2002).

(221.) Ohm, supra note 4, at 1764.

(222.) For instance, under HIPAA, the public release of healthinformation requires the covered entity to prepare the data such that"there is no reasonable basis to believe that the information canbe used to identify an individual." 45 C.F.R. [section] 164.514(a)(2010). Releases of identifiable health information to a businessassociate, on the other hand, are permitted so long as the businessassociate makes assurances that it will guard and handle the health datain a manner consistent with the covered entity's responsibilitiesunder HIPAA. Id. [section] 164.502(e)(1)(i) (2010). Any additionalrestrictions the covered entity might wish to impose are left to theoriginal data-holder's discretion.

(223.) Derek Bambauer argues that rules are more helpful thanstandards in contexts when three conditions are met: (1) when thespecified minimum standard for behavior will suffice most or all of thetime, (2) when the standard degrades slowly, and (3) when monitoring forharm is low-cost and accurate. Bambauer, supra note 212, at 50. Here,the first condition is met because, as I argued earlier,re-identificaiton attacks performed on anonymized data are difficult,and anonymization has sufficed to prevent re-identification attacks. Seesupra Parts III, IV. The second and third conditions are developed inthis Part.

(224.) AOL failed to strip the dataset of last names. Thisoversight, in combination with multiple searches for a particularneighborhood, led to the re-identification of Thelma Arnold. Barbaro& Zeller, supra note 176.

(225.) FTC PRIVACY REPORT, supra note 5, at 36, 38. The AOL story,along with the Netflix study, was the support for the FTC'sbroad-reaching conclusion that "businesses combine disparate bitsof 'anonymous' consumer data from numerous different onlineand offline sources into profiles that can be linked to a specificperson." Id.

(226.) The Centers for Disease Control and Prevention anticipatesaggregated tables using a threshold value of three. CTRS. FOR DISEASECONTROL AND PREVENTION & HEALTH RES. SERVS. ADMIN., INTEGRATEDGUIDELINES FOR DEVELOPING EPIDEMIOLOGIC PROFILES 126 (2004), availableat http://www.cdc.gov/hiv/topics/surveillance/resources/guidelines/epiguideline/pdf/epi_guidelines.pdf.

(227.) See Sweeney, supra note 18, at 557.

(228.) For example consumer preferences and information containedon a Facebook "wall" are not indirect identifiers in myscheme.

(229.) For example, the Public-Use Microdata Samples ("PUMSfiles") report data on a sample of U.S. households. See Public-UseMicrodata Samples (PUMS), U.S. CENSUS BUREAU,http://www.census.gov/main/www/pums.html (last updated May 28, 2010).

(230.) This assumption can fail in circ*mstances where a potentialdata subject is unusual. If the indirect identifiers included in thedataset uniquely describe a person in the broad population of peoplethat could potentially be included in the sample, an adversary will beable to check whether that person actually is included in the sample(and identify him if he is). For example, suppose only one veterinarianin Delaware identifies himself as a Native American; a dataset thatincluded profession, state, and detailed race information cannot rely onan unknown sampling frame to ensure anonymity because any datasetincluding these indirect identifiers would immediately identify theindividual in question as being a member of the dataset.

(231.) See, e.g., Khaled El Emam & Fida Kamal Dankar,Protecting Privacy Using k-Anonymity, 15 J. AM. MED. INFO. ASS'N627, 634-35 (2008).

(232.) See George T. Duncan, Confidentiality and Data Access Issuesfor Institutional Review Boards, in PROTECTING PARTICIPANTS ANDFACILITATING SOCIAL AND BEHAVIORAL SCIENCES RESEARCH 235, 235 (ConstanceF. Citro et al. eds., 2003).

(233.) See Nat'l Cable Television Ass'n v. FCC, 479 F.2d183, 195 (D.C. Cir. 1973); Tax Analysts and Advocates v. IRS, 362 F.Supp. 1298, 1307 (D.D.C. 1973) (quoting Nat'l Cable TelevisionAss'n, 479 F.2d at 195)

(234.) See J. Trent Alexander, Michael Davern & BetseyStevenson, Inaccurate Age and Sex Data in Census PUMS Files 1-3 (CESifoWorking Paper No. 2929, 2010), available athttp://ssrn.com/abstract=1546969; Steven Levitt, Can You Trust CensusData?, FREAKONOMICS (Feb. 2, 2010, 11:09 AM),http://www.freakonomics.com/2010/02/02/canyou-trust-census-data.

(235.) See Bartnicki v. Vopper, 532 U.S. 514, 515, 518 (2001); Fla.Star v. B.J.F., 491 U.S. 524, 525 (1989); Sidis v. F-R Publ'gCorp., 113 F.2d 806, 807-09 (2d Cir. 1940).

(236.) See Desnick v. ABC, 44 F.3d 1345, 1354-55 (7th Cir. 1995);Erwin Chemerinsky, Protect the Press: A First Amendment Standard forSafeguarding Aggressive Newsgathering, 33 U. RICH. L. REV. 1143, 1160(2000). But see Food Lion, Inc. v. ABC, 194 F.3d 505, 521 (4th Cir.1999).

(237.) C. Thomas Dienes, Protecting Investigative Journalism, 67Geo. WASH. L. REV. 1139, 1143 (1999).

(238.) Chemerinsky, supra note 236, at 1160.

(239.) In order to trigger criminal protection againstre-identification, the dataset must be properly anonymized in accordancewith the requirements affording safe harbor protection. This preventsusers of a poorly anonymized dataset from incurring criminal liability.

(240.) If the sample frame is unknown, non-public information canconsist of information reported about the subject in the dataset or eventhe mere fact that the subject is in the dataset.

(241.) I do not believe this precaution is necessary to avoid"thought crimes." After all, tort law has made actionable someforms of observation in public. For example, even mere publicsurveillance can be actionable under the tort of intrusion. See Summersv. Bailey, 55 F.3d 1564, 1566 (11th Cir. 1995); Nader v. Gen. MotorsCorp., 255 N.E.2d 765, 769-71 (N.Y. 1970). The reverse-engineering of ananonymized dataset is at least as intrusive and requires just as muchactus reus.

(242.) INST. OF MED. OF THE NAT'L ACADS., BEYOND THE HIPAAPRIVACY RULE: ENHANCING PRIVACY, IMPROVING HEALTH THROUGH RESEARCH 265(Sharyl J. Nass et al. eds., 2009), available athttp://www.ncbi.nlm.nih.gov/books/NBK9578/pdf/TOC.pdf.

(243.) Robert Gellman, The Deidentification Dilemma: A Legislativeand Contractual Proposal, 21 FORDHAM INTELL. PROP. MEDIA & ENT. L.J.33, 51-52 (2010). Paul Ohm also suggests that regulators should considerprescribing "new sanctions--possibly even criminal punishment--forthose who reidentify." ohm, supra note 4, at 1770. BothGellman's and the Institute of Medicine's proposals restrictresearchers from sharing the de-identified data outside their researchteams. Gellman, supra, at 51-52; INST. OF MED. OF THE NAT'L ACADS.,supra note 242, at 49-50. My proposal does not prohibit re-disclosure ofanonymized data.

(244.) See, e.g., Restricted Data Use Agreement, ICPSR,http://www.icpsr.umich.edu/icpsrweb/ICPSR/access/restricted/agreement.jsp (last visited Dec. 21,2011).

(245.) Family Educational Rights and Privacy, 73 Fed. Reg. 74,806,74,832 (Dec. 9, 2008).

(246.) The student might be able to bring a claim based on the tortof public disclosure of private facts. See RESTATEMENT (SECOND) OF TORTS[section] 652D (1977).

(247.) See John E. Calfee & Richard Craswell, Some Effects ofUncertainty on Compliance with Legal Standards, 70 Va. L. REV. 965, 966(1984). This reaction is also reflected in the high prices firms pay forcriminal liability insurance. See Miriam H. Baer, Insuring CorporateCrime, 83 IND. L.J. 1035, 1036 (2008).

(248.) Ohm, supra note 4, at 1748, 1757.

(249.) Id. at 1757.

(250.) The Computer and Invasion of Privacy: Hearings Before aSubcomm. of the Comm. on Gov't Operations, 89th Cong. 3 (July26-28, 1966).

(251.) The congressional hearings in the late 1960s led to thepassage of the Privacy Act of 1974. The Privacy Act of 1974, EPIC,http://epic.org/privacy/1974act (last visited Dec. 21, 2011)[hereinafter EPIC Privacy Act Report]. This law bars government agenciesfrom collecting, sharing, and retaining information that is notnecessary for carrying out official duties. 5 U.S.C. [section] 552a(b)(2006). But the Privacy Act is the result of an odd collection ofcompromises, EPIC Privacy Act Report, supra, so its ability to protectagainst the creation of data profiles is limited. It contains a numberof exceptions, including the routine use exemption (which is arguablythe exception that swallows the rule), id. [section] 552a(b)(3), andexceptions for law enforcement investigations, id. [section] 552a(b)(7).For a criticism of the routine use exemption, see Paul M. Schwartz,Privacy and Participation: Personal Information and Public SectorRegulation in the United States, 80 Iowa L. Rev. 553, 584-87 (1995), andRobert Gellman, Does Privacy Law Work?, in Technology and Privacy: TheNew Landscape 193, 198 (Philip E. Agre & Marc Rotenberg eds., 1997).

(252.) See supra Part III.

(253.) This example comes from a dissenting opinion from a recentmedical privacy lawsuit. See IMS Health Inc. v. Sorrell, 630 F.3d 263,283 (2d Cir. 2010) (Livingston, J., dissenting), aff'd, 131 S. Ct.2653 (2011).

(254.) See Ohm, supra note 4, at 1725-27. Likewise, EPIC has thesame conviction, claiming that the harms caused by the release of the(non-)anonymized AOL search query data will increase over time sincere-identifying more AOL subjects will be easier as more and more dataenters the public domain. See Re-identification, supra note 5.

(255.) [(0.9).sup.3] = 0.73. And, of course, the adversary will notknow which twenty-seven percent of entries contain the expected errors.

(256.) Muller, et al., supra note 127.

(257.) Even commercial data aggregation, which has the luxury oflinking identified information, is riddled with error. Joel Steindocumented the false information in his own commercial profiles in arecent Time article. Joel Stein, Your Data, Yourself, Time, Mar. 21,2011, at 40. Though the profiles are useful for advertising purposes,they suggest that a "database of ruin" is a fantasy well outof reach. one of the commercial databases believed that Joel Stein wasan eighteen- to nineteen-year-old woman. Id.

(258.) occasionally this kind of prediction is accomplished byreminiscing about simpler times. Jeffrey Rosen, for example, believesthe Internet compares unfavorably to the villages described in theBabylonian Talmud. Jeffrey Rosen, The End of Forgetting, N.Y. TIMES,July 25, 2010, [section] MM (Magazine), at 30.

(259.) William McGeveran was quoted as making this critique in arecent New York Times article. Natasha Singer, Technology OutpacesPrivacy (Yet Again), N.Y. TIMES, Dec. 11, 2010, at BU3.

(260.) See supra text accompanying notes 229-234 for a descriptionof sampling frames and how they can be used to strengthen theanonymization of data.

(261.) See supra text accompanying notes 226-228. The toughestchoices will involve information that is frequently the subject ofself-revelation on the Internet (e.g., preferences or movie ratings).Id. Also, replacing indirect identifiers with random codes does notautomatically convert an indirect identifier into a non-identifier. Seesupra note 73.

(262.) Top-coding is frequently used on income data, for a slightlydifferent purpose than I discuss here. Income is a variable that can beused as an indirect identifier when the value is extremely high. Whilemost people are not identifiable by their income, the very richestmembers of a community might be. Top-coding income to prevent thisre-identification risk preserves k-anonymity and is a form of subgroupcell size control. Thus, income top-coding recodes more than just thehighest income. The Checklist on Disclosure Potential of Proposed DataReleases, prepared by the Federal Committee on Statistical Methodology,suggests top-coding the upper limit of income distributions. WorkingPaper No. 22, supra note 109, at 103. Additional measures may be takenif a subgroup is too hom*ogeneous with respect to a sensitive attribute.

(263.) Databases rarely cover the same populations since dataproducers have noted the high risk of overlapping disclosures on thesame sample population. See Working Paper No. 22, supra note 109, at 82.

(264.) Fish v. Dallas Indep. Sch. Dist., 170 S.W.3d 226, 231 (Tex.App. 2005).

(265.) The California Supreme Court recently came to thepreposterous holding that ZIP codes, alone, constitute "personalidentification information." Pineda v. Williams-Sonoma Stores,Inc., 246 P.3d 612, 615, 618 (Cal. 2011). The defendant had used ZIPcodes in conjunction with names in order to find the addresses ofcustomers (and then used the data for marketing purposes). Id. at 615.The court could have solved this consumer privacy problem by ruling thatZIP codes, when combined with names, constituted PII. Instead theyexpanded the definition of PII to absurd proportions by finding that ZIPcodes alone are PII. Id. at 620.

(266.) 395 F.App'x 472 (9th Cir. 2010).

(267.) Long v. IRS, No. C74-724P, 2006 WL 1041818, at *6 (W.D.Wash. Apr. 3, 2006).

(268.) Id. at *3.

(269.) Id.

(270.) Id.

(271.) Id.

(272.) Id.

(273.) Id. at *4.

(274.) Long v. IRS, 395 F.App'x 472, 475 (9th Cir. 2010).

(275.) Id.

(276.) Id. at 475-76.

(277.) Some courts have gotten it right. See, e.g., Conn.Dep't of Admin. Servs. v. Freedom of Info. Comm'n, No. CV95550049, 1996 WL 88490 (Conn. Super. Ct. Feb. 9, 1996) (finding that atable showing the percentage of job applicants for a librarian positionthat identified themselves as having a physical handicap was notprivacy-violating because the pool of applicants could not beidentified).

(278.) Working Paper No. 22, supra note 109, at 16.

(279.) Id.

(280.) The table could pose problems if there are very few highlyeducated parents in a given county. Suppose, for example, that AlphaCounty had only one head of household with very high education. Thenmembers of the community might be able to discern that the head ofhousehold in question has a delinquent child. The definition of"unknown sampling frame" provided earlier in this sectionguards against these scenarios.

(281.) In the discussion of Southern Illinoisan in Part III, Idiscuss how an aggregated table can be used to slightly increase thechance of re-identification when used by a sophisticated adversary (ofdubitable existence), but small cell sizes are no more vulnerable thanlarge ones for these tactics.

(282.) CONFIDENTIALITY AND DATA ACCESS COMM. & FED. COMM. ONSTATISTICAL METHODOLOGY, CONFIDENTIALITY AND DATA ACCESS ISSUES AMONGFEDERAL AGENCIES 4 (2001), available athttp://fcsm.gov/committees/cdac/brochur10.pdf ("For example, atwo-dimensional frequency count table may have rows corresponding toemployment sectors (industry, academia, nonprofit, government, military)and columns corresponding to income categories (in increments of$10,000).... Using this example, such a tabulation could result in adisclosure of confidential information if ... only 2 cases of any sectorfell into the same income category (permitting the conclusion on thepart of anyone privy to the information about one of the cases, to knowthe income of the other).").

(283.) Muralidhar & Sarathy, supra note 168, at 9 (tablereformatted by author).

(284.) Id.

(285.) This is how the FPCO responded to requests for betterguidance on the application of education privacy law to de-identifieddata:

 In response to requests for guidance on what specific steps and methods should be used to de-identify information ... it is not possible to prescribe or identify a single method to minimize the risk of disclosing personally identifiable information in redacted records or statistical information that will apply in every circ*mstance.... This is because determining whether a particular set of methods for deidentifying data and limiting disclosure risk is adequate cannot be made without examining the underlying data sets, other data that have been released, publicly available directories, and other data that are linked or linkable to the information in question.

Family Educational Rights and Privacy, 73 Fed. Reg. 74,806, 74,835(Dec. 9, 2008). The FPCO is abandoning its responsibility to provideguidance on anonymization practices because it cannot provide afool-proof step-by-step instruction manual applicable to every scenario.

(286.) Working Paper No. 22, supra note 109, at 24-33.

(287.) Id. at 24.

(288.) The Checklist on Disclosure Potential of Proposed DataReleases succeeds in providing some guidance on the sort of issues thatmust be considered when preparing a public-use microdata file. SeeINTERAGENCY CONFIDENTIALITY AND DATA ACCESS GROUP, FED. COMM. ONSTATISTICAL METHODOLOGY, CHECKLIST ON DISCLOSURE POTENTIAL OF PROPOSEDDATA RELEASES 6-17 (1999), available athttp://fcsm.gov/committees/cdac/checklist_799.doc. But the guidance goesover the heads of the average government administrator, unfamiliar with"sampling frame[s]," "matching," and "nestingvariables." Id. Like the other resources, the Checklist increasesconcern without providing clear principles.

(289.) Barbara J. Evans, Congress' New Infrastructural Modelof Medical Privacy, 84 NOTRE DAME L. REV. 585, 626 (2009).

(290.) Professor Sander's raw data and other study materialsare on file with the author and are being used with permission fromProfessor Sander.

(291.) These schools received several inquiries in a variety offormats, including, at the very least, two mailed letters, two e-mails,and two phone calls. The project's logs for schools that wereunresponsive read like parodies of bureaucratic inefficiency. Here is anexample (names and contact information redacted):

 [10/17] VW said she never got [the request], and to speak to ES. [Phone number]. Spoke to ES, told her we would resend request. ASW 6/5: Letter mailed and emailed to ES. ASW 6/13: Recd email from ES acknowledging request and advising that it would be more than $150; they will advise us of the cost soon. TP 8/15/8 spoke with ES who said she does not remember our request but will check on it and get back to us. TP send a follow-up e-mail to [email address] // 11/19/08 ES assistant said she is out of the office for the week. Lft msg on voice mail.//12/09/08 TP got a hold of ES who connected me to vice Chancelor JP. JP asked that I e-mail him the requests. I did on same day.//1/9/9 TP left phone message for JP.

(292.) The University of Maryland Law School invoiced ProfessorSander $3,700 for the data--an amount thirty-seven times the averagecost estimates.

(293.) CHARLES J. SYKES, THE END OF PRIVACY 135 (1999) (notingthat, in the context of medical privacy, "[i]n an age where ...medical datawebs cover the country from coast to coast, only uniformstandards have any reasonable prospect of assuring patientconfidentiality"); Andrew B. Serwin, Privacy 3.0--The Principle ofProportionality, 42 U. MICH. J.L. REFORM 869, 875 (2009) (finding thatinconsistent legal standards cannot meet society's need forprivacy).

(294.) The instruction manual for applying to use a dataset held bythe National Center for Education Statistics is fifty-six pages long.See INST. OF EDUC. SCIS., U.S. DEP'T OF EDUC., RESTRICTED-USE DATAPROCEDURES MANUAL (2011), available athttp://nces.ed.gov/pubs96/96860rev.pdf.

(295.) The National Center for Health Statistics ("NCHS")changed their data access policies in 2005 and pulled some previouslypublic data files into a research enclave that requires pre-approval andthe payment of a fee. See NCHS Data Release and Access Policy forMicro-data and Compressed Vital Statistics Files, CENTERS FOR DISEASECONTROL AND PREVENTION,http://www.cdc.gov/nchs/nvss/dvs_data_release.htm (last updated Apr. 26,2011). For a description of the process to apply for access to theresearch enclave, see NCHS Research Data Center, CENTERS FOR DISEASECONTROL AND PREVENTION, http://www.cdc.gov/rdc (last updated Nov. 3,2009).

(296.) Department of Transportation and Related AgenciesAppropriations Act, Pub. L. No. 106-69, [section] 350, 113 Stat. 986,1025-26 (1999); Cate, supra note 9, at 12.

(297.) Cate, supra note 9, at 15. Opt-in requirements produceinsurmountable selection bias problems because the people who opt intothe study (or those that do not) often share characteristics.Researchers cannot assume that the subjects who have chosen to opt inare typical or representative of the general population. Bas Jacobs,Joop Hartog & Wim Vijverberg, Self-Selection Bias in Estimated WagePremiums for Earnings Risk, 37 EMPIRICAL ECON. 271, 272 (2009).

(298.) ohm, supra note 4, at 1749. The term "inchoateharm" is inappropriate in the context of research data. It evokesimages of a loaded gun--something nefarious and unnecessarily dangerous.Privacy harms can be described as "inchoate" when a sensitivepiece of information has been exposed to public view, and it is unclearwhether or when it will be harmfully linked to a data subject. See id.at 1749-50. This is an excellent approach for data spills (theaccidental release of identifiable data). But in anonymized form,research data is no different from the data banks sitting on a server oreven a personal computer. While it is susceptible to an interveningwrong, its existence is not, in itself, wrongful.

(299.) ohm, supra note 4, at 1736.

(300.) See supra note 6.

(301.) For example, in the complaint of a lawsuit against Apple forthe disclosure of data (which Apple claims was anonymized), the data wasdescribed as "confidential information and personal property that[the data subjects] do not expect to be available to an unaffiliatedcompany." Complaint at 5, Lalo v. Apple Inc., 2010 WL 5393496 (N.D.Cal. Dec. 23, 2010) (No. 5:10-cv-05878-PSG) [hereinafter AppleComplaint].

(302.) See, e.g., In re Pharmatrak, Inc., 329 F.3d 9, 16 (1st Cir.2003); Apple Complaint, supra note 301.

(303.) See supra note 6. Paul Schwartz challenges a simple propertymodel for information privacy by noting that consumers will foreseeablysell their alienable information for too little compensation. Schwartz,supra note 6, at 2091. Schwartz embraces many of the aspects of aproperty model, but also proposes that government regulation shouldprovide a right of exit (or claw-back) and a realm of inalienability.Id. at 2094-116.

(304.) The sound choice between liability and property rules willlook to both efficiency and distributional goals. Guido Calabresi &A. Douglas Melamed, Property Rules, Liability Rules, and Inalienability:One View of the Cathedral, 85 HARV. L. REV. 1089, 1110 (1972)(explaining how liability rules facilitate the combination of efficiencyand distributive results which would be difficult to achieve underproperty rules).

(305.) Eleanor Singer, Nancy A. Mathiowetz & Mick P. Couper,The Impact of Privacy and Confidentiality Concerns on SurveyParticipation: The Case of the 1990 U.S. Census, 57 PUB. OPINION Q. 465,479 (1993).

(306.) While the U.S. Census Bureau has had no recent (known)confidentiality breaches, the Bureau did transfer confidential recordsto the U.S. Department of Justice during World War II to facilitateidentifying and rounding up Japanese-Americans and placing them intointernment camps. See JR Minkel, Confirmed: The U.S. Census Bureau GaveUp Names of Japanese-Americans in WW II, Sci. Am. (Mar. 30, 2007),http://www.scientificamerican.com/article.cfm?id=confirmed-the-us-census-b.

(307.) See Hardin, supra note 8, at 1244. Each data subject willview their decision to take their own data out of the commons as theoptimal choice: the data commons is rich enough

to allow for research, but their own data is not exposed to risk ofre-identification. If many people arrived at this same choice in thecourse of their own independent evaluations, there would be no commonsleft. I discuss the differences between the data commons and thetraditional tragedy of the commons in Part I. See supra note 8.

(308.) See generally Pertussis (Whooping Cough), CENTERS FORDISEASE CONTROL & PREVENTION,http://www.cdc.gov/pertussis/index.html (last updated Aug. 22, 2011).

(309.) See Chris Mooney, Why Does the Vaccine/Autism ControversyLive On?, Discover Mag., June 2009, at 58, 58-59.

(310.) Ijeoma Ejigiri, The Resurgence of Pertussis: Is Lack ofAdult Vaccination to Blame?, CLINICAL CORRELATIONS (Feb. 23, 2011),http://www.clinicalcorrelations.org/?p=3951.

(311.) Khaled El Emam et al., The Case for De-identifying PersonalHealth Information 21-29 (Jan. 18, 2011) (unpublished manuscript),available at http://ssrn.com/abstract=1744038.

(312.) Id. at 27.

(313.) Id. at 25-28.

(314.) If a property rule is crafted to avoid over-protection thenit will likely end up in a form that is under-protective. Suppose wewere to determine that the data subject had alienated his right to theinformation as soon as he gave it to the data producer (say, a retaileror his doctor), then a property regime would constrain the state frominterfering with the data producer's use, no matter how badly theoriginal data subject was under-compensated. This is not sound policy inthe majority of contexts in which data is collected--where theinformation is given for a purpose without concrete attention to theadditional uses (in identifiable form or not) to which the data will beput.

(315.) Fred Cate has argued that democratic values would benefitfrom a shift away from property rights, though he sees value, oftenoverlooked by the legal academy, in allowing private entities to usedata for secondary uses. See Cate, supra note 9, at 12.

(316.) ohm advises regulators to compare the risks of unfetteredinformation flow to its likely costs in privacy. ohm, supra note 4, at1768.

(317.) Warren & Brandeis, supra note 43, at 214.

(318.) Ohm, supra note 4, at 1716.

(319.) Douglas J. Sylvester & Sharon Lohr, Counting onConfidentiality: Legal and Statistical Approaches to Federal Privacy Lawafter the USA Patriot Act, 2005 WIS. L. REV. 1033, 1113 (2005).

(320.) Kang & Buchner, supra note 6, at 233.

(321.) Edward J. Janger & Paul M. Schwartz, TheGramm-Leach-Bliley Act, Information Privacy, and the Limits of DefaultRules, 86 MINN. L. REV. 1219, 1250-51 (2002).

(322.) Id. at 1251.

(323.) Anita L. Allen, Coercing Privacy, 40 WM. & MARY L. REV.723, 725 (1999) ("Neither privacy nor private choice, however, isan absolute, unqualified good. There can be too much privacy, and it canbe maldistributed.").

(324.) Danielle Keats Citron, Fulfilling Government 2.0'sPromise with Robust Privacy Protections, 78 GEO. WASH. L. REV. 822,841-43 (2010); Schwartz, supra note 251, at 593 (1995) ("Theboundless collection, processing, and dissemination of personal data canhave a deleterious effect on the ability of individuals to join insocial discourse.").

(325.) John M. Abowd & Julia Lane, New Approaches toConfidentiality Protection: Synthetic Data, Remote Access and ResearchData Centers 3050 PRIVACY IN STATISTICAL DATABASES: LECTURE NOTES INCOMPUTER SCIENCE 282, 283 (2004), available athttp://www.springerlink.com/content/27nud7qx09qurg3p/fulltext.pdf.

(326.) For example, the U.S. Census Bureau's AmericanFactFinder service allows users to submit queries for the creation ofcustomized tables. American FactFinder, U.S. CENSUS BUREAU,http://factfinder.census.gov/servlet/DatasetMainPageServlet (lastvisited Dec. 21, 2011).

Jane Yakowitz, Visiting Assistant Professor of Law, Brooklyn LawSchool; Yale Law School, J.D.; Yale College, B.S. The author is gratefulfor invaluable feedback from Jeremy Albright, Jonathan Askin, MiriamBaer, Derek Bambauer, Daniel Barth-Jones, Anita Bernstein, FredericBloom, Ryan Calo, Deven Desai, Robin Effron, Khaled El Emam, MarshaGarrison, Robert Gellman, Eric Goldman, Dan Hunter, Ted Janger, MargoKaplan, Claire Kelly, Bailey Kuklin, Rebecca Kysar, Brian Lee, David S.Levine, Andrea M. Matwyshyn, Bill McGeveran, Helen Nissenbaum, Paul Ohm,Richard Sander, Liz Schneider, Paul Schwartz, Christopher Soghoian,Larry Solan, Berin Szoka, Nelson Tebbe, Adam Thierer, Marketa Trimble,Felix Wu, and Peter Yu. This article was generously supported by theBrooklyn Law School Dean's Summer Research Stipend Program.

Table 1: Number of Delinquent Children by County and EducationLevel of Household Head (278)Education Level of Household HeadCounty Low Medium High Very High TotalAlpha 15 1 * 3 * 1 * 20Beta 20 10 10 15 55Gamma 3 * 10 10 2 * 25Delta 12 14 7 2 * 35Total 50 35 30 20 135Table 2: California High School Exit Exam (CAHSEE) Results forMathematics and English Language Arts (ELA) by Gender andEthnic Designation, (Combined 2008) for (Grade 11)[Name of School Redacted] (283) MATH ELA Took Passed Took PassedAll Students 27 3 23 3Female 4 n/a 3 n/aMale 23 3 20 2Hispanic or Latino 20 3 18 3White 7 n/a 5 n/aFigure 1: OkCupid Analysis of Member Messaging Behavior (48)Reply Rates By Racefemale sender Asian- Black- Hispanic/ Male Male Latin-MaleAsian--Female 48 55 49Black--Female 31 37 36Hispanic/Latin--Female 51 46 48Indian--Female 51 51 43Middle Eastern--Female 51 55 54Native American--Female 45 50 47Other--Female 52 52 43Pacific Islander--Female 51 57 49White--Female 48 51 47 47.3 46.9 46.4 Indian- Middle Native Male Eastern- American- Male MaleAsian--Female 50 53 49Black--Female 37 40 41Hispanic/Latin--Female 45 50 45Indian--Female 55 51 45Middle Eastern--Female 63 56 63Native American--Female 47 47 44Other--Female 54 52 51Pacific Islander--Female 35 60 53White--Female 48 49 48 48.2 49.7 47.3 Other- Pacific White- Male Islander- Male MaleAsian--Female 50 46 41 43.7Black--Female 41 32 32 34.3Hispanic/Latin--Female 48 48 40 42.5Indian--Female 36 44 40 42.7Middle Eastern--Female 52 48 47 49.5Native American--Female 47 52 40 42.3Other--Female 47 50 42 44.4Pacific Islander--Female 50 46 44 46White--Female 48 47 41 42.1 47.5 46.2 40.5 42.0

COPYRIGHT 2011 Harvard Law School, Harvard Journal of Law & Technology
No portion of this article can be reproduced without the express written permission from the copyright holder.

Copyright 2011 Gale, Cengage Learning. All rights reserved.


Tragedy of the data commons. (2024)

References

Top Articles
Latest Posts
Article information

Author: Ray Christiansen

Last Updated:

Views: 6018

Rating: 4.9 / 5 (49 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Ray Christiansen

Birthday: 1998-05-04

Address: Apt. 814 34339 Sauer Islands, Hirtheville, GA 02446-8771

Phone: +337636892828

Job: Lead Hospitality Designer

Hobby: Urban exploration, Tai chi, Lockpicking, Fashion, Gunsmithing, Pottery, Geocaching

Introduction: My name is Ray Christiansen, I am a fair, good, cute, gentle, vast, glamorous, excited person who loves writing and wants to share my knowledge and understanding with you.