HIPAA Privacy Regulations: Other Requirements Relating to Uses and Disclosures of Protected Health Information: Requirements for De-Identification of Protected Health Information - § 164.514(b)
As Contained in the HHS HIPAA Privacy Rules
HHS Regulations as Amended August 2002 |
(b) Implementation specifications: Requirements for de-identification of protected health information. A covered entity may determine that health information is not individually identifiable health information only if:
(1) A person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods for rendering information not individually identifiable:
(i) Applying such principles and methods, determines that the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information; and
(ii) Documents the methods and results of the analysis that justify such determination; or
(2)(i) The following identifiers of the individual or of relatives, employers, or household members of the individual, are removed:
(A) Names;
(B) All geographic subdivisions smaller than a State, including street address, city, county, precinct, zip code, and their equivalent geocodes, except for the initial three digits of a zip code if, according to the current publicly available data from the Bureau of the Census:
(1) The geographic unit formed by combining all zip codes with the same three initial digits contains more than 20,000 people; and
(2) The initial three digits of a zip code for all such geographic units containing 20,000 or fewer people is changed to 000.
(C) All elements of dates (except year) for dates directly related to an individual, including birth date, admission date, discharge date, date of death; and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older;
(D) Telephone numbers;
(E) Fax numbers;
(F) Electronic mail addresses;
(G) Social security numbers;
(H) Medical record numbers;
(I) Health plan beneficiary numbers;
(J) Account numbers;
(K) Certificate/license numbers;
(L) Vehicle identifiers and serial numbers, including license plate numbers;
(M) Device identifiers and serial numbers;
(N) Web Universal Resource Locators (URLs);
(O) Internet Protocol (IP) address numbers;
(P) Biometric identifiers, including finger and voice prints;
(Q) Full face photographic images and any comparable images; and
(R) Any other unique identifying number, characteristic, or code, except as permitted by paragraph (c) of this section; and
(ii) The covered entity does not have actual knowledge that the information could be used alone or in combination with other information to identify an individual who is a subject of the information.
HHS Description of and Commentary on August 2002 Revisions Other Requirements Relating to Uses and Disclosures of Protected Health Information: De-Identification of Protected Health Information |
See also Limited Data Set at § 164.514(e) .
December 2000 Privacy Rule. At § 164.514(a)-(c), the Privacy Rule permits a covered entity to de-identify protected health information so that such information may be used and disclosed freely, without being subject to the Privacy Rule's protections. Health information is de-identified, or not individually identifiable, under the Privacy Rule, if it does not identify an individual and if the covered entity has no reasonable basis to believe that the information can be used to identify an individual. In order to meet this standard, the Privacy Rule provides two alternative methods for covered entities to de-identify protected health information.
First, a covered entity may demonstrate that it has met the standard if a person with appropriate knowledge and experience applying generally acceptable statistical and scientific principles and methods for rendering information not individually identifiable makes and documents a determination that there is a very small risk that the information could be used by others to identify a subject of the information. The preamble to the Privacy Rule refers to two government reports that provide guidance for applying these principles and methods, including describing types of techniques intended to reduce the risk of disclosure that should be considered by a professional when de-identifying health information. These techniques include removing all direct identifiers, reducing the number of variables on which a match might be made, and limiting the distribution of records through a “data use agreement” or “restricted access agreement” in which the recipient agrees to limits on who can use or receive the data.
Alternatively, covered entities may choose to use the Privacy Rule's safe harbor method for de-identification. Under the safe harbor method, covered entities must remove all of a list of 18 enumerated identifiers and have no actual knowledge that the information remaining could be used, alone or in combination, to identify a subject of the information. The identifiers that must be removed include direct identifiers, such as name, street address, social security number, as well as other identifiers, such as birth date, admission and discharge dates, and five-digit zip code. The safe harbor requires removal of geographic subdivisions smaller than a State, except for the initial three digits of a zip code if the geographic unit formed by combining all zip codes with the same initial three digits contains more than 20,000 people. In addition, age, if less than 90, gender, ethnicity, and other demographic information not listed may remain in the information. The safe harbor is intended to provide covered entities with a simple, definitive method that does not require much judgment by the covered entity to determine if the information is adequately de-identified.
The Privacy Rule also allows for the covered entity to assign a code or other means of record identification to allow de-identified information to be re-identified by the covered entity, if the code is not derived from, or related to, information about the subject of the information. For example, the code cannot be a derivation of the individual's social security number, nor can it be otherwise capable of being translated so as to identify the individual. The covered entity also may not use or disclose the code for any other purpose, and may not disclose the mechanism (e.g., algorithm or other tool) for re-identification.
The Department is cognizant of the increasing capabilities and sophistication of electronic data matching used to link data elements from various sources and from which, therefore, individuals may be identified. Given this increasing risk to individuals' privacy, the Department included in the Privacy Rule the above stringent standards for determining when information may flow unprotected. The Department also wanted the standards to be flexible enough so the Privacy Rule would not be a disincentive for covered entities to use or disclose de- identified information wherever possible. The Privacy Rule, therefore, strives to balance the need to protect individuals’ identities with the need to allow de-identified databases to be useful.
March 2002 NPRM. The Department heard a number of concerns regarding the de-identification standard in the Privacy Rule. These concerns generally were raised in the context of using and disclosing information for research, public health purposes, or for certain health care operations. In particular, concerns were expressed that the safe harbor method for de-identifying protected health information was so stringent that it required removal of many of the data elements that were essential to analyses for research and these other purposes. The comments, however, demonstrated little consensus as to which data elements were needed for such analyses and were largely silent regarding the feasibility of using the Privacy Rule's alternative statistical method to de-identify information.
Based on the comments received, the Department was not convinced of the need to modify the safe harbor standard for de-identified information. However, the Department was aware that a number of entities were confused by potentially conflicting provisions within the de-identification standard. These entities argued that, on the one hand, the Privacy Rule treats information as de-identified if all listed identifiers on the information are stripped, including any unique, identifying number, characteristic, or code. Yet, the Privacy Rule permits a covered entity to assign a code or other record identification to the information so that it may be re-identified by the covered entity at some later date.
The Department did not intend such a re-identification code to be considered one of the unique, identifying numbers or codes that prevented the information from being de-identified. Therefore, the Department proposed a technical modification to the safe harbor provisions explicitly to except the re-identification code or other means of record identification permitted by § 164.514(c) from the listed identifiers (§ 164.514(b)(2)(i)(R)).
Overview of Public Comments. The following provides an overview of the public comment received on this proposal. Additional comments received on this issue are discussed below in the section entitled, “Response to Other Public Comments.” All commenters on our clarification of the safe harbor re-identification code not being an enumerated identifier supported our proposed regulatory clarification.
Final Modifications. Based on the Department’s intent that the re-identification code not be considered one of the enumerated identifiers that must be excluded under the safe harbor for de-identification, and the public comment supporting this clarification, the Department adopts the provision as proposed. The re-identification code or other means of record identification permitted by § 164.514(c) is expressly excepted from the listed safe harbor identifiers at § 164.514(b)(2)(i)(R).
Response to Other Public Comments.
Comment: One commenter asked if data can be linked inside the covered entity and a dummy identifier substituted for the actual identifier when the data is disclosed to the external researcher, with control of the dummy identifier remaining with the covered entity.
Response: The Privacy Rule does not restrict linkage of protected health information inside a covered entity. The model that the commenter describes for the dummy identifier is consistent with the re-identification code allowed under the Rule’s safe harbor so long as the covered entity does not generate the dummy identifier using any individually identifiable information. For example, the dummy identifier cannot be derived from the individual’s social security number, birth date, or hospital record number.
Comment: Several commenters who supported the creation of de-identified data for research based on removal of facial identifiers asked if a keyed-hash message authentication code (HMAC) can be used as a re-identification code even though it is derived from patient information, because it is not intended to re-identify the patient and it is not possible to identify the patient from the code. The commenters stated that use of the keyed-hash message authentication code would be valuable for research, public health and bio-terrorism detection purposes where there is a need to link clinical events on the same person occurring in different health care settings (e.g. to avoid double counting of cases or to observe long-term outcomes).
These commenters referenced Federal Information Processing Standard (FIPS) 198: “The Keyed-Hash Message Authentication Code.” This standard describes a keyed-hash message authentication code (HMAC) as a mechanism for message authentication using cryptographic hash functions. The HMAC can be used with any iterative approved cryptographic hash function, in combination with a shared secret key. A hash function is an approved mathematical function that maps a string of arbitrary length (up to a pre-determined maximum size) to a fixed length string. It may be used to produce a checksum, called a hash value or message digest, for a potentially long string or message.
According to the commenters, the HMAC can only be breached when the key and the identifier from which the HMAC is derived and the de-identified information attached to this code are known to the public. It is common practice that the key is limited in time and scope (e.g. only for the purpose of a single research query) and that data not be accumulated with such codes (with the code needed for joining records being discarded after the de-identified data has been joined).
Response: The HMAC does not meet the conditions for use as a re-identification code for de-identified information. It is derived from individually identified information and it appears the key is shared with or provided by the recipient of the data in order for that recipient to be able to link information about the individual from multiple entities or over time. Since the HMAC allows identification of individuals by the recipient, disclosure of the HMAC violates the Rule. It is not solely the public’s access to the key that matters for these purposes; the covered entity may not share the key to the re-identification code with anyone, including the recipient of the data, regardless of whether the intent is to facilitate re-identification or not.
The HMAC methodology, however, may be used in the context of the limited data set, discussed below. The limited data set contains individually identifiable health information and is not a de-identified data set. Creation of a limited data set for research with a data use agreement, as specified in § 164.514(e), would not preclude inclusion of the keyed-hash message authentication code in the limited data set. The Department encourages inclusion of the additional safeguards mentioned by the commenters as part of the data use agreement whenever the HMAC is used.
Comment: One commenter requested that HHS update the safe harbor de-identification standard with prohibited 3-digit zip codes based on 2000 Census data.
Response: The Department stated in the preamble to the December 2000 Privacy Rule that it would monitor such data and the associated re-identification risks and adjust the safe harbor as necessary. Accordingly, the Department provides such updated information in response to the above comment. The Department notes that these three-digit zip codes are based on the five-digit zip Code Tabulation Areas created by the Census Bureau for the 2000 Census. This new methodology also is briefly described below, as it will likely be of interest to all users of data tabulated by zip code.
The Census Bureau will not be producing data files containing U.S. Postal Service zip codes either as part of the Census 2000 product series or as a post Census 2000 product. However, due to the public's interest in having statistics tabulated by zip code, the Census Bureau has created a new statistical area called the Zip Code Tabulation Area (ZCTA) for Census 2000. The ZCTAs were designed to overcome the operational difficulties of creating a well-defined zip code area by using Census blocks (and the addresses found in them) as the basis for the ZCTAs. In the past, there has been no correlation between zip codes and Census Bureau geography. Zip codes can cross State, place, county, census tract, block group and census block boundaries. The geographic entities the Census Bureau uses to tabulate data are relatively stable over time. For instance, census tracts are only defined every ten years. In contrast, zip codes can change more frequently. Because of the ill-defined nature of zip code boundaries, the Census Bureau has no file (crosswalk) showing the relationship between US Census Bureau geography and US Postal Service zip codes.
ZCTAs are generalized area representations of U.S. Postal Service (USPS) zip code service areas. Simply put, each one is built by aggregating the Census 2000 blocks, whose addresses use a given zip code, into a ZCTA which gets that zip code assigned as its ZCTA code. They represent the majority USPS five-digit zip code found in a given area. For those areas where it is difficult to determine the prevailing five-digit zip code, the higher-level three-digit zip code is used for the ZCTA code. For further information, go to: http://www.census.gov/geo/www/gazetteer/places2k.html.
Utilizing 2000 Census data, the following three-digit ZCTAs have a population of 20,000 or fewer persons. To produce a de-identified data set utilizing the safe harbor method, all records with three-digit zip codes corresponding to these three-digit ZCTAs must have the zip code changed to 000. The 17 restricted zip codes are: 036, 059, 063, 102, 203, 556, 692, 790, 821, 823, 830, 831, 878, 879, 884, 890, and 893.
HHS Description from Original Rulemaking Other Requirements Relating to Uses and Disclosures of Protected Health Information: Requirements for De-Identification of Protected Health Information |
Note: HHS Description is the same as for § 164.514(a)
In § 164.506(d) of the NPRM, we proposed that the privacy standards would apply to “individually identifiable health information,” and not to information that does not identify the subject individual. The statute defines individually identifiable health information as certain health information:
(i) Which identifies the individual, or
(ii) With respect to which there is a reasonable basis to believe that the information can be used to identify the individual.
As we pointed out in the NPRM, difficulties arise because, even after removing obvious identifiers (e.g., name, social security number, address), there is always some probability or risk that any information about an individual can be attributed to that individual.
The NPRM proposed two alternative methods for determining when sufficient identifying information has been removed from a record to render the information de-identified and thus not subject to the rule. First, the NPRM proposed the establishment of a “safe harbor”: if all of a list of 19 specified items of information had been removed, and the covered entity had no reason to believe that the remaining information could be used to identify the subject of the information (alone or in combination with other information), the covered entity would have been presumed to have created de-identified information. Second, the NPRM proposed an alternative method so that covered entities with sufficient statistical experience and expertise could remove or encrypt a combination of information different from the enumerated list, using commonly accepted scientific and statistical standards for disclosure avoidance. Such covered entities would have been able to include information from the enumerated list of 19 items if they (1) believed that the probability of re-identification was very low, and (2) removed additional information if they had a reasonable basis to believe that the resulting information could be used to re-identify someone.
We proposed that covered entities and their business partners be permitted to use protected health information to create de-identified health information using either of these two methods. Covered entities would have been permitted to further use and disclose such de-identified information in any way, provided that they did not disclose the key or other mechanism that would have enabled the information to be re-identified, and provided that they reasonably believed that such use or disclosure of de-identified information would not have resulted in the use or disclosure of protected health information.
A number of examples were provided of how valuable such de-identified information would be for various purposes. We expressed the hope that covered entities, their business partners, and others would make greater use of de-identified health information than they do today, when it is sufficient for the purpose, and that such practice would reduce the burden and the confidentiality concerns that result from the use of individually identifiable health information for some of these purposes.
In §§ 164.514(a)-(c) of this final rule, we make several modifications to the provisions for de-identification. First, we explicitly adopt the statutory standard as the basic regulatory standard for whether health information is individually identifiable health information under this rule. Information is not individually identifiable under this rule if it does not identify the individual, or if the covered entity has no reasonable basis to believe it can be used to identify the individual. Second, in the implementation specifications we reformulate the two ways in which a covered entity can demonstrate that it has met the standard.
One way a covered entity may demonstrate that it has met the standard is if a person with appropriate knowledge and experience applying generally accepted statistical and scientific principles and methods for rendering information not individually identifiable makes a determination that the risk is very small that the information could be used, either by itself or in combination with other available information, by anticipated recipients to identify a subject of the information. The covered entity must also document the analysis and results that justify the determination. We provide guidance regarding this standard in our responses to the comments we received on this provision.
We also include an alternate, safe harbor, method by which covered entities can demonstrate compliance with the standard. Under the safe harbor, a covered entity is considered to have met the standard if it has removed all of a list of enumerated identifiers, and if the covered entity has no actual knowledge that the information could be used alone or in combination to identify a subject of the information. We note that in the NPRM, we had proposed that to meet the safe harbor, a covered entity must have “no reason to believe” that the information remained identifiable after the enumerated identifiers were removed. In the final rule, we have changed the standard to one of actual knowledge in order to provide greater certainty to covered entities using the safe harbor approach.
In the safe harbor, we explicitly allow age and some geographic location information to be included in the de-identified information, but all dates directly related to the subject of the information must be removed or limited to the year, and zip codes must be removed or aggregated (in the form of most 3-digit zip codes) to include at least 20,000 people. Extreme ages of 90 and over must be aggregated to a category of 90+ to avoid identification of very old individuals. Other demographic information, such as gender, race, ethnicity, and marital status are not included in the list of identifiers that must be removed.
The intent of the safe harbor is to provide a means to produce some de-identified information that could be used for many purposes with a very small risk of privacy violation. The safe harbor is intended to involve a minimum of burden and convey a maximum of certainty that the rules have been met by interpreting the statutory "reasonable basis to believe that the information can be used to identify the individual" to produce an easily followed, cook book approach.
Covered entities may use codes and similar means of marking records so that they may be linked or later re-identified, if the code does not contain information about the subject of the information (for example, the code may not be a derivative of the individual’s social security number), and if the covered entity does not use or disclose the code for any other purpose. The covered entity is also prohibited from disclosing the mechanism for re-identification, such as tables, algorithms, or other tools that could be used to link the code with the subject of the information.
Language to clarify that covered entities may contract with business associates to perform the de-identification has been added to the section on business associates.
HHS Response to Comments Received from Original Rulemaking Other Requirements Relating to Uses and Disclosures of Protected Health Information: Requirements for De-Identification of Protected Health Information |
Note: HHS Response to Comments Received is the same as for § 164.514(a)
General Approach
Comments: The comments on this topic almost unanimously supported the concept of de-identification and efforts to expand its use. Although a few comments suggested deleting one of the proposed methods or the other, most appeared to support the two method approach for entities with differing levels of statistical expertise.
Many of the comments argued that the standard for creation of de-identified information should be whether there is a "reasonable basis to believe" that the information has been de-identified. Others suggested that the “reasonable basis” standard was too vague.
A few commenters suggested that we consider information to be de-identified if all personal identifiers that directly reveal the identity of the individual or provide a direct means of identifying individuals have been removed, encrypted or replaced with a code. Essentially, this recommendation would require only removal of “direct” identifiers (e.g., name, address, and ID numbers) and allow retention of all "indirect" identifiers (e.g., zip code and birth date) in “de-identified” information. These comments did not suggest a list or further definition of what identifiers should be considered "direct" identifiers.
Some commenters suggested that the standard be modified to reflect a single standard that applies to all covered entities in the interest of reducing uncertainty and complexity. According to these comments, the standard for covered entities to meet for de-identification of protected health information should be generally accepted standards in the scientific and statistical community, rather than focusing on a specified list of identifiers that must be removed.
A few commenters believed that no record of information about an individual can be truly de-identified and that all such information should be treated and protected as identifiable because more and more information about individuals is being made available to the public, such as voter registration lists and motor vehicle and driver's license lists, that would enable someone to match (and identify) records that otherwise appear to be not identifiable.
Response: In the final rule, we reformulate the method for de-identification to more explicitly use the statutory standard of "a reasonable basis to believe that the information can be used to identify the individual"- just as information is “individually identifiable” if there is a reasonable basis to believe that it can be used to identify the individual, it is “de-identified” if there is no reasonable basis to believe it can be so used. We also define more precisely how the standard should be applied.
We did not accept comments that suggested that we allow only one method of de-identifying information. We find support for both methods in the comments but find no compelling logic for how the competing interests could be met cost-effectively with only one method.
We also disagree with the comments that advocated using a standard which required removing only the direct identifiers. Although such an approach may be more convenient for covered entities, we judged that the resulting information would often remain identifiable, and its dissemination could result in significant violations of privacy. While we encourage covered entities to remove direct identifiers whenever possible as a method of enhancing privacy, we do not believe that the resulting information is sufficiently blinded as to permit its general dissemination without the protections provided by this rule.
We agree with the comments that said that records of information about individuals cannot be truly de-identified, if that means that the probability of attribution to an individual must be absolutely zero. However, the statutory standard does not allow us to take such a position, but envisions a reasonable balance between risk of identification and usefulness of the information.
We disagree with those comments that advocated releasing only truly anonymous information (which has been changed sufficiently so that it no longer represents actual information about real individuals) and those that supported using only sophisticated statistical analysis before allowing uncontrolled disclosures. Although these approaches would provide a marginally higher level of privacy protection, they would preclude many of the laudable and valuable uses discussed in the NPRM (in § 164.506(d)) and would impose too great a burden on less sophisticated covered entities to be justified by the small decrease in an already small risk of identification.
We conclude that compared to the alternatives advanced by the comments, the approach proposed in the NPRM, as refined and modified below in response to the comments, most closely meets the intent of the statute.
Comments: A few comments complained that the proposed standards were so strict that they would expose covered entities to liability because arguably no information could ever be de-identified.
Response: In the final rule we have modified the mechanisms by which a covered entity may demonstrate that it has complied with the standard in ways that provide greater certainty. In the standard method for de-identification, we have clarified the professional standard to be used, and anticipate issuing further guidance for covered entities to use in applying the standard. In the safe harbor method, we reduced the amount of judgment that a covered entity must apply. We believe that these mechanisms for de-identification are sufficiently well-defined to protect covered entities that follow them from undue liability.
Comments: Several comments suggested that the rule prohibit any linking of de-identified data, regardless of the probability of identification.
Response: Since our methods of de-identification include consideration of how the information might be used in combination with other information, we believe that linking de-identified information does not pose a significantly increased risk of privacy violations. In addition, since our authority extends only to the regulation of individually identifiable health information, we cannot regulate de-identified information because it no longer meets the definition of individually identifiable health information. We also have no authority to regulate entities that might receive and desire to link such information yet that are not covered entities; thus such a prohibition would have little protective effect.
Comments: Several commenters suggested that we create incentives for covered entities to use de-identified information. One commenter suggested that we mandate an assessment to see if de-identified information could be used before the use or disclosure of identified information would be allowed.
Response: We believe that this final rule establishes a reasonable mechanism for the creation of de-identified information and the fact that this de-identified information can be used without having to follow the policies, procedures, and documentation required to use individually identifiable health information should provide an incentive to encourage its use where appropriate. We disagree with the comment suggesting that we require an assessment of whether de-identified information could be used for each use or disclosure. We believe that such a requirement would be too burdensome on covered entities, particularly with respect to internal uses, where entire records are often used by medical and other personnel. For disclosures, we believe that such an assessment would add little to the protection provided by the minimum necessary requirements in this final rule.
Comments: One commenter asked if de-identification was equivalent to destruction of the protected health information (as required under several of the provisions of this final rule).
Response: The process of de-identification creates a new dataset in addition to the source dataset containing the protected health information. This process does not substitute for actual destruction of the source data.
Modifications to the Proposed Standard for De-identification
Comments: Several commenters called for clarification of proposed language in the NPRM that would have permitted a covered entity to treat information as de-identified, even if specified identifiers were retained, as long as the probability of identifying subject individuals would be very low. Commenters expressed concern that the “very low” standard was vague. These comments expressed concern that covered entities would not have a clear and easy way to know when information meets this part of the standard.
Response: We agree with the comments that covered entities may need additional guidance on the types of analyses that they should perform in determining when the probability of re-identification of information is very low. We note that in the final rule, we reformulate the standard somewhat to require that a person with appropriate knowledge and experience apply generally accepted statistical and scientific methods relevant to the task to make a determination that the risk of re-identification is very small. In this context, we do not view the difference between a very low probability and a very small risk to be substantive. After consulting representatives of the federal agencies that routinely de-identify and anonymize information for public release, we attempt here to provide some guidance for the method of de-identification.
As requested by some commenters, we include in the final rule a requirement that covered entities (not following the safe harbor approach) apply generally accepted statistical and scientific principles and methods for rendering information not individually identifiable when determining if information is de-identified. Although such guidance will change over time to keep up with technology and the current availability of public information from other sources, as a starting point the Secretary approves the use of the following as guidance to such generally accepted statistical and scientific principles and methods:
[Note that the following two links are no longer active].
(1) Statistical Policy Working Paper 22 - Report on Statistical Disclosure Limitation Methodology (http://www.fcsm.gov/working-papers/wp22.html) (prepared by the Subcommittee on Disclosure Limitation Methodology, Federal Committee on Statistical Methodology, Office of Management and Budget) and
(2) the Checklist on Disclosure Potential of Proposed Data Releases (http://www.fcsm.gov/docs/checklist_799.doc) (prepared by the Confidentiality and Data Access Committee, Federal Committee on Statistical Methodology, Office of Management and Budget).
We agree with commenters that such guidance will need to be updated over time and we will provide such guidance in the future.
According to the Statistical Policy Working Paper 22, the two main sources of disclosure risk for de-identified records about individuals are the existence of records with very unique characteristics (e.g., unusual occupation or very high salary or age) and the existence of external sources of records with matching data elements which can be used to link with the de-identified information and identify individuals (e.g., voter registration records or driver's license records). The risk of disclosure increases as the number of variables common to both types of records increases, as the accuracy or resolution of the data increases, and as the number of external sources increases. As outlined in Statistical Policy Working Paper 22, an expert disclosure analysis would also consider the probability that an individual who is the target of an attempt at re-identification is represented on both files, the probability that the matching variables are recorded identically on the two types of records, the probability that the target individual is unique in the population for the matching variables, and the degree of confidence that a match would correctly identify a unique person.
Statistical Policy Working Paper 22 also describes many techniques that can be used to reduce the risk of disclosure that should be considered by an expert when de-identifying health information. In addition to removing all direct identifiers, these include the obvious choices based on the above causes of the risk; namely, reducing the number of variables on which a match might be made and limiting the distribution of the records through a "data use agreement" or "restricted access agreement" in which the recipient agrees to limits on who can use/receive the data. The techniques also include more sophisticated manipulations: recoding variables into fewer categories to provide less precise detail (including rounding of continuous variables); setting top-codes and bottom-codes to limit details for extreme values; disturbing the data by adding noise by swapping certain variables between records, replacing some variables in random records with mathematically imputed values or averages across small random groups of records, or randomly deleting or duplicating a small sample of records; and replacing actual records with synthetic records that preserve certain statistical properties of the original data.
Modifications to the “Safe Harbor”
Comments: Many commenters argued that stripping all 19 identifiers is unnecessary for purposes of de-identification. They felt that such items as zip code, city (or county), and birth date, for example, do not identify the individual and only such identifiers as name, street address, phone numbers, fax numbers, email, Social Security number, driver’s license number, voter registration number, motor vehicle registration, identifiable photographs, finger prints, voice prints, web universal resource locator, and Internet protocol address number need to be removed to reasonably believe that data has been de-identified.
Other commenters felt that removing the full list of identifiers would significantly reduce the usefulness of the data. Many of these comments focused on research and, to a lesser extent, marketing and undefined “statistical analysis.” Commenters who represented various industries and research institutions expressed concern that they would not be able to continue current activities such as development of service provider networks, conducting “analysis” on behalf of the plan, studying use of medication and medical devices, community studies, marketing and strategic planning, childhood immunization initiatives, patient satisfaction surveys, and solicitation of contributions. The requirements in the NPRM to strip off zip code and date of birth were of particular concern. These commenters stated that their ability to do research and quality analysis with this data would be compromised without access to some level of information about patient age and/or geographic location.
Response: While we understand that removing the specified identifiers may reduce the usefulness of the resulting data to third parties, we remain convinced by the evidence found in the MIT study that we referred to in the preamble to the proposed rule and the analyses discussed below that there remains a significant risk of identification of the subjects of health information from the inclusion of indirect identifiers such as birth date and zip code and that in many cases there will be a reasonable basis to believe that such information remains identifiable. We note that a covered entity not relying on the safe harbor may determine that information from which sufficient other identifiers have been removed but which retains birth date or zip code is not reasonably identifiable. As discussed above, such a determination must be made by a person with appropriate knowledge and expertise applying generally accepted statistical and scientific methods for rendering information not identifiable.
Although we have determined that all of the specified identifiers must be removed before a covered entity meets the safe harbor requirements, we made modifications in the final rule to the specified identifiers on the list to permit some information about age and geographic area to be retained in de-identified information.
For age, we specify that, in most cases, year of birth may be retained, which can be combined with the age of the subject to provide sufficient information about age for most uses. After considering current and evolving practices and consulting with federal experts on this topic, including members of the Confidentiality and Data Access Committee of the Federal Committee on Statistical Methodology, Office of Management and Budget, we concluded that in general, age is sufficiently broad to be allowed in de-identified information, although all dates that might be directly related to the subject of the information must be removed or aggregated to the level of year to prevent deduction of birth dates. Extreme ages -- 90 and over -- must be aggregated further (to a category of 90+, for example) to avoid identification of very old individuals (because they are relatively rare). This reflects the minimum requirement of the current recommendations of the Bureau of the Census. For research or other studies relating to young children or infants, we note that the rule would not prohibit age of an individual from being expressed as an age in months, days, or hours.
For geographic area, we specify that the initial three digits of zip codes may be retained for any three-digit zip code that contains more than 20,000 people as determined by the Bureau of the Census. As discussed more below, there are currently only 18 three-digit zip codes containing fewer than 20,000 people. We note that this number may change when information from the 2000 Decennial Census is analyzed.
In response to concerns expressed in the comments about the need for information on geographic area, we investigated the potential of allowing 5-digit zip codes or 3-digit zip codes to remain in the de-identified information. According to 1990 Census data, the populations in geographical areas delineated by 3-digit zip codes vary a great deal, from a low of 394 to a high of 3,006,997, with an average size of 282,304. There are two 3-digit zip codes containing fewer than 500 people and six 3-digit zip codes containing fewer than 10,000 people each. Of the total of 881 3-digit zip codes, there are 18 with fewer than 20,000 people, 71 with fewer than 50,000 people, and 215 containing fewer than 100,000 population. We also looked at two-digit zip codes (the first 2 digits of the 5-digit zip code) and found that the smallest of the 98 2-digit zip codes contains 188,638 people.
We also investigated the practices of several other federal agencies which are mandated by Congress to release data from national surveys while preserving confidentiality and which have been dealing with these issues for decades. The problems and solutions being used by these agencies are laid out in detail in the Statistical Policy Working Paper 22 cited earlier.
To protect the privacy of individuals providing information to the Bureau of Census, the Bureau has determined that a geographical region must contain at least 100,000 people. This standard has been used by the Bureau of the Census for many years and is supported by simulation studies using Census data. These studies showed that after a certain point, increasing the size of a geographic area does not significantly decrease the percentage of unique records (i.e., those that could be identified if sampled), but that the point of diminishing returns is dependent on the number and type of demographic variables on which matching might occur. For a small number of demographic variables (6), this point was quite low (about 20,000 population), but it rose quickly to about 50,000 for 10 variables and to about 80,000 for 15 variables. The Bureau of the Census releases sets of data to the public that it considers safe from re-identification because it limits geographical areas to those containing at least 100,000 people and limits the number and detail of the demographic variables in the data. At the point of approximately 100,000 population, 7.3% of records were unique (and therefore potentially identifiable) on 6 demographic variables from the 1990 Census Short Form: age in years (90 categories), race (up to 180 categories), sex (2 categories), relationship to householder (14 categories), Hispanic (2 categories), and tenure (owner vs. renter in 5 categories). Using 6 variables derived from the Long Form data, age (10 categories), race (6 categories), sex (2 categories), marital status (5 categories), occupation (54 categories), and personal income (10 categories), raised the percentage to 9.8%.
We also examined the results of an NCHS simulation study using national survey data to see if some scientific support could be found for a compromise. The study took random samples from populations of different sizes and then compared the samples to the whole population to see how many records were identifiable, that is, matched uniquely to a unique person in the whole population on the basis of 9 demographic variables: age (85 categories), race (4 categories), gender (2 categories), ethnicity (2 categories), marital status (3 categories), income (3 categories), employment status (2 categories), working class (4 categories), and occupation (42 categories). Even when some of the variables are aggregated or coded, from the perspective of a large statistical agency desiring to release data to the public, the study concluded that a population size of 500,000 was not sufficient to provide a reasonable guarantee that certain individuals could not be identified. About 2.5 % of the sample from the population of 500,000 was uniquely identifiable, regardless of sample size. This percentage rose as the size of the population decreased, to about 14% for a population of 100,000 and to about 25% for a population of 25,000. Eliminating the occupation variable (which is less likely to be found in health data) reduced this percentage significantly to about 0.4 %, 3%, and 10% respectively. These percentages of unique records (and thus the potentials for re-identification) are highly dependent on the number of variables (which must also be available in other databases which are identified to be considered in a disclosure risk analysis), the categorical breakdowns of those variables, and the level of geographic detail included.
With respect to how we might clarify the requirement to achieve a "low probability" that information could be identified, the Statistical Policy Working Paper 22 referenced above discusses the attempts of several researchers to define mathematical measures of disclosure risk only to conclude that "more research into defining a computable measure of risk is necessary." When we considered whether we could specify a maximum level of risk of disclosure with some precision (such as a probability or risk of identification of <0.01), we concluded that it is premature to assign mathematical precision to the "art" of de-identification.
After evaluating current practices and recognizing the expressed need for some geographic indicators in otherwise de-identified databases, we concluded that permitting geographic identifiers that define populations of greater than 20,000 individuals is an appropriate standard that balances privacy interests against desirable uses of de-identified data. In making this determination, we focused on the studies by the Bureau of Census cited above which seemed to indicate that a population size of 20,000 was an appropriate cut off if there were relatively few (6) demographic variables in the database. Our belief is that, after removing the required identifiers to meet the safe harbor standards, the number of demographic variables retained in the databases will be relatively small, so that it is appropriate to accept a relatively low number as a minimum geographic size.
In applying this provision, covered entities must replace the (currently 18) forbidden 3-digit zip codes with zeros and thus treat them as a single geographic area (with > 20,000 population). The list of the forbidden 3-digit zip codes will be maintained as part of the updated Secretarial guidance referred to above. Currently, they are: 022, 036, 059, 102, 203, 555, 556, 692, 821, 823, 830, 831, 878, 879, 884, 893, 987, and 994. This will result in an average 3-digit zip code area population of 287,858 which should result in an average of about 4% unique records using the 6 variables described above from the Census Short Form. Although this level of unique records will be much higher in the smaller geographic areas, the actual risk of identification will be much lower because of the limited availability of comparable data in publically available, identified databases, and will be further reduced by the low probability that someone will expend the resources to try to identify records when the chance of success is so small and uncertain. We think this compromise will meet the current need for an easy method to identify geographic area while providing adequate protection from re-identification. If a greater level of geographical detail is required for a particular use, the information will have to be obtained through another permitted mechanism or be subjected to a specific de-identification determination as described above. We will monitor the availability of identified public data and the concomitant re-identification risks, both theoretical and actual, and adjust this safe harbor in the future as necessary.
As we stated above, we understand that many commenters would prefer a looser standard for determining when information is de-identified, both generally and with respect to the standards for identifying geographic area. However, because public databases (such as voter records or driver’s license records) that include demographic information about a geographically defined population are available, a surprisingly large percentage of records of health information that contain similar demographic information can be identified. Although the number of these databases seems to be increasing, the number of demographic variables within them still appears to be fairly limited. The number of cases of privacy violation from health records which have been identified in this way is small to date. However, the risk of identification increases with decreasing population size, with increasing amounts of demographic information (both in level of detail and number of variables), and with the uniqueness of the combination of such information in the population. That is, an 18 year old single white male student is not at risk of identification in a database from a large city such as New York. However, if the database were about a small town where most of the inhabitants were older, retired people of a specific minority race or ethnic group, that same person might be unique in that community and easily identified. We believe that the policy that we have articulated reaches the appropriate balance between reasonably protecting privacy and providing a sufficient level of information to make de-identified databases useful.
Comments: Some comments noted that identifiers that accompany photographic images are often needed to interpret the image and that it would be difficult to use the image alone to identify the individual.
Response: We agree that our proposed requirement to remove all photographic images was more than necessary. Many photographs of lesions, for example, which cannot usually be used alone to identify an individual, are included in health records . In this final rule, the only absolute requirement is the removal of full-face photographs, and we depend on the “catch-all” of “any other unique ... characteristic ...” to pick up the unusual case where another type of photographic image might be used to identify an individual.
Comments: A number of commenters felt that the proposed bar for removal had been set too high; that the removal of these 19 identifiers created a difficult standard, since some identifiers may be buried in lengthy text fields.
Response: We understand that some of the identifiers on our list for removal may be buried in text fields, but we see no alternative that protects privacy. In addition, we believe that such unstructured text fields have little or no value in a de-identified information set and would be removed in any case. With time, we expect that such identifiers will be kept out of places where they are hard to locate and expunge.
Comments: Some commenters asserted that this requirement creates a disincentive for covered entities to de-identify data and would compromise the Secretary’s desire to see de-identified data used for a multitude of purposes. Others stated that the ‘no reason to believe’ test creates an unreasonable burden on covered entities, and would actually chill the release of de-identified information, and set an impossible standard.
Response: We recognize that the proposed standards might have imposed a burden that could have prevented the widespread use of de-identified information. We believe that our modifications to the final rule discussed above will make the process less burdensome and remove some of the disincentive. However, we could not loosen the standards as far as many commenters wanted without seriously jeopardizing the privacy of the subjects of the information. As discussed above, we modify the “no reason to know” standard that was part of the safe harbor provision and replace it in the final rule with an “actual knowledge” standard. We believe that this change provides additional certainty to covered entities using the safe harbor and should eliminate any chilling effect.
Comments: Although most commenters wanted to see data elements taken off the list, there were a small number of commenters that wanted to see data items added to the list. They believed that it is also necessary to remove clinical trial record numbers, device model serial numbers, and all proper nouns from the records.
Response: In response to these requests, we have slightly revised the list of identifiers that must be removed under the safe harbor provision. Clinical trial record numbers are included in the general category of "any other unique identifying number, characteristic, or code." These record numbers cannot be included with de-identified information because, although the availability of clinical trial numbers may be limited, they are used for other purposes besides de-identification/re-identification, such as identifying clinical trial records, and may be disclosed under certain circumstances. Thus, they do not meet the criteria in the rule for use as a unique record identifier for de-identified records. Device model serial numbers are included in "any device identifier or serial number" and must be removed. We considered the request to remove all proper nouns to be very burdensome to implement for very little increase in privacy and likely to be arbitrary in operation, and so it is not included in the final rule.
Re-identification
Comments: One commenter wanted to know if the rule requires that covered entities retain the ability to re-identify de-identified information.
Response: The rule does not require covered entities to retain the ability to re-identify de-identified information, but it does allow them to retain this ability.
Comments: A few commenters asked us to prohibit anyone from re-identifying de-identified health information.
Response: We do not have the authority to regulate persons other than covered entities, so we cannot affect attempts by entities outside of this rule to re-identify information. Under the rule, we permit the covered entity that created the de-identified information to re-identify it. However, we include a requirement that, when a unique record identifier is included in the de-identified information, such identifier must not be such that someone other than the covered entity could use it to identify the individual (such as when a derivative of the individual’s name is used as the unique record identifier).