A group of researchers from Purdue University has been awarded $1.5 million from the National Science Foundation to help fund an ongoing project that’s investigating how well current techniques for anonymizing data are working and whether there’s a need for better methods.
The grant will help the researchers further their research, which includes work from computer scientists and linguists, who are looking at ways in which people can still be identified through textual clues even after explicitly identifiable data has been removed. The Purdue anonymization project has been ongoing for some time, and also includes researchers from a number of other institutions, including Indiana University and the Kinsey Institute.
The question of how well data anonymization works has become an important one in recent years as the volume of data collected by advertisers, merchants, Web sites, health care organizations and other companies has increased exponentially. That data is the lifeblood of many of these organizations, and they mine and analyze it constantly for new insights into customer behavior, buying patterns and potential marketing opportunities.
Consumers in many cases know little about how their data is collected, analyzed and sold to other companies, and privacy advocates have been putting pressure on a variety of organizations to improve their disclosures, as well as their efforts to keep user data private. By way of compromise, some organizations have taken to anonymizing certain kinds of data by removing identifiable portions, such as names, birth dates and Social Security numbers. And many data-protection laws have carved out exemptions for data breaches that involve anonymized data.
But there are questions about how well those techniques work, as well as whether the subsequent analysis of anonymized data has any validity.
“Textual data, even when explicit identifiers are removed (names,
dates, locations), can contain highly identifiable information. For
example, a sample of chief complaint fields from the Indiana Network
for Patient Care (INPC) found several instances of “phantom limb
pain”. Amputees can be visually identifiable, but the HIPAA Safe
Harbor rules do not list this as “identifying information”. Any
policy explicitly listing all types of identifying data is likely to
fail. Through a joint effort with computer science and linguistics,
the project is developing new methods to remove specific details from
text while preserving meaning, eliminating such highly identifiable
information without a priori knowledge of what would be identifying,” the Purdue team’s project page explains.
The project, led by Chris Clifton, Victor Raskin, Chyi-Kong Chang, and Luo Si at Purdue, is a long-term effort that encompasses not just computer science approaches, but also linguistic analysis.