November 2, 2010, 10:33AM

National Science Foundation Funds Purdue Data-Anonymization Project

A group of researchers from Purdue University has been awarded $1.5 million from the National Science Foundation to help fund an ongoing project that's investigating how well current techniques for anonymizing data are working and whether there's a need for better methods.

The grant will help the researchers further their research, which includes work from computer scientists and linguists, who are looking at ways in which people can still be identified through textual clues even after explicitly identifiable data has been removed. The Purdue anonymization project has been ongoing for some time, and also includes researchers from a number of other institutions, including Indiana University and the Kinsey Institute.

The question of how well data anonymization works has become an important one in recent years as the volume of data collected by advertisers, merchants, Web sites, health care organizations and other companies has increased exponentially. That data is the lifeblood of many of these organizations, and they mine and analyze it constantly for new insights into customer behavior, buying patterns and potential marketing opportunities.

Consumers in many cases know little about how their data is collected, analyzed and sold to other companies, and privacy advocates have been putting pressure on a variety of organizations to improve their disclosures, as well as their efforts to keep user data private. By way of compromise, some organizations have taken to anonymizing certain kinds of data by removing identifiable portions, such as names, birth dates and Social Security numbers. And many data-protection laws have carved out exemptions for data breaches that involve anonymized data.

But there are questions about how well those techniques work, as well as whether the subsequent analysis of anonymized data has any validity.

"Textual data, even when explicit identifiers are removed (names, dates, locations), can contain highly identifiable information. For example, a sample of chief complaint fields from the Indiana Network for Patient Care (INPC) found several instances of "phantom limb pain". Amputees can be visually identifiable, but the HIPAA Safe Harbor rules do not list this as "identifying information". Any policy explicitly listing all types of identifying data is likely to fail. Through a joint effort with computer science and linguistics, the project is developing new methods to remove specific details from text while preserving meaning, eliminating such highly identifiable information without a priori knowledge of what would be identifying," the Purdue team's project page explains.

The project, led by Chris Clifton, Victor Raskin, Chyi-Kong Chang, and Luo Si at Purdue, is a long-term effort that encompasses not just computer science approaches, but also linguistic analysis.

Commenting on this Article is closed.

 

Copyright © 2012 threatpost.com | Terms of Service | Privacy