The Internet may make many promises, but anonymity isn’t always one of them. Users, for example, who covet their privacy often turn to Tor and other similar services to keep their activities on the web from prying eyes, yet that hasn’t stopped the FBI and researchers from trying to uncloak people on that network.
On the open Internet, users leave behind breadcrumbs as to their interests and locations on the sites they visit, data that is tracked by advertisers and other services interested in delivering targeted advertising in the browser.
A team of academics from Princeton and Stanford universities has gone a step further and figured out how to reveal a user’s identity from links clicked on in their Twitter feed. The researchers built a desktop Google Chrome extension called Footprints as a proof of concept that combs a user’s browser history for links clicked on from Twitter.
The extension sends all Twitter links from the last 30 days that are still in a user’s browsing history through the tool. The user is given the opportunity to review the links before sending them. The tool then returns, in less than a minute, a list of 15 possible Twitter profiles that are a likely match; the extension then deletes itself, the researchers said.
“We were interested in how much information leak there is when browsing the Web,” said Sharad Goel, assistant professor at Stanford in the Department of Management Science and Engineering. Goel along with Stanford students Ansh Shukla, Jessica Su and Princeton professor Arvind Narayanan, developed Footprints.
“We want to raise awareness and inform policy,” Goel said. “This is more of an academic demonstration. We’re not trying to make the tool available to other people, it’s mostly about raising awareness.”
A tool like this would allow a business already tracking a user’s information to correlate it with Twitter traffic to make a best guess as to the user’s identity. It would do so, Goel said, by analyzing the anonymized browsing history and running a similarity match against Twitter traffic to rank the overlaps and arrive at a conclusion.
In a post published to the Freedom to Tinker website, Su wrote that people’s social networks are distinct and made up of family, friends and colleagues, resulting in a distinctive set of links in one’s Twitter feed.
“Given only the set of web pages an individual has visited, we determine which social media feeds are most similar to it, yielding a list of candidate users who likely generated that web browsing history,” Su wrote. “In this manner, we can tie a person’s real-world identity to the near complete set of links they have visited, including links that were never posted on any social media site. This method requires only that one click on the links appearing in their social media feeds, not that they post any content.”
The researchers said there were two challenges to be worked out. First was their ability to quantify how similar a social media feed would be to web browsing history, which seems simple, but does not take into account users with an excessively large number of followers that could also include bots. Goel said those feeds were penalized in this exercise because of their size and the number of links they may contain could skew results.
“We posit a stylized, probabilistic model of web browsing behavior, and then compute the likelihood a user with that social media feed generated the observed browsing history,” Su wrote. “It turns out that this method is approximately equivalent to scaling the fraction of history links that appear in the feed by the log of the feed size.”
The demonstration uses Twitter feeds because they are for the most part public. The researchers heuristically narrowed the number of feeds to be searched and then applied their similarity measure to arrive at the final result, Su said.
Goel said he expects the tool to remain available for the time being as they continue to collect data and refine the demo. A paper is expected to follow in the next few weeks, he said.