Data-Enriched Profiles on 1.2B People Exposed in Gigantic Leak

Although the data was legitimately scraped by legally operating firms, the security and privacy implications are numerous.

An open Elasticsearch server has exposed the rich profiles of more than 1.2 billion people to the open internet.

First found on October 16 by researchers Bob Diachenko and Vinny Troia, the database contains more than 4 terabytes of data. It consists of scraped information from social media sources like Facebook and LinkedIn, combined with names, personal and work email addresses, phone numbers, Twitter and Github URLs, and other data commonly available from data brokers – i.e., companies which specialize in supporting targeted advertising, marketing and messaging services.

Taken together, the profiles provide a 360-degree view of individuals, including their employment and education histories. All of the information was unprotected, with no login needed to access it.

“it is a comprehensive dataset collected from B2B [business-to-business] lead-generation companies’ lists,” Diachenko told Threatpost via Twitter.

If accessed by cybercriminals, the data, which includes scores of related accounts tied to each individual, could be used for highly effective, targeted phishing attacks, business email compromises and identity theft, among other things.

“Information like this is extremely useful to criminals as a starting point in hacking a number of related accounts and also lends itself the potential for increased credential stuffing attacks,” Carl Wearn, head of e-crime at Mimecast, said via email. “This information obviously also provides a fantastic treasure trove of information for the means of industrial, political and state-related espionage and there are multiple malicious uses for the data leaked from this breach.”

For affected consumers, remediation is no picnic, either.

“Data breaches that expose information such as phone numbers to personal accounts like email or social accounts are just as serious as ones that expose payment information,” Zack Allen, director of threat operations at ZeroFOX, told Threatpost. “Luckily for payment information, you can change your credit card, or your password to your accounts. But what can victims of this breach do when their phone number and Facebook profile is leaked? Changing your phone number can cost money with your carrier, you also have to update all of your contacts with your new phone number, plus all of your two-factor accounts.”

Diachenko and Troia’s investigation uncovered that the data sets came from two separate lead-generation companies, whose business it is to assemble highly detailed profiles of individuals: People Data Labs (PDL) and OxyData[.]io.

“The majority of the data spanned four separate data indexes, labeled ‘PDL’ and ‘OXY,’ with information on roughly 1 billion people per index,” the researchers wrote in a writeup on Friday. “Each user record within the databases was labeled with a ‘source’ field that matched either PDL or Oxy, respectively.”

After notifying both companies, both said the server in question did not belong to them. However, the data certainly appeared to.

“In order to test whether or not the data belonged to PDL, we created a free account on their website which provides users with 1,000 free people lookups per month,” the researchers explained. “The data discovered on the open Elasticsearch server was almost a complete match to the data being returned by the People Data Labs API. To confirm, we randomly tested 50 other users and the results were always consistent.”

OxyData meanwhile sent Diachenko a copy of his profile, and the data fields also matched.

The researchers said they were unsure how the data came to be collected in the now-closed database. Could it be a customer of both PDL and OxyData, they wondered? Or, was the data had been stolen and placed in the storage bucket by hackers? The only clues as to the owner of the server was the IP address (, and that it was hosted with Google Cloud.

While the incident is not a data breach per se (but rather a story of yet another misconfigured server), it brings up two different concerns. First, what liability do the data originators (PDL and OxyData) have to the people whose profiles were exposed? And two, even though the information is aggregated from allegedly public sources, what does this kind of “data enrichment” mean from a privacy perspective?

To the first concern, Kelly White, CEO at RiskRecon, believes that the lead-generation companies are on the hook for the exposure.

“Data…is easily and perfectly replicable,” she said via email. “Every location where the asset exists must be known and protected. This requires that purveyors of sensitive data know their customers well and for what purposes they will use the data. Regulators are increasingly holding the original aggregators of sensitive data responsible for the protection of sensitive information, regardless of where it is stored or to whom they share it with. As such, while the originator of this data may not have been breached, they will likely suffer blowback.”

Diachenko took a similar view: “One could argue that because PDL’s data was mis-used, it is up to them to notify their customers.”

To the second concern, the privacy implications around rich personal profiles continue to be a source of discussion. “Collected information on a single person can include information such as household sizes, finances and income, political and religious preferences, and even a person’s preferred social activities,” noted Diachenko and Troia, in their posting.

Worryingly, some of that information can come from sources that are decidedly not public. For instance, one of the phone numbers returned for Diachenko’s profile was an old landline that came as part of an AT&T TV bundle. “The landline was never used and never given to anyone – I never actually owned a phone, yet somehow this information appears in my profile,” he said.

The most famous example of the mis-use of such profiling is the Cambridge Analytica scandal, in which Facebook allowed a third-party application to hand over the data of up to 50 million platform users to the company. That was then combined with other data to create highly detailed profiles that the Trump campaign used to micro-target population segments with 2016 election messaging.

This latest revelation of the breadth of such data-enrichment underscores that even after Cambridge Analytica, privacy practices have not moved forward, Diachenko noted.

“Due to the sheer amount of personal information included, combined with the complexities identifying the data owner, this has the potential raise questions on the effectiveness of our current privacy and breach notification laws,” he said.

Mimecast’s Wearn agreed: “This particular breach highlights the trade in personal details which takes place and the inherent risks to this normalized and relatively uncontrolled practice,” he said. “Due to its scale, it will undoubtably add to calls for better regulation and security in relation to the storage of personal data.”

Is MFA enough to protect modern enterprises in the peak era of data breaches? How can you truly secure consumer accounts? Prevent account takeover? Find out: Catch our free, on-demand Threatpost webinar, “Trends in Fortune 1000 Breach Exposure” to hear advice from breach expert Chip Witt of SpyCloud. Click here to register.

Suggested articles