Black Hat 2018: Voice Authentication is Broken, Researchers Say

Researchers crack voice authentication systems by recreating any voice using under ten minutes of sample audio.

LAS VEGAS – We live in a world increasingly dominated by voice-enabled smart digital assistants. More and more we rely on Amazon’s Alexa to tell us if we have any new messages. We ask Google Home smart speakers to remind us of calendar appointments. Some banks even allow users to use their voice as a password to access account data.

However, according to two researchers John Seymour and Azeem Aqil, both with Salesforce’s research team, voice authentication for account access is extremely insecure. At a Black Hat session Thursday, the two showed how easy it is to spoof someone’s voice well enough to access protected accounts.

Voice synthesis, a technology that creates life-like synthesized voices, can be used to recreate any person’s voice. The results are astounding, thanks to artificial intelligence technology such as Google’s WaveNet and other technologies such as Adobe’s Project VoCo.

“Recent advances in machine learning have shown that text-to-speech systems can generate synthetic, high-quality audio of subjects using audio recordings of their speech,” researchers said. “Current techniques for audio generation are good enough to spoof voice authentication algorithms.”

Hypothetical attacks, discussed researchers, include those against financial institutions that use voice authentication for limited account access.

Not discussed, but also vulnerable, are the myriad of home automation application that can be tied to specific user voice identities. For example, Google has introduced Voice Match into its fleet of smart home products that link an individual’s voice with their Google account. This allows Google Home to tailor commands based on an identified voice.

Both researchers demonstrated on stage a proof-of-concept using Siri and Microsoft’s Speaker Recognition API. They then used the service Lyrebird to synthesize the target’s voice and were able to successfully use the fake voice to authenticate with a Windows service.

The hurdle for attacks attempting to spoof a voice well enough to bypass voice authentication methods is that the sample set of voice data needs to be huge. Some systems require of up to 24 hours of high-quality voice samples before machine learning programs can process and recreate a voice.

But researchers found that voice quality didn’t need to be perfect. It only needed to be good enough in order to trick a voice-protected feature, service or account.

“In our case we only focused on using text-to-speech to bypass voice authentication. So, we really do not care about the quality of our audio. It could sound like garbage to a human as long as it bypasses the speech APIs,” Seymour said.

In a technique developed by Seymour and Aqil, they were able to use a tiny sample set of 10 minutes of audio in order to create a synthesized voice of a target and spoof their voice using text-to-speech. That was enough, in many cases, to fool voice authentication systems and access a protected account.

The researchers’ proof-of-concept attack consisted of identifying a target and seeking about ten minutes of high-quality audio samples of the victim via public sources such as YouTube. Because the sample size of audio was so small, researchers had to artificially lengthen the sample. They achieved that by doing things like slowing audio samples down and changing their pitch.

“Even with these techniques that stretched ten minutes of audio into 30 (using above techniques and others) that’s still nowhere near the 24 hours needed by text-to-speech algorithms need to train,” Seymour said.

To create a larger sample set, researchers simply re-fed the clean 10-minute audio sample and then the 30 minute tweaked audio sample into Google’s Tacotron voice machine learning program.

The result was a raspy sounding voice, but had all the necessary key voice traits to fool a voice authentication, researchers said.

“Overall this is kind of scary. We just grabbed audio off YouTube, trained the model and generated new speech,” Seymour said.

The implications of spoofing a voice are considerable. First, there is using voice to authenticate with a service. But, both researchers warned that it’s feasible that voice phishing could be problematic. The scenario being, an employee gets a call from someone who sounds like their boss asking them to wire money or approve paying a fake invoice.

“There is a lot of cool research going on this area,” Seymour said. “We’re only we’re going to see more research into how machine learning to be used and abused.”

Their advice: It’s too easy to spoof someones voice. So, don’t use voice as an authentication option.

Suggested articles