Adversary Simulation with Voice Cloning in Real Time, Part 1

Every day, blog posts and news articles warn us about the danger of artificial intelligence (AI) and how the technology behind it can be used by criminals to perform sophisticated attacks.

Our clients often ask, “Should we be worried?” Emerging technology such as ChatGPT and other generative AI systems give adversaries new tools to enhance their attack capabilities. In our professional experience, persistent social engineering attempts remain the most effective way of breaching organizations with mature security programs. By combining generative AI with traditional social engineering attack vectors, the effectiveness of these attacks can be greatly enhanced.

To prepare, organizations should assess themselves against these new attacks in a controlled but realistic simulation. In this post, we will demonstrate a proof of concept that combines emerging technologies and classic techniques to create a dynamic new method of social engineering.

When social engineering a target, the caller needs to sound believable to help meet their objectives: soliciting sensitive information or eliciting sensitive action. Believability is important—not just so the target believes the caller, but so the whole conversation is perceived as legitimate. A successful social engineering campaign isn’t just one where a target hands over credentials or resets a password, but where the target feels safe and will not report a call suspicious after the engagement.

Existing Roadblocks for Impersonation

Until now, it has been difficult to replicate a voice of the opposite gender for social engineering engagements with even the most sophisticated voice changing software and hardware. Solutions without AI sound unrealistic, are error-prone, and depend vastly on the speaker’s vocal range. Even though existing software such as Clownfish or MorphVOX have built-in pitch correction, they still require the speaker to speak in a higher or lower voice and maintain a consistent pitch. This issue persists even with popular voice changing hardware such as the GoXLR or Roland VT-4. An attacker attempting to pass themselves off as a different gender using these tools would be easily laughed off by most human listeners since the output sounds robotic and generally unrealistic.

Speech patterns, intonation, and flow are also difficult for individuals to impersonate. For example, differences in age are often easy to detect due to cadence and inflection. Additionally, if the impersonated speaker (or “target speaker”) has an accent, it becomes even more difficult to emulate. The success of current social engineering campaigns is dependent on luck. Usually, in smaller companies or in-person workspaces, employees are familiar with one another. In this case, it becomes difficult for a social engineer to impersonate someone as voice recognition might throw a red flag. Voice cloning addresses all of these issues and more, but finding the right technology has proven to be difficult up until now.

AI Hype Train

Generative AI offers a fundamental change in voice changing technology, enabling red team operators to switch scenarios and caller voice tones on the fly--and with a much broader spectrum of options than a single team alone can emulate. Tevora experimented with several open source and free AI based options for this including open source projects such as Real-Time-Voice-Cloning and services such as Resemble.AI, but found these insufficient to mount a convincing attack.

That’s when we discovered Respeecher, which offers a compelling and powerful voice changing system suitable for use in real-time social engineering attacks. Partnering with Respeecher, we were able to use their RTC demo instance to execute proof of concept attacks.

The setup needed to execute this attack is relatively simple, but it does rely on a few third-party tools such as Google Voice, Spoofcard, Black Hole, and Audio Hijack. For detailed instructions on setting up your own instance of Respeecher to perform real-time voice cloning, check out Part 2 of this blog series here.

Voice Cloning Demonstration

Respeecher is a company who does research targeting specifically voice cloning through Generative AI. The demonstration instance they spun up for us included a few voices for us to try, which included several voices as presets that would work in real time. The Respeecher instance itself requires no user set up and with the provided pre-trained models, it takes only a few seconds to switch between voices. Critically, the target voice model accounts for the intonation, accent, and other nuances of the cloned voice. As an American male, it was fascinating to hear my voice as an Eastern European female. The sample set provided in the Respeecher demo has several models from different regions.

We captured audio showing how powerful these tools are when combined, which you can hear below:

These results are fairly convincing, but believability could be further enhanced using a target speaker from the organization that is being tested. In an AI-assisted social engineering campaign, a red team engineer could, for example, take audio of an organization’s CEO speaking at a TED talk and use it to train an AI model. Voice cloning only requires about five seconds of audio to create a convincing voice model. Using Respeecher and the other AI-assisted tools, the attacker could then realistically render the CEO’s voice in real time on social engineering calls, increasing the likelihood that targeted individuals would give up sensitive information or take sensitive actions.

The Future

The impact of using voice cloning in real time with spoofed numbers is critical, not just in a workplace, but in other aspects of life. Some financial institutions such as Capital One and HSBC have started using voice recognition for banking. This could become problematic, as we have demonstrated how simple it is to clone a voice rather convincingly. In general, being able to execute this attack chain will greatly increase the success rate of phone pretexts and social engineering attacks. An attacker can assume any identity, gender, age, and even nationality on a call and can transition to a completely different identity seamlessly on live calls. This provides an added element of realism and can really be used to exploit voice recognition between people. Another consideration is that the technology that currently exists for cloning only requires a few seconds of speech. For an adversary, this attack chain is low risk, high reward. This technology and attack chain enables attackers to make persistent attempts against the same target as the ability to transition to different identities in a split second is trivial.

Our answer to the question, “Should we be worried?” is, “to a healthy degree”. As these technologies become public and more accessible down the line, new defenses and a healthy dose of skepticism will be needed. When executed properly, true adversary simulation engagements will often end with questions like, “How do we protect against this?”. Our standard response is security awareness training and technical security controls such as MFA that could compensate for the risk, but the better answer is to have consistent attack simulations to highlight the deficiencies in your organization. Tevora’s adversary simulation team is always conducting research and weaponizing advanced techniques to identify attack chains before they happen.