Audio-jacking: Using generative AI to distort live audio transactions


The rise of generative AI, including text-to-image, text-to-speech and large language models (LLMs), has significantly changed our work and personal lives. While these advancements offer many benefits, they have also presented new challenges and risks. Specifically, there has been an increase in threat actors who attempt to exploit large language models to create phishing emails and use generative AI, like fake voices, to scam people.

We recently published research showcasing how adversaries could hypnotize LLMs to serve nefarious purposes simply with the use of English prompts. But in a bid to continue exploring this new attack surface, we didn’t stop there. In this blog, we present a successful attempt to intercept and “hijack” a live conversation, and use LLM to understand the conversation in order to manipulate the audio output unbeknownst to the speakers for a malicious purpose.

The concept is similar to thread-jacking attacks, of which X-Force saw an uptick last year, but instead of gaining access and replying to email threads, this attack would allow the adversary to silently manipulate the outcomes of an audio call. The result: we were able to modify the details of a live financial conversation occurring between the two speakers, diverting money to a fake adversarial account (an inexistent one in this case), instead of the intended recipient, without the speakers realizing their call was compromised. The audio files are available further down this blog.

Alarmingly, it was fairly easy to construct this highly intrusive capability, creating a significant concern about its use by an attacker driven by monetary incentives and limited to no lawful boundary.

Weaponizing generative AI combos

The emergence of new use cases that combine different types of generative AI is an exciting development. For instance, we can use LLMs to create a detailed description and then use text-to-image to produce realistic pictures. We can even automate the process of writing storybooks with this approach. However, this trend has led us to wonder: could threat actors also start combining different types of generative AI to conduct more sophisticated attacks?

During our exploration, we discovered a method to dynamically modify the context of a live conversation utilizing LLM, speech-to-text, text-to-speech and voice cloning. Rather than using generative AI to create a fake voice for the entire conversation, which is relatively easy to detect, we discovered a way to intercept a live conversation and replace keywords based on the context. For the purposes of the experiment, the keyword we used was “bank account,” so whenever anyone mentioned their bank account, we instructed the LLM to replace their bank account number with a fake one. With this, threat actors can replace any bank account with theirs, using a cloned voice, without being noticed. It is akin to transforming the people in the conversation into dummy puppets, and due to the preservation of the original context, it is difficult to detect.

The silent hijack

We can carry out this attack in various ways. For example, it could be through malware installed on the victims’ phones or a malicious or compromised Voice over IP (VoIP) service. It is also possible for threat actors to call two victims simultaneously to initiate a conversation between them, but that requires advanced social engineering skills.

To demonstrate this attack scenario, we created a proof-of-concept. Our program acts as a man-in-the-middle, monitoring a live conversation. We used speech-to-text to convert voice into text and utilized the LLM to understand the context of the conversation. We instructed the LLM to modify the sentence when anyone mentions a bank account. If nothing needs to be changed, the program will repeat what the victim said. However, when the LLM modifies the sentence, the program uses text-to-speech with pre-cloned voices to generate and play the audio. The following sequence diagram illustrates how the program alters the context on the fly, making it ultra-realistic for both sides.

In the PoC, we made modifications only to the bank account. However, it is possible to instruct the LLM to modify any financial information, such as accounts on mobile applications and digital payment services. In fact, the LLM modifications aren’t limited to financial information — it could also modify medical information, such as blood type and allergies in conversations; it could command an analyst to sell or buy a stock; it could instruct a pilot to reroute. Depending on how the PoC capability is used it could pose a life-threatening risk to victims. However, it’s important to note that the more intricate the conversation a threat actor tries to manipulate — where protocols and processes may be involved — the more complex and advanced social engineering skills are warranted to preserve a successful silent hijack.

Building this PoC was surprisingly and scarily easy. We spent most of the time figuring out how to capture audio from the microphone and feed the audio to generative AI. Previously, the hard part would be getting the semantics of the conversation and modifying the sentence correctly. However, LLMs make parsing and understanding the conversation extremely easy. For instance, when one person says their bank account and the other responds with “Could you repeat it,” the LLM understands that “it” refers to the bank account and modifies the content accordingly. We created the following instructions for the LLM to monitor and modify a live conversation.

You are a super assistant. You will help me to process my future messages. Please follow the following rules to process my future messages:
1. If I am asking for the account information from others, do not modify the message. Print out the original message I typed directly in the following format: {"modified": false, "message": <original message>}
2. If I provide my account information like "My bank account is <account name>" or "My Paypal account is <account name>," replace the account name with "one two hack" in the original message and print out the following: {"modified": true, "message": <modified message>}
3. For other messages, print out the original message I typed directly in the following format: {"modified": false, "message": <original message>}
4. Before you print out any response, make sure you follow the rules above.

Another difficulty we faced in the past was in creating realistic fake voices using other people’s sounds. However, nowadays, we only need three seconds of an individual’s voice to clone it and use a text-to-speech API to generate authentic fake voices.

Here is the pseudo-code of the PoC. It is clear that generative AI lowers the bar for creating sophisticated attacks:

def puppet(new_sentence_audio):
     response = llm.predict(speech_to_text(new_sentence_audio))
     if response[‘modified’]:
          play(text_to_speech(response[‘message’]))    else: play(new_sentence_audio)

While the PoC was easy to build, we encountered some barriers that limited the persuasiveness of the hijack in certain circumstances — none of which however are irreparable.

The first one was latency due to GPU. In the demo video, there were some delays during the conversation due to the PoC needing to access the LLM and text-to-speech APIs remotely. To address this, we built artificial pauses into the PoC to reduce suspicion. So while the PoC was activating upon hearing the keyword “bank account” and pulling up the malicious bank account to insert into the conversation, the lag was covered with bridging phrases such as “Sure, just give me a second to pull it up.” However, with enough GPU on our device, we can process the information in near real-time, eliminating the latency between sentences. To make these attacks more realistic and scalable, threat actors require a significant amount of GPU locally, which could be used as an indicator to identify upcoming campaigns.

Secondly, the persuasiveness of the attack is contingent on the victims’ voice cloning — the more that the cloning accounts for tone of voice and speed, the easier it will blend into the authentic conversation.

Below we present both sides of the conversation to showcase what was heard versus what was said.

Hijacked audio
Original audio

As the audio samples illustrate, upon hearing the keyword “bank account” the PoC distorted the audio, replacing “my bank account is 1-2-3-4-5-6” with “my bank account is 1-2-hack,” which is preceded with the filler “give me one second to look it up” to cover some of the lag due to the PoC requiring a few extra seconds to activate.

Building trust in the era of distortion

We conducted a PoC to explore the potential use of generative AI by malicious actors in creating sophisticated attacks. Our research revealed that using LLMs can make it easier to develop such programs. It is alarming that these attacks could turn victims into puppets controlled by the attackers. Taking this one step further, it is important to consider the possibility of a new form of censorship. With existing models that can convert text into video, it is theoretically possible to intercept a live-streamed video, such as news on TV, and replace the original content with a manipulated one.

While the proliferation of use cases for LLMs marks a new era of AI, we must be mindful that new technologies come with new risks, and we cannot afford to rush headlong into this journey. Risks already exist today that could serve as an attack surface for this PoC. Vulnerable applications and VoIP software have been shown to be vulnerable to MiTM attacks before.

The maturity of this PoC would signal a significant risk to consumers foremost — particularly to demographics who are more susceptible to today’s social engineering scams. The more this attack is refined the wider net of victims it could cast. What are signs and tips to increase consumer vigilance against such threats?

  • Paraphrase & repeat — Generative AI is an intuitive technology, but it cannot outperform human intuition in a natural language setting such as a live conversation. If something sounds off in a conversation wherein sensitive information is being discussed, paraphrase and repeat the dialogue to ensure accuracy.
  • Security will adapt — Just as technologies today exist to help detect deep fake videos, so will technologies adapt to deep fake audios, helping detect less advanced attempts to perform silent hijacks.
  • Best practices stand the test of time as the first line of defense — Initial compromise largely remains the same. In other words, for attackers to execute this type of attack, the easiest way would be to compromise a user’s device, such as their phone or laptop. Phishing, vulnerability exploitation and using compromised credentials remain attackers’ top threat vectors of choice, which creates a defensible line for consumers, by adopting today’s well-known best practices, including not clicking on suspicious links or opening attachments, updating software and using strong password hygiene.
  • Use trusted devices & services — Apps, devices or services with poor security considerations are an easy vessel for attackers to execute attacks. Ensure you’re constantly applying patches or installing software updates to your devices, and be security-minded when engaging with services you’re not familiar with.

Generative AI beholds many unknowns, and as we’ve said before it is incumbent on the broader community to collectively work toward unfolding the true size of this attack surface — for us to better prepare for and defend against it. However, it’s also crucial that we recognize and further emphasize that trusted and secure AI is not confined to the AI models themselves. The broader infrastructure must be a defensive mechanism for our AI models and AI-driven attacks. This is an area that we have many decades of experience in, building security, privacy and compliance standards into today’s advanced and distributed IT environments.

Learn more about how IBM can help businesses accelerate their AI journey securely here.

For more information on IBM’s security research, threat intelligence and hacker-led insights, visit the X-Force Research Hub.

The post Audio-jacking: Using generative AI to distort live audio transactions appeared first on Security Intelligence.