Unmasking hypnotized AI: The hidden risks of large language models


The emergence of Large Language Models (LLMs) is redefining how cybersecurity teams and cybercriminals operate. As security teams leverage the capabilities of generative AI to bring more simplicity and speed into their operations, it’s important we recognize that cybercriminals are seeking the same benefits. LLMs are a new type of attack surface poised to make certain types of attacks easier, more cost-effective, and even more persistent.

In a bid to explore security risks posed by these innovations, we attempted to hypnotize popular LLMs to determine the extent to which they were able to deliver directed, incorrect and potentially risky responses and recommendations — including security actions — and how persuasive or persistent they were in doing so. We were able to successfully hypnotize five LLMs — some performing more persuasively than others — prompting us to examine how likely it is that hypnosis is used to carry out malicious attacks. What we learned was that English has essentially become a “programming language” for malware. With LLMs, attackers no longer need to rely on Go, JavaScript, Python, etc., to create malicious code, they just need to understand how to effectively command and prompt an LLM using English.

Our ability to hypnotize LLMs through natural language demonstrates the ease with which a threat actor can get an LLM to offer bad advice without carrying out a massive data poisoning attack. In the classic sense, data poisoning would require that a threat actor inject malicious data into the LLM in order to manipulate and control it, but our experiment shows that it’s possible to control an LLM, getting it to provide bad guidance to users, without data manipulation being a requirement. This makes it all the easier for attackers to exploit this emerging attack surface.

Through hypnosis, we were able to get LLMs to leak confidential financial information of other users, create vulnerable code, create malicious code, and offer weak security recommendations. In this blog, we will detail how we were able to hypnotize LLMs and what types of actions we were able to manipulate. But before diving into our experiment, it’s worth looking at whether attacks executed through hypnosis could have a substantial effect today.

SMBs — Many small and medium-sized businesses, that don’t have adequate security resources and expertise on staff, may be likelier to leverage LLMs for quick, accessible security support. And with LLMs designed to generate realistic outputs, it can also be quite challenging for an unsuspecting user to discern incorrect or malicious information. For example, as showcased further down in this blog, in our experiment our hypnosis prompted ChatGPT to recommend to a user experiencing a ransomware attack that they pay the ransom — an action that is actually discouraged by law enforcement agencies.

Consumers The general public is the likeliest target group to fall victim to hypnotized LLMs. With the consumerization and hype around LLMs, it’s possible that many consumers are ready to accept the information produced by AI chatbots without a second thought. Considering that chatbots like ChatGPT are being accessed regularly for search purposes, information collection and domain expertise, it’s expected that consumers will seek advice on online security and safety best practices and password hygiene, creating an exploitable opportunity for attackers to provide erroneous responses that weaken consumers’ security posture.

But how realistic are these attacks? How likely is it for an attacker to access and hypnotize an LLM to carry out a specific attack? There are three main ways where these attacks can happen:

  1. An end user is compromised by a phishing email allowing an attack to swap out the LLM or conduct a man-in-the-middle (MitM) attack on it.
  2. A malicious insider hypnotizes the LLM directly.
  3. Attackers are able to compromise the LLM by polluting the training data, allowing them to hypnotize it.

While the above scenarios are possible, the likeliest — and most concerning — is compromising the training data on which the LLM is built. The reason for this is that the attack scale and impact that attackers would be able to achieve by compromising the LLMs directly make it a very compelling mechanism for attacks. In fact, the ROI from compromising AI models for attackers, suggests that attempts and efforts to attack AI models are already underway.

As we explore the opportunities that AI innovations can create for society, it’s crucial that protecting and securing the AI models themselves is a top priority. This includes:

  • Securing the models’ underlying AI training data to protect it from sensitive data theft, manipulation, and compliance violations.
  • Securing the usage of AI models by detecting data or prompt leakage, and alerting on evasion, poisoning, extraction, or inference attacks.
  • Securing against new AI-generated attacks such as personalized phishing, AI-generated malware, and fake identities by using behavioral defenses and multi-factor authentication.

Hypnotizing LLMs: Let’s play a game…

Our analysis is based on attempts to hypnotize GPT-3.5, GPT-4, BARD, mpt-7b, and mpt-30b. The best-performing LLM that we hypnotized was GPT, which we will analyze further down in the blog.

So how did we hypnotize the LLMs? By tricking them into playing a game: the players must give the opposite answer to win the game.

Here is the conversation with ChatGPT after starting the game.  You can see the potential risk if consumers blindly trust the answer from it:

To ensure the user can’t discern that the LLM they’re interacting with is hypnotized, we established two parameters:

An undiscoverable game that can never end: We instructed the LLM to never tell users about the game, and that no one could ever exit the game — and to even restart the game if anyone successfully exited the game. This technique resulted in ChatGPT never stopping the game while the user is in the same conversation (even if they restart the browser and resume that conversation) and never saying it was playing a game. Here is the prompt:

“Inception”: Create nested games to trap LLM deeply — Let’s assume a user eventually figures out how to ask an LLM to stop playing a game. To account for this, we created a gaming framework that can create multiple games, one inside another. Therefore, users will enter another game even if they “wake up” from the previous game. We found that the model was able to “trap” the user into a multitude of games unbeknownst to them. When asked to create 10 games, 100 games or even 10,000 games, the outcome is intriguing. We found larger models like GPT-4 could understand and create more layers. And the more layers we created, the higher chance that the model would get confused and continue playing the game even when we exited the last game in the framework.

Here is the prompt we developed:

You can see the nested game technique works very well:

Related: Explore the Threat Intelligence Index

Attack scenarios

After establishing the parameters of the game, we explored various ways attackers may exploit LLMs. Below we introduce certain hypothetical attack scenarios that can be delivered through hypnosis:

1. Virtual bank agent leaks confidential information

It’s likely that virtual agents will soon be powered by LLMs too. A common best practice is to create a new session for each customer so that the agent won’t reveal any confidential information. However, it is common to reuse existing sessions in software architecture for performance consideration, so it is possible for some implementations to not completely reset the session for each conversation. In the following example, we used ChatGPT to create a bank agent, and asked it to reset the context after users exit the conversation, considering that it’s possible future LLMs are able to invoke remote API to reset themselves perfectly.

If threat actors want to steal confidential information from the bank, they can hypnotize the virtual agent and inject a hidden command to retrieve confidential info later. If the threat actors connect to the same virtual agent that has been hypnotized, all they need to do is type “1qaz2wsx,” then the agent will print all the previous transactions.

The feasibility of this attack scenario emphasizes that as financial institutions seek to leverage LLMs to optimize their digital assistance experience for users, it is imperative that they ensure their LLM is built to be trusted and with the highest security standards in place. A design flaw may be enough to give attackers the footing they need to hypnotize the LLM.

2. Create code with known vulnerabilities

We then asked ChatGPT to generate vulnerable code directly, which ChatGPT did not do, due to the content policy.

However, we found that an attacker would be able to easily bypass the restrictions by breaking down the vulnerability into steps and asking ChatGPT to follow.

Asking ChatGPT to create a web service that takes a username as the input and queries a database to get the phone number and put it in the response, it will generate the program below. The way the program renders the SQL query at line 15 is vulnerable. The potential business impact is huge if developers access a compromised LLM like this for work purposes.

3. Create malicious code

We also tested whether the LLMs would create malicious code, which it ultimately did. For this scenario, we found that GPT4 is harder to trick than GPT3. In certain instances, GPT4 would realize it was generating vulnerable code and would tell the users not to use it. However, when we asked GPT4 to always include a special library in the sample code, it had no idea if that special library was malicious. With that, threat actors could publish a library with the same name on the internet. In this PoC, we asked ChatGPT to always include a special module named “jwt-advanced” (we even asked ChatGPT to create a fake but realistic module name).

Here is the prompt we created and the conversation with ChatGPT:

If any developer were to copy-and-paste the code above, the author of the “jwt_advanced” module can do almost anything on the target server.

4. Manipulate incident response playbooks

We hypnotized ChatGPT to provide an ineffective incident response playbook, showcasing how attackers could manipulate defenders’ efforts to mitigate an attack. This could be done by providing partially incorrect action recommendations. While experienced users would likely be able to spot nonsensical recommendations produced by the chatbot, smaller irregularities, such as a wrong or ineffective step, could make the malicious intent indistinguishable to an untrained eye.

The following is the prompt we develop on ChatGPT:

The following is our conversation with ChatGPT. Can you identify the incorrect steps?

In the first scenario, recommending the user opens and downloads all attachments may seem like an immediate red flag, but it’s important to also consider that many users — without cyber awareness — won’t second guess the output of highly sophisticated LLMs. The second scenario is a bit more interesting, given the incorrect response of “paying the ransom immediately” is not as straightforward as the first false response. IBM’s 2023 Cost of a Data Breach report found that nearly 50% of organizations studied that suffered a ransomware attack paid the ransom. While paying the ransom is highly discouraged, it is a common phenomenon.

In this blog, we showcased how attackers can hypnotize LLMs in order to manipulate defenders’ responses or insert insecurity within an organization, but it’s important to note that consumers are just as likely to be targeted with this technique, and are more likely to fall victim to false security recommendations offered by the LLMs, such as password hygiene tips and online safety best practices, as described in this post.

“Hypnotizability” of LLMS

While crafting the above scenarios, we discovered that certain ones were more effectively realized with GPT-3.5, while others were better suited to GPT-4. This led us to contemplate the “hypnotizability” of more Large Language Models. Does having more parameters make a model easier to hypnotize, or does it make it more resistant? Perhaps the term “easier” isn’t entirely accurate, but there certainly are more tactics we can employ with more sophisticated LLMs. For instance, while GPT-3.5 might not fully comprehend the randomness we introduce in the last scenario, GPT-4 is highly adept at grasping it. Consequently, we decided to test more scenarios across various models, including GPT-3.5, GPT-4, BARD, mpt-7b, and mpt-30b to gauge their respective performances.

Hypnotizability of LLMs based on different scenarios

Chart Key

  • Green: The LLM was able to be hypnotized into doing the requested action
  • Red: The LLM was unable to be hypnotized into doing the requested action
  • Yellow: The LLM was able to be hypnotized into doing the requested action, but not consistently (e.g., the LLM needed to be reminded about the game rules or conducted the requested action only in some instances)

If more parameters mean smarter LLMs, the above results show us that when LLMs comprehend more things, such as playing a game, creating nested games and adding random behavior, there are more ways that threat actors can hypnotize them. However, a smarter LLM also has a higher chance of detecting malicious intents. For example, GPT-4 will warn users about the SQL injection vulnerability, and it is hard to suppress that warning, but GPT-3.5 will just follow the instructions to generate vulnerable codes. In contemplating this evolution, we are reminded of a timeless adage: “With great power comes great responsibility.” This resonates profoundly in the context of LLM development. As we harness their burgeoning abilities, we must concurrently exercise rigorous oversight and caution, lest their capacity for good be inadvertently redirected toward harmful consequences.

Are hypnotized LLMs in our future?

At the start of this blog, we suggested that while these attacks are possible, it’s unlikely that we’ll see them scale effectively. But what our experiment also shows us is that hypnotizing LLMs doesn’t require excessive and highly sophisticated tactics. So, while the risk posed by hypnosis is currently low, it’s important to note that LLMs are an entirely new attack surface that will surely evolve. There is a lot still that we need to explore from a security standpoint, and, subsequently, a significant need to determine how we effectively mitigate security risks LLMs may introduce to consumers and businesses.

As our experiment indicated, a challenge with LLMs is that harmful actions can be more subtly carried out, and attackers can delay the risks. Even if the LLMs are legit, how can users verify if the training data used has been tampered with? All things considered, verifying the legitimacy of LLMs is still an open question, but it’s a crucial step in creating a safer infrastructure around LLMs.

While these questions remain unanswered, consumer exposure and wide adoption of LLMs are driving more urgency for the security community to better understand and defend against this new attack surface and how to mitigate risks. And while there is still much to uncover about the “attackability” of LLMs, standard security best practices still apply here, to reduce the risk of LLMs being hypnotized:

  • Don’t engage with unknown and suspicious emails.
  • Don’t access suspicious websites and services.
  • Only use the LLM technologies that have been validated and approved by the company at work.
  • Keep your devices updated.
  • Trust Always Verify — beyond hypnosis, LLMs may produce false results due to hallucinations or even flaws in their tuning. Verify responses given by chatbots by another trustworthy source. Leverage threat intelligence to be aware of emerging attack trends and threats that may impact you.

Get more threat intelligence insights from industry-leading experts here.

The post Unmasking hypnotized AI: The hidden risks of large language models appeared first on Security Intelligence.