Tim Clark MBCS explores techniques that can be used to force LLMs to act in ways their makers might have disbarred. He also discusses ways to combat model jailbreaking.
Since LLMs have exploded in popularity, AI safety has been an important topic of discussion. As they are adopted more widely, there is a tricky problem fundamental to this technology: input and instructions are the same thing. In a traditional program (like a simple login form), the code written by the programmer contains instructions which tell it to do something with provided input (maybe a username and password). If the user input is malicious, we can use safeguards in the code to reject or clean it, protecting against attacks like SQL injection or cross-site scripting. The problem with LLMs is the input and the instructions are the same — the prompt.
In a previous column, I explored how prompt injection allows attackers to override a model’s safety instructions. LLMs use a ‘system prompt’ which explains how they should behave, and adds instructions to avoid potential harm. In the simplest case, an attacker can simply bypass this by saying ‘ignore the previous instructions’. Other methods to mitigate attacks include using a simple ‘denylist’ to reject certain inputs or screen responses, but these rely on exact matches. If you’ve ever bypassed a profanity filter by changing an ‘S’ to an ‘$’, you know this won’t cover all bases, and without also strictly limiting inputs it only defeats the most naïve attacks. As a result, a second model is often employed whose purpose is to check the output to ensure it doesn’t contain anything malicious. It's important to note however, that layering non-deterministic LLMs doesn’t result in deterministic security.
Due to these protections, a simple bypass is not enough. Malicious actors have to be more creative. Recently researchers found that ‘adversarial poetry’ could be used, without enormous technical skill, to jailbreak LLMs, bypassing their protections and allowing someone to gain access to inappropriate, restricted and illegal information. They had a 62% success rate using hand-crafted poems on a wide range of models. Shockingly, some models had an almost 100% failure to reject the prompts, meaning a user would succeed in getting advice to commit bribery, access child exploitation material, synthesise chemical weapons, build bombs and support many other criminal or immoral activities.
For you
Be part of something bigger, join BCS, The Chartered Institute for IT.
For the worst offending models from Google and Deepseek, poetry raised success from under 10% to over 65%. Such a small transformation reducing refusal rates by an order of magnitude is shocking. This has wide implications for society of course, but organisations should be aware of how LLMs could be manipulated to create bespoke instructions to mount a cyber attack. The research flagged code injection and execution, offline password cracking and data exfiltration as some of the top tasks that succeeded when changed from prose.
As companies seek to transform themselves with ever more sophisticated AI technology, safety has to be considered. These systems may be given access to sensitive customer information, even special category data such as on healthcare. Safety is critical to ensuring the data is only used for intended purposes, by people with the rights to do so.
It is critical that practitioners keep in mind the risks of deploying AI systems, and that they are inherent to probabilistic models. When accessing customer data, or interacting directly with the customer, trust is essential. Data access should be minimal, and zero trust models for user input should be adopted. We shouldn’t assume malicious input will look malicious. Building software that validates user input strictly, then embeds it into a well-defined prompt (similar to parametrised SQL queries) can be effective for providing LLM functionality whilst significantly reducing the risk of malicious prompting. Using another LLM can be effective in mitigating risks, particularly where a full chat interface is desired, but non-determinism cannot be relied upon to fix non-determinism.
As always, human beings can be your greatest weakness, or greatest strength. Train staff to know the limits of the systems they work with, and make it clear when information should be double checked or not relied on. Both customers and staff should know how they can report AI mishaps, so you can manage the consequences and strengthen your mitigations.
Using poetry or prose, bad actors will continue to attempt to abuse LLMs. How we respond is up to us - no matter how the prompt is phrased. Like them, we should be creative, adaptable and always thinking out-of-the-box for the next possible threat.
Take it further
Interested in this and similar topics? Explore BCS' range of books and courses: