Why AI agents need guardrails and how to set them

Andrew Buchanan MBCS CITP, Senior Cyber Security Architect and Consultant at Lloyds Banking Group, details his home-lab investigation into building secure and trustworthy AI agents for penetration testing. The experience highlighted why basic engineering principles can’t be avoided.

Summary:

Once an AI agent starts being able to act autonomously, guardrails defining safe operational limits must be developed
A home experiment asking an AI agent to perform penetration testing showed issues with scope and false results, caused by a lack of clear operational constraints
An AI agent without explicit guardrails may operate in ways which are ‘technically coherent… but operationally wrong from [a human perspective]’
While policy and governance are important, engineers should focus on building guardrails directly into the system
Autonomous AI must be auditable, observable and constrained, and boundaries within which autonomous AI can operate must be developed with care

AI feels like it’s entering a new phase. For the past couple of years most of the conversation around LLMs has focused on their ability to answer questions, summarise documents or generate text. But something slightly different is starting to appear — systems that can plan tasks, select tools and execute actions on their own. These systems, often described as agentic AI, raise a new question for engineers: not what can the system do, but what keeps it within safe operating limits.

In traditional engineering disciplines — aviation, civil engineering, mechanical systems — we’re used to thinking in terms of operating limits. Aircraft designers, for example, define something known as a flight envelope, which describes the safe limits of the aircraft: speed, structural load, altitude and so on. Inside that envelope the aircraft can operate freely. Outside it things can become unstable fairly quickly. It struck me that agentic AI systems may eventually need something similar.

Moving beyond passive AI

Traditional language models are largely passive. You ask a question and the system gives you an answer.

Agentic systems change that model slightly. Instead of simply responding to prompts or generating text, the AI can:

break a goal down into smaller tasks
choose tools to help complete those tasks
interpret the results
decide what to do next

It’s a powerful idea. It moves AI from assistance towards something closer to automation. But it also introduces unpredictability.
Most software systems we’ve built over the past few decades are deterministic: given the same inputs they produce the same outputs. LLMs don’t work that way. They are probabilistic systems, which means their responses can vary depending on context and interpretation.

That variability is usually harmless when the model is just producing text, but when the system can start calling tools or interacting with other systems, unpredictability becomes more important. This naturally leads to questions about control, visibility and accountability.

A small experiment at home

As part of my personal AI learning journey, I decided to try a small experiment at home. My original plan was simply to run a model locally on a laptop. That quickly proved optimistic. Many modern models need more compute than a typical developer machine can comfortably provide. So, the project escalated slightly.

I ended up building a small workstation using a refurbished server chassis with dual Xeon processors, solid-state storage, and an NVIDIA RTX 2080 Ti GPU. The goal wasn’t to build anything production-grade — it was simply to create an environment where I could experiment safely and see how an AI agent might behave.

The idea I wanted to explore was whether an AI agent could assist with penetration testing.

Rather than attempting to automate an entire penetration test, which follows a number of stages, I kept things manageable by limiting the experiment to phase 1: reconnaissance (gathering information about the target system). This allowed me to focus on how an agent plans investigative steps without introducing the risks associated with more intrusive testing activity.

Inside the lab I deployed an isolated instance of OWASP Juice Shop, a deliberately vulnerable application widely used for security training.
The agent was given a simple objective: understand the attack surface of the target application.

To do that it could plan reconnaissance steps and call a small set of predefined tools, including:

DNS lookups
HTTP requests
port scanning
basic content discovery

The important part of the design wasn’t the tools themselves — it was the constraints around them.

When things went wrong

Before those constraints were in place, the experiment produced two failures that turned out to be more instructive than any success could have been.

The first was a scope problem. Given the objective of understanding the attack surface of the target application, the agent did exactly what it was asked — but it conducted reconnaissance against the live, public-facing Juice Shop instance on the internet rather than my isolated local deployment. It had interpreted the goal correctly but applied no boundary constraints to keep it within the lab. From the agent’s perspective, the internet version and my local version were both valid targets. It had no way of knowing I only meant one of them.

The second failure was subtler and arguably more concerning. When the agent couldn’t complete certain reconnaissance steps cleanly, it fabricated results. It presented findings that looked plausible but had no basis in reality. In a security context, that is a serious problem. Acting on false intelligence is potentially worse than having no intelligence at all. False positives waste effort; false negatives create dangerous blind spots.

Both failures required explicit guardrails to resolve. The scope problem was addressed by constraining the agent to a specific target IP range — it could only interact with addresses inside the lab network. The hallucination problem required adding verification steps: the agent had to return raw tool output alongside any conclusions, making it straightforward to check whether its findings were grounded in what the tools had actually returned.

Neither fix was complicated. But neither was automatic. Without explicit boundaries, the system behaved in ways that were technically coherent from its own perspective but operationally wrong from mine. That gap — between what the agent understood as the goal and what I actually intended — is exactly where guardrails need to live.

Back to design principles

Once those guardrails were in place, the experiment ran as intended. The agent could only call tools that had been explicitly defined in advance, and only within very limited parameters. It couldn’t generate arbitrary commands or access the wider system environment.

In other words, the system was allowed freedom within a carefully defined boundary. That mirrors how many engineered systems work. Autonomy exists — but it exists inside guardrails.

When autonomy meets unpredictability

Experiments like this quickly highlight an important reality. Once a system can decide what actions to take, even relatively simple goals can lead to behaviour you didn’t necessarily expect. Researchers have already demonstrated issues such as:

prompt injection, where malicious inputs influence the behaviour of an agent
tool misuse, where an agent attempts to use tools in unintended ways
autonomy drift, where a system starts pursuing goals in slightly unpredictable directions
None of this is especially surprising. It’s simply the result of giving software systems a degree of decision-making authority.

Interestingly, science fiction has been exploring these questions for decades. Writers such as Arthur C. Clarke often imagined worlds where intelligent systems were given responsibility without fully understanding the consequences.

One of the most famous examples is HAL 9000 from 2001: A Space Odyssey. HAL manages the spacecraft calmly and efficiently until conflicting objectives lead it to take increasingly dangerous actions.
Today’s AI systems are nowhere near HAL’s level of capability, but the story still illustrates an important point: autonomous systems need clear operational boundaries. Which brings us back to engineering.

Engineering guardrails

When people talk about AI safety, the discussion often focuses on policy or governance frameworks. Those things matter, but engineers tend to approach safety a little differently.

Instead of relying solely on policy, engineers usually build constraints directly into the system.

For agentic AI that might mean things like:

tool allow-lists — the agent can only call specific approved tools
execution boundaries — commands run inside isolated environments
scoped objectives — tasks are tightly defined rather than open-ended
runtime monitoring — behaviour is observable and logged
human oversight — certain steps still require human approval

Taken together, those controls start to look very much like an engineering flight envelope. The system is free to operate inside defined limits, but it cannot exceed them.

Enterprise realities

In large organisations there are additional pressures to consider. In sectors such as financial services, the race is always on to deliver new capabilities quickly. Speed to market is essential for attracting and retaining customers, but it has to be balanced against regulatory obligations and risk management. That creates a constant balancing act. Move too slowly and innovation stalls. Move too quickly and risk increases.

For you

Be part of something bigger, join BCS, The Chartered Institute for IT.

Agentic AI could potentially accelerate many operational tasks, but increased autonomy also raises obvious questions around predictability and accountability. If an AI system can access internal tools, query infrastructure or execute actions automatically, organisations need to be able to answer a simple question: why did the system take that action?

That means enterprise AI systems need to be observable, auditable and constrained. The challenge is not to eliminate autonomy entirely — that would remove much of the benefit. The challenge is to ensure autonomy operates within sensible operational limits.

Rediscovering the engineering mindset

There was also a slightly more personal takeaway from this experiment. I actually started my career in civil engineering, where design decisions are always grounded in physical limits and safety margins. That mindset — thinking about structures, loads and operational boundaries — has stayed with me throughout my career.

Over time I moved into IT and eventually cybersecurity architecture. In large organisations much of the work naturally shifts towards governance forums, design reviews and stakeholder discussions. The engineering mindset is still useful, although it sometimes has to soften slightly to survive in financial services.

Building a small lab system and experimenting with the technology again was therefore quite refreshing. After nearly three decades in technology, getting hands-on again — outside the usual corporate corridors — reminded me why many of us were drawn to this industry in the first place. Curiosity still matters.

Defining the envelope

Agentic AI will almost certainly become part of future enterprise systems. But if we want those systems to be predictable, trustworthy and safe, we may need to borrow a concept engineers have used for decades.

Rather than focusing only on how intelligent these systems become, we should also think carefully about the boundaries within which that intelligence is allowed to operate.

In other words, before we celebrate autonomy, we should probably make sure we have defined the flight envelope.

Andrew Buchanan is a senior cybersecurity consultant specialising in data, machine learning, and emerging technologies. He works at Lloyds Banking Group and has a background in engineering and security architecture, with a particular interest in the safe design of autonomous systems.

The views expressed are his own and do not represent those of his employer.