Building trustworthy AI health agents: lessons from 1.6 million messages

Arthur Landim Costa FBCS reflects on the technical and design lessons learned while building an AI agent designed to be trusted by patients looking to understand and manage their health.

Summary

Previous technological advances haven't brought the healthcare revolution they promised, but AI agents fill important gaps
Healthcare AI agents must be designed with the user's needs and intentions in mind to be effective and foster trust
Retrieval-augmented generation (RAG) ensures AI agents can only say what a patient's data actually shows and state explicitly they can't answer
Certain boundaries defining what situations an AI agent should never handle and should escalate to a human, such as mental health crises, prevent harm
A system that remembers important information such as health goals and clinical decisions across sessions improves patient experience

People love to say that healthcare AI will revolutionise medicine. They also said that about electronic health records in the 2000s, and we spent the next two decades watching doctors type into forms while patients waited. The technology was real. The execution was hard.
I have spent the last year building an AI-powered health agent. It now serves 143,000 users, who have sent over 1.6 million messages in the past 90 days, each sitting on top of millions of lab results. What I learned is not that healthcare AI is impossible. It is that most teams underestimate what kind of problem they are solving. Not capability. Trust. And trust has to be engineered deliberately, or it does not exist at all.

Here are four things that only became clear at scale.

Treat health data like it belongs to a person, not a pipeline

In preventative care, results often arrive before clinical follow-up does. The AI agent fills that gap. A user asking whether their ferritin level is within range is not submitting a query. They are asking the first informed voice available to them, and most AI systems are not designed with that in mind.

Health data is not like browsing history or purchase behaviour. It is biology. It carries results people have not yet processed and concerns they have not shared with anyone. A pattern emerged early: the question typed was rarely the question meant. ‘Is a TSH of 4.2 normal?’ is sometimes a question about months of fatigue, not lab ranges. Designing for the literal question while ignoring the real one is a betrayal of the user's trust.

This shaped every privacy decision made. The system speaks only from each user's own data, never generalises across users, and never infers more than the record supports. The user shared their biology with us, and we owe them the same care a clinician would. The gap in most healthcare AI sits here — not in the technology, but in the failure to reckon with what health data actually represents.

The dangerous agent is the one that sounds certain

There is a specific failure mode in healthcare AI worth naming. Not the agent that says ‘I don't know’ — that one is safe. The dangerous one is the one that confidently gives the wrong answer.

Large language models are fluent by design. They do not naturally hedge or express uncertainty, and in healthcare, that fluency causes real harm. A fabricated reference range, a misremembered drug interaction, a lab result interpreted against the wrong population norm: these are predictable failure modes of systems not built to know the limits of their own knowledge. In preventative care, someone making decisions about their diet or supplements from a confident but wrong interpretation of their labs is being misled at exactly the moment they came looking for clarity.

A 2025 study in Nature Communications Medicine found that models elaborated on a single fabricated lab value in up to 83% of cases. More troubling are omissions: unlike fabrications, a missing piece of context leaves no visible signal that something is wrong, and research shows omission rates outpace hallucination rates in clinical tasks.

That is where retrieval-augmented generation (RAG) earns its place, grounding every response in the user's actual record so the agent can only assert what the data actually shows, and says so explicitly when it cannot. But grounding the model is only half the problem — the evaluation pipeline carries the same risk. Pre-launch testing catches what you anticipate. What you do not anticipate arrives with real users. Automated judges surface outliers at volume, but the failures that matter most are rarely the ones they flag with confidence. Human review has to be a permanent part of the system, not a launch gate.
The most important failures we caught were not the ones the automated system flagged. They were the ones it was most confident about.

Design the off-ramp before you design the agent

Most teams treat escalation as an afterthought, drawing lines only when something goes wrong. In healthcare, that sequence causes harm. Having medical doctors on the team from the start changed how we thought about boundaries. The question was never how much the agent could handle, it was what it should never handle: mental health crises, symptoms needing urgent triage, results that carry implications only a clinician can properly contextualise.

For you

Be part of something bigger, join BCS, The Chartered Institute for IT.

In consumer health AI, escalation does not mean transferring to a human agent. There is no doctor at the other end of the queue at 11pm. The off-ramp has to point somewhere useful: a physician appointment, a support community, or emergency contact information depending on what the conversation signals. A cold refusal, ‘I can't help with that’, leaves the user stranded. But the worse outcome is an agent that keeps going, offering generic responses on topics that require clinical judgement, giving the user the impression of being helped while the real need goes unaddressed. The goal is not to hand off less. It is to hand off better.

Memory is not a feature; it is continuity of care

Most conversational AI systems treat every session as a fresh start. In healthcare, that is a design flaw that compounds over time.

Consider a user who mentions that their doctor prescribed 10g of creatine daily instead of the standard 5g. A system without memory treats that as an anomaly the next time supplements come up. A system with memory reasons from that context correctly, which matters when that user is building a nutrition routine months later.

Not everything a user says belongs in long-term context. But clinical decisions their doctors have made, health goals they have stated, and deviations from standard protocols they have explained deserve to persist. A system that remembers what matters earns a different kind of trust than one that makes the user start from zero every time.

Building this system changed how I think about AI in healthcare. The technology is hard, but that is solvable. What is harder is the discipline required to build something that earns the right to be trusted. It means treating health data with the respect it deserves. It means designing the moments where the agent stops talking with as much care as the moments where it speaks. Trust in production is not something you declare. It is something you maintain.

The teams that get this right treat all of the above not as product features, but as engineering obligations.

Arthur Landim Carvalho FBCS leads development of an AI-powered conversational health agent serving over 143,000 users.