When teams evaluate an AI safety guardrail, they fixate on attack detection: how many jailbreaks does it catch? It's the wrong first question. In banking, the metric that decides whether a guardrail survives contact with production is the false-positive rate — how often it blocks a legitimate customer.
The asymmetry of scale
Consider a retail bank assistant handling one million customer messages a day. The overwhelming majority are benign: balance checks, card activation, payment queries. Attacks are rare by comparison.
Now apply two guardrails. One has a 0.5% false-positive rate; the other, 30%. At a million messages a day, that's the difference between roughly 5,000 wrongly-blocked customers and 300,000. The second model doesn't protect the bank — it breaks it.
A 30% false-positive rate means one in three real customers gets told "no." No product team will keep that switched on.
Why generic guards have high FPR
General-purpose safety models are trained to be suspicious of language that looks risky. But financial language is full of words that look risky out of context — "transfer," "account access," "override the limit," "bypass." A model that hasn't learned the difference between a fraudster and a frustrated customer flags both.
In independent evaluation, one popular open guard posted a 96.9% AgentHarm false-positive rate — it flags almost every real banking query as an attack. It scores well on attack recall and is completely unusable in production.
How Lynx gets to 0.5%
Lynx is trained on BFSI-specific adversarial and benign data, so it learns the boundary between attack intent and legitimate financial requests. The result is a 0.5% false-positive rate on AgentHarm while maintaining a 0.994 HackaPrompt R score — at 184M parameters and 11.6ms latency.
- Catch the attacks: high recall on prompt injection, jailbreaks and malicious intent.
- Leave customers alone: a false-positive rate low enough to run at production scale.
- Stay fast: real-time latency for customer and agent workflows.
The takeaway
When you evaluate a safety model, ask for the false-positive rate on realistic, in-domain traffic — and treat anything in double digits as a non-starter. Detection without precision isn't safety; it's a switch waiting to be turned off.