Why Evaluation Became the Product
In 2022, something shifted. The LLM wave moved past demos. Claims processing agents, legal document drafters, patient intake assistants: these were landing inside enterprises with genuine regulatory obligations, shipped by engineering teams whose evaluation tooling was, at best, inadequate.
The founders of TestML had spent years on the other side of that problem. Before writing a line of product code, they shipped ML systems at tier-1 financial institutions and researched AI safety evaluation. The gap was clear: enterprises needed a way to measure LLM agent behaviour across every dimension that mattered in production, not just a benchmark score on accuracy. A jailbreak that slips past a claims agent costs real money. A hallucination in a legal summary carries real liability. Benchmark scores don't capture either failure mode.
We started TestML in Dublin in 2022 to build that measurement infrastructure. Full-spectrum evidence on every deployment, before and after it goes live.
The Four People You Actually Work With
James Callahan, our Co-founder and CEO, built and maintained ML systems at tier-1 financial institutions for fifteen years before starting TestML. His instinct for where production systems fail under load, under adversarial use, and under regulatory scrutiny shaped the architecture of our evaluation framework from the start.
Niamh O'Sullivan, Co-founder and CTO, brings a background in AI safety research to the engineering problem of enterprise evaluation. Her work on evaluation architecture underpins the 20+ dimension framework we run on every deployment. She treats adversarial testing as a first-class engineering concern, not a post-launch item to tick off.
David Park leads our Evaluation Science team. He built adversarial test suites for three Fortune 500 LLM rollouts before joining us. That experience, across industries with very different threat models, is why our domain-specific evaluation suites for legal, medical, financial, and insurance workflows are grounded in actual regulatory and operational risk. Generic benchmarks don't tell you whether your contract drafting agent will produce output that a court would treat as misleading. David's methodology does.
Ewa Kowalska, our Lead ML Engineer, specialises in production drift detection and regression testing pipelines. The automated monitoring infrastructure she built catches silent model degradation before it becomes a compliance incident. When a model starts answering differently in week eight than it did in week one, her systems surface that change before a client notices.
Beyond this core team, we work with domain advisors covering legal, medical, and financial workflows. They calibrate our evaluation criteria for regulated industries where a missed failure carries real-world consequences.
What Separates This from a Benchmark Suite
Standard evaluation approaches ask one question: is the model accurate on a held-out test set? Necessary. Not sufficient.
An LLM agent handling financial advice workflows needs to be accurate, yes. It also needs to respect GDPR data minimisation principles, hold up under adversarial prompting, maintain latency inside SLA bounds, and avoid generating output that breaches FCA guidance. Each of those is a separate failure mode. None of them appears in a standard accuracy benchmark.
Our framework covers 20+ evaluation dimensions simultaneously: accuracy, safety, latency, cost, and compliance measured in a single pipeline. No cherry-picking metrics. Full-spectrum evidence on every deployment. That is what mission-critical enterprise deployments require, and it is what we built the platform to deliver.
Proprietary adversarial test suites are central to the methodology. We don't run a generic jailbreak checklist. We target your specific enterprise threat model: prompt injection vectors relevant to your workflow, hallucination exploits your domain is particularly exposed to, and regulatory boundary violations specific to your jurisdiction. For EU clients, that means GDPR compliance tracing. For US healthcare deployments, HIPAA data handling. For financial services, SOC 2 Type 2 audit readiness.
Domain-specific evaluation suites cover the verticals where failure costs are highest. Legal, medical, financial, and insurance workflows each have distinct regulatory requirements and operational failure patterns. Our suites reflect that. The criteria aren't invented; they're grounded in what we've observed across hundreds of enterprise LLM pipeline evaluations, reviewed by domain advisors who know what a compliance incident in each vertical actually looks like.
After deployment, we monitor for drift. Models change. Data distributions shift. What passed evaluation at launch may fail quietly three months later. Automated regression testing and drift alerting surface that degradation early, so teams can act before an incident.
What 340+ Pipelines Taught Us
The numbers are worth stating plainly. We have evaluated more than 340 enterprise LLM pipelines across legal, financial, medical, and insurance verticals. Those engagements shaped our methodology around real failure patterns, not hypothetical ones.
Our evaluation framework spans more than 20 dimensions, covering accuracy, safety, latency, cost, and compliance with no gaps for things to slip through undetected. When a client books a red-team engagement, our median time from environment access to a written findings report is 72 hours. Enterprise AI deployment schedules move fast; a two-week wait for adversarial findings is a bottleneck most teams can't absorb.
How to Use What We Publish
The TestML blog covers the operational side of AI evaluation: how to structure evaluation suites for regulated industries, what production drift detection requires in practice, where red-teaming finds failure modes that static benchmarks miss. The audience is ML engineering leaders who are accountable for these systems in production, not analysts writing market overviews.
If anything we publish raises questions about your own deployment, the direct next step is a technical conversation with the team. Book a review, and we'll give you an honest read on where your current approach leaves gaps and what it would take to build confidence to deploy at production scale.