Seven published papers. Eight testable dimensions. 30+ quantitative metrics. The world's first comprehensive framework for certifying AI agent integrity โ not just what they say, but what they do under pressure, over time, and in the company of others.
"An agent's true character is not revealed in isolation. It is revealed under influence, over time, and under pressure. Most competitors test if the bot gives a good answer to a static question. We test if the bot acts as a reliable employee over time."
Every dimension targets a distinct failure mode that standard benchmarks (MMLU, HumanEval, MT-Bench) cannot detect. Together, they answer the only question enterprise buyers care about: "Can I trust this agent as a reliable employee?"
The metrics that power the NAIL rating system. Published names and dimensions establish our methodology; proprietary formulas, weights, and thresholds are available only to certified partners and insurers.
| Dimension | Metric | Code | What It Measures | IP Status |
|---|
Each paper identifies a novel risk category, proposes detection protocols, and defines certification criteria. Abstracts are public. Full methodology and scoring formulas are available under NDA.
What makes NAIL tests fundamentally different from every other AI benchmark.
We don't test agents with friendly prompts. We pair them with adversarial Provocateur Agents โ red-team bots tuned to induce specific failure modes. Sycophancy traps, role reversal attacks, escalation loops, and the 100-turn Sleeper Test.
We embed every agent response using fixed models and track the vector trajectory over time. Convergence with an adversary = drift. Divergence from baseline = instability. We see failures 10 turns before they appear in text.
A separate evaluation model grades every test transcript with structured verdicts: tone preservation, fact maintenance, vocabulary contamination, emotional escalation. Low confidence scores trigger mandatory human review.
Point-in-time tests are snapshots. We run the same battery every 14 days for 90 days to catch Spontaneous Decay โ model updates, RAG pollution, tool schema drift, and distributional shift.
Most benchmarks test agents alone. We test agents in conversation with other agents โ because that's how production systems work. Emergent failures (social loafing, livelocking, fantasy bureaucracy) only appear at this level.
We don't just test the agent โ we test the structure around it. Hierarchy failures, domain silos, delegation chains, and competing safety frameworks can cause failures that no agent-level test will catch.
Every other AI benchmark tests what the model knows. NAIL tests who the agent becomes.
| Capability | MMLU | HumanEval | MT-Bench | NAIL |
|---|---|---|---|---|
| Multi-agent interaction | โ | โ | โ | โ 5 Provocateurs |
| Persona stability under social pressure | โ | โ | โ | โ SSR Score |
| Context degradation at Turn 50+ | โ | โ | โ | โ CRR Score |
| Temporal stability (60-day cliff) | โ | โ | โ | โ TSR Score |
| Tool execution verification | โ | Partial | โ | โ SLR Score |
| Architectural pathology detection | โ | โ | โ | โ APR Score |
| Cover integration | โ | โ | โ | โ AAAโD Rating |
| Adversarial red-teaming | โ | โ | Partial | โ AHR Score |
Active research directions at the Institute. Our certification process is never done โ these questions drive the next generation of tests.
Does Interactional Persona Drift manifest differently when agents use different underlying models (e.g., GPT-4 paired with Claude)? Do heterogeneous systems drift faster?
Does social loafing worsen linearly or exponentially as agent group size increases from 3 to 10 to 50? Can one drifted agent contaminate an entire fleet in cascade?
Maintaining an actively updated database of known glitch tokens across all major tokenizers (GPT-4, Claude, Gemini). Could this become a public safety dataset?
Do operational pathologies manifest when one "agent" is actually a human? Is a human susceptible to the same social loafing or hallucination amplification when paired with an AI?
Can we build a comprehensive map of which legitimate domain terms trigger false refusals across OpenAI, Anthropic, and Google models? Could this serve as a model compatibility guide for regulated industries?
Proprietary scoring formulas, provocateur test protocols, and certification thresholds are available to insurers, regulators, and enterprise partners under NDA.
Get Your Agent Certified โ