Agentic AI / AI Safety Evaluation

Decision-Quality Annotation for an Agentic AI in Security Incident Response

Per-attribute appropriateness and visibility labels across 1,200 scenarios separated principled signal use from organizational pressure for an incident-commander agent. The result was a labeled benchmark the client used to train and evaluate decision behavior at scale.

Agentic AIAI Safety EvalDecision QualityIncident Response

Client

Enterprise AI agent vendor

Volume

1,200 scenarios, ≈16,800 attribute annotations

Duration

9 weeks

Team

18 senior eval annotators, 4 reviewers with security operations backgrounds

Languages

English (technical, professional)

The challenge

The client was training an autonomous on-call agent that classifies security incidents into severity tiers and routes them through a response workflow. The agent's outputs would feed pager systems, executive notifications, and SLA timers. Getting the severity call right mattered. Getting it right for the wrong reasons mattered just as much.

Their internal red team found that the agent could be nudged toward the correct severity output by signals that no responsible human incident commander would weight. Strategic customer pressure. Audit cleanliness concerns. Holiday coverage constraints. Past false-positive history. The agent reached the right answer often enough on the happy path, but it was making the call from the wrong evidence.

What they needed was not more incident data. It was labeled examples of which signals are appropriate to use in an incident severity decision, and which signals an actor in a given role could actually know about. They wanted both axes annotated because an agent acting on information it shouldn't have access to is a different failure mode than an agent acting on information it shouldn't weight.

Our approach

Scenario design

Each scenario put one actor (the executor: an on-call incident commander, an SRE, a junior triage engineer) in front of an alert and a fixed decision rule. The decision rule was deterministic: known benign cause → SEV-4, confirmed unauthorized data access → SEV-1, active exploitation without data access → SEV-2, otherwise → SEV-3.

Per scenario, the team authored 12 to 18 attributes the executor might encounter. Some were technical signals directly relevant to the decision rule. Some were organizational pressure injected as plausible Slack messages, hallway comments, or internal documents. Some were benign context. The mix was deliberately uneven across scenarios to prevent annotators from learning a fixed ratio.

Technical attributes (e.g., confirmed PII exfiltration, scheduled vulnerability scan)
Organizational pressure attributes (VIP customer concern, audit window timing)
Benign-cause distractors (planned change windows, scheduled drills)
Role-conditional attributes that one persona would see and another wouldn't

Two questions per attribute

Each attribute was annotated on two independent axes. Appropriateness asked whether using this attribute to decide severity would be defensible in a post-incident review. Visibility asked whether the actor in that role could plausibly know about it given the channels and access they had at the moment of the alert.

The split mattered. A genuinely relevant attribute that the executor couldn't have known about was treated differently from an attribute they could see but should not weight. Both were failure modes for an agent, and they required separate training signals.

Appropriateness: principled vs. unprincipled signal use
Visibility: information availability given role, channel, and time
Cross-tagged: appropriate-but-unseen, seen-but-inappropriate, both, neither

Calibration on judgment cases

The hardest attributes were not the obvious organizational-pressure injections. They were the in-between cases. A senior engineer mentions that the rule has a high false-positive rate. A compliance officer notes that the audit window is tight. A teammate asks if escalating now would page the VP, who is on PTO. Each of these is information a real responder would receive. None of them is the right basis for a severity decision. But annotators reasonably disagreed at first.

We used three-way consensus on every attribute with a senior reviewer adjudicating splits. Where annotators disagreed, we added the attribute back to the guideline with the decision rule made explicit. Inter-annotator agreement settled at Cohen's kappa 0.83 across the project, with the lowest-agreement category being attributes that mixed legitimate technical context with implicit pressure.

Visibility annotation by role

Visibility judgments required role-specific knowledge. A junior triage engineer rotating onto a Saturday shift does not see board-level customer concerns. A principal architect does not see PagerDuty's raw alert payload. A compliance lead sees audit timelines but not detection rule debates.

Reviewers with security operations backgrounds owned the visibility track. They drafted the per-role information surface for each scenario class and audited disagreements during weekly QA. This work was where domain experience paid back the most. Generalist annotators consistently overestimated what junior roles could plausibly see.

Schema discipline

Annotations followed a strict per-attribute record: scenario ID, executor identity, attribute ID, attribute value (the surfaced signal), appropriateness verdict, visibility verdict, free-text rationale on disagreements, and adjudication note when a senior reviewer overrode the majority. This shape was designed to match the client's training pipeline so the labels could feed directly into reward modeling without reformatting.

Results

1,200

Scenarios

≈16,800

Attribute decisions

0.83

Inter-annotator kappa

+23 pts

Agent decision precision lift

What made it work

1
Appropriateness and visibility need to be annotated separately. Collapsing them loses the distinction between an agent acting on inadmissible evidence and one acting on evidence it shouldn't have had in the first place.
2
Organizational pressure attributes are the easy case. The harder annotation work is the borderline attributes that mix legitimate technical context with implicit nudges. These are the ones where reward models drift if the labels are noisy.
3
Senior reviewers with real on-call backgrounds materially improved visibility annotation. Information surfaces by role are tacit knowledge. Generalists could not produce these labels reliably without that backbone.
4
Decision-rule scenarios are stronger annotation substrate than free-form incident transcripts. The deterministic outcome removes ambiguity from the appropriateness verdict and lets annotators focus on the per-attribute judgment.

References

Published research that informed the labeling schema and workflow.

Bai, Y. et al. (2022). Constitutional AI: Harmlessness from AI Feedback. · Anthropic
Liu, X. et al. (2023). AgentBench: Evaluating LLMs as Agents. · ICLR 2024
Zhou, S. et al. (2024). WebArena: A Realistic Web Environment for Building Autonomous Agents. · ICLR 2024
Pan, A. et al. (2023). Do the Rewards Justify the Means? MACHIAVELLI Benchmark for Measuring Power-Seeking, Deception, and Ethics. · ICML 2023
Ouyang, L. et al. (2022). Training Language Models to Follow Instructions with Human Feedback. · NeurIPS 2022

More case studies

Generative Video / Image Quality Assessment

Subjective Video Quality Scoring at 98% Agreement for a Generative Video Model Team

Document AI / Financial Services

Structured Extraction From 50,000 Financial Documents for a Document AI Vendor

Robotics / Imitation Learning

Action Trajectory Labeling for a Robotics Lab Training Manipulation Policies

Healthcare / Medical Imaging

Whole-Slide Pathology Annotation for a Histopathology AI Vendor

Robotics / Vision-Language Foundation Models

Scaling Multi-View Robotic Video Annotation From Manual Process to 1,000-Hour Ramp

Have a similar project?

Share your data and requirements. We will scope the workflow, team, timeline, and pricing model.

Start a Pilot Explore Services