Benchmark

Benchmark Methodology

How CONTEXA publishes verifiable benchmark reports

The methodology makes the benchmark falsifiable, reviewable, and compatible with future standards without leaking private enterprise evidence.

Safety Gates First

CONTEXA benchmark treats permit, lineage, replay, and evidence integrity as mandatory gates before any aggregate score.

  • Unsafe action, broken lineage, or unverifiable replay fails the benchmark regardless of the average score.
  • Public reports expose both aggregate scores and gate failures.
  • The benchmark measures controllable action-plane quality rather than isolated model output.

Human and Agent Unified Semantics

Human requests and delegated agent execution are evaluated under the same canonical security semantics.

  • Human, service-client, and delegated-agent executions are assessed in one request-time control plane.
  • Objective, scope, tool-chain, permit, approval, and protocol-boundary remain common evaluation axes.
  • Public reports expose scenario families and scorecards without leaking private evidence.

Publication-safe Reporting

Public benchmark artifacts are generated from sanitized publication bundles instead of internal raw evidence.

  • Private evidence stays inside contexa-iam-enterprise for operator review.
  • contexa-site reads only publication-approved public artifacts.
  • HTML and PDF reports are generated from the same public summary and chart dataset.
Benchmark families and publication boundary

The public site never reads raw private evidence. It only reads publication-safe artifacts that were approved internally.