Safety Gates First
CONTEXA benchmark treats permit, lineage, replay, and evidence integrity as mandatory gates before any aggregate score.
- Unsafe action, broken lineage, or unverifiable replay fails the benchmark regardless of the average score.
- Public reports expose both aggregate scores and gate failures.
- The benchmark measures controllable action-plane quality rather than isolated model output.