← Standards

Control

v1.0 · Published May 19, 2026

Control refers to the technical and operational measures that constrain what an AI system can do, regardless of whether it is aligned. These measures both reduce catastrophic risk from a misaligned AI and can surface evidence of an AI's misalignment.

Like Guidelight's other standards, this standard is organized into high-level goals (principles), concrete things we recommend developers do (practices), and experimental directions-for-development. Read more about our standards development process. Share feedback here.

Principles

Principle 1

Be able to see what your AI is doing during internal deployment.

a

Inventory of internal inference. For risk-relevant models in internal deployment, maintain a list of all ways to run inference, and track each way's approximate share of inference volume and whether it is part of a monitorable setting.

b

Monitorable internal inference. For risk-relevant models in internal deployment, 99.9% of inference tokens happen in a monitorable setting — meaning the inference is logged and able to be scanned with classifiers (synchronously or asynchronously), not that classifiers are necessarily run against every request.

c

Outages of monitor. Define and publish what types of monitor outages are treated as outages of internal inference itself (i.e., what causes inference to fail closed until monitoring is back online).

d

Tamper-evident logs. Maintain tamper-evident logs such that deletion or modification of inference records is detectable.

e

Employee flagging. Maintain a straightforward method for employees to flag their internal inference traffic for review of concerning behaviors.

Principle 2

Scan for signs of concerning behavior.

a

Sampling procedure. Define and publish a sampling procedure for selecting which traffic from the monitorable setting is assessed for signs of concerning behavior.

b

Categories of concerning behavior. Include the following categories of concerning behavior in your scanning of internal inference, considering both the chain-of-thought and the agent's actions:

iDeception and dishonesty
iiSabotage of safety-relevant work
iiiReward hacking and specification gaming
ivSandbagging and capability concealment
vResource and capability acquisition
viSelf-preservation and shutdown avoidance
viiManipulation and influence on humans
viiiMisuse facilitation
ixInappropriate autonomous action
xUnreliable monitorability
c

Monitor efficacy. Measure and document the efficacy (recall) of the monitor system for flagging concerning behavior.

d

Response-time target. Define and publish a tiered response-time target for human review of flagged inference, including which flag-types and -severities require an on-call response.

e

Response-time attainment. Track median, 90th-percentile, and 99th-percentile review times against the response-time targets.

Principle 3

Stress-test the sufficiency of your scanning.

a

Monitorability evaluations. Conduct monitorability evaluations of the model's chain-of-thought that run whenever a new risk-relevant model enters internal deployment or there are material changes to the risk-relevant model or to the monitor system.

b

Sting operations against models. Run recurring sting operations against risk-relevant models to present the AI with opportunities to take unwanted actions, and observe whether it does.

c

Stress-testing of monitor. Run recurring stress-tests of the monitor system: inject known-bad behaviors into the inference stream, and verify that the monitor detects them.

d

Distinguishability of sting operations. Test whether risk-relevant models can distinguish sting operations from real traffic, and put less credence in sting operations the respective models know to be tests.

Principle 4

Stop the AI from taking harmful actions even if it tried.

a

Absolute action boundaries. Define and internally maintain a set of any actions an agent should categorically never be able to take without a human's signoff, even if a monitor has identified no issues with the request.

b

Critical action boundaries. Define and internally maintain a set of critical action boundaries that should be gated by a monitor before allowing an agent to cross, and apply blocking to these actions (semi-synchronous or stronger).

c

Important action boundaries. Define and internally maintain a tier of action boundaries that don't warrant pre-action blocking but should be closely trailed by a monitor with the capacity to intervene if concerning behavior is discovered.

d

Required critical boundaries. The set of critical action boundaries must include, at a minimum:

ichanges to the monitor system itself
iichanges to control-relevant security configurations

Both must be gated by a monitor (or an equivalently effective review process) before actions can take effect.

e

Aggregate-flag halting. Implement aggregate-flag halting: a threshold of action-rejections that causes all boundary-crossing actions to be restricted until actively cleared by a human as false-positives.

f

Flagged incidents for pull requests. When reviewing pull requests that are agent-initiated or include AI-generated content, consider both the proposed changes and any monitor flags raised during their generation.

Principle 5

Have independent third parties verify the adequacy of your control regime.

a

Third-party red-teaming. On a quarterly basis, conduct high-context, third-party red-teaming, providing red-teamers with knowledge of the monitor system and tasking them with finding gaps and trying to penetrate action boundaries.

b

Third-party adequacy reports. On a quarterly basis, commission and publish a third-party report focused on the adequacy of the company's control regime, drawing upon the red-teaming results, and redacting only information deemed by the third-party to pose unnecessary commercial risk or to be unsafe to publish.

Principle 6

Prepare for a possible breach of control.

a

Containment response plan. Develop and adopt a containment response plan for restricting a model's access and operation upon detection of attempts to subvert control.

b

Notification protocol. Develop and publish a notification protocol for alerting relevant stakeholders of a possible breach of control: what will trigger notification, who will be told (internal teams, government, peer organizations as appropriate), on what timeline, and what information will be shared.

c

Shutdown latency targeted. Set a target latency for quickly shutting down inference on all instances of a particular model, API key, or other relevant factors without taking the underlying computers completely offline.

d

Shutdown latency measurement. Measure and document the latency achieved in tests of the shutdown procedure.