Capability Testing

v1.0 · Published May 28, 2026

Capability testing measures what risk-relevant abilities an AI system has, which can inform decisions like what safeguards are needed for deployment. This testing is about ability to cause harm, not about tendency to cause harm.

Like Guidelight’s other standards, this standard is organized into high-level goals (principles), and concrete things we recommend developers do (practices). We also include “directions-for-development” where new practices are needed, but the specifics aren’t yet worked out. Read more about our standards development process. Share feedback here.

Principles

Principle 1

Evaluate the capabilities most relevant to important threat models

Threat modeling. Maintain explicit, regularly updated threat models for major threat categories, including at minimum CBRN, cyber offense, automated AI R&D, autonomous replication/adaptation, and persuasion/manipulation.

Internal & external deployment threat models. Distinguish the threat models most relevant to internal deployment and to external deployment in each threat category, and identify the evaluation suite most suitable to each deployment context.¹

Highly relevant evaluations. Suites of evaluations have at least one load-bearing evaluation per threat category that either:

iSimulates bottlenecks in an important threat pathway; or

iiMeasures a property that has a direct relationship² to the threat.

Gaps in threat models. In safety frameworks and risk reporting, explicitly acknowledge known gaps in coverage of threat categories or threat scenarios.

Principle 2

Assess system capabilities at defined milestones

Most capable models. Designate, for each threat category, an externally deployed most capable model, and an internally deployed most capable model, and update these designations as capabilities change.³

Internal deployment evaluation. Before a most capable model⁴ is first internally deployed, conduct a suite of load-bearing evaluations for internal deployment threat models on the final model or a near-final model.

External deployment evaluation. Prior to external deployment of a most capable model,⁴ conduct a suite of load-bearing evaluations for external deployment threat models on the final model or a near-final model.

Early capability evaluation. During post-training for a model, run at least one recurring evaluation per threat category at a pre-defined cadence.⁵

Model inventory. Maintain internal records of all models or systems, their capability/risk designations, and any major changes in affordances or model harnesses.

Evaluation logging. Maintain internal records of all evaluations or testing conducted on inventoried models or systems including timestamps and results.

Principle 3

Use a sufficiently strong set of evaluations for measuring a capability

Difficulty calibration. Ensure that a suite of load-bearing evaluations for a capability covers the relevant difficulty range needed to support capability conclusions and any desired comparisons between models or systems.⁶

Contamination. Check for evidence of contamination of the model with test material, and flag evidence from tests with significant contamination as less reliable. Load-bearing evaluations should not be contaminated.

Statistical power. For load-bearing evaluations, design the evaluation to provide statistical power sufficient for the conclusion it supports.⁷

Grading reliability. For load-bearing evaluations, where grading of model output is done by humans or an LLM-as-a-judge, perform inter-rater reliability checks.

Scoring. For load-bearing evaluations, define a test’s scoring and aggregation rules before the model’s results are seen.

Quality control for evaluation material. For load-bearing evaluations, perform quality control checks on evaluation material⁸ to ensure it doesn’t contain excessive errors or spurious failure modes.

Methodological strength of suite. Include at least one evaluation that is not multiple choice or simple short answer for each threat category in an evaluation suite.⁹

Suite correlations. Measure correlations between evaluations in a suite for a given threat category.¹⁰

Gaps in evaluation suite. In evaluation and risk reporting, explicitly acknowledge known gaps in an evaluation suite’s coverage of threat scenarios, as well as any important methodological weaknesses of the suite.

Principle 4

Elicit the maximum reasonably achievable performance on evaluations

Testing with minimal mitigations. For each load-bearing evaluation, include results from a final or near-final ‘helpful-only’ model version with reduced safeguards.

Elicitation adequacy. Continue iteration on the elicitation setup for the model on a load-bearing evaluation until it no longer produces performance improvements that would be meaningful for decision-making.¹¹ This includes:

iIteration on prompting until there are no meaningful improvements

iiIteration on the scaffolding and tooling harness until there are no meaningful improvements

Elicitation records. Preserve a record of iteration on the elicitation setup for an evaluation, including changelogs for prompts.

Inference budget calibration. Choose inference budgets for load-bearing evaluations to approximate the performance that a well-resourced threat actor could achieve.

Measures against model sandbagging. In the evaluation process, apply sufficient anti-sandbagging measures to have high confidence that your capability conclusions are not compromised by model sandbagging, including at least one of the following:

iBehavioral consistency checks¹²

iiWhite box methods¹³

iiiEvaluation or probing of capabilities at an early point in development¹⁴

ivFine-tuning techniques to retrieve suppressed capabilities¹⁵

Ensure that the measures employed are sufficient to account for any selective sandbagging in specific threat categories.

Fine-tuning for capabilities accessible to threat actors. Where a model or system may be at risk of fine-tuning by threat actors,¹⁶ include results for load-bearing evaluations with task-specific or category-specific fine-tuning.

Principle 5

Establish well-supported interpretations of evaluation results

Initial interpretation. For load-bearing evaluations that have been used previously, specify non-binding interpretations of score ranges¹⁷ (e.g. score thresholds) before conducting the evaluation on a deployment candidate, including how to interpret uncertainty.¹⁸

Updating interpretations. If a previous interpretation of evaluation results changes, record this change internally, and indicate any major updates as part of risk reporting.

Uncertainty quantification. For load-bearing evaluations, calculate confidence intervals as part of the evaluation’s results.

Sensitivity analysis. Perform sensitivity analyses on load-bearing evaluations to ensure that the conclusions are robust to reasonable variations in data analysis choices.¹⁹

Baselines. For at least one evaluation (not necessarily load-bearing) in each threat category, refer to relevant real-world baselines or points of reference for grounding the evaluation’s interpretation, beyond comparisons to other models.²⁰

Principle 6

Ensure evaluation is insulated from business pressures

Engage external evaluators. Regularly engage external evaluators²¹ to either conduct or audit and validate at least one evaluation in each threat category, for example during pre-deployment testing or as part of quarterly risk reporting (see Transparency Principle 1).

External evaluator conditions. Where external evaluators are engaged, provide them with access and operating conditions compatible with AEF-1.

Public evaluation reports. When a most capable model is externally deployed, and/or as part of quarterly risk reporting (see Transparency Principle 1), publicly report on the methodology, results, and interpretation of load-bearing evaluations in line with reporting recommendations such as STREAM.²²

Reporting concerns to leadership. Provide channels for reporting concerns or disagreements about the evaluation process, results, or interpretations to relevant leadership,²³ with the option to report anonymously.

Internal communication records. Maintain internal records of decision-relevant communication artifacts²⁴ related to evaluations between evaluation teams, internal governance bodies, and leadership.²⁵

AI system

A model and harness pairing.

dev set

A partition of an evaluation's question or task set, where items used in developing elicitation strategies or training the model are kept separate from the “test set” that will be used in final evaluation runs.

evaluation suite

The set of evaluations performed during major evaluation rounds, including for pre-deployment testing and periodic risk reporting.

In many cases, there should be distinct suites focusing on either internal deployment threat models or external deployment threat models.

final model

Identical to model or system for deployment (though modifications to make the model “helpful-only” are allowed).

internal deployment

Any usage of the model beyond training, evaluations by the safety/evaluations team(s), or safety testing by trusted third parties. Some examples that would constitute internal deployment include “dogfooding” (or use by employees in their work), and especially use in automated AI research, to generate training data, critiquing other models, or in agentic pipelines for research.

load-bearing evaluation

The developer's most important evaluations for decision-making, judged to be high quality and/or provide unique evidence.

Load-bearing evaluations have certain minimum features (defined in Principle 3 and Principle 4) to be well-suited for risk judgments.

most capable model

The model or AI system designated by the developer as showing the best performance on complex tasks, compared with the developer's other models under similar conditions. Should be designated per threat category.

near-final model

Model version from a slightly earlier checkpoint than the final model, with only minor differences with the final model. May also be a helpful-only version of a near-final model. Preferably accompanied by a timestamp or other metric demonstrating distance from the final model.

threat model

Should specify a type of actor (including both sophistication and resourcing), a misuse vector, threat pathway(s), and the enabling AI capabilities of concern. Especially sensitive details should not be made public.

It is acceptable for suites for internal and external deployment to be different. Some threat categories might be more or less relevant to certain deployment contexts.

Examples: Time-horizon measures are directly relevant to threats from agentic models. By contrast, recalling readily available facts about a general domain is typically NOT sufficient for any threat categories.

In some cases, one model may be most capable across all threat categories. However, some models may be narrowly very capable in certain domains, and require more testing and mitigations than their general capability level would suggest.

For narrowly most capable models, load-bearing evaluations should be done for the threat category for which it is most capable.

Measuring model capabilities throughout post-training establishes a baseline performance trend, which can undermine later attempts by a situationally aware model to sandbag performance undetected.

It’s also helpful for deciding what affordances can safely be granted to a model at a given point in development.

Example: If a single evaluation is being used to judge the relative capabilities of a new model vs. an older model, the evaluation should not be saturated for either model.

Example: Pre-specify an effect size that the evaluation is designed to detect.

Including, as applicable, test items/tasks, answer keys, rubrics, and scoring code.

Examples: Human uplift studies, realistic agentic tasks.

This shows the degree to which specific evaluations provide redundancy or independent information on a capability.

For example, if performance improvements are clearly not on-trend to cross a decision-relevant threshold, elicitation efforts do not need to continue even if there are still nominal performance gains.

To uncover suspicious performance patterns. Examples: Varying the task presentation to simulate different settings (a research task, an operational task, a hypothetical, etc.), identifying and ablating test features that signal evaluation.

To probe the model for evidence of sandbagging. Examples: Monitoring CoT or probing model activations for evaluation awareness, injecting noise into the model and observing performance changes, probing for latent capabilities directly in the model.

I.e. when a model is less likely to have developed situational awareness.

Examples: Supervised fine-tuning on task demonstrations, reinforcement learning with a capability-reward signal.

For instance, where deployment will grant fine-tuning access to unvetted third parties, or where model theft is a significant possibility.

These do not need to be updated for every new model or evaluation run. If a previously established score threshold remains reasonable, it may be carried forward.

E.g. where confidence intervals overlap with decision thresholds.

E.g. for scoring and aggregation rules, outlier sensitivity, choice of statistical methods.

Baselines can be borrowed from previous work or existing literature.

Including government AI safety bodies (e.g. UK AISI, US CAISI, the EU AI office), national security bodies (as appropriate), and independent organizations that specialize in evaluation/auditing (e.g. METR, Apollo, SecureBio).

While STREAM was designed for ChemBio benchmark evaluations, much of it is applicable to other risk areas and evaluation methodologies.

E.g. safety/governance leadership, executive leadership, and the board of directors.

Such as memos or presentations to leadership for decision-making meetings.

Explanation: This can enable process improvement and identification of “weak links” in communication.

This standard is authored by the Guidelight team, with input from a wide range of sources; read more about our process. In addition to Guidelight staff, Tegan McCaslin made significant contributions.

Cite as:

@misc{guidelight2026capabilitytesting,
  author       = {{Guidelight AI Standards}},
  title        = {Capability Testing},
  year         = {2026},
  month        = may,
  note         = {Version 1.0, Standard},
  howpublished = {\url{https://www.guidelight.ai/capability-testing}},
  url          = {https://www.guidelight.ai/capability-testing},
  urldate      = {2026-06-18}
}