AI Alignment testing

Andrew Schreiber
4 min readApr 1, 2018
SpaceX Falcon Heavy static fire test

Aligned Artificial General Intelligence will be the most sophisticated system humanity has ever attempted to build. We know of many ways that such an attempt may fail. And yet, this is not humanity’s first rodeo. From landing a space shuttle on the moon to building cathedrals that stand tall for hundreds of years, our scientists, architects, and engineers have occasionally found overwhelming success at tasks that once seemed impossible.

A seemingly well-designed rocket can explode in more ways than we can count, but we’ve figured out a way to catch a majority of failures before launch: tests. Engine ignition and burn failures are core risks for rockets, so SpaceX conducts static fire tests.

In the context of aligned AGI, it’s not immediately obvious how or what to test in 2018. There is no physical system to break. We are years away from compelling AGI architectures, let alone implementations. We will discuss this problem in detail in a future post. For now, let’s simply imagine AI alignment tests as measuring decision-making processes and results in simulated environments. The key metrics of these tests — separate from capabilities — may be derived from the research agendas of alignment researchers and organizations. Examples: is this AI corrigible? Robust to distributional shift? Able to learn human values? Retain human values under increasing capability? As Alignment research broadens, we can expect these test metrics to expand and evolve.

What is the case for and against such tests?

Note: My goal here is to catelog and summarize arguments. A future post will dive further into point-counter-point and weighting.

A few arguments for AGI Alignment tests

1) Testing is a widely practiced activity across many disciplines. We can pull lessons about when testing is useful or not. It may be easier to socialize progress in testing as compared with other AI Alignment research activities.

2) We have evidence that creating good tests encourages progress in a field (see citations in AI Safety Gridworlds)

3) A healthy culture of testing encourages AI researchers to atomize their system into smaller, testable components to help decompose the source of test failures. Smaller components are generally easier to reason about.

4) Tests could be open-sourced and interfaced with via well-documented APIs. People outside top AGI research teams could make improvements that positively impact frontrunners.

5) Alignment tests also function orthogonally as capabilities tests. Well-respected, difficult tests function as a fire alarm on AGI progress inside teams. Of importance is that these brakes are on the technical — not political — side, meaning even in a uncoordinated or arms race political scenario, they can slow capabilities development. Tests may buy the world time to figure out more aligned AGI design.

6) Testing can include introspection at every level of abstraction. We aren’t limited to measuring an AI’s behavior, rather we can build tests directly on the AI’s thought processes. The latter perhaps being essential for catching deception.

7) Recursive self-improvement may result in value drift for a previously aligned AGI. Testing may be a core part of how AGI evaluates the alignment of it’s own subsequent versions. We should expect some of the highest-information tests to include tests that it writes itself. A robust suite of human-made tests to add upon and modify may make writing quality tests more tractable for a young aligned AGI.

A few challenges for AI Alignment testing

1) Tests do not and cannot cover all scenarios. Rockets still explode and buildings still sometimes collapse because their tests do not fully generalize to the real world. It’s easy to forget and hard to explain that absence of evidence is not evidence of absence.

2) Tests may lull teams into a false sense of security. If the tests are too easy, they may be all passed and give the impression an unaligned AGI is aligned. Tests that are useful when AGI is human-level may become useless as an AGI becomes superintelligent.

3) Open-sourced tests may reduce the budget AGI research teams dedicate to alignment. Teams could rely on others to do the alignment work, creating a potential tragedy of the commons scenario.

4) Tests don’t solve the hardest alignment problem — designing a system with the capacity to be aligned.

5) Tests that are too uninteresting or too difficult to make any progress could deter researcher interest in the testing paradigm as a tool for alignment.

6) It may be incoherent attempting to make tests when we don’t have a clear sense of how an AGI will be built.

7) AGI research teams may ignore tests because of arms race dynamics. Especially if those tests are expensive or time-consuming to run. Furthermore they may ignore test results if the team does not believe the tests to be relevant or useful.

8) Testing in various scenarios may increase the probability of an unaligned AGI escape. For example, the kinds of tests we write may help AGI understand humans better to convince us to let it out. Or multi-agent environments may enable coordinating with other agents within a testing environment.

You may have thoughts on of some of these arguments. Please comment!


The downsides of Alignment tests seem to be primarily cultural risks while the upsides point at the technical challenges of AI alignment. On the balance, I have a strong prior that rigorous testing can be a strong positive force for the field of AI Alignment. I hope to see more advancement from researchers in AI alignment tests.

Thanks to Ozzie Gooen and Jeremey Schlatter for reading drafts of this.