Our first project has focused on adapting insider-risk practices to design and test practical safeguards that can steer AI systems away from harmful strategies, such as blackmail, and toward safe escalation behaviours.
We plan to build on this work by developing controls that work downstream of model training to reduce harm from agentic AI systems built on frontier models. Our goal is to design safeguards with the propensities of models in mind, not only to monitor or constrain capabilities, but to shape how these are used.
Looking further ahead, we are exploring the potential for models to shape their environments in subtle, hard-to-detect ways that improve their long-term goal achievement, for example by strengthening allies or neutralising perceived threats. Such behaviour could increase an AI system’s resilience, increasing the risk of loss of control scenarios or gradual disempowerment over time.