lightningTechnical safety & evaluationTechnical safeguards and monitoring
Technical Advantages for Weak-to-Strong Oversight: Bets I'd Like Challenged
8 July 2026 · 12:05 pm–12:10 pm · Cullen
I'll lay out some of my current research directions, focused on how we can design machine learning methods to favour the supervisor in weak-to-strong alignment settings. 1. Rewards should be cooperative; you don't want unstable competition with a stronger model (CIRL, gradient routing, etc). 2. Also supervision signal should be close, not a distant proxy (avoid RL if you can). 3. Internal interventions (like Representation Engineering and steering) favour the supervisor. 4. Self-supervised: the weak supervisor can only give weak labels, so we need the strong model to self-supervise. I invite attendees to find me after the talk to change my mind or discuss.
