lightningTechnical safety & evaluationTechnical safeguards and monitoring

Technical Advantages for Weak-to-Strong Oversight: Bets I'd Like Challenged

8 July 2026 · 12:07 pm–12:12 pm · Cullen

I'll lay out some of my current research directions, focused on how we can design machine learning methods to favour the supervisor in weak-to-strong alignment settings. 1. Rewards should be cooperative; you don't want unstable competition with a stronger model (CIRL, gradient routing, etc). 2. Also supervision signal should be close, not a distant proxy (avoid RL if you can). 3. Internal interventions (like Representation Engineering and steering) favour the supervisor. 4. Self-supervised: the weak supervisor can only give weak labels, so we need the strong model to self-supervise. I invite attendees to find me after the talk to change my mind or discuss.

Speaker

Michael Clark
Independent Researcher
I'm Michael (wassname) Clark, a Perth-based ML researcher and ex-geoscientist. I work as a principal data scientist at Woodside Energy and sit on the board of medical AI startup Cytophenix. I'm interested in 1) building tools to ask AI hard questions, and know if they're lying, and 2) exploring unsupervised ways to scalably give AI good character. Interests: RepE, reliable steering, OOD moral generalization

Audience Q&A

Ask a question or upvote others.

Loading questions…

View full program