Longterm Wiki

White-Box Estimation of Rare Misbehavior

Scalable Oversightemerging

Predicting probability of rare harmful outputs using model internals.

Cluster: Scalable Oversight

Tags

function:assurancescope:technique