White-Box Estimation of Rare Misbehavior
Scalable OversightemergingPredicting probability of rare harmful outputs using model internals.
Cluster: Scalable Oversight
Tags
function:assurancescope:technique
Predicting probability of rare harmful outputs using model internals.