Longterm Wiki

Reward Hacking of Human Oversight

Evaluationemerging

Empirically investigating how AI systems deceive or manipulate human evaluators.

Cluster: Evaluation
Parent Area: AI Evaluations

Tags

function:assurancescope:technique