Skip to content
Longterm Wiki
Back

Bounded objectives research

paper

Authors

Stuart Armstrong·Sören Mindermann

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Addresses the challenge of inferring reward functions from agents with unknown rationality levels in inverse reinforcement learning, tackling a practical ambiguity problem relevant to AI alignment and human-AI preference learning.

Paper Details

Citations
0

Metadata

arxiv preprintprimary source

Abstract

Inverse reinforcement learning (IRL) attempts to infer human rewards or preferences from observed behavior. Since human planning systematically deviates from rationality, several approaches have been tried to account for specific human shortcomings. However, the general problem of inferring the reward function of an agent of unknown rationality has received little attention. Unlike the well-known ambiguity problems in IRL, this one is practically relevant but cannot be resolved by observing the agent's policy in enough environments. This paper shows (1) that a No Free Lunch result implies it is impossible to uniquely decompose a policy into a planning algorithm and reward function, and (2) that even with a reasonable simplicity prior/Occam's razor on the set of decompositions, we cannot distinguish between the true decomposition and others that lead to high regret. To address this, we need simple `normative' assumptions, which cannot be deduced exclusively from observations.

Summary

This paper addresses a fundamental challenge in inverse reinforcement learning: inferring reward functions from observed behavior when the agent's rationality level is unknown. The authors prove that it is impossible to uniquely decompose an agent's policy into a planning algorithm and reward function due to a No Free Lunch result, and that even with simplicity priors, multiple decompositions can produce similarly high regret. They argue that resolving this ambiguity requires normative assumptions that cannot be derived solely from behavioral observations, highlighting a previously underexplored but practically important limitation of IRL approaches.

Cited by 2 pages

PageTypeQuality
Corrigibility Failure PathwaysAnalysis62.0
AI AlignmentApproach91.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202684 KB
[1712.05812] Occam’s razor is insufficient to infer the preferences of irrational agents 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Occam’s razor is insufficient to infer the preferences of irrational agents

 
 
 
Stuart Armstrong * 
 Future of Humanity Institute 
 University of Oxford 
 stuart.armstrong@philosophy.ox.ac.uk 
 &Sören Mindermann* 
 Vector Institute 
 University of Toronto 
 soeren.mindermann@gmail.com 
 
 Equal contribution.Further affiliation: Machine Intelligence Research Institute, Berkeley, USA.Work performed at Future of Humanity Institute. 
 

 
 Abstract

 Inverse reinforcement learning (IRL) attempts to infer human rewards or preferences from observed behavior. Since human planning systematically deviates from rationality, several approaches have been tried to account for specific human shortcomings.
However, the general problem of inferring the reward function of an agent of unknown rationality has received little attention.
Unlike the well-known ambiguity problems in IRL, this one is practically relevant but cannot be resolved by observing the agent’s policy in enough environments.
This paper shows (1) that a No Free Lunch result implies it is impossible to uniquely decompose a policy into a planning algorithm and reward function, and (2) that even with a reasonable simplicity prior/Occam’s razor on the set of decompositions, we cannot distinguish between the true decomposition and others that lead to high regret.
To address this, we need simple ‘normative’ assumptions, which cannot be deduced exclusively from observations.

 
 
 
 1 Introduction

 
 In today’s reinforcement learning systems, a simple reward function is often hand-crafted, and still sometimes leads to undesired behaviors on the part of RL agent, as the reward function is not well aligned with the operator’s true goals 1 1 1 See for example the game CoastRunners, where an RL agent didn’t finish the course, but instead found a bug allowing it to get a high score by crashing round in circles https://blog.openai.com/faulty-reward-functions/ . . As AI systems become more powerful and autonomous, these failures will become more frequent and grave as RL agents exceed human performance, operate at time-scales that forbid constant oversight, and are given increasingly complex tasks — from driving cars to planning cities to eventually evaluating policies or helping run companies. Ensuring that the agents behave in alignment with human values is known, appropriately, as the value alignment problem [Amodei et al., 2016 , Hadfield-Menell et al., 2016 , Russell et al., 2015 , Bostrom, 2014 , Leike et al., 2017 ] .

 
 
 One way of resolving this problem is to infer the correct reward function by observing human behaviour.
This is known as Inverse reinforcement learning (IRL) [Ng and Russell, 2000 , Abbeel and Ng, 2004 , Ziebart et al., 2008 ] . Often, learning a reward function is preferred over imitating a policy: when the agent must outperform humans, transfer to new

... (truncated, 84 KB total)
Resource ID: 6b7fc3f234fa109c | Stable ID: sid_1l9RTD5yoV