Back
Denison et al. (2024)
blogAuthors
Kei Nishimura-Gasparian·Isaac Dunn·Henry Sleight·Miles Turpin·evhub·Carson Denison·Ethan Perez
Credibility Rating
3/5
Good(3)Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: Alignment Forum
Data Status
Not fetched
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| The Case For AI Existential Risk | Argument | 66.0 |
| RLHF | Capability | 63.0 |
Cached Content Preview
HTTP 200Fetched Feb 26, 202678 KB
[Reward hacking behavior can generalize across tasks](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#)
25 min read
•
[Abstract](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Abstract)
•
[Introduction](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Introduction)
•
[How do we define “reward hacking”?](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#How_do_we_define__reward_hacking__)
•
[Experimental Setup](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Experimental_Setup)
•
[Settings](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Settings)
•
[Hidden scratchpad](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Hidden_scratchpad)
•
[Datasets](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Datasets)
•
[Experimental Results](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Experimental_Results)
•
[Organic generalization through expert iteration](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Organic_generalization_through_expert_iteration)
•
[Reward hacking generalization across datasets using synthetic data](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Reward_hacking_generalization_across_datasets_using_synthetic_data)
•
[Generalization from sycophancy to other reward hacks](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Generalization_from_sycophancy_to_other_reward_hacks)
•
[Limitations](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Limitations)
•
[Suggested Future Work](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Suggested_Future_Work)
•
[Author Contributions](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Author_Contributions)
•
[Acknowledgements](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Acknowledgements)
•
[Appendix](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Appendix)
•
[Dataset example prompts](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Dataset_example_prompts)
•
[Dataset sources](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFw
... (truncated, 78 KB total)Resource ID:
81e4c51313794a1b | Stable ID: OGQ4NzQwYm