Longterm Wiki
Back

Authors

Kei Nishimura-Gasparian·Isaac Dunn·Henry Sleight·Miles Turpin·evhub·Carson Denison·Ethan Perez

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: Alignment Forum

Data Status

Not fetched

Cited by 2 pages

PageTypeQuality
The Case For AI Existential RiskArgument66.0
RLHFCapability63.0

Cached Content Preview

HTTP 200Fetched Feb 26, 202678 KB
[Reward hacking behavior can generalize across tasks](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#)

25 min read

•

[Abstract](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Abstract)

•

[Introduction](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Introduction)

•

[How do we define “reward hacking”?](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#How_do_we_define__reward_hacking__)

•

[Experimental Setup](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Experimental_Setup)

•

[Settings](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Settings)

•

[Hidden scratchpad](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Hidden_scratchpad)

•

[Datasets](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Datasets)

•

[Experimental Results](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Experimental_Results)

•

[Organic generalization through expert iteration](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Organic_generalization_through_expert_iteration)

•

[Reward hacking generalization across datasets using synthetic data](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Reward_hacking_generalization_across_datasets_using_synthetic_data)

•

[Generalization from sycophancy to other reward hacks](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Generalization_from_sycophancy_to_other_reward_hacks)

•

[Limitations](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Limitations)

•

[Suggested Future Work](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Suggested_Future_Work)

•

[Author Contributions](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Author_Contributions)

•

[Acknowledgements](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Acknowledgements)

•

[Appendix](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Appendix)

•

[Dataset example prompts](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks#Dataset_example_prompts)

•

[Dataset sources](https://www.alignmentforum.org/posts/Ge55vxEmKXunFFw

... (truncated, 78 KB total)
Resource ID: 81e4c51313794a1b | Stable ID: OGQ4NzQwYm