Back
can worsen with model size
webopenreview.net·openreview.net/pdf?id=bx24KpJ4Eb
This OpenReview paper examines scaling behavior of alignment techniques, relevant to debates about whether larger models are automatically safer or whether alignment interventions like RLHF become more costly or less effective at scale. Page was temporarily unavailable at time of analysis.
Metadata
Importance: 55/100conference paperprimary source
Summary
This paper investigates how alignment techniques such as RLHF may exhibit scaling problems, where safety-relevant behaviors or alignment costs worsen rather than improve as models grow larger. The work likely examines the relationship between model scale and alignment properties.
Key Points
- •Alignment properties or costs may not improve monotonically with model scale, potentially degrading with larger models
- •RLHF and human feedback-based training may introduce unexpected scaling challenges
- •Larger models could exhibit worse alignment behavior in certain metrics despite improved general capabilities
- •Results suggest caution about assuming scale automatically improves safety or alignment outcomes
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| RLHF | Research Area | 63.0 |
Cached Content Preview
HTTP 200Fetched Apr 9, 20260 KB
Wayback Machine Jan FEB Mar 11 2025 2026 2027 success fail About this capture COLLECTED BY Collection: Save Page Now Outlinks TIMESTAMPS The Wayback Machine - http://web.archive.org/web/20260211011847/https://openreview.net/pdf?id=bx24KpJ4Eb
Resource ID:
7712afe39f75a44c | Stable ID: NzJhMjI2Mj