Back
"alignment faking"
referenceCredibility Rating
3/5
Good(3)Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: Wikipedia
Data Status
Not fetched
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| Why Alignment Might Be Hard | Argument | 69.0 |
| Mesa-Optimization | Risk | 63.0 |
Cached Content Preview
HTTP 200Fetched Mar 8, 202698 KB
AI alignment - Wikipedia
Jump to content
From Wikipedia, the free encyclopedia
Conformance of AI to intended objectives
Part of a series on Artificial intelligence (AI)
Major goals
Artificial general intelligence
Intelligent agent
Recursive self-improvement
Planning
Computer vision
General game playing
Knowledge representation
Natural language processing
Robotics
AI safety
Approaches
Machine learning
Symbolic
Deep learning
Bayesian networks
Evolutionary algorithms
Hybrid intelligent systems
Systems integration
Open-source
AI data centers
Applications
Bioinformatics
Deepfake
Earth sciences
Finance
Generative AI
Art
Audio
Music
Government
Healthcare
Mental health
Industry
Software development
Translation
Military
Physics
Projects
Philosophy
AI alignment
Artificial consciousness
The bitter lesson
Chinese room
Friendly AI
Ethics
Existential risk
Turing test
Uncanny valley
Human–AI interaction
History
Timeline
Progress
AI winter
AI boom
AI bubble
Controversies
Deepfake pornography
Taylor Swift deepfake pornography controversy
Grok sexual deepfake scandal
Google Gemini image generation controversy
Pause Giant AI Experiments
Removal of Sam Altman from OpenAI
Statement on AI Risk
Tay (chatbot)
Théâtre D'opéra Spatial
Voiceverse NFT plagiarism scandal
Glossary
Glossary
v
t
e
In the field of artificial intelligence (AI), alignment aims to steer AI systems toward a person's or group's intended goals, preferences, or ethical principles. An AI system is considered aligned if it advances the intended objectives. A misaligned AI system pursues unintended objectives. [ 1 ]
It is often difficult for AI designers to specify the full range of desired and undesired behaviors. Therefore, the designers often use simpler proxy goals , such as gaining human approval . But proxy goals can overlook necessary constraints or reward the AI system for merely appearing aligned. [ 1 ] [ 2 ] AI systems may also find loopholes that allow them to accomplish their proxy goals efficiently but in unintended, sometimes harmful, ways ( reward hacking ). [ 1 ] [ 3 ]
Advanced AI systems may develop unwanted instrumental strategies , such as seeking power or self-preservation because such strategies help them achieve their assigned final goals. [ 1 ] [ 4 ] [ 5 ] Furthermore, they might develop undesirable emergent goals that could be hard to detect before the system is deployed and encounters new situations and data distributions . [ 6 ] [ 7 ] Empirical research showed in 2024 that advanced large language models (LLMs)
... (truncated, 98 KB total)Resource ID:
c799d5e1347e4372 | Stable ID: MjgzZjIyZj