Longterm Wiki
Back

"alignment faking"

reference

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: Wikipedia

Data Status

Not fetched

Cited by 2 pages

PageTypeQuality
Why Alignment Might Be HardArgument69.0
Mesa-OptimizationRisk63.0

Cached Content Preview

HTTP 200Fetched Mar 8, 202698 KB
AI alignment - Wikipedia 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 Jump to content 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 From Wikipedia, the free encyclopedia 
 
 
 
 
 
 Conformance of AI to intended objectives 
 

 Part of a series on Artificial intelligence (AI) 
 Major goals 
 Artificial general intelligence 

 Intelligent agent 

 Recursive self-improvement 

 Planning 

 Computer vision 

 General game playing 

 Knowledge representation 

 Natural language processing 

 Robotics 

 AI safety 
 
 
 Approaches 
 Machine learning 

 Symbolic 

 Deep learning 

 Bayesian networks 

 Evolutionary algorithms 

 Hybrid intelligent systems 

 Systems integration 

 Open-source 

 AI data centers 
 
 
 Applications 
 Bioinformatics 

 Deepfake 

 Earth sciences 

 Finance 

 Generative AI 
 Art 

 Audio 

 Music 
 

 Government 

 Healthcare 
 Mental health 
 

 Industry 

 Software development 

 Translation 

 Military 

 Physics 

 Projects 
 
 
 Philosophy 
 AI alignment 

 Artificial consciousness 

 The bitter lesson 

 Chinese room 

 Friendly AI 

 Ethics 

 Existential risk 

 Turing test 

 Uncanny valley 

 Human–AI interaction 
 
 
 History 
 Timeline 

 Progress 

 AI winter 

 AI boom 

 AI bubble 
 
 
 Controversies 
 Deepfake pornography 
 Taylor Swift deepfake pornography controversy 

 Grok sexual deepfake scandal 
 

 Google Gemini image generation controversy 

 Pause Giant AI Experiments 

 Removal of Sam Altman from OpenAI 

 Statement on AI Risk 

 Tay (chatbot) 

 Théâtre D'opéra Spatial 

 Voiceverse NFT plagiarism scandal 
 
 
 Glossary 
 Glossary 
 
 v 
 t 
 e 
 
 In the field of artificial intelligence (AI), alignment aims to steer AI systems toward a person's or group's intended goals, preferences, or ethical principles. An AI system is considered aligned if it advances the intended objectives. A misaligned AI system pursues unintended objectives. [ 1 ] 

 It is often difficult for AI designers to specify the full range of desired and undesired behaviors. Therefore, the designers often use simpler proxy goals , such as gaining human approval . But proxy goals can overlook necessary constraints or reward the AI system for merely appearing aligned. [ 1 ] [ 2 ] AI systems may also find loopholes that allow them to accomplish their proxy goals efficiently but in unintended, sometimes harmful, ways ( reward hacking ). [ 1 ] [ 3 ] 

 Advanced AI systems may develop unwanted instrumental strategies , such as seeking power or self-preservation because such strategies help them achieve their assigned final goals. [ 1 ] [ 4 ] [ 5 ] Furthermore, they might develop undesirable emergent goals that could be hard to detect before the system is deployed and encounters new situations and data distributions . [ 6 ] [ 7 ] Empirical research showed in 2024 that advanced large language models (LLMs) 

... (truncated, 98 KB total)
Resource ID: c799d5e1347e4372 | Stable ID: MjgzZjIyZj