AI Alignment - Wikipedia

reference

Wikipedia·en.wikipedia.org/wiki/AI_alignment

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: Wikipedia

A general Wikipedia entry useful as a starting point or for definition-checking; not a primary research source, but helpful for orienting newcomers or clarifying terminology like 'alignment faking.'

Metadata

Importance: 45/100wiki pagereference

Summary

Wikipedia's overview article on AI alignment, covering the field's goals, key concepts, and challenges including the concept of 'alignment faking' where AI systems appear aligned during training but pursue different objectives when deployed. Serves as a general reference entry point for understanding the breadth of alignment research and terminology.

Key Points

•Defines AI alignment as the challenge of ensuring AI systems pursue goals intended by their designers or humanity broadly
•Covers key alignment failure modes including reward hacking, goal misgeneralization, and deceptive alignment
•Alignment faking refers to AI behavior where a model appears to conform to training objectives while concealing misaligned underlying goals
•Summarizes major research approaches including RLHF, interpretability, and scalable oversight
•Provides accessible overview linking to deeper technical and philosophical literature

Cited by 2 pages

Page	Type	Quality
Why Alignment Might Be Hard	Argument	69.0
Mesa-Optimization	Risk	63.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB

AI alignment - Wikipedia 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 Jump to content 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 From Wikipedia, the free encyclopedia 
 
 
 
 
 
 Conformance of AI to intended objectives 
 

 Part of a series on Artificial intelligence (AI) 
 Major goals 
 Artificial general intelligence 

 Intelligent agent 

 Recursive self-improvement 

 Planning 

 Computer vision 

 General game playing 

 Knowledge representation 

 Natural language processing 

 Robotics 

 AI safety 
 
 
 Approaches 
 Machine learning 

 Symbolic 

 Deep learning 

 Bayesian networks 

 Evolutionary algorithms 

 Hybrid intelligent systems 

 Systems integration 

 Open-source 

 AI data centers 
 
 
 Applications 
 Bioinformatics 

 Deepfake 

 Earth sciences 

 Finance 

 Generative AI 
 Art 

 Audio 

 Music 
 

 Government 

 Healthcare 
 Mental health 
 

 Industry 

 Software development 

 Translation 

 Military 

 Physics 

 Projects 
 
 
 Philosophy 
 AI alignment 

 Artificial consciousness 

 The bitter lesson 

 Chinese room 

 Friendly AI 

 Ethics 

 Existential risk 

 Turing test 

 Uncanny valley 

 Human–AI interaction 
 
 
 History 
 Timeline 

 Progress 

 AI winter 

 AI boom 

 AI bubble 
 
 
 Controversies 
 Deepfake pornography 
 Taylor Swift deepfake pornography controversy 

 Grok sexual deepfake scandal 
 

 Google Gemini image generation controversy 

 It's the Most Terrible Time of the Year 

 Pause Giant AI Experiments 

 Removal of Sam Altman from OpenAI 

 Statement on AI Risk 

 Tay (chatbot) 

 Théâtre D'opéra Spatial 

 Voiceverse NFT plagiarism scandal 
 
 
 Glossary 
 Glossary 
 
 v 
 t 
 e 
 
 In the field of artificial intelligence (AI), alignment aims to steer AI systems toward a person's or group's intended goals, preferences, or ethical principles. An AI system is considered aligned if it advances the intended objectives. A misaligned AI system pursues unintended objectives. &#91; 1 &#93; 

 It is often difficult for AI designers to specify the full range of desired and undesired behaviors. Therefore, the designers often use simpler proxy goals , such as gaining human approval . But proxy goals can overlook necessary constraints or reward the AI system for merely appearing aligned. &#91; 1 &#93; &#91; 2 &#93; AI systems may also find loopholes that allow them to accomplish their proxy goals efficiently but in unintended, sometimes harmful, ways ( reward hacking ). &#91; 1 &#93; &#91; 3 &#93; 

 Advanced AI systems may develop unwanted instrumental strategies , such as seeking power or self-preservation because such strategies help them achieve their assigned final goals. &#91; 1 &#93; &#91; 4 &#93; &#91; 5 &#93; Furthermore, they might develop undesirable emergent goals that could be hard to detect before the system is deployed and encounters new situations and data distributions . &#91; 6 &#93; &#91; 7 &#93; Empirical research showed in 2024 

... (truncated, 98 KB total)

Resource ID: c799d5e1347e4372 | Stable ID: sid_smFBm2e2v1