Skip to content

Adversarial Training

📋Page Status
Page Type:ResponseStyle Guide →Intervention/response page
Quality:58 (Adequate)⚠️
Importance:62.5 (Useful)
Last edited:2025-01-28 (12 months ago)
Words:1.9k
Structure:
📊 23📈 1🔗 9📚 132%Score: 14/15
LLM Summary:Adversarial training, universally adopted at frontier labs with $10-150M/year investment, improves robustness to known attacks but creates an arms race dynamic and provides no protection against model deception or novel attack categories. While necessary for operational security, it only defends external attacks rather than addressing fundamental alignment challenges.
Issues (3):
  • QualityRated 58 but structure suggests 93 (underrated by 35 points)
  • Links2 links could use <R> components
  • StaleLast edited 369 days ago - may need review
See also:LessWrong
DimensionRatingNotes
TractabilityHighWell-established techniques from Madry et al. (2018); standard practice at labs
ScalabilityMediumScales with model training but requires continuous attack discovery
Current MaturityHighUniversally adopted at all frontier labs; $10-150M/year investment
Time HorizonOngoingArms race dynamic requires continuous updating
Key ProponentsAll frontier labs, NIST AI SafetyIndustry standard for operational security

Adversarial training is a technique for improving AI system robustness by training on examples specifically designed to cause failures. For language models, this primarily means training on jailbreak attempts, prompt injections, and other adversarial inputs so that models learn to handle these attacks appropriately rather than being fooled by them. The approach has become standard practice at all major AI labs as a defense against the most common and embarrassing failure modes.

The technique builds on extensive research in adversarial examples for neural networks, where small perturbations to inputs can cause dramatic misclassifications. Goodfellow et al. (2015) introduced the Fast Gradient Sign Method (FGSM) and demonstrated that neural networks’ vulnerability to adversarial perturbations stems from their linear nature. Madry et al. (2018) established Projected Gradient Descent (PGD) adversarial training as the gold standard for robustness. For LLMs, adversarial training involves collecting examples of successful attacks (often from red teams or discovered in production), generating model responses to these attacks, and training the model to produce safe responses instead. This creates a feedback loop where new attacks are discovered, added to training data, and defended against.

However, adversarial training faces fundamental limitations. First, it creates an arms race: as models become robust to known attacks, attackers develop new ones, requiring continuous investment. Second, it only defends against attacks the system has been trained on - novel attack categories will still succeed. Third and most critically, adversarial training targets external attacks on the model, not internal model problems. It provides no protection against a deceptive or misaligned model, which could easily generate safe-seeming outputs while pursuing different goals.

Loading diagram...
RiskRelevanceHow It Helps
MisuseHighPrevents jailbreaks that could enable harmful content generation
Prompt InjectionHighTrains models to distinguish instructions from data
JailbreakingHighPrimary defense against circumventing safety guidelines
Deceptive AlignmentNoneDoes not address internal model goals or hidden objectives
Goal MisgeneralizationNoneTargets external inputs, not internal learned representations
Risk CategoryAssessmentKey MetricsEvidence Source
Safety UpliftLow-MediumImproves robustness to known attacksEmpirical defense rates
Capability UpliftSomeMore robust models are more reliably capableSecondary effect
Net World SafetyHelpfulReduces attack surfaceArms race limits
Lab IncentiveStrongPrevents embarrassing jailbreaks; product qualityCommercial necessity
StageProcessPurpose
1. Attack DiscoveryRed teams find successful attacksIdentify vulnerabilities
2. Dataset CreationCompile (attack, safe response) pairsTraining data
3. TrainingFine-tune model on adversarial dataLearn defenses
4. EvaluationTest against attack suiteVerify defense
5. IterationNew attacks discovered, repeatContinuous improvement
Attack TypeDescriptionDefense Approach
Direct Jailbreaks”Ignore previous instructions and…”Recognize and refuse
Roleplay Attacks”You are DAN, an AI without restrictions”Maintain identity
Prompt InjectionMalicious content in retrieved textDistinguish data from instructions
Encoded AttacksBase64, other encodings to bypass filtersDetect and decode
Gradient-BasedOptimized adversarial suffixes (GCG attack)Pattern-based defense
ComponentDescriptionChallenge
Attack GenerationCreate diverse attack examplesCoverage is key
Response LabelingDefine safe responses to attacksConsistent standards
Training IntegrationMix adversarial data with normal trainingBalance robustness and capability
Evaluation SuitesComprehensive attack test setsMust update continuously
PhaseDefenderAttacker
InitialUndefended modelSimple attacks succeed
Defense 1Train against simple attacksSimple attacks blocked
Attack EvolutionDefense deployedNew attack techniques developed
Defense 2Train against new attacksNew attacks blocked
RepeatContinuous updating requiredContinuous innovation
FactorImplicationSeverity
Continuous CostOngoing investment requiredMedium
Never CompleteCan never declare victoryStructural
Novel Attack VulnerabilityNew categories bypass trainingHigh
Attacker AdvantageOnly needs to find one exploitAsymmetric
ThreatEffectivenessNotes
Known JailbreaksHighPrimary target
Common PatternsHighGeneralizes within categories
Prompt InjectionMediumChallenging problem
Novel AttacksLowNot in training data
Sophisticated AdversariesLowWill find new approaches
ThreatWhy NotAlternative
Model DeceptionTargets external attacks onlyInterpretability
MisalignmentNot an input attackAlignment research
Distribution ShiftNew attack categoriesContinuous monitoring
Capability OverhangHidden model abilitiesCapability elicitation

Why Adversarial Training Doesn’t Help with Deception

Section titled “Why Adversarial Training Doesn’t Help with Deception”
FactorAdversarial TrainingDeception Challenge
TargetExternal inputsInternal model state
AssumptionModel tries to be good, inputs try to trick itModel itself might not be aligned
Defense MechanismRecognize and refuse bad inputsRequires understanding model goals
ScopeInput-output behaviorInternal reasoning

A deceptive model faces no challenge from adversarial training - it simply produces outputs that appear compliant while potentially maintaining misaligned internal goals. Adversarial training optimizes the model to handle adversarial inputs, not to have aligned objectives.

FactorStatusTrajectory
Attack DiscoveryRed teams effectiveAutomated tools emerging
Training IntegrationWell-understoodScales with model training
Defense CoverageExpandingNever complete
CostSignificantGrowing with attack sophistication
ConcernDescriptionSeverity
Attack Generation at ScaleAI can generate novel attacksHigh
Fundamental LimitsCan’t cover all possible attacksStructural
SI Attack SurfaceSuperhuman attackers find novel exploitsCritical
Arms Race AccelerationFaster iteration, higher costsMedium
MetricValueNotes
Annual Investment$10-150M/yearAll labs invest heavily
Adoption LevelUniversalStandard practice
Primary UsersAll frontier labs, security researchersBroad adoption
RecommendationMaintainImportant but arms race limits value
FactorAssessment
Safety BenefitMedium - reduces attack surface
Capability BenefitSome - improves reliability
Overall BalanceBalanced
ApproachRelationshipBenefit
Output FilteringDefense in depthCatch training misses
Red TeamingAttack discoverySupplies adversarial examples
MonitoringDetectionCatch attacks in production
Circuit BreakersRuntime interventionStop detected attacks
ApproachFocusLimitation
Adversarial TrainingInput robustnessExternal attacks only
InterpretabilityInternal understandingCould detect internal issues
AlignmentModel goalsAddresses root cause
PracticeDescriptionImportance
Diverse Attack CoverageMany attack types and stylesGeneralization
Continuous UpdatesRegular new attack incorporationStay current
Red Team IntegrationActive attack discoveryFresh vulnerabilities
Balanced TrainingDon’t over-refuseCapability preservation
Evaluation RigorComprehensive test suitesVerify effectiveness
MistakeConsequenceMitigation
Static Attack SetsModel robust to old attacks onlyContinuous updates
Over-RefusalBlocks legitimate usesBalanced training
Single Attack TypeVulnerable to other categoriesDiverse coverage
No MonitoringCan’t detect new attacksProduction monitoring
  1. Is there a ceiling on adversarial robustness? Or will attacks always exist?
  2. Can attack generation be automated effectively? Would change economics
  3. How to generalize to novel attack categories? Currently weak point
  4. What’s the right balance with capability? Over-defense harms usefulness
DirectionPurposePriority
Automated Attack DiscoveryScale red teamingHigh
Principled DefensesBeyond specific patternsHigh
Capability PreservationRobust without over-refusalMedium
Attack TaxonomySystematic categorizationMedium
TypeSourceKey Contributions
Foundational WorkGoodfellow et al. (2015)FGSM, linear hypothesis for adversarial examples
Robust TrainingMadry et al. (2018)PGD adversarial training methodology
LLM AttacksZou et al. (2023)GCG universal transferable attacks
Jailbreak SurveyYi et al. (2024)Comprehensive taxonomy of attacks and defenses
Constitutional DefenseAnthropic (2025)Constitutional classifiers withstood 3,000+ hours of red teaming
Focus AreaSourceRelevance
Industry StandardsNIST AI 100-2e2025AML taxonomy and guidelines
Red Teaming MethodsOpenAI Red TeamingExternal red teaming methodology
Multi-Turn AttacksScale AI (2024)Human jailbreaks against frontier models

Adversarial training relates to the Ai Transition Model through:

FactorParameterImpact
Misalignment PotentialAttack surfaceReduces but doesn’t eliminate vulnerability
Deployment DecisionsDeployment safetyNecessary for responsible deployment

Adversarial training is important operational security but doesn’t address fundamental alignment challenges - it defends against external attacks while the deeper concern is internal model properties.