Skip to content

Probing / Linear Probes

📋Page Status
Page Type:ResponseStyle Guide →Intervention/response page
Quality:55 (Adequate)⚠️
Importance:62 (Useful)
Last edited:2025-01-28 (12 months ago)
Words:2.7k
Structure:
📊 10📈 2🔗 0📚 3618%Score: 13/15
LLM Summary:Linear probing achieves 71-83% accuracy detecting LLM truthfulness and is a foundational diagnostic tool for interpretability research. While computationally cheap and widely adopted, probes are vulnerable to adversarial hiding and only detect linearly separable features, limiting their standalone safety value to supporting other techniques.
Issues (3):
  • QualityRated 55 but structure suggests 87 (underrated by 32 points)
  • Links5 links could use <R> components
  • StaleLast edited 369 days ago - may need review

Linear probing is a foundational interpretability technique that trains simple classifiers (typically linear models) on the internal activations of neural networks to determine what information is encoded in those representations. The core insight is elegant: if a linear classifier can accurately predict whether a model “knows” some concept (e.g., truthfulness, sentiment, factual correctness) from its activations, then that concept must be represented in a linearly accessible way in the model’s internal state. This provides evidence about the structure of learned representations and what information models track internally.

The technique has become a standard research tool across interpretability and AI safety. Researchers use probes to detect whether models represent concepts like “this statement is false,” “this response is harmful,” or “I am being evaluated” in their activations. The simplicity of linear probes (typically just a single learned weight matrix) makes them computationally cheap and easy to train, while their limitations (they can only detect linearly separable features) provide insight into how concepts are organized in neural networks.

For AI safety applications, probing offers a direct window into whether models have internal representations that differ from their expressed behavior. A model might claim ignorance while its activations reveal it knows the answer, or might express confidence while internally representing uncertainty. More critically, probes could potentially detect deception-related representations, identifying when a model internally represents “I should mislead the user” even if its outputs appear helpful. Azaria and Mitchell (2023) demonstrated that trained classifiers on LLM hidden states achieve 71-83% accuracy at detecting whether statements are true or false, providing evidence that truthfulness is encoded in model representations. However, the technique faces fundamental limitations: probes only detect what they’re trained to find, models might learn to hide representations from probes, and correlation with activations doesn’t guarantee causal relevance.

Loading diagram...

The probing methodology involves four key steps:

  1. Activation Extraction: Run inputs through the neural network and capture the activation vectors at specific layers. These vectors encode the model’s internal representations at that processing stage.

  2. Dataset Construction: Create labeled examples pairing activation vectors with ground-truth labels for the target concept (e.g., “this statement is true/false”).

  3. Classifier Training: Train a simple classifier (typically logistic regression) on the activation-label pairs. The simplicity of the classifier is intentional—if a linear model can predict the concept, the information must be linearly separable in the representation space.

  4. Evaluation: Test the probe on held-out data to determine whether the target concept is encoded in the model’s representations.

Different probing methods offer distinct tradeoffs between simplicity, scalability, and the types of insights they provide.

TechniqueMethodStrengthsLimitationsKey Citations
Linear ProbesTrain logistic regression on activationsSimple, interpretable, fast to trainOnly detects linear featuresBelinkov 2022
Structural ProbesTrain probes predicting syntactic distancesRecovers parse trees from geometryRequires linguistic annotationHewitt & Manning 2019
Contrast Consistent Search (CCS)Unsupervised search for “truth direction”No labeled data neededSensitive to contrast pair qualityBurns et al. 2023
Activation PatchingReplace activations to test causal roleEstablishes causality, not just correlationComputationally expensiveZhang & Nanda 2024
Representation Reading (RepE)Identify high-level concept directionsWorks for complex concepts; enables controlRequires careful vector constructionZou et al. 2023
PatchscopesPatch activations between forward passesElicits information from activationsLimited by target prompt designGhandeharioun et al. 2024

Linear Probes are the default choice for initial exploration—they’re fast, cheap, and provide interpretable results. Use them when you have labeled data and want to test specific hypotheses about what information is encoded.

CCS and unsupervised methods are valuable when labeled data is unavailable or when you want to discover what the model “believes” without biasing results with human labels. CCS achieves comparable accuracy to supervised probes on many tasks using only contrast pairs.

Activation Patching should be used when you need to establish causal relationships—when you need to know not just whether information is present, but whether the model actually uses it for the behavior you’re studying.

Representation Engineering (RepE) is appropriate when you want to both understand and control model behavior. It enables steering models toward honesty or other properties by manipulating identified directions.

Probing techniques contribute to addressing several AI safety risks:

RiskHow Probing HelpsEffectiveness
Deceptive AlignmentCould detect representations of deceptive intent in model activationsMedium - 71-83% accuracy on truthfulness, but deception-specific probes less validated
SycophancyCan identify when models represent “user wants X” separately from “X is true”Low-Medium - Research ongoing
Reward HackingProbes may detect when models represent reward vs. intended goals differentlyLow - Theoretical application
Evaluation GamingCould detect representations of “being evaluated” or “test conditions”Medium - Some preliminary evidence
Goal MisgeneralizationMay reveal whether training goals are represented differently than intended goalsLow-Medium - Requires understanding goal representations
DimensionAssessmentNotes
Safety UpliftLowDiagnostic tool; doesn’t directly improve safety but supports research
Capability UpliftNeutralAnalysis tool only; no capability improvement
Net World SafetyHelpfulSupports understanding with minimal dual-use concerns
ScalabilityYesComputationally cheap; scales well to larger models
Deception RobustnessPartialCould detect lying representations; models could learn to hide them
SI ReadinessMaybeTechnique scales; effectiveness at SI uncertain
Current AdoptionWidespreadStandard research tool across academia and industry
Research Investment$5-10M/yrCommon technique; many groups use probing as part of broader research
ConceptProbe AccuracySourceSafety Relevance
Truthfulness71-83%Azaria & Mitchell 2023Detecting when models “know” they’re wrong
True/False Statements70-85%Burns et al. 2023Unsupervised truth detection via CCS
Syntactic Structure85-95%Hewitt & Manning 2019Understanding linguistic representations
Harmful Intent≈75% (est.)Anthropic internalContent filtering; harm detection
Self-awareness58-70% (est.)VariousEvaluation gaming concerns

Accuracy estimates marked (est.) are based on unpublished or partially reported results.

The Eliciting Latent Knowledge (ELK) Connection

Section titled “The Eliciting Latent Knowledge (ELK) Connection”

Probing is central to the Eliciting Latent Knowledge research agenda developed by the Alignment Research Center (ARC). ELK asks: can we reliably extract what a model “believes” from its internal representations, even when its outputs are unreliable? This is critical for AI safety because a deceptive AI might output misleading information while internally representing the truth.

Loading diagram...

Key research questions:

  • Can probes extract truth even when the model’s output is deceptive?
  • How do we validate probe findings without ground truth?
  • Could deceptive models learn to hide from probes?

EleutherAI’s ELK project is actively building on the Contrast Consistent Search (CCS) method from Burns et al. 2023, creating reusable libraries for unsupervised knowledge elicitation.

StudyTaskModelAccuracyMethodKey Finding
Azaria & Mitchell 2023Truthfulness detectionLLaMA-13B71-83%Supervised MLPInternal states reveal statement veracity
Burns et al. 2023True/false classificationGPT-J, LLaMA70-85%CCS (unsupervised)Truth direction exists without labels
Anthropic 2024Safety-relevant featuresClaude 3 SonnetNot disclosedSAE + probingMillions of interpretable features extracted
Belinkov & Glass 2019POS taggingBERT97%Linear probeSyntactic info in middle layers
Hewitt & Liang 2019Dependency parsingELMo80-85%Structural probeParse trees embedded in geometry

Research consistently shows that different types of information are best probed at different layers:

Information TypeOptimal Layer PositionAccuracy RangeSource
Lexical/surface featuresEarly layers (1-4)90-95%Belinkov 2022
Syntactic structureMiddle layers (8-16)85-95%Hewitt & Manning 2019
Semantic conceptsLate-middle layers70-85%Various
Factual knowledgeLater layers (20+)60-80%Azaria & Mitchell 2023
Task-specific featuresFinal layersVariableDepends on task
  1. Layer Effects: Probes perform differently across layers; optimal layer depends on concept type. Early layers encode surface features (90-95% accuracy) while later layers encode abstract concepts (60-80% accuracy).

  2. Model Size Scaling: Larger models often have more linearly accessible representations. Anthropic’s March 2025 research found that interpretability techniques scale well to production models like Claude 3 Sonnet.

  3. Training Data Requirements: Probe performance improves with more diverse training examples, but even 1,000-5,000 examples often suffice for reliable results.

  4. Generalization: Probes trained on one distribution often fail on others—a probe trained on formal text may perform poorly on casual conversation, limiting deployment reliability.

Belinkov (2022) provides a comprehensive analysis of probing limitations:

LimitationEvidenceQuantified ImpactImplication
Spurious CorrelationsHewitt & Liang 2019 showed probes achieve 70%+ accuracy on random control tasksUp to 30% of apparent probe accuracy may be spuriousMust use control tasks to validate findings
Non-CausalProbing doesn’t establish that features cause behaviorUnknown % of probed features are actually usedCombine with activation patching for causal claims
Adversarial HidingModels can be trained to hide representations from probesDeception detection reliability: uncertainProbes may fail against sophisticated deception
Concept ValidityWhat we label as “truthfulness” may not match model’s representationProbe accuracy is upper-bounded by label qualityCareful dataset construction essential
Classifier CapacityComplex probes may learn task rather than extract featuresMLP probes show 5-15% higher accuracy than linearPrefer minimal probe complexity
Distribution ShiftProbes trained on one domain fail on others10-30% accuracy drop on out-of-distribution dataDeployment requires domain-matched training
TechniqueAdvantages Over ProbingProbing AdvantagesCompute CostKey Use Case
Sparse AutoencodersUnsupervised; discovers unknown features; scales to millionsTargeted; cheaper; more interpretable100-1000x higherFeature discovery
Activation PatchingEstablishes causality; identifies critical componentsDoesn’t require intervention; faster10-100x higherCausal verification
Behavioral TestingTests actual outputs; no internal access neededReveals hidden representationsSimilarSurface behavior
Representation EngineeringEnables intervention and controlPurely observational; lower riskSimilarModel steering

Compute costs are approximate relative to linear probing on a single layer.

  1. Foundation for Other Work: Probes inform SAE training, RepEng directions, and evaluation design
  2. Computationally Cheap: Can be applied broadly without major infrastructure
  3. Well-Understood: Decades of machine learning theory apply
  4. Rapid Iteration: Easy to test new hypotheses about representations
  5. Safety-Relevant: Direct application to deception detection research
  1. Not Sufficient Alone: Probes diagnose but don’t fix problems
  2. Vulnerability to Gaming: Sophisticated models could hide from probes
  3. Interpretation Challenges: Probe success doesn’t guarantee understanding
  4. Limited to Linear Structure: May miss important nonlinear features
  5. Correlation Not Causation: Findings may not translate to control
  • Can probes reliably detect deception in models that are trying to hide it?
  • What probe accuracy is “good enough” for safety applications?
  • How do probe results transfer across model architectures and scales?
  • Can probe training data adequately cover the space of concerning behaviors?

The field has seen significant advances in probing methodology and scale:

Anthropic’s Interpretability Breakthrough (March 2025): Anthropic researcher Josh Batson stated that “in another year or two, we’re going to know more about how these models think than we do about how people think,” highlighting rapid progress in mechanistic interpretability including probing techniques. Their work extracted millions of interpretable features from Claude 3 Sonnet using sparse autoencoders, with probing used to validate discovered features.

Scaling to Production Models: Research has demonstrated that probing techniques developed on smaller models successfully transfer to frontier systems. Anthropic’s interpretability team has shown that methods for interpreting small models scale to production LLMs, finding that concepts are represented by patterns of neurons firing together that can be identified and studied.

Unsupervised Methods Mature: The Contrast Consistent Search (CCS) method and its variants have shown that supervised probing is not always necessary. EleutherAI’s collaboration with Cadenza Labs has produced cleaner, more modular implementations with multi-GPU support and Hugging Face integration.

Circuit Tracing Integration: Probing increasingly complements circuit tracing methods. Anthropic’s circuit tracing work revealed a “shared conceptual space where reasoning happens before being translated into language,” with probes used to identify and validate the identified concepts.

Recommendation Level: MAINTAIN

Linear probing is a valuable supporting technique that is already adequately funded as a standard research tool. The technique should continue to be used broadly but does not require major additional investment as a standalone approach. Its primary value lies in enabling and informing other interpretability techniques rather than providing direct safety guarantees.

Appropriate uses:

  • Informing sparse autoencoder training and evaluation
  • Generating hypotheses for representation engineering
  • Quick diagnostics on new models and capabilities
  • Supporting the Eliciting Latent Knowledge research agenda
  • Baseline comparisons for more sophisticated techniques
PaperAuthorsYearKey ContributionLink
The Internal State of an LLM Knows When It’s LyingAzaria & Mitchell202371-83% accuracy detecting truthfulness from activationsarXiv
Discovering Latent Knowledge Without SupervisionBurns et al.2023Contrast Consistent Search (CCS) for unsupervised probingarXiv
Probing Classifiers: Promises, Shortcomings, and AdvancesBelinkov2022Comprehensive survey of probing limitationsACL Anthology
Designing and Interpreting Probes with Control TasksHewitt & Liang2019Control tasks methodology for probe validationarXiv
Representation EngineeringZou et al.2023Top-down approach using representation reading/controlarXiv
Towards Best Practices of Activation PatchingZhang & Nanda2024Systematic analysis of patching methodologyarXiv
  • ARC Eliciting Latent Knowledge: Technical report on using probes to extract model “beliefs”
  • Anthropic Interpretability: Research page documenting sparse autoencoder and probing work on Claude
  • EleutherAI ELK Project: Project page building on CCS for knowledge elicitation
ToolPurposeLink
TransformerLensMechanistic interpretability library with probing utilitiesGitHub
BaukitProbing and intervention experiments on language modelsGitHub
EleutherAI SAE LibrarySparse autoencoder training and analysisGitHub
scikit-learnStandard library for logistic regression probesDocs
  • Representation Similarity Analysis: Comparing representations across models/layers
  • Concept Bottleneck Models: Architectures with interpretable intermediate representations
  • Neural Network Surgery: Broader field of analyzing and modifying trained networks
  • Mechanistic Interpretability: The broader research agenda probing contributes to; see Neel Nanda’s overview