CAIS (Center for AI Safety)
Center for AI Safety
CAIS is a nonprofit research organization founded by Dan Hendrycks that has distributed compute grants to researchers, published technical AI safety papers including the representation engineering and MACHIAVELLI benchmark papers, and organized the May 2023 Statement on AI Risk signed by over 350 AI researchers and industry leaders. The organization focuses on technical safety research, field-building, and policy communication.
Center for AI Safety
CAIS is a nonprofit research organization founded by Dan Hendrycks that has distributed compute grants to researchers, published technical AI safety papers including the representation engineering and MACHIAVELLI benchmark papers, and organized the May 2023 Statement on AI Risk signed by over 350 AI researchers and industry leaders. The organization focuses on technical safety research, field-building, and policy communication.
Overview
The Center for AI Safety (CAIS)↗🔗 web★★★★☆Center for AI SafetyCAIS SurveysThe Center for AI Safety conducts technical and conceptual research to mitigate potential catastrophic risks from advanced AI systems. They take a comprehensive approach spannin...safetyx-risktalentfield-building+1Source ↗ is a nonprofit research organization that works to reduce societal-scale risks from artificial intelligence through technical research, field-building initiatives, and public communication. Founded by Dan Hendrycks, CAIS received substantial public attention in May 2023 when it organized a one-sentence statement on AI extinction risk that attracted signatures from over 350 AI researchers and industry figures, including several Turing Award recipients and heads of major AI laboratories.
CAIS operates across three areas: technical research on AI alignment and robustness, grant and fellowship programs intended to grow the AI safety research community, and communication efforts aimed at policymakers and the public. Its technical output includes work on Representation Engineering and the MACHIAVELLI benchmark for evaluating goal-directed behavior in AI systems. The organization is primarily funded by Coefficient Giving, an EA-aligned philanthropic fund, a funding relationship that is relevant context for assessing its research priorities and institutional positioning.
CAIS occupies a distinct niche in the AI safety ecosystem: unlike academic centers such as CHAI or research-focused organizations like MIRI, it combines original technical research with explicit field-building and public communication goals. Critics have questioned whether its emphasis on long-run extinction risk is appropriately calibrated relative to near-term AI harms, and whether EA-concentrated funding in this space creates ideological homogeneity in safety research priorities. These debates are discussed in the Critiques and Limitations section below.
Organizational Background
CAIS was established as a nonprofit research organization with the goal of filling a perceived gap between technical AI safety research and broader scientific and public awareness of AI risks. Dan Hendrycks, who completed his PhD at UC Berkeley, founded CAIS to provide infrastructure — compute grants, fellowships, educational resources, and policy engagement — that individual academic researchers lacked access to.
The organization's theory of change rests on several linked assumptions: that AI systems pose meaningful risks of societal-scale harm, including possible catastrophic outcomes; that the current period is important for establishing safety-relevant research norms and technical methods; and that field-building activities (funding researchers, running educational programs, facilitating policy engagement) will increase the probability of good outcomes by growing and coordinating the safety research community. Whether these assumptions are well-founded is contested, and the organization's critics have argued that the extinction-risk framing in particular overstates speculative long-run risks relative to observable near-term harms.
CAIS is legally structured as a nonprofit. Its primary disclosed funder is Coefficient Giving, which has made grants to CAIS as part of its AI safety grantmaking portfolio. Exact annual budget figures are not publicly confirmed by CAIS; estimates of approximately $5M annually have circulated but have not been verified against IRS Form 990 filings, which would be the authoritative source for nonprofit financials.
Funding
CAIS's primary disclosed funder is Coefficient Giving, a philanthropic organization closely associated with the effective altruism movement. This funding relationship is material context for interpreting the organization's research agenda: Coefficient Giving has historically prioritized long-run catastrophic and extinction-level AI risk over near-term AI harms, and CAIS's framing broadly reflects this prioritization.
No comprehensive public breakdown of CAIS's funding sources or annual budget has been identified. The figure of approximately $5M annually cited in earlier versions of this page is an unverified estimate and should not be treated as authoritative. Readers seeking verified financial data should consult CAIS's IRS Form 990 filings, which are publicly available through ProPublica Nonprofit Explorer or the IRS TEOS database.
The concentration of AI safety funding through EA-aligned funders including Coefficient Giving (formerly Open Philanthropy) has been noted by critics as a potential source of ideological constraint on safety research priorities — organizations dependent on this funding may face implicit pressure to prioritize framings and research directions consistent with funder worldviews. CAIS has not publicly addressed this critique directly.
Key Research Areas
Technical Safety Research
| Research Domain | Key Contributions | Notes |
|---|---|---|
| Representation Engineering | Methods for reading and steering model internal representations | Published 2023↗🔗 web★★★★☆Google Scholar15+ citationsai-safetyx-riskrepresentation-engineeringSource ↗; independent replication and scalability to frontier models remains an open research question |
| Safety Benchmarks | MACHIAVELLI benchmark for evaluating goal-directed and deceptive behavior | Cited in subsequent research; the extent to which it has been formally integrated into evaluation pipelines at Anthropic or OpenAI is not publicly documented |
| Adversarial Robustness | Evaluation protocols and defense mechanisms | Part of the broader Adversarial Robustness research agenda |
| Alignment Foundations | Conceptual frameworks and problem taxonomies for AI safety | Including the "Unsolved Problems in ML Safety" paper (2022) |
Major Publications & Tools
- Representation Engineering: A Top-Down Approach to AI Transparency↗📄 paper★★★☆☆arXivRepresentation Engineering: A Top-Down Approach to AI TransparencyAndy Zou, Long Phan, Sarah Chen et al. (2023)interpretabilitysafetyllmai-safety+1Source ↗ (2023) — Methods for understanding and influencing AI decision-making by working with internal representations rather than input-output behavior alone
- Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior↗📄 paper★★★☆☆arXivMACHIAVELLI datasetAlexander Pan, Jun Shern Chan, Andy Zou et al. (2023)capabilitiessafetydeceptionevaluation+1Source ↗ (2023) — Introduces the MACHIAVELLI benchmark for evaluating whether AI agents pursue goals through unethical means in text-based game environments
- Unsolved Problems in ML Safety↗📄 paper★★★☆☆arXivUnsolved Problems in ML SafetyDan Hendrycks, Nicholas Carlini, John Schulman et al. (2021)alignmentcapabilitiessafetyai-safety+1Source ↗ (2022) — A taxonomy of open technical challenges in machine learning safety, intended partly as a research agenda for the field
- Measuring Mathematical Problem Solving With the MATH Dataset↗📄 paper★★★☆☆arXivMATHDan Hendrycks, Collin Burns, Saurav Kadavath et al. (2021)capabilitieseconomiccomputellm+1Source ↗ (2021) — A benchmark for evaluating AI mathematical reasoning, authored by Dan Hendrycks and collaborators during his PhD at UC Berkeley; this paper predates CAIS's founding and is a product of Hendrycks's academic research rather than an organizational output of CAIS
Citation counts for these papers (figures such as "200+", "50+", "30+") previously appeared on this page without sourced methodology. Readers seeking current citation data should consult Google Scholar or Semantic Scholar directly.
Field-Building Programs
CAIS runs several programs intended to grow the population of researchers working on AI safety. The term "field-building" refers to activities designed to increase the size, diversity, and coordination of a research community — in this case, researchers focused on technical and governance aspects of AI safety.
Grant Programs
| Program | Reported Scale | Description | Timeline |
|---|---|---|---|
| Compute Grants | $2M+ distributed; number of recipients reported variously as 100+ and 200+ in different CAIS materials — figure unverified | Provides compute resources to researchers working on safety-relevant projects | 2022–present |
| ML Safety Scholars | Approximately 50 participants per cohort | Structured program for early-career researchers entering the AI safety field | 2021–present |
| Research Fellowships | $500K+ annually | Fellowships placing researchers at academic and research institutions | 2022–present |
| AI Safety Camp | 200+ participants total | Collaborative program supporting international research teams | 2020–present |
Note: Quantitative figures in this table are drawn from CAIS's own communications and have not been independently verified. The compute grant recipient count is internally inconsistent across CAIS materials (100+ in some sources, 200+ in others); the higher figure may aggregate across all field-building programs rather than compute grants alone.
Institutional Partnerships
- Academic Collaborations: Reported collaborations with UC Berkeley, MIT, Stanford, and Oxford
- Industry Engagement: Research interactions with Anthropic and Google DeepMind
- Policy Connections: Briefings reported with US Congress, UK Parliament, and EU regulatory bodies
Statement on AI Risk (2023)
In May 2023, CAIS published and circulated the Statement on AI Risk↗🔗 web★★★★☆Center for AI SafetyAI Risk Statementrisk-interactionscompounding-effectssystems-thinkingai-safety+1Source ↗, a single sentence co-signed by over 350 AI researchers and industry figures:
"Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war."
The statement was covered widely in major news outlets and was cited in subsequent policy discussions, including in the context of UK and US government AI strategies. The official signatory list is available at safe.ai; the figure of 350+ is drawn from that list, though the precise count at any given time may vary as signatories are added.
Signatory Groups
| Category | Notable Signatories | Description |
|---|---|---|
| Turing Award Recipients | Geoffrey Hinton, Yoshua Bengio, Stuart Russell | Recipients of computing's highest recognition who signed the statement |
| Industry Executives | Sam Altman (OpenAI), Dario Amodei (Anthropic), Demis Hassabis (DeepMind) | CEOs of major AI laboratories |
| Policy and Governance Researchers | Helen Toner, Allan Dafoe, Gillian Hadfield | Researchers working on AI governance and policy |
| ML/AI Researchers | 300+ researchers across academia and industry | Researchers who signed as individuals, not representing institutional positions |
The statement's reception was not uniformly positive within the AI research community. A number of prominent ML researchers declined to sign or publicly criticized the statement's framing. Critics raised several concerns: that the one-sentence format was too vague to convey meaningful technical content; that equating AI risk with nuclear war risk was unsupported by available evidence; that the extinction framing could distract attention and resources from observable near-term harms from AI systems (such as bias, surveillance, and labor displacement); and that the statement's signatories were not uniformly working on extinction-risk problems, making it a weak signal of scientific consensus. These critiques were raised by researchers including those associated with AI fairness and near-term safety research communities.
Proponents argued that the statement served a legitimate coordination function: making it socially acceptable for researchers to discuss catastrophic risk publicly, signaling to policymakers that risk concerns were not fringe views, and creating a reference point for subsequent regulatory discussions. Whether the statement's net effect on AI policy and research prioritization was positive is a matter of ongoing debate.
The statement's impact on specific policy documents — including mentions in UK AI Safety Institute and US AI Safety Institute contexts — has been cited by CAIS, though the causal relationship between the statement and any particular policy outcome is difficult to establish.
Critiques and Limitations
Criticism of Extinction-Risk Framing
The most substantive criticism of CAIS's work concerns its central framing of AI extinction risk as a near-term policy priority. Critics from several directions have argued:
- Near-term displacement effect: Emphasizing speculative long-run extinction risk may draw funding, talent, and policy attention away from near-term AI harms — discrimination in algorithmic decision-making, AI-enabled surveillance, labor market disruption, and misinformation — that are currently affecting people. Researchers associated with the AI ethics and fairness communities have made this argument most consistently.
- Epistemic status of extinction claims: The probability of AI-caused human extinction within policy-relevant timeframes is highly uncertain, and critics have argued that treating it as a "global priority alongside pandemics and nuclear war" involves large unjustified inferential steps. Some ML researchers have noted that the mechanisms by which current or near-term AI systems could pose extinction-level risks are not specified with sufficient precision to evaluate.
- Ideological concentration: CAIS's alignment with EA-associated funders and the broader longtermist intellectual tradition has led critics to argue that its research agenda reflects a particular philosophical worldview rather than a neutral assessment of AI risk. This critique is not unique to CAIS — it applies to several EA-funded AI safety organizations — but it is relevant to assessing how to interpret CAIS's outputs.
Limitations of Specific Research
- Representation Engineering scalability: The representation engineering paper introduced methods that work on models of a given scale; whether these methods generalize to frontier-scale models is an open question, and independent researchers have noted limitations in the approach's applicability to very large models.
- Benchmark validity: A general concern in AI safety evaluation is whether constructed benchmarks (including MACHIAVELLI) capture risks that manifest in real deployment contexts. The MACHIAVELLI benchmark uses text-based game environments, and the extent to which performance on these environments predicts behavior in consequential real-world settings is not established.
- Field-building outcome measurement: CAIS reports counts of researchers supported and grant dollars distributed, but does not publicly report outcome data for its programs — for example, where ML Safety Scholars alumni work subsequently, what research they produce, or whether compute grant recipients remain in safety research. Without outcome data, the field-building impact claims are difficult to evaluate independently.
Critiques of the 2023 Statement
Beyond the framing critiques noted above, several researchers argued that the statement's format — a single declarative sentence without methodology, evidence, or mechanism — made it unsuitable as a scientific communication and more akin to a public advocacy document. Others noted that some signatories are not primarily working on extinction-risk problems, which complicated interpretation of the statement as a signal of expert consensus on the technical merits of the extinction-risk hypothesis.
Current Trajectory & Timeline
Research Roadmap
The following research priorities were described by CAIS as goals for 2024–2026. Given that this page was last edited in late 2025, some of these projections are now in the past. Actual outcomes against these goals have not been independently verified and are not currently documented on this page.
| Priority Area | Stated Goals | Status |
|---|---|---|
| Representation Engineering | Scale methods to frontier models; pursue industry adoption for safety evaluation | Outcome unverified |
| Evaluation Frameworks | Develop comprehensive benchmark suite; establish standard evaluation protocols | Outcome unverified |
| Alignment Methods | Proof-of-concept demonstrations; practical implementation work | Outcome unverified |
| Policy Research | Technical governance recommendations; regulatory framework development | Outcome unverified |
A previously cited projection of "2x expansion by 2025" appeared in earlier versions of this page without a cited source. Whether this projection materialized has not been verified.
Organizational Scale
- Staff: 15+ full-time staff reported; current headcount has not been independently verified
- Affiliates: 50+ affiliate researchers reported
- Budget: Approximately $5M annually — this figure is an unverified estimate; IRS Form 990 filings are the authoritative source for nonprofit financials
Key Uncertainties & Research Cruxes
Technical Challenges
These represent genuine open questions in CAIS's research agenda, not settled conclusions:
- Representation Engineering Scalability: Whether methods developed on mid-scale models transfer reliably to frontier-scale systems remains unclear. The gap between controlled research settings and deployment conditions is a known limitation.
- Benchmark Validity: Whether evaluations like MACHIAVELLI capture risks that manifest in real deployment — rather than behavior specific to text-game environments — is unresolved. This is a field-wide challenge, not unique to CAIS.
- Alignment Verification: There is no established consensus on how to verify that an AI system is successfully aligned with intended goals rather than passing evaluations through surface-level pattern matching.
Strategic Questions
- Research vs. Policy Balance: CAIS allocates resources across technical research, field-building, and policy communication. The optimal allocation is not obvious, and different observers weight these activities differently based on their models of how AI safety progress happens.
- Open vs. Closed Research: Publishing safety research openly makes it available to the broader community but may also inform adversarial actors. CAIS has not publicly articulated a detailed position on this tradeoff.
- Timeline Assumptions: Appropriate research priorities depend substantially on assumptions about AGI timelines and the nature of AI risk. Researchers with shorter timelines and those focused on long-run speculative risk reach different conclusions about what work is most valuable now.
- Near-term vs. Long-term Risk Balance: Whether resources spent on extinction-risk scenarios are appropriately calibrated relative to near-term AI harms is a live debate both within and outside the AI safety community, and CAIS's position at the long-run end of this spectrum is contested.
Leadership & Key Personnel
Key People
Note: Staff roles and affiliations reflect information available at time of last edit and may not reflect current positions. Andy Zou holds joint affiliation with CMU and CAIS; his primary institutional role should be verified against current sources.
Positioning Within the AI Safety Ecosystem
CAIS occupies a specific position within the broader AI safety research landscape that distinguishes it from peer organizations:
- vs. MIRI: MIRI focuses almost exclusively on foundational theoretical alignment research and does not run field-building or public communication programs. CAIS's research is more empirical and its scope is broader institutionally.
- vs. CHAI: CHAI (Center for Human-Compatible AI, UC Berkeley) is an academic center with a narrower research agenda centered on value alignment. CAIS has a more explicit field-building and policy communication mandate.
- vs. Redwood Research: Redwood focuses on specific empirical safety problems with a small team; CAIS has a larger scope including grant programs and public communication.
- vs. METR and ARC Evaluations: These organizations focus specifically on model evaluations and dangerous capability assessments. CAIS's evaluation work (MACHIAVELLI) is one component of a broader agenda.
- vs. GovAI: GovAI focuses on AI governance and policy research. CAIS does policy communication but its primary identity is as a technical research organization.
The common thread across CAIS-adjacent organizations is EA-aligned funding, primarily from Coefficient Giving, which has led to criticisms that the AI safety field as constituted reflects the priorities of a relatively narrow philanthropic and ideological community rather than a broad scientific consensus.
Sources & Resources
Official Resources
| Type | Resource | Description |
|---|---|---|
| Website | safe.ai↗🔗 web★★★★☆Center for AI SafetyCAIS SurveysThe Center for AI Safety conducts technical and conceptual research to mitigate potential catastrophic risks from advanced AI systems. They take a comprehensive approach spannin...safetyx-risktalentfield-building+1Source ↗ | Main organization hub |
| Research | CAIS Publications↗🔗 web★★★★☆Center for AI SafetyCAIS Publicationsai-safetyx-riskrepresentation-engineeringSource ↗ | Technical papers and reports |
| Blog | CAIS Blog↗🔗 web★★★★☆Center for AI SafetyCAIS Blogai-safetyx-riskrepresentation-engineeringSource ↗ | Research updates and commentary |
| Courses | ML Safety Course↗🔗 webML Safety Coursesafetyai-safetyx-riskrepresentation-engineeringSource ↗ | Educational materials on machine learning safety |
Key Research Papers
| Paper | Year | Description |
|---|---|---|
| Unsolved Problems in ML Safety↗📄 paper★★★☆☆arXivUnsolved Problems in ML SafetyDan Hendrycks, Nicholas Carlini, John Schulman et al. (2021)alignmentcapabilitiessafetyai-safety+1Source ↗ | 2022 | Research agenda taxonomy; citation counts should be verified via Google Scholar or Semantic Scholar |
| MACHIAVELLI Benchmark↗📄 paper★★★☆☆arXivMACHIAVELLI datasetAlexander Pan, Jun Shern Chan, Andy Zou et al. (2023)capabilitiessafetydeceptionevaluation+1Source ↗ | 2023 | Evaluation framework for goal-directed AI behavior in game environments |
| Representation Engineering↗📄 paper★★★☆☆arXivRepresentation Engineering: A Top-Down Approach to AI TransparencyAndy Zou, Long Phan, Sarah Chen et al. (2023)interpretabilitysafetyllmai-safety+1Source ↗ | 2023 | Methods for reading and steering AI model internal representations |
Related Organizations
- Technical Safety Research: MIRI, CHAI, Redwood Research
- Evaluations: ARC Evaluations, METR
- Policy Focus: GovAI, RAND Corporation↗🔗 web★★★★☆RAND CorporationRAND: AI and National Securitycybersecurityagenticplanninggoal-stability+1Source ↗
- Industry Labs: Anthropic, OpenAI, Google DeepMind
- Funders: Coefficient Giving
References
The Center for AI Safety conducts technical and conceptual research to mitigate potential catastrophic risks from advanced AI systems. They take a comprehensive approach spanning technical research, philosophy, and societal implications.