Skip to content
Longterm Wiki
Back

Compute Funding for SERI MATS LLM Alignment Research - Manifund

web

This Manifund project page documents a compute grant supporting SERI MATS-affiliated LLM alignment research; useful as a reference for understanding how alignment research is funded and what topics SERI MATS scholars pursued circa 2023.

Metadata

Importance: 38/100otherhomepage

Summary

A Manifund grant project led by Ethan Perez funding compute for four LLM alignment research projects through SERI MATS and university collaborators. Projects focus on RLHF failure modes (sycophancy, inverse scaling), debate-based alignment, and related technical safety work. The project exceeded its $200K funding goal, raising over $400K.

Key Points

  • Funds compute for 11 researchers across SERI MATS, Oxford, UCL, and NYU working on LLM alignment under Ethan Perez's supervision
  • Project 1 investigates RLHF-driven failures including sycophancy, where LLMs reinforce user biases or misinformation as models scale
  • Project 2 tests debate as a scalable oversight technique, examining whether LLM-generated arguments improve human evaluator accuracy
  • Grant administered by FAR AI; raised $400K against a $200K goal, indicating strong community support
  • Research targets near-term alignment failures with implications for future scaled models and successor methods to RLHF

Cited by 1 page

PageTypeQuality
Model Organisms of MisalignmentAnalysis65.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202615 KB
Compute and other expenses for LLM alignment research | Manifund

 

 
 
 
 

 Jan
 FEB
 Mar
 

 
 

 
 09
 
 

 
 

 2025
 2026
 2027
 

 
 
 

 

 

 
 
success

 
fail

 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 
 
 
 
 
 

 

 About this capture
 

 

 

 

 

 

 
COLLECTED BY

 

 

 
 
Collection: Common Crawl

 

 

 Web crawl data from Common Crawl.
 

 

 

 

 

 
TIMESTAMPS

 

 

 

 

 

 

The Wayback Machine - https://web.archive.org/web/20260209082828/https://manifund.org/projects/compute-funding-for-seri-mats-llm-alignment-research

 

Manifund

Home

Login

About

People

Categories

Newsletter

Home

About

People

Categories

Login

Create

6

Compute and other expenses for LLM alignment research

Technical AI safety

Ethan Josean Perez

Active

Grant

$400,100raised

$200,000funding goal

Fully funded and not currently accepting donations.

p]:prose-li:my-0 text-gray-900 prose-blockquote:text-gray-600 prose-a:font-light prose-blockquote:font-light font-light break-anywhere empty:prose-p:after:content-["\00a0"]">
Ethan Perez is currently supervising 4 independent or academia-based research projects on aligning LLMs, which would significantly benefit from additional funding for compute. These projects are led by 11 full-time research collaborators, 3 from universities (Oxford, University of College London, and New York University) and 8 from SERI MATS.

The funding would be administered by FAR; below an outline of the projects. 

1 Finding Failures Driven by Human Feedback

Reinforcement Learning from Human Feedback (RLHF) has grown increasingly widespread as a technique for aligning LLMs with human preferences. Finding fundamental failures in RLHF is important for under- standing how to address future issues that will come up and still persist, after organizations have improved their RLHF training setups and scaled up models. As a result, we aim to find cases of “inverse scaling” [1] driven by RLHF; cases where model behavior gets worse as models grow larger and more effective at op- timizing human preference judgments. By discovering failures driven by RLHF, and showing that human feedback is at fault, we aim to drive the development of successor methods to RLHF, including techniques for using LLMs to aid humans in providing preference judgments [2].

Thus far, we have found a number of egregious cases of LLM sycophancy, where LLMs repeat back user views, including cases where the LLM repeats misinformation or blatantly flatters users. We are currently working to generate datasets (e.g., using LLMs) to test these egregious failures more robustly. We are also running experiments to more carefully determine the extent to which these failures are driven by flaws in human feedback vs. remnants of behaviors learned by LLMs during pretraining. With additional funding for compute, we would be able to generate more datasets for testing for RLHF-driven failures in LLMs, as well as test LLMs like GPT4 on more datasets. Moreover, we use LLMs to 

... (truncated, 15 KB total)
Resource ID: 60649beb13903c7c | Stable ID: sid_tqorUZmPSP