Back
Samuel Marks — Cambridge Boston Alignment Initiative
webcbai.ai·cbai.ai/samuel-marks
Profile of Samuel Marks, who leads Anthropic's cognitive oversight subteam focused on detecting deceptive AI behavior through interpretability and black-box interrogation techniques, relevant to scalable oversight and AI deception research.
Metadata
Importance: 42/100homepagereference
Summary
Samuel Marks leads Anthropic's cognitive oversight subteam, focusing on overseeing AI systems by examining their internal cognitive processes rather than just input/output behavior. His research targets detecting AI deception, including cases where lying cannot be identified from outputs alone. He is interested in both white-box (interpretability) and black-box (model interrogation) techniques.
Key Points
- •Leads Anthropic's cognitive oversight subteam within the alignment science team.
- •Focus on detecting AI deception even when lying is undetectable from input/output behavior alone.
- •Interested in both interpretability-based (white-box) and interrogation-based (black-box) oversight techniques.
- •Research interests include overseeing models without reliable ground-truth supervision signals and downstream interpretability applications.
- •Mentors empirical alignment research projects through the Cambridge Boston Alignment Initiative (CBAI).
Cached Content Preview
HTTP 200Fetched Apr 14, 20262 KB
Samuel Marks Biography Samuel leads the cognitive oversight subteam of Anthropic’s alignment science team. Their goal is to be able to oversee AI systems not based on whether they have good input/output behavior, but based on whether there’s anything suspicious about the cognitive processes underlying those behaviors. For example, one in-scope problem is “detecting when language models are lying, including in cases where it’s impossible to tell based solely on input/output” (such as when a model knows a piece of private information which it is lying about). His team is interested in both white-box techniques (e.g. interpretability-based techniques) and black-box techniques (e.g. finding good ways to interrogate models about their thought processes and motivations). Mentor topics I will mentor an empirical alignment research project. My main research interests are in (1) overseeing models on tasks where we don't have access to a reliable ground-truth supervision signal and (2) downstream applications of interpretability; this blog post can give a sense of my flavor of research. Desired fellow qualifications I'm looking for candidates who: Are comfortable driving a research project with only ~30 mins/week of direct supervision Have strong coding skills Ideally, experience with empirical ML research Member of Technical Staff, Anthropic
Resource ID:
977dc0f8f626f514