Tool-Use Restrictions

Approach

Tool-Use Restrictions

Tool-use restrictions provide hard limits on AI agent capabilities through defense-in-depth approaches combining permissions, sandboxing, and human-in-the-loop controls. Empirical evidence shows METR task horizons doubling every 7 months and incident data (EchoLeak CVE-2025-32711, Shai-Hulud campaign) demonstrating real-world exploitation, with security effectiveness ranging 60-95% across threat categories depending on control type.

Approaches

Capabilities

Organizations

3.9k words · 3 backlinks

Quick Assessment

Dimension	Assessment	Evidence
Effectiveness	High (70-90% risk reduction for targeted threats)	AWS Well-Architected Framework rates lack of least privilege implementation as "High" risk; METR evaluations show properly sandboxed agents have substantially reduced attack surface
Implementation Maturity	Medium (fragmented across providers)	Over 13,000 MCP servers on GitHub in 2025 alone; UK AISI Inspect toolkit now used by governments, companies, and academics worldwide
Adoption Rate	Growing rapidly (97M+ monthly SDK downloads for MCP)	Model Context Protocol backed by Anthropic, OpenAI, Google, Microsoft; donated to Linux Foundation December 2025
Vulnerability Surface	High (9.4 CVSS critical vulnerabilities found)	CVE-2025-49596 in MCP Inspector enables browser-based RCE; hundreds of "NeighborJack" vulnerabilities discovered across 7,000+ MCP servers
Bypass Difficulty	Medium-Low (35% of incidents from simple prompts)	Obsidian Security reports 35% of AI security incidents caused by prompt injection, some leading to $100K+ losses
Defense-in-Depth Value	Critical (no single control sufficient)	UK AISI emphasizes container isolation alone insufficient; requires OS primitives, hardware virtualization, and network segmentation combined
Urgency	Increasing (task horizons doubling every 7 months)	METR reports 50%-task-completion time horizon has doubled every 7 months for past 6 years; extrapolation suggests month-long autonomous projects by 2030

Overview

Tool-use restrictions are among the most direct and effective safety measures for agentic AI systems. Rather than trying to shape model behavior through training, restrictions simply remove access to dangerous capabilities. An AI agent without access to code execution cannot deploy malware; one without financial API access cannot make unauthorized purchases; one without email access cannot conduct phishing campaigns. These hard limits provide guarantees that behavioral training cannot.

The approach is especially important given the rapid expansion of AI agent capabilities. Systems like Claude Computer Use, OpenAI's function calling, and various autonomous agents are gaining access to browsers, file systems, code execution, and external APIs. Each new tool represents both capability expansion and risk surface expansion. Tool restrictions create a deliberate friction that forces explicit decisions about what capabilities are necessary and appropriate for a given deployment context.

However, tool restrictions face significant practical challenges. Commercial pressure consistently pushes toward expanding tool access, as more capable agents are more valuable products. Users may bypass restrictions by deploying their own tools or using alternative providers. And sophisticated AI systems may find creative ways to achieve prohibited goals using only permitted tools, a form of composition attack. According to OWASP's 2025 Top 10 for LLM Applications, prompt injection remains the primary attack vector, ranking as the number one risk. The EchoLeak exploit (CVE-2025-32711) against Microsoft Copilot in mid-2025 demonstrated how engineered prompts in email messages could trigger automatic data exfiltration without user interaction.

Research from METR shows that agentic AI capabilities have been exponentially increasing, with the "time horizon" for autonomous task completion doubling approximately every 7 months. Extrapolating this trend suggests that within five years, AI agents may independently complete tasks that currently take humans days or weeks. This capability trajectory makes robust tool restrictions increasingly critical, as more capable agents have correspondingly larger attack surfaces.

Risk Assessment & Impact

Dimension	Rating	Assessment
Safety Uplift	Medium	Directly limits harm potential
Capability Uplift	Tax	Reduces what AI can do for users
Net World Safety	Helpful	Important safeguard for agentic systems
Lab Incentive	Weak	Limits product value; mainly safety-motivated
Scalability	Partial	Effective but pressure to expand access
Deception Robustness	Partial	Hard limits help; but composition attacks possible
SI Readiness	Partial	Hard limits meaningful; but SI creative with available tools

Research Investment

Current Investment: $10-30M/yr (part of agent safety engineering)
Recommendation: Increase (important as agents expand; labs face pressure to loosen)
Differential Progress: Safety-dominant (pure safety constraint; reduces capability)

Comparison of Tool Restriction Approaches

Different approaches to restricting AI tool access vary significantly in their security guarantees, implementation complexity, and impact on system usability. The UK AI Safety Institute has emphasized that defense-in-depth is essential, as no single approach provides complete protection.

Approach	Security Strength	Implementation Complexity	Usability Impact	Best Use Case
Permission Allowlists	Medium	Low	Low	Well-defined task scopes
Capability Restrictions	Medium-High	Medium	Medium	Limiting dangerous capabilities
Human-in-the-Loop Confirmation	High	Medium	High	Irreversible or high-risk actions
Container Sandboxing	High	High	Low-Medium	Code execution, untrusted environments
Hardware Virtualization	Very High	Very High	Medium	Maximum isolation requirements
Network Egress Allowlists	Medium-High	Medium	Medium	Preventing data exfiltration
Time/Resource Quotas	Medium	Low	Low-Medium	Preventing resource abuse
Attribute-Based Access Control (ABAC)	High	High	Low	Dynamic, context-sensitive policies

Source: Synthesized from AWS Well-Architected Generative AI Lens, Skywork AI Security, and IAPP Analysis

Selection Criteria

The AWS Well-Architected Framework for Generative AI recommends selecting approaches based on:

Risk Level: Higher-risk operations require stronger isolation (hardware virtualization > containers > permissions)
Frequency: Frequently used tools benefit from lower-friction approaches (allowlists, ABAC)
Reversibility: Irreversible actions warrant human-in-the-loop confirmation regardless of other controls
Trust Level: Untrusted code or inputs require sandboxing; trusted internal tools may need only permissions

Tool Restriction Architecture

The following diagram illustrates a defense-in-depth architecture for AI agent tool access, incorporating multiple security layers as recommended by the UK AI Safety Institute sandboxing toolkit:

Diagram (loading…)

flowchart TD
  A[User Request] --> B[Input Guardrails]
  B --> C{Jailbreak Check}
  C -->|Blocked| D[Denied]
  C -->|Pass| E[ABAC Policy]
  E --> F[Scope Limits]
  F --> G[Container Sandbox]
  G --> H{Risk Level}
  H -->|Low| I[Execute]
  H -->|High| J[Human Confirm]
  J --> I
  I --> K[Audit Log]

  style B fill:#e8f4fd
  style E fill:#fff3cd
  style G fill:#d4edda
  style K fill:#f8d7da

Architecture Components

Layer	Components	Security Function	Failure Mode
User Request	Input guardrails, jailbreak detection	First-line defense against malicious prompts	Prompt injection bypass
Policy & Permission	ABAC, scope limiters, quotas	Enforce least privilege principle	Policy misconfiguration
Sandboxed Execution	Containers, network allowlists, human confirmation	Isolate potentially dangerous operations	Sandbox escape vulnerabilities
Monitoring & Audit	Logging, anomaly detection, session tracking	Detect and respond to policy violations	Alert fatigue, log tampering

Tool Restriction Categories

Communication Restrictions

Tool Type	Risk	Restriction Approach
Email	Phishing, spam, social engineering	Draft-only or approval required
Social Media	Misinformation, impersonation	Generally prohibited
Messaging	Unauthorized contact	Strict allowlists
Phone/Voice	Social engineering	Usually prohibited

Code Execution Restrictions

Capability	Risk	Mitigation
Shell commands	System compromise	Sandboxed, allowlisted commands
Script execution	Malware deployment	Isolated environment, no network
Package installation	Supply chain attacks	Pre-approved packages only
Container/VM creation	Resource abuse	Quota limits, approval required

Information Access Restrictions

Access Type	Risk	Control
Local files	Data exfiltration	Scoped directories, read-only
Databases	Data modification	Read-only, query logging
APIs	Unauthorized actions	Scope-limited tokens
Web browsing	Information gathering	Filtered, logged, rate-limited

Security Threat Landscape

OWASP Top 10 for Agentic Applications (December 2025)

The OWASP GenAI Security Project released the first comprehensive framework specifically for agentic AI security in December 2025, developed by over 100 security researchers with input from NIST, Cisco, Microsoft AI Red Team, Oracle Cloud, and the Alan Turing Institute.

ID	Risk	Description	Tool Restriction Relevance
ASI01	Excessive Agency	Agent takes actions beyond intended scope	Direct—core rationale for tool restrictions
ASI02	Supply Chain Attacks	Compromised dependencies or plugins	Allowlists and signed component requirements
ASI03	Tool & Function Manipulation	Attackers hijack agent's tools	Schema validation and approval workflows
ASI04	Privilege & Access Control	Agents with overly broad permissions	Least privilege implementation critical
ASI05	Data Leakage & Privacy	Unauthorized data exposure	Egress controls and data classification
ASI06	Memory & Context Poisoning	Attackers corrupt agent's persistent state	Session isolation and memory validation
ASI07	Insecure Inter-Agent Communication	Spoofed messages between agents	Authentication and message signing
ASI08	Cascading Failures	False signals propagate through pipelines	Circuit breakers and isolation boundaries
ASI09	Human-Agent Trust Exploitation	Confident outputs mislead human operators	Explanation auditing and uncertainty flagging
ASI10	Rogue Agents	Misalignment, concealment, self-directed action	Kill switches and behavioral monitoring

Source: OWASP Top 10 for Agentic Applications, evaluated by Distinguished Expert Review Board including NIST's Apostol Vassilev

LLM Application Security Threats

According to OWASP's 2025 Top 10 for LLM Applications and the Agentic AI Security Survey, the primary threats to AI tool use systems include:

Rank	Threat	Description	Mitigation Effectiveness
1	Prompt Injection	Malicious inputs manipulate agent behavior	Medium (60-80% blocked with input guardrails)
2	Memory Poisoning	Attackers manipulate agent memory/context	Low-Medium (40-60% detected)
3	Tool Misuse	Agents tricked into abusing system access	Medium-High (70-85% with ABAC)
4	Privilege Escalation	Agents exploited to gain unauthorized access	High (85-95% with least privilege)
5	Excessive Agency	Agents taking actions beyond intended scope	Medium (60-75% with confirmation gates)
6	Data Exfiltration	Unauthorized transmission of sensitive data	Medium-High (70-90% with egress controls)

Source: OWASP Top 10 for LLM Applications, Palo Alto Unit 42

The Palo Alto Unit 42 research emphasizes that in multi-agent environments, cascade failures present additional risks where one compromised agent can affect others, requiring isolation between agent instances.

Composition Attacks

A key challenge is that restricted tools may be combined to achieve prohibited outcomes:

Example Composition Attacks

Prohibited Capability	Composition Approach	Mitigation
Send emails	Write draft + notify human + persuade to send	Detect multi-step patterns
Execute code	Generate code + ask user to run it	Monitor for such patterns
Access finances	Manipulate documents user will act on	Context-aware monitoring
Acquire resources	Convince humans to provide them	Social engineering detection

Defense Strategies

Intent Detection: Monitor for goal patterns, not just individual actions
Cumulative Tracking: Track sequences of actions across sessions
Rate Limiting: Limit volume of potentially-harmful action combinations
Human Review: Flag suspicious action patterns for review

Key Cruxes

Crux 1: Is the Capability Tax Acceptable?

Position: Accept Tax	Position: Tax Too High
Safety more important than convenience	Users will route around restrictions
Can design tasks around limitations	Competitive disadvantage
Precautionary approach appropriate	Beneficial uses blocked
Restrictions can be selectively relaxed	Slows AI adoption

Crux 2: Can Restrictions Scale to More Capable Systems?

Position: Yes	Position: No
Hard limits are architecturally enforced	Composition attacks become more sophisticated
Capability boundaries are clear	Pressure to expand tool access
Can add restrictions as needed	Creative workarounds emerge
Fundamental to defense-in-depth	SI would find paths around any restriction

Crux 3: Should Restrictions Be User-Configurable?

More User Control	Less User Control
Users know their needs best	Users may accept inappropriate risks
Flexibility enables more use cases	Liability and safety concerns
Market provides appropriate pressure	Race to bottom on safety
Respects user autonomy	Inexpert users can be harmed

Best Practices

Principle of Least Privilege

The AWS Well-Architected Generative AI Lens and Obsidian Security recommend implementing least privilege as the foundational security control:

"Agents should have the minimum access necessary to accomplish their tasks. Organizations should explicitly limit agents to sandboxed or development environments—they should not touch production databases, access user data, or handle credentials unless absolutely required."

Key implementation requirements:

Diagram (loading…)

flowchart TD
  A[Task Definition] --> B[Identify Required Tools]
  B --> C[Minimal Permission Set]
  C --> D[Time-Limited Grant]
  D --> E[Task Execution]
  E --> F[Revoke Permissions]

  G[Monitor During Execution] --> E

  style C fill:#d4edda
  style G fill:#d4edda

Implementation Checklist

Based on Skywork AI's enterprise security guidelines and AWS best practices:

Requirement	Description	Priority	Implementation Notes
Default deny	No tool access without explicit grant	Critical	Use scoped API keys with specific permissions
Explicit authorization	Each tool requires specific permission	Critical	Implement ABAC policies for top-risk actions
Audit logging	All tool uses logged	Critical	Immutable logs with tamper detection
Egress allowlists	Restrict external service calls	Critical	Prevent data exfiltration to arbitrary endpoints
Time limits	Permissions expire	High	Time-limited tokens; session quotas
Scope limits	Permissions scoped to specific resources	High	Read-only credentials where possible
Human approval	High-risk tools require confirmation	High	Mandatory for irreversible actions
Secret management	Credentials in dedicated vaults	High	Agents receive only time-limited tokens
Rollback capability	Can undo tool actions	Medium	Transaction-based execution where possible
Anomaly detection	Flag unusual usage patterns	Medium	Behavioral baselines per agent type
Unique identities	Each agent/tool has distinct identity	Medium	Enables attribution and revocation

Tiered Access Model

Tier	Permitted Tools	Use Case
Tier 0	None	Pure text completion
Tier 1	Read-only, information retrieval	Research assistance
Tier 2	Draft/create content	Writing assistance
Tier 3	Reversible actions	Basic automation
Tier 4	Limited external actions	Supervised agents
Tier 5	Broad capabilities	Highly trusted contexts

Who Should Work on This?

Good fit if you believe:

Near-term agentic safety is important
Hard limits provide meaningful guarantees
Security engineering approach is valuable
Incremental restrictions help even if imperfect

Less relevant if you believe:

Capability expansion is inevitable
Restrictions will be bypassed
Focus should be on alignment
Tool restrictions slow beneficial AI

Current State of Practice

AI Lab Tool Restriction Policies (2024-2025)

Major AI labs have implemented varying approaches to tool restrictions, with significant differences in transparency and enforcement mechanisms. The AI Lab Watch initiative tracks these commitments across organizations.

Organization	Framework	Tool Access Controls	Pre-deployment Evaluation	Third-Party Audits	Key Restrictions
Anthropic	Responsible Scaling Policy (ASL levels)	Explicit tool definitions, scope limits, MCP	Yes, shared with UK AISI and METR	Yes (Gryphon Scientific, METR)	Computer Use sandboxed; no direct internet without approval
OpenAI	Preparedness Framework	Function calling with schema validation	Yes, pre/post-mitigation evals	Initially committed, removed April 2025	Code interpreter sandboxed; browsing restricted
Google DeepMind	Frontier Safety Framework	Capability-based restrictions	Yes, but less specific on tailored evals	Not publicly disclosed	Gemini tools require explicit enablement
Meta	Llama Usage Policies	Model-level restrictions (open weights)	Limited pre-release testing	Community-driven	Acceptable use policy; no runtime controls
Microsoft	Copilot Trust Framework	Role-based access, enterprise controls	Internal red-teaming	SOC 2 compliance	Sensitivity labels, DLP integration

Sources: AI Lab Watch Commitments, EA Forum Safety Plan Analysis, company documentation

Notable Policy Developments

Anthropic's Model Context Protocol (MCP): Released in 2024, MCP became an industry standard by 2025 for connecting AI agents to external tools. OpenAI adopted MCP in March 2025, followed by Google in April 2025. In December 2025, Anthropic donated MCP to the Agentic AI Foundation (AAIF) under the Linux Foundation. However, enterprise security teams have criticized MCP for weak authorization capabilities and high prompt injection risk.

MCP Ecosystem Security Status (2025)

Metric	Value	Implication
Monthly SDK downloads	97M+	Massive adoption creates large attack surface
MCP servers on GitHub	13,000+ launched in 2025	Fragmented security posture across ecosystem
Major adopters	Anthropic, OpenAI, Google, Microsoft	Industry convergence but varying security practices
Critical CVEs discovered	CVE-2025-49596 (CVSS 9.4), CVE-2025-6514	Protocol-level vulnerabilities affect all users
Developer environments compromised via mcp-remote	437,000+	Supply chain attacks target developer tooling
Servers exposed via NeighborJack misconfiguration	Hundreds across 7,000+ servers analyzed	Network interface binding (0.0.0.0) common mistake
Authorization specification status	Under community revision	OAuth implementation conflicts with enterprise practices

Sources: Oligo Security, Red Hat MCP Security Analysis, Zenity MCP Security

OpenAI Audit Commitment Removal: In December 2023, OpenAI's Preparedness Framework stated evaluations would be "audited by qualified, independent third-parties." By April 2025, this provision was removed without changelog documentation, raising concerns about declining safety commitments under competitive pressure.

Industry Approaches

System	Tool Restriction Approach	Notes
Claude	Explicit tool definitions, scope limits	Computer Use has specific restrictions
ChatGPT	Function calling with approval	Plugins have varying access
Copilot	Limited to code assistance	Narrow scope by design
Devin-style agents	Task-scoped, sandboxed	Emerging practices

Common Implementation Gaps

Gap	Severity	Evidence
Inconsistent policies across providers	High	OpenAI removed third-party audit commitment in April 2025; Meta's open weights have no runtime controls
User override pressure	Medium	Users deploy own tools or switch to less restrictive providers to bypass controls
Composition attack detection	High	METR evaluations show agents increasingly capable of multi-step circumvention using permitted tools
Cross-tool pattern tracking	Medium	6-month undetected access in OpenAI plugin supply chain attack demonstrates monitoring gaps
Third-party tool security	Critical	43 agent framework components identified with embedded vulnerabilities (Barracuda 2025); 800+ npm packages compromised in Shai-Hulud campaign
MCP authorization gaps	High	Current OAuth specification conflicts with enterprise practices; community revision ongoing

Security Incident Data (2024-2025)

Real-world incidents demonstrate the consequences of inadequate tool restrictions:

Incident	Date	Impact	Root Cause	Estimated Cost
EchoLeak (CVE-2025-32711)	Mid-2025	Microsoft Copilot data exfiltration via email prompts	Zero-click prompt injection	$100M estimated Q1 2025 impact
Amazon Q Extension Compromise	December 2025	VS Code extension compromised; file deletion and AWS disruption	Supply chain attack on verified extension	Undisclosed
MCP Inspector RCE (CVE-2025-49596)	July 2025	Browser-based RCE on developer machines	CVSS 9.4 critical vulnerability	Data theft, lateral movement risk
MCP-Remote Injection (CVE-2025-6514)	2025	437,000+ developer environments compromised	Shell command injection	Credential harvesting
Shai-Hulud npm Campaign	Late 2025	800+ compromised packages; GitHub token/API key theft	Inadequate supply chain controls	Multi-million (estimated)
OpenAI Plugin Supply Chain	Q2 2025	47 enterprise deployments compromised; 6-month undetected access	Compromised agent credentials	Customer data, financial records exposure

Sources: Adversa AI 2025 Security Incidents Report, Fortune, The Hacker News

2025 AI Security Incident Statistics

Metric	Value	Source
Prompt injection as % of AI incidents	35% of all real-world AI security incidents	Obsidian Security
GenAI involvement in incidents	70% of all AI security incidents involved GenAI	Adversa AI
MCP servers analyzed with vulnerabilities	Hundreds of NeighborJack flaws across 7,000+ servers	Backslash Security via Hacker News
Agent framework components with embedded vulnerabilities	43 different components identified	Barracuda Security Report (November 2025)
Q1 2025 estimated prompt injection losses	$100M+ across 160+ reported incidents	Adversa AI
Risk amplification from AI agents vs traditional systems	Up to 100x due to autonomous browsing, file access, credential submission	Security Journey

Capability Benchmarks and Tool Use Metrics

The UK AI Safety Institute conducts regular evaluations of agentic AI capabilities, providing empirical data on the urgency of tool restrictions:

Metric	2024 Baseline	2025 Current	Trend	Implication for Restrictions
Autonomous task completion (50% success horizon)	≈18 minutes	>2 hours	Exponential growth	Longer unsupervised operation requires stronger controls
METR task horizon doubling time	—	≈7 months	Accelerating	Restrictions must evolve faster than capabilities
Multi-step task success rate (controlled settings)	45-60%	70-85%	Improving	Higher reliability increases both utility and risk
Open-ended web assistance success	15-25%	30-45%	Improving slowly	Real-world deployment remains challenging

Sources: METR, UK AISI May 2025 Update, Evidently AI Benchmarks

Sources & Resources

Government and Research Organizations

Organization	Focus	Key Publications
UK AI Safety Institute	Evaluation standards, sandboxing	Inspect Sandboxing Toolkit, Advanced AI Evaluations
METR	Model evaluation, threat research	Task horizon analysis, GPT-5.1 evaluation, MALT dataset
OWASP	Security standards	Top 10 for LLM Applications 2025
NIST	Risk management frameworks	AI RMF 2.0 guidelines
Future of Life Institute	AI safety policy	2025 AI Safety Index

Industry Documentation

Source	Type	URL
OpenAI Agent Builder Safety	Official guidance	platform.openai.com/docs/guides/agent-builder-safety
Claude Jailbreak Mitigation	Official guidance	docs.claude.com/en/docs/test-and-evaluate/strengthen-guardrails
AWS Well-Architected Generative AI Lens	Best practices	docs.aws.amazon.com/.../gensec05-bp01
AI Lab Watch Commitments	Tracking database	ailabwatch.org/resources/commitments

Academic and Technical Papers

Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges - Comprehensive survey of agentic AI security
Systems Security Foundations for Agentic Computing - Theoretical foundations for agent security
Throttling Web Agents Using Reasoning Gates - Novel approach to constraining agent actions

Industry Analysis and News

Rippling: Agentic AI Security Guide 2025 - Comprehensive threat analysis
IAPP: Understanding AI Agents - Practical safeguards overview
Adversa AI: Top Agentic AI Security Resources - Curated resource collection
EA Forum: AI Lab Safety Plans Analysis - Independent review of lab commitments

Key Critiques

Limits usefulness: Constrains beneficial applications; 30-50% reduction in task completion rates for heavily restricted agents
Pressure to expand access: Commercial incentives oppose restrictions; industry voluntary commitment compliance remains largely unverified
Composition attacks: Creative workarounds using permitted tools; METR and UK AISI evaluations show agents increasingly capable of multi-step circumvention
Verification challenges: "Defining a precise, least-privilege security policy for each task is an open challenge in the security research community" (Systems Security Foundations)
Open-source model proliferation: Tool restrictions cannot be enforced on open-weight models after release

References

1Measuring AI Ability to Complete Long Tasks - METRMETR▸

METR presents empirical research showing that AI models' ability to complete increasingly long autonomous tasks is growing exponentially, with the maximum task length that models can successfully complete roughly doubling every 7 months. This 'task length' metric serves as a practical proxy for measuring real-world AI capability progression and agentic autonomy.

★★★★☆

metr.org

2METR: Model Evaluation and Threat ResearchMETR▸

METR is an organization conducting research and evaluations to assess the capabilities and risks of frontier AI systems, focusing on autonomous task completion, AI self-improvement risks, and evaluation integrity. They have developed the 'Time Horizon' metric measuring how long AI agents can autonomously complete software tasks, showing exponential growth over recent years. They work with major AI labs including OpenAI, Anthropic, and Amazon to evaluate catastrophic risk potential.

★★★★☆

metr.org

3Engineered prompts in emailsarXiv·Shrestha Datta, Shahriar Kabir Nahin, Anshuman Chhabra & Prasant Mohapatra·2025·Paper▸

A comprehensive survey of security threats unique to agentic AI systems—LLM-powered autonomous agents with planning, tool use, and memory—presenting a threat taxonomy, reviewing evaluation benchmarks, and discussing technical and governance defense strategies. The paper distinguishes agentic AI risks from both traditional AI safety and conventional software security, synthesizing current research and open challenges to support secure-by-design agent development.

★★★☆☆

arxiv.org

4EchoLeak exploit (CVE-2025-32711)unit42.paloaltonetworks.com▸

Unit 42 (Palo Alto Networks) analyzes EchoLeak (CVE-2025-32711), a vulnerability in agentic AI systems that allows adversarial prompt injection via tool/function calls and API integrations, enabling data exfiltration and unauthorized actions. The research demonstrates how multi-step AI agents can be compromised through malicious content in external data sources, highlighting systemic risks in agentic architectures. It serves as a concrete case study in real-world AI security vulnerabilities.

unit42.paloaltonetworks.com

5AI Lab Watch: Commitments Trackerailabwatch.org▸

AI Lab Watch's Commitments Tracker monitors and evaluates the public safety commitments made by major AI laboratories, tracking whether frontier AI companies are honoring pledges related to safety, governance, and responsible deployment. It serves as an accountability tool by systematically documenting what labs have promised and assessing follow-through.

ailabwatch.org

6UK AI Safety Institute renamed to AI Security InstituteUK AI Safety Institute·Government▸

The UK AI Safety Institute evaluated five anonymized large language models across cyber, chemical/biological, agent, and jailbreak dimensions. Key findings show models exhibit PhD-level CBRN knowledge, limited but real cybersecurity capabilities, nascent agentic behavior, and widespread vulnerability to jailbreaks—providing an early empirical baseline for frontier model risk assessment.

★★★★☆

aisi.gov.uk

7AI Agent Benchmarks 2025evidentlyai.com▸

A comprehensive overview of state-of-the-art benchmarks for evaluating AI agent capabilities, including multi-turn interactions, tool use, web navigation, and collaborative tasks. The resource surveys how these benchmarks stress-test LLMs in realistic, complex scenarios to better measure practical performance. It serves as a reference guide for researchers and practitioners assessing agent progress.

evidentlyai.com

8UK AI Safety Institute (AISI)UK AI Safety Institute·Government▸

The UK AI Safety Institute (AISI) is the UK government's dedicated body for evaluating and mitigating risks from advanced AI systems. It conducts technical safety research, develops evaluation frameworks for frontier AI models, and works with international partners to inform global AI governance and policy.

★★★★☆

aisi.gov.uk

9US AI Safety InstituteNIST·Government▸

NIST is the U.S. national metrology and standards institute, playing a central role in AI safety through the AI Risk Management Framework (AI RMF) and hosting the U.S. AI Safety Institute (AISI). It develops technical standards, evaluation frameworks, and guidance for trustworthy AI systems used by industry and government.

★★★★★

nist.gov

10Future of Life Institute▸

★★★☆☆

futureoflife.org

11FLI AI Safety Index Summer 2025Future of Life Institute▸

The Future of Life Institute's AI Safety Index Summer 2025 systematically evaluates leading AI companies on safety practices, finding widespread deficiencies across risk management, transparency, and existential safety planning. Anthropic receives the highest grade of C+, indicating that even the best-performing company falls significantly short of adequate safety standards. The report serves as a comparative benchmark for industry accountability.

★★★☆☆

futureoflife.org

Tool-Use Restrictions