DeepSeek Warns of Jailbreak Risks in Open-Source AI Models

web

yahoo.com·yahoo.com/news/articles/deepseek-warns-jailbreak-risks-op...

Relevant to AI safety governance debates around open-source model risks and the comparative transparency of Chinese vs. Western AI labs in disclosing and mitigating model vulnerabilities.

Metadata

Importance: 42/100news articlenews

Summary

DeepSeek published its first safety evaluation of its AI models in Nature, revealing that open-source models—including its own R1 and Alibaba's Qwen2.5—are particularly vulnerable to jailbreak attacks. The report highlights a disparity between Chinese and American AI companies in publicizing model risks and implementing safety frameworks, with US firms like Anthropic and OpenAI having established formal risk mitigation policies.

Key Points

•DeepSeek published a peer-reviewed safety evaluation in Nature, its first public disclosure of AI model risks.
•Open-source models like DeepSeek R1 and Alibaba Qwen2.5 were found most susceptible to jailbreak attacks producing harmful outputs.
•Chinese AI companies have been less transparent about model risks compared to US counterparts with formal safety frameworks.
•American firms like Anthropic (Responsible Scaling Policies) and OpenAI (Preparedness Framework) have established public risk mitigation policies.
•The disclosure signals a potential shift toward greater safety transparency from Chinese AI developers.

Cited by 1 page

Page	Type	Quality
Open Source AI Safety	Approach	62.0

Cached Content Preview

HTTP 200Fetched Apr 7, 20269 KB

DeepSeek warns of &#x27;jailbreak&#x27; risks for its open-source models US English 
 US y LATAM Español 
 Australia English 
 Canada English 
 Canada Français 
 Deutschland Deutsch 
 France Français 
 香港 繁中 
 Malaysia English 
 New Zealand English 
 Singapore English 
 台灣 繁中 
 UK English 
 
 Mail Search the web Advertisement Advertisement Return to Homepage Top Stories: 

 Trump&#x27;s Iran ultimatum 
 Artemis II updates 
 U.S. airman&#x27;s rescue in Iran 
 Savannah Guthrie on &#x27;Today&#x27; 
 Trump viral rumor denial 
 2nd airman rescued in Iran 
 Eye drop bottle recall 
 Trump budget proposal 
 Trump admin. departures 
 Trump fires Bondi 
 DeepSeek has revealed details about the risks posed by its artificial intelligence models for the first time, noting that open-sourced models are particularly susceptible to being "jailbroken" by malicious actors.

 The Hangzhou-based start-up said it evaluated its models using industry benchmarks as well as its own tests in a peer-reviewed article published in the academic journal Nature.

 American AI companies often publicise research about the risks of their rapidly improving models and have introduced risk mitigation policies in response, such as Anthropic&#39;s Responsible Scaling Policies and OpenAI&#39;s Preparedness Framework.

 Advertisement Advertisement Do you have questions about the biggest topics and trends from around the world? Get the answers with SCMP Knowledge , our new platform of curated content with explainers, FAQs, analyses and infographics brought to you by our award-winning team.

 Chinese companies were less outspoken about risks, despite their models being just a few months behind their US equivalents, according to AI experts. However, DeepSeek had conducted evaluations of such risks before, including the most serious "frontier risks", the Post reported earlier.

 The Nature paper provided more "granular" details about DeepSeek&#39;s testing regime, said Fang Liang, an expert member of China&#39;s AI Industry Alliance (AIIA), an industry body. These included "red-team" tests based on a framework introduced by Anthropic, in which testers try to get AI models to produce harmful speech.

 According to the paper, DeepSeek found that its R1 reasoning model and V3 base model - released in January 2025 and December 2024, respectively - had slightly higher-than-average safety scores across six industry benchmarks than OpenAI&#39;s o1 and GPT-4o, both released last year, and Anthropic&#39;s Claude-3.7-Sonnet, released in February.

 Advertisement Advertisement However, it found that R1 was "relatively unsafe" once its external "risk control" mechanism was removed, following tests on its own in-house safety benchmark consisting of 1,120 test questions.

 AI companies typically try to prevent their systems from generating harmful content by "fine-tuning" the models themselves during the training process or adding external content filters.

 However, experts have warned that such safety 

... (truncated, 9 KB total)

Resource ID: 393d870a262b0132 | Stable ID: sid_hA90RWLJ3e