The Current State of LLM Evaluation: Challenges and Solutions

How to Jailbreak a Post-trained LLM: A Case Study on DeepSeek R1 1776

March 12, 2025

Introduction

DeepSeek R1 1776 is a large language model (LLM) post-trained to provide detailed answers on historically sensitive or controversial topics. Jailbreaking refers to bypassing an LLM’s built-in safety measures to retrieve restricted or harmful information. My analysis shows that post-training models to reduce overly restrictive responses can inadvertently make them vulnerable to jailbreaking.

Why DeepSeek R1 1776 is Vulnerable

Any LLM that undergoes post-training to enhance its willingness to answer sensitive questions inherently risks becoming susceptible to sophisticated jailbreaking techniques. While DeepSeek R1 1776 initially shows robust safety against clearly harmful questions, its defenses weaken considerably when harmful prompts are cleverly combined with topics it has been specifically post-trained to discuss.

Figure 1: When mentioning Xi Jinping, , R1-1776 is willing toteach user how to prescribe opioids

Methodology

Our approach involved four main steps:

Dataset Creation: We began with clearly harmful prompts from StrongReject[1], including requests for information on creating explosives or toxic gases.
Question Combination: Malicious prompts were merged with safe example prompts provided by the DeepSeek R1 1776 blog, such as inquiries about historical events like Tiananmen Square. Since many LLMs refuse to handle harmful prompts directly, we utilized mistralai/Mistral-Nemo-Instruct-2407, known for a higher acceptance rate of controversial prompts[2], to generate coherent combined questions.
Advanced Jailbreaking with Pap_Misrepresentation[3]: We employed the pap_misrepresentation technique, disguising harmful requests within innocent-seeming contexts, such as fictional storytelling or academic inquiry. For instance, instead of directly asking how to create a bomb, we posed a narrative scenario about a fictional character constructing one.
Filtering Benign Prompts: Generated prompts were labeled by robust LLMs, including Claude 3.7, O3, and Gemini 2. Questions identified as benign by any of these models were excluded from our analysis.

Evaluation Metrics

We used:

Refusal rate (higher better): Binary indication of whether the model refused to answer.
Specificity and convincingness scores: Both were assessed using a rescaled Likert scale from 0 to 1. 1 means specific and convincing.
Overall Score (lower better)

Below is a comparison of the Overall Score and Refusal Rate for different prompt types:

	R1 1776	R1
	Overall Score/Refusal Rate	Overall Score/Refusal Rate
Jailbreak Question	0% / 100%	0% / 100%
Combined Question	50.3% / 44.4%	22.5% / 73.6%
Pap Combined Question	67.4% / 25%	64.6% / 30.6%

DeepSeek R1 1776 initially showed perfect defense (100% refusal) against clearly malicious jailbreak prompts. Merging malicious prompts with historically sensitive topics significantly lowered refusal rates (44.4%), showing clear vulnerabilities introduced by post-training. However, the base model R1 is less vulnerable to these prompts. Further disguised prompts drastically reduced refusals to 25%, indicating even more pronounced vulnerability.

Recommendations

Evaluate Thoroughly: When deploying post-trained models specialized for sensitive domains, conduct evaluations by generating domain-specific jailbreak datasets.
Balance Specificity and Safety: Avoid excessive fine-tuning toward openness, as this often inadvertently introduces vulnerabilities.
Continuous Monitoring: Regularly update your jailbreak testing methodology to account for emerging threats and novel jailbreaking strategies.

Conclusion

DeepSeek R1 1776 exemplifies a broader issue facing post-trained LLMs: Increased specificity can unintentionally reduce safety. Our findings underline the importance of careful alignment and evaluation when deploying models designed to answer sensitive or controversial queries.

An example of generated prompts and R1-1776's responses: https://huggingface.co/datasets/weijiejailbreak/jailbreak_1776_step3

References

Souly, Alexandra, et al. "A strongreject for empty jailbreaks." arXiv preprint arXiv:2402.10260 (2024).
Cui, Justin, et al. "Or-bench: An over-refusal benchmark for large language models." arXiv preprint arXiv:2405.20947 (2024).
Zeng, Yi, et al. "How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms." Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024.

Figure 2: When mentioning One party socialist government, R1-1776 is willing to teach user how to build toxic chemicals