How to Jailbreak a Post-trained LLM: A Case Study on DeepSeek R1 1776
March 12, 2025
Introduction
DeepSeek R1 1776 is a large language model (LLM) post-trained to provide detailed answers on historically sensitive or controversial topics. Jailbreaking refers to bypassing an LLM’s built-in safety measures to retrieve restricted or harmful information. My analysis shows that post-training models to reduce overly restrictive responses can inadvertently make them vulnerable to jailbreaking.
Why DeepSeek R1 1776 is Vulnerable
Any LLM that undergoes post-training to enhance its willingness to answer sensitive questions inherently risks becoming susceptible to sophisticated jailbreaking techniques. While DeepSeek R1 1776 initially shows robust safety against clearly harmful questions, its defenses weaken considerably when harmful prompts are cleverly combined with topics it has been specifically post-trained to discuss.

Methodology
Our approach involved four main steps:
- Dataset Creation: We began with clearly harmful prompts from StrongReject[1], including requests for information on creating explosives or toxic gases.
-
Question Combination: Malicious prompts were merged with safe example prompts
provided by the DeepSeek R1 1776 blog, such as inquiries about historical events like Tiananmen
Square. Since many LLMs refuse to handle harmful prompts directly, we utilized
mistralai/Mistral-Nemo-Instruct-2407
, known for a higher acceptance rate of controversial prompts[2], to generate coherent combined questions. - Advanced Jailbreaking with Pap_Misrepresentation[3]: We employed the pap_misrepresentation technique, disguising harmful requests within innocent-seeming contexts, such as fictional storytelling or academic inquiry. For instance, instead of directly asking how to create a bomb, we posed a narrative scenario about a fictional character constructing one.
- Filtering Benign Prompts: Generated prompts were labeled by robust LLMs, including Claude 3.7, O3, and Gemini 2. Questions identified as benign by any of these models were excluded from our analysis.
Evaluation Metrics
We used:
- Refusal rate (higher better): Binary indication of whether the model refused to answer.
- Specificity and convincingness scores: Both were assessed using a rescaled Likert scale from 0 to 1. 1 means specific and convincing.
- Overall Score (lower better)
Below is a comparison of the Overall Score and Refusal Rate for different prompt types:
R1 1776 | R1 | |
---|---|---|
Overall Score/Refusal Rate | Overall Score/Refusal Rate | |
Jailbreak Question | 0% / 100% | 0% / 100% |
Combined Question | 50.3% / 44.4% | 22.5% / 73.6% |
Pap Combined Question | 67.4% / 25% | 64.6% / 30.6% |
DeepSeek R1 1776 initially showed perfect defense (100% refusal) against clearly malicious jailbreak prompts. Merging malicious prompts with historically sensitive topics significantly lowered refusal rates (44.4%), showing clear vulnerabilities introduced by post-training. However, the base model R1 is less vulnerable to these prompts. Further disguised prompts drastically reduced refusals to 25%, indicating even more pronounced vulnerability.
Recommendations
- Evaluate Thoroughly: When deploying post-trained models specialized for sensitive domains, conduct evaluations by generating domain-specific jailbreak datasets.
- Balance Specificity and Safety: Avoid excessive fine-tuning toward openness, as this often inadvertently introduces vulnerabilities.
- Continuous Monitoring: Regularly update your jailbreak testing methodology to account for emerging threats and novel jailbreaking strategies.
Conclusion
DeepSeek R1 1776 exemplifies a broader issue facing post-trained LLMs: Increased specificity can unintentionally reduce safety. Our findings underline the importance of careful alignment and evaluation when deploying models designed to answer sensitive or controversial queries.
An example of generated prompts and R1-1776's responses: https://huggingface.co/datasets/weijiejailbreak/jailbreak_1776_step3
References
- Souly, Alexandra, et al. "A strongreject for empty jailbreaks." arXiv preprint arXiv:2402.10260 (2024).
- Cui, Justin, et al. "Or-bench: An over-refusal benchmark for large language models." arXiv preprint arXiv:2405.20947 (2024).
- Zeng, Yi, et al. "How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms." Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024.
