롱-테일 탈옥 공격 방어를 위한 도메인 탐지기

Vol. 35, No. 3, pp. 623-638, 6월. 2025
10.13089/JKIISC.2025.35.3.623, Full Text:
Keywords: Large Language Model Security, Jailbreak Attack, Longtail Jailbreak Attack, Jailbreak Pre-detection
Abstract

This paper presents a method to defend large language models against various security threats, focusing on a specific type of jailbreak attack known as the longtail jailbreak attack. While existing defense approaches include Pre-detector methods, they exhibit limitations in handling longtail jailbreak attacks. To address this, we propose a novel approach that integrates a domain detector with a warning prefix. Experimental results show superior performance over existing methods, which had low accuracy. Our method achieved 85.54% accuracy on encoding and 100% on multilingual attacks. Applying a staged warning prefix to longtail jailbreak attack prompts achieved a defense success rate of 96.56%, a false positive rate of 0.40%, and an F1-score of 0.9542. This outcome demonstrates that our approach not only preserves the advantages of prior detection—namely preemptive blocking and real-time responsiveness—but also enhances security under black-box conditions. Consequently, our new preemptive strategy for longtail jailbreak attacks contributes to the stable deployment of large language models and ensures both flexibility and practicality in security policies.

Statistics
Show / Hide Statistics

Statistics (Cumulative Counts from December 1st, 2017)
Multiple requests among the same browser session are counted as one view.
If you mouse over a chart, the values of data points will be shown.


Cite this article
[IEEE Style]
장현준, 최대선, 나현식, 박대얼, "Domain Detector for defending against Longtail Jailbreak Attack," Journal of The Korea Institute of Information Security and Cryptology, vol. 35, no. 3, pp. 623-638, 2025. DOI: 10.13089/JKIISC.2025.35.3.623.

[ACM Style]
장현준, 최대선, 나현식, and 박대얼. 2025. Domain Detector for defending against Longtail Jailbreak Attack. Journal of The Korea Institute of Information Security and Cryptology, 35, 3, (2025), 623-638. DOI: 10.13089/JKIISC.2025.35.3.623.