영어-한국어 탈옥 프롬프트 데이터셋 구축 및 탈옥 프롬프트 분류기 모델 성능 비교 분석

박대얼; 최대선; 장현준; 윤두식

영어-한국어 탈옥 프롬프트 데이터셋 구축 및 탈옥 프롬프트 분류기 모델 성능 비교 분석

Vol. 35, No. 3, pp. 613-622, 6월. 2025

10.13089/JKIISC.2025.35.3.613, Full Text:

Keywords: LLM, Jailbreak Attack, text classifier, data augmentation
Abstract

The security of large language models is increasingly being challenged by jailbreak prompt attacks, yet existing research primarily focuses on English jailbreak prompts. In response, this study constructs an English-Korean jailbreak prompt dataset and evaluates the performance of jailbreak prompt classifiers trained on this dataset. By collecting datasets and applying augmentation techniques, we constructed the dataset labeled into three categories: Benign, Harmful, and Jailbreak. We trained classifiers on Korean-only data and English-Korean combined data, then evaluated their performance separately. Experimental results show that the Korean-only model performs better on Korean prompts, while the English-Korean model maintains stable performance on English data as well. Furthermore, our classifier outperforms existing models by reducing false positive rates for benign prompts and improving classification accuracy for Korean prompts. This research contributes to strengthening the security of both English and Korean LLMs and enhances their resilience against jailbreak attacks.

Statistics

Show / Hide Statistics

Cite this article

[IEEE Style]

박대얼, 최대선, 장현준, 윤두식, "English-Korean Jailbreak Prompt Dataset Construction and Performance Analysis of Jailbreak Prompt Classification Models," Journal of The Korea Institute of Information Security and Cryptology, vol. 35, no. 3, pp. 613-622, 2025. DOI: 10.13089/JKIISC.2025.35.3.613.

[ACM Style]

박대얼, 최대선, 장현준, and 윤두식. 2025. English-Korean Jailbreak Prompt Dataset Construction and Performance Analysis of Jailbreak Prompt Classification Models. Journal of The Korea Institute of Information Security and Cryptology, 35, 3, (2025), 613-622. DOI: 10.13089/JKIISC.2025.35.3.613.