언어 모델 기반 음성 특징 추출을 활용한 생성 음성 탐지

김승민; 최대선; 박소희

언어 모델 기반 음성 특징 추출을 활용한 생성 음성 탐지

김승민

최대선

박소희

Vol. 34, No. 3, pp. 439-449, 6월. 2024

10.13089/JKIISC.2024.34.3.439, Full Text:

Keywords: BERT, Audio codec, Voice Features Extraction, Speech Synthesis, Generated voice detection
Abstract

Recent rapid advancements in voice generation technology have enabled the natural synthesis of voices using text alone. However, this progress has led to an increase in malicious activities, such as voice phishing (voishing), where generated voices are exploited for criminal purposes. Numerous models have been developed to detect the presence of synthesized voices, typically by extracting features from the voice and using these features to determine the likelihood of voice generation.This paper proposes a new model for extracting voice features to address misuse cases arising from generated voices. It utilizes a deep learning-based audio codec model and the pre-trained natural language processing model BERT to extract novel voice features. To assess the suitability of the proposed voice feature extraction model for voice detection, four generated voice detection models were created using the extracted features, and performance evaluations were conducted. For performance comparison, three voice detection models based on Deepfeature proposed in previous studies were evaluated against other models in terms of accuracy and EER. The model proposed in this paper achieved an accuracy of 88.08% and a low EER of 11.79%, outperforming the existing models. These results confirm that the voice feature extraction method introduced in this paper can be an effective tool for distinguishing between generated and real voices.

Statistics

Show / Hide Statistics

Cite this article

[IEEE Style]

김승민, 최대선, 박소희, "Voice Synthesis Detection Using Language Model-Based Speech Feature Extraction," Journal of The Korea Institute of Information Security and Cryptology, vol. 34, no. 3, pp. 439-449, 2024. DOI: 10.13089/JKIISC.2024.34.3.439.

[ACM Style]

김승민, 최대선, and 박소희. 2024. Voice Synthesis Detection Using Language Model-Based Speech Feature Extraction. Journal of The Korea Institute of Information Security and Cryptology, 34, 3, (2024), 439-449. DOI: 10.13089/JKIISC.2024.34.3.439.