CHOSUN

딥러닝 전이학습 모델과 설명 가능 기법들을 이용한 음성 감정 인식

Metadata Downloads
Author(s)
김태완
Issued Date
2024
Abstract
Speech emotion analysis and recognition using deep learning and explainable method Kim, Tae-Wan Advisor : Prof. Kwak, Keun Chang, Ph. D. Dept. of Electronic Engineering, Graduate School of Chosun University Recent advances in technologies utilizing speech, such as user recognition, IoT applications, and emotion classification, have garnered increasing attention. Pre-processing, feature extraction methods, and appropriate models are needed to effectively use speech data in these technologies. In this paper, we propose a speech emotion analysis and recognition using deep learning and explainable method. To implement the above method, we proceed through 5 steps. First, in order to obtain more informaion about emotion, speech data is converted into a spectrogram image in the time-frequency domain by STFT(Short-Time Fourier Transform). Second, use a GDS(Gaussian Data Selection) mechanism, that using gaussian distribution and correlation coefficient to reduce a data volume. Third, higher performance and adapt diverse of speechs, we constructed a late fused to VGGish and YAMNet that learn and extract features independently through each model. Fourth, we apply three explainable models(Grad CAM, LIME, and Occlusion Sensitivity) to visually confirm which area the trained model focused on while classifying data and which time-frequency domain characteristic make the decision. Finally, we adjust focused area which obtain in Grad CAM method on speech signal. After convert speech signal to spectrogram image, that is divided for model's input size in time-domain. Unnecessary segments without emotional features such as silence, stutter and noise occur during traditional dividing, which make more spend computational resource for training. In the output of the model, User need to understand about model's decision where is the most crucial area in spectrogram time-frequency. Just output of model, users cant see that results can be sufficiently assist their judgment, and furthermore can't analyze that a certain area is closely related to the corresponding emotion. The data use in the experiment is acquired by Chosun University for emotional classification in 2021 and 2022(CSU2021, CSU2022) and public data in AI-hub for emotion classification. In the paper, we select heterogeneous data above the threshold by applying correlation coefficients to the gaussian distribution of the segment spectrogram section. And by excluding that data from learning, we effectively reduce the size of the data to reduce computational resource consumption and learning time. Two transfer learning models are fused to design a model with high classification accuracy that can be applied to more diverse speech data. Also about classification result, we applying 3 explainable method(Grad CAM, LIME, Occlusion Sensitivity) to show the crucial area in spectrogram in a variety of ways. That area can explain which frequency band has closely features about emotions and which words and phrases influence the classification by analyzing in the time domain. The concentration area obtain by the Grad CAM is restored to the original signal size, so that the speech signal of the corresponding section could be additionally analyze. Since then, the above methods are expected to enable effective learning and classification for speech research, and to be used for multimodal research using concentrated words obtain through explanable techniques and situational analysis through emotional words.
Alternative Title
Speech Emotion Recognition Using Deep Learning and Explainable Techniques
Alternative Author(s)
Tae Wan-Kim
Affiliation
조선대학교 일반대학원
Department
일반대학원 전자공학과
Advisor
곽근창
Awarded Date
2024-02
Table Of Contents
제1장 서론 · 1
제1절 연구 배경 및 목적 1
제2절 연구 내용 및 구성 3
제2장 관련 연구 5
제1절 음성 스펙트로그램 이미지에 딥러닝을 활용한 연구 5
제2절 스펙트로그램에 딥러닝과 설명 가능 기법을 적용한 연구 8
제3장 딥러닝 전이학습 모델과 설명 가능 기법들을 이용한 음성 감정 인식 10
제1절 데이터 전처리 12
1. 음성 데이터 스펙트로그램 이미지 변환 12
2. 가우시안 데이터 선별 메커니즘· 14
제2절 VGGish와 YAMNet을 활용한 Late-Fusion 모델 설계 15
1. VGGish와 YAMNet 모델의 특징 15
2. Late-Fusion 방식 소개 17
제3절 설명 가능 모델(XAI) 19
1. 모델의 집중 방향을 시각화하는 설명 가능 기법· 19
가. Grad CAM 기법· 20
나. LIME 기법 21
다. Occlusion Sensitivity 기법 22
제4장 실험 및 결과· 23
제1절 데이터 셋 소개 23
1. CSU 2021 일반인 대상 음성 감정 데이터 셋 23
2. CSU 2022 일반인 대상 음성 감정 데이터 셋 25
3. AI-HUB, 감정 분류용 데이터 셋· 27
4. 데이터 전처리 및 데이터 선별 29
제2절 실험 및 결과· 30
1. 학습 환경 및 파라미터 30
2. 클래스별 정확도 분석 · 31
3. 설명 가능 모델을 적용한 집중 영역 분석· 34
4. 설명 가능한 모델의 집중 영역을 적용한 오디오· 38
제5장 결론 40
참고문헌 · 42
Degree
Master
Publisher
조선대학교 대학원
Citation
김태완. (2024). 딥러닝 전이학습 모델과 설명 가능 기법들을 이용한 음성 감정 인식.
Type
Dissertation
URI
https://oak.chosun.ac.kr/handle/2020.oak/18006
http://chosun.dcollection.net/common/orgView/200000720223
Appears in Collections:
General Graduate School > 3. Theses(Master)
Authorize & License
  • AuthorizeOpen
  • Embargo2024-02-23
Files in This Item:

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.