지능적 문서 분석을 위한 개선된 WSD 방법 연구

Metadata Downloads
Issued Date
Natural language is a communication system that was created by the human evolutionary process, in order to write and share information with others. Starting from the simple methods (gestures or facial shapes), natural language has been evolved into highly scientific and intelligent language systems. With the advent of a paper, people started to produce many kinds of texts however, the internet and smart devices has changed entire world. Due to the simplicity and mobility of smart devices, documents on the web have been dramatically increased during past few years. The problem is that handling huge amount of web documents requires high costs. Therefore, computer scientists started to analyze human written texts data by using statistical approaches so that a machine can deal with huge amount of documents instead of human.

However, analyzing the human language is more than just statistics. It is much far beyond mathematics than we simply expect. Therefore, scientists fell into deep agony to overcome this issue. One of the possible ideas is building a machine readable knowledge system followed by the human brain system. It was based on the hypothesis that human understands meaning of words by using concepts in his/her memory which he/she have studied before. If a machine can have knowledge system, it might be possible to analyze documents more smartly and precisely than ever.

The biggest problem for understanding human language by computer is that a word can have multiple meanings, Word Sense Disambiguation (WSD) problem is the most challenging issue to be solved. Therefore, scientists have focused on the WSD problem. The one of the most popular algorithm based on the machine readable knowledge is Structural Semantic Interconnections (SSI) algorithm which applies the hypothesis that a concept of the given word can be disambiguated by comparing interconnections between concepts of co-occurring words. Even though, the SSI algorithm is a powerful method, it still has a weakness to overcome.

A word ambiguity is different from each other words. Some words have only single meaning (monosemous word) but some words have multiple meanings (polysemous word). The word which has low word ambiguity must be disambiguated earlier than the words with high word ambiguity. Moreover, a word is likely to be semantically related with the adjacent words. If the centroid word and the target word to be analyzed are adjacent in the given sentence, it has higher possibility to share semantic relations than the words far apart. In order to apply these two hypotheses, the Low Ambiguity First (LAF) algorithm has been introduced in this research. Word ambiguity will be measure by using a number of possible concepts and frequencies of concepts of target words which are defined in the WordNet.

In order to demonstrate the superiority of the proposed algorithm, nouns in SemCor2.1 corpus have been disambiguated by using the base-line, the SSI, and the LAF algorithm. Experimental results clearly shows that the proposed algorithm disambiguates nouns more accurately than other algorithms (Brown1: 9.546% improved, Brown2: 10.324% improved).

As a result, the proposed LAF algorithm can disambiguate nouns semantically with the highest precision ratio. However, the weakness of the proposed algorithm is that it is depends on the performance of the WordNet. If target words are not defined in the WordNet, there is no way to disambiguate the target words. Hence, further extension of the LAF algorithm includes additional works to deal with words like proper nouns or technical terms by using web resources such as the Wikipedia.
Alternative Title
An Improved WSD Method for Intelligent Document Analysis
Alternative Author(s)
Choi, Dongjin
조선대학교 일반대학원
일반대학원 컴퓨터공학과
Awarded Date
Table Of Contents

Ⅰ. 서 론 1
1. 연구 배경 1
2. 연구 내용 및 범위 4
Ⅱ. 관련 연구 6
1. 단어의 의미 모호성 해소 연구의 개요 6
2. 말뭉치 기반(Corpus-based) WSD 기법 8
3. 사전 기반(Dictionary-based) WSD 기법 11
4. 지식베이스 기반(Knowledge-based) WSD 기법 12
1) 워드넷 기반 개념 사이 의미 유사도 측정 방법 14
2) 지식베이스 기반 단어의 의미 모호성 해소 알고리즘 18
Ⅲ. LAF 알고리즘을 위한 배경 이론의 제안 21
1. 단어의 의미적 연관성 21
1) 인접한 단어일수록 의미적 연관성은 높다 21
2. 단어의 의미 모호성 측정 방법 27
1) 중의적인 단어의 의미 모호성은 측정이 가능하다 27
Ⅳ. LAF 알고리즘 기반 단어 의미 모호성 해소 방법 33
1. 전처리 (Preprocessing) 34
2. 알고리즘 실행을 위한 초기 설정 37
3. 단어의 의미 모호성 해소를 위한 LAF 알고리즘의 적용 39
Ⅴ. 실험 및 결과 54
1. 실험 데이터 55
2. 베이스라인(Baseline) 실험 57
3. SSI 알고리즘 기반 실험 61
4. LAF 알고리즘 기반 실험 66
Ⅵ. 결론 및 향후 연구 79
참 고 문 헌 82
최동진. (2015). 지능적 문서 분석을 위한 개선된 WSD 방법 연구.
Appears in Collections:
General Graduate School > 4. Theses(Ph.D)
Authorize & License
  • AuthorizeOpen
  • Embargo2015-08-25
Files in This Item:

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.