CHOSUN

단어 의미적 연관성을 고려한 개선된 어휘체인기반의 자동 문서요약 방법

Metadata Downloads
Author(s)
택미얏린
Issued Date
2016
Keyword
Lexical Chain, Automatic Text Summarization, Transition Probability Distribution Generator, Markov Chain
Abstract
Summarization is a challenging task that need to understand the content of the document to determine the importance information of the text. Automatic Summarization is the procedure of lessening a text document using an intelligent system with complex algorithms to form a short summary of it which retains the most important information of the text. Lexical cohesion is a way identifying connected portions of the text according to the relations between the words in the document or text. Lexical cohesive relations between words in a document can be described using lexical chains. Lexical chains are applied in various Natural Language Processing (NLP) and Information Retrieval (IR) applications. The fundamental task to perform in constructing lexical chain is to extract the best appropriate candidate keywords of the text as the chains which contain the salient portions of the document are relied on them.
Thus, we extend our research on automatic keyword extraction for extracting candidate terms of the given text before approaching to summarization. Keywords are a set of keywords or keyphrases that capture the primary information or topic discussed in the text. Keywords are widely used to define queries in IR systems as they are easy to define, revise, remember, and share. Furthermore, keywords are essential Search Engine Optimization (SEO) elements for every search engines, yet they are matched against with users’ search query keywords. They can empower document browsing by providing a short summary, improve information retrieval, and be employed in generating indexes for a large text corpus.
In current thesis, we proposed a new approach for automatic text summarization using lexical chain with semantic relatedness keywords. In contrast, the new method of extracting keywords is also implemented for constructing the efficient lexical chain. Instead of constructing a lexical chain with every noun phrases in the text, the result of our experimental results show that the extracted candidate keywords of the text delivers a better efficient summary with a better performance. Thus, we have built a system to extract the promising keywords from the text which is based on the characteristics of the manually assigned keywords. The system consists of a generator to produce the possibility distributions of three distinct features of assigned keywords; the occurrence of a term in title, the occurrence of a term in a first sentence, and the higher score of average TF.IDF score of a term. Then, those distributions are applied to Markov chain process of three stages to assign the label for each N-gram term in the text. Extracted unigram candidate keywords are selected to build the lexical chains of the text. There are also three distinct relation criteria to connect the terms according to their relationships to other candidate keywords using WordNet; hypernym, hyponym and synonym. Then we apply the method to score and extract the salient portions of the document. Our experimental results prove that the efficient summaries can be extracted for the above tasks.
| 문서를 요약한다는 것은 그 문서의 일관성을 유지하면서 중복을 제거하고, 응축된 정보를 생산하는 것을 말하며, 자동문서 요약 기술은 컴퓨터를 사용해서 문서 내 중요한 부분을 유지하고, 중복된 내용을 제거함으로써 처리하고자 하는 대용량의 문서를 자동적이고 효율적으로 처리하는 방법을 말한다.
“어휘 결합(Lexical Cohesion)”은어휘의 관계(상하 어휘관계, 유의 어휘관계 등)를바탕으로 하나의 문서 내의 등장하는 단어와 단어 사이의 관계를 분석하는 방법이다. 이러한 “어휘결합관계” 는어휘사슬(Lexical Chain)을이용하여나타낼수있다. 어휘 사슬은자연언어처리(Natural Language Processing) 및정보검색(Information Retrieval)기술에 다양하게 활용되고 있으며, 적절한 후보키워드를 추출하는 것이 프로그램의 성능을 좌우하기 때문에, 어휘사슬을구성하기 위해 적절한 후보키워드를 추출하는 것이 가장 중요한작업이라고 할 수 있다.
본연구는 단어의 의미적 연관성을 고려하여 구성한 후보키워드의 어휘사슬을 기반으로 개선된 자동문서 요약방법에 관한 연구로, 효율적인 어휘 사슬을 구성하기 위해 새로운키워드추출방법을 제안하였다. 본 논문에서 문서 내의 키워드 추출을 위해 “제목에 등장하는 단어”, “문서의 첫 번째 문장에 등장하는 단어”, “TF-IDF 가중치가 높게 측정되는 단어” 세 가지의 키워드 특징을 정의하였으며, 키워드가 갖는 조건부 확률 값을 활용해 전이 행렬(transition matrices)을 생성함으로써, 마르코프 연쇄(Markov Chain)에 적용을 통해 후보키워드를 추출한다.
추출된 후보키워드는 워드넷(WordNet) 상에서 정의된 단어의 상하위어 관계, 동의어 관계를 고려하여 후보 키워드 간의 연결을 통해 어휘 사슬을 구성하였으며 이를 통해 자동문서 요약을 수행하게 된다. 본 논문의 실험결과에따르면, 제안한 방법에 의해 추출한후보키워드로 어휘 사슬을 구성하는 것이 문서 내 모든 명사구에 대한 어휘 사슬을 구성하는 것보다 향상된 성능을 보였으며, 더욱효율적으로문서를요약할 수 있음을 증명하였다.
Alternative Title
An Improved Lexical Chain Method for Automatic Text Summarization using Semantic Related Terms
Alternative Author(s)
Htet Myet Lynn
Affiliation
컴퓨터공학과
Department
일반대학원 컴퓨터공학과
Advisor
김판구
Awarded Date
2016-08
Table Of Contents
TABLEOFCONTENTS i
LIST OF FIGURES iii
LIST OF TABLES iv
ABSTRACT v
요약 vii
Ⅰ. INTRODUCTION 1
A. Motivation 1
B. Outline 2

Ⅱ. BACKGROUNDCONCEPTS 3
A. Summarization 3
1. Existing Approaches 5
(i) Naive Bayes Methods 5
(ii) Hidden Markov Models 6
(iii) Neural Networks and Third Party Features 7
B. Lexical Chain 8
1. WordNet 3.1 10
C. Automatic Keyword Extraction 14
1. Motivation 15
2. Existing Approaches 16
(i) Linguistics Approaches 17
(ii) Machine Learning Approaches 18
(iii) Mixed Approaches 19

Ⅲ. PROPOSEDMETHOD & SYSTEMIMPLEMENTATION 20
A. Data Collection 22
B. Document Preprocessing and Methods 24
1. Natural Lanuage Toolkit (NLTK) 24
2. Sentence Segmentation 26
3. Stop words Removal 27
4. Word Tokenization 27
5. Part-Of-Speech(POS) Tagging 28
6. Noun Phrase Extraction 29
C. Feature Extraction 31
D. Constructing Transition Probability Distribution Generator (TPDG) 32
E. Assigning Keywords with Markov Chain using TPDG 35
F. Building Lexical Chains 37
G. Scoring Chains 40
H. Strong Chain Selection 40
I. Sentence Selection 41

Ⅳ. EXPERIMENTALEVALUATION 42
A. Evaluation Measures for Keyword Extraction 42
1. Experimental Results 42
B. Evaluation Measures for Summarization 46
1. ROUGE 47
2. Experimental Results 47

Ⅴ. CONCLUSION ANDFUTURE WORK 52

REFERENCES 54
Degree
Master
Publisher
조선대학교
Citation
택미얏린. (2016). 단어 의미적 연관성을 고려한 개선된 어휘체인기반의 자동 문서요약 방법.
Type
Dissertation
URI
https://oak.chosun.ac.kr/handle/2020.oak/12891
http://chosun.dcollection.net/common/orgView/200000265662
Appears in Collections:
General Graduate School > 3. Theses(Master)
Authorize & License
  • AuthorizeOpen
  • Embargo2016-08-25
Files in This Item:

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.