단어 의미 관계를 고려한 연관 문서 추천 방법
- Issued Date
- The information retrieval technology is a representative technology of the information society and the ranking of information as a result of the search has been proven to influence people. Various researches have been conducted on the recommendation method of related documents in the field of information retrieval technology. Knowledge resource-based research, such as a dictionary of terms and ontology, requires human intervention and deployment costs and maintenance. Simple frequency-based research, such as TF-IDF, does not take into account the context of words and sentences, and it is impossible to interpret new words. Therefore, the efficiency of search is reduced.
News is characterized by having a wide range of information reported by various media companies. There are limitations to providing customized information because it deals with various incidents and it is composed of various keywords in one topic.
Therefore, in this paper, we use Korean news articles to extract topic distributions in documents and word distribution vectors in topics through LDA-based Topic Modeling. Then, we use Word2vec to vector words, and generate a weight matrix to derive the relevance SCORE considering the semantic relationship between the words. We propose a way to recommend documents in order of high score.
As a result of document recommendation by relevance SCORE, it was confirmed that documents in SCORE range of higher association consist of words that are highly related to the query keyword. As a result of the document recommendation performance evaluation by the relevance SCORE using the TextRank algorithm which can recognize the importance of the words in the document, the query keyword has a higher importance as the higher relevance SCORE numerical value is larger. And the suitability of the proposed method was verified. The comparative experiment shows that semantic-based document recommendation is more effective than the existing document ranking methodologies TF-IDF and LDA.
The methods proposed in this paper can improve the performance of document searches by eliminating semantic ambiguity, and provide customized information by recommending documents that are most relevant to users' keywords, and identify the events associated with each keyword in the same topic.
정보 검색 기술은 정보 사회를 대표하는 기술이며 검색 결과인 정보의 순위는 사람들에게 영향력을 미치는 것으로 검증되었다. 정보 검색 기술 분야의 연관 문서 추천 방법에 대한 다양한 연구가 진행되고 있으며 기존의 용어 사전, 온톨로지와 같은 지식 리소스 기반의 연구는 사람의 개입과 구축비용, 유지보수가 필요하다. TF-IDF 같은 단순 빈도수 기반의 연구는 여러 뜻을 가지고 있는 다의어와 모양이 달라도 의미는 같은 동음이의어가 고려되지 않아서 검색의 효율이 떨어진다.
뉴스는 다양한 언론사를 통해 보도되고 넓고 방대한 정보 범위를 가지는 것을 특징으로 한다. 다양한 사건을 다루기 때문에 여러 주제들을 내포하고 있으며 하나의 주제 속에 매우 다양한 키워드로 이루어져있으므로 사용자 맞춤형 정보를 제공하는데 한계가 있다.
따라서 본 논문에서는 단어의 의미적 모호성을 해소하고 효율적인 정보 검색을 위해 한국어 뉴스 기사를 이용하여 LDA(Latent Dirichlet Allocation) 기반 토픽 모델링을 통해 문서 내 주제 분포와 주제 내 단어 분포 벡터를 추출하고 Word2vec을 이용하여 단어를 벡터화한 후 가중치 행렬을 생성해 키워드를 확장하여 연관성 SCORE를 도출한 다음 점수가 높은 순서대로 문서를 추천하는 방법을 제안한다.
연관성 SCORE에 의한 문서 추천 결과로 상위 연관성 SCORE 범위에 있는 문서일수록 질의 키워드와 연관성이 높은 단어들로 구성된 것을 확인할 수 있었다. 문서 내 단어의 중요도를 알 수 있는 TextRank 알고리즘을 사용한 연관성 SCORE를 통한 문서 추천 성능 평가 결과로 질의 키워드는 상위 연관성 SCORE 수치 값이 클수록 더 높은 중요도를 가졌다. 상위 연관성 SCORE와 하위 연관성 SCORE 범위에서 중요도가 높은 키워드는 서로 낮은 중요도를 가지는 것 또한 알 수 있었으며 제안하는 방법의 적합성을 확인할 수 있었다. 또한 비교실험을 통해 기존 문서 랭킹 방법론인 TF-IDF와 LDA보다 더 효과적인 의미 기반 문서 추천이 가능하다는 것을 알 수 있었다.
본 논문에서 제안하는 방법을 통해 의미적 모호성을 해소하여 문서 검색의 성능이 향상될 수 있고 사용자가 원하는 키워드와 가장 연관성이 높은 문서를 추천해주므로 사용자 맞춤형 정보를 제공할 수 있으며 같은 주제에서 각 키워드와 관련된 사건들을 파악하기 쉬워진다.
- Alternative Title
- Method of Related Document Recommendation considering Semantic Relation between Words
- Alternative Author(s)
- Kim, Seon Mi
- 산업기술융합대학원 소프트웨어융합공학과
- Awarded Date
- 2019. 2
- Table Of Contents
I. 서론············································································································································· 1
A. 연구 배경 및 목적············································································································· 1
B. 연구 내용 및 구성·············································································································· 3
II. 관련연구···································································································································· 4
A. TF-IDF ································································································································ 4
B. 토픽 모델링(Topic modeling) ·························································································· 5
1. LDA ···································································································································· 5
C. Word2vec ····························································································································· 7
D. PageRank와 TextRank ····································································································· 8
1. PageRank ·························································································································· 8
2. TextRank ·························································································································· 9
III. 단어 의미 관계를 고려한 연관 문서 추천···································································· 11
A. 시스템 구성도··················································································································· 11
B. TF-IDF 가중치를 적용한 LDA기반 토픽 모델링···················································· 13
1. 데이터 수집 및 전처리 과정 ······················································································ 13
2. TF-IDF 가중치 적용 ··································································································· 16
3. LDA기반 토픽 모델링·································································································· 17
C. 단어 간 유사도 추출········································································································ 20
1. Word2vec ························································································································ 20
D. 연관성 SCORE 도출········································································································ 23
IV. 실험 및 평가························································································································ 27
A. 실험 결과 및 성능 평가································································································· 27
B. 연관성 비교 실험·············································································································· 31
V. 결론 및 제언························································································································· 33
- 조선대학교 산업기술융합대학원
- 김선미. (2018). 단어 의미 관계를 고려한 연관 문서 추천 방법
Appears in Collections:
- Engineering > Theses(Master)(산업기술창업대학원)
- Authorize & License
- Files in This Item:
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.