CHOSUN

유의어 가중치를 적용한 논문 표절 유사도 측정 방법

Metadata Downloads
Author(s)
김민범
Issued Date
2018
Keyword
유의어 표절 유사도 가중치
Abstract
With the advancement of the Internet, various information can be easily found without the restrictions of time and place. However, there are cases where such information is abused in plagiarizing others' theses. Plagiarism mostly occurs when using the content of other people's paper without citation. Some of the common forms of plagiarisms are plagiarizing text, ideas, rewriting other people's work, and interweaving various sources together in work. CopyKiller, the free service commonly used in plagiarism has standard processing criteria of judging plagiarism when six word-phrases are the same. However, this criterion seems insufficient since there can be rewordings.

The result of the earlier study showed that one document had an algorithm word A often appeared in the introduction and related studies but another word B which is a modification from the algorithm A, appeared in the main body. Since the two words are measured to be similar from the test, TF-IDF is modified to give weight by the paragraph of the introduction, related studies, main body, and conclusion, to improve the accuracy of similarity test on the words alike but have different meanings.

Hence, this paper proposes a method of measuring similarity by finding synonyms between the words and applying weight by the paragraph for the improvement of accuracy in document similarity checks. After preprocessing the data, vector values of the word are extracted by using Doc2Vec and word with weights by paragraph is clustered by DBSCAN(Density-based spatial clustering of applications with noise). The main word in each cluster is set as the representative word, and other words in the same cluster are set as synonyms. The distance value between the representative word and a synonym is used as synonym weight.
Since the number is too low to be applied as weight with the distance value of the word alone, an L-R table is created via L-R syntactic parsing. The noun score is calculated using the score for the possible noun and is applied to the synonym weight.

To evaluate the similarity between the documents, similarity criteria by sentence are found first and the sentences having the similarity of 0.7 and higher are identified as the plagiarized sentence. The performance of the method suggested by this paper will be verified through a comparative evaluation with the existing studies.
Alternative Title
Method of measuring paper plagiarism similarity using synonym weighting
Alternative Author(s)
Minbeom Kim
Affiliation
산업기술융합대학원 소프트웨어융합공학과
Department
산업기술융합대학원 소프트웨어융합공학과
Advisor
신주현
Awarded Date
2018. 8
Table Of Contents
목 차
ABSTRACT

Ⅰ. 서론 1
A. 연구 배경 및 목적 1
B. 연구 내용 및 구성 3

Ⅱ. 관련 연구 4
A. TF-IDF 4
B. Word2Vec 와 Doc2Vec 5
1. Word2Vec 5
2. Doc2Vec 6
C. DBSCAN 8
D. 유사도 측정 10

Ⅲ. 유의어 가중치를 적용한 유사도 측정 방법 11
A. 시스템 구성도 11
B. 유의어 추출 13
1. 전처리 과정 13
2. Doc2Vec 14
3. DBSCAN 군집화 15
4. 단락별 가중치 16
C. L-R 구문분석 20

Ⅳ. 실험 및 결과 22
A. 유사도 측정 22
B. 유사도 평가 26

Ⅴ. 결론 및 향후연구 28

참고문헌 30
Degree
Master
Publisher
조선대학교
Citation
김민범. (2018). 유의어 가중치를 적용한 논문 표절 유사도 측정 방법
Type
Dissertation
URI
https://oak.chosun.ac.kr/handle/2020.oak/2004
http://chosun.dcollection.net/common/orgView/200000266968
Appears in Collections:
Engineering > Theses(Master)(산업기술창업대학원)
Authorize & License
  • AuthorizeOpen
Files in This Item:

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.