CHOSUN

네팔어를 위한 비감독 품사 태깅 방법

Metadata Downloads
Author(s)
가우탐 디페쉬
Issued Date
2008
Abstract
Parts of Speech (POS) tagging is also known as morphosyntactic categorization or syntactic word class tagging [1]. Most of the tagging tasks rely on large collection of training corpus. But availability of pre-tagged training corpus is the major constraint for tagging natural language text. There has been developed various corpora like American National Corpus [30], Bank of English [31], British National Corpus [32], Helsinki Corpus [33] etc. Even though most of the world’s literature languages like Nepali has large collection of lexical dictionaries and encyclopedia, but still the electronic corpus are not available for many rich languages. So some method which helps in tagging from scratch should be proposed.
In this thesis we propose a method of POS tagging of Nepali language text. In the first step we manually constructed pronoun and particle lexicons as they are in a small number in Nepali text. We then use these lexicons to pre-classify pronoun, particle and last occurring verb as every sentence ends either with verb or particle. In the next step we use co-occurring word statistics as the feature for words clustering as different study [14] [15] [16] on natural languages suggests that co-occurring words convey important information for the processing of the language. For feature generation several most frequent words from large collection of news paper article are selected as dimensions in vector space. The components of feature vector of each word are the number of times each dimension word occurs to the left and right of the word. These vectors are clustered to group the words in collection into several syntactic categories and the POS tags from NELRALEC [19] are assigned to each cluster. During tagging several corresponding honorific, gender or other agreement specific tags unlike suggested by NELRALEC are considered to be in same syntactic category for the purpose of this thesis thus reducing the number of POS tags. Finally the performance is evaluated by precision and recall value of the result. Despite the moderate performance as a result of several error sources, this research is noble in the POS tagging of Nepali language text as few researches in the area have been performed.
Alternative Author(s)
The Method of Unsupervised POS Tagging for Nepali Language Text
Affiliation
컴퓨터공학과
Department
일반대학원 컴퓨터공학과
Advisor
김판구
Awarded Date
2008-08
Table Of Contents
ABSTRACT ii
LIST OF FIGURES iv
LIST OF TABLES v
I. INTRODUCTION 1
II. BACKGROUND CONCEPTS 4
A. TEXT CORPUS 4
B. LEXICON 4
C. PARTS OF SPEECH TAGGING 6
D. WORD SENSE DISAMBIGUATION 6
E. PROXIMITY MEASURE AND CLUSTERS 6
1. Similarity Measure 7
2. Dissimilarity Measure 8
III. RESEARCHES ON NEPALI LANGUAGE 9
A. CURRENT STATE OF NEPALI LEXICON 9
B. NELRALEC PROJECT 9
C. THE NELRALEC TAGSET 11
D. CONTEMPORARY NEPALI DICTIONARY 12
E. DOBHASE (TRANSLATOR) 13
IV. TAGGING NEPALI LANGUAGE TEXT 14
A. OVERVIEW 14
B. TEXT MANIPULATION TOOLS 15
C. TEXT COLLECTION 17
D. WORD DICTIONARY 18
E. BIGRAM AND BIGRAM DICTIONARY 18
F. FEATURE GENERATION AND CLUSTERING 20
1. Feature Generation 20
2. Pre-Classification of Pronouns, Particles and Verbs 22
3. Clustering Feature Vectors 23
V. EXPERIMENTAL RESULTS 25
VI. CONCLUSION AND FUTURE WORKS 29
REFERENCES 31
Degree
Master
Publisher
조선대학교 대학원
Citation
가우탐 디페쉬. (2008). 네팔어를 위한 비감독 품사 태깅 방법.
Type
Dissertation
URI
https://oak.chosun.ac.kr/handle/2020.oak/7299
http://chosun.dcollection.net/common/orgView/200000236543
Appears in Collections:
General Graduate School > 3. Theses(Master)
Authorize & License
  • AuthorizeOpen
  • Embargo2008-07-18
Files in This Item:

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.