18
An Efficient Rule-Based System An Efficient Rule-Based System for Morphological Parsing of for Morphological Parsing of Tamil Language Tamil Language ததததத ததத ததததத தததத ததததத ததத ததததத தததத STUDENTS: Karthik S 106106029 Praveen Kumar 106106045 Venkataraman GB 106106073 GUIDE: Dr. V. Gopalakrishnan Final Semester Project Department of Computer Science and Engineering National Institute of Technology, Tiruchirappalli May 2010

Tamil Morphological Analysis

Embed Size (px)

DESCRIPTION

An Efficient Rule-Based System for Morphological Parsing of Tamil Language

Citation preview

Page 1: Tamil Morphological Analysis

An Efficient Rule-Based System for An Efficient Rule-Based System for Morphological Parsing of Tamil Morphological Parsing of Tamil LanguageLanguage

தமி�ழ உருபனி�யல ஆயவு தமி�ழ உருபனி�யல ஆயவு

STUDENTS:Karthik S 106106029Praveen Kumar 106106045Venkataraman GB 106106073

GUIDE:Dr. V. Gopalakrishnan

Final Semester ProjectDepartment of Computer Science and EngineeringNational Institute of Technology, Tiruchirappalli

May 2010

Page 2: Tamil Morphological Analysis

AgendaAgenda Overview of the Project NLP Applications – The Stakeholders The problem at hand The proposed solution

◦ Rule – Based Morphological Analysis

◦ Machine Learning Where does it all fit in ? Need for Tamil Morphological Analysis Resources Obtained Implementation Details Demonstration Future Scope

12/04/23National Institute of Technology,

Tiruchirappalli

WHO WHAT WHYWHERE HOW1

Page 3: Tamil Morphological Analysis

Overview of the ProjectOverview of the Project Natural Language Processing Morphological Analysis Tamil Language

12/04/23National Institute of Technology,

Tiruchirappalli

WHO WHAT WHYWHERE HOW

Morphing …

… And in Tamilநடநத�ன நடநதனிர

நடககி�னறா�ள

நடபப�ன

நடககி�னறா�ன

2

Page 4: Tamil Morphological Analysis

NLP Applications – The StakeholdersNLP Applications – The Stakeholders

WHO ARE THE STAKEHOLDERS ?Natural Language Processing Applications like:StemmingMachine TranslationSpeech RecognitionInformation Retrieval

12/04/23National Institute of Technology,

Tiruchirappalli

WHO WHAT WHYWHERE HOW

WHY ARE THESE APPLICATION THE STAKEHOLDERS ?

3

Page 5: Tamil Morphological Analysis

The problem at handThe problem at handMorphological Analysis of Tamil involves understanding the word structure and its inflections

AGGLUTINATION IN TAMILAgglutination is the morphological process of adding affixes to the base of a wordTypical Tamil verb form will have a number of suffixes showing person, number, mood, tense and voice.

INFLECTIONS IN TAMIL

12/04/23National Institute of Technology,

Tiruchirappalli

WHO WHAT WHYWHERE HOW

பா�ல - Gender

எண - Number

திணை� - Class

கா�லம - Tenseஇடம - Person

4

Page 6: Tamil Morphological Analysis

The problem at handThe problem at handMorphological Analysis of Tamil involves understanding the word structure and its inflections

AGGLUTINATION IN TAMILAgglutination is the morphological process of adding affixes to the base of a wordTypical Tamil verb form will have a number of suffixes showing person, number, mood, tense and voice.

INFLECTIONS IN TAMILExample: vAlntukkontiruntēṉ: [வா�ழநதுகொகி�ணடிருநதேதன]

12/04/23National Institute of Technology,

Tiruchirappalli

WHO WHAT WHYWHERE HOW

vAl - வா�ழ intu - நது kontu - கொகா�ணடு irunta - இருநதி ēn - ஏன

root voice marker tense marker aspect marker person marker

live past tenseobject voice

during past progressive first person,Singular

4

Page 7: Tamil Morphological Analysis

The proposed solution The proposed solution There are two levels called lexical and surface levels. In the surface level, a word is represented in its original orthographic form. In the lexical level, a word is represented by denoting all of the functional components of the word.

RULE – BASED MORPHOLOGICAL ANALYSISAnalyzing word inflections using rules specified in Tamil Grammar

அன ஆன அள ஆள அர ஆர பாமமா�ர

அஆ குடுதுறு என ஏன அல அன

அம ஆம எம ஏம ஓகொமா� டுமமூர

காடதிற ஐ ஆய இமமா&ன இரஈர

ஈயர காயவு கொமானபாவும பா*றவும

வா*ணை+ய*ன வா*குதி கொபாயரி&னும சி/லவேவா

12/04/23National Institute of Technology,

Tiruchirappalli

WHO WHAT WHYWHERE HOW5

SURFACE LEVEL LEXICAL LEVEL

நனனூல

கொதி�லகா�பபா*யம

Page 8: Tamil Morphological Analysis

The proposed solution The proposed solution

MACHINE LEARNING APPROACH

12/04/23National Institute of Technology,

Tiruchirappalli

WHO WHAT WHYWHERE HOW6

While checking for suffixes in a given word, more than one suffix might be possible, if the rules are strictly followed. But only one suffix is semantically possible.

வா*குதி : பாடிதது – “ ” உ பாடிததிது – “ ” து or “ ” உ ???

M/L approach helps the system in “learning” the correct parsing method for the word, and in the subsequent processing of the same word, the wrong possibilities are automatically eliminated.

1

Two words might share the same inflectional part.

நடககானற�ன பாடிககானற�ன

The inflectional part of every word is learnt by the system. This helps in optimization by eliminating the need to analyse the second word again from scratch

2

Page 9: Tamil Morphological Analysis

Where does it all fit in ?Where does it all fit in ?

12/04/23National Institute of Technology,

Tiruchirappalli

WHO WHAT WHYWHERE HOW7

Characters

Word – Tokenization

Morphological Analysis

Sentence Syntax Analysis

Semantic Analysis

பா டி த தி� ன

பாடிததி�ன

பாடி - தத - ஆன

அவான புததிகாதணைதிப பாடிததி�ன

???Meaning of the sentence

Page 10: Tamil Morphological Analysis

Need for Tamil Morphological Need for Tamil Morphological AnalysisAnalysisENGLISH vs. TAMIL

TRANSLATION AND SEMANTIC ANALYSIS

அவான மிதுரை#ககு வாநதி�ள -- Semantically Wrong

To check semantic correctness of a sentence, morphological analysis is needed.

How to translate the above sentence ??

12/04/23National Institute of Technology,

Tiruchirappalli

WHO WHAT WHYWHERE HOW8

I came ந�ன வாநதேதனYou came ந வாநத�யThey came அவாரகிள வாநதனிர

He came அவான வாநத�னShe came அவாள வாநத�ள

Page 11: Tamil Morphological Analysis

Resources ObtainedResources Obtained

EMILLE – CIIL TAMIL MONOLINGUAL CORPUSEnabling Minority Language EngineeringCollaborative Venture of

◦ Lancaster University, UK

◦ Central Institute of Indian Languages (CIIL), Mysore, India

Distributed by European Language Resources Association [ELRA]

TAMIL WORDNETThe database is a semantic dictionary that is designed as a lexical networkDeveloped by

◦ Department of Linguistics of Tamil University

◦ AU-KBC Research Centre, Chennai

Tamil Wordnet resembles a traditional dictionary. It also contains valuable information about morphologically related words

12/04/23National Institute of Technology,

Tiruchirappalli

WHO WHAT WHYWHERE HOW9

Page 12: Tamil Morphological Analysis

Implementation Details - 1Implementation Details - 1

12/04/23National Institute of Technology,

Tiruchirappalli

WHO WHAT WHYWHERE HOW10

Input Tamil Word

Check in DB

C-V Segmentation

Root verb

?

Backward Scanning of inflections

Classify and Remove Inflection

Output

Conflict ResolutionMachine Learning

No

YesYes

No

Page 13: Tamil Morphological Analysis

Implementation Details - 2Implementation Details - 2

12/04/23National Institute of Technology,

Tiruchirappalli

WHO WHAT WHYWHERE HOW11

பாடிததி�ன

பா டி த தி� ன

ப - அ ட - இ த த - ஆ ன

ப அ ட இ த த ஆ ன

படி < VERB_ROOT >தத < PAST TENSE >

ஆன < 3SM >

Page 14: Tamil Morphological Analysis

Implementation Details - 3Implementation Details - 3

UNICODE SUPPORT FOR TAMILU+0B80 – U+0BFF

GOOGLE TAMIL TRANSLITERATOR IME (Input Method)Google Transliteration IME is an input method editor which allows users to enter text Tamil using a roman keyboard

PROGRAMMING LANGUAGE Java

DATABASESMySQL Databases, with JDBC to access the database

12/04/23National Institute of Technology,

Tiruchirappalli

WHO WHAT WHYWHERE HOW12

Page 15: Tamil Morphological Analysis

Implementation Details - 3Implementation Details - 3

TRANSLITERATION MODULEA simple Transliterator module - to enable conversion from Tamil to English and vice-versaExample:

◦ அ - a

◦ ஆ - aa

◦ கி - ka

HASH TABLE GENERATORThe application uses two data files, containing a list of vigudhi and idainilai. The Java Hash Generator Code loads the data from the workbooks, adds them to a hash table, and serializes the data and outputs to an external data file, which can be loaded whenever the application requires access.

12/04/23National Institute of Technology,

Tiruchirappalli

WHO WHAT WHYWHERE HOW13

Page 16: Tamil Morphological Analysis

Future ScopeFuture Scope The algorithm can be extended to cover nouns and noun forms too.

The algorithm can be improved to incorporate stricter rules so as to reduce conflicts that arise in the output generated by the current system.

The algorithm can be extended for other agglutinative languages.

The various resources obtained as a part of this project, including the EMILLE-CIIL ELRA Corpus, the Tamil Wordnet Database and other tools can be used for further study, research and development in the field of Natural Language Processing at our college in the years to come.

12/04/23National Institute of Technology,

Tiruchirappalli

14

Page 17: Tamil Morphological Analysis

ReferencesReferences A Novel Approach to Morphological Analysis for Tamil Language

◦ Anand kumar M1, Dhanalakshmi V1, Rajendran S2, Soman K P

Nannool and Tholkaapiyam◦ Tamil Grammar texts

The Morphological Generator and Parsing Engine for Tamil Verb Forms. ◦ Ultimate Software Solution, Dindigul

Morphological Analyzer for Tamil ◦ Anandan. P, Ranjani Parthasarathy, Geetha T.V. [2002]

◦ ICON 2002, RCILTS-Tamil, Anna University, India.

Morphology. A Handbook on Inflection and Word Formation◦ Daelemans Walter, G. Booij, Ch. Lehmann, and J. Mugdan (eds.) [2004]

Tamil Part-of-Speech tagger based on SVMTool◦ Dhanalakshmi V, Anandkumar M, Vijaya M.S, Loganathan R, Soman K.P, Rajendran S

[2008]

◦ Proceedings of the COLIPS International Conference on Asian Language Processing 2008 (IALP).

Unsupervised Learning of the Morphology of a Natural Language.◦ John Goldsmith. [2001]

◦ Computational Linguistics, 27(2):153–198.

Computational morphology of verbal complex ◦ Rajendran, S., Arulmozi, S., Ramesh Kumar, Viswanathan, S. [2001]

◦ Paper read in Conference at Dravidan University, Kuppam, December 26-29, 2001. 12/04/23

National Institute of Technology, Tiruchirappalli

15

Page 18: Tamil Morphological Analysis

Thank youThank you

12/04/23National Institute of Technology,

Tiruchirappalli