Upload
yelena
View
26
Download
0
Embed Size (px)
DESCRIPTION
Thai Linguistic Resources. Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium on Language Resources in Asia. Linguistic. Linguistic. Knowledge. Training. Knowledge. Resources. - PowerPoint PPT Presentation
Citation preview
Virach Sornlertlamvanich
Information R&D Division (iTech)
National Electronics and Computer Technology Center (NECTEC)
THAILAND
19 January 2001
Symposium on Language Resources in Asia
Thai Linguistic Resources
How Important !
Language Processing
DefiningRules
LinguisticKnowledge
StatisticalModeling
TrainingResources
LinguisticKnowledge
Top-Down Bottom-Up
Evaluation
Models
Adjust Adjust
EvaluationResources
• Linguistic resources are necessary even in top-down and bottom-up design
• Exploitable in modeling and evaluation
What we need ?
Linguistic Resources
FundamentalLinguistic Tools
Applications
• Lexicon / Dictionary (30k)
• Tagged Text (2MB) / Speech Corpora
• Language Model
• Word Extraction (ML; p=85%; r=56%)
• Word Segmentation / POS tagger (ML; 96-97%)
• Sentence Segmentation (ML; 85-89%)
• Grapheme-to-Phoneme Conversion (PGLR; 73-90%)
• Word Sense Disambiguation
• Corpus / UNL / UW (concept) Editor
• MT (ParSit; http://come.to/parsit) / UNL
• Text Summarization
• Speech Recognition / Synthesis
Our Workbench …
Prosody-coverage
Phonetically-balance
Vocabulary-coverage
WordExtraction
CorpusEditor
Lexicon
Corpus-based
Dictionary
InterlingualConcept
LanguageModel
RawText
WordSegmentation
POSTagging
SentenceExtraction
Graphemeto Phoneme
WordDisambiguation
UNLMachine
Translation
TextSummarization
SpeechRecognition
SpeechSynthesis
Linguistic Tools Applications
Linguistic Resources
XML TaggedCorpus
Open Linguistic Resources • LEXiTRON v 1.1 (a corpus based T-E dictionary, 1994)
• About 11,000 Thai entries; 9,000 English entries• http://www.links.nectec.or.th/lexit
• ORCHID POS-Tagged Corpus (supported by CRL, 1997)• 160 documents; 2MB text; 400K words• XML tagged for Paragraph, Sentence, Word, Part-of-Speech (47 tags)• http://www.links.nectec.or.th/orchid
• Thai Royal Institute Dictionary (T-T dictionary)• Basic term 32,000 entries• Technical term 15,339 entries• http://www.royin.go.th/
• ParSit (http://come.to/parsit, 2000)
Ongoing : Thai Speech Corpus #1
Scope (2001)
• Large Vocabulary Continuous Speech Recognition (LVCSR) Corpus- Phonetically-balanced sentences- 5K vocabulary coverage sentences
• Corpus for Text-to-Speech Synthesis- 400 phonetically and prosodic-balanced sentences- For probabilistic prosody generation
• Dialog speech corpus (collaboration with ATR)- 50 conversations, 2,099 sentences- 5,000 words, 866 phonetically-balanced sentences- 40 speakers (males and females)
Ongoing : Thai Speech Corpus #2
Procedure
Word Segmentation
Sentence Extraction
POS Tagging
Grapheme-to-Phoneme
RawText
CorpusEditor
XML TaggedCorpus
Sentence Selection Process
Speech Recordingand Tagging
Tagged SpeechCorpus
Phonetically-balanced
Vocabulary coverage
Prosody-balanced
Ongoing : Thai Speech Corpus #3
Tools
Plain Text
Corpus EditorXML Corpus
Ongoing : Thai Speech Corpus #4
Text Sources
• Technology Promotion Association (Thailand-Japan)
• Amarin Printing Co., Ltd.
• Matichon Public Co., Ltd.
Project Collaboration
• Kasetsart University
• Thammasat University
• King’s Mongkut University of Technology Thonburi
• Prince of Songkhla University
Ongoing : Thai Speech Corpus #5
JNAS T IMIT WSJCAMO NECTEC(2001-2006)
Vocab size 5K, 20K - 20K, 64K 20K
# sent -PB -Vocab
503< 15,000
4501,890
< 1,500< 14,000
< 866< 10,000
# speaker 306 630 140 200
# sent/speaker 150(100 Vocab+50 PB)
10 100(Vocab+PB)
100(80 Vocab+20 PB)
Record time 60 hrs.(16 CDROM)
1 CDROM - 1GB
Ongoing : LEXiTRON v 2.0 #1
Scope (2001)
• Entries- 25,000 Thai - English- 25,000 English - Thai
• Fields- Translation- Phonetics- Root of vocabulary- Part-of-speech- Synonym- Antonym- Sentence sample
Procedure
WordExtraction
ExistingDictionary
RawText
VocabularySelection
DictionaryEditing
ExistingDictionary
Corpus-basedSentenceSamples
LEXiTRON v 2.0
Ongoing : LEXiTRON v 2.0 #2
ToolsDictionary DB
Phonetic Symbols
Wordnet
Corpus-based Sample Sentences
Discussion
• Language difficulties; 13 Tai-family languages• Text sources• Common tagset• Resource center• Institutional collaboration