Upload
reginald-fields
View
213
Download
0
Embed Size (px)
Citation preview
Where do we stand? MT development, research, and deployment in Asia
Key-Sun Choi (KAIST)AAMT
http://www.asianlp.org/ http://www.afnlp.org/http://korterm.org/
3
MT in China – 1980-1990’s
To translate the scientific documents From Russian and Western Countries’ language
Supported by government No private company in early stage
TRANS-STAR: 30,000 words/hour for 386 PC. Basis dictionary includes 40,000 entries, 10 specialized technical dictionaries
including 350,000 entries. subject fields: computer, economics, telecommunication, c
eramics, thermal power industry, printing machine industry, automobile/tractor industry, Petroleum prospecting, geology, Chemical industry.
4
MT in China – PresentEnglish-to-Chinese
GAOLI: jointly by Beijing GAOLI Computer Co. Lid. & Linguistics Insti
tute of CASS. Basic lexical dictionary: 60,000 entries in which usage and g
rammatical function of every word is described in detail. Translation accuracy: 80% Readability of translated text: 80%-90%
863-IMT/EC: by the Institute of Computer Technology, Academia Sinica. commercialized and got very good economic benefits.
5
MT in China – PresentChinese-to-English
SINO-TRANS by the Company CS&S (China National
Software & Technology Service Co.) at 1993.
Basic dictionary: 40,000 entries Two special subject technical dictionaries:
Naval ships and boats (9312 entries), rocket-gun (33,773 entries)
Linguistic rules: 1,000 rules
6
MT in China – PresentEnglish-to-Chinese + terminology
TONGYI system: by the Tianjin DATONG computer software compa
ny WINDOWS platform Different special subject dictionaries:
a.commonly-used scientific terms: 200,000 entriesb.terms including 22 different subjects (e.g. machine buildi
ng, telecommunication, aviation, medicine, etc): 3,000,000 entries
Good market strategy and service Cooperation with enterprises
7
MT in China – PresentEnglish-to-Chinese + internet browsing + more user interface
YIWANG: by SUNSHINE company of Shenzhen. Highest translation speed: 100 sentences per secon
d. Internet browsing
YIBA: by YAXINCHENG software technical company. Three translation: on line, automatic, interface. Open to users: to revise dictionary and rules Rich special subject dictionaries: 30 subjects (e.g. C
omputer, telecommunication, medicine)
8
MT in China – PresentEnglish-to-Japanese
E-to-J by JEC company in Beijing. Technique of transformation from phrase
tree (P-tree) to dependency tree (D-tree). Closely integrated with word processor
9
MT in China – PresentExample-based MT: experimental systems
Japanese-Chinese EBMT: computer department of Qinghua university in 1996. corpus for Japanese and Chinese alignment sentenc
es The example unit is sentence The similarity rate calculation based on word
DAYA EBMT: Harbin Polytechnic University. machine-aided translation system, human factor is ve
ry important corpus is sentence-level alignment
10
MT in ChinaGovernment Funding: 1990’s
Hi-Tech 863 funding: 863-IMT/EC system (English-Chinese) SUNSHINE YIWANG system.
905 Chinese Language Processing Project: completed in 1998.
11
MT in China
User’s English Level
The proportion of English level of user for TONGYI MT software: Higher level: 16.5% Middle level: 49.5% Lower level: 34.1%
So the MT software must be oriented to common people
12
MT in China
Potential UsersThe proportion of enterprise user for TONGYI MT software: Small enterprises: 31.3% Medium-scale & large-scale enterprises: 68.7%
So the MT software must be oriented to large-scale & medium-scale enterprises, but we don’t ignore the small enterprises that also
has translation demand.
13
MT in China
Regional DistributionUser’s region distribution of MT software: translation demand is concentrated in the big cities
and developing regions. Beijing: 18.7% Liaoning: 7.9%, Jiangsu: 7.5% Zhejiang: 6.5%, Hubei: 6.5%, Shanghai: 6.1% Sichuan: 4.7%, Guangdong: 4.7% Henan: 3.3%, Helongjiang: 3.3% Hebei: 2.8%, Shanxi: 2.3%, Jilin: 2.3% Yunnan: 1.9%, Neimeng: 1.5%, Gansu: 1.4% Guizhou: 0.5%, Anhui: 0.5%
14
MT in China - Future and Strategies (1)Terminology Data Bank
MT software combines with terminology data bank 1990: sub-committee of computer-aided in
terminology of China set up. This sub-committee is attached to the State Language
Commission (SLC) of China A series of national standards for terminology data-
bank Terminology Databank creation
Chinese-English: Since 1995, by ISTIC (Institute of Scientific and technical Information of China)
Remarkable databanks…
15
MT in China - Future and Strategies (2)Language Corpus Processing
Corpus construction: the scale of 25 million Chinese characters
(1999) Automatic segmentation of Chinese writing
text in corpus (97.68%, close test) Automatic phrase bracketing and syntactic
annotation for Chinese Corpus
16
MT in China - Future and Strategies (3)speech-to-speech translation
Chinese speech into Chinese text. "SIDA-863A" system can recognize
398 basic Chinese syllable, recognition rate can arrive to 93%, response time is less than 0.1 second, input rapidity can arrive to 80 Chinese
characters per minute
17
MT in China - Future and Strategies (4)combined with OCR and Internet
Internet MT: SUNSHINE YIWANG, YAXIN YIBA, TONGYI, etc.
The advantage for MT software in INTERNET are: Higher translation speed, real-time translation Cheap price Large machine dictionary Possibility to add the new words
18
MT in China: New National Project
973 project: from 2001 supported by Chinese government. For creative research in
Natural Language processing including machine translation.
automatic speech-to-speech translation system (English-Chinese)
developing in Institute of Automation of Academia Sinica.
19
MT in China – Survey Source
Prof. Feng, Zhiwei: Secretary-general and the deputy chairman of
sub-committee of computer-aided in terminology of China
under the State Language Commission (SLC) of China.
Invited professor, KAIST (Sep/2001 – Aug/2002)
Dr. Liu, Qun Institute of Computer Technology, Academia Sinic
a, Beijing
20
MT in Japan - 1
More than 10 companies For English, Chinese, Korean
Waiting for the new breakthrough Internet eLearning Co-work with special-domain related companies
Technology transfer Collaboration tools is ready to be in market
For translator’s collaboration workbench thru network User interface: well-organized.
21
MT in Japan - 2
Leading Systems Cross-lingual patent retrieval
Prime NTT/ALT
Japanese-to-English Japanese-to-Malay Japanese-to-Chinese
Speech Translation ATR: C-Star
22
UNL in UN University
Through Universal Networking Language With Hindi, Japanese, Persian, Indonesia-
Malay, Thai, Chinese, Mongolian, Korean in Asian Region
Other region: Major European languages and English
Possible Users: ITU mail translation
23
MT in Malaysia
No commercial product yet. But in academic sectors
For application to Internet eLearning eCommerce
Universiti Sains Malaysia Computer Aided Translation Unit Prof. Tang Enya Kong and Prof. Yusoff Zaharin
24
MT in India
18 constitutional languages with 10 different scripts: their script grammar and language
grammars are quite similar they have 40 to 80 percent vocabularies in
common
less than 5 percent people who can work in English
25
MT in India: 1990-2001government effort for IT
TDIL (Technology Development of Indian Languages): 1990-1991
development of corpora, OCR, Text-to-Speech, machine translation; Standards for keyboard and internal code for information interchange
2000-2001 seven major initiatives:
Knowledge Resources, Knowledge Tools, Translation Support Systems, Human Machine Interface Systems, Localisation, Standardization and Language Technology Human Resource Development.
Thirteen Resource centres for Indian Language Technology Solutions (RC-ILTS)
were supported covering all 18 Indian languages.
26
MT in India: Future Digital Unite and Knowledge for All
Indian Language Technology Vision 2010 has been prepared with the Vision statement “ Digital Unite and Knowle
dge for All”. Growing popularity of Internet
content creation, localisation, on-line gisting and summarisation, e-learning, Cross-Lingual Information Retrieval are being promoted to ensure information access in cyberspace in Indian languages
Source: Dr. Om Vikas Senior Director and Head, Computer Development
Division, Ministry of Information Technology
27
MT in ThailandGovernment 1996
IT-2000 To build a national information infrastructure (NII) To invest in people, intends to concentrate on transferring IT
knowledge to their children. To build a Government Information Network (GINET)
Internet Users in Thailand (2000): 2.3M/66M Age <10 10-14 15-19 20-29 30-39 40-49 50-59 60-69 70+ Total Freq 18 124 261 1,238 572 187 32 27 2 2,461 Percent 0.7 5 10.6 50.3 23.2 7.6 1.3 1.1 0.1
100
Most of the Thai Internet users know English and other Internet
languages at a basic or low intermediate level
28
MT in ThailandPARSIT
web-based Thai-English Machine Translation since 1998 in cooperation with NEC (Japan). very popular among Thai users to translate English to Thai with the accuracy of 60%.
20 percent mistranslating might be due to differences
in expressions, slang, and sentence structures
http://www.suparsit.com/300,000 hits/month25,000 users/month
29
MT in Thailand: Dictionary
a web-based dictionary: Lexitron Thai-English and English-Thai dictionary
30
MT in Thailand: Future
to develop PARSIT translating system Thai-to-English and to other target languages.
Other language programs, such as OCR research, speech research, and language research
Thai full-text search engine
31
MT in Thailand: eASEAN
eASEAN Plan: Multilingual Machine Translation Proposal
Thailand, Cambodia, Laos, Vietnam, Japan, Korea, English
source: Dr. Virach Sornlertlamvanich [virach@nect
ec.or.th] Dr. Prayong THITITHANANON (Rajabhat I
nstitute Ubon Ratchathani, Thailand)
33
MT in KoreaCommercial Product
English-to-Korean (Korean-to-English) Enguide LNI Soft E-Tran2001 NLP Lab (Seoul National University) EZ ReaderLanguage and Computer ClickWorldClickQ TransmateIBM Korea …
Japanese-to/from-Korea Unisoft Changmyung …
Translation Memory Localization companies develop for their own use:
ITI …
34
MT in KoreaTest suite for E-to-K
KAIST (http://korterm.kaist.ac.kr/ksurimal) Supported by Ministry of Science and Tech
nology
Exhaustive Evaluation A variety of Sentences (5000 from high sch
ool textbooks, 10000 from internet e-business site)
To identify the R&D direction
35
Problematic Part of System A
Part ofSpecech
Partial Structure
Sentence Structure
Phrase
Collocation
Article PronounNoun Adverb
Preposition
Adjective Verb
RelativesConjunctionMark
Infinitive TenseGerundParticiple Number Idioms
Sentence type
NegationSpeechEllipsis ListsInsertion Inversion
Comparative Subjunctive moodSpecial Construction
Multiple part of speech Realtion and Scope of modification
V+N V+Prep. N+V N+N
Adv.+N Adv.+ Prep
Ambiguous word
Natural Expression
Different meaning between singular and plural
N Etc.V
NP IdiomsVP PP AP(adjective phrase) Sentence
serious average
N+Prep.
StructuralPart
Semantic Part
36
MT in Korea
Caption/EK and KE - ETRI Real-time translation of caption in the TV news
CNN for English-Korean KBS for Korean-English
Chinese-Korean MT Pohang University of Science & Tech. KAIST ETRI (Korean-to-Chinese) Companies: Konan tech.
Japanese-Korean MT (technology transfer) Pohang University of Science & Tech.
37
Online language populations (2001 June)
English 45%, Japanese 9.8%, Chinese 8.4%
German 6.2%, Korean 4.7%, Spanish 4.5%
Italian 3.6%, French 3.4%, Portuguese 2.5%
Dutch 2%, Russian 1.9%
GlobalReach. Global Internet Statistics (by Language). http://www.glreach.com/globstats/index.php3
38
Organizations in AsiaAAMT
AFNLP (Asia Federation of NLP Assocations) http://asianlp.org/ http://afnlp.org/
Eafterm (East Asia Terminology Forum) http://eafterm.org/
Language Resource Sharing and Management Jan/2001 – workshop in Tokyo, invited by Japan
Prof. Tanaka, Hozumi (Chair; GSK)
Nov/2001 – workshop in NLPRS-2001, Tokyo ISO TC37/SC4 (Language Resource Management) u
nder organization