39
Where do we stand? MT development, research, and deployment in Asia Key-Sun Choi (KAIST) AAMT http://www. asianlp .org/ http://www.afnlp.org/ http://korterm.org/

Where do we stand? MT development, research, and deployment in Asia Key-Sun Choi (KAIST) AAMT

Embed Size (px)

Citation preview

Where do we stand? MT development, research, and deployment in Asia

Key-Sun Choi (KAIST)AAMT

http://www.asianlp.org/ http://www.afnlp.org/http://korterm.org/

2

Contents

ChinaJapanIndiaMalaysiaThailandTaiwanKoreaUNLAssociations related to MT

3

MT in China – 1980-1990’s

To translate the scientific documents From Russian and Western Countries’ language

Supported by government No private company in early stage

TRANS-STAR: 30,000 words/hour for 386 PC. Basis dictionary includes 40,000 entries, 10 specialized technical dictionaries

including 350,000 entries. subject fields: computer, economics, telecommunication, c

eramics, thermal power industry, printing machine industry, automobile/tractor industry, Petroleum prospecting, geology, Chemical industry.

4

MT in China – PresentEnglish-to-Chinese

GAOLI: jointly by Beijing GAOLI Computer Co. Lid. & Linguistics Insti

tute of CASS. Basic lexical dictionary: 60,000 entries in which usage and g

rammatical function of every word is described in detail. Translation accuracy: 80% Readability of translated text: 80%-90%

863-IMT/EC: by the Institute of Computer Technology, Academia Sinica. commercialized and got very good economic benefits.

5

MT in China – PresentChinese-to-English

SINO-TRANS by the Company CS&S (China National

Software & Technology Service Co.) at 1993.

Basic dictionary: 40,000 entries Two special subject technical dictionaries:

Naval ships and boats (9312 entries), rocket-gun (33,773 entries)

Linguistic rules: 1,000 rules

6

MT in China – PresentEnglish-to-Chinese + terminology

TONGYI system: by the Tianjin DATONG computer software compa

ny WINDOWS platform Different special subject dictionaries:

a.commonly-used scientific terms: 200,000 entriesb.terms including 22 different subjects (e.g. machine buildi

ng, telecommunication, aviation, medicine, etc): 3,000,000 entries

Good market strategy and service Cooperation with enterprises

7

MT in China – PresentEnglish-to-Chinese + internet browsing + more user interface

YIWANG: by SUNSHINE company of Shenzhen. Highest translation speed: 100 sentences per secon

d. Internet browsing

YIBA: by YAXINCHENG software technical company. Three translation: on line, automatic, interface. Open to users: to revise dictionary and rules Rich special subject dictionaries: 30 subjects (e.g. C

omputer, telecommunication, medicine)

8

MT in China – PresentEnglish-to-Japanese

E-to-J by JEC company in Beijing. Technique of transformation from phrase

tree (P-tree) to dependency tree (D-tree). Closely integrated with word processor

9

MT in China – PresentExample-based MT: experimental systems

Japanese-Chinese EBMT: computer department of Qinghua university in 1996. corpus for Japanese and Chinese alignment sentenc

es The example unit is sentence The similarity rate calculation based on word

DAYA EBMT: Harbin Polytechnic University. machine-aided translation system, human factor is ve

ry important corpus is sentence-level alignment

10

MT in ChinaGovernment Funding: 1990’s

Hi-Tech 863 funding: 863-IMT/EC system (English-Chinese) SUNSHINE YIWANG system.

905 Chinese Language Processing Project: completed in 1998.

11

MT in China

User’s English Level

The proportion of English level of user for TONGYI MT software: Higher level: 16.5% Middle level: 49.5% Lower level: 34.1%

So the MT software must be oriented to common people

12

MT in China

Potential UsersThe proportion of enterprise user for TONGYI MT software: Small enterprises: 31.3% Medium-scale & large-scale enterprises: 68.7%

So the MT software must be oriented to large-scale & medium-scale enterprises, but we don’t ignore the small enterprises that also

has translation demand.

13

MT in China

Regional DistributionUser’s region distribution of MT software: translation demand is concentrated in the big cities

and developing regions. Beijing: 18.7% Liaoning: 7.9%, Jiangsu: 7.5% Zhejiang: 6.5%, Hubei: 6.5%, Shanghai: 6.1% Sichuan: 4.7%, Guangdong: 4.7% Henan: 3.3%, Helongjiang: 3.3% Hebei: 2.8%, Shanxi: 2.3%, Jilin: 2.3% Yunnan: 1.9%, Neimeng: 1.5%, Gansu: 1.4% Guizhou: 0.5%, Anhui: 0.5%

14

MT in China - Future and Strategies (1)Terminology Data Bank

MT software combines with terminology data bank 1990: sub-committee of computer-aided in

terminology of China set up. This sub-committee is attached to the State Language

Commission (SLC) of China A series of national standards for terminology data-

bank Terminology Databank creation

Chinese-English: Since 1995, by ISTIC (Institute of Scientific and technical Information of China)

Remarkable databanks…

15

MT in China - Future and Strategies (2)Language Corpus Processing

Corpus construction: the scale of 25 million Chinese characters

(1999) Automatic segmentation of Chinese writing

text in corpus (97.68%, close test) Automatic phrase bracketing and syntactic

annotation for Chinese Corpus

16

MT in China - Future and Strategies (3)speech-to-speech translation

Chinese speech into Chinese text. "SIDA-863A" system can recognize

398 basic Chinese syllable, recognition rate can arrive to 93%, response time is less than 0.1 second, input rapidity can arrive to 80 Chinese

characters per minute

17

MT in China - Future and Strategies (4)combined with OCR and Internet

Internet MT: SUNSHINE YIWANG, YAXIN YIBA, TONGYI, etc.

The advantage for MT software in INTERNET are: Higher translation speed, real-time translation Cheap price Large machine dictionary Possibility to add the new words

18

MT in China: New National Project

973 project: from 2001 supported by Chinese government. For creative research in

Natural Language processing including machine translation.

automatic speech-to-speech translation system (English-Chinese)

developing in Institute of Automation of Academia Sinica.

19

MT in China – Survey Source

Prof. Feng, Zhiwei: Secretary-general and the deputy chairman of

sub-committee of computer-aided in terminology of China

under the State Language Commission (SLC) of China.

Invited professor, KAIST (Sep/2001 – Aug/2002)

Dr. Liu, Qun Institute of Computer Technology, Academia Sinic

a, Beijing

20

MT in Japan - 1

More than 10 companies For English, Chinese, Korean

Waiting for the new breakthrough Internet eLearning Co-work with special-domain related companies

Technology transfer Collaboration tools is ready to be in market

For translator’s collaboration workbench thru network User interface: well-organized.

21

MT in Japan - 2

Leading Systems Cross-lingual patent retrieval

Prime NTT/ALT

Japanese-to-English Japanese-to-Malay Japanese-to-Chinese

Speech Translation ATR: C-Star

22

UNL in UN University

Through Universal Networking Language With Hindi, Japanese, Persian, Indonesia-

Malay, Thai, Chinese, Mongolian, Korean in Asian Region

Other region: Major European languages and English

Possible Users: ITU mail translation

23

MT in Malaysia

No commercial product yet. But in academic sectors

For application to Internet eLearning eCommerce

Universiti Sains Malaysia Computer Aided Translation Unit Prof. Tang Enya Kong and Prof. Yusoff Zaharin

24

MT in India

18 constitutional languages with 10 different scripts: their script grammar and language

grammars are quite similar they have 40 to 80 percent vocabularies in

common

less than 5 percent people who can work in English

25

MT in India: 1990-2001government effort for IT

TDIL (Technology Development of Indian Languages): 1990-1991

development of corpora, OCR, Text-to-Speech, machine translation; Standards for keyboard and internal code for information interchange

2000-2001 seven major initiatives:

Knowledge Resources, Knowledge Tools, Translation Support Systems, Human Machine Interface Systems, Localisation, Standardization and Language Technology Human Resource Development.

Thirteen Resource centres for Indian Language Technology Solutions (RC-ILTS)

were supported covering all 18 Indian languages.

26

MT in India: Future Digital Unite and Knowledge for All

Indian Language Technology Vision 2010 has been prepared with the Vision statement “ Digital Unite and Knowle

dge for All”. Growing popularity of Internet

content creation, localisation, on-line gisting and summarisation, e-learning, Cross-Lingual Information Retrieval are being promoted to ensure information access in cyberspace in Indian languages

Source: Dr. Om Vikas Senior Director and Head, Computer Development

Division, Ministry of Information Technology

27

MT in ThailandGovernment 1996

IT-2000 To build a national information infrastructure (NII) To invest in people, intends to concentrate on transferring IT

knowledge to their children. To build a Government Information Network (GINET)

Internet Users in Thailand (2000): 2.3M/66M Age <10 10-14 15-19 20-29 30-39 40-49 50-59 60-69 70+ Total Freq 18 124 261 1,238 572 187 32 27 2 2,461 Percent 0.7 5 10.6 50.3 23.2 7.6 1.3 1.1 0.1

100

Most of the Thai Internet users know English and other Internet

languages at a basic or low intermediate level

28

MT in ThailandPARSIT

web-based Thai-English Machine Translation since 1998 in cooperation with NEC (Japan). very popular among Thai users to translate English to Thai with the accuracy of 60%.

20 percent mistranslating might be due to differences

in expressions, slang, and sentence structures

http://www.suparsit.com/300,000 hits/month25,000 users/month

29

MT in Thailand: Dictionary

a web-based dictionary: Lexitron Thai-English and English-Thai dictionary

30

MT in Thailand: Future

to develop PARSIT translating system Thai-to-English and to other target languages.

Other language programs, such as OCR research, speech research, and language research

Thai full-text search engine

31

MT in Thailand: eASEAN

eASEAN Plan: Multilingual Machine Translation Proposal

Thailand, Cambodia, Laos, Vietnam, Japan, Korea, English

source: Dr. Virach Sornlertlamvanich [virach@nect

ec.or.th] Dr. Prayong THITITHANANON (Rajabhat I

nstitute Ubon Ratchathani, Thailand)

32

MT in Taiwan

Prof. Su, Keh-Ih Machine translation localization

33

MT in KoreaCommercial Product

English-to-Korean (Korean-to-English) Enguide LNI Soft E-Tran2001 NLP Lab (Seoul National University) EZ ReaderLanguage and Computer ClickWorldClickQ TransmateIBM Korea …

Japanese-to/from-Korea Unisoft Changmyung …

Translation Memory Localization companies develop for their own use:

ITI …

34

MT in KoreaTest suite for E-to-K

KAIST (http://korterm.kaist.ac.kr/ksurimal) Supported by Ministry of Science and Tech

nology

Exhaustive Evaluation A variety of Sentences (5000 from high sch

ool textbooks, 10000 from internet e-business site)

To identify the R&D direction

35

Problematic Part of System A

Part ofSpecech

Partial Structure

Sentence Structure

Phrase

Collocation

Article PronounNoun Adverb

Preposition

Adjective Verb

RelativesConjunctionMark

Infinitive TenseGerundParticiple Number Idioms

Sentence type

NegationSpeechEllipsis ListsInsertion Inversion

Comparative Subjunctive moodSpecial Construction

Multiple part of speech Realtion and Scope of modification

V+N V+Prep. N+V N+N

Adv.+N Adv.+ Prep

Ambiguous word

Natural Expression

Different meaning between singular and plural

N Etc.V

NP IdiomsVP PP AP(adjective phrase) Sentence

serious average

N+Prep.

StructuralPart

Semantic Part

36

MT in Korea

Caption/EK and KE - ETRI Real-time translation of caption in the TV news

CNN for English-Korean KBS for Korean-English

Chinese-Korean MT Pohang University of Science & Tech. KAIST ETRI (Korean-to-Chinese) Companies: Konan tech.

Japanese-Korean MT (technology transfer) Pohang University of Science & Tech.

37

Online language populations (2001 June)

English 45%, Japanese 9.8%, Chinese 8.4%

German 6.2%, Korean 4.7%, Spanish 4.5%

Italian 3.6%, French 3.4%, Portuguese 2.5%

Dutch 2%, Russian 1.9%

GlobalReach. Global Internet Statistics (by Language). http://www.glreach.com/globstats/index.php3

38

Organizations in AsiaAAMT

AFNLP (Asia Federation of NLP Assocations) http://asianlp.org/ http://afnlp.org/

Eafterm (East Asia Terminology Forum) http://eafterm.org/

Language Resource Sharing and Management Jan/2001 – workshop in Tokyo, invited by Japan

Prof. Tanaka, Hozumi (Chair; GSK)

Nov/2001 – workshop in NLPRS-2001, Tokyo ISO TC37/SC4 (Language Resource Management) u

nder organization

MT Status in Asia

Thank you.