New Development in MT Technology and Services, by Anthony Wong, CCID TransTech

1

The Combination of Statistical and Knowledge Models:

A Rule-based Oriented Hybrid MT System

Executive Forum (Beijing)

April 2016

Anthony WongChina Center for Information Industry

Development (CCID TransTech)25th Apr 2016

2

Content 1 背景 – Background

2 相关工作 – Related Work3 统计与知识模型融合的规则主导型混合机器翻译系统

- The Combination of Statistical and Knowledge Models

4 系统构建、性能评估与主要应用 - System Development, Performance Evaluation and Major Applications

5 下一步工作- Future Plan

Introduction to CCID Group

Testing & Evaluation Group

Media & Training

CIPOL

4

MT is taking on a new cross-paradigm development characterized by the hybridization of data-driven statistical approaches and linguistic knowledge models.

This presentation introduces the structural design of CCID’s English-Chinese bi-directional MT system combining statistical and knowledge models, including

• extracting term and lexicon entries from Parallel Corpora • extracting native English MWEs from 3-tuple Comparable Corpora• developing statistical post-editing of RBMT for Domain adaptation.

Finally, it provides a comprehensive performance evaluation of the practical hybrid MTsystem, examples of the typical applications of the system and work plan for the next step.

Introduction

5

1 Background

● Statistical MT is excellent in the translation of phrases and short collocations while performing less satisfactory in long collocations, as the linguistic model based on N-gram often overlooks long collocation translation. (The ‘Unbearable’ issue of ‘Long-distance Reordering of SMT Model.)

● Both RBMT and SMT are actually complementary in application: SMT has advantages in social media translation, such as forum, UGC, etc., while RBMT has advantages in the translation of technical documents, reports, online help, user interface, etc.

6

1 Background

In the EACL 2014 Third Workshop on Hybrid Approches to Translation (HyTra) held in Goteborg in April, 2014, R Rapp et al indicated that “The complementary between statistical and rule-based system has narrowed the gap between them. The rising cross-paradigm perspective in the field of MT aims at fostering a creative combination between the two main MT paradigms: statistical and rule-based paradigms, which will bring major breakthroughs to modern machine translation technologies”.

7

1 Background

Two hybridization development trends:

1. SMT as Core: To integrate morphological, syntactic or semantic information into the statistical MT system.

2. RBMT as Core: To integrate the data-driven statistical approaches with the existing rule-based system: using parallel and comparable corpora to improve results by enriching their lexicons and grammars, and by applying new methods for disambiguation.

2 Related Work The representative R&D program of the hybrid

machine translation combining statistics and knowledge is a high quality HyghTra Program jointly developed by the University of Leeds and Lingenio Company under the 7th Framework Programme(2010-2014) of the EU.

The program attempts to extract parallel and comparable bilingual resources from corpora through advanced statistical approaches, with aims in developing dictionary and syntax rule base to improve the performance of the hybrid machine translation system guided by RBMT.

8

2 Related Work A number of hybrid translation systems have attempt to

put some analytical abstraction based on linguistic knowledge on top of an SMT core.

Kurt Eberle et al believe that this is not the best choice, as in accordance to the underlying philosophy, SMT is linguistically ignorant at the beginning and only learns all linguistic rules automatically from corpora. However, the extracted information is typically represented in huge data sets which are not readable by humans in a natural way. As a result, this type of architecture cannot easily provide interfaces for incorporating linguistic knowledge effectively.

9

2 Related Work HyghTra has hence adopted an opposite

approach:

Integrating the information obtained from corpus through statistical approaches with the rule-based translation system as the CORE.

If the rule-based translation system that serves as the framework of the program is highly modularized and successful in the linguistics aspect, the hybrid MT system will have great potential in delivering high quality results.

10

2 Related WorkThe HyghTra program has integrated with statistical models

in the following aspects: 1. Build and use comparable corpora, and then extract

parallel resources from the corpora; 2. Extend the dictionaries through GIZA++ and parallel

corpora; 3. Use corpus statistical approaches to develop and maintain

the ‘Rules’ of the rule-based MT;4. Use the automatic evaluation technology to assess the

most commonly used syntactic structure and the quality of multiword expressions (MWE);

5. Explore correct translation from the parallel corpus, to analyze the key syntactic structure which may have an influence on the system quality.

11

2 Related Work

More background on: SPE

In recent years, there are a great number of researches on the domain adaptation for rule-based translation system based on SPE (Statistics Post-Editing) approach, and reports about using this technology to improve the performance of commercial MT System.

12

2 Related Work In 2012, R.Rubino et al put forward the theoretical basis

to use SPE approach to achieve the domain adaptation for machine translation system, believing that

“…. Most human activities involve certain language or the language of a specific domain, while a certain domain may have its unique terminologies, syntax and text structural characterization. It is unreasonable to build a unique translation system for each sub-domain, but we believe that using SPE technology to achieve the domain adaptation for out-of-domain translation system may be a solution to diversify specific domains”.

13

2 Related Work The post-editing specific MT evaluation criteria carried

out by Symantec Company has been used to evaluate the translation quality of the Systran SPE system. The results show that the acceptability of Japanese-English translation, which has the best SPE effect, has increased by 28.17%.

Ki-Young Lee et al suggested using SPE to improve the fluency of English-Korean translation, targeting at using the knowledge obtained from SPE to adjust RBMT translation into human’s reference translation. They have proved, through test results, that SPE can effectively improve the translation of sentences without morphological and syntactic analysis error delivered by the RBMT system.

14

3. The Combination of Statistical and Knowledge Models: A Rule-based oriented Hybrid MT System by CCID

The core system of CCID’s “Rule-based oriented hybrid machine translation system” is based on the practical rule-based translation system that we developed in the first decade of the 21st century. The core system has made a good integration of micro engine technologies such as shallow parser, translation template matching, etc.

Word segmentation processing, named entity recognition and shallow syntactic parsing technologies combining rules and statistics have also been adopted to ensure that the translation generated by the system delivers excellent syntactic structures and semantic expressions, which are crucial requirements for the R&D of the hybrid machine translation system.

15

Rule-based Back-bonebased on a well-tested ‘core’: (863 expert

group evaluation result)

16

Adequacy Fluency

CCID

***

***

***

***

***

3. The Combination of Statistical and Knowledge Models:

A Rule-based oriented Hybrid MT System

Figure 1 shows the schematic overview of the architecture of the hybrid MT system that combines statistical and knowledge models.

Three major sections: The core translation system based on linguistic

knowledge models; (Upper-left) The bilingual resources extraction module which

uses statistical model to extract bilingual resource from parallel and comparable corpora, and then integrates the bilingual resources into the RBMT core system; (Lower-left)

The SPE system of the phrase-based PBMT which has integrated into HMT. (Right)

17

18Figure 1 Schematic overview of the architecture of hybrid MT system combined statistical models and knowledge models

3. The Combination of Statistical and Knowledge Models: A Rule-based oriented Hybrid MT System

The three key technologies of CCID’s HMT:

1. develop the technology to extract terms and lexicon entries from parallel corpora;

2. develop the technology to extract multiword expressions in native English from the three-tuple comparable corpora;

3. develop the technology that enables SPE system to achieve the domain adaption for MT

19


3.1 Extract terms and lexicon entries from parallel corpora

As shown in figure 2, this method includes three steps: Establish a phrase table from the parallel corpora; Filter candidate terms to select terms which are in

conformity with grammatical norms; Conduct manual SPE for the above-mentioned

terminology table, thus to guarantee the correctness of the POS and attribute annotation of the terminologies.

20

21

Parallel corpora

Pre-processing（tokenization, lower-case conversion, Chinese

word segmentation）

Use GIZA+++ to obtain alignment model

Create phrase tables with Moses

Frequency filter

Linguistic filter

Lexicon filter

Confirm the correctness of the lemma and annotation

Create phrase tables

Filter term candidates

Post-edit

Domain specific glossary

Figure 2 Procedures to extract terminology from parallel corpora


3.1.1 Creating phrase tableThe procedures include: Pre-processing. English data was tokenized and

converted to lower-case, and the Chinese tokens were segmented by words using CCID Segmentation tools.

GIZA++ for word alignment

Phrase table extraction. Use Moses tool to establish a phrase table on the basis of alignment model.

22


3.1.2 Filter candidatesThree filters are applied as described here: Frequency filter: only phrases with a frequency and

translation probability above a given threshold are considered as term candidates.

Linguistic filter: Only phrases with certain linguistic properties are acceptable. Variants of the above patterns can also be candidates.

Lexicon filter: remove candidates which already exist in

the system, candidates of non-specific and those no longer in uses.

23


3.2 Extract multiword expressions in native English from the three-tuple comparable corporaThe translation of parallel corpora usually distorts its native language (Translationese):

Such issue is even more obvious in Chinese-English parallel corpora. For example, “ 电子政务建设” : e-government construction “ 二手资料” : second-hand data“ 重要意义” : important significance(see Table 2).

24


Our approach is to set up three-tuple comparable corpora with a grade of million sentence pairs consist of standard Chinese, native English and Chinglish.

Statistical approaches are then used to analyze the overused and underused appearances of key word clusters on a word level.

LL will be calculated to conduct quantitative analysis on the significance of difference of key word clusters. Based on the variation of LL (see Table 1), the distinctive features of Chinglish and native English multiword expression can be detected.

Native English MWEs are then extracted to improve the performance of the machine translation system.

25


As shown in Table 2, the key word cluster with the biggest log likelihood is on the top of the table, indicating a considerable difference in the frequency distribution of this word cluster between native English and Chinglish.

So far, over 56,000 multiword expressions in native English have been extracted, the accuracy and fluency of the system have seen significant improvement.

26

27

Table 2 Analysis on the Significance of Difference of MWE in Chinglish and Native English

Corpora:- Chinglish Native EnglishOverused (+)Underused (-)Key words

Normalized frequency

Normalized frequency

LL Chinese expression

network bubble 515 36 + 497.82

Dot-com bubble 16 372 - 404.52 网络泡沫e-government

construction126 3 + 150.34

e-government

development9 120 - 113.55 电子政务建设

second-hand data 62 0 + 85.95

indirect data 2 58 - 65.64 二手数据Olympic five rings 20 0 + 27.72

The Olympic rings 0 24 - 33.27 奥运五环middle-sized 35 1 + 40.77

medium-sized 4 30 - 22.50 中等大小important significance 35 3 + 31.69

great significance 6 42 - 30.37 重要意义


3.3 SPE System Using SPE technology to achieve the domain adaptation for

out-of-domain translation system is one of the major solutions to adapt MT for the domain diversity.

In the manual post-editing process, it is the human translators who edit the output of machine translation system, thus to complete the PE.

In SPE, the texts edited by translators, namely reference translation, are introduced to train the system, so as to enable the system to correct the output of the original MT system automatically.

During the process, SPE has to learn repeatedly fixing the same mistakes by the MT system and the inherent lexical choice defects in RBMT.

28


Using SPE system to achieve the domain adaption for machine translation includes the following four steps:

29

A parallel corpus is constructed using the RBMT output translations as source text and the post-edited reference translations as target text. This corpus is used for SMT model training and in turn help build the SPE system.

Using SMT toolkits to build translation model and language model for SPE system.

Applying the output of RBMT to the decoder of SPE. SPE system will deliver the corrected SPE text.*Two translation steps are performed: first, the source text is

translated by the RBMT to intermediate target language, which is in turn translated by the SPE module to the post-edited target language text. (see Figure 3)

Figure 3 Statistical post-editing architecture 30



The above process can be used to customize RBMT system in a convenient way, and to enable the RBMT system to adapt to the specific domain.

Currently, we are improving the SPE system, enabling it to adopt not only the baseline RBMT output, but also the source language input.

There are many ways to input information in source language. For example, by introducing the feature functions related to the phrase in target language and the text in source language into the log-linear model.

31

4. System Development , Performance Evaluation and Major Applications

4.1 System Development Before implementation, CCID has already developed the

Chinese segmentation tool and English tokenization tool independently to process Chinese and English corpora.

Giza-pp v.1.0.1 was used in the prototype stage to conduct word alignment.

Moses v0.91 was used as SPE component; phrase-based translation model has been introduced; the distortion limit for reordering was set to 6; Weights for the models were tuned by the MERT.

The SRILM v1.7.0 toolkit was used for language modeling for a standard 5-gram Language Model (LM) with the Kneser-Ney smoothing.

The corpora for system training were the corpora with million sentence pairs in the ICT domain.

32


4.2 System performance evaluation Table 3 shows the SPE experimental result of Chinese-English and

English-Chinese RBMT system Table 4 and Table 5 provide examples of SPE output for Chinese-

English RBMT and English-Chinese RBMT respectively.

33

Language TER BLEU AcceptabilityChinese-to-EnglishRBMTRBMT + SPE

0.710.57

0.250.32

0.630.71

English-to-ChineseRBMTRBMT + SPE

0.670.46

0.320.50

0.610.80

Table 3: Experimental Results: For TER, lower (error) is better, while for BLEU and Acceptability, higher (score) is better .

Table 4 Example of SPE output for Chinese-to-English RBMT

34

Source: 培养一大批知识产权的专业人才和管理人才，以及一大批既了解专利有关知识，又懂得法律的志愿者。Reference translation: Cultivate a large number of professionals and management staff for intellectual property, and a large number of volunteers well-versed in both patent knowledge and legal knowledge.

RBMT: Cultivate large quantities of the professional and the management staff intellectual properties, and large quantities of volunteer both understood the patent relevant knowledge, understand the law.With SPE: Cultivate a large number of professionals and management staff for intellectual properties, and a large number of volunteer both understood the patent relevant knowledge, understand the law.


Table 5 Example of SPE output for English-to-Chinese RBMT

35

Source: To remove a Blank Back Plate , pull the back plate pin to release the Blank Back Plate , and remove the Blank Back Plate from the rear panel slot.Reference translation: 要卸下填充板背板，请拉出背板引脚以松开填充板背板，然后从后面板插槽中卸下该背板。RBMT: 要删除一个空白的信号板，请拉进信号板别针发布空白的信号板，并从后面板插槽中删除空白的信号板。With SPE: 要卸下填充板背板，请拉出背板引脚松开填充板背板，然后从后面板插槽中卸下填充板背板。



The following examples are to illustrate CCID’s English-Chinese machine in comparison to a popular online translation system, so as to further verify that the hybrid machine translation system combining linguistic knowledge model and data-driven statistical approaches is far more superior than statistical machine translation system in the handling of long-distance reordering, as well as in generating authentic syntactic structure and semantic structure.

36

Source sentence input by the system:

GOES-P will be launched on board a United Launch Alliance Delta IV (4, 2) launch vehicle under a FAA commercial license.

4. System Setup , Performance Evaluation and Major Applications

Translation delivered by the hybrid machine translation system (test conducted in July, 2014):在美国联邦航空局商业许可下，联合发射同盟的一枚德尔塔 -4 （ 4 ， 2 ）运载火箭将搭载 GOES-P 卫星发射升空。

Translation delivered by XXX online translation (test conducted in July, 2014):

GOES-P 将在船上联合发射联盟德尔塔 IV （ 4 ， 2 ）根据美国联邦航空局商业许可运载火箭发射。

37

4. System Development , Performance Evaluation and Major Applications 4.3 Major applications

Due to its performance advantage, CCID’s hybrid machine translation system guided by RBMT has been applied to many key Public-Sector projects:

Beijing Summer Olympic Games China-EU Information Society Program Skynet Program of the Ministry of Public Security

In addition, CA, Symantec and other multinational corporations have also embedded the system into their corporate Localization Workflow.

38

5. Future Plan

For HMT R&D based on SPE, consider how to use linguistic models to select the sentences and phrases which can be improved through SPE. Such selection of post-editing target can minimize the degradation of translation quality.

We will continue to work with Lancaster University, conducting researches on how to use MAM ( 多重相关性度量 ) to enhance the accuracy and recall rate of MWE’s, and how to use USAS semantic analysis system to improve the WSD ability of the MT system.

Other Projects: German <-> Chinese MT engine (Hybrid MT with Lingenio, PONS) Text Mining of the Strategy Contents of Corporate Annual Reports

39

Thank You