View
215
Download
0
Embed Size (px)
Citation preview
1
The Combination of Statistical and Knowledge Models:
A Rule-based Oriented Hybrid MT System
Executive Forum (Beijing)
April 2016
Anthony WongChina Center for Information Industry
Development (CCID TransTech)25th Apr 2016
2
Content 1 背景 – Background
2 相关工作 – Related Work3 统计与知识模型融合的规则主导型混合机器翻译系统
- The Combination of Statistical and Knowledge Models
4 系统构建、性能评估与主要应用 - System Development, Performance Evaluation and Major Applications
5 下一步工作- Future Plan
Introduction to CCID Group
Testing & Evaluation Group
Media & Training
CIPOL
4
MT is taking on a new cross-paradigm development characterized by the hybridization of data-driven statistical approaches and linguistic knowledge models.
This presentation introduces the structural design of CCID’s English-Chinese bi-directional MT system combining statistical and knowledge models, including
• extracting term and lexicon entries from Parallel Corpora • extracting native English MWEs from 3-tuple Comparable Corpora• developing statistical post-editing of RBMT for Domain adaptation.
Finally, it provides a comprehensive performance evaluation of the practical hybrid MTsystem, examples of the typical applications of the system and work plan for the next step.
Introduction
5
1 Background
● Statistical MT is excellent in the translation of phrases and short collocations while performing less satisfactory in long collocations, as the linguistic model based on N-gram often overlooks long collocation translation. (The ‘Unbearable’ issue of ‘Long-distance Reordering of SMT Model.)
● Both RBMT and SMT are actually complementary in application: SMT has advantages in social media translation, such as forum, UGC, etc., while RBMT has advantages in the translation of technical documents, reports, online help, user interface, etc.
6
1 Background
In the EACL 2014 Third Workshop on Hybrid Approches to Translation (HyTra) held in Goteborg in April, 2014, R Rapp et al indicated that “The complementary between statistical and rule-based system has narrowed the gap between them. The rising cross-paradigm perspective in the field of MT aims at fostering a creative combination between the two main MT paradigms: statistical and rule-based paradigms, which will bring major breakthroughs to modern machine translation technologies”.
7
1 Background
Two hybridization development trends:
1. SMT as Core: To integrate morphological, syntactic or semantic information into the statistical MT system.
2. RBMT as Core: To integrate the data-driven statistical approaches with the existing rule-based system: using parallel and comparable corpora to improve results by enriching their lexicons and grammars, and by applying new methods for disambiguation.
2 Related Work The representative R&D program of the hybrid
machine translation combining statistics and knowledge is a high quality HyghTra Program jointly developed by the University of Leeds and Lingenio Company under the 7th Framework Programme(2010-2014) of the EU.
The program attempts to extract parallel and comparable bilingual resources from corpora through advanced statistical approaches, with aims in developing dictionary and syntax rule base to improve the performance of the hybrid machine translation system guided by RBMT.
8
2 Related Work A number of hybrid translation systems have attempt to
put some analytical abstraction based on linguistic knowledge on top of an SMT core.
Kurt Eberle et al believe that this is not the best choice, as in accordance to the underlying philosophy, SMT is linguistically ignorant at the beginning and only learns all linguistic rules automatically from corpora. However, the extracted information is typically represented in huge data sets which are not readable by humans in a natural way. As a result, this type of architecture cannot easily provide interfaces for incorporating linguistic knowledge effectively.
9
2 Related Work HyghTra has hence adopted an opposite
approach:
Integrating the information obtained from corpus through statistical approaches with the rule-based translation system as the CORE.
If the rule-based translation system that serves as the framework of the program is highly modularized and successful in the linguistics aspect, the hybrid MT system will have great potential in delivering high quality results.
10
2 Related WorkThe HyghTra program has integrated with statistical models
in the following aspects: 1. Build and use comparable corpora, and then extract
parallel resources from the corpora; 2. Extend the dictionaries through GIZA++ and parallel
corpora; 3. Use corpus statistical approaches to develop and maintain
the ‘Rules’ of the rule-based MT;4. Use the automatic evaluation technology to assess the
most commonly used syntactic structure and the quality of multiword expressions (MWE);
5. Explore correct translation from the parallel corpus, to analyze the key syntactic structure which may have an influence on the system quality.
11
2 Related Work
More background on: SPE
In recent years, there are a great number of researches on the domain adaptation for rule-based translation system based on SPE (Statistics Post-Editing) approach, and reports about using this technology to improve the performance of commercial MT System.
12
2 Related Work In 2012, R.Rubino et al put forward the theoretical basis
to use SPE approach to achieve the domain adaptation for machine translation system, believing that
“…. Most human activities involve certain language or the language of a specific domain, while a certain domain may have its unique terminologies, syntax and text structural characterization. It is unreasonable to build a unique translation system for each sub-domain, but we believe that using SPE technology to achieve the domain adaptation for out-of-domain translation system may be a solution to diversify specific domains”.
13
2 Related Work The post-editing specific MT evaluation criteria carried
out by Symantec Company has been used to evaluate the translation quality of the Systran SPE system. The results show that the acceptability of Japanese-English translation, which has the best SPE effect, has increased by 28.17%.
Ki-Young Lee et al suggested using SPE to improve the fluency of English-Korean translation, targeting at using the knowledge obtained from SPE to adjust RBMT translation into human’s reference translation. They have proved, through test results, that SPE can effectively improve the translation of sentences without morphological and syntactic analysis error delivered by the RBMT system.
14
3. The Combination of Statistical and Knowledge Models: A Rule-based oriented Hybrid MT System by CCID
The core system of CCID’s “Rule-based oriented hybrid machine translation system” is based on the practical rule-based translation system that we developed in the first decade of the 21st century. The core system has made a good integration of micro engine technologies such as shallow parser, translation template matching, etc.
Word segmentation processing, named entity recognition and shallow syntactic parsing technologies combining rules and statistics have also been adopted to ensure that the translation generated by the system delivers excellent syntactic structures and semantic expressions, which are crucial requirements for the R&D of the hybrid machine translation system.
15
Rule-based Back-bonebased on a well-tested ‘core’: (863 expert
group evaluation result)
16
Adequacy Fluency
CCID
***
***
***
***
***
3. The Combination of Statistical and Knowledge Models:
A Rule-based oriented Hybrid MT System
Figure 1 shows the schematic overview of the architecture of the hybrid MT system that combines statistical and knowledge models.
Three major sections: The core translation system based on linguistic
knowledge models; (Upper-left) The bilingual resources extraction module which
uses statistical model to extract bilingual resource from parallel and comparable corpora, and then integrates the bilingual resources into the RBMT core system; (Lower-left)
The SPE system of the phrase-based PBMT which has integrated into HMT. (Right)
17
18Figure 1 Schematic overview of the architecture of hybrid MT system combined statistical models and knowledge models
3. The Combination of Statistical and Knowledge Models: A Rule-based oriented Hybrid MT System
The three key technologies of CCID’s HMT:
1. develop the technology to extract terms and lexicon entries from parallel corpora;
2. develop the technology to extract multiword expressions in native English from the three-tuple comparable corpora;
3. develop the technology that enables SPE system to achieve the domain adaption for MT
19
3. The Combination of Statistical and Knowledge Models: A Rule-based oriented Hybrid MT System
3.1 Extract terms and lexicon entries from parallel corpora
As shown in figure 2, this method includes three steps: Establish a phrase table from the parallel corpora; Filter candidate terms to select terms which are in
conformity with grammatical norms; Conduct manual SPE for the above-mentioned
terminology table, thus to guarantee the correctness of the POS and attribute annotation of the terminologies.
20
21
Parallel corpora
Pre-processing(tokenization, lower-case conversion, Chinese
word segmentation)
Use GIZA+++ to obtain alignment model
Create phrase tables with Moses
Frequency filter
Linguistic filter
Lexicon filter
Confirm the correctness of the lemma and annotation
Create phrase tables
Filter term candidates
Post-edit
Domain specific glossary
Figure 2 Procedures to extract terminology from parallel corpora
3. The Combination of Statistical and Knowledge Models: A Rule-based oriented Hybrid MT System
3.1.1 Creating phrase tableThe procedures include: Pre-processing. English data was tokenized and
converted to lower-case, and the Chinese tokens were segmented by words using CCID Segmentation tools.
GIZA++ for word alignment
Phrase table extraction. Use Moses tool to establish a phrase table on the basis of alignment model.
22
3. The Combination of Statistical and Knowledge Models: A Rule-based oriented Hybrid MT System
3.1.2 Filter candidatesThree filters are applied as described here: Frequency filter: only phrases with a frequency and
translation probability above a given threshold are considered as term candidates.
Linguistic filter: Only phrases with certain linguistic properties are acceptable. Variants of the above patterns can also be candidates.
Lexicon filter: remove candidates which already exist in
the system, candidates of non-specific and those no longer in uses.
23
3. The Combination of Statistical and Knowledge Models: A Rule-based oriented Hybrid MT System
3.2 Extract multiword expressions in native English from the three-tuple comparable corporaThe translation of parallel corpora usually distorts its native language (Translationese):
Such issue is even more obvious in Chinese-English parallel corpora. For example, “ 电子政务建设” : e-government construction “ 二手资料” : second-hand data“ 重要意义” : important significance(see Table 2).
24
3. The Combination of Statistical and Knowledge Models: A Rule-based oriented Hybrid MT System
Our approach is to set up three-tuple comparable corpora with a grade of million sentence pairs consist of standard Chinese, native English and Chinglish.
Statistical approaches are then used to analyze the overused and underused appearances of key word clusters on a word level.
LL will be calculated to conduct quantitative analysis on the significance of difference of key word clusters. Based on the variation of LL (see Table 1), the distinctive features of Chinglish and native English multiword expression can be detected.
Native English MWEs are then extracted to improve the performance of the machine translation system.
25
3. The Combination of Statistical and Knowledge Models: A Rule-based oriented Hybrid MT System
As shown in Table 2, the key word cluster with the biggest log likelihood is on the top of the table, indicating a considerable difference in the frequency distribution of this word cluster between native English and Chinglish.
So far, over 56,000 multiword expressions in native English have been extracted, the accuracy and fluency of the system have seen significant improvement.
26
27
Table 2 Analysis on the Significance of Difference of MWE in Chinglish and Native English
Corpora:- Chinglish Native EnglishOverused (+)Underused (-)Key words
Normalized frequency
Normalized frequency
LL Chinese expression
network bubble 515 36 + 497.82
Dot-com bubble 16 372 - 404.52 网络泡沫e-government
construction126 3 + 150.34
e-government
development9 120 - 113.55 电子政务建设
second-hand data 62 0 + 85.95
indirect data 2 58 - 65.64 二手数据Olympic five rings 20 0 + 27.72
The Olympic rings 0 24 - 33.27 奥运五环middle-sized 35 1 + 40.77
medium-sized 4 30 - 22.50 中等大小important significance 35 3 + 31.69
great significance 6 42 - 30.37 重要意义
3. The Combination of Statistical and Knowledge Models: A Rule-based oriented Hybrid MT System
3.3 SPE System Using SPE technology to achieve the domain adaptation for
out-of-domain translation system is one of the major solutions to adapt MT for the domain diversity.
In the manual post-editing process, it is the human translators who edit the output of machine translation system, thus to complete the PE.
In SPE, the texts edited by translators, namely reference translation, are introduced to train the system, so as to enable the system to correct the output of the original MT system automatically.
During the process, SPE has to learn repeatedly fixing the same mistakes by the MT system and the inherent lexical choice defects in RBMT.
28
3. The Combination of Statistical and Knowledge Models: A Rule-based oriented Hybrid MT System
Using SPE system to achieve the domain adaption for machine translation includes the following four steps:
29
A parallel corpus is constructed using the RBMT output translations as source text and the post-edited reference translations as target text. This corpus is used for SMT model training and in turn help build the SPE system.
Using SMT toolkits to build translation model and language model for SPE system.
Applying the output of RBMT to the decoder of SPE. SPE system will deliver the corrected SPE text.*Two translation steps are performed: first, the source text is
translated by the RBMT to intermediate target language, which is in turn translated by the SPE module to the post-edited target language text. (see Figure 3)
Figure 3 Statistical post-editing architecture 30
3. The Combination of Statistical and Knowledge Models: A Rule-based oriented Hybrid MT System
3. The Combination of Statistical and Knowledge Models: A Rule-based oriented Hybrid MT System
The above process can be used to customize RBMT system in a convenient way, and to enable the RBMT system to adapt to the specific domain.
Currently, we are improving the SPE system, enabling it to adopt not only the baseline RBMT output, but also the source language input.
There are many ways to input information in source language. For example, by introducing the feature functions related to the phrase in target language and the text in source language into the log-linear model.
31
4. System Development , Performance Evaluation and Major Applications
4.1 System Development Before implementation, CCID has already developed the
Chinese segmentation tool and English tokenization tool independently to process Chinese and English corpora.
Giza-pp v.1.0.1 was used in the prototype stage to conduct word alignment.
Moses v0.91 was used as SPE component; phrase-based translation model has been introduced; the distortion limit for reordering was set to 6; Weights for the models were tuned by the MERT.
The SRILM v1.7.0 toolkit was used for language modeling for a standard 5-gram Language Model (LM) with the Kneser-Ney smoothing.
The corpora for system training were the corpora with million sentence pairs in the ICT domain.
32
4. System Development , Performance Evaluation and Major Applications
4.2 System performance evaluation Table 3 shows the SPE experimental result of Chinese-English and
English-Chinese RBMT system Table 4 and Table 5 provide examples of SPE output for Chinese-
English RBMT and English-Chinese RBMT respectively.
33
Language TER BLEU AcceptabilityChinese-to-EnglishRBMTRBMT + SPE
0.710.57
0.250.32
0.630.71
English-to-ChineseRBMTRBMT + SPE
0.670.46
0.320.50
0.610.80
Table 3: Experimental Results: For TER, lower (error) is better, while for BLEU and Acceptability, higher (score) is better .
Table 4 Example of SPE output for Chinese-to-English RBMT
34
Source: 培养一大批知识产权的专业人才和管理人才,以及一大批既了解专利有关知识,又懂得法律的志愿者。Reference translation: Cultivate a large number of professionals and management staff for intellectual property, and a large number of volunteers well-versed in both patent knowledge and legal knowledge.
RBMT: Cultivate large quantities of the professional and the management staff intellectual properties, and large quantities of volunteer both understood the patent relevant knowledge, understand the law.With SPE: Cultivate a large number of professionals and management staff for intellectual properties, and a large number of volunteer both understood the patent relevant knowledge, understand the law.
4. System Development , Performance Evaluation and Major Applications
Table 5 Example of SPE output for English-to-Chinese RBMT
35
Source: To remove a Blank Back Plate , pull the back plate pin to release the Blank Back Plate , and remove the Blank Back Plate from the rear panel slot.Reference translation: 要卸下填充板背板,请拉出背板引脚以松开填充板背板,然后从后面板插槽中卸下该背板。RBMT: 要删除一个空白的信号板,请拉进信号板别针发布空白的信号板,并从后面板插槽中删除空白的信号板。With SPE: 要卸下填充板背板,请拉出背板引脚松开填充板背板,然后从后面板插槽中卸下填充板背板。
4. System Development , Performance Evaluation and Major Applications
4. System Development , Performance Evaluation and Major Applications
The following examples are to illustrate CCID’s English-Chinese machine in comparison to a popular online translation system, so as to further verify that the hybrid machine translation system combining linguistic knowledge model and data-driven statistical approaches is far more superior than statistical machine translation system in the handling of long-distance reordering, as well as in generating authentic syntactic structure and semantic structure.
36
Source sentence input by the system:
GOES-P will be launched on board a United Launch Alliance Delta IV (4, 2) launch vehicle under a FAA commercial license.
4. System Setup , Performance Evaluation and Major Applications
Translation delivered by the hybrid machine translation system (test conducted in July, 2014):在美国联邦航空局商业许可下,联合发射同盟的一枚德尔塔 -4 ( 4 , 2 )运载火箭将搭载 GOES-P 卫星发射升空。
Translation delivered by XXX online translation (test conducted in July, 2014):
GOES-P 将在船上联合发射联盟德尔塔 IV ( 4 , 2 )根据美国联邦航空局商业许可运载火箭发射。
37
4. System Development , Performance Evaluation and Major Applications 4.3 Major applications
Due to its performance advantage, CCID’s hybrid machine translation system guided by RBMT has been applied to many key Public-Sector projects:
Beijing Summer Olympic Games China-EU Information Society Program Skynet Program of the Ministry of Public Security
In addition, CA, Symantec and other multinational corporations have also embedded the system into their corporate Localization Workflow.
38
5. Future Plan
For HMT R&D based on SPE, consider how to use linguistic models to select the sentences and phrases which can be improved through SPE. Such selection of post-editing target can minimize the degradation of translation quality.
We will continue to work with Lancaster University, conducting researches on how to use MAM ( 多重相关性度量 ) to enhance the accuracy and recall rate of MWE’s, and how to use USAS semantic analysis system to improve the WSD ability of the MT system.
Other Projects: German <-> Chinese MT engine (Hybrid MT with Lingenio, PONS) Text Mining of the Strategy Contents of Corporate Annual Reports
39
Thank You