Upload
chelo-vargas
View
164
Download
0
Embed Size (px)
Citation preview
Bilingual Terminology Extraction from TMXA State-of-the-Art Overview
Chelo Vargas-Sierra, PhDUniversity of Alicante,
Spain
2
Key wordsOverview of terms involved in the process
1st point 2nd point 3rd point 4th point
EvaluationBATE under evaluationMeasures for accuracyQuality in use model and tasks
Terminology and extractorsTerminology managementIts timelineBATE (approaches, state of the art)
ResultsPrecision & RecallParameters & Questionnaire
INDEXMain points of this presentation
Parallel corpusTMX
Alignment levelsParagraph, sentence and word
level
ATE & BATE Precision/RecallGetting only terms and all terms
Gold standardExhaustive, manually created
bilingual glossary
Validation* Term validation facility* Which TCs are real terms?
UsabilitySoftware used to achieve
user’s objectives with effectiveness, efficiency,
and satisfaction
Quality in use modelISO standard
KEY WORDSTerms involved in the process
2. Terminology & Extractors
5
IDENTIFY
FINDRETRIEVE
the terminology in the source text adequatelyIdentify and interpret
terminological data
Retrieve and store
proper documentation and information resources
Find and use
IMPORTANCE OF TERMINOLOGYTranslators were the first professionals to be aware of term-related issues
66
Time spent to solve terminological problems (Arntz 1993,Walker 1993).+40%
In specialized translation
TERMINOLOGY MANAGEMENT
77
Managing terminology (extracting, validating, importing, adding, editing, deleting,revising, updating, exporting, publishing) is a time-comsuming process.
Time spent to solve terminological problems (Arntz 1993, Walker 1993).+40%In specialized translation
TERMINOLOGY MANAGEMENT
88
Managing terminology (extracting, validating, importing, adding, editing, deleting, revising,updating, exporting, publishing) is a time-comsuming process.
Time spent to solve terminological problems (Arntz 1993, Walker 1993).+40%In specialized translation
TERMINOLOGY MANAGEMENT
Terminology work is “on backstage”, and customer oremployers may not be fully aware of their befefits for QA.
99
Managing terminology (extracting, validating, importing, adding, editing, deleting, revising,updating, exporting, publishing) is a time-comsuming process.
Time spent to solve terminological problems (Arntz 1993, Walker 1993).+40%In specialized translation
TERMINOLOGY MANAGEMENT
Return on Investment (ROI) on terminology managementreported by some corporate studies (Childress, 2007;Popiolek, 2015)
90%
Terminology work is “on backstage”, and customer or employers may not be fully aware oftheir befefits for QA.
1010
TERMINOLOGY MANAGEMENT
Extraction
• List of terms extracted from ST• List of terms to validate (accept or reject)
Translation
• List is added to a termbase• List is translated and additional data added
Approval
• List approved by a person in charge of terminology• When the client has requested there is an addtional
step for client approval
General model por project terminologycreation (Popiolek, 2015: 351)
Monolingualextraction & validation
Importing & looking forequivalents
11
Preparing the files and import them into the BATE
Preparation: TMX import
List of candidate term pairsextracted from TMX
Bilingual extraction
TIMELINE in Terminology Managementwith bilingual extraction
12
- List of pair of terms to validate (accept or reject terms and suggested equivalents)
- Term by term and additional data are added to a term base (Synchroterm)
Validation (& data entry)
- Export bilingual terms and additional data in an available file format (.xls, .txt, .TBX, …)
- Import output file to a TDB system (to be integrated into a MT System)
Export/Import
13
Person in charge of terminology or client
Approval
Ready to useFinish
14
Bilingual Automatic Term ExtractorsTwo approaches (Foo, 2012)
EXTRACT-ALIGN1ST step: monolingual terminology extractionin both languages.2nd step: cross-linguistic matching usingword-alignment or co-occurrence statistics tofind equivalents.
Commercial systems in this approach
15
ALIGN-FILTER1ST step: word-alignment on theparallel texts.2nd step: rank the aligned units tofinally select the most likely pair ofcandidates (statistics)
TExSIS (Macken et al, 2013)
Bilingual Automatic Term ExtractorsTwo approaches (Foo, 2012)
16
Bilingual Automatic Term ExtractorsAcademic / In-house
- English-French TERMIGHT (Dagan & Church, 1994)- English-French (Kupiek, 1993)- English-Dutch (Eijk, 1993)- English-French (Gaussier, 1995)- English and Swedish (Ahrenberg et al., 1998)
- French-Japanese (Morin et al 2010, from ACABIT, Daille, 2003): not bilingual, but multilingual- Slovene and English, Luiz (Vintar, 2010); - English and Swedish ITools suite (Foo & Merkel, 2010)- English and German (Gojun et al., 2012).- English, French, German, Spanish, TTC TermSuite (Daille, 2012)- English-Spanish TBXTools (Oliver & Vázquez, 2015) (under development)- Chinese, Czech, Dutch, English, French, German, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish: Sketch Engine (Baisa et al 2015, Koval et al 2016)
- French-German (Blank, 2000)- Japanese-English, MNH (Nakagawa & Mori, 2003)- Spanish-Basque, Elexbi (Hernaiz et al., 2006), from a TMX; - Spanish-German, Autoterm (Haller, 2008); - English-Spanish, Mutual Bilingual Term Extractor (Ha et al, 2008)- French-English, French-Italian and French-Dutch (Lefever et al., 2009)
90s
2000-2009
2010 -2016
17
Bilingual Automatic Term ExtractorsOther BATE (free / comercial)
- TermExtractor (Shimohata et al 2001)
- MemoQ's built-in term extractor- Déjà Vu - Lexicon- TermoStat Web: http://termostat.ling.umontreal.ca/- Yate (IULA)- Okapi- TerMine:
http://www.nactem.ac.uk/software/termine/- TerminologyExtractor: https://goo.gl/yA2Cuf- PRoMT- FiveFilters (web-based): http://fivefilters.org/term-
extraction/- Concordace programs: WordSmith Tools,
AntConc (free), …
90s 2010 -2016MONOLINGUAL ATE
- Xerox Terminology Suite (2001)
- SDL Multiterm Extract- Synchroterm- CrossMining (Across)- MultiTrans Term Extractor- Similis™ (by Lingua et Machina™)- Anchovy (by Swordfish)- Araya Term Extractor
- Analysis software: Sketch Engine(terminology extraction from TMX)
BILINGUAL
3. Evaluation
19BATE UNDER EVALUATIONSketch Engine
SIMILIS
Multiterm Extract
Synchroterm
20
Multiterm Extract SynchroTerm Similis SkE Araya
Import TMX
Extraction config.
Extraction scores
Validation facility
Term base indexation
Export to TBX (xls, txt…)
Trados TMX
MAIN FEATURES
Others Others
21
TERMS
NO TERMS
EXTRACTED NON-EXTRACTED
A B
C DRECALL = 𝐴𝐴
𝐴𝐴+𝐵𝐵
PRECISION = 𝐴𝐴𝐴𝐴+𝐶𝐶
MEASURES FOR ACCURACY
Context coveragedegree to which the
product understands the complete context of its
usage. Flexibility
Effectivenessaccuracy and completeness
with which user achieves objectives
Satisfaction
Efficiencyresources expended in
relation to the accuracy and completeness
Freedom from riskno risk for the security of
users, software, context or the environment
degree to which user needs are satisfied when a software is
used in a specified context of use
QUALITY IN USE MODELCharacteristics (ISO-IEC 25010: 2011)
23
Setting up the extraction project
CONFIGURATION
Importing the source fileTMX IMPORT
Performing the extraction to get a
bilingual list
EXTRACTION
Selecting the real terms.VALIDATION
Creating and managing term entries
RECORD CREATION
Exporting the final result for later use in CAT Systems
EXPORTATION
6 TASKS TO EVALUATEwhen performing bilingual extraction
4. Results
25
28,30
43,33
10,66
14,85
21,29
62,33
45,42
51,61
0,00
10,00
20,00
30,00
40,00
50,00
60,00
70,00
PRECISION RECALL
PRECISION & RECALL IN %
Sketch MTE Synchr Similis
EXTRACTED NON-EXTRACTED
TERMS NO TERMS TERMS NO TERMS GOLD STANDARD TCs PRECISION RECALL
A C B D
Sketch 283 717 370 653 1,000 28,30 43,34
MTE 97 813 556 910 10,66 14,85
SynchroT. 407 1505 246 1,912 21,29 62,33
Similis 337 405 316 742 45,42 51,61
26
Characteristics and sub-characteristics to be measured METRICS
EFFECTIVENESS Value between 0 (minimum) and 5 (maximum) (EFE1+EFE2+EFE3)/3EFE1.- Degree of accuracy – precision of tasks & results (P1+P7+P13+P19+P25+P31)/6
EFE2.- Degree of completeness (tasks are accomplished and results are not missing) (P2+P8+P14+P20+P26+P32)/6
EFE3.- Frequency of errors (P3+P9+P15+P21+P27+P33)/6
EFFICIENCY Value between 0 (minimum) and 5 (maximum) (EFI2+EFI3+EFI4)/3EFI1.- Time spent in the accomplishment of the task. (TM1+TM2+TM3+TM4+TM5+TM6)
EFI2.- Need to use additional sources (material, software, etc.) for the task (P4+P10+P16+P22+P28+P34)/6
EFI3.- Productivity – effort exerted by the user to carry out the task (P5+P11+P17+P23+P29+P35)/6
EFI4.- Need to consult the software Help to perform the task (P6+P12+P18+P24+P30+P36)/6
SATISFACTION Value between 0 (minimum) and 5 (maximum)
(P37+P38+P39)/3SAT1.- UsefulnessSAT2.- TrustSAT3.- Pleasure
CONTEXT COVERAGE Value between 0 (minimum) and 5 (maximum)(P40+P41+P42)/3COB1.- Context of use
COB2.- Flexibility
PARAMETERS
27
QUESTIONNAIRE42 questions grouped by tasks
28
16
1314
25
20
26
21
24
0
5
10
15
20
25
30
EXTRACTION VALIDATION
RESULTS FOR EXTRACTION & VALIDATION
Sketch MTE Synchr Similis
3,33 3,004,00
3,50
13,83
4,06 4,44
3,00
1,50
13,00
4,11 4,22 4,33
3,00
15,67
3,723,11 3,00 3,00
12,83
0,00
2,00
4,00
6,00
8,00
10,00
12,00
14,00
16,00
18,00
EFFECTIVENESS EFFICIENCY SATISFACTION CONTEXT COVERAGE TOTAL QIU
FINAL RESULTS FOR QUALITY IN USE
Sketch MTE Synchr Similis
29
CONCLUSIONS
• Managing terminology still takes a lot of time and effort, even in this increasingly computerized profession.
• Research on automatic terminology extraction has been around for more than 20 years and significant enhancements concerning bilingual extraction and bilingual corporaexploitation have been introduced.
• I briefly described the BATE under evaluation and illustrated some results obtained for accuracy and with the QIU model.
• Results make it clear that much more work has to be done for BATE to be considered of real help to translators and terminologists, mainly due to poor accuracy results.
Some references• Baisa, Vit, Barbora Ulipová, and Michal Cukr. 2015. “Bilingual Terminology Extraction in Sketch Engine.” In 9th
Workshop on Recent Advances in Slavonic Natural Language Processing (RASLAN 2015), 61–67.
• Childress, Mark D. 2007. “Terminology Work Saves More Time than It Cost.” Multilingual, no. April/May: 43–46.
• Foo, Jody. 2012. Computational Terminology : Exploring Bilingual and Monolingual Term Extraction.
• Foo, Jody; Merkel, Magnus. 2010. “Computer Aided Term Bank Creation and Standardization. Building StardardizeTerm Banks through Automated Term Extraction and Advanced Editing Tools.” In Terminology in Everyday Life, edited by Marcel Thelen and Fireda Steurs, 163–80. John Benjamins Publishing Company. doi: 10.1075/tlrp.13.12foo.
• Kovář, Vojtěch, Vít Baisa, and Miloš Jakubíček. 2016. “Sketch Engine for Bilingual Lexicography.” International Journal of Lexicography 29 (3): 339–52. doi:10.1093/ijl/ecw029.
• Macken, Lieve, Els Lefever, and Veronique Hoste. 2013. “TExSIS: Bilingual Terminology Extraction from ParallelCorpora Using Chunk-Based Alignment.” Terminology 19 (2013): 1–30. doi:10.1075/term.19.1.01mac.
• Oliver, Antoni, and M. Vazquez. 2015. “TBXTools: A Free, Fast and Flexible Tool for Automatic Terminology Extraction.” In Proceedings of Recent Advances in Natural Language Processing, 473–79.
• Popiolek, Monika. 2015. “Terminology Management within a Translation Quality Assurance Process.” In Handbookof Terminology (Volume 1), edited by Hendrik J Kockaert and Frieda Steurs, 341–59. John Benjamins Publishing Company. doi:10.1075/hot.1.ter6.
• Sauron, Véronique. 2002. “Tearing out the Terms : Evaluating Terms Extractors.” In Translating and the Computer 24: Proceedings from the Aslib Conference, 21-22 November 2002.
• Vintar, Špela. 2010. “Bilingual Term Recognition revisited<BR> The Bag-of-Equivalents Term Alignment Approach and Its Evaluation.” Terminology 16 (2010): 141–58. doi:10.1075/term.16.2.01vin.
University of AlicanteIULMA
Campus de San VicenteApdo. 99
03080 Alicante
Phone & FaxDirect Line: +34 965903438
Fax: +34 [email protected]
Social Media@chelovargas
Many thanks for your attentionChelo Vargas-Sierra