33
Thinking Strategically About Content Destined for Machine Translation Val Swisher Founder & CEO © 2013. Content Rules, Inc. All rights reserved. @contentrulesinc

Thinking Strategically About Content Destined for Machine Translation

Embed Size (px)

DESCRIPTION

It's a fact. More and more organizations are exploring the benefits of machine translation. Some people believe automated translation will be ubiquitous; part of every system we use to interact with content — tablets, smart phones, mobile devices, computer kiosks, bank machines, consumer electronics, appliances, automobiles, trains, buses and planes. But, is our content ready for machines to process? Can automated translation systems produce the quality we desire? And, if so, what do we need to do to prepare it? Attend this session to learn how thinking strategically about your content can help you make smart decisions up front that will provide big benefits downstream. Learn how creating structured, semantically-rich content today can help you feed automated translation systems tomorrow.

Citation preview

Page 1: Thinking Strategically About Content Destined for Machine Translation

Thinking Strategically About Content Destined for Machine Translation

Val SwisherFounder & CEO

© 2013. Content Rules, Inc. All rights reserved.

@contentrulesinc

Page 2: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

Who Am I?

Founder and CEO of Content Rules 25+ years in content arena Specialty areas:

Global content strategy Terminology management Content quality Single-sourcing / XML / DITA

Finishing third book, “Global Content Strategy,” due out in 2014

Page 3: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

What is Content Rules?

Professional services firm specializing in: • Content strategy / Global content strategy• Content creation• Content quality / Global readiness

Based in Silicon Valley Founded in 1994 Acrolinx Authorized Services Provider Authorized provider of The Rockley Strategic Method™

Page 4: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

Global Readiness

Ensure content is translatable Readability Grammar and style Reuse

Evaluate and improve content quality using state-of-the-art tools Reports Metrics Recommendations Fixes

Save money on translation

Page 5: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

Page 6: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

Today’s Presentation

Importance of content Historic background Types of machine translation Content quality affects machine translation results Bleu scores Pre-editing instead of post-editing

Page 7: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

Content Is Important

87% of respondents to a recent CMO Council survey said that content had a moderate to major impact on their buying decisions

Page 8: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

Content Is A Strategic Asset

Page 9: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

What Does It Mean to be Strategic?

stra·te·gic

  [struh-tee-jik]  

adjective

1. pertaining to, characterized by, or of the nature of strategy: strategic movements.

2. important in or essential to strategy.

3. forming an integral part of a stratagem: a strategic move in a game of chess.

Page 10: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

Content Creation In the Past

Content wasn't so easy to create and distribute

Created by trained professionals

Only they had access to the content

Page 11: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

Content Creation Today

Everyone creates content

Very easy to distribute

Now, we have loads and loads of content • Some of it good

• Some of it mediocre

• Some of it downright awful

Page 12: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

Translation In The Past

Content wasn't so easy to translate.

Trained professionals

Only they understood multiple languages well enough to translate content

Page 13: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

Translation Today

It is easy and free to translate content

We have loads and loads of translated content

• Some of it good

• Some of it mediocre

• Some of it downright awful

Page 14: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

More Machine Translation All The Time

Machine Translation (MT) is becoming more relied upon as a way to get cost-effective, fast translations

%18.05 year-over-year growth of MT expected over next 3 years*

Must pay a more attention to the source content that goes into it

A machine cannot figure what we meant to say based on what we actually wrote

Garbage In – Garbage Out

*http://www.researchandmarkets.com/research/2gpj3p/global_machine

*http://www.researchandmarkets.com/research/2gpj3p/global_machine

Page 15: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

Source Content And Machine Translation

Types of MT engines and the effect of source content on them

What are Bleu scores

How quality of content affects MT output

Page 16: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

MT Engine Types

There are three types of MT Engines: 1. Rule-based2. Statistical3. Hybrid

Page 17: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

Rule-Based MT (RBMT)

Uses linguistic rules Extensive use of bilingual dictionaries Transfers structure of source language into target language Results are literal translations based on rules Does not handle ambiguity well (word or phrase having

more than one meaning)

Page 18: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

Statistical MT (SMT)

Based on analysis of content Engine trained over time More content = better results Need at least 2,000,000 million words per domain Better quality content = better results Results are more natural translations, based on previous

source | destination pairs Google Translate

Page 19: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

Hybrid

Combines rule-base and statistical Provides predictability and consistency of RBMT Provides fluency and flexibility of SMT Reduces the amount of data needed to train the

engine

Page 20: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

Training The SMT Beast

Training SMT software extremely important Poor quality source = Poor quality translations

Some companies have such poorly trained MT engines that fixing the content first is actually not an option

The engine has been trained to translate poor quality source

Page 21: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

The Effect Of Poor Content On SMT And Hybrid MT

Poor or unpredictable translations Increased time to retrain the system with correct

information Increased post-editing, per language Wasted money

Page 22: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

Evaluating MT Precision - Bleu Scores

Introduced in 2002 by the IBM Watson Research Center

Automatic evaluation metric used to compare MT output with reference human translation

“The closer a machine translation is to a professional human translation, the better it is.” *

Metric widely used throughout the industry

*http://acl.ldc.upenn.edu/P/P02/P02-1040.pdf

Page 23: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

Bleu Scores – Helpful Or Hype?

According to Callison-Burch, Osborne, and Koehn of the School on Informatics, University of Edinburgh, Bleu scores have many issues*:

Synonyms and paraphrases difficult to score All words are weighted equally Difficult to calculate

*http://homepages.inf.ed.ac.uk/pkoehn/publications/bleu2006.pdf

Page 24: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

That’s Okay. We Can Post Edit.

Post-Edited Translations

Original Source Content

Page 25: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

Why Not Pre-Edit Instead?

Fewer issues = less post editing Save time Save money Improve quality

Page 26: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

Create Global-Ready Content

Reduce word count Standardize terminology Enforce correct grammar Eliminate jargon and colloquialisms Increase reuse

Page 27: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

Results of Pre-Editing

Save money Improve quality Faster time to market Fewer in-country iterations Better translation consistency

Page 28: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

Summary

Content is a strategic asset Machine translation is becoming more popular Poor quality content incorrectly trains MT engines Poor quality content results in increased post-

editing Pre-editing saves money and time, and improves

translation quality

Page 29: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

Val Swisher [email protected]@contentrulesinc

Page 30: Thinking Strategically About Content Destined for Machine Translation

Val Swisher CEO & [email protected]@contentrulesinc

Page 31: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

Page 32: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

Reduce word count

We recommend 24 words, max, for machine translation.

It is impossible for people to understand long sentences. Imagine software having to parse through all of those commas (half of which are probably missing or mis-placed).

Page 33: Thinking Strategically About Content Destined for Machine Translation

© 2013. Content Rules, Inc. All rights reserved.

Let's say we have 100,000 words of source content.We are going to translate the content into 14 languages. We will end up with 1.4 million words of content.  Let's say the 100,000 words contain all types of errors. We will have to post-edit and fix 1.4 million words on the other side. Let's say we have to pay someone <<<$ .xx>>> per word to post-edit the content. That's <<<$.xx>>> * 1,400,000 words. If we paid <<<$ .07>>> per word to predit the content, we would have spent $7,000 for preparation.