DeciphermentofEvasiveorEncrypted OﬀensiveTextsummit.sfu.ca/system/files/iritems1/16613/etd9692_ZWu.pdfDeciphermentofEvasiveorEncrypted OﬀensiveText by ZhelunWu B.Sc. (Hons.),DalhousieUniversity,2014

Decipherment of Evasive or EncryptedOffensive Text

by

Zhelun Wu

B.Sc. (Hons.), Dalhousie University, 2014

Thesis Submitted in Partial Fulfillment of theRequirements for the Degree of

Master of Science

in theSchool of Computing ScienceFaculty of Applied Sciences

c© Zhelun Wu 2016SIMON FRASER UNIVERSITY

Summer 2016

All rights reserved.However, in accordance with the Copyright Act of Canada, this work may bereproduced without authorization under the conditions for “Fair Dealing.”

Therefore, limited reproduction of this work for the purposes of private study,research, education, satire, parody, criticism, review and news reporting is likely

to be in accordance with the law, particularly if cited appropriately.

Approval

Name: Zhelun Wu

Degree: Master of Science (Computing Science)

Title: Decipherment of Evasive or Encrypted OffensiveText

Examining Committee: Chair: Dr. Robert D. CameronProfessor

Dr. Anoop SarkarSenior SupervisorProfessor

Dr. Fred PopowichSupervisorProfessor

Dr. David Alexander CampbellExternal ExaminerAssociate ProfessorDepartment of Statistics and ActuarialScienceDirectorManagement and System Science

Date Defended: 20 July 2016

ii

Abstract

Automated filters are commonly used in on line chat to stop users from sending maliciousmessages such as age-inappropriate language, bullying, and asking users to expose personalinformation. Rule based filtering systems are the most common way to deal with thisproblem but people invent increasingly subtle ways to disguise their malicious messages tobypass such filtering systems. Machine learning classifiers can also be used to identify andfilter malicious messages, but such classifiers rely on training data that rapidly becomes outof date and new forms of malicious text cannot be classified accurately. In this thesis, wemodel the disguised messages as a cipher and apply automatic decipherment techniques todecrypt corrupted malicious text back into plain text which can be then filtered using rulesor a classifier. We provide experimental results on three different data sets and show thatdecipherment is an effective tool for this task.

Keywords: Natural Language Processing; Decipherment; Spelling Correction; MaliciousWords Filtering

iii

Dedication

To my beloved parents, who always encourage me, and to my lovely fiancee, who alwaysbrings out the best of me.

iv

Acknowledgements

I would like to show my appreciation to my supervisor Dr. Anoop Sarkar for the continuoussupport of my Masters study and research, for his patience, motivation, enthusiasm, andimmense knowledge. His guidance helped me all the time, during the research and writingof this thesis. I could not have imagined having a better advisor and mentor for my Mastersstudies.

I would like to thank my committee members, Dr. Fred Popowich and Dr. David Alexan-der Campbell, whose suggestions and comments helped me polish this thesis.

I would also like to thank Ken Dwyer and Michael Harris at Two Hat Security Companywho collect the data for us. Thanks to the CEO Chris Priebe who gave me the opportunityto have the internship at Two Hat Security Company and inspired me with the idea ofspelling correction for chats.

Thanks to all of my natural language processing lab mates who helped me during thesetwo years and I really enjoyed being with them.

v

Table of Contents

Approval ii

Abstract iii

Dedication iv

Acknowledgements v

Table of Contents vi

List of Tables viii

List of Figures ix

1 Introduction 11.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Noisy Channel Model 62.1 Statistical Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.2 Translation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Spelling Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2.1 Spelling Correction Algorithm . . . . . . . . . . . . . . . . . . . . . . 142.2.2 Error Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.3 Candidate Generalization . . . . . . . . . . . . . . . . . . . . . . . . 152.2.4 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Decipherment 20

vi

3.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.1 Expectation Maximization Algorithm . . . . . . . . . . . . . . . . . 233.2.2 Forward-Backward Algorithm . . . . . . . . . . . . . . . . . . . . . . 243.2.3 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2.4 Beam Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2.5 Random Restarts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Experimental Results 294.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1.1 Wiktionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.1.2 Real World Chat Messages . . . . . . . . . . . . . . . . . . . . . . . 30

4.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3.1 Classifier Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.3.2 Decipherment of Caesar Cipher . . . . . . . . . . . . . . . . . . . . . 334.3.3 Decipherment of Leet Substitution . . . . . . . . . . . . . . . . . . . 344.3.4 Decipherment of Real Chat Offensive Words Substitution Dataset . 384.3.5 Decipherment of Real Offensive Chat Messages . . . . . . . . . . . . 40

5 Conclusion 42

Bibliography 43

vii

List of Tables

Table 4.1 Classification Accuracy of Spelling Correction and Decipherment Re-sults in Caesar Cipher Encrypted Text . . . . . . . . . . . . . . . . . 34

Table 4.2 Classification Accuracy of Decipherment results in Leet SubstitutionCipher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Table 4.3 Top n Candidates of Decipherment Results in Leet Substitution CipherClassification on Test Set A . . . . . . . . . . . . . . . . . . . . . . . . 36

Table 4.4 Classification Accuracy of Spelling Correction and Decipherment Re-sults in Leet Substitution Cipher with Beam Search Width of 5 . . . 36

Table 4.5 Smoothing in Decipherment of Whole Set on Test Set A . . . . . . . . 38Table 4.6 Classification Accuracy of Spelling Correction and Decipherment Re-

sults in Real Chat Offensive Words Substitution Wiktionary Dataset 39

viii

List of Figures

Figure 1.1 An Example of Filtering Offensive Text . . . . . . . . . . . . . . . . 2

Figure 2.1 An Overview of A Statistical Machine Translation System . . . . . 7Figure 2.2 Possible Alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Figure 2.3 EM Algorithm for IBM Model 2 (Collins, 2011) . . . . . . . . . . . 13Figure 2.4 Pseudo Code for Noisy Channel Model Based Spelling Correction

(Jurafsky and Martin, 2014) . . . . . . . . . . . . . . . . . . . . . . 14Figure 2.5 An Example of Beam Search . . . . . . . . . . . . . . . . . . . . . . 17

Figure 4.1 100 Random Restarts Loglikelihood in Caesar Cipher Decipherment 34Figure 4.2 Variational Bayes for EM Algorithm from Gao and Johnson (2008) 37Figure 4.3 100 Random Restarts Loglikelihood in Leet Substitution Cipher De-

cipherment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37Figure 4.4 100 Random Restarts Loglikelihood in Decipherment of Real Chat

Offensive Words Substitution on Test Set B . . . . . . . . . . . . . 40

ix

Chapter 1

Introduction

Children using chat rooms can be confronted with sexting, profanity, age-inappropriatelanguage, cyber-bullying, and requests for personal identifying information. Rule basedfiltering systems are commonly used to filter out these malicious messages. However, somemalicious messages are subtly transformed by users so that they can get past the filteringsystem (see the example in Figure 1.1). There are several ways to detect and filter thesehidden malicious messages. Spelling correction can correct misspelled words, but it ishard to correct words that are intentionally misspelled by users by several edits. Moremodern techniques use machine learning classifiers or rule-based filtering systems to identifymalicious messages in order to filter them out. However, such methods are still prone to theproblem that users can adapt their encryption methods to hide their original intent sincethe classifier or rule-based filtering system is static, while the users can continuously changetheir behavior.

In this thesis, we regard this problem as a decipherment problem. The offensive text thatusers have disguised using their own invented techniques can be treated as cipher text. Theoriginal offensive text that users actually wanted to say is the plain text. The solution is tomap the users encrypted text into what the users truly want to express. The plain text is farmore likely to be detected and filtered out by the classifier or the rule-based filtering system.We want to decipher the encrypted text without knowing the exact method that was used toencrypt it. No matter how users edit or change their offensive messages, the deciphermentalgorithm should always recover the original messages. Since we aim to decipher any possibleway to encode the message, our method is immune to many (but not all) types of attemptsto hide the message using new ways to encode offensive language. For instance, let usassume that the original offensive message is you are a bunny and the corrupted messagethat tries to get past the filtering system is ura B*n@n@ee . We search for the originalmessages among possible sources messages mapped from encrypted text letter to letter andselect the one that achieves the highest statistical model score. In this thesis, we will explainthe background and details of this unsupervised learning decipherment algorithm. We also

1

Figure 1.1: An Example of Filtering Offensive Text

Note: “Shit” can be blocked by filtering system but with one letter being changed “Shet”cannot be filtered out. The encrypted word is still recognized by human.

conduct extensive experimental studies with our decipherment based approach to maliciouslanguage detection, using synthetic as well as real-world data.

1.1 Machine Learning

Machine learning learns a model from observed data to enable it to make predictions onthe unseen data. The idea about the learning is that the machine should not only use theobserved data for acting but also for improving its performance. So, Bishop (2006) notesthat the objective of a learner is to generalize from its experiences. The generalizationis the ability of learning machines that can accurately predict the new unseen data afterlearning the attributes from the observed data set. As the observed data which is usedfor training is finite and unseen data is uncertain the error rates on the unseen data canbe uncertain. Russell et al. (2003) specifies three major parts of a learning process: whichattributes of the observed data are to be learned; what feedback is available to learn thesesattributes; what representation is used for the attributes. The attributes are a feature of theobserved data. They can be word counts, term frequency, term frequency times the inversedocument frequency (tf-idf) and so on. The attributes can be learned from appropriatefeedback. Machine learning is divided into three major types depending on the trainingregime and task: supervised, unsupervised and reinforcement learning.

Supervised learning is learning the model from samples of inputs and outputs. To trainthe model, we need a set of observed data with their labels. The observed data attributesare the inputs and their labels are the outputs. Once the model trained, it predicts the new

2

data with the output labels. In a fully observable environment, it can observe the effects ofits actions and hence can use the supervised learning methods to learn to predict the outputlabels. However, not all of the available data might have the necessary output labels fortraining a supervised classifier. Faced with a lack of labeled data, unsupervised learning canbe used to learn the patterns among the unlabeled data when no specific output values aresupplied. Thus, the training in unsupervised learning uses a set of training data without anycorresponding target values. Reinforcement learning is another type of machine learning. Inthis type, there is a teacher to guide the learning process. The teacher rewards or penalizesthe predicted values. The learner needs to receive feedback to know whether the thing thathas happened is good or bad and the machine is then rewarded for good actions but onlyat the end of the sequence of actions. The goal of reinforcement learning is to maximizethe eventual reward.

In this thesis, we simulate the way users edit messages and then use unsupervisedlearning approaches to deduce the plain text which we presume the users actually wantedto express if there were no rule filters to block their messages. The unsupervised learningalgorithm used involves Expectation Maximization from Dempster et al. (1977). We regardour problem as a Hidden Markov Model (HMM) (Rabiner, 1989) and use beam searchto find out the most likely original plain text message. We also decipher the cipher textmessages using a supervised learning method known as a noisy channel model and compareto the state-of-art Aspell spelling correction algorithm.

1.2 Motivation

In the rule-based filtering system, the rules need to be updated periodically. The users canlearn the rules used by the rule filter after using the system and checking which messagesare blocked and which are not. If the users still want to send offensive messages, they willtry to edit the original offensive messages by adding special symbols, substituting a letterwith another letter and so on. These user changed texts do not match the rules of offensivetext in the system and thus they can pass the filtering system. They can, of course, still beunderstood by other users, otherwise it would make no sense to send such messages. Becausethere are so many ways to disguise messages as typos and the patterns of editing the textchanges over time, users can always come up with new ways to change their messages basedon their experiences with the filtering system. We have to continuously update the rules todetect those disguised messages. It is laborious to summarize a new rule and update thefiltering rules. But it is easy to obtain the latest disguised messages. From these disguisedmessages, we can learn the patterns which users have used to encrypt the text, and thisallows us to recover the original messages with the help of the patterns and language model.The patterns are learned from the observed disguised messages. The language model istrained from the corpus of un-disguised original messages. This decipherment regards the

3

observed messages as cipher text and the un-disguised original clean text as plain text. Thedecipherment approach reveals the underlying plain text from the disguised observed ciphertext.

1.3 Contribution

In this thesis, we applied unsupervised learning decipherment approaches (Knight et al.,2006) to recover original text from encrypted offensive text regardless of the nature of theencryption. The decipherment process is adaptive and automatic. In our experiments it canalways recover all of the text with Caesar cipher encryption and most of the plain text whenit is encrypted using the Leet substitution encryption (Wikipedia, 2016). Our experimentsalso show that the decipherment approach has a quite similar or even better performance atrecovering plain text than the noisy channel model spelling correction method. Compared tothe supervised learning approach, and like the noisy channel model spelling correction, theunsupervised learning decipherment approach has the advantage of obtaining the trainingdata and the unsupervised learning decipherment can cover more cases. Although theparallel corpus is difficult to obtain, particularly in some specific domains such as offensivechat messages. We can obtain as many of the users disguised offensive chat messages aswe want. We can also access large amounts of plain text language data. The deciphermentapproach covers more severe or difficult cases than the noisy channel spelling correctionmethod which only considers a limited edit distance of the disguised words to generateplain text candidate words to do the correction. The decipherment approach considers allof the possible words that could be the plain text word of the disguised word in questionto recover the messages. In our experiment, the noisy channel model spelling correctiondid not work in Caesar cipher encryption data but the decipherment approach correctlydeciphered all of the Caesar cipher encrypted data.

To our knowledge this is the first time that a decipherment based filter has been ap-plied to a real-world data set of malicious chat messages and we are able to decipher theoriginal plain text and evaluate the performance of the rule-based filtering system on ourdecipherment output.

1.4 Overview

The thesis is organized as follows:In Chapter 2 we describe the background of the spelling correction and the noisy

channel model. This chapter will cover the definition and concepts of the noisy channelmodel. We will explain the details of this model and how this model is being used inspelling correction.

4

In Chapter 3 we describe the methodologies of the decipherment problem, which iscomposed of HMM, EM algorithm and Forward-Backward algorithm. More detailed pro-cesses, such as how to preprocess the offensive text before performing the decipherment,including inserting NULL symbols and deleting repeated characters, are also explained.

In Chapter 4 we explain the experiments and show the experimental results from Cae-sar cipher decipherment, Leet substitution decipherment and real chat messages key worddecipherment. We compared the results according to the classifier classification accuracy.We also conducted experiments aimed at recovering plain text using spelling correctionmethods. We showed the decipherment approach performance by recovering the originalplain text from the corrupted text.

In Chapter 5 we summarize the experimental results of the previous chapter andconclude that the decipherment approach is a robust way to recover the corrupted text nomatter how the text was encrypted.

5

Chapter 2

Noisy Channel Model

The noisy channel model is a well-known framework that is widely used in spelling correc-tion, speech recognition, question answering and machine translation. Statistical machinetranslation and spelling correction are good examples with which to introduce the noisychannel model. In this chapter, we will introduce the concepts of the language model, thetranslation model (which is also called error model in spelling correction), and the Bayesrule used in the noisy channel model.

2.1 Statistical Machine Translation

The term statistical machine translation implies the use of statistics. The rationale forstatistical machine translation was first described by Weaver (1955). Based on insights fromAlan Turing’s speculation about the use cases for computing machines, Warren Weaver laidout the noisy channel model for translation. In state-of-art statistical machine translation,the principle of the methods used today originate from the IBM models from the IBMCandide projects (Brown et al., 1988, 1990, 1993) in the late 1980s and early 1990s. Theseworks were seminal and introduced many concepts that still underpin today’s machinetranslation models. Define e as source sentences that are translated text and f as targetsentences that are foreign sentences which are needed to be translate. The IBM models areexamples of noisy channel model. They have two critical components:

1. A language model that assigns a probability of source sentence e

2. A translation model that assigns a conditional probability p(f | e) to any pair ofsource-target sentences (e, f)

Given these two essential components of the model, with the method of the noisy channelapproach, we can compute the translation output using this equation:

e∗ = argmaxe∈E

p(e | f) (2.1)

6

where E is the set of source language sentences. The function argmaxe∈E

f(e) means there is a

sentence e that maximizes the function f(e) . Thus, given the target sentence f which is theforeign language we want to translate to the source language e, we seek the sentences e thatwill translate to the sentence f . The language model p(e) is a good source of information.However, in Equation 2.1, there is no language model p(e). Bayes’ theorem 2.2 defines:

p(e | f) = p(e)p(f | e)∑e p(e)p(f | e)

(2.2)

The denominator on the right hand of this Equation is the probability of the given targetsentences f as we sum all the source sentences e.

p(f) =∑e

p(f | e)p(e) (2.3)

The Equation 2.3 shows that the denominator of Equation 2.2 is p(f), which is the prob-ability of foreign sentence. As the foreign sentence is fixed and not effected by the sourcelanguage sentence. Hence, it is sufficient to only consider the numerator of Equation 2.2 tomaximize the Equation 2.1. The noisy channel model is as follows:

e∗ = argmaxe∈E

p(e | f)

= argmaxe∈E

p(e)× p(f | e)∑e p(e)× p(f | e)

= argmaxe∈E

p(e)× p(f | e)

(2.4)

Each sentence e is computed by the product of language and translation model scores.The language model p(e) gives prior distribution over which sentences are likely in thesource language and the translation model p(f | e) indicates how likely the target sentencef is to be the translation of the source language e. Thus, Equation 2.4 means, out of all ofthe sentences in the source language, there can only be one source sentence that maximizesthe score of the right-hand side p(e)× p(f | e). The process of choosing that sentence e inthe source language for which Equation 2.4 is maximized is called decoder.

Figure 2.1: An Overview of A Statistical Machine Translation System

7

In Figure 2.1, we show an overview of a statistical machine translation system, whereMono source Corpora is a large set of monolingual structured texts, Target Text is theforeign text and Source Text is the translated text from foreign text. LM is language modeland TM is translation model.

2.1.1 Language Model

Given a sequence of words, e1, e2, ..., en, we can write the probability of this sentence as:

p(e1e2 . . . en) = p(e1)p(e2 | e1) . . . p(en | e1e2 . . . en−1) (2.5)

The language model measures how likely it is that a sequences of words would be in acertain corpus in the same language. In the translation process, we are not only producingthe correct language but also fluent sentences. The language model has the advantage ofassigning high probability to the fluent sentences. The language model would prefer thecorrect order of words in the source sentence rather than the incorrect order of words. Forexample,

pLM (“this is correct order”) > pLM (“order correct is this”) (2.6)

We need to know the history of word en to compute the p(en | e1e2 . . . en), but there willbe long histories of the word in the long sentences. To be able to compute these languagemodel, we limit the history to m words which means we consider previous m words:

p(en | e1, e2, . . . , en−1) ≈ p(en | en−m, . . . , en−2, en−1) (2.7)

With this estimation, for the words in each position, we can estimate the language modelscore based only on their previous m histories of words. This is n-gram language modeling.n-gram means that there are n words in the history being considered in order to computethe current word probability score. We examine the sequences of words and only considerthe transitions in limited grams, which as termed Markov chains (Baum and Petrie, 1966);the number of the words considered in the conditioning history is the order of the model.

Unigram language model does not consider the history when computing the individualword: rather, it computes the word probability on using statistics model of the trainingcorpus. Intuitively, p(e) is the distribution over a sequence of words in the training corpus.Typically, more training data allows the consideration of longer histories. The trigramlanguage model is commonly used. It considers two words histories to compute the thirdword distribution. The trigram language model can be written as follows:

p(e1e2 . . . en) ≈ p(e1)p(e2 | e1)p(e3 | e1e2)p(e4 | e2e3) . . . p(en | en−2en−1) (2.8)

8

However, there are still some trigrams not included in the language model which arethe unknown tokens. If one of the trigram in Equation 2.8 is missing and the probability isequals to zero, the whole sequence of the language model score will be zero, which makes theother trigrams useless. The smoothing techniques can transfer a small portion of probabilityof the occurrence from the observed data to the unknown data so that the unknown datacan also have a small possibility of occurrence. The smoothing extracts more informationfrom the language model as long as the assumption of smoothing is reasonable.

2.1.2 Translation Model

State-of-art machine translation (MT) systems apply statistical approaches to learn transla-tion rules from large amount of parallel data. The translation model assigns the conditionalprobability p(f | e) to any pair of source-target sentences (e, f). Every sentence consists ofa sequence of words. From the parallel corpora, the only information we have about thissource sentence is its translation but we do not know which word in the source sentence istranslated to which word in the target sentence. There are many ways to train the transla-tion model, such as word-based, phrase-based and hierarchical translation models. We willintroduce the word-based translation model.

Word-based Translation Model

To compute the conditional probability p(f | e) for any target sentence f = f1 . . . fm, wehave to model the distribution:

p(f1 . . . fm | e1 . . . el) (2.9)

It is difficult to compute Equation 2.9 directly due to the joint probability. Brown et al.(1990) notes that it is reasonable to regard the target translation of a source sentence beinggenerated from the source sentences word by word. For example, the sentence pair (Jeanaime Marie, John loves Mary), John translates to Jean, love translates to aime and Mary

translates to Marie. Thus a word is translated to a word to which it aligns. This is calledalignment, and is the mapping from the target language to the source language. However,in the parallel data, there is no word alignment in the pair of source-target sentences. Foreach target/foreign input word f , we do not know the source word to which it mapped.We lack the alignment information. By incorporating the alignment information into thetranslation model, we can write the translation model as follows:

p(f1 . . . fm | e1 . . . el) =l∑

a1=0

l∑a2=0

l∑a3=0· · ·

l∑am=0

p(f1 . . . fm, a1 . . . am | e1 . . . el) (2.10)

9

Figure 2.2: Possible Alignments

where am is in {0, 1, . . . , l}. If a1 = 3, this means that the first foreign word is aligned to thethird source word. The alignment is considered as a hidden variable in our translation model.To estimate the translation model from incomplete data, the expectation maximizationalgorithm (EM algorithm) can be applied to iteratively increase the likelihood by changingthe hidden variable, which in this case is the alignment probability.

As the example taken from Brown et al. (1990),

e = John loves Mary

f = Jean aime Marie(2.11)

The correct alignment isa1, a2, a3 =< 1, 2, 3 > (2.12)

Each French word maps to an English word. However, there can be another alignmentshown in Figure 2.2 like this : a1, a2, a3 =< 1, 1, 1 > Every French word maps to the firstword in English. To decide which alignment is best, we need to use an EM algorithm toestimate the parameters to get the translation model.

EM algorithm

The EM algorithm is defined as follows in Chapter 4.2.2 of Koehn (2009):

1. Initialize the model, usually with uniform distribution or randomized translation prob-ability

2. Apply the model to the data (expectation step)

3. Learn the model from the data (maximization step)

10

4. Iterate steps 2 and 3 until convergence

To apply the EM algorithm to the translation model (Equation 2.10), we first initializethe model with uniform distribution on all alignment probabilities t(e | f) and collect theexpected count of alignment to update the alignment probability. After a while, it willprefer the alignment which is the most likely word translation with alignment.

For the Equation 2.10, we need to make some assumptions to efficiently estimate thismodel using an EM algorithm. Our goal is to build a model of

P (F1 = f1 . . . Fm = fm, A1 = a1 . . . Am = am | E1 = e1 . . . El = el, L = l,M = m) (2.13)

where A1 . . . Am are the alignment parameters, L is the length of the source sentences andM is the length of the target (foreign) sentences. By applying the chain rules, Equation2.13 can be written as follows:

P (F1 = f1 . . . Fm = fm, A1 = a1 . . . Am = am | E1 = e1 . . . El = el, L = l,M = m)

= P (A1 = a1 . . . Am = am | E1 = e1 . . . El = el, L = l,M = m)

× P (F1 = f1 . . . Fm = fm | A1 = a1 . . . Am = am, E1 = e1 . . . El = el, L = l,M = m)(2.14)

Assuming the alignment is independent of the source and target languages, the distributionof the alignment only depends on the source and target language sentences length l and m.The first term of Equation 2.14 can be written as

P (A1 = a1 . . . Am = am | E1 = e1 . . . El = el, L = l,M = m)

=m∏i=1

P (Ai = a1 | A1 = a1 . . . Ai−1 = ai−1, E1 = e1 . . . El = el, L = l,M = m)

=m∏i=1

P (Ai = a1 | L = l,M = m)

=m∏i=1

q(ai | i, l,m)

(2.15)

where the first equality is by the chain rule of probabilities.For the second term in Equation 2.14, we make the following assumption:

11

P (F1 = f1 . . . Fm = fm | A1 = a1 . . . Am = am, E1 = e1 . . . El = el, L = l,M = m)

=m∏

i=1P (Fi = fi | F1 = f1 . . . Fi−1 = fi−1, A1 = a1 . . . Ai−1 = ai−1, E1 = e1 . . . El = el, L = l,M = m)

=m∏

i=1P (Fi = fi | Eai

)

=m∏

i=1t(fi | eai

)

(2.16)where we applied the chain rules and assume the foreign word Fi depends on Eai , which

means the foreign word Fi is aligned to source word Eai .Combining Equation 2.15 and 2.16, we can derive a distribution of the translation model

considering the lexical translation t(fi | eai) and the alignment model q(ai | i, l,m):

p(f, a | e) = εm∏i=1

t(fi | eai)q(ai | i, l,m) (2.17)

This model is called IBM Model 2, which differs from IBM Model 1 in that the latter onlyhas lexical translation probability. The difference is that IBM Model 1 does not take intoaccount the alignment of words. For example, in IBM Model 1, the translation probabilitiesfor the following two translations are the same:

“John loves Mary” => “Jean aime Marie”

“John loves Mary” => “Jean Marie aime”

In IBM Model 2 (Equation 2.17), the alignment probability distribution is incorporatedto take into account the alignment.

The EM algorithm from Collins (2011) is shown on Figure 2.3. S is the number oftraining iterations. q(j | i, l,m) represents the conditional probability of the foreign wordfi that is aligned to the source word ej given the foreign sentence length m and sourcesentence length l. The t(f | e) means the conditional probability of mapping the sourceword e to the foreign word f . In the EM algorithm shown in Figure 2.3, both parametersare estimated by the expected count in the E-step (expectation step) which computes thec(e(k)

j , f(k)i ) and c(e(k)

j ) and normalized in the M-step (maximization step) which computesthe t(f | e) and q(j | i, l,m).

2.1.3 Summary

In summary, the noisy channel model is applied in machine translation systems. The sourcesentences are distorted in the noisy channel and become foreign sentences. We only knowthe foreign sentences and use knowledge of the source language (language model) and the

12

Input: A training corpus (f (k), e(k)) for k = 1 . . . n, where f (k) = f(k)1 . . . f

(k)mk ,

e(k) = e(k)1 . . . e

(k)lk

.Initialization: Initialize t(f |e) and q(j|i, l,m) parameters (e.g., to random values).

Algorithm:

• For s = 1 . . . S

– Set all counts c(. . .) = 0

– For k = 1 . . . n

⇤ For i = 1 . . . mk

· For j = 0 . . . lk

c(e(k)j , f

(k)i ) c(e

(k)j , f

(k)i ) + �(k, i, j)

c(e(k)j ) c(e

(k)j ) + �(k, i, j)

c(j|i, l,m) c(j|i, l,m) + �(k, i, j)

c(i, l,m) c(i, l,m) + �(k, i, j)

where

�(k, i, j) =q(j|i, lk, mk)t(f

(k)i |e(k)

j )Plk

j=0 q(j|i, lk, mk)t(f(k)i |e(k)

j )

– Set

t(f |e) =c(e, f)

c(e)q(j|i, l,m) =

c(j|i, l,m)

c(i, l,m)

Output: parameters t(f |e) and q(j|i, l,m)

Figure 2: The parameter estimation algorithm for IBM model 2, for the case ofpartially-observed data.

13

Figure 2.3: EM Algorithm for IBM Model 2 (Collins, 2011)

13

distortions caused by noisy channel, which is the alignment and lexicon translation infor-mation in IBM Model 2. Based on these models, we do the translation to recover the sourcelanguage sentences from the foreign sentences.

2.2 Spelling Correction

It is common that humans misspell the words when they are typing text. Various automaticspelling correction programs are in widespread use. However, these can become the sourceof amusement in cases of inaccurate corrections. Another example of the noisy channelmodel is spelling correction. The objective function in spelling correction is the same asmachine translation (see Equation 2.4), but for spelling correction, we refer to the channelmodel as the error model.

The first idea to model language transmission as a Markov source passed through anoisy channel model was developed in Shannon (1948). In the early 1990s, Kernighanet al. (1990); Church and Gale (1991) proposed the noisy channel model based spellingcorrections. Norvig (2009) showed the implementation of spelling correction in python.The spelling corrections have two components, one from the language model, and the otherfrom the error model. To estimate the error model, we need to train from the pairs ofmisspelled and corrected words. Figure 2.4 shows an overview of the noisy channel based

4 CHAPTER 6 • SPELLING CORRECTION AND THE NOISY CHANNEL

function NOISY CHANNEL SPELLING(word x, dict D, lm, editprob) returns correction

if x /2 Dcandidates, edits All strings at edit distance 1 from x that are 2 D, and their editfor each c,e in candidates, edits

channel editprob(e)prior lm(x)score[c] = log channel + log prior

return argmaxc score[c]

Figure 6.2 Noisy channel model for spelling correction for unknown words.

have a similar spelling to the input word. Analysis of spelling error data has shownthat the majority of spelling errors consist of a single-letter change and so we oftenmake the simplifying assumption that these candidates have an edit distance of 1from the error word. To find this list of candidates we’ll use the minimum edit dis-tance algorithm introduce in Chapter 2, but extended so that in addition to insertions,deletions, and substitutions, we’ll add a fourth type of edit, transpositions, in whichtwo letters are swapped. The version of edit distance with transposition is calledDamerau-Levenshtein edit distance. Applying all such single transformations toDamerau-

Levenshteinacress yields the list of candidate words in Fig. 6.3.

TransformationCorrect Error Position

Error Correction Letter Letter (Letter #) Typeacress actress t -- 2 deletionacress cress -- a 0 insertionacress caress ca ac 0 transpositionacress access c r 2 substitutionacress across o e 3 substitutionacress acres -- s 5 insertionacress acres -- s 4 insertion

Figure 6.3 Candidate corrections for the misspelling acress and the transformations thatwould have produced the error (after Kernighan et al. (1990)). “–” represents a null letter.

Once we have a set of a candidates, to score each one using Eq. 6.5 requires thatwe compute the prior and the channel model.

The prior probability of each correction P(w) is the language model probabilityof the word w in context. We can use any language model from the previous chapter,from unigram to trigram or 4-gram. For this example let’s start in the following tableby assuming a unigram language model. We computed the language model from the404,253,213 words in the Corpus of Contemporary English (COCA).

w count(w) p(w)actress 9,321 .0000231cress 220 .000000544caress 686 .00000170access 37,038 .0000916across 120,844 .000299acres 12,874 .0000318

How can we estimate the likelihood P(x|w), also called the channel model orchannel model

Figure 2.4: Pseudo Code for Noisy Channel Model Based Spelling Correction (Jurafsky andMartin, 2014)

spelling correction from Jurafsky and Martin (2014), which indicates that the channel model(error model) is computed by the edit probability and prior distribution is computed by thelanguage model. It combined these two models and selects the one with the highest score.

2.2.1 Spelling Correction Algorithm

Algorithm 1 shows the beam search algorithm for the noisy channel model spelling correctionalgorithm, which is based on the stack decoding heuristic from the Beam Search sectionin Koehn (2009). The general idea is to find out the correct word x that maximizes theobjective function : argmaxxP (x | w) = argmaxxP (x) · P (w | x). P (x) is the language

14

model and P (w | x) is the error model where w is the misspelled word and x is the candidateword. The beam search algorithm tracks the partial hypotheses in the stacks. During thesearch, it consider the top n list of hypotheses in each stack and uses the error model andlanguage model to score the candidate words in these hypotheses. At the end, it traces backfrom the predecessor attributes in the hypotheses to return the sequence of best scored text.

2.2.2 Error Model

The error model we used is the probability of edit, estimated from the misspelling data. Theadvantage of using edit probability for the error model is that one edit can represent manypairs of correct and misspelled words. For example, P (“ew” | “e”) means the probabilityof the correct letter “e” being misspelled as “ew” and this is an insert operation. For themisspelled words “thew”, one of the possible candidates is “the” and thus its probabilityof edit is P (“ew” | “e”). There are four operations, deletion, insertion, substitution andreversal, which can be used to transform one word to another. (Kernighan et al., 1990)To train an error model, we first collect a training set of pairs of words where each pairis the correct and misspelled form of the word. We set a threshold for the longest wordswe can process and the biggest word length difference. If the pair of words exceeds thesethresholds, we skip this example and process the next pair of words. This can save trainingtime without losing too much information as there are only very few cases which wouldexceed the threshold. Then, based on Damerau Levenshtein distance (Damerau, 1964;Levenshtein, 1966), we generate the candidate edits from the misspelled to the correct wordwithin a certain edit distance in Subsection 2.2.3.

If there are more than 1000 possible alignments between misspelled and target words,we randomly sample 1000 alignments to extract the edits. We count the edits from thewhole training set which is all the pairs of target and misspelled words and then calculateits probability normalized by the total count of edit extracted from the training set. We seta parameter that is the probability of misspelling which assigns the probability of correctlyspelled letters by 1−pspell_error. Thus, we have the following formulas to compute the errormodel:

Pr(edits) ={

1− pspell_error if edits is emptypspell_error · P (edit1) · P (edit2) · · ·P (editn) n > 0 if edits is not empty

(2.18)where edits is a set of possible edits {edit1, edit2, · · · , editn}

The pseudo-code in Algorithm 2 shows how we trained the error model.

2.2.3 Candidate Generalization

For unknown words which are not in the dictionary, we generate a list of candidate words.In Norvig (2009) Chapter 14, he shows that the candidates are generated by edits, which

15

is passed a word, and returns a dictionary of {word : edit} pairs indicating the possiblecorrections. As there will be too many candidates if all the words that were generated bythe edits, even the incorrect ones were listed. Norvig (2009) precomputed the set of allprefixes of all the words in the vocabulary. He then split the words into a head and a tail,ensuring that the head was always in the list of prefixes. Similarly, we pre-computed theprefixes of words which are the first n letters in words; the n can be from 1 to the lengthof the word. With the help of a set of word prefixes, we generated candidates with theconstraints of matching the word prefixes, which removed most of the incorrect candidates.

The list of candidates were determined by the edit distance. Usually, in English mis-spelling, the misspelled words are not greater than 3 edits from the correct words.

2.2.4 Decoding

As we have error model and language model, we use beam search to decode the best sequenceof words that have the highest scores according to the following equation:

argmaxx∈V

P (x | w) = argmaxx∈V

Pr(edits) · P (x)λ

= argmaxx∈V

log10Pr(edits) + λ · log10P (x)(2.19)

where V is the vocabulary, λ is for weighting the language model and the error modelPr(edits) is computed by Equation 2.18.

Considering the context of the unknown words in the language model helps us to pre-cisely predict the unknown words. For example, the data could be provided with thefollowing fields:{“text”: “bevause”, “left_context”: “this was”, “right_context”:“she sucks”}.The misspelled word is ‘bevause’. We can list all the 1 to 3 Damerau-Levenshtein edit

distance candidate words and refer to the dictionary to search for the correct word. Wescore every candidate using the language model with their context multiplying the errormodel and then use beam search to pick the one with the best score.

For example, we have a sentence “this was bevause sye sacks”. Every word in thesentences can generate a list of their corresponding candidates. Assume “this” has a listof candidates “this”, “that”, “what”, “thew”. “was” has a list of candidates “has”, “his”,“wait”, “was”,“what”. “bevause” has a list of candidates “because”, “behavior”, ”beverage”.“sye” has a list of candidates “she”, “eye”, “yes”. “sacks” has a list of candidates “sucks”,“snacks”. The beam search first scores every candidate word of “this” and then selects thetop n candidate word and expands. In Figure 2.5, assume we set the beam search width n to2, we select “this” and “that” at first position. And then the beam search algorithm expandsthe candidate “this” and scores “this” with all candidate words in the second position andselects the top 2 candidate words that maximize the score from the beginning word of thesentence to current word. At each position of the sentence, we remove some of the candidate

16

words reducing the search space. Figure 2.5 shows an example search through the top n

candidate words rather than all the candidate words.

Figure 2.5: An Example of Beam Search

2.2.5 Summary

In the noisy channel model, misspelled words are treated as correctly spelled words that aredistorted by a noisy communication channel. Basically, the channel applies letter substitu-tion, insertion, deletion and transportation to the correctly spelled words. The goal of thenoisy channel model based spelling correction is to pass all the possible correctly spelledwords to the model and then find out which one is the most similar to the misspelled words.

2.3 Summary

In this chapter, we introduced the noisy channel model applied in machine translationand spelling correction. The noisy channel model is a framework that models differentproblems in the same way. It regards the foreign text (misspelled words) as distorted sourcestext (correct words) by a noisy channel. We explained the alignment using EM algorithmin machine translation and the algorithms for training the error model and generatingcandidates for spelling correction. The beam search algorithm is used during decodingwhich will also be used in our decipherment method.

17

Algorithm 1 Beam Search for Spelling Correction Algorithm1: procedure Beam-Search(words, width, edit-distance-limit=3)2: D is the dictionary of known words3: LM is the language model score in logorithm4: Pr(edits) is from Equation 2.185: init stacks with the size of n . n is the number of words in text6: set hypothesis tuple with three attributes score, predecessor, word7: place empty hypothesis into stack 08: for all stacks 0 ... n-1 do9: for the best-width hypothesis in stack do

10: if words[i] is in D then . i is the index of stacks11: score = current hypothesis score12: edits← empty13: score = score + log(Pr(edits)) + LM(words[i])14: new-hypothesis = hypothesis (score, current hypothesis, words[i])15: if words[i] not in stacks[i+1] or stacks[i+1][words[i]].score < score then16: stack[i+1][words[i]]=new-hypothesis17: end if18: continue19: end if20: generate candidate words from D21: if no candidates generated then22: score = current hypothesis score23: edits← empty24: score = score + log(Pr(edits)) + LM(words[i])25: new-hypothesis = hypothesis (score, current hypothesis, words[i])26: if words[i] not in stacks[i+1] or stacks[i+1][words[i]].score < score then27: stack[i+1][words[i]]=new-hypothesis28: end if29: else30: for each pair of candidate word and edits (w,edits) do31: score = current hypothesis score32: score = score + log(Pr(edits)) + LM(words[i])33: new-hypothesis = hypothesis (score, current hypothesis, words[i])34: if words[i] not in stacks[i+1] or stacks[i+1][words[i]].score < score then35: stack[i+1][words[i]] =new-hypothesis36: end if37: end for38: end if39: end for40: end for41: get the best scored hypothesis42: return the best scoring sequence of words by traverse back from the best scored

hypothesis43: end procedure

18

Algorithm 2 Error Model Training1: procedure Error-Model-Training(training-file)2: training-file contains pairs of misspelled word and target word and their frequency3: set the longest-word-length := 204: set the sample-N := 10005: set diff-length := 56: for each pair of misspelled word and target word do7: if the length of misspelled word or target word > longest-word-length then8: continue9: else

10: if absolute(len(misspelled word) - len(target word) )> diff-length then11: continue12: end if13: generate the Damerau Levenshtein distance between the misspelled word and

target word14: extract the alignments between the misspelled word and target word15: if the number of alignment is greater than sample-N then16: random sampling sample-N alignments from all the alignments17: end if18: for each alignment do19: extract the edits from alignment20: count the edits and store them in dictionary21: end for22: end if23: end for24: Normalized the edit dictionary as error model25: end procedure

19

Chapter 3

Decipherment

In cryptography, a message is called either plain text or clear text. Disguising a message ina way to hide the content of the original messages is called encryption, while recovering thecipher text, an encrypted text, back to plain text is called decryption or decipherment.

The goal of decipherment is to uncover the hidden plain text text p1...pn, given a dis-guised text c1...cn. Our methodologies operate under the assumption that every corruptedform of adversarial text is the plain text encrypted in letter substitution, insertion anddeletion. For example, a user corrupts adversarial texts by inserting additional letters, ormaking substitution and deletion in the text, like you are a B*n@n@ee, whose plain text isyou are a bunny. We lowercase the texts and substitute the special symbols and punctua-tions inside the words with NULL, and then map NULL to any letter or NULL. So in thiscase, the corrupted word B*n@n@ee is changed to b<NULL>n<NULL>n<NULL>ee fordecipherment.

When dealing with insertion cases, we add NULL symbols into the cipher text in orderto give a space for the decipherment mapping the candidate letters. We tried two ways ofinserting a NULL symbol. One is inserting the NULL in a random position of the corruptedoffensive key words. Another way is based on the paper written by Ando and Lee (2003).Ando and Lee (2003) proposed a statistical segmentation algorithm to segment Japanese.This algorithm involves the counts of character n-grams in an unsegmented corpus to makethe segmentation decisions. As an example shown in their paper, “A B C D W X YZ” is a sequence of unsegmented words. They check the n-gram count before, after andacross the position k of the word sequence. For example, 4-grams s1 = “A B C D” ands2 = “W X Y Z”, are before and after, respectively, the position 4, both of which tend tobe more frequent than other straddled n-grams, like the 4-grams t1 = “B C D W”, who isacross the position 4. Thus, they claim that there is evidence of a word boundary betweenD and W, which is position 4 in our case. Equations 3.1 and 3.2 below are formulas tocompute each position’s score. In our study, we pick the highest position score in Equation3.2 as the boundary to insert the NULL symbol.

20

vn(k) = 12(n− 1)

2∑i=1

n−1∑j=1

I>(#(sni ),#(tnj )) (3.1)

where n is the order of the gram, I>(y, z) is 1 when y > z, and 0 otherwise. sn1 is the leftn-gram of the position k and sn2 is the right n-gram of the position k.

vN (k) = 1N

∑n∈N

vn(k) (3.2)

Following Weaver (1955), “This is really written in English, but it has been coded insome strange symbols. I will now proceed to decode.” The set of encrypted forms of Englishadversarial text is the English that has been encoded in some strange symbols. The criticaltask is estimating the probability of transforming the plain text letter e to cipher text letterc. There can be 27 possible plain text letters, including the NULL letter that represents noletter. Here is the probability:

P (c) =∑e

P (e) ∗ P (c | e) (3.3)

We try to locate the plain text that maximizes the probability P (plain text | cipher text).To achieve this, we first build a probabilistic model P (e) of the plain text source. Thenwe build a probabilistic channel model P (c | e) that explains how plain text sequences (e)become cipher text sequences (c). In general, people use the EM algorithm to estimate thechannel model P (c | e) for the maximization of P (c), which is the same as maximizing thesum over all plain text e of P (e) ∗ P (c | e) (by the Bayes Rule). After that, we use Viterbialgorithm (Forney Jr, 1973) to choose the e maximizing P (e) ∗ P (c | e), which is the sameas maximizing P (e | c) (by the Bayes Rule).

The P (e) is the language model which is known to us. We will try an English characterbased language model. The decipherment of cipher text is like tagging the plain text letterto each cipher text letter, which is an HMM. We apply the EM algorithm (Dempster et al.,1977) to adjust the P (c | e) in order to maximize the probability of the observed cipher text(encrypted text). The P (c | e) represents the probability of transforming the p letter intoc letter. We used the Forward-Backward algorithm (Jelinek, 1997) to infer the probabilityof mappings.

The EM algorithm estimates the mapping probabilities which maximize the likelihood ofthe sequence of the cipher text. If the adversarial text is not being encrypted or corrupted,we keep it in the same as the plain text. After estimating the probability P (c | e), we usebeam search to search for the best sequence of plain text:

argmaxe

P (c) = argmaxe

P (e)× P (c | e)

21

We model this decipherment task as an HMM. The plain text letter is the hidden stateand the state transition probability is according to the language model. The cipher text isthe observations and the emission probability is P (c | e).

Therefore, given a sequence of English cipher text of length L, the EM algorithm has acomputational complexity of O(L ∗V 2 ∗R) where R is the number of iteration and V is thenumber of letters. If it is one to one letter simple substitution encryption, the vocabularyis a set of 26 alphabet letters, whitespace and NULL symbol.

3.1 Related Works

Most relevant prior works are based on the well-known noisy-channel framework. Knightet al. (2006) analyzed the problem in which we face a cipher text stream and try to uncoverthe plain text that lies behind it. For letter substitution ciphers having only 26 plain textletters, they create a table of 27× 27 which includes a space character. They then used theEM algorithm (Dempster et al., 1977) to estimate the probability of the letters mapping,P (c | e), which is the probability of cipher text letter c given a plain text letter e.

Ravi and Knight (2011) regarded foreign language as cipher text and English as theplaintext. The goal of word substitution decipherment is to guess the original plaintext fromgiven cipher data without any knowledge of the substitution key. Unlike letter substitutionwhich only has 26 plaintext letters in this case they have large-scale vocabularies (10k-1Mword types in plaintext) and large corpora sizes (100k cipher tokens). It is impractical toestimate all the possible mapping between plaintext words and cipher tokens. To deal withthis they proposed two methods. One is iterative EM, that picks the top K frequent wordtypes in both the plain and cipher text data and then uses the selected top K plain andcipher text data to estimate channel model probability. The other method is to use theBayesian learning algorithm to sample data to estimate the channel model probabilities anddo the decoding. They achieved reasonably high decipherment accuracy (88.6%) in usingthe Bayesian method with the bigram language model.

Similarly Knight and Yamada (1999) adapted the EM algorithm (Dempster et al., 1977)to decipher unknown scripts aligning sound to character. The EM algorithm was used toestimating the mapping distribution over the sound to character. They then generatedthe most probable sequence of clean text with a dynamic-programming method Viterbialgorithm from Forney Jr (1973).

For detecting malicious messages, Kansara and Shekokar (2015) proposed a frameworkthat detecting cyberbullying messages in social network using classifiers. They not onlydetected the offensive text but also images. Chen et al. (2012) also proposed the LexicalSyntactic Feature (LSF) architecture to detect offensive content and identify potential of-fensive users in social media. And Razavi et al. (2010) took advantage of a variety ofstatistical models and rule-based patterns and applied multi-level classification for flame

22

detection. Therefore, machine learning technique and rule based filtering are widely usedin detecting the malicious messages.

3.2 Hidden Markov Model

A Hidden Markov Model (HMM) θ is modeling the sequential data (Baum and Petrie,1966). It can be treated as the triple < ps, ptt, ptw >. ps(t0) is the probability that westart with some plain letter t as t0. In our case, it is the starting symbol of a sentence.ptt(ti | tt−1) is the transition probability from ti−1 to ti and the language model can producethis probability. ptw(wi | ti) is the probability of generating wi from ti, where wi is ciphertext and ti is its plain text.

3.2.1 Expectation Maximization Algorithm

The Expectation maximization (EM) algorithm (Baum, 1972) finds the hidden parametersfrom a set of observed data by iteratively computing the expected value of log likelihoodfunction and then computing the derivation of the function to obtain the parameters whichmaximum the log likelihood function. The log likelihood function is as follows:

L(θ) =∑i

logP (c(i)) (3.4)

By using the Bayes Rule, we can represent this function as

L(θ) =n∑i=1

logP (c(i))

=n∑i=1

log∑e∈V

(P (e)×

d∏j=1

P (c(i)j | e)

) (3.5)

The EM algorithm works as follows:

1. Assign initial probabilities to all parameters

2. Repeat: Adjust the probabilities of the parameters to increase the loglikehood themodel assigns to the training set

3. until training converges

The greater the number of iterations of the EM algorithm, log likelihood is more likelyup to a point. As the log likelihood function increases, the language model score of theproduced plain text also increases, meaning that the deciphered plain text is more similarto the source language (the language of the language model).

As there is no prior information about the model, we set all the parameters to be equallyprobable. To adjust the probabilities of the parameters, we use the Forward-Backward

23

algorithm, also referred to the Baum-Welch algorithm (Baum, 1972), which is a special caseof the EM algorithm general form. It uses a dynamic programming strategy to efficientlycompute the posterior marginal distributions, which is the second term P (c | e) in ourEquation 3.3. The Forward-Backward algorithm (Rabiner, 1989) is guaranteed to convergeon a local maximum of the loglikelihood function, in other words, in each training iterationwe guarantee that the parameter values will be adjusted such that the loglikelihood score willnot decrease. The idea is that in each iteration, we estimate the probability of parameters bycounting the score the current model assigned to them. For example, if our model assignsplain letter “A” to cipher letter “u” with probability 1/27, then we would increment by1/27 our language model score of “A” as our count for the number of “A” and “u” occurstogether. Training typically is stopped when the increase in the loglikelihood score of thetraining set between iterations drops below the threshold which determines the convergepoint.

3.2.2 Forward-Backward Algorithm

The Forward-Backward algorithm (Rabiner, 1989) computes posterior probabilities of asequence of states given a sequence of observations (Collins, 2013). In the first pass, thealgorithm computes the summation of a set of forward probabilities which provides theprobability of any particular states given the observations ending at that point. It can bewritten as follows. For all j ∈ 2, ...,m, s ∈ V where m is the length of observations, the Vis the set of possible states and α are the forward probabilities,

α(j, s) =∑s′∈V

α(j − 1, s′)× ϕ(s′ , s, j) (3.6)

Define ϕ(s′ , s, j) = P (s | s′) · P (cj | s) as in Collins (2013):

ϕ(s1...sm) =m∏j=1

ϕ(sj1 , sj , j)

=m∏j=1

P (sj | sj−1) · P (cj | sj)(3.7)

In the second pass, we define the backward probabilities β, analogous to the forwardprobabilities α.

β(j, s) =∑s′∈V

α(j + 1, s′)× ϕ(s, s′ , j + 1) (3.8)

This equation computes the summation of a set of backward probabilities which providethe probability of observing the remaining observations given any starting point j. Inthe end, with the forward and backward probabilities, we can obtain the summation ofthe probability of all possible paths to go through a particular point where observation j

24

generates states s′ , which is the expected count for our EM training. For all j ∈ 1...m, a ∈ V ,

µ(j, a, b) = α(j, a)× β(j, a) (3.9)

For all j ∈ 1...(m− 1), a, b ∈ V ,

µ(j, a, b) = α(j, a)× ϕ(a, b, j + 1)× β(j + 1, b) (3.10)

The pseudo code of Forward-Backward algorithm (Rabiner, 1989) modified from Knightet al. (2006) is given below:

Algorithm 3 Forward-Backward algorithm for estimating posterior distribution (emissionprobability)1: Given a cipher text c with length of m, a plaintext with vocabulary of v bigram tokens, and a plaintext trigram

model b:2: Random initialized the s(c | p) substation table and normalized3: for several iterations do4: set up a count(c, p) table with zero entries5: for i=1 to v do6: Q[i, 1] = b(pi[1] | pi[0], boundary)7: end for8: for j =2 to m do9: for i=1 to v do

10: Q[i, j] = 011: for k=1 to v do12: if pi[0] == pk[1] then13: Q[i, j]+ = Q[k, j − 1] · b(pi[1] | pk[0], pk[1]) · s(cj−1 | pk[1])14: end if15: end for16: end for17: end for18: for i =1 to v do19: R[i,m] = b(boundary | pi[0], pi[1])20: end for21: for j=m-1 to 1 do22: for i=1 to v do23: R[i, j] = 024: for k =1 to v do25: if pi[1] == pk[0] then26: R[i, j]+ = R[k, j + 1] · b(pk[1] | pi[0], pi[1]) · s(cj+1 | pk[1])27: end if28: end for29: end for30: end for31: for j=1 to m do32: for i=1 to v do33: count(cj , pi)+ = Q[i, j] ·R[i, j] · s(cj | pi)34: end for35: end for36: normalize count(c, p) table to create a revised s(c | p)37: end for

3.2.3 Initialization

In Blömer and Bujna (2013), it is stated that the initialization of EM training really dependson the data and the allowed computational cost, and that there was no way to determine

25

the best initialization algorithm that outperforms all algorithms on all instances. Thus,to identify the suitable initialization to reduce the computational cost and reach the localoptima with fewer iterations, we propose our own initialization algorithm, which is basedon the assumption that the previous trained table can help us to reach the local optimawith fewer iterations than uniform distribution initialization in the following cipher textsentences.

Considering our problem of deciphering the corrupted offensive text, Algorithm 3 trainson the single cipher text sentence. If we have multiple sentences or a large set of ciphertext sentences to decipher, one way is to train each sentence line by line with Algorithm 3.Each line starts with the random initialization. The other method which we proposed is toaccomplish random initialization on the first cipher text sentence, and then add the reviseds(c | e) with random initialization for the next line cipher text as the initial table s(c | e).The iteration training part doesn’t change. We compared these two ways of training in theExperimental Chapter 4 and found that the latter training method is better.

In the following pseudo code, we showed the whole decipherment algorithm on offensivetext with the second initialization algorithm.

Algorithm 4 Initialization with previous trained table1: Given a set of cipher text sentences with size of n, a plain text with vocabulary of v

bigram tokens and a plain text trigram model b2: Random initialized the s(c | e) substitution table and normalized3: for i = 1 to n do4: preprocess the i-th cipher text sentence by removing the repeated characters into 2

identical characters and lower case5: inserting Null symbol based on language model into the offensive key words to handle

insertion cases6: if i 6= 1 then7: add up s(c | e) with the (i-1)-th trained table s(c | e) to get the new substitution

table8: end if9: apply Forward-Backward algorithm with the new substitution table as the initial-

ization table10: end for

3.2.4 Beam Search

The term beam search was introduced in Reddy et al. (1977). This searching methodologyis often used when there the search space is exponential and we are faced with using limitedmemory. In our decipherment problem beam search selects a sequence of plain text lettersbased on the language model. However, we cannot guarantee to get the best result becausethe beam search only considers a limited number of candidates to reduce its memory re-quirements. Beam search uses breadth-first search and prunes candidates/nodes at each

26

level of searching, thus cutting down the search space at each level. Nuhn et al. (2013) re-ported that the general decipherment goal is to obtain a mapping such that the probabilityof the deciphered text is maximal.

For example, we have a sequence of cipher text which is “ifmmp xpsme”. Assuming thesearching spaces for each letter are the 26 letters from a to z. If we search through all thepossible letters, it will need O(26n) space complexity. If we prune the candidate letters bythe language model score, every time we set a width of beam search which can be the top ncandidates letters based on the language model score, and we can reduce the search spaceinto O(widthn). Then we find that the plain text “hello world” has the highest probability.

Since we estimate the posterior probability by the Forward-Backward algorithm inHMM, the language model score P (e) and posterior probability P (c | e) are known. Givena sequence of corrupted offensive text, we can search the candidate letters by beam searchto identify a sequence of deciphered plain text that maximizes P (e) × P (c | e). How-ever, rather than maximizing P (e) × P (c | e), Knight et al. (2006) found that maximizingP (e)×P (c | e)3 is extremely useful across decipherment applications. Cubing the emissionprobability was introduced by Knight and Yamada (1999), who stated that “it serves tostretch out the P (c | e) probabilities, which tend to be too bunched up. And the bunchedup is caused by the incompatibilities between the n-gram frequencies used to train P (e)and the n-gram frequencies found in the correct decipherment of c”. In our deciphermentproblem, the beam search was implemented based on the stack decoding (Koehn, 2009).Koehn (2009) stated that there was a heuristic search that reduces the search space. Wehave many hypotheses at each states or cipher text letter. If a hypothesis appears to bebad, we may not want to expand this hypothesis in the future. We organize the hypothesesinto stacks, and, if the stacks get too large, we remove the worst hypothesis in the stacks.The score in our decipherment training and decoding is log based since we deal with smallprobabilities which may have the chance to overflow. The following pseudo code Algorithm5 shows the decoding process after substitution table has been trained.

3.2.5 Random Restarts

In Berg-Kirkpatrick and Klein (2013), the effect of running numerous random restartswhen using EM to attack decipherment problems was investigated. They showed thatinitializations that lead to the local optima with the highest likelihoods are very rare.Thus, they report that large-scale random restarts have broad potential to reach the highestlikelihoods.

The reason the EM algorithm reaches the local optima is that the likelihood function inHMM (Equation 3.5) is a non-convex function which has multiple “local” optima (Collins,2011).

During every iteration of the EM algorithm, the likelihood function will increase untilit converges at the local optima point. The random restarts try different starting points

27

Algorithm 5 Beam Search Algorithm in Deciphermentn is the length of cipher textSet hypothesis tuple with three attributes (score, predecessor, letter)initialize the stacks with size of n+1place empty hypothesis (0,None,‘<s>’) into stack 0for i= 1 to n do

for the best width hypothesis h in stack dofor p in candidates letters do

score+ = 3 ∗ S[cipher[i], p] + LM(h.predecessor.letter, h.letter, p)new-hypothesis ← hypothesis(score,h,p)if p not in stacks[i+1] or stacks[i+1][p].score < score then

stacks[i+1][p] = new-hypothesisend if

end forend for

end forwinner = max (stacks[-1].score)plain-text ← traverse back from the winner hypothesisreturn plain-text

which increases the probability that the likelihood functions will converge to the higherlocal optima, which are the points which will generate the best results.

3.3 Summary

In this chapter, we reviewed the concept of the HMM, EM algorithm and Forward-Backwardalgorithm. We also proposed our own initialization algorithm. The idea of applying thedecipherment method to the corrupted offensive text is shown in this chapter, we modelthe decipherment problem as an HMM and regard the cipher text (corrupted offensivetext) as observed data and the plain text as a hidden variable. To uncover the plaintext, we preprocess the cipher text by inserting NULL symbol based on n-gram countand removing the repeated characters. The EM algorithm estimates the parameters usingForward-Backward algorithm in the HMM, and we then use beam search decoding to getthe best sequence of plain text as deciphered result.

28

Chapter 4

Experimental Results

4.1 Dataset

4.1.1 Wiktionary

The dataset we used is from Wiktionary. Wiktionary is a content dictionary of all words inall languages. Wiktionary data also includes sentences which are tagged with some labels,for example, vulgar, derogatory and so on. We parsed the whole English Wiktionary datasetin 2015-01-02 timestamp and extracted 2298 offensive sentences and 152,770 non-offensivesentences. We tagged every sentence as either offensive or normal. As the Wiktionary islike a traditional dictionary, it has every word’s explanation and example sentences. Weextracted sentences from these example sentences. We appended the offensive key words atthe end of every sentence. In our experiment, we only considered one language (English),so we did not include other languages. We filtered the sentences based on Wiktionary usinganother label called “lang=en” or “en”. This label helped us to filter out the sentencesthat were not in English. However, the decipherment model can also use other languageto decipher the original text. Because it is not language dependent, we can train anotherlanguage model to decipher that language’s text. To train and test the classifier, we splitthese extracted offensive sentence into two parts, one part for training and another part fortesting. We split the data according to their key words. For the sentences having the samekey words, there had to be at least 75% of these sentences in the training set and 25% ofthese sentences in the test set. For example, there are 8 sentences containing the key word“shit”. We placed 6 sentences into the training set and the remaining two sentences into thetest set. If a key word only has one example sentence, we put this example sentence intothe training set. If a key word has two example sentences, we put one sentence into trainingset and one sentence into the testing set. Thus, we guaranteed that the testing set had atleast one example sentence with the same key word in the training set. We obtained 1532offensive sentences for training and 766 offensive sentences for testing. There were 103,503non-offensive sentences for training and 49,267 non-offensive sentences for testing. We

29

preprocessed these sentences by removing the punctuation and put them into lower case.For the offensive testing data, because we will substitute the alphabet letters with non-alphabet symbols, we discarded the sentences having non-alphabet letters. After filtering,we obtained 716 offensive testing sentences. We used 1,532 sentences offensive trainingset data and sampled 1,532 non-offensive sentences as a balanced dataset for training anoffensive sentence classifier. We split these 716 offensive testing sentences into 4 parts insequence. The first three parts we set as test sets and the latter part as development set.Every set has 179 sentences.

The language model was also trained on the Wiktionary dataset. Since there are lessoffensive sentences than non-offensive sentences, we duplicated the offensive sentences untilthe offensive sentences were as many in number as the non-offensive sentences. In this way,we did not lose any information on the non-offensive sentences when making the balancedtraining data set. There were 1,532 lines of offensive sentences and 103,503 lines of non-offensive sentences. Therefore, we duplicated the offensive ones 68 times to get 104,176lines of offensive sentences which are close to the non-offensive training sentences. The to-tal Wiktionary training set has 155,251 tokens. To have a comprehensive language model,only having the Wiktionary dataset is not enough, so we trained another English languagemodel on the European Parliament Proceedings Parallel Corpus 1996-2011 (Koehn, 2005).We sampled 100,000 English sentences from the German-English European Parliament Pro-ceedings Parallel Corpus. There are 2,714,110 English tokens.

We used the SRILM toolkit (Stolcke et al., 2002) to train the Wiktionary languagemodel and Europarl language model. We mixed them to generate a mixed language model.Each language model was given a weighting factor to compose the mixed language model.The weight was tuned in the develop set to obtain the highest mixed language model score.The toolkit we used to mix the language model is py-srilm-interpolator (Kiso, 2012). Inour decipherment approach, we used the letter based language model. We inserted spacebetween the letters to make each letter as a word and replaced the original whitespace with“ ” to train the letter based language model.

The language model we used in the noisy channel spelling correction is from the EnglishGigaword corpus (Graff et al., 2003). We trained the Gigawords trigram English languagemodel with 7,477,897 English tokens, which were lower case. The dictionary we used in ourspelling correction is from the Linux system dictionary which has 479,829 English words inlower case. We obtained 3,393,741 pairs of human disguised words and original plain textwords from the company to train the error model.

4.1.2 Real World Chat Messages

The Two Hat Security is a technical company who combines artificial intelligence withhuman expertise to classify text and images on a scale of risk, taking context into account.They have a rule-based filtering system to assign risk level for each chat message to identify

30

offensive messages. However, not all of the messages can be filtered out since people inventsubtler forms of malicious messages in an effort to subvert such filtering systems. Weobtained 4,713,970 chat messages from the company which were not system recognizedmessages at the previous timestamp. They preprocessed the chat messages reproduced acleaner message which was based on some rules, and called them simplified messages. Weused the simplified versions of 4,700,000 chat messages as a training set to train the letterbased language model.

To train a comprehensive language model, we mixed the real chat messages languagemodel, Wiktionary language model and Europarl language model using py-srilm-interpolator(Kiso, 2012). We tuned the weights of the 500 plain text messages in the development setwhich are collected from real chat messages. The development set messages were all rec-ognized by the previous timestamp version filtering system. For the testing messages, wesampled 500 chat messages from 265,626 chat messages which were identified as offensivemessages in their latest filtering system version but identified as unknown messages in theold version filtering system. After decipherment, we again pass the deciphered results intothe old version filtering system and evaluate the quantity of the text that can be recognizedby the old version filtering system.

4.2 Experimental Design

We evaluate our approach in terms of the classifier accuracy and the risk level from therule-based filtering system for the real chat messages. We are using LibShortText toolkit(Yu et al., 2013) to train a classifier. The classifier is trained on the training set whichclassifies offensive and normal sentences. After training the classifier, we classify the orig-inal test sentences without any encryption. Then we encrypted the test sentences by theCaesar cipher, Leet simple substitution cipher and replacing the offensive keyword withreal human disguised words. We wanted to mimic the way people corrupt messages online.The classifier here is not to correctly classify offensive messages, rather, this classifier is forus to measure how well the decipherment method can recover the original messages backfrom the encrypted messages. We compare the classification accuracy between the originaland deciphered messages. Thus, we focused on how to close the accuracy between the de-ciphered and original messages. If the classification accuracy gap between the original anddeciphered messages is small, the decipherment approach can recover the original messagesfrom encrypted messages. Otherwise, the classification accuracy gap will be large. Weapply our HMM decipherment approach to these encrypted sentences and classify the deci-phered sentences to compare their accuracy with that of the original sentences. The goal ofthis experiment is to evaluate whether the decipherment approach can recover the originalsentences no matter what the encryption is as the classification accuracy gap indicates.

31

The noisy channel model based spelling correction is an alternative approach to solve theproblem. This approach needs to train an error model from a pair of misspelled and correctlyspelled words, as Algorithm 2 shows. Furthermore, it requires a dictionary which containscorrectly spelled English words. In our experiments, we have the following settings for thenoisy channel model base spelling correction as stated in Norvig (2009): pspell_error = 0.05in Equation 2.18, λ = 1.0 in Equation 2.19, the maximum edit distance edit-distance-limit= 3 in Algorithm 1, the dictionary is Linux system dictionary, and the error model is trainedfrom company data which is described in the previous section.

The HMM decipherment methods need random restarts to get good results. Berg-Kirkpatrick and Klein (2013) showed that an easy cipher Zodiac 408 decipherment reachedthe good loglikelihood score with 100 random restarts. However, using more than 100random restarts does not appear to be useful for this easy cipher Zodiac 408. Thus, we ran100 random restarts on each encrypted type set.

We also applied our decipherment methods on Two Hat Security’s real chat messages.These chat messages are not recognized by their filtering system at the previous timestampbut for the current timestamp as there are more rules being added into the system, theyshould be recognized. We apply our decipherment to recover the original messages and passthem into their filtering system again to obtain the risk level.

4.3 Experimental results

4.3.1 Classifier Tuning

There are 179 sentences in the development set which are used for tuning the features andmodels of the classifier. First, we train our classifier with different features and differentmodels. Then, we pick the one which best fits our development set. There are severalfeatures that represent the word such as binary features, word count, term frequency andTF-IDF. The features also include the combination of unigram or bigram, stop word removaland word stemming. The classification model includes L1-loss support vector machine, L2-loss support vector machine, support vector machine by Crammer and Singer and logisticregression. We extracted the context of keywords from the original training data and madeit as another training set for tuning. Thus, we obtained a context based training set and acomplete training set. We ran the combinations of different word representation, features,training sets and classification models. On the context training set, we got the highestclassification accuracy of 87% in the development set with features of the binary wordrepresentation, stop word removal, stemming and bigram with logistic regression classifier,while in complete training set, we got the highest classification accuracy of 86% in the samedevelopment set.

We concluded that the binary word representation, stop word removal, stemming andbigram with logistic regression classifier on context training set outperforms the other fea-

32

tures in the development set. In the following experiments, the classifier was trained usingthe context based training set and with these features.

4.3.2 Decipherment of Caesar Cipher

For the simple substitution cipher, like the Caesar cipher and the Leet substitution cipher,the HMM decipherment approach works very well. The classification accuracies droppedgreatly with the encrypted text, because the Caesar cipher changes all the letters and thewords are no longer English words. As Table 4.1 shows, the classification accuracies of thethree test sets increased back to their original classification accuracies in the “decipheringwhole set” column. This column means the decipherment is trained using Algorithm 4,which initializes the substitution table with the substitution table trained before the cur-rent line of sentence. Thus, the HMM decipherment trained on the whole set can recoverall the encrypted letters (cipher text) into their original letters (plain text). When we de-cipher each line of messages individually, the EM training does not observe enough datato learn the posterior probabilities (substitution table). In contrast, the substitution tabletrained by previous messages being passed into next message initialization table will notlose the information learned from the previous messages. Therefore, the results of wholeset decipherment are better than per line decipherment.

For the noisy channel spelling correction approach, in Table 4.1, it cannot recover theoriginal messages successfully as the classification accuracies are much lower than the de-cipherment approaches. Even deciphering line by line is better than noisy channel spellingcorrection approaches. Thus, the decipherment approaches can cover more cases than thesupervised learning method as the supervised learning method needs pairs of misspelledand correct words to train and has edit distance limitations. In Caesar cipher encryption,the supervised learning model did not have the training data to train the error model. Inaddition, the edit distance in the Caesar cipher correlates positively with the length of theword because every letter has been mapped to another letter.

Here are the examples in our test set A. The original message is “im going to hit theclubs and see if i can get me some cunt“. After encrypting with Caesar cipher with 3 lettersshifting, the cipher text is “lp jrlqj wr klw wkh foxev dqg vhh li l fdq jhw ph vrph fxqw”.The deciphered plain text messages using HMM decipherment is “im going to hit the clubsand see if i can get me some cunt”, which is the same as the original message. For the noisychannel spelling correction, the corrected message is “lp rj wr kl with foxe dq hh li l dq jewph ph few”, which is totally not recovering the original messages.

We ran 100 random restarts on the Caesar cipher encrypted dataset A. The Figure 4.1shows every loglikelihood value in 100 random restarts. The mean of the loglikelihood was -27923 and the standard deviation was 23. As the Caesar cipher is easy to be deciphered, theloglikelihood did not vary greatly. The highest loglikelihood gave a classification accuracyof 86% (154/179), which is the same as the original plain text classification accuracy. Thus,

33

Test Set Original text Caesar cipherencrypted text

Noisy ChannelSpelling Correction

Decipheringper line

Decipheringwhole set

A 86%(154/179)

4%(8/179)

28%(51/179)

55%(99/179)

86%(154/179)

B 86%(154/179)

3%(7/179)

22%(41/179)

51%(92/179)

86%(154/179)

C 84%(152/179)

5%(10/179)

22%(41/179)

58%(104/179)

84%(152/179)

Table 4.1: Classification Accuracy of Spelling Correction and Decipherment Results inCaesar Cipher Encrypted Text

Figure 4.1: 100 Random Restarts Loglikelihood in Caesar Cipher Decipherment

the decipherment process recovers the whole message that was encrypted by the Caesarcipher.

4.3.3 Decipherment of Leet Substitution

We encrypted three test sets A, B, C with Caesar cipher encryption by 3 letters rightshifting. For the Leet substitution cipher, we referred to the KoreLogic’s Leet rules (KoreLogic Security, 2012) which tagged as “#KoreLogicRulesL33t” . We used John the Ripperpassword cracker (Solar Designer and Community, 2013) to apply the KoreLogic Leet rulesencrypting our test sets.

The Leet substitution cipher is more complicated than the Caesar cipher because everyletter can map to different characters or numbers without the same mapping rules. In thisexperiment, we evaluated the different width of beam search and the different deciphermentinitialization in the same way as for the Caesar cipher decipherment.

After EM training, we obtained the posterior probability table (the substitution table)and we used beam search to decode the results. The wider width of beam search, the largerthe size of the searching space, and also the higher the probability of finding a good solution.

34

Test Set Decipherment Type Beam searchwidth of 1

Beam searchwidth of 5

Beam searchwidth of 10

A Per Line 45%(82/179)

45%(82/179)

55%(100/179)

A Whole Set 80%(144/179)

82%(147/179)

82%(147/179)

B Per Line 58%(104/179)

62%(112/179)

62%(112/179)

B Whole Set 79%(143/179)

82%(148/179)

82%(148/179)

C Per Line 58%(105/179)

67%(121/179)

68%(123/179)

C Whole Set 81%(146/179)

81%(146/179)

81%(146/179)

Table 4.2: Classification Accuracy of Decipherment results in Leet Substitution Cipher

Table 4.2 shows that as the beam search width increases the classification accuracies alsoincrease or stay the same. The bold font numbers shown in the Table 4.2 are the optimalresults. Thus, a width of 5 is reasonable for beam search decoding. The beam search widthof 10 needed more time to decode as it had a larger searching space. As the results are thesame, we chose a beam search width of 5 to decode the following tests. Deciphering withthe whole set is better than deciphering per line of messages. Deciphering the whole settakes the substitution table from previous sentence that was deciphered as the initial tableof the following line’s sentence, which can be useful from one sentence to another if thecipher was similar. It helps to find out the actual mapping between the cipher and plaintext. However, deciphering per line did not consider the trained table generated from theprevious data.

Decipherment with whole set results have classification accuracies above 80% and thusoutperform noisy channel spelling correction methods. This time the Leet substitutioncipher did not change all the letters in the words, so the noisy channel spelling correc-tion can recover some of the corrupted messages. However, it still cannot outperform thedecipherment approach with previous table initialization.

For example, in test set A, the original message is “whats fucker” . The Leet substitutioncipher encrypted message is “wha+$ Fuck3r”, which substitutes “t” with “+”, “s” with “$”and “e” with “3”. The deciphered message using HMM decipherment approach is “whatsfucker”, which is the same as original message. For the noisy channel spelling correction,the corrected message is “what a fucker”, which is almost the same as original messages.

Beam search is generating many candidate sentences when decoding and pick up thehighest scored candidate as the final result. However, it doesn’t consider the other candidatesentences which can be good deciphered results. We run a experiment which lists the top5, top 50 and top 500 candidate sentences for each sentences. We use the same classifier to

35

Top n Candidates Classification Accuracy1 82% (147/179)5 83% (148/179)50 95% (170/179)500 99% (178/179)

Table 4.3: Top n Candidates of Decipherment Results in Leet Substitution Cipher Classi-fication on Test Set A

classify these candidate sentences to see if there is any sentence that can also be the plausibledeciphered result. For every sentence, if its candidate list contains a candidate sentence thatcan be classified as offensive, we classify this sentence as offensive. The experiment is runon the test set A. As the Table 4.3 shows, as we consider more candidate sentences not onlythe highest scored candidate, the classification accuracies increase. Thus, the decipheredresults are in the candidates lists but the highest scored candidate are not guaranteed tobe the best result.

As Table 4.4 shows, HMM decipherment training can recover most of the words in ourtest sets compared to the original message classification accuracy of each set. From theresults shown in Table 4.1 and Table 4.4, no matter what encryption of substitution cipherwas used, be it Caesar cipher or Leet substitution cipher, the HMM decipherment withlanguage model could always recover the original messages.

Test Set OriginalText

LeetEncryption Set

Noisy ChannelSpelling

Correction

Decipheringper line

Decipheringwhole set

A 86%(154/179)

59%(107/179)

68%(122/179)

56%(102/179)

82%(147/179)

B 86%(154/179)

64%(115/179)

60%(108/179)

62%(112/179)

82%(148/179)

C 84%(152/179)

62%(112/179)

65%(117/179)

67%(121/179)

81%(146/179)

Table 4.4: Classification Accuracy of Spelling Correction and Decipherment Results in LeetSubstitution Cipher with Beam Search Width of 5

Gao and Johnson (2008); Johnson (2007) show the Variational Bayes is a fast estimatorespecially on large data set, which let the EM training converge to the local optima withless iteration. The formula of Variational Bayes in Figure 4.2 was defined in Gao andJohnson (2008) where m′ and m are the number of word types and states respectively, inour decipherment case, they are English letters. ψ is the digamma function. And the E isthe expected value in EM training.

We applied add-one smoothing and Variational Bayes smoothing (Gao and Johnson,2008; Johnson, 2007). Table 4.5 shows that smoothing in our case does not provide a better

36

P(ti|w, t�i,�,��) ��

n�wi,ti + ��

nti + m��

� �nti,ti�1 + �

nti�1 + m�

� �nti+1,ti + I(ti�1 = ti = ti+1) + �

nti + I(ti�1 = ti) + m�

�

Figure 1: The conditional distribution for state ti used in the pointwise collapsed Gibbs sampler, which conditions onall states t�i except ti (i.e., the counts n do not include ti). Herem� is the size of the vocabulary,m is the number ofHMM states and I(·) is the indicator function (i.e., equal to one if its argument is true and zero otherwise),

The calculus of variations is used to minimize theKL divergence between the desired posterior distri-bution and the factorized approximation. It turnsout that if the likelihood and conjugate prior be-long to exponential families then the optimalQ1 andQ2 do too, and there is an EM-like iterative pro-cedure that finds locally-optimal model parameters(Bishop, 2006).This procedure is especially attractive for HMM

inference, since it involves only a minor modifica-tion to the M-step of the Forward-Backward algo-rithm. MacKay (1997) and Beal (2003) describeVariational Bayesian (VB) inference for HMMs. Ingeneral, the E-step for VB inference for HMMs isthe same as in EM, while the M-step is as follows:

�̃(�+1)t�|t = f(E[nt�,t] + �)/f(E[nt] + m�) (4)

�̃(�+1)w|t = f(E[n�

w,t] + ��)/f(E[nt] + m��)

f(v) = exp(�(v))

where m� and m are the number of word types andstates respectively, � is the digamma function andthe remaining quantities are as in (2). This meansthat a single iteration can be performed in O(nm2)time, just as for the EM algorithm.

2.3 MCMC sampling algorithms

The goal of Markov Chain Monte Carlo (MCMC)algorithms is to produce a stream of samples fromthe posterior distribution P(t | w,�). Besag (2004)provides a tutorial on MCMC techniques for HMMinference.A Gibbs sampler is a simple kind of MCMC

algorithm that is well-suited to sampling high-dimensional spaces. A Gibbs sampler for P(z)where z = (z1, . . . , zn) proceeds by sampling andupdating each zi in turn from P(zi | z�i), wherez�i = (z1, . . . , zi�1, zi+1, . . . , zn), i.e., all of the

z except zi (Geman and Geman, 1984; Robert andCasella, 2004).We evaluate four different Gibbs samplers in this

paper, which vary along two dimensions. First, thesampler can either be pointwise or blocked. A point-wise sampler resamples a single state ti (labeling asingle word wi) at each step, while a blocked sam-pler resamples the labels for all of the words in asentence at a single step using a dynamic program-ming algorithm based on the Forward-Backward al-gorithm. (In principle it is possible to use blocksizes other than the sentence, but we did not explorethis here). A pointwise sampler requires O(nm)time per iteration, while a blocked sampler requiresO(nm2) time per iteration, where m is the numberof HMM states and n is the length of the trainingcorpus.Second, the sampler can either be explicit or col-

lapsed. An explicit sampler represents and sam-ples the HMM parameters � and � in addition tothe states t, while in a collapsed sampler the HMMparameters are integrated out, and only the states tare sampled. The difference between explicit andcollapsed samplers corresponds exactly to the dif-ference between the two PCFG sampling algorithmspresented in Johnson et al. (2007).An iteration of the pointwise explicit Gibbs sam-

pler consists of resampling � and � given the state-to-state transition counts n and state-to-word emis-sion counts n� using (5), and then resampling eachstate ti given the corresponding word wi and theneighboring states ti�1 and ti+1 using (6).

�t | nt,� � Dir(nt + �)�t | n�

t,�� Dir(n�

t + ��)(5)

P(ti | wi, t�i,�,�) � �ti|ti�1�wi|ti�ti+1|ti (6)

The Dirichlet distributions in (5) are non-uniform;nt is the vector of state-to-state transition counts int leaving state t in the current state vector t, while

347

Figure 4.2: Variational Bayes for EM Algorithm from Gao and Johnson (2008)

result than the unsmoothed version. For the add-one smoothing, the classification resultsdid not change greatly from the one without smoothing. The Variational Bayes smoothingeven reduces the quality of the results, which is not a proper smoothing method for ourEM training. Gao and Johnson (2008) reported that the approximation used by VariationalBayes is likely to be less accurate on smaller data sets. In our case, the states in HMM area small set which only includes 27 letters so the Variational Bayes smoothing did not have agood performance. 100 random restarts were also applied into the Leet substitution cipher

Figure 4.3: 100 Random Restarts Loglikelihood in Leet Substitution Cipher Decipherment

in A dataset. The Figure 4.3 showed every loglikelihood value in 100 random restarts. Themean of the loglikelihood of 100 random restarts was -30183 and the standard deviation was390. These statistical numbers indicate that Leet substitution cipher decipherment has alarger deviation than the Caesar cipher decipherment. In other words, this also shows thatthe Leet substitution cipher decipherment is harder than Caesar cipher decipherment. Thehighest loglikelihood score was -30061.3 and its classification accuracy was 80%(144/179).Although the highest loglikelihood score did not give us the highest classification accuracy(82%(148/179)), it is close to the highest classification accuracy.

37

A Encrypted Type Smoothing Type ClassificationAccuracy (width 5) Loglikelihood

Caesar cipherWithout Smoothing 86%

(154/179) -27904

Add-one Smoothing 86%(154/179) -27901

Variational BayesSmoothing

44%(79/179) -55114

Leet substitution cipherWithout Smoothing 82%

(147/179) -30110

Add-one Smoothing 81%(146/179) -30105

Variational BayesSmoothing

27%(50/179) -66204

Table 4.5: Smoothing in Decipherment of Whole Set on Test Set A

4.3.4 Decipherment of Real Chat Offensive Words Substitution Dataset

In the Wiktionary dataset, we encrypted the original sentences by substituting the offensivewords with real human corrupted words. These real human corrupted offensive words arefrom the Two Hat Security company. This company has a rule-based filtering system tofilter out offensive chat messages. They provided us with a pair of corrupted offensive wordscollected from real chat messages and the corresponding plain text. However, there are stillsome offensive words in our Wiktionary dataset that did not appear in their corruptedwords set and so we simply changed one letter in a random position of the word to imitatehuman users hiding real messages.

Unlike the previous encryption type, this time the encryption is from real chat messages.It is not like the Caesar cipher, which only shifts the letters to other letters, and it is alsonot like the Leet simple substitution cipher, which only involves substitution. In real chatmessages, users can always invent more creative ways to disguise their messages to bypassthe filter system, which can involve both insertion and deletion. Therefore, we need tohandle insertion, deletion and substitution in the disguised words to recover the originalplain text words. In the insertion case, for example, if the original word “hello” is disguisedas “helo”, we need to insert an NULL symbol inside the disguised word “helo” to decipher.The correct place to insert the NULL symbol is “he<NULL>lo” but we do not know theproper position to insert the NULL during training. One approach to circumvent this is toinsert NULL at random position before the beginning of the EM training and another wayis to insert the NULL where the boundary of the n-gram is according to the n-gram countfrom the training set of plain text (Ando and Lee, 2003).

In Table 4.6, the n-gram count based insertion decipherment has a higher classificationaccuracy than the random insertion NULL decipherment. In real chat messages, combining

38

Test Set WiktionaryEncrypted Set

Noisy ChannelSpelling

Correction

Aspell SpellingCorrection Insert Null Decipherment

A 64.8045%(116/179)

72.6257%(130/179)

77.0950%(138/179)

Random 69.2737%(124/179)

n-gram Count Based 72.0670%(129/179)

B 72.0670%(129/179)

76.5363%(137/179)

73.1844%(131/179)

Random 72.6257%(130/179)


C 75.9777%(136/179)

77.0950%(138/179)

77.0950%(138/179)

Random 76.5363%(137/179)


Table 4.6: Classification Accuracy of Spelling Correction and Decipherment Results in RealChat Offensive Words Substitution Wiktionary Dataset

two words is common, like “helloworld”. The n-gram count based “NULL” insertion canhandle this case, because based on the n-gram count we can determine the boundary ofthe words and insert the NULL symbol at the word boundary. For example, there isa corrupted words “helloworld”, the insertion based on n-gram count will insert NULLbetween the “hello” and “world” which helps in the decipherment training. One advantageof this HMM decipherment method is that it tends to not change the words that are correctsince these words have already got the highest language model score. Rather, it changesthe words that are corrupted or misspelled. No matter the type of encryption, the HMMdecipherment approach always deciphers the messages which fit with the language modelwe trained.

Thus, the best classification accuracy is the HMM decipherment method with insertionbased on n-gram count. We conducted an experiment testing the real corrupted wordsencrypted Wiktionary offensive sentences with the Aspell program. Table 4.6 shows thedecipherment results are better than the Aspell results in test set B and C while in test setA, the decipherment approach is about 5% less accurate than the Aspell program result.Thus, from the experimental results, it is clear that the HMM decipherment method isstill as effective as the spelling correction method using Aspell in the task of recovering thecorrupted words to their original words. Furthermore, the noisy channel spelling correctionresults in Table 4.6 are quite close to the n-gram count based on the insertion deciphermentmethod.

In summary, spelling correction methods such as the noisy channel model and the AspellSpelling Correction have similar performance when dealing with real chat corrupted offensivewords. If we change the editing too much, like in Ceasar cipher encryption, the spellingcorrection performances dropped substantially, but the decipherment approaches were notaffected.

39

Figure 4.4: 100 Random Restarts Loglikelihood in Decipherment of Real Chat OffensiveWords Substitution on Test Set B

This time we did 100 random restarts on the test set B. The Figure 4.4 showed everyloglikelihood value in 100 random restarts. The mean of loglikelihood was -41444.7 andthe standard deviation was 115.53. As the real chat offensive words have greater diversity,the decipherment is much harder than the simple substitution cipher. As Table 4.6 showsthe encrypted set of test set B had 72.0670% classification accuracy. From the 100 randomrestarts, the highest loglikelihood classification accuracy was 75.9777% which did not rep-resent a great increase in accuracy. However, the decipherment did recover some corruptedwords based on an increase in the accuracy.

4.3.5 Decipherment of Real Offensive Chat Messages

We obtained 500 sampled real offensive chat messages to decipher. Before decipheringthese messages, we preprocessed the text as before. We removed repeated characters andonly kept two sequential repeated characters. For example, we changed “heeeellllooo” to“heelloo”. We also substituted the special symbols to “NULL”. We trained the n-gram countfrom the training set that we used to train the language model, which was composed of4,700,000 chat messages that were recognized by the filtering system. The filtering systemassigns a risk level for each text which passed through it. The risk level that higher than4 means that the text is offensive, while a risk level lower than 4 means that the text isnot offensive. If the risk level was 4, the filtering system did not recognize this text andit could not make decisions. We passed the deciphered results into the old version of thefiltering system and obtained the risk level as assigned by the system. The reason for thiswas that these 500 sampled test messages were collected by the latest filtering system whichcan have a risk level higher that 4, whereas in the old version of the filtering system theywere all in risk level 4. We wanted to recover these messages and put them into the old

40

version filtering system again to see how many messages could be recognized by the oldversion filtering system.

Here we would like to show some real examples that deciphered by our deciphermentapproaches. For example, “fvk u” and “f2ck u” are deciphered into “fuck u”. The decipher-ment can handle the substitution and insertion in the examples we showed. After passingthe deciphered results into the old version of the filtering system, there were 51.6% of testoffensive text that were recognized as a risk level of higher than 4. However, the percentageof text recovered by the old version is 58.6%. 7% of the test messages deciphered intonon-offensive messages which were supposed to be offensive. Thus, the decipherment ap-proach can recover about half of the corrupted messages into the filter system recognizedmessages but some of the messages were not properly recovered and were thus categorizedas non-offensive text.

41

Chapter 5

Conclusion

The HMM decipherment can decipher disguised text based on the language model regardlessof the encryption type. It works well when the encryption is a simple substitution cipher asit only considers substitutions. As our experiments show, for the Caesar cipher encryption,the decipherment can recover all of the encrypted messages into their original messages,obtaining the same classification accuracy as the original messages. For more complicatedencryption text, we insert “NULL” symbols according to n-gram count methodology fromAndo and Lee (2003) to handle insertion cases. The HMM decipherment can always in-crease the classification accuracy according to our experiments of Caesar cipher encryptiondecipherment, Leet substitution decipherment, real chat offensive words encrypted text de-cipherment and real chat messages decipherment. The decipherment approach can covermore cases than spelling correction methods. As in the Caesar cipher encryption case, thedecipherment results can reach the same classification accuracy as the original messages.However, the noisy channel spelling correction is only able to reach around 22% classifi-cation accuracy. Due to the limitation of edit distance and lack of error model trainingdata, the noisy channel spelling correction has its limitations and cannot handle high editdistance case. However, large edit distances are common in real chat messages, and thusthe decipherment approach has its advantages. The difference between the deciphermentwith traditional spelling correction methods like Aspell is that decipherment method onlyneeds a language model to decipher cipher text, and does not need a dictionary to refer to.The language model has the advantage that we can train a domain specific language modelto decipher specific topic messages as real chat messages usually have some sort of topicsor domain, such as sports, news and so on. The future work is that we can try differentlanguage messages to decipher, as long as we can have the corresponding language model.

In this thesis, we showed that evasive or encrypted offensive text can be recoveredto their original plain text by an HMM based decipherment approach through extensiveexperimental studies. For the first time, we modeled this problem as a decipherment problemand solved it using a statistical model and machine learning algorithms.

42

Bibliography

Ando, R. K. and Lee, L. (2003). Mostly-unsupervised statistical segmentation of japanesekanji sequences. Natural Language Engineering, 9(2):127–149.

Baum, L. E. (1972). An equality and associated maximization technique in statisticalestimation for probabilistic functions of markov processes. Inequalities, 3:1–8.

Baum, L. E. and Petrie, T. (1966). Statistical inference for probabilistic functions of finitestate markov chains. The annals of mathematical statistics, 37(6):1554–1563.

Berg-Kirkpatrick, T. and Klein, D. (2013). Decipherment with a million random restarts.In EMNLP, pages 874–878.

Bishop, C. M. (2006). Pattern recognition. Machine Learning, 128.

Blömer, J. and Bujna, K. (2013). Simple methods for initializing the em algorithm forgaussian mixture models. arXiv preprint arXiv:1312.5946.

Brown, P., Cocke, J., Pietra, S. D., Pietra, V. D., Jelinek, F., Mercer, R., and Roossin,P. (1988). A statistical approach to language translation. In Proceedings of the 12thconference on Computational linguistics-Volume 1, pages 71–76. Association for Com-putational Linguistics.

Brown, P. F., Cocke, J., Pietra, S. A. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D.,Mercer, R. L., and Roossin, P. S. (1990). A statistical approach to machine translation.Computational linguistics, 16(2):79–85.

Brown, P. F., Pietra, V. J. D., Pietra, S. A. D., and Mercer, R. L. (1993). The mathematicsof statistical machine translation: Parameter estimation. Computational linguistics,19(2):263–311.

Chen, Y., Zhou, Y., Zhu, S., and Xu, H. (2012). Detecting offensive language in socialmedia to protect adolescent online safety. In Privacy, Security, Risk and Trust (PAS-SAT), 2012 International Conference on and 2012 International Confernece on SocialComputing (SocialCom), pages 71–80. IEEE.

43

Church, K. W. and Gale, W. A. (1991). Probability scoring for spelling correction. Statisticsand Computing, 1(2):93–103.

Collins, M. (2011). Statistical machine translation: Ibm models 1 and 2. Columbia ColumbiaUniv.

Collins, M. (2013). The forward-backward algorithm. Columbia Columbia Univ.

Damerau, F. J. (1964). A technique for computer detection and correction of spelling errors.Communications of the ACM, 7(3):171–176.

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from in-complete data via the EM algorithm. Journal of the royal statistical society. Series B(methodological), page 1-38.

Forney Jr, G. D. (1973). The viterbi algorithm. Proceedings of the IEEE, 61(3):268–278.

Gao, J. and Johnson, M. (2008). A comparison of bayesian estimators for unsupervised hid-den markov model pos taggers. In Proceedings of the Conference on Empirical Methodsin Natural Language Processing, pages 344–352. Association for Computational Lin-guistics.

Graff, D., Kong, J., Chen, K., and Maeda, K. (2003). English gigaword. Linguistic DataConsortium, Philadelphia.

Jelinek, F. (1997). Statistical methods for speech recognition. MIT press.

Johnson, M. (2007). Why doesn’t em find good hmm pos-taggers? In EMNLP-CoNLL,pages 296–305.

Jurafsky, D. and Martin, J. H. (2014). Speech and language processing. Pearson.

Kansara, K. B. and Shekokar, N. M. (2015). A framework for cyberbullying detection insocial network. International Journal of Current Engineering and Technology, 5.

Kernighan, M. D., Church, K. W., and Gale, W. A. (1990). A spelling correction programbased on a noisy channel model. In Proceedings of the 13th conference on Computationallinguistics-Volume 2, pages 205–210. Association for Computational Linguistics.

Kiso, T. (2012). A python wrapper for determining interpolation weights with srilm. https:

//github.com/tetsuok/py-srilm-interpolator.

Knight, K., Nair, A., Rathod, N., and Yamada, K. (2006). Unsupervised analysis for de-cipherment problems. In Proceedings of the COLING/ACL on Main conference postersessions, pages 499–506. Association for Computational Linguistics.

44

https://github.com/tetsuok/py-srilm-interpolator

https://github.com/tetsuok/py-srilm-interpolator

Knight, K. and Yamada, K. (1999). A computational approach to deciphering unknownscripts. In ACL Workshop on Unsupervised Learning in Natural Language Processing,pages 37–44. 1.

Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In MTsummit, volume 5, pages 79–86.

Koehn, P. (2009). Statistical machine translation. Cambridge University Press.

Kore Logic Security (2012). Kore logic custom rules. http://contest-2010.korelogic.

com/rules.txt.

Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions andreversals. In Soviet physics doklady, volume 10, page 707.

Norvig, P. (2009). Natural language corpus data. Beautiful Data, pages 219–242.

Nuhn, M., Schamper, J., and Ney, H. (2013). Beam search for solving substitution ciphers.In Citeseer.

Rabiner, L. R. (1989). A tutorial on hidden markov models and selected applications inspeech recognition. Proceedings of the IEEE, 77(2):257–286.

Ravi, S. and Knight, K. (2011). Deciphering foreign language. In Proceedings of the 49thAnnual Meeting of the Association for Computational Linguistics: Human LanguageTechnologies-Volume 1, pages 12–21. Association for Computational Linguistics.

Razavi, A. H., Inkpen, D., Uritsky, S., and Matwin, S. (2010). Offensive language detectionusing multi-level classification. In Canadian Conference on Artificial Intelligence, pages16–27. Springer.

Reddy, D. et al. (1977). Speech understanding systems: A summary of results of thefive-year research effort. Department of Computer Science. Camegie-Mell University,Pittsburgh, PA.

Russell, S. J., Norvig, P., Canny, J. F., Malik, J. M., and Edwards, D. D. (2003). Artificialintelligence: a modern approach, volume 2. Prentice hall Upper Saddle River.

Shannon, C. E. (1948). A mathematical theory of communication. Bell System TechnicalJournal, 27(3):379–423.

Solar Designer and Community (2013). John the ripper password cracker. http://www.

openwall.com/john.

Stolcke, A. et al. (2002). Srilm-an extensible language modeling toolkit. In Interspeech.

45

http://contest-2010.korelogic.com/rules.txt

http://contest-2010.korelogic.com/rules.txt

http://www.openwall.com/john

http://www.openwall.com/john

Weaver, W. (1955). Translation. Machine translation of languages, 14:15–23.

Wikipedia (2016). from https://en.wikipedia.org/wiki/Leet. retrieved 11 june 2016,.

Yu, H., Ho, C., Juan, Y., and Lin, C. (2013). Libshorttext: A library for short-textclassification and analysis.

46

https://en.wikipedia.org/wiki/Leet

Documents

DeciphermentofEvasiveorEncrypted OﬀensiveTextsummit.sfu.ca/system/files/iritems1/16613/etd9692_ZWu.pdfDeciphermentofEvasiveorEncrypted OﬀensiveText by ZhelunWu B.Sc. (Hons.),DalhousieUniversity,2014