Outlier detection for multivariate time series · Outlier detection for multivariate time series Jorge Luís Fuzeta Serras Master’s Thesis Electrical and Computer Engineering Supervisor(s):

Outlier detection for multivariate time series

Jorge Luís Fuzeta Serras

Master’s Thesis

Electrical and Computer Engineering

Supervisor(s): Prof. Alexandra Sofia Martins de CarvalhoProf. Susana de Almeida Mendes Vinga Martins

Examination Committee

Chairperson: Prof. António Manuel Raminhos Cordeiro GriloSupervisor: Prof. Alexandra Sofia Martins de Carvalho

Member of the Committee: Prof. Sara Alexandra Cordeiro Madeira

November 2018

ii

Dedicated to my parents.

iii

iv

Declaration

I declare that this document is an original work of my own authorship and that it fulfills all the requirements of

the Code of Conduct and Good Practices of the Universidade de Lisboa.

v

vi

Acknowledgments

I express my deepest gratitude for those who made this journey possible. I wish to thank my supervisors,

Alexandra Carvalho and Susana Vinga for their continuous support, knowledge and availability throughout the

project.

Moreover, I would also want to thank for the endorsement of all my family and friends that motivated me along

the way.

The work of this dissertation was performed under a research grant assigned by Instituto de Telecomunicacoes

within the framework of the project PERSEIDS, personalizing cancer therapy through integrated modeling and

decision, financed by the Portuguese Foundation for Science and Technology (Fundacao para a Ciencia e a Tec-

nologia - FCT) under contract PTDC/EMS-SIS/0642/2014.

vii

viii

Resumo

Outliers podem ser definidos como observacoes suspeitas de nao terem sido geradas pelos processos subjacentes

aos dados. Muitas aplicacoes exigem uma forma de identificar padroes interessantes ou incomuns em series tempo-

rais multivariadas (STM), no entanto, a maioria dos metodos de detecao concentram-se exclusivamente em series

univariadas, fornecendo aos analistas solucoes extenuantes.

Propomos um sistema completo de detecao de outliers abrangendo problemas desde o pre-processamento que

adota um algoritmo de modelacao de redes de Bayes dinamicas. O mesmo codifica as conectividades inter e intra-

temporais otimas de redes de transicao, capazes de identificar dependencias condicionais em STM. Um mecanismo

de janela deslizante e empregado para capturar gradualmente o score de cada transicao de uma STM dado o modelo.

Simultaneamente, STM inteiras sao avaliadas. Duas estrategias de analise de scores sao estudadas para assegurar

uma classificacao automatica de dados anomalos.

A abordagem proposta e primeiramente validada atraves de dados simulados, demonstrando o desempenho do

sistema. Comparacao com um metodo de arvore probabilistica de sufixos esta disponıvel exibindo a vantagem

da abordagem multivariada proposta. Experiencias adicionais sao feitas em dados reais, revelando anomalias em

cenarios distintos, como series de eletrocardiogramas, dados de taxas de mortalidade e escrita de dıgitos.

O sistema desenvolvido mostrou-se benefico na captura de outliers resultantes de contextos temporais, sendo

adequado para qualquer cenario que emprega STM. Uma aplicacao web de livre acesso, empregando o sistema

completo, e disponibilizada em conjunto com um tutorial.

Palavras-chave: series temporais multivariadas, deteccao de outliers, redes de Bayes dinamicas, algo-

ritmo de janela deslizante, analise de resultados, aplicacao web

ix

x

Abstract

Outliers can be defined as observations which are suspected of not have been generated by data’s underlying

processes. Many applications require a way of identifying interesting or unusual patterns in multivariate time

series (MTS), however, most outlier detection methods focus solely on univariate series, providing analysts with

strenuous solutions.

We propose a complete outlier detection system covering problems since pre-processing that adopts a dynamic

Bayesian network modeling algorithm. The latter encodes optimal inter and intra time-slice connectivity of tran-

sition networks capable of capturing conditional dependencies in MTS datasets. A sliding window mechanism is

employed to gradually score each MTS transition given the model. Simultaneously, complete MTS are evaluated.

Two score-analysis strategies are studied to assure an automatic boundary classifying anomalous data.

The proposed approach is first validated through simulated data, demonstrating the system’s performance.

Comparison with an assembled probabilistic suffix tree method is available displaying the leverage of the proposed

multivariate approach. Further experimentation is made on real data, by uncovering anomalies in distinct scenarios

such as electrocardiogram series, mortality rate data and written pen digits.

The developed system proved beneficial in capturing unusual data resulting from temporal contexts, being

suitable for any MTS scenario. A widely accessible web application employing the complete system is made

available jointly with a tutorial.

Keywords: multivariate time series, outlier detection, dynamic Bayesian networks, sliding window algo-

rithm, score analysis, web application

xi

xii

Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi

1 Introduction 1

1.1 Overview and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Outlier detection 7

2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Outlier and Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.2 Outlier detection strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.3 Challenges in Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Outlier Detection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Statistical Modeling 15

3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Markovian Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.1 Variable and sparse Markov models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2.2 Hidden Markov models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3 Bayesian models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3.1 Bayesian network outlier detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3.2 Dynamic Bayesian networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3.3 Dynamic Bayesian network outlier detection . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3.4 Model training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Proposed method 29

4.1 Time series pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

xiii

4.3 Dynamic outlier detection technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3.1 Outlierness score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3.2 Sliding window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.4 Score analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.4.1 Tukey’s strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.4.2 Gaussian mixture model strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.5 Web application and software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5 Experimental results 45

5.1 Simulated data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.1.1 Proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.1.2 Comparison with probabilistic suffix trees . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.2 Real Multivariate Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.2.1 Electrocardiogram dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.2.2 Mortality Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2.3 Pen digits dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6 Discussion 63

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Bibliography 65

xiv

List of Tables

5.1 Subject outlier results of the proposed approach using Tukey’s score-analysis on simulated data . . 49

5.2 Subject outlier results of the proposed approach using GMM score-analysis on simulated data . . 51

5.3 Subject outlier results of the PST approach using Tukey’s score-analysis on simulated data . . . . 53

5.4 Subject outlier results of the PST approach using GMM score-analysis on simulated data . . . . . 53

5.5 Subject outlier results of the proposed approach on pen digits data . . . . . . . . . . . . . . . . . 61

xv

xvi

List of Figures

1.1 Evolution of anomaly detection systems in security data science. . . . . . . . . . . . . . . . . . . 2

1.2 Main phases of the proposed outlier detection technique. . . . . . . . . . . . . . . . . . . . . . . 3

3.1 Example of a probabilistic suffix tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Three-state hidden Markov model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3 Example of a first-order DBN model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.1 Scheme of the proposed outlier detection approach . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2 Time series pre-processing example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3 Sliding window example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.4 Distribution skewness types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.5 Two Gaussian distribution classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.6 Web application front page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.7 Transition score analysis tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.1 Transition network for the generated normal subjects . . . . . . . . . . . . . . . . . . . . . . . . 47

5.2 Transition networks used for abnormal subject generation . . . . . . . . . . . . . . . . . . . . . . 48

5.3 Histogram for an experiment comprising simulated data . . . . . . . . . . . . . . . . . . . . . . . 49

5.4 Histogram of the control experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.5 Score analysis method comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.6 Experiment comparing both outlier detection systems . . . . . . . . . . . . . . . . . . . . . . . . 54

5.7 Mean and standard deviation of normalized variables along time. . . . . . . . . . . . . . . . . . . 55

5.8 ECG transition outlier results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.9 Flipped ECG transition outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.10 ECG transition outlierness considering two alphabet sizes . . . . . . . . . . . . . . . . . . . . . . 57

5.11 Normalized male mortality rates of 6 selected variables. . . . . . . . . . . . . . . . . . . . . . . . 58

5.12 Mortality data sets transition results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.13 Year score histogram of the Mortality data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.14 Transition outlierness of a pen digits dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

xvii

xviii

List of Algorithms

1 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2 Transition outlier detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

xix

xx

Glossary

AR – Auto-Regressive model

ARMA – Auto-Regressive Moving-Average model

BIC – Bayesian Information Criterion: also known as MDL

BN – Bayesian Network

DBN – Dynamic Bayesian Network

EM – Expectation Maximization

FN – False Negative

FP – False Positive

FSA – Finite-State Automaton

GMM – Gaussian Mixture Model

HMM – Hidden Markov Model

LL – Log-Likelihood

KLD – Kullback-Leibler Divergence

MLE – Maximum Likelihood Estimator

MDL – Minimum Description Length: also known as BIC

MTS – Multivariate Time Series

PAA – Piece Aggregate Approximation

PST – Probabilistic Suffix Tree

PDF – Probability Density Function

SAX – Symbolic Aggregate Approximation

TS – Time Series

TN – True Negative

TP – True Positive

xxi

xxii

Chapter 1

Introduction

The world is changing. Machines are beginning to replace humans in every way. In recent times, the machine

learning community has boomed due to the always expanding desire to acquire maximum benefit from collected

data. Such demand is apparent in sectors like biomedicine, socio-economics and industry which benefit the most

due to their data’s crucial nature.

Outlier detection has proved to induce decisive progress in multiple scenarios. An example is in the deter-

mination of long or short-time survivors in survival data [1] and in the revelation of uncommon ECG patterns

representative of a specific disease [2], showing that outliers are not always associated to erroneous observations.

Open-mindedly, a complete outlier detection design is assembled from scratch with the aim of providing an

intuitive and effective alternative for discovering abnormal entities among real-world datasets. In the present

thesis, a complete anomaly detection system is developed, extending from data pre-processing to the recognition

of unusual patterns in multivariate time series (MTS) [3].

The present study reviews existing outlier detection algorithms with special focus on temporal and sequential

data as well as a rundown on statistical modeling. A new outlier detection approach based on dynamic Bayesian

networks (DBN) [4] is thus proposed and validated through synthetic and real data sets. Furthermore, a proba-

bilistic suffix tree (PST) technique, which does not consider relationships among variables, is built and contrasted

with the developed system. The PST approach as well as most existing literature focus solely on the detection of

univariate temporal anomalies. Even with a sophisticated univariate system, many contextual outliers can not be

disclosed without contemplating inter-variable relations, limiting thus univariate strategies. The proposed approach

focuses on the use of DBN sliding windows to assure abnormality disclosure. A complete system advantageous

in capturing complex MTS patterns encoding temporal and inter-variable dependencies simultaneously is made

available, previously non-existent.

1.1 Overview and motivation

An overview of the topics and main objectives of the thesis is mentioned along with decisions’ clarification ac-

cording to real-world applications. An in-depth analysis of existing outlier detection techniques is undertaken with

the intention of providing the necessary background knowledge of the realm of anomaly detection. An alternate

1

outlier detection approach is proposed capable of uncovering complex contextual anomalies in longitudinal data,

contrary to most existing algorithms.

Data related applications ranging from primary sectors like industry to consumer focused areas share a desire

of acquiring maximum performance and information. Not only through the detection of flaws but, furthermore,

by the discovery of patterns or novelties that could be crucial to a better understanding of the data’s underlying

nature leading thus to the development of more efficient methods. Without deviation from the norm, progress is not

possible [5].

An outlier can simply be described as an element/segment which there is no explanation for it to stand out.

Outliers have the capability to mislead analysts to altogether different insights regarding data, providing a catas-

trophic sight to any analysis results. Strategic decisions are taken based on such results, thus the importance of

anomaly detection. As an example, applications in Security data science owe most of their progress to anomaly

detection mechanisms [6, 7]. This leads to an endless search for better and more efficient mechanisms as seen in

Fig. 1.1.

Figure 1.1: Evolution of anomaly detection systems in security data science. Adapted from [8].

Considering the extent and complexity of outlier detection, MTS data is the spotlight of this study. A time

series (TS) is defined as a set of observations measured along time at an equal pace. The latter are considered MTS

when comprised by a set of multiple attributes, each represented by a TS. The proposed approach handles MTS

data sets by employing Bayesian statistics capable of expressing the degree of belief of observed data.

Taking advantage of Bayesian modeling, a complete anomaly detection system is studied, capable of fitting

into a range of data-mining and cleansing applications. It tackles known issues from pre-processing to post-score

analysis, being the main phases described in Fig. 1.2. The proposed approach requires multivariate data to be

discretized with finite domain variables. For such, a discretization and dimensionality reduction approach known

as SAX [9] is employed previous to modeling. Data is thus pre-processed to ensure an evasion from issues such as

high dimensionality. The whole system focuses on tackling temporal data, mainly MTS.

In order to address multivariate temporal data, DBNs [4] are employed in the proposed approach. These are

graphical statistical methods capable of encoding conditional relationships not only among different variables but

also through time. Such approaches have already been used in literature engaging in a wide range of applications

2

Figure 1.2: Main phases of the proposed outlier detection technique.

[10–13]. The aforesaid techniques can model complex MTS structures providing a normality standard. An outlier

detection approach is thus developed to measure the level of discrepancy between each test subject and the obtained

model.

In order to model the normal behavior of a data set, an existing strategy [14] is adapted for an outlier detection

scenario. It models an optimal Tree-augmented DBN (tDBN) structure. The learning algorithm provides a network

possessing optimum inter/intra-slice connectivities for each transition network, verified to outperform existing

literature. Both stationary and non-stationary DBNs are studied.

The assembled anomaly detection mechanism adopts a sliding window approach to uncover transition outliers

in multivariate temporal datasets. A dataset is seen as a set of subjects, where the latter are MTS. A transition is

regarded as a subset of a subject which is comprised by the observations responsible for the explanation of a certain

time-stamp. In other words, a window of a MTS possessing observations of a given time-slice and its preceding

time-frames. Windows have an associated transition network depending on the parameters of the trained model.

In the proposed approach, an outlier is defined as a transition which is suspected of not have been generated

by the model. Being thus result of unexpected phenomena. Such is uncovered by analyzing anomaly scores for

each transition. The latter are obtained by computing the conditional probabilities for each window, representing

the likelihood of each time slice. Likewise, whole subjects are scored to expose anomalous MTS. Such is obtained

by averaging all the transition scores from a subject, which with prudence are used to classify each series. The

developed approach is thus adapted to handle several types of anomalies according to the analyst needs.

Mechanisms are then employed to classify each score into one of the classes, abnormal or normal. Two main

3

strategies are studied, being Tukey’s Method [15, 16] and Gaussian mixture models (GMM) [17]. A threshold is

automatically selected to determine the disclosure boundary in addition to manual selection.

The examined concepts are validated through simulated data generated by DBN mixtures with known anoma-

lies assuring a correct performance measurement. The results are compared with a designed approach capable

of adapting PSTs to multivariate data. Real-world data sets available in existing literature are then tested and

discussed justifying the employment of the system in common temporal multivariate problems.

With the increasing demand of data science related appliances which aspire not only promptness but also gener-

alization and easily adaptable mechanisms, the current system is built. The main goal is to grant a completely free

and accessible algorithm to tackle mainly MTS problems. Such are typically not found in existing literature, being

the latter only concerned with univariate time series [18, 19] claimant of additional examination and background

unsought by analysts. With such in mind, the proposed approach considers data’s peculiarities from its origin to

the final classification phase bringing great burden relief. Data formatting and pre-processing as-well as forefront

model training are made available. The software is freely accessible online [3] along with source code which can

be easily adapted for each application desires. A web application is additionally available for use without requiring

download or any endeavor from the analyst along with a tutorial to ensure immediate usage. Complete analysis

of the whole procedure with user-friendly graphical support is available. All experiments undertaken in the cur-

rent work can be reproduced in the built software. Programming languages such as R and JAVA are used with

cooperation of packages Shiny [20] and RJAVA [21].

1.2 Contributions

Every footstep taken throughout the thesis is simulated and verified, being the final goal the validation of the

outlier detection method in typical real-world scenarios. The main aims are to:

• Provide an insight of existing anomaly detection techniques while conceiving existing knowledge regarding

dynamic Bayesian networks and their modeling.

• Apply discretization and dimensionality reduction to TS datasets prior to performing anomaly detection.

• Model a DBN representing a normality standard for MTS datasets.

• Design a subject and transition scoring system based on a DBN sliding window approach in conjunction

with score analysis to differentiate normality from abnormality.

• Measure the performance of the proposed approach through simulated and real experimentation while con-

trasting with existing techniques.

• Design and publish a free web application available to any analyst granting a complete MTS anomaly detec-

tion system together with a tutorial. Future work suggestions as-well as advice for each specific application

are offered.

• Prepare a paper based on the current thesis for submission in a journal.

4

1.3 Organization

A revision of the state-of-the-art and background knowledge is available at Chapter 2. The latter offers an

extensive overview of outlier detection systems as well as specific akin expertise to the present work.

In Chapter 3, knowledge regarding Bayesian networks as well as Markovian models is available. Modeling of

a DBN is analyzed as-well as its assumptions and comparison with existing literature.

The proposed outlier detection mechanism is discussed in Chapter 4. The latter addresses topics ranging from

data pre-processing to post-score analysis. Software referral is undertaken.

In Chapter 5, both validation and result discussion using synthetic data as-well as real-world datasets is con-

templated. Furthermore, comparison between the proposed approach and a designed multivariate PST mechanism

is undertaken.

Conclusions and future work suggestions are subject of discussion in Chapter 6.

5

6

Chapter 2

Outlier detection

Many researchers around the globe are currently enriching the field of data science. The use of outlier detec-

tion and machine learning techniques became a playground for testing and simulating a wide range of fields from

industry to computer science. This chapter has the objective of offering an overview of the current existing knowl-

edge concerning outlier detection in its entirety. Methods that are analogous to the problems addressed are given a

more in-depth analysis. Nevertheless, an insight is provided in other to perform anomaly detection in any research

field. In the context of the studied and developed methods, a group of fully and partial developed techniques are

mentioned as-well as research studies. A list of documents and projects were selected among a large consulted

bibliography. By assimilating existing knowledge, new ideas for the development of the present study are taken

into consideration.

2.1 Overview

Defining an outlier has been a challenge for many years, being highly dependent on the context, area, and

type of data in each situation. In the existing literature authors label outliers as novelties, anomalies, deviations

or exceptions according to their circumstances. Outliers are sometimes referred to as data instances that can

either be considered abnormal or noisy. Where anomalies, for example, are seen as a special kind of outliers

with interesting information to an analyst [22]. In the present work, the term outlier is used agglomerating the

characteristics of the several interpretations in the literature. Other terms, such as the previously mentioned, are

used when deemed appropriate. When considering a statistical, mathematical or even an informal point of view,

the idea of an outlier can change abruptly. Outliers have been defined by Grubbs [23] as “outlying observations, or

outliers, that appear to deviate markedly from other members of the sample in which they occur” while Hawkins

[24] described them as “observations that deviate so much from other observations as to arouse suspicion that

they were generated by a different mechanism”. Although an extensive subjectivity arouses when interpreting

outliers, the importance of detecting them is always praised and sought after. Outliers are thus considered to be

subsets of data that are of interest to the context in question hence requiring additional examination. Often outliers

are not necessarily incorrect observations, but rather observations hiding valuable information. These frequently

possess intelligence regarding the characteristics of the systems and entities responsible for the data generation

7

process. In medical diagnosis applications, for example, data collected from patient scanning and monitorization

reveal unusual patterns that typically reflect undesirable medical conditions but also interesting clinical cases worth

exploring. An example is the determination of long or short-time survivors in survival data [1].

2.1.1 Outlier and Data Types

Before discussing the several categories of outliers, a review of data types is mentioned with the goal of promot-

ing its importance when performing anomaly detection. For nearest neighbor based techniques, for example, the

nature of the input data determines the distance measure to be applied. Such techniques will be detailed further on.

2.1.1.1 Data Types

The nature of the input data is of extreme importance. These can be objects, observations, samples or patterns

among others. Data can subsist of only one attribute (univariate) or multiple attributes (multivariate). Attributes can

be of different types such as binary, categorical or continuous. These can also be a mixture of several types. Input

data can be categorized using the relationship among data instances. Spatial and graph data are another case which

illustrates the many times needed relationship between data instances, in these case neighborhood dependencies. In

network or graph data, data values are represented by nodes and relationships among them by edges in the network.

Outliers in network data are modeled due to unusual relationships or an abnormal structure in the graph. Other

kinds of data such as multi-dimensional streams or discrete series require dedicated methods for their analysis.

The relative placement between data instances is of great importance when considering discrete sequences. This

does not need to be necessarily temporal. For example sequence data such as genome string representations are an

example where relationships are important. DNA sequencing is an example where the order of nucleotides within

molecules is crucial although not temporal. Prediction-based techniques can be used to estimate the values for

specific positions in the sequence. Deviations from the forecasted data are seen as contextual outliers.

Focusing on time-series data which are normally formed by continuous measurements over time, these often

contain contextual and collective anomalies over related time-stamps. Temporal context is thus of great importance

when handling time-series data. Anomalies are usually associated to abrupt changes in values or trends. Meaning

that data instances are classified as anomalies, not for their value but rather for the change they induced in the

data considering previous time-stamps. Looking at a temperature measurement time-series, for example, if a

measurement of 35oC comes right after a measurement of 5oC it will probably be detected as an anomaly due to

the abrupt change and not by the values themselves. Data trends that slowly change over time require a much more

careful analysis over a longer period of time.

2.1.1.2 Outlier Types

Depending on the variety of outlier desired to be detected and their contextual dependencies, the employed

anomaly detection strategies adjust. The objective can be to discover unusual data points with abnormal values

or unusual patterns of changes. The latter may be, for example, an uncommon ECG pattern representative of a

specific disease [2]. The various types of outliers [25–27] are thus:

8

• Point – being the simplest and most intuitive type of anomaly, point outliers are data instances that are

considered to be anomalous with respect to the rest of the data due to their value.

• Collective – when a group of related data instances are considered anomalous with respect to the rest of

the data, it is considered a collective anomaly. Note that each individual data instance of that collection

may be valid, yet their agglomeration is not. In clustering scenarios, for example, an unexpected cluster

may be classified as anomalous even if each individual instance is valid, since those specific observations

are not expected to be grouped. Collective anomalies can also be considered as contextual outliers if their

classification depends on specific contexts.

• Contextual – considering a certain context specified in the problem’s formulation, data instances can be

classified as contextual outliers when present in specific scenarios. Point outliers in the other hand are

always classified as such independent of context. For a simple example consider the average temperature of

a country. When a certain data instance indicates a negative temperature value in the mid of winter, this is

probably not classified as an unusual data instance, but if the same value appeared during summer it would

be classified otherwise. Detecting such contextual anomalies require more complex problem formulations

than simple point anomalies.

• Pattern frequency – sequences which contain patterns with an anomalous frequency of appearance are

also considered outliers. The data instances themselves may not be abnormal but the frequency or pattern in

which their occur can be labeled as such. A simple example common in medical applications is the detection

of unusual patterns in patient’s measurements which may indicate the presence of specific illnesses.

Distinguishing normal from abnormal data is far from being a straightforward puzzle. Sometimes, unusual

data does not respect the categorization described earlier, turning the procedures of outlier detection even more

challenging. There are several phenomena which can occur specially in datasets with a considerable amount of

outliers. The first is named Masking, which happens when an outlier masks a second outlier being thus the latter

invisible to the anomaly detection procedure. This means that the second anomaly can only be detected in the

absence of the first one. Another interesting effect is the so called Swamping. An outlier is said to be swamping

a second data instance if the latter can only be classified as an anomaly when in the presence of the first, and

not by itself. Such phenomena has been target of research and is of great importance in order to avoid incorrect

labeling due to the outlier detection system itself. Stamping specific observations as outliers may lead to other data

instances be classified as such in the forthcoming iterations, resulting in the “self-destruction” of the algorithm.

2.1.2 Outlier detection strategies

Most outlier detection techniques create a sort of model to represent data’s normal patterns. The choice of the

appropriate data model is one of the most crucial phases in outlier detection. An incorrect choice may lead to

poor results. For example, if the underlying data is not clustered by nature, a cluster-based model will probably

work poorly. In practice, it is the analyst that dictates the model relevant to an application, requiring a good

understanding of the data itself.

9

The type of strategy to employ in outlier detection should be conformable with the kind of data and domain

in question. Some important data characteristics are, for example, its dimensionality and prior knowledge. If in a

specific scenario labeled data is already available or outliers are known to be subsequences of data instead of single

observation points, than the selection of an anomaly detection technique is greatly influenced.

The interpretability of the employed model is also significant. Analysts are concerned not only with the detec-

tion of anomalies but also the reason why they are considered as such. This provides crucial information for the

understanding and hopefully eradication of the phenomena causing the anomalies. Methods that require complex

data transformations, although frequently effective in isolating anomalies, are rather painful to interpret. There is

then a trade-off between interpretability and complexity which is many times overlooked.

The outlier detection problem is analogous to the classification problem with two class labels, “normal” and

“abnormal”. Much of the knowledge from classification methods generalize to anomaly detection. The three main

approaches for the problem of outlier detection [25, 28] are thus:

• Type 1: Unsupervised – analogous to unsupervised clustering, outliers are determined with no prior knowl-

edge of the data. It is assumed that outliers can be separated from the rest of the data. These can be

highlighted by some diagnostic approach or incorporated in a robust model for later classification. Data is

considered to be a static distribution.

• Type 2: Supervised – analogous to supervised classification, outlier determination requires pre-labeled data

already classified as normal or abnormal. Such techniques model both normality and abnormality. New

data should be classified correctly if an adequate generalization was obtained in the training phase. Pre-

labeled data should cover as many situations as possible. If the classifier did not observe some anomalous

patterns in the training phase, these could be miss-classified when testing. Classifiers are normally adapted

to static data, meaning that distribution shifts enforce a reset on the training phase, although techniques like

evolutionary neural networks can endure dynamism.

• Type 3: Semi-supervised – analogous to semi-supervised recognition, outlier classification requires only

one of the classes (normal or abnormal) to be taught. Nevertheless the opposite class is learned by the

algorithm, even when the pre-labeled data only marked one of the classes. These techniques are more

adequate for dynamic data, typically modeling normality. This means that in the case of online algorithms,

the complete state or prior information of anomalous instances is many times non-existent, making it difficult

to adapt a supervised system in real time. For semi-supervised methods, the full domain of normality is

required for good generalization, being abnormal data not required in opposition of type 2 methods.

In many cases, unsupervised methods are used for noise removal in order to clean data sets for the later use of

supervised methods. The latter are designed specifically for the attributes and relationships in the underlying data.

Most anomaly detection techniques have either a score or a label output. Scoring methods assign anomaly scores

to each occurrence which quantifies the degree of abnormality, also known as outlierness level. Most of these

approaches output a ranked list of instances sorted by their anomaly score. Instances with higher score normally

indicate an higher probability of being outliers. The list can then be cut-off with a specific threshold to select

only the most interesting observations. Techniques that rely on labels assign occurrences to one of two categories,

10

normal or anomalous, lacking a degree of irregularity for comparison between test instances. Labels are normally

binary containing less information than a scoring output. However, in the end, most of the scoring techniques

convert their results to binary labels allowing the reach of a final decision regarding abnormality.

The proposed approach is unsupervised and offers score and label outputs simultaneously. Being the latter

formulated by score-analysis.

2.1.3 Challenges in Outlier Detection

At first glance outlier detection may seem superficial, since it can be formulated as the determination of a region

representing the normal behavior of the data in question. Any observations that are not present inside that region

are labeled as anomalies. We quickly realize the breadth of problems associated to such a task initially perceived

as trivial [25–29]. These regions are normally chosen on an ad hoc basis according to the application’s specific

characteristics. With the aim of detecting hidden and interesting information amidst the available data, the main

challenges encountered when detecting outliers are:

• Wide variety of techniques and models – it is difficult to choose and adapt an already existing method to a

specific problem, being one of the steps researchers normally tend to spend the most time on. Some models

are even impossible in some problem formulations.

• Data scale – data related problems tend to increase in size not only in the number of observations but

also on their complexity. This often leads to processing and resource limitations. Applications that handle

stream data are the most prone to such limitations, since these require fast processing as new data arrives

continuously.

• Boundaries – boundaries between normal and abnormal behaviours are many times not precise, being very

difficult to correctly classify data which lies near them.

• Malicious actions – outliers are often blended with normal data. This worsens if the anomalies are inten-

tional, caused by malicious agents [30]. These anomalies are normally very similar to normal ones, being

very challenging to distinguish them.

• Continuous evolution – the notion of normal behavior continuously changes in many domains, meaning

that data classified as abnormal today may not be labeled as such tomorrow.

• Notion of anomaly – different research fields have different notions of anomalies. Applying the same

technique in one domain does not necessarily mean it should be applied in others domain.

• Availability of labeled data – one of the major issues in outlier detection is the lack of labeled or training

data, mainly in supervised methods. The validation of the results obtained is thus difficult and requires

in-depth knowledge of the data and research field in question.

• Data pre-processing – before applying outlier detection techniques, data requires normalization and/or

filtering in order to remove complications that deteriorate the performance of the applied methods. Time-

series data, for example, is often desired to have constant spacing between observations with specific time-

stamps.

11

• Reproducibility – some techniques require initialization which when random can significantly change the

results obtained. Models which are very sensible to parameter changes are also prone to greatly influence

the results obtained even when the data and methods are the same.

• Validation – the performance and quality of the obtained anomaly classifications greatly depend on the do-

main and data in hand, being often not advisable to compare techniques in general but only on the performed

tests. A method performing better than another for specific data-sets does not mean these paradigm is valid

for every situation.

• Overfitting – sometimes the simplest approaches are the most fortunate, since these do not adapt exces-

sively to data peculiarities. For supervised techniques, an overfit model will most probably classify normal

instances as abnormal if these did not appeared regularly in the training data. A good approach for reducing

overfitting is sometimes repeatedly partition the data into training and testing sets in a random way.

It is to note that in many cases the separation between noise and anomalies is not well defined, resolving in

many data points created by noise to be considered interesting points due to high deviance from normal behavior.

Although these instances are desired to be detected and eliminated, they turn out to have uninteresting information,

wasting time and effort when examining them.

2.2 Outlier Detection Methods

Without further ado, a set from a wide range of outlier detection methods [26, 27] is considered with the aim of

providing an insight for the problem in hand and the world of outlier detection in a whole. These are organized

according to their specifications. A more in-depth focus is enforced on temporal data [25] and sequence [31]

anomaly detection allowing the assimilation of the required basics before proceeding. A possible way to classify

outlier detection techniques [25, 27] is:

• Regression-based – Regression methods are extensively used in TS data. They adapt equations or models

to data sets considering outliers as high residual entities when compared to normal data instances. An

example are AutoRegressive (AR) models [22, 32] decreeing that data points linearly depend on previous

values and on a stochastic term, representing a random process. AR models can be enhanced and made

more robust by combining them with other techniques. By simultaneously considering outlier scores of

previous observations, autoregressive moving-average (ARMA) [33] models are built. Since older data

becomes uninteresting over time for the estimation of new data, Autoregressive Integrated Moving-Average

(ARIMA) models [34, 35] are assembled to tackle such issues.

• Distance-based – In distance-based techniques, the distance or similarity between data instances is com-

puted. Outliers are believed to be isolated from the rest of the data. A k-th nearest neighbor (KNN) technique

has the objective of, for each test instance, determine the distance to its k-th nearest neighbor in the train-

ing set, being these used as anomaly scores [27]. The same reasoning can be applied to discrete sequences

[26, 36, 37] using the length of the longest common subsequence as a measuring distance. Some of these

methods also determine the density of the neighborhood of each data instance. One example is the Local

12

Outlier Factor (LOF) algorithm [38] which measures the local density of the test sample and its k nearest

neighbors. Such is discussed in an outlier detection scenario [39].

• Clustering/Classification-based – Classification and clustering methods [40] provide additional insight on

outlier detection problems. Note that these types of methods are not equal, even though they are associated

by nature. Classification techniques are supervised, assuming that training data is already labeled. A model is

learned with the aim of classifying test instances into one of the already known classes. Clustering methods,

on the other hand, have the goal of designing the classes themselves. These are unsupervised. Classification

methods can be multi-class or one-class when employed in an outlier detection environment. Multi-class

algorithms try to classify test instances into one of the already present normal classes. If none of the classes

have a sufficient confidence on a certain test instance, it is classified as anomalous. One-class methods

normally learn a boundary or region where all normal data is contained. If a certain test instance does

not fit inside that region, it is classified as anomalous. An example are one-class support vector machines

(SVM) [41] which learn a region containing training data. A boundary is computed around the region of

normal behavior. Each data instance is evaluated according to its position regarding the boundary. The

aforementioned have proved to detect anomalies on several applications [42, 43]. Note that SVMs only

handle vectors, meaning that to apply the same procedures in TS data a conversion is needed [41].

A type of clustering techniques are based on Expectation-Maximization (EM) [44] methods, capable of

finding parameter estimations which determine the distribution of data. The latter is commonly used when

considering Gaussian Mixture Models (GMM) [17]. Clusters created using training data can then be used to

classify test instances. Anomalous samples are expected to deviate themselves from the clusters. A robust

EM algorithm used in outlier detection is established in [45]. GMMs are considered in the current study in

Sec. 4.4.2 and in existing literature [46].

• Pattern-based – By employing pattern-mining techniques, patterns in a sequence whose frequency of occur-

rence is anomalous are detected. Time series can be viewed as sequences, allowing the use of mechanisms

which, for example, measure the discrepancy between the frequency of occurrence of a context in a se-

quence and that expected given observed data. An example uses suffix trees [47], containing all the suffixes

of a given sequence. Such allows the detection of abnormal subsequences within each time series. Pattern

frequency-based anomaly detection is beneficial in detecting interesting subsequences with engagement in

numerous applications [48].

• Probabilistic/Statistical-based – There is usually a confusion when comparing probability theory and statis-

tics. These universes are inverse of one another, while probability theory tries to interpret problems consid-

ering randomness or uncertainty modeled by random variables, statistics observes something that happened

and tries to figure out what process explains those observations. With the aim of providing a clear un-

derstanding of the main statistical and probability methods in outlier detection, a brief overview of such

techniques is provided mainly using TS data. As Persi Diaconis once said, “The really unusual day would

be one where nothing unusual happens” [49]. As a rule of thumb, statistical methods fit a statistical model

to data in order to shape normal behavior. Tests are then performed to determine if the probability of an in-

stance is appropriate to be considered normal. A more in-depth analysis of such techniques is considered in

13

Chapter 3, being the proposed approach based on dynamic Bayesian networks, which have already proven to

benefit anomaly detection [10–13]. A technique based on Markov processes is likewise studied in Sec. 3.2.

The main approach studied in the current thesis resides in the probabilistic/statistical paradigm. It models a

statistical model representing conditional dependencies between variables as well as time.

14

Chapter 3

Statistical Modeling

Statistical methods attempt to explain a population, defined as a set of observations or subjects which are of

interest for some application. These methods fit a statistical model to data in order to shape normal behavior.

Instances are then tested against the model to measure their outlierness levels.

3.1 Data

The implemented application is designed to handle MTS. The latter must be discretized before being introduced

to the outlier detection apparatus. Such is considered in Section 4.1. Other types of data like symbolic sequences

can be loaded as well if in the right format. Missing values should be imputed or considered as an additional

symbol beforehand. Such is discussed later as well as precautions with the sampling rates of each variable.

A univariate TS Xi[t] is seen as a set of observations along a consistent time rate represented as

Xi[t] = xi[1], . . . , xi[T ] i ∈ N, (3.1)

expressing variable i along T time stamps. For the case of MTS, each row of a data-set represents a subject

identified by its row index. Subjects are MTS with a common number of variables n and width T . Each column

depicts a certain variable at a certain time stamp. In other words, a subject Sj of length T and n variables is

represented as

Sj = x1[1], . . . , xn[1], . . . , x1[T ], . . . , xn[T ] j ∈ N, (3.2)

where xi[t] depicts the discrete value of variable i ∈ 1, . . . , n at time stamp t ∈ 1, . . . , T of subject Sj with row

index j. A subject is thus a combination of univariate TS, each represented by Eq. (3.1). Columns are sorted

according to time steps rather than variable indexes. Many appliances require anomaly detection to be engaged in

TS descendant from sensor devices. These are normally not discretized and can also reveal unwanted offsets. To

tackle such issues, pre-processing is employed formerly to the outlier detection mechanism.

15

3.2 Markovian Models

Markovian models are commonly used to predict randomly changing systems, making use of the Markov property.

In such scenario, a system’s future states only depend on its current state and not on past events. A system’s past

and future states are thus independent. The Markov property is often preferred when performing probabilistic

forecasting.

In outlier detection, Markov models are typically used for detecting sequence-based and contextual anomalies.

These use the conditional probability for each symbol in a data sequence conditioned on the symbols preceding it

[31]. The probability of observing each symbol is thus predicted using pre-symbol probabilities. If the expected

behavior is significantly different from the observed, an anomaly is detected. Note that these methods are mostly

employed for discrete data sequences and TS data. However, any type of data including continuous sequences can

be discretized.

A set of training sequences is used for the computation of the conditional proprieties that will represent the

data-generating distribution. After learning the model’s parameters, testing involves in determining the likelihood

of test sequences given the parameters. The first and most basic model is referred as the Markov chain. A Markov

chain is a sequence of discrete-time random variables X1, . . . , Xn that respect the Markov property, in other

words, a “memory-less” stochastic process. Considering the existence of n possible states i1, . . . , in of the random

variables, the Markov property dictates that

P (Xn = in|Xn−1 = in−1) = P (Xn|X0 = i0, . . . , Xn−1 = in−1), (3.3)

meaning that the system only requires knowledge regarding the previous state to determine the probability distri-

bution of the current state. The probability of a sequence of states i1, . . . , in is computed as

P (i1, . . . , in) = P (X1 = i1)

n∏j=2

P (Xj = ij |Xj−1 = ij−1), (3.4)

which corresponds to a “chain” computing the probability of each state ij given its previous state ij−1. The

probability of the initial state i1 is determined using a prior probability. A transition matrix can thus store a

Markov chain evolution through time, comprising each transition probability between states. The product of these

probabilities describe the system’s transition along time.

A useful characteristic of discrete sequences is the short-memory property [50]. The latter dictates that con-

ditional probabilities 3.4 can be accurately approximated by a window of preceding symbols. For the case of a

discrete sequence a1 . . . an,

P (an|a1...an−1) ≈ P (an|an−k...an−1), (3.5)

for some value k > 1. A discrete sequence a1 . . . an can be encoded into a set of states according to the different

possible configurations it can assume. Each position ai, i ∈ 1 . . . n contains a symbol σ ∈ Σ from an alphabet Σ. In

a fixed k-th order Markov model, each state corresponds to the final k symbols an−k . . . an−1 in the sequence being

modeled [22]. A transition is thus the addition of the element an at the end of the sequence. The system transitions

from state an−k . . . an−1 to state an−k+1 . . . an, being the probability of transition P (an|an−k . . . an−1). Hence,

16

the probability of a sequence can be computed using Eq. (3.4), where windows of size (k+1) are retrieved from the

discrete sequence to determine the sequence of states it represents. A Markov model of order k declares that each

state corresponds to the subsequence of the final k symbols of the sequence being modeled. A state corresponds

to the immediately preceding sequence of k states, meaning that higher values of k correspond to higher values of

stored memory. Note that this memory refers to the windows used to compute each state and not to the memory

between states. Hence, the model holds the Markov property.

In the present discussion, models are assumed to be fixed. Fixed markovian techniques use windows of fixed

length k to estimate the necessary conditional probabilities. The conditional probability of occurrence of a given

symbol ai is estimated as

P (ai|ai−k...ai−1) =f(ai−k...ai)

f(ai−k...ai−1)=NijNi

, (3.6)

where f(ai−k...ai) denotes the occurrence frequency of subsequence ai−k...ai in the set of extracted sequences,

Nij the number of observations in which the system was at state i at time t and at state j at time t+ 1. Ni denotes

the total number of occurrences where the system was at state i at time t.

A Markovian model can be visualized as a set of nodes representing each possible state and a set of edges

between nodes. Edges represent connections between nodes which have associated conditional probabilities. The

system transitions between two states through the corresponding edge. Markovian models are capable of encoding

TS data, granting a basis for normal behavior. The probability of occurrence of each transition can be examined as

an anomaly score.

A similar approach is used [51] to develop an anomaly detection technique which represents normal behavior of

sequence-based data in network and computer intrusions using a simple first-order Markov chain. The probability

of sequences of states are computed in order to detect anomalous actions. The likelihood of a sequence is seen as

an anomaly score according to its occurrences in the training set.

A Markov chain can be viewed as a Finite State Automaton (FSA) which represents the system as a machine

that can be in exactly one of a finite number of states at a given time. The system requires a list of states, its initial

state and the conditions for the transitions. A FSA can be used to represent a Markov chain. A finite state machine

is said to accept an input if it ends at an acceptance state. FSAs are commonly used for predicting future events

and detecting system call intrusions [6, 52].

3.2.1 Variable and sparse Markov models

Fixed Markovian techniques have been discussed. These force each symbol to be conditioned by a fixed number

of previous symbols. Sometimes it is not possible to reliably estimate the conditional probabilities when consider-

ing subsequences with a low frequency of occurrence. As mentioned in [26], consider a situation where a certain

subsequence S1 of size n only occurs once in the training data and is followed by the symbol j. If we employ

a nth-order model, the conditional probability P (j|S1) has a value of 1 even though it only occurred once in the

training dataset. Although being legitimate, the estimated conditional probability relies solely on one observation,

which is probably insufficient to generalize to the whole dataset. To solve such issues, the size of the subsequence

S1 should be reduced. The smaller subsequence will probably appear more frequently in the training data. The

17

frequency of the subsequences are thus subject to a threshold in order to determine if the computed probabilities

are reliable.

Variable-order models can, for instance, adapt to the frequency of certain states in the data. States with very

low frequency of occurrence should be pruned and replaced with low-order generalizations. Such characteristics

are present in Probabilistic Suffix Trees (PST) [50]. Generalizations allow the reduction of the number of states

in the model and increase matching in the training phase. Sequential outliers in protein sequences are addressed

in [53] using PSTs. In the present thesis a PST approach is developed to handle MTS datasets, which is then

compared with the proposed approach through experimentation.

PSTs are an example of variable length Markovian techniques. A discrete TS is perceived as a sequence, also

called string, s = s1...sl of with a maximal sequence length l. Each position can assume a symbol σ ∈ Σ from

the the TS’ alphabet Σ. Like in fixed Markovian models, a state si, 1 ≤ i ≤ l is conditioned by its preceding

subsequence s1...sl−1, denoted as context c = c1, ..., ck. In a variable length Markov chain, the size of the window

L comprising a state’s preceding symbols is particular of each context, rather than being fixed. The core concept

behind a PST, is the short memory property [50]. For each context c = c1, ..., ck preceding a symbol σ ∈ Σ, there

exists a length L, 0 ≤ L ≤ k, such that

P (σ|c1, ..., ck) ≈ P (σ|ck−L, ..., ck), (3.7)

where 1 ≤ k ≤ l − 1 [54]. Thus, the length of the Markov chain is not fixed, with symbols being conditioned

with a previous window of different lengths depending on the context c. A PST training algorithm estimates

the length L for each context appearing in the training data, capturing the contexts c = c1, ..., ck of maximum

length. Conditional probabilities P (σ|c) using these contexts are compared with the ones obtained using their

longest suffix, P (σ|suf(c)), where suf(c) = c2, ..., ck. By using Eq. (3.7), conditional probabilities are compared

according to P (σ|c) ≈ P (σ|suf(c)). If the latter probabilities are similar, the PST solely stores P (σ|suf(c)),

reducing thus the number of parameters without loosing too much information.

A PST is proficient in determining the probability P (s) associated to any sequence s, considering the latter is

formed by symbols σ ∈ Σ. A sequence probability is computed as

P (s1...sl) = P (s1)P (s2|s1) . . . P (sl|s1...sl−1), (3.8)

where each state in the sequence is conditioned on the past observed states. Conditional probabilities are retrieved

efficiently from a tree structure. It is worth noting that unseen patterns can disrupt Eq. (3.8), since a null value of

any conditional probability causes the whole sequence to be impossible. To tackle such matter, a procedure known

as smoothing is applied to every probability removing null values. The latter is likewise applied in the proposed

approach, discussed in Sec. 4.3.1.

States are organized on a suffix tree, where nodes have their largest suffix as a parent. Each leaf represents a

memory of the associated Markov chain [55]. A PST is a suffix tree with the addition that conditional probabilities

are stored in each node. Nodes are labeled by a sequence representing a path from that node to the tree’s root.

Nodes contain the count of its associated sequence s as observed in a training dataset and conditional probabilities

P (σ|c) of observing each symbol σ right after s. Being denoted next-symbol conditional probabilities. Edges

18

Figure 3.1: An example of a PST with maximum depth 3 highlighting consulted nodes when computing theprobability of sequence abaab. Each node is represented as a container incorporating the next-symbol conditionalprobabilities for a (green) and b (grey) and the number of occurrences of their corresponding sequence in thedataset. Node e is the root of the PST. Edges are represented by a symbol σ ∈ a, b. Nodes that are marked with ared line are pruned. Adapted from [54].

represent symbols σ ∈ Σ, denoting the next character of the sequence. Such means that each node has at most |Σ|

children. The root node possess the prior probabilities P (σ) of each symbol.

With a dataset comprising multiple sequences, training consists in counting the occurrences of each pattern.

The number of occurrences decreases with depth, meaning that the computation of P (s3|s1s2) uses less observa-

tions than P (s2|s1), since the latter is a suffix of the first. If the number of occurrences of a pattern is low, the

probability of a sequence containing it is affected. However, longer sequences are more accurately modeled since

these use more information, according to Eq. (3.8). The initial positions of any sequence are thus generally badly

estimated, due to lack of memory. There should thus be a trade-off in the complexity of a PST.

The size of the tree grows exponentially with memory length. Thus, the tree’s maximum depth and addition-

ally the alphabet size |Σ| are responsible for the size. An overly-complex PST suffers from overfitting and is

inappropriate when considering rarely observed data, meaning that for rare patterns the probability of observing a

certain symbol after their corresponding suffix is deceptively high. A procedure known as pruning is capable of

controlling the size of a tree avoiding the aforementioned. Pruning can be done according to the frequency of the

nodes. If certain paths in the tree did not appeared at-least nmin times in the training data, the PST is allowed to

prune them. Being nmin a threshold for the number of occurrences. Nevertheless, the reasoning behind pruning

is to remove nodes which do not grant additional information with respect to its parent. Such means that their

conditional probabilities are similar for every symbol σ ∈ Σ. A gain functionG1(c) [54] is capable of determining

the information gain between related nodes. A node represents an information gain for any symbol σ if

G1(c) =∑σ

[PT (σ|c)

PT (σ|suf(c))≥ C or

PT (σ|c)PT (σ|suf(c))

≤ 1

C

]≥ 1, σ ∈ Σ (3.9)

where C is a threshold representing the ratio between the conditional distributions of the nodes.

Consider a PST which structure is seen in Fig. 3.1. The latter models sequences from an alphabet Σ = a, b.

19

Numbers inside each node represent the number of occurrences of the associated sequence. The sequence of a node

is determined by the inverse of the path starting from the root node and finishing in that node travelling along each

edge represented by a symbol σ ∈ a, b. Observing Fig. 3.1, the node highlighted in orange describes sequence ba,

storing the conditional probabilities P (σ|ba). Consider that the probability of sequence s = abaab is requested.

According to Eq. (3.8), P (abaab) = P (a)P (b|a)P (a|ab)P (a|aba)P (b|abaa), being each conditional probability

retrieved from nodes highlighted in blue and orange in Fig. 3.1. Note that the value of P (a|aba) is fetched from

node ba instead of node aba, since the latter was pruned. The node of its longest suffix is thus used.

In a PST, outliers are defined as patterns which appear less frequently in comparison with more frequent ones.

A sequence s is scored as

logloss(s) =1

l

l∑i=1

log2 P (si|s1...si−1) =1

llog2 P (s), (3.10)

representing the average log-loss of the sequence [54]. The latter depicts the residual between the prediction of the

tree and a perfect prediction.

PSTs are thus capable of capturing high-order dependencies in sequences while remaining simple to estimate,

with existing literature in pattern detection [47, 56]. However, the latter are only capable of handling univariate

TS. Due to the lack of freely available MTS outlier detection systems, in the present work, a multivariate PST

approach is built and compared with the proposed approach with the help of an existing modeling algorithm [54].

The aforementioned is detailed in Sec. 5.1.2, based on the modeling of multiple PSTs, one for each variable. The

objective is to demonstrate the importance of inter-variable relations in outlier detection and establish a basis for

comparison.

Methods like Interpolated Markov Models (IMM) can be used to compute the variable length conditional

probabilities of a symbol. In the field of Biology, such models are used to find genes in DNA sequences [57].

IMMs can additionally be used to detect frequency-based anomalies [48].

Variable Markovian techniques, although allowing different symbols to be conditioned with a previous window

of different lengths, these still need to be contiguous and immediately preceding the symbols. In other words, the

preceding sequences can not be sparse. A new variety of Markovian techniques can address sparse history for their

conditional probability estimations. This means that those particular positions can be filled by any symbol, turning

the windows sparse.

One sparse Markovian technique known as Sparse Markov Trees (SMT) [58] estimates the probability distribu-

tion conditioned by input sequences. SMTs are very similar to PSTs with the exception of allowing the introduction

of wild cards, being known as a generalization of PSTs. A sparse Markovian method worth mentioning is RIPPER

[59]. RIPPER can also be seen as a classification algorithm. It is used to build sparse models from the extraction of

k-length windows at a training stage. The objective is to predict the kth term according the previous (k − 1) sym-

bols. The described strategy is used in [60] for detecting intrusions in Unix systems. Regions of the test sequences

are classified as abnormal if a certain percentage of failed predictions is achieved in those particular regions. As

evaluated in [31], RIPPER is said to be more flexible than FSAs when conditioning the probability of events with

respect to preceding symbols. It smoothens the likelihood score of rarely seen patterns. Sparse and variable mod-

els can be greatly effective in some applications, offering better flexibility in terms of the size of context used.

20

Sparse Markov models are many times used when handling DNA sequences and protein data [58, 61] due to their

performance in such environments. We recall, however, that different researches arrive at different conclusions.

There is no perfect model, and many times, the normal fixed Markovian techniques outperform its variants.

3.2.2 Hidden Markov models

Hidden Markov models (HMM) are similar to the previous described Markovian techniques. They generate

sequences of transitions among states in a Markov chain. The difference is that in a Markov chain transition

sequences are directly visible. In a HMM, the precise state sequence through which the model passes is unknown.

States can be defined in the modeling stage by an analyst with a clear understanding of the context specific to

the system. The process is accomplished according to observations considered relevant. Although the states can

be formulated, there is no direct knowledge about the sequence of transitions required to reach them, thus being

denoted as hidden. A sequence of transitions corresponds to an observed data sequence. Each state carries a set

of emission probabilities for each symbol. When a state is reached, one of those symbols is ejected. This symbol

generation is dependent on previous history, meaning that the successive states are not independent from each

other. HMMs are commonly used to capture temporal correlations in data sequences.

Figure 3.2: Three-state hidden Markov model.

Observing the example in fig.3.2, we conclude that the model has three states A,B and C which could emit

one of two symbols S1 or S2. The emission probabilities are shown next to each state. There are also probabilities

for each transition between states. When the system is at state A, it has 20% probability of assuming symbol S1

and 80% of assuming symbol S2. There is, at the same time, a probability of changing state. Probabilities are

estimated according to observations from a training set. Although these are known, the exact state of the system is

unknown at a given instant.

Larger training sets are required in order to avoid overfitting. Without prior knowledge about the system’s

generating process, a model with n states possesses n2 transitions. Every state has a transition to every other state.

The analyst is adviced to understand the system’s behavior in order to eliminate transitions. Domain knowledge is

hence valuable.

HMMs are composed by a hidden transition matrix and an observation matrix, with a symbol emission distri-

bution for each state. Three main procedures exist. Training phase estimates the model’s initial, transition and state

emission probabilities using training sequences. An Expectation Maximization (EM) approach can be employed

[62]. In Testing, the fit probability of each test sequence is determined considering the HMM. Anomaly scores are

21

associated to each sequence. In the final phase, Explanation, the most probable sequence of states which generated

each test sequence is uncovered. An understanding and explanation for every classified anomaly is given. For ad-

ditional details consider [62]. A HMM is thus capable of not only learning a model that best describes the system’s

behavior, but also determine the probability of each test sequence, additionally explaining it.

Note that methods in the evaluation and explanation stages require a considerable amount of time due to the

recursive nature of the algorithms. Complex models are thus avoidable when possible. Although computational

complexity is seen as a disadvantage for HMMs as well as the sensible selection of the parameters, these are widely

used in existing literature outputting favorable results, typically in the detection of malicious activity [6, 7]. A PST

approach for protein sequences is compared with existing HMM literature in [63].

3.3 Bayesian models

A Bayesian network (BN) is a probabilistic graphical model which encodes conditional relationships among

variables. It is composed by two components, a directed acyclic graph (DAG) encoding dependencies between

random variables and a set of parameters defining the local conditional distributions of the attributes.

Random variables are assumed to be discrete with a finite domain. The main idea behind employing the afore-

mentioned techniques in outlier detection is the posterior probability estimation of observing a class label given

evidence. Inference computes the probability of variables given known knowledge encoded in the observed vari-

ables. Models are obtained from learning algorithms together with expert knowledge particular to each problem.

The class label with the highest posterior probability is associated for each instance. If an anomalous class is

selected instead of any of the normal classes, such instance is classified as abnormal. BNs are also commonly used

for clustering and classification purposes.

Learning the structure of a BN [64] can be summarized as finding the DAG which may have generated the data.

A training set allows the network to estimate the likelihood of an instance belonging to each class as well as each

class prior probability. Random variables X = (X1, . . . , Xn) are assumed to be discrete with a finite domain.

The DAG represents the set of variables (nodes) and their conditional dependencies (edges). Nodes that are not

connected represent variables that are conditionally independent, two random variables X and Y are conditionally

independent given a third random variable Z, X ⊥ Y |Z , if and only if P (X ∩Y |Z) = P (X|Z)P (Y |Z). In other

words, knowing the value of Z does not provide information regarding the value of Y knowing the value of X ,

at the same time, the opposite also applies. Edges linking nodes indicate that one variable directly influences the

other. A nodeXp is pronounced as a parent of node Xs if there is a connection directed fromXp toXs. Each node

is associated to a function which receives the node’s parent variables and outputs the probability of the variable

expressed in that node. This local probability distribution stores a set of parameters encoding each probability of a

possible configuration of a node Xi given its parents pa(Xi),

P (Xi = xik|pa(Xi) = mij), (3.11)

where xik is the k-th possible value from the domain of Xi and mij the j-th possible configuration of the set of

variables pa(Xi). The network’s global probability distribution can be seen as a joint probability distribution com-

posed of the several local probability distributions associated to each variable. By evoking the Markov property,

22

the global probability distribution comes

P (X1, ..., Xn) =

n∏i=1

P (Xi|pa(Xi)), (3.12)

where n is the total number of variables. The joint probability depicted in Eq. (3.12) can be used to compute

the probability of an evidence set. The set of conditional probabilities associated to each node is denoted as the

network’s parameters. The local distributions present the prior probabilities for the attributes with no parents and

conditional probability tables for nodes with parents. A variable Xi is independent of all its non-descendant nodes

given its parents pa(Xi). Being BNs probabilistic models, the laws of probability can be evoked, mainly Bayes

theorem

P (Q|e) = P (e|Q) · P (Q)/P (e), (3.13)

where P (Q|e) represents the posterior probability of a distribution Q, P (e|Q) the likelihood of Q given known

evidence e and P (Q) the prior probability ofQ. The probability of evidence P (e) has the objective of normalizing

the result’s sum to 1. Bayes theorem in Eq. (3.13) allows probability update according to new evidence.

For the present work, the outlier detection scenario can be seen as a classification problem among two classes,

normal and anomalous. Each attribute is thus evaluated according to its context using a trained BN, if a cer-

tain criteria is not met, such attribute is considered anomalous. A typical practice is the employment of scoring

functions.

Multivariate categorical data can be used in such networks by aggregating each posterior probability for each

test instance in order to assign a class label.

3.3.1 Bayesian network outlier detection

Several Bayesian network-based techniques have been considered in outlier detection in a wide range of applica-

tions and have instigated numerous research works [65, 66]. A disease outbreak detection system [67] considered

a BN to model a baseline distribution using health-care data. Monitored data was then compared with the baseline

for anomaly prospecting. Data collected from the previous 24 hours allowed for an adaptable BN, which was mod-

eled daily. The joint probability of the data was used by conditioning attributes such as season, day of the week

and current flu level. An high flu level may not indicate an outbreak since it is typically normal in rainy seasons,

thus the relevance of adopting a Bayesian approach. Rules engaging a scoring function determined the degree of

similarity of a subset of data with respect to the baseline. The highest score for the day had its p-value computed

being then compared to a threshold for the final decision.

3.3.2 Dynamic Bayesian networks

One of the main challenges encountered for the scope of this thesis is the data’s temporal component. Time

series are perceived as vectors of stochastic variables observed at regular intervals of time. MTS, although contain-

ing inter-variable dependencies, they also accommodate temporal dependencies. Relations within series are thus

present. Dynamic Bayesian Networks (DBN) [4] are BNs which relate variables over adjacent time steps. These

23

can model probability distributions over time, being interesting when considering complex temporal data. Unlike

standard BNs, DBNs are composed by multiple networks known as transition networks. Additionally, a prior BN

denotes the distribution of initial states. The latter is a standard BN. A transition network possess two types of con-

nectivity among variables noted as inter and intra-slice connectivities. The latter refers to variable dependencies at

the same time frame, just like standard BNs. Inter-slice connectivity is responsible for the temporal aspect which

relates attributes of different time slices.

Data is seen as a collection of subjects which are formed by MTS. This means that a certain test subject possess

multiple variables along time, thus a variable is seen as a single TS. A DBN is modeled according to the evolution

of not only each of the variable TS along discrete time but also their inter-connection, extending thus the domain

of BNs with time.

A transition network Btt−1 of a stationary first-order DBN can be observed at Fig. 3.3, where nodes have an

associated time-stamp. These influence the distributions stored in each node in the sense that variables possess

parameters which depend on time, something oblivious in standard BNs. It is worth noting that data must be

based on discrete time, meaning that time frames should be well defined former to the employment of such net-

works. Temporal associations among entities are obtained from consecutive data, since these are typically highly

correlated in time. In the shown example, the DBN is of first-order, meaning that nodes at time frame t may be

correlated to attributes at the previous time instant t − 1. The connection between attributes X3[t] and X1[t] rep-

resent the intra-slice connectivity of the transition network Btt−1 which correlates the attributes in the same time

frame similar to a standard BN while all the others represent the inter-connectivity. It is worth noting that in the

example shown, a connection between X3[t− 1] and X1[t− 1] is not shown although existing. This is due to the

fact that the connection belongs to the previous transition network Bt−1t−2 which possess the exact same structure

shown since we are referring to a stationary DBN. Recall that being the DBN an acyclic graph, the former can not

contain loops.

X1

X2

X3

X1

X2

X3

Transition : t− 1→ t

t− 1 t

Figure 3.3: Transition network of a stationary first-order DBN.

Considering Fig. 3.3, node X1[t] is conditionally independent from X2[t] leading to a joint probability of the

attributes at time t,

P (X3[t], X2[t], X1[t]) = P (X3[t]|X3[t− 1]) · P (X2[t]|X2[t− 1]) · P (X1[t]|X1[t− 1], X3[t]). (3.14)

There is elasticity when regarding the structural traits of a DBN. These can be stationary, meaning that the transition

24

network does not vary between the transitions t−1→ t of the MTS. In other words, the structure present in Fig. 3.3

is invariant over time. Data is thus treated equally along time, where the transition network Btt−1 adapts to the

data’s transitions in a whole. Non-stationarity implies that a DBN is formed by a set of transition networks which

are particular of each transition. The latter can adapt easier to time signatures. However, the model becomes

more complex, which without caution can provoke overfitting. A dataset which possesses anomalies at particular

time-stamps could unintentionally mask them, preventing their detection. To prevent such unwanted behaviours,

prudence is adviced from the analyst. MTS data can be branched in two classes, one who bears series with

statistical properties invariant of time and those whose properties change over time. Identifying such aspects

greatly improves the decisions made when choosing the appropriate model.

The lag associated to the network influences the structure behind it. A L-th-order DBN means that a certain

attribute Xi[t] at time frame t can possess parent nodes from t−L to t. It is worth noting that besides the increase

in complexity of the transition networks Btt−L as-well as the risks of modeling unwanted signatures or overfitting,

data is partitioned more. This comes from the need of further data by the transition networks Btt−L which require

information from L + 1 time frames when considering a lag of L. A transition t − L → t can thus be seen as a

window Dtt−L which is compromised by the observations of the attributes belonging to the respective transition

network. Before modeling such networks, one should consider if there is sufficient measurements to acquire a

credible result.

Dynamic BNs also perform inference and discover the most probable sequence, somewhat similar to HMMs,

discussed in Sec. 3.2.2, which are also known as conditional probability queries. Since the model must express

data’s normal behavior, an outlier can be represented by how likely it could have been generated by the model.

Hence, the employment of a scoring procedure provides the aforementioned likelihood. The most common choice

is the Log-Likelihood (LL) function which is based by the logarithm of the probability density function of the

Bayesian network. The more anomalous evidence is, the lower the LL score associated to it. A variation of the LL

score known as Minimum Description Length (MDL) can also be employed penalizing complex structures through

the number of parameters.

Learning a DBN is essentially disclosing the set of transition networks which better match a training set. The

latter is explored in the current work in Section 3.3.4.

3.3.3 Dynamic Bayesian network outlier detection

Bayesian networks are fitted for outlier detection, capturing otherwise hard to find correlations among attributes

in high dimensional data. Time series data can easily be adapted to cope with BNs. The graph nature of BNs allow

analysts to quickly identify relations as-well as tracing the root cause of anomalies such as determining which

attributes contributed the most to the problem. Prediction is likewise a virtue of using BNs when handling online

data, such allows the detection of anomalies beforehand usually warned by a decrease of the LL score. An anomaly

detection approach using DBNs was developed to handle the detection of aircraft pilot errors taking advantage of

such virtues [10]. One compelling aspect was the use of hidden nodes similar to HMM to represent pilot actions.

These can not be measured like regular variables from the aircraft sensors. Maximum likelihood estimations of

the parameters were performed given the data and model with an EM algorithm due to the partial observability

nature of the problem. The goal is not only to identify the occurrence of an anomaly but to track its point of origin

25

as-well. It is assumed that an anomaly appearing at a time instant will spread to related states and succeeding time

slices. The procedure reverts the usual flow of the network and backtracks a detected anomaly until the state which

provoked the anomaly is identified. The algorithm can thus warn not only the manifest of an anomaly but also the

past pilot action which probably incited it.

In medical scenarios, the present methods are capable of providing valuable insights about disease processes.

Patients’ underlying clinical details change in the course of medical interventions in addition to affliction devel-

opments. DBNs are thus fitting for such scenarios having increasing focus in the literature [11]. These have also

proven to be efficient when modeling gene expression data [13].

Furthermore, Kalman filters [68] are DBNs which estimate and track system states recursively over time.

Probability density functions are approximated using incoming measurements assuming states as hidden Markov

processes similar to HMMs. Kalman filters can be employed in several domains within the spectre of the present

work. An example using real-time streaming data allowed the engagement of DBNs approaches in environmental

engineering [12].

Another example of the versatility of DBNs is its application on areas such as image and video processing [69].

The frames of a captured video are contemplated as feature vectors. A MTS is thus formulated by features such as

pixel color trough the different frames. Anomalies are classified as sequences of consecutive frames that discern

excessively from normal ones. The LL score of a sequence can thus be used. Alongside other techniques, DBNs

are employed in the developed apparatus by adapting an existing DBN modeling algorithm [14] that supplies an

optimal network when certain criteria are met. Details are discussed further in Chapter 4 with the incorporation of

a sliding window approach. The final result is a complete and adaptable anomaly detection scheme which tackles

common issues from pre-processing to post-processing such as discretization and score analysis respectively.

3.3.4 Model training

Learning a BN can be summed up as to finding the DAG and its parameters which maximize the fitting of a

training set. Such is acquired by comparing each possible structure through a scoring function.

The developed outlier detection technique assumes that sane data originated from stochastic processes which

can be revealed by a modeling framework. With the intention of detecting outliers originated from unusual and

abnormal transitions in MTS, a DBN structure learning algorithm is considered. The prevailing aspiration is the

identification of conditional dependencies amid distinct variables in addition to temporal dependencies. Thus, the

present chapter retains the reasoning behind an optimal tree-augmented DBN (tDBN) structure learning algorithm

[14] which provides the craved model possessing optimum inter/intra-slice connectivities for each transition net-

work. The modeling algorithm described next is employed in the modeling phase of the proposed outlier detection

approach discussed in Chapter 4.

3.3.4.1 Tree-augmented structure learning

The reasons behind choosing a tDBN [14] technique opposed to available DBN learning software [70] is that

the latter typically neglect intra-slice connectivity or simply structure it as a detached approach, meaning that

connectivities are modeled separately. Such means that typically, inter-connectivity is modeled as a standard

Bayesian network, which does not consider temporal relationships. In contrast, tDBN models an optimal network

26

by determining both connectivities simultaneously, providing a solid representation for regular data. It is assumed

that each variable TS is discrete with a finite number of states. The presence of hidden variables or missing values

is unauthorized, meaning that such issues must be taken care prior to modeling. Such matters are discussed further

in Sec. 4.1.

Before proceeding to the actual description of the learning method’s underlying procedures, several concepts

need to be acknowledged. By considering both connectivities in model training, the attainment of the globally

optimal network becomes NP-hard [71], contrary to learning solely the inter-connectivity, which may not be [72].

These are the opposite of P problems, which can be solved in a polynomial number of steps. If the solution for our

task was known, it could be checked in polynomial time. However, acquiring the solution is part of the problem.

NP-hard problems typically require a lot of computational power to check every possible solution combination.

The time to find a solution grows exponentially with the problem’s size.

To tackle this dilemma, tDBN [14] limits the search space to tree-augmented networks. Thus, an attribute

node at a certain time-slice can only possess at most one parent at that same slice. The constraint influences alone

the intra-connectivity. Furthermore, in each node, the maximum number of parents from preceding time slices

is bounded by parameter p, which is typically a value lower than 5. Such structures have already proven to be

effective and superior to their counter-parts, being one of such examples the tree augmented naive Bayes (TAN)

which augments the base structure of the naive BN while capturing correlations among variables [73].

The tDBN algorithm [14] is based on the Chow and Liu approach [74] which approximates joint probability

distributions by the product of terms that introduce just one variable, being thus possible to represent such compu-

tation as a first-order dependency tree. The Chow-Liu algorithm outputs the conditional probabilities to be used.

The approximation offers the minimum Kullback–Leibler Divergence (KLD) [75] between the estimation and the

real distribution, which measures how much information is lost when the approximated distribution is used instead

of the real one. A more in-depth notion is available in Sec. 4.3.1.

Each transition t→ t+1 of the training set requires inspection in order to determine the most suitable network

for the model. For the non-stationary case, each transition should be examined separately. For sake of simplicity,

the explained steps are considering a Markov lag of one, however, the arguments prevail for larger lags. With this

in mind, the main phases for the structure learning method must be applied for each transition t → t + 1 and are

as follows:

• Complete directed graph: The algorithm oughts the existence of a scoring function φ. This can be any

decomposable function desired, nevertheless the LL score is chosen. Its penalized version MDL can also be

employed. For each attribute Xi[t + 1], the set of parents with maximum score is found. By allowing each

node to have one parent from its current time slice, the optimization problem for a node Xi[t + 1] can be

formulated as

maximizesij ,Par[t]

sij = φk(Par[t] ∪Xj [t+ 1], Dt+1t )

subject to Par[t] ∈ S(X[t]), i 6= j,

|S(X[t])| ≤ p.

(3.15)

In Eq. (3.15), φk refers to a term of the decomposable scoring function φ, where Dt+1t refers to a window

27

possessing the set of observations regarding transition t→ t+ 1, while Par[t] specifies the set of parents at

slice t, S(X[t]) represents all the subsets of nodes X[t] from the preceding time slice t and Xj [t+ 1] a node

from the same time slice t + 1 with i 6= j. The presented constraint declares that the set of parents must

be among the sets of preceding nodes S(X[t]), however, these are restricted to only contain a maximum

number of nodes p. This means that a node Xi[t + 1] can only possess at most p parents from time slice t.

The latter is not mandatory, although it facilitates the searching procedure while diminishing overfitting. A

complete directed graph is thus built inX[t+1] using the Chow-Liu algorithm [74] also known as Kruskal’s

algorithm. The latter determines a spanning tree of a weighted graph having maximum weight. Each edge

Xj [t+ 1]→ Xi[t+ 1] belonging to the graph has a weight equal to eij = sij−si, where si represents score

from Eq. (3.15) for node Xi[t + 1] without considering Xj [t + 1] as a parent. This means that the weight

represents the score gain of the network when including Xj [t+ 1] as a parent. The resulting tree is one that

is closest to the distribution associated to the data as measured by the KLD, thus a maximum LL score tree.

• Maximum branching: It is worth noting that the computed edge weights are direction dependant, eij 6= eji.

A maximum branching algorithm must be employed to determine the directed spanning tree, in other words,

an acyclic and connected graph emanating from the root node and possessing a single path from that node to

each other node. The tree has the largest total weight, being this equal to the sum of all the edge weights of

the tree. To achieve such structure, the Edmonds’ algorithm [76] is employed. The resulting nodes have an

in-degree of at most 1. The intra-slice connectivity is immediately retrieved from the Edmond’s algorithm

output, while the inter-slice connectivity is derived from the solutions of the optimization problems given in

Equation (3.15) for each node Xj [t+ 1] with exception of the Edmond’s root node which only requires the

optimal set of parents from the preceding slice.

A tree-augmented DBN is thus obtained with intra/inter-slice connectivity resolved simultaneously with a

polynomial worst-case time complexity. It is worth noting that even upon finding an optimal structure, this is

normally not unique. The previously described processes are considering the modeling of a non-stationary first-

order Markov lag DBN and they should be applied to every transition available. The same reasoning is adapted for

stationary and non-unitary Markov lag DBNs with further details and proofs available at [14].

It is worth noting that the DBN’s prior network is obtained separately with conventional procedures and is not

significant for the developed outlier detection approach.

28

Chapter 4

Proposed method

In the present chapter, a gradual but in-depth analysis of the developed outlier detection mechanism is depicted.

The latter focus on the detection of unusual values in conjunction with irregular sequences present in MTS through

DBN modeling. Firstly, a general and intuitive view is laid out together with a step-by-step reasoning of the

underlying conducts.

The proposed approach can be contemplated as the conjunction of 4 main phases. These are depicted in

Fig. 4.1. Datasets containing MTS are pre-processed prior to modeling if necessary. The pre-processing phase

studied is performed using a discretization and dimensionality reduction technique discussed in Sec. 4.1. Discrete

MTS datasets are then employed to the tDBN modeling algorithm, described in Sec. 3.3.4, which generates a DBN

according to the inserted dataset and parameters chosen. The latter are the Markov lag L, the number of parents

p from preceding frames each node can possess and a flag s deciding the stationarity of the model. Afterwards,

the MTS dataset together with the trained model are delivered to a scoring phase. The aforementioned capitalizes

on the structure and parameters of the model to analyze each MTS transition using a sliding window algorithm.

Entire series are likewise scored. Subsequently, scores are delivered to a score-analysis strategy which the main

goal is to create a boundary differentiating abnormal from normal scores. Two possible strategies are discussed

in the present work and later compared. A score-analysis strategy outputs a threshold used for the final binary

classification. Original data associated to scores below the threshold are classified as outliers.

Each phase is analyzed in the present chapter, being the modeling algorithm discussed in Sec. 3.3.4.

29

Figure 4.1: Scheme of the proposed outlier detection approach comprised of 4 phases. Data sets formed by MTSdata can be directly applied to the modeling phase when discrete and in the right format, otherwise, the pre-processing phase is applied prior to modeling. Missing values must be imputed before hand, not being consideredin the current system. Discrete data is delivered to the modeling phase along with the parameters p, L and s of theDBN to be modeled. The score-analysis phase considers two distinct strategies providing thus two possible routesfor outlier disclosure.

4.1 Time series pre-processing

One of the challenges when handling TS data is their tendency to be continuous and of high dimension. Be-

ing real-world databases vulnerable to inconsistency and noisy sources, data pre-processing is considered. To

tackle such issues, a procedure is applied to data-sets prior to the modeling phase. The aforementioned provides

discretization and dimensionality reduction. Noise cleaning can be endorsed, however, since the aim of the devel-

oped program is to detect anomalous and unusual behavior in data, noise reduction is not mandatory. For a better

understanding regarding data pre-processing and its importance consider [77].

Being the trained model a DBN, un-discretized variables foment an heavy and over-fitted model which ends

up behaving poorly. The same can be said to over-sampled data. These aspects also endorse high computational

burdens. An emphasis should be present in the choice of the representation parameters which can highly influence

the performance of the outlier detection process.

The representation enforced in the proposed approach on the input TS is known as Symbolic Aggregate ap-

proXimation (SAX) [9]. Other techniques can be used prior to the system, being the resulting discretized dataset

30

applied to the approach. SAX has already been practiced in anomaly detection scenarios [2], being thus the elected

strategy. The next statements will consider an univariate TS for the sake of simplicity, being the generalization

foreordain for later. The pseudo-code for the pre-processing mechanism is available in Algorithm 1 and specified

next:

• Normalization: Before proceeding to the actual dimensionality reduction and discretization, TS Xi[t] =

xi[1], ..., xi[T ] are normalized to present zero mean and a standard deviation of one. It is clear that comparing

TS with different offsets and/or amplitudes is futile. Such is tackled employing Z-normalization [78]. The

mean of the TS is subtracted from every data point. The result is then divided by the TS’ standard deviation.

The process is thus

x′

i[t] =xi[t]− xi

σii ∈ N, (4.1)

where xi[t] denotes a data point from variable i at time stamp t while xi is the mean value for that variable

and σi its standard deviation. The Z-score normalization from Eq. (4.1) adjusts the value of every data point

xi[t] of the TS into their new representations x′

i[t]. The latter denotes the number of standard deviations

from the mean the original data point is. Let us recall, that for the case of MTS, the procedure must be

applied to each variable TS, Xi[t], separately. Other types of normalization and variations of the default

Z-normalization can be contemplated [77].

• Dimensionality reduction: Each TS of length T can be converted into equivalent sequences of size m �

T . This reduced data representation is known as compression. Notorious compression techniques include

wavelet transforms among others. In the current program, compression is assured by employing a technique

known as Piecewise Aggregate Approximation (PAA) [79]. The latter subdivides a normalized TS into m

equally sized windows. These segments will turn into the new TS’ values hopefully preserving its properties.

The mean value of each window is hence computed replacing all values from that window. In other words,

all the m means of each window serve as the new TS. Mathematically, a time series X[t] = x[1], ..., x[T ] of

length T is remodeled as another sequence X[t] = x[1], ..., x[m] of length m � T , being each value x[t],

1 ≤ t ≤ T transformed using

x[i] =m

T

Tm i∑

j= Tm (i−1)+1

x[j], 1 ≤ i ≤ m (4.2)

where new values are computed as an average of the original ones in their region of size T/m.

Is worth noting that the aforementioned technique is only possible in numeric TS, meaning that reducing

dimensionality in symbolic sequences should consider a different approach. An additional concern is the

modification of the sampling rate of the data. When exercising PAA, the spacing between values becomes

T/m, considering that the original spacing was 1. If a particular sampling rate is needed for a specific

application, the analyst should pay attention to the dimensionality reduction process and the chosen window

size. In the end of the present phase, the original TS is approximated by a combination of steps as seen in

Fig. 4.2. The aforementioned procedure can be seen as a non-parametric model for numerosity reduction.

31

For the multivariate scenario, it is worth noting that PAA reduction must be applied equally to every variable

belonging to a common TS. If for each attribute TS Xj [t] a different length mj is chosen, the final MTS will

have a length equal to max(mj) causing every variable sequence to not fit together. In other words, distinct

variables would have different sampling rates, forcing an undesirable amount of missing values. It is thus

advisable to maintain the PAA parameter m equal for every attribute.

• Symbolic Discretization: After having a normalized and reduced representation of a TS, the final and

crucial phase concerns the actual discretization. It is worth noting that the PAA phase is not mandatory

and can therefore be overlooked. If the analyst is satisfied with the number of samples of a TS, he can

choose to discretize it soon after normalization. Normalized time series typically have Gaussian distributions

[80]. With such in mind, the domain of a TS can be divided into Σ = σ1, σ2, ..., σ|Σ| equiprobable regions

according to a Gaussian distributionN(0, 1), where |Σ| denotes the size of the alphabet Σ chosen to represent

each new sector. Each region is identified by its boundaries, known as breakpoints B = β0, β1, . . . , β|Σ|,

being the rightmost and leftmost breakpoints β0 and β|Σ| respectively −∞ and ∞. A symbol σi is thus

associated to interval [βi−1, βi[. The objective is to verify in which of the alphabet regions each time series

point falls into. As seen in Fig. 4.2, if a certain element is within the boundaries of symbol σi, for example,

it is replaced by that symbol. The procedure is performed for every time stamp using a sliding window. The

final result is a sequence of symbols which will now represent the TS for the upcoming computations.

Figure 4.2: Example of SAX discretization considering a 8-point TS. The 3-point PAA transform is converted tothe string “acb” using R package jMotif [81].

A simplistic version of the SAX algorithm can be reviewed in Algorithm 1. The pre-processing procedure must

be applied to every variable preferably with the same choice of parameters when handling MTS. The selection of

such values can have a considerable impact on the course to come. While the PAA parameter determines the level

of proximity and memory spent for describing a TS, the alphabet size represents the granularity of expressing each

element. With this in mind, it can be shown that for most applications an alphabet size in the range of 5 to 8

32

normally outputs good results [9].

Algorithm 1 Data pre-processingInput: Multiple continuous MTS Sj with a fixed length T and format. These are considered not to have missingvalues.

Output: The corresponding input MTS discretized with a fixed alphabet size |Σi|, 1 ≤ i ≤ n for each variable Xi

and reduced to a length m� T .1: procedure SAX2: for each MTS Sj do // For each subject3: for each variable Xi of Sj do4: Normalizedi ← z Norm(Xi) // Normalize each variable time series5: function PAA(Normalizedi,m)6: for each section of size T/m from Normalizedi do7: x[i]← (m/T )

∑j∈sectionNormalizedi[j] // Mean of each section

8: Normalizedi ← Normalizedi.Append(x[i]) // Time series of size m9: function DISCRETIZATION(Normalizedi, |Σi|)

10: β ← SegmentGaussianDistrib(|Σi|) // y axis divided in |Σi| equiprobable regions11: for each value i in Normalizedi do12: Discreteji ← ToSymbolic(i, β) // Breakpoints β used to classify each value i13: return(Discreteji ) // Return all discretized variables of all subjects

Although SAX discretization is said to guarantee equiprobablity among symbols, the former statement is not

entirely true. Research [82] has shown that when the PAA phase is applied former to SAX discretization, the

standard deviation of the normalized data tends to shrink. While the mean remains unaltered and equal to zero,

the standard deviation no longer remains unitary. This shrinkage is proportional to the size of the PAA windows

considered and the degree of auto-correlation within the TS. If the PAA reduction is exaggerated, the resulting

values tend to converge to the original TS’ mean. Consequently, the equal distribution of symbols gets harder and

harder to achieve.

Even though SAX is considered to perform TS discretization, mechanisms like Principal Component Analysis

(PCA) [83] can further enhance the outlier detection. The latter selects only the most influential attributes when

considering multivariate data, being thus adviced to be applied prior to SAX. This is absent in the current approach.

4.2 Assumptions

A list of assumptions is specified with the aim of providing additional information for those wishing to employ

the proposed unsupervised outlier detection system to their own application. If the analyst recognizes the referred

assumptions and precautions, normal behavior is guaranteed. The presumptions are thus:

• Discretization & Completeness: A MTS data set must possess all its variables discretized with a finite

domain. Missing values must be treated beforehand. These can be imputed or replaced by an additional

symbol. Univariate TS representing each variable must be compatible, otherwise transitions would have

omitted information corrupting the procedure.

• Training set: Data used for model training should not hold an excessive amount of anomalies due to over-

fitting. The typical system behavior should thus be captured, without forcing anomalous scenarios.

33

• Correlation: Multivariate data employed in the algorithm should present correlation among variables as

well as temporal relationships, which are the main focus when modeling DBNs.

• Complexity: It is advisable to apply data sets with a reduced number of variables, due to the poor scaling

when modeling DBNs. Over-complexity promotes overfitting as well as excessive time consumption. Data

sets comprising a number of variables of 20 or higher should select a smaller subset.

4.3 Dynamic outlier detection technique

The proposed scoring phase is now described. It is noted that all the information accommodated in the following

sections are akin when considering both stationary and non-stationary DBNs. In the studied algorithm, an outlier is

defined as being a single or set of observations belonging to a subject which is suspicious of not being generated by

the data’s main underlying processes. Also referred as statistical outliers, these are typically caused by qualitatively

distinct mechanisms. A measure of outlierness is thus associated to each subset according to their resemblance

with a trained DBN, which represents data’s normal behavior along time.

A window is defined as a sample of a MTS at a specified time interval and has always a size equal to n ·(L+1),

where n is the number of variables and L the DBN’s lag. A window is described as

Dtt−L = x1[t− L], . . . , xn[t− L], . . . , x1[t], . . . , xn[t], L ≤ t ≤ T (4.3)

where t identifies the last time frame present in the window. With this in mind, Eq. (4.3) shows that the size of

a sequence is dependent on the Markov lag L of the trained model. Such occurs due to the fact that variables of

a certain time slice t are conditioned by nodes no later than L prior time frames according to the DBN structure.

Each window possess a transition network associated to it which can be specific of that transition in the case of a

non-stationary DBN.

A dataset containing multiple MTS is employed to train a DBN, discussed in Sec. 3.3.4. The conditional

probabilities from Eq. (3.11) for each attribute are kept in matrices within the algorithm. Such means that a

MTS, depicted in Eq. (3.2), with n variables through T time instants holds n · T attributes. Hence, DBN nodes

emblematize attributes which are associated not only through variables but also time.

The trained model is composed by a prior network B0 and a transition network Btt−1 for each transition

t − 1 → t in the case of a first-order DBN. The initial BN is built upon the Markov lag of the model. The former

means that a Markov lag of L demands an initial BN along L time instances. A transition network Btt−L for frame

t requires variables defined since t − L which means that only after time slice L − 1 transition networks begin

to act. Each transition is associated to a window composed by the observed attributes required to compute its

corresponding anomaly score.

4.3.1 Outlierness score

To acquire a measure of outlierness, windows representing each transition of a subject are graded. The cor-

responding transition networks are used to check the probability of the values of time frame t according to the

window’s attribute values from Eq. (4.3). The probability of sequence x1[t], . . . , xn[t] is computed according to

34

the DBN structure storing the parent nodes of every attribute xi[t] and the parameters of every possible configu-

ration. An attribute can possess parents in the same time frame, xj [t], j 6= i, and at the previous L time instances

xi[t′], t− L ≤ t′ < t.

Before determining the score of each MTS transition, let us first understand the meaning of likelihood. Since

the learned model represents normal behavior, evidence should be set on the DBN in order to compute how likely

it is that certain data was generated by the model. The natural logarithm of the likelihood function is employed

with the aim of preventing the underflow problem typically associated to such functions. Since the logarithm of

a function reaches its maximum at the same point has the original function, the desired outcome is not disturbed.

The more anomalous observations are, the lower LL values they are associated to. Contrarily, a perfect fit has a

score of zero, being thus the maximum possible value for a score. The score of a window Dtt−L, also known as

transition score, is computed as

Scorett−L =

n∑i=1

log(P (xi[t]|pa(xi[t]))), (4.4)

where P (xi[t]|pa(xi[t])) is the probability of attribute xi[t] conditioned by its parent nodes’ values, represented by

pa(xi[t]). These can be attributes from the same time frame or prior ones according to the transition network. Each

score from Eq. (4.4) is thus a sum of the logarithm of the probability of every attribute at time slice t assuming

their observed values. The outlierness of a each transition t− L→ t is acquired.

The LL score is preferred due to its connection with the notions of entropy and KLD from information theory

[84]. Entropy can be seen as the minimum number of bits necessary to specify the value of a certain discrete

random variable. The entropy of a random variable with possible values σ ∈ Σ and known symbol probabilities is

computed as

H = −|Σ|∑i=1

P (σi) · log2(P (σi)), (4.5)

where |Σ| is the size of the alphabet. In other words, the amount of information needed to identify the system’s

state. Note that information is employed under Shannon’s definition [85]. It can be shown [84] that the mean

negative LL of a discrete random variable converges to the sum between the entropy, considering the variable’s

true distribution and the KLD among the true distribution and the distribution considered. The KLD is the number

of additional bits, on average, needed to encode an outcome distributed by an approximating (model) distribution

when comparing with the use of the optimal encoding for the true distribution. It is worth noting that mutual

information I between two random variables X and Y can be expressed as a KLD between the joint distribution

of both variables and the product of the marginal distributions, I(X,Y ) = KLD(P (X,Y )|P (X) · P (Y )) =

H(X) −H(X|Y ) [84]. The latter can be seen as a reduction in the uncertainty of X given Y . Note that we can

not use KLD to measure the distance between two distributions, being the main reason the asymmetry of the KLD.

With this in mind, maximizing the likelihood of data under the trained model is equivalent to minimizing the KLD,

reinforcing the notion that anomalous data associated to low likelihoods are less probable of had been generated

by the true generative model.

Although practical, the score in Eq. (4.4) has a substantial flaw. If the evidence possesses an unseen pattern,

35

the probability of at-least an attribute is zero. This means that the anomaly score for the whole transition is −∞,

since the model assumes the evidence is impossible. It becomes easily understandable by inspecting Eq. (4.4).

The logarithm of zero is not defined and its limit is equal to −∞ when the probability approaches zero from the

positive side. To tackle such issue, a technique known as probability smoothing is employed. The aim is to prevent

score disruption for unseen evidence. The smoothing applied, known in existing literature [54], is

Pi = (1− |Σi|ymin)pi + ymin, (4.6)

where pi represents a conditional probability P (xi[t]|pa(xi[t])), ymin a parameter expressing the degree of prob-

ability uncertainty and |Σi| the granularity of the variable in question. Every probability pi is thus remodeled

to Pi, to not only resolve the problem of unseen patterns but also to ensure a more realistic and fair probability

density distribution. This means that when pi is zero, the new probability will be equal to ymin which normally is

0.001, instead of assuming an unseen pattern as impossible. When a sequence has a probability of 1, every time

a given attribute’s parent nodes configuration occurs that attribute’s value is constant. Equation (4.6) ensures that

the probability is decreased according not only to ymin but also the size of the alphabet related to that attribute.

Consider a small example where an attribute at time frame t can assume one value from {a, b, c, d, e, f}. The

training set revealed that given its parents’ values, the attribute should always adopt symbol c. However, the last

statement encourages overfitting. The algorithm should not assume impossibilities or certainties. Doubt is not a

pleasant condition, but certainty is absurd [86]. In this example, the variable can assume 1 from 6 possible values.

These degree of uncertainty is thus encoded in Eq. (4.6), denoting that variables with higher granularity should

have an higher uncertainty even when the trained model says otherwise. A probability of 1 is hence converted

into a smaller probability, being the reduction proportional to the granularity of the variable. Consequently, the LL

score is computed using the smoothed probabilities.

4.3.2 Sliding window

The next phase is to acquire the outlierness of every subject’s transition. In other words, the outlierness of every

time frame from each TS given the model. The developed method resorts to a sliding window approach to create

an outlierness plot along time for each subject. A sliding window can be defined as a sub-network of a DBN over

consecutive time slices [87]. The mechanism is based in the gradual capture of all the windows Dtt−L which allow

the computation of transition scores Scorett−L, as seen in Eq. (4.4), for each transition. Similar concepts have

been studied in DBN although for different purposes such as approximate inference [87]. A sliding window has

n · (L+ 1) values depending on the order of the DBN.

Consider a discretized univariate TS of length T , modeled with a second-order DBN. The first 3 windows to

score are showed in Fig. 4.3, where each time-stamp ti, L + 1 ≤ i ≤ T represents a time slice of the TS. In this

case, windows are comprised by L+1 observations. Initial attributes at time frames t1 and t2 are not scored by the

mechanism, since these do not possess sufficient preceding values to be evaluated with the second-order transition

network. Nevertheless, the latter are modeled by the prior network B0 and influence the scores of the upcoming

windows which include them. This means that an unusual value at t1, for example, can still corrupt transition

t1 → t3 triggering an anomaly. In Fig. 4.3, contiguous transitions always possess two shared frames, being

36

thus a window Dtiti−2

responsible for describing the likelihood of observations at slice ti given the corresponding

transition network.

For a discrete MTS of length n · T , formatted according to Eq. (3.2), each cell displayed in Fig. 4.3 represents

a set of n observations instead of a single value. However, the reasoning is exactly the same, being the number of

captured values multiplied by the number of variables of the MTS. The entire windows of size n · (L+ 1) are used

to compute the conditional probabilities according to the respective transition network.

Figure 4.3: Sliding window example of a subject modeled with a second-order DBN. At each iteration, windowDtiti−2

is captured to compute the likelihood of attributes at time-slice ti, 3 ≤ i ≤ T given the respective transitionnetwork Btiti−2

.

All transition windows are gathered for every subject and scored according to Eq. (4.4). The whole procedure

is depicted in Algorithm 2. Score analysis must now be employed to discern the final decision boundary.

Algorithm 2 Transition outlier detectionInput: A DBN model with stored conditional probabilities for each transition networkBtt−L. A data set containingeach subject Sj = x1[1], . . . , xn[1], . . . , x1[T ], . . . , xn[T ] as a MTS in the same format as those used for trainingthe DBN. A threshold tresh to discern abnormal and normal scores.

Output: The LL scores for every transition t − L → t which are below the specified threshold. Outlierness ismeasured by multiplying the conditional probabilities of attributesXi[t], ..., Xn[t] from the corresponding window.

1: procedure OUTLIERNESS2: for each time slice t do // For each transition t− L→ t3: for each subject Sj do4: Dt

t−L ← xi[t− L], ..., xn[t− L], ..., xi[t], ..., xn[t]5: Xt ← xi[t], ..., xn[t]6: function SCORING(Dt

t−L, Xt, t)7: TransNet← DBN.getTransNet(t) // Retrieve the transition network Btt−L8: for xi in Xt do // For each attribute of Xt

9: pai ← TransNet.getParents(xi) // Parents of xi[t]10: pi ← TransNet.getProbability(xi, pai) // Retrieve Conditional probability11: Pi ← (1-|Σi|ymin)pi + ymin // Probability Smoothing12: Scorett−L ←

∑i logPi // Score for window Dt

t−L of subject Sj13: scorej ← scorej .Append(Scorett−L) // Every transition score of subject Sj14: if Scorett−L < tresh then15: outliers← outliers.Append(Scorett−L) // Transitions classified as outliers

16: return(outliers)

37

For the final appliance output, a user defined threshold can be enforced. All transitions from every test subject

which obtained a score below the specified value are yield. It is worth noting that even if an outlier is marked

at time slice t, this represents the outlierness of the transition and not necessarily the instant. The analyst should

observe the attributes preceding the selected frame according to the corresponding transition network. It could

happen that outliers detected at time frame t are caused by previous slices. With such in mind, one can consider

an entire window as abnormal rather than specific frames. Furthermore, since the outlier score of a transition

corresponds to the conjunction of the scores of every attribute generated in that time slice, one can inquire which

of the attributes contributed the most to the registered abnormality. It may happen that a single variable assumed a

value so unlikely given its parents that endorsed an high outlierness in the whole transition.

In addition to transition outlierness, the algorithm offers the classification of subjects as well. In many appli-

cations, analysts are worried about removing anomalous series from their models rather than detecting unusual

sections within subjects. Such concern is confronted by computing an additional score for entire MTS. A straight-

forward approach is considered, being a subject outlierness equal to the mean of every score corresponding to every

transition of that subject. The same principle can be applied to any window size ranging from a single transition to

the whole subject. A subject score is computed as

SubjectScorej =1

T − L

T∑t=L+1

Scorett−L, (4.7)

using smoothed probabilities, where Scorett−L represents transition scores belonging to subject j. The mentioned

mechanisms are applicable to any type of DBN, with the difference between stationary and non-stationary models

the transition network considered in each step for the computation of the conditional probabilities.

Providing the option of scoring data sets in multiple ways, the proposed approach adapts more easily to the an-

alysts’ needs and fits into a wide range of scenarios. The next phase is to cleverly determine a boundary separating

abnormal scores from normal ones.

4.4 Score analysis

An appraised trait contemplated in the present algorithm is automaticity. Being the apparatus unsupervised,

the idea is to lower human intervention as much as possible yet offering quality results. Two possible strategies

are discussed for the score-analysis phase, the aim is to compute the optimal threshold separating abnormal and

normal scores. The chosen strategies have score distributional assumptions, therefore, one should note that specific

scenarios can crave for particular approaches. The two strategies studied can be seen as the counterparts of each

other. Such means that generally when one performs poorly the other outputs good results. Such is discussed and

proved by simulated results in Sec. 5.1. By offering the option of choosing one of these strategies, no scenario is

left out of the system’s scope.

38

4.4.1 Tukey’s strategy

The current strategy acknowledges the skewness of the score’s probability distribution. With the concern of

finding a proper way to separate anomalous scores, the nature of the original data is temporarily ignored. One can

define abnormal scores as values that are too far away from the norm, presuming the existence of some kind of

cluster comprising normal scores. The current technique has inspiration in John Tukey’s method [15, 16], which

determines the interquartile range (IQR), also known as H-spread, that measures statistical dispersion. It can be

seen as the width of a box-and-whisker plot and it assumes that the score’s probability distribution is symmetric

around a given point. The method ignores data’s mean and standard deviation, nullifying the impact of extreme

scores. The IQR is hence a measure of variability robust to the presence of outliers, which is a desired trait when

performing anomaly detection.

The before-mentioned cluster is assumed to be centered at the scores’ median. Tukey’s method uses the range

between the first and third quartiles representing the spreadness of the middle 50% of scores, denoted as IQR.

Thus, 50% of all scores are within ±0.5 · IQR of the median. A quartile is a division of data into four intervals.

The first quartile (Q1) is defined as the point in which the lowest 25% fall into, in other words, the lowest 25% have

values lower than Q1. The second quartile (Q2) is the data’s median, cutting the data set in half, while the third

quartile (Q3) is the point in which the highest 25% of values are above it. With this in mind, the IQR is computed

as

IQR = Q3−Q1. (4.8)

Tukey’s method uses the notion of fences [16], frontiers which separate outliers from normal data. Upper and

lower fences are computed as Q3 + (1.5 · IQR) and Q1–(1.5 · IQR), respectively. The reason behind choosing

1.5 · IQR is that for most cases, a value of IQR labels too many outliers (too exclusive) while 2 · IQR begins

to classify extreme values as normal (too inclusive). An intermediate value is thus chosen and performs typically

well for most situations. In a standard Gaussian distribution, a level of 1.5 · IQR allocates about 1% of outliers,

which is a reasonable value. The analyst’s definition of outlier can come into play depending on the sensitivity

desired. Thus, there is not a statistically-driven reason for that value, being fruit of conducted experiments [16].

Figure 4.4: Distribution skewness types. Adapted from [88].

It is worth noting that Tukey’s method should not be confused with the IQR method, as seen at [89]. The latter

39

does not use fences when employed in outlier detection, it instead considers anomalies as values which deviate at

least λ · IQR from the mean, where λ is a sensitivity parameter. John W. Tukey combined such concept with the

notion of fences representing deviations from percentiles rather than the mean. The IQR method is not sensible

to asymmetry in the sampling distribution, since it only considers absolute deviations from the center. Tukey’s

mechanism improves such aspect by introducing fences. However, this still presents an issue for many asymmetric

distributions.

Transition and subject scores are classified as abnormal if their value subsists below their respective lower

fence, and equivalently, above their corresponding higher fence. Such can be visually observed in Box plots [15].

However, the developed algorithm is only distressed with the recognition of anomalous scores below the lower

fence, since these are defined as low likelihood entities. Distributions are thus typically negatively skew as seen in

Fig. 4.4. Thus, scores si holding inequality

si ≤ Q1–(1.5 · IQR) (4.9)

are considered abnormal, being Q1–(1.5 · IQR) the computed threshold.

Score distributions generated by the developed algorithm are often “negatively skewed” with the right side of

the bell curve compressed and with a long tail on the left side. Tukey’s procedure typically prefers symmetric

score distributions with a low ratio of outliers having a breakdown at about 25% [90]. In scenarios with absence of

anomalies, this mechanism is capable of reducing or even completely eliminate false positive occurrences, since

fences are not forced to be in the data’s observed domain.

4.4.2 Gaussian mixture model strategy

One of the disadvantages when employing Tukey’s method [15, 16] is that the latter implies a symmetric distri-

bution of scores. In some applications, outliers are formed by specific phenomena which may cause the appearance

of a Gaussian like curve in the score’s histogram, rather than disperse scores. Such situations are visible in experi-

ences performed in Sec. 5.1. This is typically noticeable in the presence of a larger number of anomalies. In such

scenarios Tukey’s method may not provide the optimal solution. If more than one curve exists, the threshold does

not separate both curves properly.

To handle such distributions, an alternate method is engaged to specifically tackle normal and abnormal curves.

With inspiration from classification and clustering problems, a Gaussian Mixture Model (GMM) [17] is considered.

A GMM is a probabilistic model that assumes data is generated from a finite mixture of Gaussian distributions

with unknown parameters. The latter assumption is tolerable since most real-world phenomena has Gaussian like

distributions [80].

For the implementation of such mechanism in the present work, a scores array is seen as a data set whose

nature can be explained by a mixture of two Gaussian curves. In this case, the threshold discovery task converts

into a classification problem among two classes, normality and abnormality. The desired solution is the boundary

between both curves, where there is maximum uncertainty about which class a score value belongs to.

On account of having two classes C1 and C2, an example is shown on Fig. 4.5. Let us consider classes C1 and

C2 as anomalous and normal, respectively. The aim is to compute the degree of belief of assigning each computed

40

Figure 4.5: Gaussian distribution of two classes C1 and C2. Threshold y0 represents the point separating bothcurves, determined by BCR. Adapted from [91].

score to one of the classes. The problem is defined as uncovering the value of P (C1, C2|y) for each score value y.

Such can be obtained by employing Bayes Rule:

P (Ci|y) =P (y|Ci) · P (Ci)

P (y), i ∈ 1, 2 (4.10)

where P (y|Ci) is the likelihood of score y belonging to class Ci, P (Ci) the priors for each class and P (y) the

evidence. Probabilities P (y|Ci) assume Gaussian distributions N(y|µi, σi) while priors should be estimated.

Evidence P (y) for each score is calculated according to

P (y) = P (y|C1)P (C1) + P (y|C2)P (C2). (4.11)

Combining Eq. (4.10) and Eq. (4.11) leads to the conclusion that for a score y be classified as anomalous,

P (y|C1)P (C1) > P (y|C2)P (C2), since P (C1|y) > P (C2|y). In the opposite case, scores are classified as

normal. Such is known as Bayes classification rule (BCR). The BCR is the key to uncover the boundary between

both classes, giving thus the threshold y0 desired, as depicted in Fig. 4.5. Although the boundary minimizes

classification error, miss-matches are doomed to occur due to intersections between both distributions.

The dilemma is now to discover the parameters of each Gaussian distribution. The GMM is defined as the

sum of the two Gaussian distributions, α1N(Y |µ1, σ21) + α2N(Y |µ2, σ

22). For this, an Expectation Maximization

algorithm [92] is used to determine the values of parameters αi, µi and σ2i :

1. Initialization: Determine the initial component parameters for each curve. It commonly comprises hierar-

chical clustering [93] like k-means to establish hard labels to each data point. These are then applied to a

MLE [44] for the initial parameters.

41

2. Expectation & Maximization: Assign labels on every score based on which class most likely generated

each score according to BCR. Use MLE [44] to re-estimate αi, µi and σ2i of each Gaussian using the assigned

labels.

3. Repeat: Reiterate Step 2 repeatedly until convergence. A LL cost function measuring the fit of the GMM to

the score’s data set is used to determine the convergence point.

The current strategy performs particularly well when the scores’ distribution is discontinued and comprised by

two distinct curves. Thus, opposing Tukey’s method, GMM excels in environments with high proportion of outliers

where such curves are more accented, having difficulty of discerning false positives from anomalies in outlier-free

datasets and single-curve distributions. The latter can easily be explained by the fact that the implemented GMM

always tries to fit an abnormal class.

A single Gaussian distribution could be fitted to tackle low outlier ratios, however, the latter would require the

definition of a deviation measurement. An example would be to classify outliers as values which lie more than λ

standard deviations from the GMM mean. It is worth recalling that the analyst can always enforce any value of

threshold or simply consider a certain percentile of scores as outliers.

Since the current thesis focus on the detection of anomalies, two-Gaussian Mixture Models are considered.

Nevertheless, GMMs with higher number of models can be employed requiring a more complex definition of

outlierness from the analyst due to the several classes encountered. However, in a more general case the analyst

does not typically know how many components yield the best fitting. Information criteria is thus used. More

specifically BIC maximization [94], which penalizes the number of estimated parameters by subtracting a term to

the Log-Likelihood.

In the current thesis, GMM is employed with the aid of the freely available R package mclust [94, 95]. For

additional information regarding EM, consider [96].

4.5 Web application and software

Having the intention of providing a fully available and adaptable outlier detection mechanism, the developed

implementation is freely reachable online [3]. Most figures from experiments undertaken for the current study (see

Chapter 5) derive from the built web application and can be easily reproduced.

Figure 4.6: Front page screenshot of the developed web application reachable through [3].

The application offers support from data-formatting to score analysis together with a tutorial video showing

42

how to maneuver inside the application, as seen in Fig. 4.6. Results can be downloaded and parameters changed in

the midst of each procedure. Sample datasets are also accessible for immediate usage.

Figure 4.7: Screenshot from a transition score-analysis tab using the developed web application [3].

In Fig. 4.7, the score-analysis tab regarding transition outlierness is displayed. Users are capable of adjusting

visualization parameters to their liking as well as values which influence the outputted results. Score-analysis

offers automatic thresholds considering the studied strategies as well as manual readjustment. The user-interface

adapts in real time to the user’s actions.

Furthermore, source code is available, users can run the application in their own setups and adapt each phase

for their particular endeavours enhancing the proposed outlier detection system.

43

44

Chapter 5

Experimental results

A vital stage when developing any algorithm is defining its performance. Several experiments are conducted

using synthetic data as well as datasets available from multiple sources and applications. The goal is to first

demonstrate the method’s outcome in controlled conditions before proceeding to more complex and real-world

scenarios. Results are discussed according to their accuracy, feasibility and more. Nevertheless, it is worth noting

that each situation appeals to different means. Favorable practices for one specific domain do not necessarily imply

satisfying results to any appliance.

The proposed approach is tested against simulated data. Experiments are divided in two groups due to the two

possible strategies in the score-analysis phase, as seen in Fig. 4.1. Such means that both strategies are validating

the approach. Simulated experiments considering a PST method are carried out afterwards and contrasted with the

proposed system. Furthermore, real datasets are addressed using the developed approach displaying its versatility.

5.1 Simulated data

The first step consists of verifying the approach’s performance and consistency in controlled scenarios. More

specifically, the present experiments subsist on training two separate DBNs, one for generating normal data de-

nominated as DBNN and another one to produce outliers DBNO. All data is then mixed together in a single data

set with the intention of identifying the intruders inserted by DBNO. The produced dataset is used to model a

DBN embodying standard data behavior. In perfect conditions the latter is identical to the DBN employed when

generating normal instances DBNN .

Let us first understand the assumptions of such operations. The obtained performance will not only depend

on the ratio of outliers compared to normal data but also on how divergent these are. Several difficulty levels are

analyzed in order to capture the method’s limits. The higher the ratio of outliers, the more the attained model

admits them. This results in a scoring procedure which reduces the score of the deviants while possibly increasing

the number of false negatives (FN) and false positives (FP). The proportion of abnormality is thus a factor to

consider. The trained model is said to be overfitted in respect to the outliers in such situations. It is worth noting,

that in general, the more data is supplied to the algorithm the better, however, the system’s ordinary behavior must

be captured with its corresponding proportion. Few subjects can lead to the overfitting of the transition networks,

45

specially when considering a non-stationary DBN, providing a precarious model which does not excel in practice.

Hence, measurement should worry about capturing a system’s typical behavior instead of focusing on catching

solely anomalies.

Performance measures: The final stage of the algorithm can be perceived as a binary classification task.

Hence, performance measures should be defined to evaluate each experiment. For the actual detection of inserted

anomalies, the precision of an experiment, also known as Positive Predictive Value (PPV) is measured,

PPV =TP

NO=

TP

TP + FP, (5.1)

where TP is the number of true positives detected and NO the total number of generated anomalies. The latter

is computed as the sum of the number of correctly classified anomalies TP and the number of miss-classified

anomalies FP . The result provides the probability of randomly selecting a true outlier from all observations

classified as abnormal. Another performance measure is recall, also known as True Positive Rate (TPR),

TPR =TP

TP + FN, (5.2)

where FN is the number of false negatives. The latter are outlying observations miss-classified as normal. TPR

provides the probability of a detected outlier being correctly classified. To conjointly consider both indicators in

the upcoming experiments, a measure denoted as F1 score is computed,

F1 = 2 · PPV · TPRPPV + TPR

. (5.3)

Standard 5.3 is the harmonic average between PPV and TPR with both being evenly weighted. Values of the

F1 measure range from 0 to 1, being the latter a case of perfect precision and recall. Accuracy measures may also

be calculated as

ACC =TP + TN

TP + TN + FP + FN, (5.4)

which represents the probability of correctly classifying an observation, being TN the number of normal instances

correctly classified.

Experiments are conducted to scrutinize subject outliers fruit of transition anomalies. All the DBNs in the

current chapter are modeled using the LL score.

Data generation: To produce a dataset native from a DBN, normal subjects are generated by the transition

network described in Fig. 5.1, representing DBNN . The majority of subjects belong to DBNN while the rest are

created in a different DBN which the degree of divergence from DBNN reflects the difficulty level being tested.

More similar networks incite better concealed outliers. Note that the model used by the outlier detection algorithm

is trained using the mixed dataset. This means that the latter is expected to differ from the normal model DBNN .

For simplicity, first-order stationary DBNs are considered, noting that the choice of the model parameters in

the training phase further influence the obtained model and consequently the final outcome. With such in mind,

parameters do not differ between experiments. All variables have a domain Σ of 3 symbols also known as the

alphabet size while subjects are MTS comprised by T=10 time instants. Every modeling undertaken in synthetic

46

data experiments considered a stationary DBN with unitary lag and one possible attribute from preceding slices.

The aim is to test the impact of the proportion of outlierness and not the choices of modeling parameters, leaving

the latter discussions for real data sets.

X1

X2

X3

X4

X5

X1

X2

X3

X4

X5


t− 1 t

Figure 5.1: Transition network of the stationary first-order DBN used for generating normal subjects, DBNN .

All attributes from the non-outlier model are conditioned by their preceding values, as seen in Fig. 5.1. Corre-

lation among variablesX5 andX2 persists. For generating anomalous subjects, two different DBNs are considered

and shown in Fig. 5.2. These differ from DBNN , where dashed arrows are representative of removed connections

and red edges relations that have been added with respect to DBNN . Black non-dashed links illustrate common

connections between the abnormal and normal networks. Such dissimilarities influence the difficulty of the gen-

erated outliers. It is worth noting that being the models stationary, the transition networks shown are replicated

along every time window. All DBNs possess the same initial network, in such manner the aforementioned does

not influence the obtained results.

By observing both anomalous models displayed in Fig. 5.2, network DBNC is structurally more dissimilar

from DBNN . The latter retains only two of the original connections while increasing the number of relationships

among different variables. It is thus expected to output more dissimilar outliers when compared to DBNB .

47

X1

X2

X3

X4

X5

X1

X2

X3

X4

X5


t− 1 t

X1

X2

X3

X4

X5

X1

X2

X3

X4

X5


t− 1 t

(a) (b)

Figure 5.2: Transition networks of the stationary first-order models DBNB (a) and DBNC (b) used for generatinganomaly subjects. Dashed connections represent links which are removed with respect to the normal network,displayed in Fig. 5.1, while red links symbolize added dependencies. Non-dashed black edges are connectionswhich are also present in the normal network.

5.1.1 Proposed approach

The proposed approach is tested against synthetic data. Experiments are divided in two groups due to the two

possible strategies in the score-analysis phase.

5.1.1.1 Tukey’s strategy

As a first contact, every test is examined using Tukey’s method (see Sec. 4.4.1 in Page 39) in the score-analysis

phase. Advantages and disadvantages are discussed and later compared with a GMM strategy.

Results are observable in Table 5.1, where DBNO PO Ns used in each test indicates which anomalous network,

DBNB or DBNC , is employed, the percentage of subjects that belong to the said anomalous network PO and the

total number of subjects in the dataset Ns. Every value is rounded down to two decimal places and represent an

average among 5 trials, since every experiment is executed 5 times. It is worth recalling that the latter have the

objective of uncovering subject outlierness. However, since a subject score is determined by an average of all the

transition scores of that subject, it is possible to observe which transitions contributed for the classification of each

subject, validating thus both disclosures simultaneously.

Observing the results of outlier detection using any anomaly network with Tukey’s method, available in Ta-

ble 5.1, it comes that datasets with only 100 subjects perform generally poorly. Such is explained by the fact

that these do not possess enough information about the data’s underlying processes. Replicating the results of an

experiment with only 100 subjects is difficult, since these are very sensible to each observation. Recall that data is

generated randomly according to each networks’ parameters. Accuracy ACC as well as F1 scores tend to decline

with the increase of outlier ratios, which is expected due to less normal data available for a correct modeling phase.

The latter is observed by the decrease of TPR measurements. The computed thresholds converge to more stable

48

Table 5.1: Results for subject outlier detection with simulated data sets while employing Tukey’s score analysis.An experiment DBNO PO Ns possesses a percentage PO of anomaly subjects generated using DBNO, which canbe DBNB or DBNC , from a total number of subjects in the dataset Ns. Every performance measure representsan average of 5 separate tests. The threshold Thresh computed for each experiment using Tukey’s method is anaverage over the 5 trials. Column NO indicates the number of outlier subjects at each experiment.

Experiment NO Thresh PPV TPR ACC F1 Experiment NO Thresh PPV TPR ACC F1

B 5 100 5 -8,72 0.88 0.70 0.98 0.78 C 5 100 5 -8.61 0.89 0.73 0.98 0.80B 5 1000 50 -5.56 0.93 0.96 0.99 0.94 C 5 1000 50 -5.46 0.91 0.98 0.99 0.94

B 5 10000 500 -5.47 0.95 0.98 0.99 0.96 C 5 10000 500 -5.32 0.94 1.00 0.99 0.97B 10 100 10 -9.85 0.96 0.38 0.94 0.54 C 10 100 10 -8.72 0.89 0.73 0.97 0.80

B 10 1000 100 -5.61 0.99 0.87 0.99 0.93 C 10 1000 100 -5.57 0.97 0.87 0.98 0.92B 10 10000 1000 -5.56 0.99 0.91 0.99 0.95 C 10 10000 1000 -5.55 0.99 0.87 0.98 0.93

B 20 100 20 -10.01 1.00 0.19 0.83 0.32 C 20 100 20 -8.71 0.90 0.22 0.84 0.35B 20 1000 200 -6.08 1.00 0.20 0.84 0.33 C 20 1000 200 -6.17 1.00 0.37 0.87 0.54

B 20 10000 2000 -6.09 1.00 0.16 0.83 0.28 C 20 10000 2000 -6.01 1.00 0.29 0.86 0.45

values with the increase of data causing oscillations to tend to zero. Hence outputting more reliable values for

every performance measure.

Figure 5.3: Subject outlierness histogram for an experiment C 20 10000 showing thresholds for both score-analysis strategies. Scores below the threshold are classified as abnormal (in red) while scores above the thresholdare classified as normal (in green). Scores which are around the threshold are displayed in yellow. All the colorrepresentation is considering the Tukey’s threshold, the same reasoning is applied to the GMM threshold.

Discussing the impact of outlier ratios, Tukey’s method is recognized to be more effective in the presence of

lower anomaly percentages. The aforementioned arises from the fact that the score distribution starts to be increas-

ingly asymmetric with the increase of more extreme scores and such has been confirmed in existing literature [89].

Moreover, when the ratio is high enough and the majority of outliers are generated by a common process, the

score distribution of abnormal data becomes visible, observed in Fig. 5.3. The latter is not endorsed by Tukey’s

method. Such phenomena explains why for the same ratio, F1 scores may decrease with the increase of subjects

when employing Tukey’s threshold in large datasets. Furthermore, the breakpoint of Tukey’s method [90] prevents

favorable results when in the presence of abundance outlierness. In the other hand, false positives tend to disap-

pear, reflecting high precision measurements. The latter is due to the fact that Tukey’s thresholds avoid the main

49

distribution, however, such allows the number of false negatives to rise.

Comparing experiments of both networks DBNB and DBNC , accuracy is in general higher in experiments

with DBNC . Such is expected, since the latter has fewer connections in common with DBNN , resulting in a more

dissimilar structure. However, such is not always true. Score distribution asymmetry may disturb Tukey’s analysis

in extreme cases.

A control test to verify the system’s behavior when in the absence of outliers is performed. The aim is to

check the number of false positives if any and compare the trained model with the presumed one. Results shown

that all the connections from DBNN are present, however additional links are trained. It is worth recalling that the

modeling algorithm may have multiple solutions and furthermore, in the case of modeling with the LL score, higher

number of relations tend to be formed when compared to MDL. The control test scores are seen in Fig. 5.4. Tukey’s

strategy outputs a threshold of -5.57 which is tolerable given the distribution. However, the GMM mechanism

yields a value of -5.06 which bears an excessive number of false positives. The latter results from the fact that a

second outlying distribution is expected thus proving the convenience of employing Tukey’s approach in scenarios

with reduced abnormality.

Figure 5.4: Subject outlierness histogram for a control experiment showing thresholds for both score-analysisstrategies. Scores below the threshold are classified as abnormal (in red) while scores above the threshold areclassified as normal (in green). Scores which are around the threshold are displayed in yellow. All the colorrepresentation is considering the Tukey’s threshold, the same reasoning is applied to the GMM threshold.

It is worth noting that despite false positives being unwanted, these are normal to appear in the present ex-

periments. The non-outlier DBN, seen in Fig. 5.1, generates certain sequences with a low probability due to its

parameters. When creating a large number of transitions, these are doom to occur. The scoring phase applies a low

likelihood to such instances causing miss-classifications.

5.1.1.2 GMM Method

Inspecting Tukey’s analysis results, the performance of experiments with larger anomaly ratios bear low F1

values due to high counts of false negatives. Such is coherent with the idea that higher outlier ratios disturb the

training phase of the algorithm and thus the final results. However, the main reason for the aforementioned outcome

is the presence of an outlier curve in the scores distribution, as seen in Fig. 5.3. The latter occurs due to the high

proportion of outliers formed by a specific mechanism, that in this case is an abnormal DBN. Tukey’s method does

not perform well in such scenarios.

50

A GMM strategy is thus employed. The datasets used in each experiment are exactly the same tested with

Tukey’s method. Hence, the score-analysis phase is the only difference among the tests, affecting the computed

threshold. The results are available at Table 5.2, where it becomes noticeable the considerable increase in recall

for experiments with higher proportions of outliers when compared with results from Table 5.1. Such has been

confirmed in existing literature [89].

Figure 5.5: Comparison between GMM and Tukey’s score-analysis F1 scores for multiple outlier ratios. Eachvalue is an average of all 15 trials performed for each outlier ratio.

Table 5.2: Results for subject outlier detection with simulated data sets while employing the GMM strategy in scoreanalysis. An experiment DBNO PO Ns possesses a percentage PO of anomaly subjects generated using DBNO,which can be DBNB or DBNC , from a total number of subjects in the dataset Ns. Every performance measurerepresents an average of 5 separate tests. The threshold Thresh computed for each experiment using the GMMstrategy is an average over the 5 trials. Column NO indicates the number of outlier subjects at each experiment.

Experiment NO Thresh PPV TPR ACC F1 Experiment NO Thresh PPV TPR ACC F1

B 5 100 5 -8.78 0.82 0.70 0.98 0.76 C 5 100 5 -7.51 0.64 1.00 0.96 0.78B 5 1000 50 -5.71 0.91 0.97 0.99 0.94 C 5 1000 50 -5.40 0.86 0.99 0.99 0.92

B 5 10000 500 -5.45 0.95 0.98 0.99 0.96 C 5 10000 500 -5.49 0.98 1.00 0.99 0.99B 10 100 10 -8.55 0.77 0.68 0.93 0.72 C 10 100 10 -8.64 0.92 0.78 0.97 0.84

B 10 1000 100 -5.43 0.94 0.96 0.99 0.95 C 10 1000 100 -5.24 0.89 0.97 0.98 0.93B 10 10000 1000 -5.27 0.91 0.98 0.99 0.94 C 10 10000 1000 -5.28 0.93 0.96 0.99 0.95

B 20 100 20 -8.18 0.66 0.49 0.85 0.56 C 20 100 20 -7.17 0.75 0.58 0.88 0.65B 20 1000 200 -5.18 0.86 0.89 0.94 0.87 C 20 1000 200 -5.24 0.91 0.92 0.96 0.92

B 20 10000 2000 -5.13 0.86 0.94 0.96 0.90 C 20 10000 2000 -5.11 0.93 0.94 0.97 0.94

51

In general, the number of false positives is higher when employing the GMM strategy. Such is due to the

fact that GMM always considers the existence of an abnormal model even in its absence. Due to similarities

between the DBNs, especially when considering DBNB , scores from both curves tend to mix together around the

threshold being thus difficult to discern them. An example is experiment from Fig. 5.3, where the GMM threshold

conveniently separates both classes. With such in mind, for the current scenarios, the GMM approach has typically

higher recall but lower precision with thresholds smaller in module. The latter is more visible in higher ratios

of outlierness, since in the presence of fewer anomalies Tukey’s method displays higher F1 scores. In Fig. 5.5,

the average F1 scores of every experiment using a specific method and outlier ratio is shown. Tukey’s method

performs very poorly in datasets with 20% of anomalies while outperforming the GMM strategy in datasets with

5% of anomalies as well as control experiments.

It is worth recalling that the GMM strategy assumes that two classes with Gaussian distribution exist, in real-

world scenarios it may happen that multiple curves appear in the score’s distribution even if there is only one

normal class. In such cases, the current approach would probably exceedingly miss-classify. The algorithm is

easily adapted to several classes by employing GMM for more than two models. However, this requires a more

in-depth intervention and reasoning by the analyst.

Although the present synthetic experiments seem to endorse the use of a GMM strategy, one should note that

both normal and abnormal data are generated according to two defined models which can by some degree be

separated. Such is a favorable scenario for GMM. When considering other scenarios, Tukey’s method is not so

susceptible to the presence of well defined curves being thus always a strategy to consider. With that said, it is

adviced for the analyst to apply as much knowledge as possible with the experimentation of both strategies.

5.1.2 Comparison with probabilistic suffix trees

To compare the developed system, an additional multivariate outlier detection system is built. The latter adopts

PSTs with the aim of mining abnormal values among each TS. This alternative approach modifies the modeling

and scoring phases of the outlier detection system. The latter means that the pre-processing and score-analysis

phases are the same as the proposed approach.

As discussed in Sec. 3.2.1, PSTs are only capable of modeling univariate time series. To tackle MTS, vari-

ous considerations are undertaken. First, each dataset is divided into multiple groups, each one containing data

concerning solely one variable i ∈ 1, ..., n. Every group is used to model a PST P i. A subject j is used to form

sequences Xji , which are univariate TS for each variable i. Thus, each sequence Xj

i comprises observations of

every time frame t ∈ 1, ..., T concerning a variable i. A dataset comprising 5 variables models 5 trees, which

do not interchange any information. Each tree computes an univariate score for all TS according to Eq. (3.10) in

Page 20. Scores from all the separate trees belonging to a common subject j of the original dataset are gathered.

The mean of each score array is computed as

SubjectScorej =1

n

n∑i=1

logloss(Xji ), (5.5)

being the latter the multivariate score for each MTS. Note that the present algorithm does not consider relation-

ships among variables, solely within univariate TS. In Eq. (5.5), n represents the total number of variables while

52

logloss(Xji ) depicts the univariate score of sequence Xj

i computed with PST P i and concerning variable i and

subject j.

An existing PST modeling package [97] is engaged. Each experiment preformed is compared with the proposed

approach. Results are shown in Tables 5.3 and 5.4. Likewise, the PST approach employs score analysis posterior

to scoring, selecting one of the two considered strategies. Every PST is modeled with maximum possible depth

and a nmin of 30. Pruning using the G1 function with a C of 1.2 is used (see Sec. 3.2.1).

Experiments preformed with solely 100 subjects did not provide satisfactory results, since the latter do not grant

sufficient information. Tree modeling and pruning have to be adapted to such scenarios. Tests are thus focused

on datasets with sufficient material. Datasets used in the present mechanism are common to the ones employed

with the DBN approach. Control experiments using solely normal data when employing the multivariate PST

system displayed a result identical to the shown in Fig. 5.4. Results demonstrate the low performance of the PST

Table 5.3: Results of subject outlier detection using the PST approach with Tukey’s score analysis. An experimentDBNO PO Ns possesses a percentage PO of anomaly subjects generated using DBNO, which can be DBNB orDBNC , from a total number of subjects in the dataset Ns. Every performance measure represents an average of 5separate tests. The threshold Thresh computed for each experiment using the Tukey’s strategy is an average overthe 5 trials. Column NO indicates the number of outlier subjects at each experiment.

Experiment NO Thresh PPV TPR ACC F1

B 5 10000 500 -1.55 0.96 0.73 0.98 0.83

C 5 10000 500 -1.61 0.96 0.94 0.99 0.95

B 10 10000 1000 -1.65 0.70 0.02 0.90 0.04

C 10 10000 1000 -1.63 0.98 0.39 0.94 0.56

B 20 10000 2000 -1.74 0.42 0.00 0.80 0.00

C 20 10000 2000 -1.73 1.00 0.03 0.81 0.06

Table 5.4: Results of subject outlier detection using the PST approach with GMM score analysis. An experimentDBNO PO Ns possesses a percentage PO of anomaly subjects generated using DBNO, which can be DBNB orDBNC , from a total number of subjects in the dataset Ns. Every performance measure represents an average of 5separate tests. The threshold Thresh computed for each experiment using the GMM strategy is an average overthe 5 trials. Column NO indicates the number of outlier subjects at each experiment.

Experiment NO Thresh PPV TPR ACC F1

B 5 10000 500 -1.47 0.86 0.88 0.99 0.87

C 5 10000 500 -1.59 0.94 0.95 0.99 0.94

B 10 10000 1000 -1.27 0.20 0.87 0.65 0.33

C 10 10000 1000 -1.55 0.88 0.68 0.96 0.77

B 20 10000 2000 -1.30 0.25 0.67 0.53 0.36

C 20 10000 2000 -1.45 0.763 0.883 0.92 0.82

53

approach when discerning anomalies generated by DBNB . Such is explained by the fact that this network is much

similar to DBNN when compared with DBNC . Furthermore, since inter-variable relations are not considered,

subjects become identical when seen by the PSTs. Hence, the resulting score distributions display a single curve

blending both classes. One exception are experiments considering only 5% of anomalies, which indicate that with

the increase of outlier ratios, the few dissimilarities among classes are modeled, causing outliers to fit the PSTs.

Regarding experiments with DBNC , although outputting worse results than the proposed approach, both classes

are more easily discerned. Such is mainly due to the differences present in the intra-variable relationships when

compared with the normal network.

Figure 5.6: Subject outlierness histogram using the proposed approach (a) and the PST approach (b) for a sameexperiment C 20 10000. Both histograms display thresholds using both score-analysis strategies. Scores belowthe threshold are classified as abnormal (in red) while scores above the threshold are classified as normal (in green).Scores which are around the threshold are displayed in yellow. All the color representation is considering Tukey’sthresholds, the same reasoning is applied to GMM thresholds.

In Fig. 5.6 a comparison between both approaches for an experiment with 20% of outliers from DBNC is

shown. The PST system can not separate both classes as well as the DBN approach, demonstrating the importance

of inter-variable relationships present in DBNC and the robustness of the DBN approach. In general, the PST

approach scales poorly with the increase of outlier ratios. It is worth recalling that datasets are generated using

DBNs, which gives an advantage for the approach proposed. However, outlier networks possess intra-variable

dependencies which are captured by PSTs, being thus target of comparison.

5.2 Real Multivariate Time Series

With validation completed in Sec. 5.1, real-world scenarios are tackled ranging from numerous applications.

The objective is to demonstrate the employment of the developed approach through different usages.

54

5.2.1 Electrocardiogram dataset

A common application of anomaly detection in medical scenarios is in Electrocardiogram (ECG) alert systems

[2]. These have the capability of detecting unusual patterns in electrical signals measured from patients. Data is

usually un-discretized and present expected patterns when under normal circumstances. In the shown experiments,

it is demonstrated that the present approach is helpful when detecting suspect regions. A complete study from

pre-processing to score analysis is thus made available.

The ECG dataset is available at [98] and is composed by 200 MTS. Tests are performed mainly using non-

stationary DBNs since specific phenomena occurs in particular time instances, which is a peculiarity of ECG

signals. The location of the peaks of the ventricular contraction typically occur at time frames 3 and 10 as shown

in Fig. 5.7, where a representation of the whole MTS data set is present. It is worth noting that the current

experiments have the objective of testing the system’s reaction to noisy and unstable data.

Figure 5.7: Mean and standard deviation of normalized ECG variables along time using a SAX alphabet of 8.

The test seen in Fig. 5.8 shows that the current approach has difficulty evaluating time slices with higher

variance. The latter is due to the fact that there is not a predominant pattern in the location of the peaks, since these

vary intensively from subject to subject contrary to more advanced slices where data behaves more predictably.

Transitions must not be regarded as time slices, since these are windows of the original TS and thus possess a time

delay proportional to the Markov Lag of the model.

As shown in Fig. 5.7, SAX discretization offers a low definition at the ends of the peaks, such is due to the

signal’s variability which does not act equally along all frames. Excessively increasing the alphabet causes each

observation to be unreasonably sensible to noise. Moreover, overfitting may occur. PAA dimensionality reduction

is not employed in the current tests due to the low number of frames.

The system proved its ability to detect unusually behaved sections in ECGs which coincide with the high

variance portions in Fig. 5.8, showing that it can be employed to detect such peaks. The latter series are discretized

with a SAX alphabet of 5. The score distribution is negatively skew, as seen in Fig. 5.10(a), thus being appropriate

for Tukey’s threshold. The modeling phase used a second-order DBN, which demonstrated favorable results.

Furthermore, the occurrence of ventricular peaks in unusual slices is likewise uncovered. To test the afore-

mentioned, 10% of the subjects are flipped and mixed together in the original set. The objective is to detect the

now abnormal transitions. Results in Fig. 5.9 demonstrate the detection of such transitions which are present on

55

Figure 5.8: ECG transitions arranged by subject using a non-stationary second-order DBN with Tukey’s score-analysis. Data is discretized using SAX with an alphabet of size 5. Transitions displayed in red are classified asabnormal while green ones are classified as normal.

Figure 5.9: ECG dataset transitions arranged by subject. A non-stationary second-order DBN model is usedtogether with Tukey’s score-analysis. Flipped subjects are associated to the highest subject ids. Data is discretizedusing SAX with an alphabet of 5 symbols. Transitions displayed in red are classified as abnormal while green onesare classified as normal.

subjects with higher id.

To test the impact of the discretization phase, the exact same experiment is conducted with the difference

that all MTS are discretized with 10 symbols instead of 5. The transition score distribution from Fig. 5.10(b)

starts to reflect multiple clusters particular of different types of patterns. The second major group formed by the

transitions of the peaks becomes larger. The latter is between the normal cluster and an anomalous cluster which

appeared in the far left. With the increase of granularity, the system begins to better differentiate the data’s different

contexts, being an example the type of ventricular peak each subject possesses. Transition peaks with common

characteristics start to join together.

It is worth noting that in non-stationarity scenarios, comparing transition scores should be minded carefully.

This comes from the fact that scores from different transitions are evaluated in distinct networks. A normal value

for one network can be anomalous for another. With such in mind, separate score analysis could be performed for

each transition score distribution. The latter is not evaluated in the current approach. Moreover, the number of FP

may increase since scores depicted as normal globally can be seen as anomalies considering solely its transition.

56

Figure 5.10: Comparison of ECG transition score distributions using a SAX alphabet Σ of size (a) 5 and (b)10 symbols in the pre-processing phase. Transitions are modeled by a non-stationary second-order DBN. Bothhistograms display a Tukey’s threshold. Scores represented in red are classified as anomalous while in green areclassified as normal. Scores around the threshold are displayed in yellow.

5.2.2 Mortality Dataset

To further test the proposed approach, consider a cell wise outlier detection scenario where a multivariate dataset

is confined in a data matrix. Such is studied in [99] where data is considered to have been generated from a

multivariate Gaussian distribution. After column standardization, an univariate outlier detection system is applied

to each variable forming a new matrix. Correlations between pairs of variables are quantified for the computation

of predictions for all cells. The proposed approach DetectDeviatingCells [99] classifies cell-wise as well as row-

wise anomalies. One of the examples tested [99] refers to a data set comprising male mortality in France from 19th

century forward, deriving from [100]. It is intended to discover outlying years. Therefore, the present mission is

to detect the main iconic events which occurred in 19th and 20th century France, providing an explanation.

In order to adapt the data set for the developed approach, a number of transformations have to be enforced.

Beyond being un-discretized, data is structured as a matrix where the only apparent variables are the mortality rate

for each age group. To remodel it to the desired format of longitudinal data, several conversions are performed:

• Variables: the several ages ranging from 1 to 91 years old are regarded as variables Xi, meaning that

correlations among mortality rates of different ages can be modeled.

• Time: being the rows of the original matrix the year in which measurements were obtained, such are now

seen as time instants t for each variable Xi.

57

• Reduction: due to the excessive number of variables, only a sub set is selected.

Experiments conducted with the total number of variables revealed bad results due to excessive ambiguity. It is

worth noting that with all the transformations performed, the data set is reduced to a single subject which portrays

a MTS. Attributes Xi[t] are thus mortality rates of specific age groups at particular years. After the mentioned

procedures, the resulting series formatted according to Eq. (3.2) is pre-processed, as discussed in Sec. 4.1.

Being data only comprised of a single subject, non-stationary models can not be employed. These would

renounce the drawn aspirations due to overfitting. The aforementioned comes as a result of existing only one

observation from each transition. Hence, a stationary DBN is modeled to capture abnormal male mortality rates

throughout the years, represented by transitions.

Each variable of the MTS is discretized with an alphabet size of 5, perceptible in Fig. 5.11. Such is selected

due to variable oscillations which would be encoded with higher alphabets. Reinforcing the problem’s facet that

available data is minimal, being comprised by only one subject, excessive sensitivity showed to be ineffective. All

tests employ Tukey’s method in the score analysis phase.

Two separate experiments are presented in Fig. 5.12. A data set comprising France’s male mortality rates from

1841 to 1987 is used. In the first experiment, 5 variables are selected, being ages 20, 30, 40, 60 and 80, which

are respectively variables V 2-V 6 in Fig. 5.11. The objective is to determine unusual events such as wars and

epidemics. Hence the reason to mainly pick ages belonging to the adult category. The trained model involves

a stationary third-order DBN. Nodes are allowed to have at-most 1 parent from previous slices. The reasoning

behind the parameter choice is purely experimental. The problem exhibits a preference of attributes establishing

connections with previous nodes which are not consecutive with themselves. Such means that typically, explaining

the current value requires knowledge of values from 2 or 3 slices before. It is worth recalling that having a lag of

3 does not mean that every or even any relation has such lag, it just offers such possibility.

Figure 5.11: Normalized values of variables Vi∈1,...,6 representing France’s mortality rates of males with ages 10,20, 30, 40, 60 and 80 respectively from 1841 to 1987. Each time stamp represents a year. Data is discretized witha SAX alphabet Σ of 5 symbols.

Results confirm major events which shaked France’s history. These are displayed in Fig. 5.12, representing

both World Wars, the influenza pandemic, the Franco-Prussian War and the European revolutionary wave of 1848.

France was a belligerent in several conflicts as well as colonization wars in the 1850s.

58

In the second experiment, variable V 1 is added to the first set. The new age represents the male mortality rate

of children aged 10 years old. The goal is to capture the impact of the presence of youth mortality in the outputted

years. The results are similar, being the differences observed in the 1860s and around the Spanish flu. Youth is

more susceptible to epidemics, which is perceived by the algorithm. Due to industrial development as well as

overcrowding, diseases like cholera peaked infant mortality around the detected years [101]. Note the increase of

outliers from the Spanish Flu in Fig. 5.12, reinforcing previous statements. With the introduction of public health-

care as well as asepsis for preventing infections, infant mortality rates stabilized through the years, explaining the

lack of anomalies in later transitions.

Figure 5.12: Transition outlierness for mortality datasets of (a) 5 and (b) 6 variables using a third-order DBN. Thefirst dataset is comprised by 5 variables representing male mortality rates of males with ages 20, 30, 40, 60 and 80.The second dataset includes the same variables as the first with the addition of a variable representing the mortalityrate of males aged 10. Transitions are arranged by year and classified as anomalous (red) and normal (green).Major wars and epidemics which affected France in the selected years are exhibited.

Reminiscing the notion that each transition is represented by a window dependent on the model’s structure, a

year disclosed as an anomaly does not always imply that the respective year is abnormal but rather its appearance

given the window’s observations. Hence, the approach captures the starting and end points of wars with more

ease. An example is WW1, which is uncovered at 1914, however later years are not present since the mechanism

expected high mortality rates given the observed attributes.

A compelling result when modeling the reduced dataset of 5 variables with a fourth-order DBN depicted in

Fig. 5.13. By overfitting the model, two major score areas appear. Further investigation reveals that transitions

classified as normal prior to WW2 belong to one score group while regular ones from after WW2 reside on another

group. Such is marked in Fig. 5.13, being the results explained by medical and society shifts as well as the

introduction of Penicillin during WW2 [102]. It is also observable in Fig. 5.11 the distinct patterns pre and post

WW2 roughly around time stamp 100.

Results obtained in [99] are thus confirmed in the current study. Experiments using the current approach in data

59

Figure 5.13: Transition score distribution for the mortality data set using Tukey’s threshold. Data is modeled usinga fourth-order DBN. Transitions are classified as anomalous (red) and as normal (green). Due to overfitting, scoresare concentrated in two major clusters representing the expected behavior of France’s male mortality rates beforeand after WW2 respectively.

sets including years after the 1980s showed minimal differences when compared with the presented ones. Such is

explained by the absence of sudden changes in mortality rates.

5.2.3 Pen digits dataset

A distinct application is the recognition of drawn digits. With measurements taken along time from each drawing

phase of each character, longitudinal data is built. The goal is thus to model the system to certain symbols being

at the same time unwanted characters amidst the training data. The latter are thus aimed to be uncovered. Data is

available at [103] and studied in [104]. Handwriting samples are captured using a sensitive tablet which outputs

the x and y coordinates of the pen at fixed time intervals.

A data set comprising 1143 MTS along 8 time frames representing digit 1 is assembled from 44 different

writers. The original MTS are discretized with an alphabet size of 8. The data set is contaminated with 130

subjects belonging to a different digit, roughly 10%. The aim is to detect the aforementioned and subsequently

understand similarities between digits.

Results for multiple tests are present in Table 5.5, where Di represents the anomaly digit i introduced. All

experiments model a non-stationary first-order DBN. Such demonstrates that a specific pair of coordinates is more

easily explained by its immediate precedent. Every attribute can possess at-most one parent from its preceding

slice. Thresholds are selected manually. The objective is not only to capture the performance of the outlier

detection mechanism but further understand which digits are more commonly resembled with digit 1. Results

show that distinguishing digit 7 from 1 is difficult due to their similarity, proved by the low F1 score obtained.

Such reflects the blending of both class distributions. Digits 8 and 9 proved to be more easily discerned from 1

with favorable performance measures.

Further experimentation employing a Tukey and GMM strategy showed that Tukey’s thresholds assume most

of the score distribution as a single cluster, outputting a low number of TP. On the other hand, GMM proved to

be able to separate both distributions detecting practically all anomalies in each test. However the later outputs an

high number of FP, since both Gaussians are mixed together. For the case of experiment D9, GMM acquired 125

60

TP and 171 FP with a threshold of -2.37.

Table 5.5: Results of pen digits outlier detection experiments. All datasets are composed with 1143 MTS along 8time frames representing digit 1 and NO subjects belonging to a different digit Di. Thresholds Tresh are chosenmanually.

Experiment NO TP FP TN FN Tresh PPV TPR ACC F1

D5 130 55 38 1105 75 -3.0 0.59 0.423 0.91 0.49

D7 130 24 41 1102 106 -3.0 0.37 0.18 0.88 0.25

D8 130 98 45 1098 32 -2.5 0.69 0.75 0.94 0.72

D9 130 90 42 1101 40 -3.0 0.68 0.69 0.94 0.69

It is worth noting the difficulty associated to the current task. Data requires discretization before being modeled

and conceals different types of features even for the same digit, since 44 writers are present. The size of the data

sets along with the roughly 10% of anomalies turn outlier detection into a challenge. Beyond the aforementioned,

there is no pre-processing specific to the scenario, contrary to existing literature, being raw data from the sensors

directly applied to the system.

Figure 5.14: Transition outlierness of an experiment comprised of 1143 MTS representing digit 1 and 130 subjectsrepresenting digit 0. The latter are associated to the highest subjects ids. The modeling phase employed a first-orderDBN. Transitions displayed in red are classified as abnormal, while in green are considered normal.

An advantage of the developed approach is the capturing of features which help explain each detection. A

similar experiment is depicted where the unusual digit is 0. However, unlike previously, a stationary DBN is

modeled. Transition anomalies seen in Fig. 5.14 demonstrate the presence of digits 0 in the top. Outliers detected

in the final transitions of digit 1 exhibit the different types of calligraphy, which complete the drawing of digit 1 in

distinct ways, contrary to 0.

The versatility of the proposed approach is shown. The appliance of the system in distinct scenarios demon-

61

strated the importance of classifying abnormality using inter-variable dependencies as well as temporal dependen-

cies. Furthermore, by considering two different strategies in score-analysis, the system can adapt to each scenario

more easily.

62

Chapter 6

Discussion

The present chapter focuses on reviewing the thesis’s achievements and providing possible future directions

to enhance the undertaken research. Conclusions are captured from the attained results detailed in Chapter 5

discussing the advantages and disadvantages of the developed system.

6.1 Conclusions

The proposed approach offers the opportunity to tackle portions as well as entire MTS with the employment of

an adaptable outlier detection mechanism. A diverse set of applications have proved to benefit from the former. It

ranges from pre-processing to score-analysis including a DBN modeling phase versatile for longitudinal data. A

widely available web application [3] is deployed to assist any analyst in their specific endeavour along with an user-

friendly interface and straightforward tutorial. A paper based on the current thesis is currently under preparation

for submission in a journal.

With such in mind the DBN sliding window outlier detection mechanism offers an additional tool for longi-

tudinal data previously nonexistent. Anomalies are conveniently analyzed and easily adapted. Furthermore, the

application can be applied to a wide range of scenarios and exercised recursively or partially. An example is per-

forming anomaly detection in order to cleanse unwanted subjects. The resulting data set can then be remodeled

without the influence of unusual patterns. Furthermore, by tweaking the chosen parameters the system is capable

of changing the perception of outlierness. The latter means that different contextual patterns can be captured.

The developed approach shows particularly good results when employed in discretized data with inter-variable

dependencies as well as temporal relations, contrary to existing TS outlier detection methods which solely consider

intra-variable dependencies. Stationary models allow confronting data which does not vary its behavior over time,

in opposition to non-stationarity. Although not being built for MTS with an high number of variables, most

scenarios are analyzed successfully considering a smaller subset of variables, being thus easily adapted to such

cases. Other types of temporal data are proven to accommodate the current system, being the pre-processing phase

specially of great importance.

63

6.2 Future Work

With the availability provided and the partition nature of the developed system, as seen in Fig. 4.1, analysts

can adapt specific stages to their ambitions. The phases which could benefit the most from applying domain

specific knowledge as well as more complex mechanisms are pre-processing and score analysis. Scoring can also

be enhanced. Improvements should keep in mind the assumptions required by the modeling phase which could be

seen as the primary stage of the system.

Different discretization and dimensionality reduction mechanisms should be studied and compared with the

aim of better capturing the main features of data before modeling. In the current approach only SAX is considered

which could further be optimized due to equiprobability issues [82].

In the score analysis phase, more complex and domain specific methods can improve results significantly such

as considering GMM with more than two components or emphasizing specific scores, regarding several classes of

anomalies instead of just one.

A possible adjustment to the existing strategy would be an iterative training phase capable of detecting outliers

in a first approach, removing them temporarily for further modeling. In each iteration modeling would just consider

normal instances from the previous iteration. Without outliers influencing the acquired DBN, the converged model

would be used for the final outlier detection.

The modeling phase can further be enhanced with the employment of change-point mechanisms [105] in the

case of non-stationarity. The latter can model changes in the underlying processes through time instead of consid-

ering each transition separately, amplifying the number of possible experiments.

In a supervised scenario, score analysis can benefit from labeled elements. For instance, SVMs [41] can be

used to create a boundary between the normal and abnormal classes, being the latter used to classify test instances.

Furthermore, different clustering models are available to commit in the disjointing of each class when labeling is

not available.

Scoring can further be improved to emphasize certain transition networks in the case of non-stationarity. More-

over, score analysis can be performed separately to each transition without mixing scores from different networks.

Such however increases computational burden.

The enhancement of the contrasted multivariate PST mechanism is additionally encouraged. By providing

an algorithm capable of modeling a tree encoding multivariate configurations, an efficient MTS variable length

Markov model can surge.

64

Bibliography

[1] J. D. Pinto, A. M. Carvalho, and S. Vinga. Outlier detection in cox proportional hazards models based on

the concordance c-index. In LOD, pages 252–256. Springer, 2015.

[2] E. Keogh, J. Lin, and A. Fu. HOT SAX: Finding the most unusual time series subsequence: Algorithms and

applications. In ICDM, pages 440–449. Citeseer, 2004.

[3] J. L. F. Serras. Dynamic Bayesian outlier detection. https://jorgeserras.shinyapps.io/

outlierdetection/, October 2018.

[4] N. Friedman, K. P. Murphy, and S. J. Russell. Learning the structure of dynamic probabilistic networks. In

UAI, pages 139–147. Morgan Kaufmann, 1998.

[5] F. Zappa and P. Occhiogrosso. Real Frank Zappa Book. Simon and Schuster, 1989.

[6] C. Warrender, S. Forrest, and B. Pearlmutter. Detecting intrusions using system calls: Alternative data

models. In S&P, pages 133–145. IEEE, 1999.

[7] T. Lane. Hidden Markov models for human/computer interface modeling. In IJCAI, pages 35–44. Citeseer,

1999.

[8] Evolution of anomaly detection systems in security data science. https://www.analyticsvidhya.

com/blog/2018/02/demystifying-security-data-science/. Accessed: 03 September

2018.

[9] J. Lin, E. J. Keogh, S. Lonardi, and B. Y. Chiu. A symbolic representation of time series, with implications

for streaming algorithms. In DMKD, pages 2–11. ACM, 2003.

[10] M. Saada, Q. Meng, and T. Huang. A novel approach for pilot error detection using Dynamic Bayesian

Networks. Cognitive neurodynamics, 8(3):227–238, 2014.

[11] M. L. Bueno, A. Hommersom, P. J. Lucas, M. Lappenschaar, and J. G. Janzing. Understanding disease

processes by partitioned dynamic Bayesian networks. BI, 61:283–297, 2016.

[12] D. J. Hill, B. S. Minsker, and E. Amir. Real-time Bayesian anomaly detection in streaming environmental

data. Water Resour. Res., 45(4), 2009.

[13] K. Murphy, S. Mian, et al. Modelling gene expression data using dynamic Bayesian networks. Technical

report, Computer Science Division, University of California, Berkeley, CA, 1999.

65

https://jorgeserras.shinyapps.io/outlierdetection/

https://jorgeserras.shinyapps.io/outlierdetection/

https://www.analyticsvidhya.com/blog/2018/02/demystifying-security-data-science/

https://www.analyticsvidhya.com/blog/2018/02/demystifying-security-data-science/

[14] J. L. Monteiro, S. Vinga, and A. M. Carvalho. Polynomial-time algorithm for learning optimal tree-

augmented dynamic Bayesian networks. In UAI, pages 622–631, 2015.

[15] J. W. Tukey. Exploratory data analysis, volume 2. Reading, Mass., 1977.

[16] D. C. Hoaglin. John W. Tukey and data analysis. Statistical Science, pages 311–318, 2003.

[17] G. McLachlan. Finite mixture models. Annual Review of Statistics and Its Application, 5(1), 2018.

[18] J. Lopez-de Lacalle. tsoutliers: Detection of outliers in time series. R package version 0.6-5. https:

//CRAN.R-project.org/package=tsoutliers, 2016.

[19] D. V. Matt Dancho. anomalize: Tidy anomaly detection. R package version 0.1.1. https://CRAN.

R-project.org/package=anomalize, 2018.

[20] RStudio. Shiny: A web application framework for R (2015). http://shiny.rstudio.com.

[21] S. Urbanek. RJava: Low-level R to Java interface. http://CRAN.R-project.org/package=

rJava.

[22] C. C. Aggarwal. Outlier analysis. Springer International Publishing Switzerland, 2017.

[23] F. E. Grubbs. Procedures for detecting outlying observations in samples. Technometrics, 11(1):1–21, 1969.

[24] D. M. Hawkins. Identification of outliers, volume 11. Springer, 1980.

[25] M. Gupta, J. Gao, C. C. Aggarwal, and J. Han. Outlier detection for temporal data: A survey. IEEE Trans.

Knowl. Data Eng., 26(9):2250–2267, 2014.

[26] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection for discrete sequences: A survey. IEEE Trans.

Knowl. Data Eng., 24(5):823–839, 2012.

[27] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM computing surveys (CSUR),

41(3):15, 2009.

[28] V. Hodge and J. Austin. A survey of outlier detection methodologies. Artificial intelligence review, 22(2):

85–126, 2004.

[29] R. Domingues, M. Filippone, P. Michiardi, and J. Zouaoui. A comparative evaluation of outlier detection

algorithms: Experiments and analyses. Pattern Recognition, 74:406–421, 2018.

[30] A. Valdes and K. Skinner. Adaptive, model-based monitoring for cyber attack detection. In International

Workshop on Recent Advances in Intrusion Detection, pages 80–93. Springer, 2000.

[31] V. Chandola, V. Mithal, and V. Kumar. Comparative evaluation of anomaly detection techniques for se-

quence data. In ICDM, pages 743–748. IEEE, 2008.

[32] P. J. Rousseeuw and A. M. Leroy. Robust regression and outlier detection, volume 589. John wiley & sons,

2005.

66

https://CRAN.R-project.org/package=tsoutliers

https://CRAN.R-project.org/package=tsoutliers

https://CRAN.R-project.org/package=anomalize

https://CRAN.R-project.org/package=anomalize

http://shiny.rstudio.com

http://CRAN.R-project.org/package=rJava

http://CRAN.R-project.org/package=rJava

[33] P. Galeano, D. Pena, and R. S. Tsay. Outlier detection in multivariate time series by projection pursuit.

Journal of the American Statistical Association, 101(474):654–669, 2006.

[34] R. S. Tsay, D. Pena, and A. E. Pankratz. Outliers in multivariate time series. Biometrika, 87(4):789–804,

2000.

[35] A. M. Bianco, M. Garcia Ben, E. Martinez, and V. J. Yohai. Outlier detection in regression models with

arima errors using robust estimates. Journal of Forecasting, 20(8):565–579, 2001.

[36] N. Kumar, V. N. Lolla, E. Keogh, S. Lonardi, C. A. Ratanamahatana, and L. Wei. Time-series bitmaps: a

practical visualization tool for working with large time series databases. In ICDM, pages 531–535. SIAM,

2005.

[37] S. Budalakoti, A. N. Srivastava, R. Akella, and E. Turkov. Anomaly detection in large sets of high-

dimensional symbol sequences. 2006.

[38] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Lof: identifying density-based local outliers. In

SIGMOD, volume 29, pages 93–104. ACM, 2000.

[39] K. Ismo et al. Outlier detection using k-nearest neighbour graph. In null, pages 430–433. IEEE, 2004.

[40] A. K. Jain and R. C. Dubes. Algorithms for clustering data. 1988.

[41] J. Ma and S. Perkins. Time-series novelty detection using one-class support vector machines. In IJCNN,

volume 3, pages 1741–1745. IEEE, 2003.

[42] M. Hauskrecht, I. Batal, M. Valko, S. Visweswaran, G. F. Cooper, and G. Clermont. Outlier detection for

patient monitoring and alerting. BI, 46(1):47–55, 2013.

[43] M. Davy and S. Godsill. Detection of abrupt spectral changes using support vector machines an application

to audio signal segmentation. In ICASSP, volume 2, pages II–1313. IEEE, 2002.

[44] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em

algorithm. Journal of the royal statistical society. Series B (methodological), pages 1–38, 1977.

[45] K. R. Koch. Robust estimation by expectation maximization algorithm. Journal of Geodesy, 87(2):107–116,

2013.

[46] L. Tarassenko, P. Hayton, N. Cerneaz, and M. Brady. Novelty detection for the identification of masses in

mammograms. In ICANN, pages 442–447. IET, 1995.

[47] E. Keogh, S. Lonardi, and B.-c. Chiu. Finding surprising patterns in a time series database in linear time

and space. In SIGKDD, pages 550–556. ACM, 2002.

[48] R. Gwadera, M. J. Atallah, and W. Szpankowski. Reliable detection of episodes in event sequences. Knowl-

edge and Information Systems, 7(4):415–437, 2005.

[49] E. Hancock. Ideas into words: Mastering the craft of science writing. JHU Press, 2003.

67

[50] D. Ron, Y. Singer, and N. Tishby. The power of amnesia: Learning probabilistic automata with variable

memory length. Mach. Learn., 25(2-3):117–149, 1996.

[51] N. Ye et al. A Markov chain model of temporal behavior for anomaly detection. In Proceedings of the 2000

IEEE Systems, Man, and Cybernetics Information Assurance and Security Workshop, volume 166, page

169. West Point, NY, 2000.

[52] R. Sekar, A. Gupta, J. Frullo, T. Shanbhag, A. Tiwari, H. Yang, and S. Zhou. Specification-based anomaly

detection: a new approach for detecting network intrusions. In CCS, pages 265–274. ACM, 2002.

[53] P. Sun, S. Chawla, and B. Arunasalam. Mining for outliers in sequential databases. In ICDM, pages 94–105.

SIAM, 2006.

[54] A. Gabadinho and G. Ritschard. Analyzing state sequences with probabilistic suffix trees: the pst r package.

JSS, 72(3):1–39, 2016.

[55] C. Low-Kam, A. Laurent, and M. Teisseire. Detection of sequential outliers using a variable length Markov

model. In ICMLA, pages 571–576. IEEE, 2008.

[56] F. Rasheed and R. Alhajj. A framework for periodic outlier pattern detection in time-series sequences. IEEE

transactions on cybernetics, 44(5):569–582, 2014.

[57] S. L. Salzberg, A. L. Delcher, S. Kasif, and O. White. Microbial gene identification using interpolated

Markov models. Nucleic acids research, 26(2):544–548, 1998.

[58] E. Eskin, W. S. Noble, and Y. Singer. Protein family classification using sparse Markov transducers. Journal

of Computational Biology, 10(2):187–213, 2003.

[59] W. W. Cohen. Fast effective rule induction. In Machine Learning Proceedings 1995, pages 115–123.

Elsevier, 1995.

[60] W. Lee, S. J. Stolfo, and P. K. Chan. Learning patterns from unix process execution traces for intrusion

detection. In AAAI Workshop on AI Approaches to Fraud Detection and Risk Management, pages 50–56,

1997.

[61] V. Jaaskinen, J. Xiong, J. Corander, and T. Koski. Sparse Markov chains for sequence data. Scandinavian

Journal of Statistics, 41(3):639–655, 2014.

[62] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc.

IEEE, 77(2):257–286, 1989.

[63] G. Bejerano and G. Yona. Variations on probabilistic suffix trees: statistical modeling and prediction of

protein families. BI, 17(1):23–43, 2001.

[64] N. Friedman. The Bayesian structural EM algorithm. In UAI, pages 129–138. Morgan Kaufmann, 1998.

[65] K. Das and J. Schneider. Detecting anomalous records in categorical datasets. In SIGKDD, pages 220–229.

ACM, 2007.

68

[66] S. Babbar and S. Chawla. On Bayesian network and outlier detection. In COMAD, page 125, 2010.

[67] W.-K. Wong, A. W. Moore, G. F. Cooper, and M. M. Wagner. Bayesian network anomaly pattern detection

for disease outbreaks. In Proceedings of the 20th International Conference on Machine Learning (ICML-

03), pages 808–815, 2003.

[68] R. E. Kalman. A new approach to linear filtering and prediction problems. J. Basic Eng., 82(1):35–45,

1960.

[69] A. Ogbechie, J. Dıaz-Rozo, P. Larranaga, and C. Bielza. Dynamic Bayesian network-based anomaly detec-

tion for in-process visual inspection of laser surface heat treatment. In Machine Learning for Cyber Physical

Systems, pages 17–24. Springer, 2017.

[70] A. Hartemink et al. Banjo: Bayesian network inference with java objects. URL [http://www. cs. duke.

edu/amink/software/banjo/], 2005.

[71] D. Chickering, D. Geiger, and D. Heckerman. Learning Bayesian networks: Search methods and experi-

mental results. In AISTATS, pages 112–128, 1995.

[72] N. Dojer. Learning Bayesian networks does not have to be NP-hard. In MFCS, pages 305–314. Springer,

2006.

[73] N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian network classifiers. Mach. Learn., 29(2-3):131–163,

1997.

[74] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. Inf. Theory,

14(3):462–467, 1968.

[75] S. Kullback and R. A. Leibler. On information and sufficiency. Ann. Math. Stat., 22(1):79–86, 1951.

[76] J. Edmonds. Optimum branchings. J. Res. Nat. Bur. Standards, 71:233–240, 1967.

[77] J. Han, J. Pei, and M. Kamber. Data mining: concepts and techniques. Elsevier, 2011.

[78] D. Q. Goldin and P. C. Kanellakis. On similarity queries for time-series data: constraint specification and

implementation. In CP, pages 137–153. Springer, 1995.

[79] E. J. Keogh and M. J. Pazzani. Scaling up dynamic time warping for datamining applications. In KDD,

pages 285–289. ACM, 2000.

[80] R. J. Larsen, M. L. Marx, et al. An introduction to mathematical statistics and its applications, volume 2.

Prentice-Hall Englewood Cliffs, NJ, 1986.

[81] P. Senin, J. Lin, X. Wang, T. Oates, S. Gandhi, A. P. Boedihardjo, C. Chen, and S. Frankenstein. Grammarviz

3.0: Interactive discovery of variable-length time series patterns. ACM Trans. Knowl. Discov. Data, 12(1):

10:1–10:28, Feb. 2018. ISSN 1556-4681. URL http://doi.acm.org/10.1145/3051126.

[82] M. Butler and D. Kazakov. Sax discretization does not guarantee equiprobable symbols. Trans. Knowl.

Data Eng., (1):1–1, 2015.

69

http://doi.acm.org/10.1145/3051126

[83] S. Wold, K. Esbensen, and P. Geladi. Principal component analysis. Chemom. Intell. Lab. Syst., 2(1-3):

37–52, 1987.

[84] C. R. Shaliz and A. Kontorovich. Shannon entropy and kullback-leibler divergence. Almost None of the

Theory of Stochastic Processes, pages 229–236, 2010.

[85] C. E. Shannon. A mathematical theory of communication. SIGMOBILE, 5(1):3–55, 2001.

[86] Voltaire. Letter to Frederick II of Prussia, 6 April. https://en.wikiquote.org/wiki/Voltaire,

1767. Accessed: 23 September 2018.

[87] X.-G. Gao, J.-F. Mei, H.-Y. Chen, and D.-Q. Chen. Approximate inference for dynamic Bayesian networks:

sliding window approach. Applied intelligence, 40(4):575–591, 2014.

[88] Distribution skewness types. http://www.assetinsights.net/Glossary/G_Probability_

Distribution.html. Accessed: 20 September 2018.

[89] P. R. Jones. A note on detecting statistical outliers in psychophysical data. bioRxiv, page 074591, 2016.

[90] P. J. Rousseeuw and C. Croux. Alternatives to the median absolute deviation. J. Am. Stat. Assoc., 88(424):

1273–1283, 1993.

[91] Gaussian distributions. http://tinyheero.github.io/2015/10/13/mixture-model.

html. Accessed: 20 September 2018.

[92] G. McLachlan and T. Krishnan. The EM algorithm and extensions, volume 382. John Wiley & Sons, 2007.

[93] J. D. Banfield and A. E. Raftery. Model-based gaussian and non-gaussian clustering. Biometrics, pages

803–821, 1993.

[94] L. Scrucca, M. Fop, T. B. Murphy, and A. E. Raftery. mclust 5: Clustering, classification and density

estimation using Gaussian finite mixture models. The R journal, 8(1):289, 2016.

[95] C. Fraley, A. Raftery, L. Scrucca, T. B. Murphy, M. Fop, and M. L. Scrucca. Package ‘mclust’. 2017.

[96] M. A. T. Figueiredo and A. K. Jain. Unsupervised learning of finite mixture models. ITPIDJ, 24(3):381–396,

2002.

[97] A. Gabadinho. PST: Probabilistic suffix trees and variable length Markov chains. https://CRAN.

R-project.org/package=PST, 2016.

[98] H. A. Dau, E. Keogh, K. Kamgar, C.-C. M. Yeh, Y. Zhu, S. Gharghabi, C. A. Ratanamahatana, Yanping,

B. Hu, N. Begum, A. Bagnall, A. Mueen, and G. Batista. The ucr time series classification archive, October

2018. https://www.cs.ucr.edu/˜eamonn/time_series_data_2018/.

[99] P. J. Rousseeuw and W. V. D. Bossche. Detecting deviating data cells. Technometrics, 60(2):135–145, 2018.

[100] Human Mortality Database. University of California, Berkeley (USA), and Max Planck Institute for De-

mographic Research (Germany)., Available at www.mortality.org or www.humanmortality.de

(data downloaded on 18 September 2018).

70

https://en.wikiquote.org/wiki/Voltaire

http://www.assetinsights.net/Glossary/G_Probability_Distribution.html

http://www.assetinsights.net/Glossary/G_Probability_Distribution.html

http://tinyheero.github.io/2015/10/13/mixture-model.html

http://tinyheero.github.io/2015/10/13/mixture-model.html

https://CRAN.R-project.org/package=PST

https://CRAN.R-project.org/package=PST

https://www.cs.ucr.edu/~eamonn/time_series_data_2018/

www.mortality.org

www.humanmortality.de

[101] I. N. D. Demographiques. Infant mortality in France. https://www.ined.fr/en/everything_

about_population/demographic-facts-sheets/focus-on/infant_mortality_

france/. Accessed: 27 September 2018.

[102] E. Torok, E. Moran, and F. Cooke. Oxford handbook of infectious diseases and microbiology. Oxford

University Press, 2016.

[103] D. Dheeru and E. Karra Taniskidou. UCI machine learning repository, 2017. URL http://archive.

ics.uci.edu/ml.

[104] F. Alimoglu and E. Alpaydin. Methods of combining multiple classifiers based on different representations

for pen-based handwritten digit recognition. In TAINN, 1996.

[105] X. Xuan and K. Murphy. Modeling changing dependency structure in multivariate time series. In ICML,

pages 1055–1062. ACM, 2007.

71

https://www.ined.fr/en/everything_about_population/demographic-facts-sheets/focus-on/infant_mortality_france/



http://archive.ics.uci.edu/ml

http://archive.ics.uci.edu/ml

Documents

Outlier detection for multivariate time series · Outlier detection for multivariate time series Jorge Luís Fuzeta Serras Master’s Thesis Electrical and Computer Engineering Supervisor(s):