Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
Outlier detection for multivariate time series
Jorge Luís Fuzeta Serras
Master’s Thesis
Electrical and Computer Engineering
Supervisor(s): Prof. Alexandra Sofia Martins de CarvalhoProf. Susana de Almeida Mendes Vinga Martins
Examination Committee
Chairperson: Prof. António Manuel Raminhos Cordeiro GriloSupervisor: Prof. Alexandra Sofia Martins de Carvalho
Member of the Committee: Prof. Sara Alexandra Cordeiro Madeira
November 2018
ii
Dedicated to my parents.
iii
iv
Declaration
I declare that this document is an original work of my own authorship and that it fulfills all the requirements of
the Code of Conduct and Good Practices of the Universidade de Lisboa.
v
vi
Acknowledgments
I express my deepest gratitude for those who made this journey possible. I wish to thank my supervisors,
Alexandra Carvalho and Susana Vinga for their continuous support, knowledge and availability throughout the
project.
Moreover, I would also want to thank for the endorsement of all my family and friends that motivated me along
the way.
The work of this dissertation was performed under a research grant assigned by Instituto de Telecomunicacoes
within the framework of the project PERSEIDS, personalizing cancer therapy through integrated modeling and
decision, financed by the Portuguese Foundation for Science and Technology (Fundacao para a Ciencia e a Tec-
nologia - FCT) under contract PTDC/EMS-SIS/0642/2014.
vii
viii
Resumo
Outliers podem ser definidos como observacoes suspeitas de nao terem sido geradas pelos processos subjacentes
aos dados. Muitas aplicacoes exigem uma forma de identificar padroes interessantes ou incomuns em series tempo-
rais multivariadas (STM), no entanto, a maioria dos metodos de detecao concentram-se exclusivamente em series
univariadas, fornecendo aos analistas solucoes extenuantes.
Propomos um sistema completo de detecao de outliers abrangendo problemas desde o pre-processamento que
adota um algoritmo de modelacao de redes de Bayes dinamicas. O mesmo codifica as conectividades inter e intra-
temporais otimas de redes de transicao, capazes de identificar dependencias condicionais em STM. Um mecanismo
de janela deslizante e empregado para capturar gradualmente o score de cada transicao de uma STM dado o modelo.
Simultaneamente, STM inteiras sao avaliadas. Duas estrategias de analise de scores sao estudadas para assegurar
uma classificacao automatica de dados anomalos.
A abordagem proposta e primeiramente validada atraves de dados simulados, demonstrando o desempenho do
sistema. Comparacao com um metodo de arvore probabilistica de sufixos esta disponıvel exibindo a vantagem
da abordagem multivariada proposta. Experiencias adicionais sao feitas em dados reais, revelando anomalias em
cenarios distintos, como series de eletrocardiogramas, dados de taxas de mortalidade e escrita de dıgitos.
O sistema desenvolvido mostrou-se benefico na captura de outliers resultantes de contextos temporais, sendo
adequado para qualquer cenario que emprega STM. Uma aplicacao web de livre acesso, empregando o sistema
completo, e disponibilizada em conjunto com um tutorial.
Palavras-chave: series temporais multivariadas, deteccao de outliers, redes de Bayes dinamicas, algo-
ritmo de janela deslizante, analise de resultados, aplicacao web
ix
x
Abstract
Outliers can be defined as observations which are suspected of not have been generated by data’s underlying
processes. Many applications require a way of identifying interesting or unusual patterns in multivariate time
series (MTS), however, most outlier detection methods focus solely on univariate series, providing analysts with
strenuous solutions.
We propose a complete outlier detection system covering problems since pre-processing that adopts a dynamic
Bayesian network modeling algorithm. The latter encodes optimal inter and intra time-slice connectivity of tran-
sition networks capable of capturing conditional dependencies in MTS datasets. A sliding window mechanism is
employed to gradually score each MTS transition given the model. Simultaneously, complete MTS are evaluated.
Two score-analysis strategies are studied to assure an automatic boundary classifying anomalous data.
The proposed approach is first validated through simulated data, demonstrating the system’s performance.
Comparison with an assembled probabilistic suffix tree method is available displaying the leverage of the proposed
multivariate approach. Further experimentation is made on real data, by uncovering anomalies in distinct scenarios
such as electrocardiogram series, mortality rate data and written pen digits.
The developed system proved beneficial in capturing unusual data resulting from temporal contexts, being
suitable for any MTS scenario. A widely accessible web application employing the complete system is made
available jointly with a tutorial.
Keywords: multivariate time series, outlier detection, dynamic Bayesian networks, sliding window algo-
rithm, score analysis, web application
xi
xii
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi
1 Introduction 1
1.1 Overview and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Outlier detection 7
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Outlier and Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Outlier detection strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Challenges in Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Outlier Detection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Statistical Modeling 15
3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Markovian Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 Variable and sparse Markov models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.2 Hidden Markov models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Bayesian models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1 Bayesian network outlier detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.2 Dynamic Bayesian networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.3 Dynamic Bayesian network outlier detection . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.4 Model training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Proposed method 29
4.1 Time series pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
xiii
4.3 Dynamic outlier detection technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.1 Outlierness score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.2 Sliding window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 Score analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4.1 Tukey’s strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4.2 Gaussian mixture model strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5 Web application and software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5 Experimental results 45
5.1 Simulated data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1.1 Proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.1.2 Comparison with probabilistic suffix trees . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Real Multivariate Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2.1 Electrocardiogram dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2.2 Mortality Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2.3 Pen digits dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6 Discussion 63
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Bibliography 65
xiv
List of Tables
5.1 Subject outlier results of the proposed approach using Tukey’s score-analysis on simulated data . . 49
5.2 Subject outlier results of the proposed approach using GMM score-analysis on simulated data . . 51
5.3 Subject outlier results of the PST approach using Tukey’s score-analysis on simulated data . . . . 53
5.4 Subject outlier results of the PST approach using GMM score-analysis on simulated data . . . . . 53
5.5 Subject outlier results of the proposed approach on pen digits data . . . . . . . . . . . . . . . . . 61
xv
xvi
List of Figures
1.1 Evolution of anomaly detection systems in security data science. . . . . . . . . . . . . . . . . . . 2
1.2 Main phases of the proposed outlier detection technique. . . . . . . . . . . . . . . . . . . . . . . 3
3.1 Example of a probabilistic suffix tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Three-state hidden Markov model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Example of a first-order DBN model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1 Scheme of the proposed outlier detection approach . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Time series pre-processing example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Sliding window example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4 Distribution skewness types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5 Two Gaussian distribution classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.6 Web application front page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.7 Transition score analysis tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.1 Transition network for the generated normal subjects . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Transition networks used for abnormal subject generation . . . . . . . . . . . . . . . . . . . . . . 48
5.3 Histogram for an experiment comprising simulated data . . . . . . . . . . . . . . . . . . . . . . . 49
5.4 Histogram of the control experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.5 Score analysis method comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.6 Experiment comparing both outlier detection systems . . . . . . . . . . . . . . . . . . . . . . . . 54
5.7 Mean and standard deviation of normalized variables along time. . . . . . . . . . . . . . . . . . . 55
5.8 ECG transition outlier results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.9 Flipped ECG transition outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.10 ECG transition outlierness considering two alphabet sizes . . . . . . . . . . . . . . . . . . . . . . 57
5.11 Normalized male mortality rates of 6 selected variables. . . . . . . . . . . . . . . . . . . . . . . . 58
5.12 Mortality data sets transition results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.13 Year score histogram of the Mortality data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.14 Transition outlierness of a pen digits dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
xvii
xviii
List of Algorithms
1 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2 Transition outlier detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
xix
xx
Glossary
AR – Auto-Regressive model
ARMA – Auto-Regressive Moving-Average model
BIC – Bayesian Information Criterion: also known as MDL
BN – Bayesian Network
DBN – Dynamic Bayesian Network
EM – Expectation Maximization
FN – False Negative
FP – False Positive
FSA – Finite-State Automaton
GMM – Gaussian Mixture Model
HMM – Hidden Markov Model
LL – Log-Likelihood
KLD – Kullback-Leibler Divergence
MLE – Maximum Likelihood Estimator
MDL – Minimum Description Length: also known as BIC
MTS – Multivariate Time Series
PAA – Piece Aggregate Approximation
PST – Probabilistic Suffix Tree
PDF – Probability Density Function
SAX – Symbolic Aggregate Approximation
TS – Time Series
TN – True Negative
TP – True Positive
xxi
xxii
Chapter 1
Introduction
The world is changing. Machines are beginning to replace humans in every way. In recent times, the machine
learning community has boomed due to the always expanding desire to acquire maximum benefit from collected
data. Such demand is apparent in sectors like biomedicine, socio-economics and industry which benefit the most
due to their data’s crucial nature.
Outlier detection has proved to induce decisive progress in multiple scenarios. An example is in the deter-
mination of long or short-time survivors in survival data [1] and in the revelation of uncommon ECG patterns
representative of a specific disease [2], showing that outliers are not always associated to erroneous observations.
Open-mindedly, a complete outlier detection design is assembled from scratch with the aim of providing an
intuitive and effective alternative for discovering abnormal entities among real-world datasets. In the present
thesis, a complete anomaly detection system is developed, extending from data pre-processing to the recognition
of unusual patterns in multivariate time series (MTS) [3].
The present study reviews existing outlier detection algorithms with special focus on temporal and sequential
data as well as a rundown on statistical modeling. A new outlier detection approach based on dynamic Bayesian
networks (DBN) [4] is thus proposed and validated through synthetic and real data sets. Furthermore, a proba-
bilistic suffix tree (PST) technique, which does not consider relationships among variables, is built and contrasted
with the developed system. The PST approach as well as most existing literature focus solely on the detection of
univariate temporal anomalies. Even with a sophisticated univariate system, many contextual outliers can not be
disclosed without contemplating inter-variable relations, limiting thus univariate strategies. The proposed approach
focuses on the use of DBN sliding windows to assure abnormality disclosure. A complete system advantageous
in capturing complex MTS patterns encoding temporal and inter-variable dependencies simultaneously is made
available, previously non-existent.
1.1 Overview and motivation
An overview of the topics and main objectives of the thesis is mentioned along with decisions’ clarification ac-
cording to real-world applications. An in-depth analysis of existing outlier detection techniques is undertaken with
the intention of providing the necessary background knowledge of the realm of anomaly detection. An alternate
1
outlier detection approach is proposed capable of uncovering complex contextual anomalies in longitudinal data,
contrary to most existing algorithms.
Data related applications ranging from primary sectors like industry to consumer focused areas share a desire
of acquiring maximum performance and information. Not only through the detection of flaws but, furthermore,
by the discovery of patterns or novelties that could be crucial to a better understanding of the data’s underlying
nature leading thus to the development of more efficient methods. Without deviation from the norm, progress is not
possible [5].
An outlier can simply be described as an element/segment which there is no explanation for it to stand out.
Outliers have the capability to mislead analysts to altogether different insights regarding data, providing a catas-
trophic sight to any analysis results. Strategic decisions are taken based on such results, thus the importance of
anomaly detection. As an example, applications in Security data science owe most of their progress to anomaly
detection mechanisms [6, 7]. This leads to an endless search for better and more efficient mechanisms as seen in
Fig. 1.1.
Figure 1.1: Evolution of anomaly detection systems in security data science. Adapted from [8].
Considering the extent and complexity of outlier detection, MTS data is the spotlight of this study. A time
series (TS) is defined as a set of observations measured along time at an equal pace. The latter are considered MTS
when comprised by a set of multiple attributes, each represented by a TS. The proposed approach handles MTS
data sets by employing Bayesian statistics capable of expressing the degree of belief of observed data.
Taking advantage of Bayesian modeling, a complete anomaly detection system is studied, capable of fitting
into a range of data-mining and cleansing applications. It tackles known issues from pre-processing to post-score
analysis, being the main phases described in Fig. 1.2. The proposed approach requires multivariate data to be
discretized with finite domain variables. For such, a discretization and dimensionality reduction approach known
as SAX [9] is employed previous to modeling. Data is thus pre-processed to ensure an evasion from issues such as
high dimensionality. The whole system focuses on tackling temporal data, mainly MTS.
In order to address multivariate temporal data, DBNs [4] are employed in the proposed approach. These are
graphical statistical methods capable of encoding conditional relationships not only among different variables but
also through time. Such approaches have already been used in literature engaging in a wide range of applications
2
Figure 1.2: Main phases of the proposed outlier detection technique.
[10–13]. The aforesaid techniques can model complex MTS structures providing a normality standard. An outlier
detection approach is thus developed to measure the level of discrepancy between each test subject and the obtained
model.
In order to model the normal behavior of a data set, an existing strategy [14] is adapted for an outlier detection
scenario. It models an optimal Tree-augmented DBN (tDBN) structure. The learning algorithm provides a network
possessing optimum inter/intra-slice connectivities for each transition network, verified to outperform existing
literature. Both stationary and non-stationary DBNs are studied.
The assembled anomaly detection mechanism adopts a sliding window approach to uncover transition outliers
in multivariate temporal datasets. A dataset is seen as a set of subjects, where the latter are MTS. A transition is
regarded as a subset of a subject which is comprised by the observations responsible for the explanation of a certain
time-stamp. In other words, a window of a MTS possessing observations of a given time-slice and its preceding
time-frames. Windows have an associated transition network depending on the parameters of the trained model.
In the proposed approach, an outlier is defined as a transition which is suspected of not have been generated
by the model. Being thus result of unexpected phenomena. Such is uncovered by analyzing anomaly scores for
each transition. The latter are obtained by computing the conditional probabilities for each window, representing
the likelihood of each time slice. Likewise, whole subjects are scored to expose anomalous MTS. Such is obtained
by averaging all the transition scores from a subject, which with prudence are used to classify each series. The
developed approach is thus adapted to handle several types of anomalies according to the analyst needs.
Mechanisms are then employed to classify each score into one of the classes, abnormal or normal. Two main
3
strategies are studied, being Tukey’s Method [15, 16] and Gaussian mixture models (GMM) [17]. A threshold is
automatically selected to determine the disclosure boundary in addition to manual selection.
The examined concepts are validated through simulated data generated by DBN mixtures with known anoma-
lies assuring a correct performance measurement. The results are compared with a designed approach capable
of adapting PSTs to multivariate data. Real-world data sets available in existing literature are then tested and
discussed justifying the employment of the system in common temporal multivariate problems.
With the increasing demand of data science related appliances which aspire not only promptness but also gener-
alization and easily adaptable mechanisms, the current system is built. The main goal is to grant a completely free
and accessible algorithm to tackle mainly MTS problems. Such are typically not found in existing literature, being
the latter only concerned with univariate time series [18, 19] claimant of additional examination and background
unsought by analysts. With such in mind, the proposed approach considers data’s peculiarities from its origin to
the final classification phase bringing great burden relief. Data formatting and pre-processing as-well as forefront
model training are made available. The software is freely accessible online [3] along with source code which can
be easily adapted for each application desires. A web application is additionally available for use without requiring
download or any endeavor from the analyst along with a tutorial to ensure immediate usage. Complete analysis
of the whole procedure with user-friendly graphical support is available. All experiments undertaken in the cur-
rent work can be reproduced in the built software. Programming languages such as R and JAVA are used with
cooperation of packages Shiny [20] and RJAVA [21].
1.2 Contributions
Every footstep taken throughout the thesis is simulated and verified, being the final goal the validation of the
outlier detection method in typical real-world scenarios. The main aims are to:
• Provide an insight of existing anomaly detection techniques while conceiving existing knowledge regarding
dynamic Bayesian networks and their modeling.
• Apply discretization and dimensionality reduction to TS datasets prior to performing anomaly detection.
• Model a DBN representing a normality standard for MTS datasets.
• Design a subject and transition scoring system based on a DBN sliding window approach in conjunction
with score analysis to differentiate normality from abnormality.
• Measure the performance of the proposed approach through simulated and real experimentation while con-
trasting with existing techniques.
• Design and publish a free web application available to any analyst granting a complete MTS anomaly detec-
tion system together with a tutorial. Future work suggestions as-well as advice for each specific application
are offered.
• Prepare a paper based on the current thesis for submission in a journal.
4
1.3 Organization
A revision of the state-of-the-art and background knowledge is available at Chapter 2. The latter offers an
extensive overview of outlier detection systems as well as specific akin expertise to the present work.
In Chapter 3, knowledge regarding Bayesian networks as well as Markovian models is available. Modeling of
a DBN is analyzed as-well as its assumptions and comparison with existing literature.
The proposed outlier detection mechanism is discussed in Chapter 4. The latter addresses topics ranging from
data pre-processing to post-score analysis. Software referral is undertaken.
In Chapter 5, both validation and result discussion using synthetic data as-well as real-world datasets is con-
templated. Furthermore, comparison between the proposed approach and a designed multivariate PST mechanism
is undertaken.
Conclusions and future work suggestions are subject of discussion in Chapter 6.
5
6
Chapter 2
Outlier detection
Many researchers around the globe are currently enriching the field of data science. The use of outlier detec-
tion and machine learning techniques became a playground for testing and simulating a wide range of fields from
industry to computer science. This chapter has the objective of offering an overview of the current existing knowl-
edge concerning outlier detection in its entirety. Methods that are analogous to the problems addressed are given a
more in-depth analysis. Nevertheless, an insight is provided in other to perform anomaly detection in any research
field. In the context of the studied and developed methods, a group of fully and partial developed techniques are
mentioned as-well as research studies. A list of documents and projects were selected among a large consulted
bibliography. By assimilating existing knowledge, new ideas for the development of the present study are taken
into consideration.
2.1 Overview
Defining an outlier has been a challenge for many years, being highly dependent on the context, area, and
type of data in each situation. In the existing literature authors label outliers as novelties, anomalies, deviations
or exceptions according to their circumstances. Outliers are sometimes referred to as data instances that can
either be considered abnormal or noisy. Where anomalies, for example, are seen as a special kind of outliers
with interesting information to an analyst [22]. In the present work, the term outlier is used agglomerating the
characteristics of the several interpretations in the literature. Other terms, such as the previously mentioned, are
used when deemed appropriate. When considering a statistical, mathematical or even an informal point of view,
the idea of an outlier can change abruptly. Outliers have been defined by Grubbs [23] as “outlying observations, or
outliers, that appear to deviate markedly from other members of the sample in which they occur” while Hawkins
[24] described them as “observations that deviate so much from other observations as to arouse suspicion that
they were generated by a different mechanism”. Although an extensive subjectivity arouses when interpreting
outliers, the importance of detecting them is always praised and sought after. Outliers are thus considered to be
subsets of data that are of interest to the context in question hence requiring additional examination. Often outliers
are not necessarily incorrect observations, but rather observations hiding valuable information. These frequently
possess intelligence regarding the characteristics of the systems and entities responsible for the data generation
7
process. In medical diagnosis applications, for example, data collected from patient scanning and monitorization
reveal unusual patterns that typically reflect undesirable medical conditions but also interesting clinical cases worth
exploring. An example is the determination of long or short-time survivors in survival data [1].
2.1.1 Outlier and Data Types
Before discussing the several categories of outliers, a review of data types is mentioned with the goal of promot-
ing its importance when performing anomaly detection. For nearest neighbor based techniques, for example, the
nature of the input data determines the distance measure to be applied. Such techniques will be detailed further on.
2.1.1.1 Data Types
The nature of the input data is of extreme importance. These can be objects, observations, samples or patterns
among others. Data can subsist of only one attribute (univariate) or multiple attributes (multivariate). Attributes can
be of different types such as binary, categorical or continuous. These can also be a mixture of several types. Input
data can be categorized using the relationship among data instances. Spatial and graph data are another case which
illustrates the many times needed relationship between data instances, in these case neighborhood dependencies. In
network or graph data, data values are represented by nodes and relationships among them by edges in the network.
Outliers in network data are modeled due to unusual relationships or an abnormal structure in the graph. Other
kinds of data such as multi-dimensional streams or discrete series require dedicated methods for their analysis.
The relative placement between data instances is of great importance when considering discrete sequences. This
does not need to be necessarily temporal. For example sequence data such as genome string representations are an
example where relationships are important. DNA sequencing is an example where the order of nucleotides within
molecules is crucial although not temporal. Prediction-based techniques can be used to estimate the values for
specific positions in the sequence. Deviations from the forecasted data are seen as contextual outliers.
Focusing on time-series data which are normally formed by continuous measurements over time, these often
contain contextual and collective anomalies over related time-stamps. Temporal context is thus of great importance
when handling time-series data. Anomalies are usually associated to abrupt changes in values or trends. Meaning
that data instances are classified as anomalies, not for their value but rather for the change they induced in the
data considering previous time-stamps. Looking at a temperature measurement time-series, for example, if a
measurement of 35oC comes right after a measurement of 5oC it will probably be detected as an anomaly due to
the abrupt change and not by the values themselves. Data trends that slowly change over time require a much more
careful analysis over a longer period of time.
2.1.1.2 Outlier Types
Depending on the variety of outlier desired to be detected and their contextual dependencies, the employed
anomaly detection strategies adjust. The objective can be to discover unusual data points with abnormal values
or unusual patterns of changes. The latter may be, for example, an uncommon ECG pattern representative of a
specific disease [2]. The various types of outliers [25–27] are thus:
8
• Point – being the simplest and most intuitive type of anomaly, point outliers are data instances that are
considered to be anomalous with respect to the rest of the data due to their value.
• Collective – when a group of related data instances are considered anomalous with respect to the rest of
the data, it is considered a collective anomaly. Note that each individual data instance of that collection
may be valid, yet their agglomeration is not. In clustering scenarios, for example, an unexpected cluster
may be classified as anomalous even if each individual instance is valid, since those specific observations
are not expected to be grouped. Collective anomalies can also be considered as contextual outliers if their
classification depends on specific contexts.
• Contextual – considering a certain context specified in the problem’s formulation, data instances can be
classified as contextual outliers when present in specific scenarios. Point outliers in the other hand are
always classified as such independent of context. For a simple example consider the average temperature of
a country. When a certain data instance indicates a negative temperature value in the mid of winter, this is
probably not classified as an unusual data instance, but if the same value appeared during summer it would
be classified otherwise. Detecting such contextual anomalies require more complex problem formulations
than simple point anomalies.
• Pattern frequency – sequences which contain patterns with an anomalous frequency of appearance are
also considered outliers. The data instances themselves may not be abnormal but the frequency or pattern in
which their occur can be labeled as such. A simple example common in medical applications is the detection
of unusual patterns in patient’s measurements which may indicate the presence of specific illnesses.
Distinguishing normal from abnormal data is far from being a straightforward puzzle. Sometimes, unusual
data does not respect the categorization described earlier, turning the procedures of outlier detection even more
challenging. There are several phenomena which can occur specially in datasets with a considerable amount of
outliers. The first is named Masking, which happens when an outlier masks a second outlier being thus the latter
invisible to the anomaly detection procedure. This means that the second anomaly can only be detected in the
absence of the first one. Another interesting effect is the so called Swamping. An outlier is said to be swamping
a second data instance if the latter can only be classified as an anomaly when in the presence of the first, and
not by itself. Such phenomena has been target of research and is of great importance in order to avoid incorrect
labeling due to the outlier detection system itself. Stamping specific observations as outliers may lead to other data
instances be classified as such in the forthcoming iterations, resulting in the “self-destruction” of the algorithm.
2.1.2 Outlier detection strategies
Most outlier detection techniques create a sort of model to represent data’s normal patterns. The choice of the
appropriate data model is one of the most crucial phases in outlier detection. An incorrect choice may lead to
poor results. For example, if the underlying data is not clustered by nature, a cluster-based model will probably
work poorly. In practice, it is the analyst that dictates the model relevant to an application, requiring a good
understanding of the data itself.
9
The type of strategy to employ in outlier detection should be conformable with the kind of data and domain
in question. Some important data characteristics are, for example, its dimensionality and prior knowledge. If in a
specific scenario labeled data is already available or outliers are known to be subsequences of data instead of single
observation points, than the selection of an anomaly detection technique is greatly influenced.
The interpretability of the employed model is also significant. Analysts are concerned not only with the detec-
tion of anomalies but also the reason why they are considered as such. This provides crucial information for the
understanding and hopefully eradication of the phenomena causing the anomalies. Methods that require complex
data transformations, although frequently effective in isolating anomalies, are rather painful to interpret. There is
then a trade-off between interpretability and complexity which is many times overlooked.
The outlier detection problem is analogous to the classification problem with two class labels, “normal” and
“abnormal”. Much of the knowledge from classification methods generalize to anomaly detection. The three main
approaches for the problem of outlier detection [25, 28] are thus:
• Type 1: Unsupervised – analogous to unsupervised clustering, outliers are determined with no prior knowl-
edge of the data. It is assumed that outliers can be separated from the rest of the data. These can be
highlighted by some diagnostic approach or incorporated in a robust model for later classification. Data is
considered to be a static distribution.
• Type 2: Supervised – analogous to supervised classification, outlier determination requires pre-labeled data
already classified as normal or abnormal. Such techniques model both normality and abnormality. New
data should be classified correctly if an adequate generalization was obtained in the training phase. Pre-
labeled data should cover as many situations as possible. If the classifier did not observe some anomalous
patterns in the training phase, these could be miss-classified when testing. Classifiers are normally adapted
to static data, meaning that distribution shifts enforce a reset on the training phase, although techniques like
evolutionary neural networks can endure dynamism.
• Type 3: Semi-supervised – analogous to semi-supervised recognition, outlier classification requires only
one of the classes (normal or abnormal) to be taught. Nevertheless the opposite class is learned by the
algorithm, even when the pre-labeled data only marked one of the classes. These techniques are more
adequate for dynamic data, typically modeling normality. This means that in the case of online algorithms,
the complete state or prior information of anomalous instances is many times non-existent, making it difficult
to adapt a supervised system in real time. For semi-supervised methods, the full domain of normality is
required for good generalization, being abnormal data not required in opposition of type 2 methods.
In many cases, unsupervised methods are used for noise removal in order to clean data sets for the later use of
supervised methods. The latter are designed specifically for the attributes and relationships in the underlying data.
Most anomaly detection techniques have either a score or a label output. Scoring methods assign anomaly scores
to each occurrence which quantifies the degree of abnormality, also known as outlierness level. Most of these
approaches output a ranked list of instances sorted by their anomaly score. Instances with higher score normally
indicate an higher probability of being outliers. The list can then be cut-off with a specific threshold to select
only the most interesting observations. Techniques that rely on labels assign occurrences to one of two categories,
10
normal or anomalous, lacking a degree of irregularity for comparison between test instances. Labels are normally
binary containing less information than a scoring output. However, in the end, most of the scoring techniques
convert their results to binary labels allowing the reach of a final decision regarding abnormality.
The proposed approach is unsupervised and offers score and label outputs simultaneously. Being the latter
formulated by score-analysis.
2.1.3 Challenges in Outlier Detection
At first glance outlier detection may seem superficial, since it can be formulated as the determination of a region
representing the normal behavior of the data in question. Any observations that are not present inside that region
are labeled as anomalies. We quickly realize the breadth of problems associated to such a task initially perceived
as trivial [25–29]. These regions are normally chosen on an ad hoc basis according to the application’s specific
characteristics. With the aim of detecting hidden and interesting information amidst the available data, the main
challenges encountered when detecting outliers are:
• Wide variety of techniques and models – it is difficult to choose and adapt an already existing method to a
specific problem, being one of the steps researchers normally tend to spend the most time on. Some models
are even impossible in some problem formulations.
• Data scale – data related problems tend to increase in size not only in the number of observations but
also on their complexity. This often leads to processing and resource limitations. Applications that handle
stream data are the most prone to such limitations, since these require fast processing as new data arrives
continuously.
• Boundaries – boundaries between normal and abnormal behaviours are many times not precise, being very
difficult to correctly classify data which lies near them.
• Malicious actions – outliers are often blended with normal data. This worsens if the anomalies are inten-
tional, caused by malicious agents [30]. These anomalies are normally very similar to normal ones, being
very challenging to distinguish them.
• Continuous evolution – the notion of normal behavior continuously changes in many domains, meaning
that data classified as abnormal today may not be labeled as such tomorrow.
• Notion of anomaly – different research fields have different notions of anomalies. Applying the same
technique in one domain does not necessarily mean it should be applied in others domain.
• Availability of labeled data – one of the major issues in outlier detection is the lack of labeled or training
data, mainly in supervised methods. The validation of the results obtained is thus difficult and requires
in-depth knowledge of the data and research field in question.
• Data pre-processing – before applying outlier detection techniques, data requires normalization and/or
filtering in order to remove complications that deteriorate the performance of the applied methods. Time-
series data, for example, is often desired to have constant spacing between observations with specific time-
stamps.
11
• Reproducibility – some techniques require initialization which when random can significantly change the
results obtained. Models which are very sensible to parameter changes are also prone to greatly influence
the results obtained even when the data and methods are the same.
• Validation – the performance and quality of the obtained anomaly classifications greatly depend on the do-
main and data in hand, being often not advisable to compare techniques in general but only on the performed
tests. A method performing better than another for specific data-sets does not mean these paradigm is valid
for every situation.
• Overfitting – sometimes the simplest approaches are the most fortunate, since these do not adapt exces-
sively to data peculiarities. For supervised techniques, an overfit model will most probably classify normal
instances as abnormal if these did not appeared regularly in the training data. A good approach for reducing
overfitting is sometimes repeatedly partition the data into training and testing sets in a random way.
It is to note that in many cases the separation between noise and anomalies is not well defined, resolving in
many data points created by noise to be considered interesting points due to high deviance from normal behavior.
Although these instances are desired to be detected and eliminated, they turn out to have uninteresting information,
wasting time and effort when examining them.
2.2 Outlier Detection Methods
Without further ado, a set from a wide range of outlier detection methods [26, 27] is considered with the aim of
providing an insight for the problem in hand and the world of outlier detection in a whole. These are organized
according to their specifications. A more in-depth focus is enforced on temporal data [25] and sequence [31]
anomaly detection allowing the assimilation of the required basics before proceeding. A possible way to classify
outlier detection techniques [25, 27] is:
• Regression-based – Regression methods are extensively used in TS data. They adapt equations or models
to data sets considering outliers as high residual entities when compared to normal data instances. An
example are AutoRegressive (AR) models [22, 32] decreeing that data points linearly depend on previous
values and on a stochastic term, representing a random process. AR models can be enhanced and made
more robust by combining them with other techniques. By simultaneously considering outlier scores of
previous observations, autoregressive moving-average (ARMA) [33] models are built. Since older data
becomes uninteresting over time for the estimation of new data, Autoregressive Integrated Moving-Average
(ARIMA) models [34, 35] are assembled to tackle such issues.
• Distance-based – In distance-based techniques, the distance or similarity between data instances is com-
puted. Outliers are believed to be isolated from the rest of the data. A k-th nearest neighbor (KNN) technique
has the objective of, for each test instance, determine the distance to its k-th nearest neighbor in the train-
ing set, being these used as anomaly scores [27]. The same reasoning can be applied to discrete sequences
[26, 36, 37] using the length of the longest common subsequence as a measuring distance. Some of these
methods also determine the density of the neighborhood of each data instance. One example is the Local
12
Outlier Factor (LOF) algorithm [38] which measures the local density of the test sample and its k nearest
neighbors. Such is discussed in an outlier detection scenario [39].
• Clustering/Classification-based – Classification and clustering methods [40] provide additional insight on
outlier detection problems. Note that these types of methods are not equal, even though they are associated
by nature. Classification techniques are supervised, assuming that training data is already labeled. A model is
learned with the aim of classifying test instances into one of the already known classes. Clustering methods,
on the other hand, have the goal of designing the classes themselves. These are unsupervised. Classification
methods can be multi-class or one-class when employed in an outlier detection environment. Multi-class
algorithms try to classify test instances into one of the already present normal classes. If none of the classes
have a sufficient confidence on a certain test instance, it is classified as anomalous. One-class methods
normally learn a boundary or region where all normal data is contained. If a certain test instance does
not fit inside that region, it is classified as anomalous. An example are one-class support vector machines
(SVM) [41] which learn a region containing training data. A boundary is computed around the region of
normal behavior. Each data instance is evaluated according to its position regarding the boundary. The
aforementioned have proved to detect anomalies on several applications [42, 43]. Note that SVMs only
handle vectors, meaning that to apply the same procedures in TS data a conversion is needed [41].
A type of clustering techniques are based on Expectation-Maximization (EM) [44] methods, capable of
finding parameter estimations which determine the distribution of data. The latter is commonly used when
considering Gaussian Mixture Models (GMM) [17]. Clusters created using training data can then be used to
classify test instances. Anomalous samples are expected to deviate themselves from the clusters. A robust
EM algorithm used in outlier detection is established in [45]. GMMs are considered in the current study in
Sec. 4.4.2 and in existing literature [46].
• Pattern-based – By employing pattern-mining techniques, patterns in a sequence whose frequency of occur-
rence is anomalous are detected. Time series can be viewed as sequences, allowing the use of mechanisms
which, for example, measure the discrepancy between the frequency of occurrence of a context in a se-
quence and that expected given observed data. An example uses suffix trees [47], containing all the suffixes
of a given sequence. Such allows the detection of abnormal subsequences within each time series. Pattern
frequency-based anomaly detection is beneficial in detecting interesting subsequences with engagement in
numerous applications [48].
• Probabilistic/Statistical-based – There is usually a confusion when comparing probability theory and statis-
tics. These universes are inverse of one another, while probability theory tries to interpret problems consid-
ering randomness or uncertainty modeled by random variables, statistics observes something that happened
and tries to figure out what process explains those observations. With the aim of providing a clear un-
derstanding of the main statistical and probability methods in outlier detection, a brief overview of such
techniques is provided mainly using TS data. As Persi Diaconis once said, “The really unusual day would
be one where nothing unusual happens” [49]. As a rule of thumb, statistical methods fit a statistical model
to data in order to shape normal behavior. Tests are then performed to determine if the probability of an in-
stance is appropriate to be considered normal. A more in-depth analysis of such techniques is considered in
13
Chapter 3, being the proposed approach based on dynamic Bayesian networks, which have already proven to
benefit anomaly detection [10–13]. A technique based on Markov processes is likewise studied in Sec. 3.2.
The main approach studied in the current thesis resides in the probabilistic/statistical paradigm. It models a
statistical model representing conditional dependencies between variables as well as time.
14
Chapter 3
Statistical Modeling
Statistical methods attempt to explain a population, defined as a set of observations or subjects which are of
interest for some application. These methods fit a statistical model to data in order to shape normal behavior.
Instances are then tested against the model to measure their outlierness levels.
3.1 Data
The implemented application is designed to handle MTS. The latter must be discretized before being introduced
to the outlier detection apparatus. Such is considered in Section 4.1. Other types of data like symbolic sequences
can be loaded as well if in the right format. Missing values should be imputed or considered as an additional
symbol beforehand. Such is discussed later as well as precautions with the sampling rates of each variable.
A univariate TS Xi[t] is seen as a set of observations along a consistent time rate represented as
Xi[t] = xi[1], . . . , xi[T ] i ∈ N, (3.1)
expressing variable i along T time stamps. For the case of MTS, each row of a data-set represents a subject
identified by its row index. Subjects are MTS with a common number of variables n and width T . Each column
depicts a certain variable at a certain time stamp. In other words, a subject Sj of length T and n variables is
represented as
Sj = x1[1], . . . , xn[1], . . . , x1[T ], . . . , xn[T ] j ∈ N, (3.2)
where xi[t] depicts the discrete value of variable i ∈ 1, . . . , n at time stamp t ∈ 1, . . . , T of subject Sj with row
index j. A subject is thus a combination of univariate TS, each represented by Eq. (3.1). Columns are sorted
according to time steps rather than variable indexes. Many appliances require anomaly detection to be engaged in
TS descendant from sensor devices. These are normally not discretized and can also reveal unwanted offsets. To
tackle such issues, pre-processing is employed formerly to the outlier detection mechanism.
15
3.2 Markovian Models
Markovian models are commonly used to predict randomly changing systems, making use of the Markov property.
In such scenario, a system’s future states only depend on its current state and not on past events. A system’s past
and future states are thus independent. The Markov property is often preferred when performing probabilistic
forecasting.
In outlier detection, Markov models are typically used for detecting sequence-based and contextual anomalies.
These use the conditional probability for each symbol in a data sequence conditioned on the symbols preceding it
[31]. The probability of observing each symbol is thus predicted using pre-symbol probabilities. If the expected
behavior is significantly different from the observed, an anomaly is detected. Note that these methods are mostly
employed for discrete data sequences and TS data. However, any type of data including continuous sequences can
be discretized.
A set of training sequences is used for the computation of the conditional proprieties that will represent the
data-generating distribution. After learning the model’s parameters, testing involves in determining the likelihood
of test sequences given the parameters. The first and most basic model is referred as the Markov chain. A Markov
chain is a sequence of discrete-time random variables X1, . . . , Xn that respect the Markov property, in other
words, a “memory-less” stochastic process. Considering the existence of n possible states i1, . . . , in of the random
variables, the Markov property dictates that
P (Xn = in|Xn−1 = in−1) = P (Xn|X0 = i0, . . . , Xn−1 = in−1), (3.3)
meaning that the system only requires knowledge regarding the previous state to determine the probability distri-
bution of the current state. The probability of a sequence of states i1, . . . , in is computed as
P (i1, . . . , in) = P (X1 = i1)
n∏j=2
P (Xj = ij |Xj−1 = ij−1), (3.4)
which corresponds to a “chain” computing the probability of each state ij given its previous state ij−1. The
probability of the initial state i1 is determined using a prior probability. A transition matrix can thus store a
Markov chain evolution through time, comprising each transition probability between states. The product of these
probabilities describe the system’s transition along time.
A useful characteristic of discrete sequences is the short-memory property [50]. The latter dictates that con-
ditional probabilities 3.4 can be accurately approximated by a window of preceding symbols. For the case of a
discrete sequence a1 . . . an,
P (an|a1...an−1) ≈ P (an|an−k...an−1), (3.5)
for some value k > 1. A discrete sequence a1 . . . an can be encoded into a set of states according to the different
possible configurations it can assume. Each position ai, i ∈ 1 . . . n contains a symbol σ ∈ Σ from an alphabet Σ. In
a fixed k-th order Markov model, each state corresponds to the final k symbols an−k . . . an−1 in the sequence being
modeled [22]. A transition is thus the addition of the element an at the end of the sequence. The system transitions
from state an−k . . . an−1 to state an−k+1 . . . an, being the probability of transition P (an|an−k . . . an−1). Hence,
16
the probability of a sequence can be computed using Eq. (3.4), where windows of size (k+1) are retrieved from the
discrete sequence to determine the sequence of states it represents. A Markov model of order k declares that each
state corresponds to the subsequence of the final k symbols of the sequence being modeled. A state corresponds
to the immediately preceding sequence of k states, meaning that higher values of k correspond to higher values of
stored memory. Note that this memory refers to the windows used to compute each state and not to the memory
between states. Hence, the model holds the Markov property.
In the present discussion, models are assumed to be fixed. Fixed markovian techniques use windows of fixed
length k to estimate the necessary conditional probabilities. The conditional probability of occurrence of a given
symbol ai is estimated as
P (ai|ai−k...ai−1) =f(ai−k...ai)
f(ai−k...ai−1)=NijNi
, (3.6)
where f(ai−k...ai) denotes the occurrence frequency of subsequence ai−k...ai in the set of extracted sequences,
Nij the number of observations in which the system was at state i at time t and at state j at time t+ 1. Ni denotes
the total number of occurrences where the system was at state i at time t.
A Markovian model can be visualized as a set of nodes representing each possible state and a set of edges
between nodes. Edges represent connections between nodes which have associated conditional probabilities. The
system transitions between two states through the corresponding edge. Markovian models are capable of encoding
TS data, granting a basis for normal behavior. The probability of occurrence of each transition can be examined as
an anomaly score.
A similar approach is used [51] to develop an anomaly detection technique which represents normal behavior of
sequence-based data in network and computer intrusions using a simple first-order Markov chain. The probability
of sequences of states are computed in order to detect anomalous actions. The likelihood of a sequence is seen as
an anomaly score according to its occurrences in the training set.
A Markov chain can be viewed as a Finite State Automaton (FSA) which represents the system as a machine
that can be in exactly one of a finite number of states at a given time. The system requires a list of states, its initial
state and the conditions for the transitions. A FSA can be used to represent a Markov chain. A finite state machine
is said to accept an input if it ends at an acceptance state. FSAs are commonly used for predicting future events
and detecting system call intrusions [6, 52].
3.2.1 Variable and sparse Markov models
Fixed Markovian techniques have been discussed. These force each symbol to be conditioned by a fixed number
of previous symbols. Sometimes it is not possible to reliably estimate the conditional probabilities when consider-
ing subsequences with a low frequency of occurrence. As mentioned in [26], consider a situation where a certain
subsequence S1 of size n only occurs once in the training data and is followed by the symbol j. If we employ
a nth-order model, the conditional probability P (j|S1) has a value of 1 even though it only occurred once in the
training dataset. Although being legitimate, the estimated conditional probability relies solely on one observation,
which is probably insufficient to generalize to the whole dataset. To solve such issues, the size of the subsequence
S1 should be reduced. The smaller subsequence will probably appear more frequently in the training data. The
17
frequency of the subsequences are thus subject to a threshold in order to determine if the computed probabilities
are reliable.
Variable-order models can, for instance, adapt to the frequency of certain states in the data. States with very
low frequency of occurrence should be pruned and replaced with low-order generalizations. Such characteristics
are present in Probabilistic Suffix Trees (PST) [50]. Generalizations allow the reduction of the number of states
in the model and increase matching in the training phase. Sequential outliers in protein sequences are addressed
in [53] using PSTs. In the present thesis a PST approach is developed to handle MTS datasets, which is then
compared with the proposed approach through experimentation.
PSTs are an example of variable length Markovian techniques. A discrete TS is perceived as a sequence, also
called string, s = s1...sl of with a maximal sequence length l. Each position can assume a symbol σ ∈ Σ from
the the TS’ alphabet Σ. Like in fixed Markovian models, a state si, 1 ≤ i ≤ l is conditioned by its preceding
subsequence s1...sl−1, denoted as context c = c1, ..., ck. In a variable length Markov chain, the size of the window
L comprising a state’s preceding symbols is particular of each context, rather than being fixed. The core concept
behind a PST, is the short memory property [50]. For each context c = c1, ..., ck preceding a symbol σ ∈ Σ, there
exists a length L, 0 ≤ L ≤ k, such that
P (σ|c1, ..., ck) ≈ P (σ|ck−L, ..., ck), (3.7)
where 1 ≤ k ≤ l − 1 [54]. Thus, the length of the Markov chain is not fixed, with symbols being conditioned
with a previous window of different lengths depending on the context c. A PST training algorithm estimates
the length L for each context appearing in the training data, capturing the contexts c = c1, ..., ck of maximum
length. Conditional probabilities P (σ|c) using these contexts are compared with the ones obtained using their
longest suffix, P (σ|suf(c)), where suf(c) = c2, ..., ck. By using Eq. (3.7), conditional probabilities are compared
according to P (σ|c) ≈ P (σ|suf(c)). If the latter probabilities are similar, the PST solely stores P (σ|suf(c)),
reducing thus the number of parameters without loosing too much information.
A PST is proficient in determining the probability P (s) associated to any sequence s, considering the latter is
formed by symbols σ ∈ Σ. A sequence probability is computed as
P (s1...sl) = P (s1)P (s2|s1) . . . P (sl|s1...sl−1), (3.8)
where each state in the sequence is conditioned on the past observed states. Conditional probabilities are retrieved
efficiently from a tree structure. It is worth noting that unseen patterns can disrupt Eq. (3.8), since a null value of
any conditional probability causes the whole sequence to be impossible. To tackle such matter, a procedure known
as smoothing is applied to every probability removing null values. The latter is likewise applied in the proposed
approach, discussed in Sec. 4.3.1.
States are organized on a suffix tree, where nodes have their largest suffix as a parent. Each leaf represents a
memory of the associated Markov chain [55]. A PST is a suffix tree with the addition that conditional probabilities
are stored in each node. Nodes are labeled by a sequence representing a path from that node to the tree’s root.
Nodes contain the count of its associated sequence s as observed in a training dataset and conditional probabilities
P (σ|c) of observing each symbol σ right after s. Being denoted next-symbol conditional probabilities. Edges
18
Figure 3.1: An example of a PST with maximum depth 3 highlighting consulted nodes when computing theprobability of sequence abaab. Each node is represented as a container incorporating the next-symbol conditionalprobabilities for a (green) and b (grey) and the number of occurrences of their corresponding sequence in thedataset. Node e is the root of the PST. Edges are represented by a symbol σ ∈ a, b. Nodes that are marked with ared line are pruned. Adapted from [54].
represent symbols σ ∈ Σ, denoting the next character of the sequence. Such means that each node has at most |Σ|
children. The root node possess the prior probabilities P (σ) of each symbol.
With a dataset comprising multiple sequences, training consists in counting the occurrences of each pattern.
The number of occurrences decreases with depth, meaning that the computation of P (s3|s1s2) uses less observa-
tions than P (s2|s1), since the latter is a suffix of the first. If the number of occurrences of a pattern is low, the
probability of a sequence containing it is affected. However, longer sequences are more accurately modeled since
these use more information, according to Eq. (3.8). The initial positions of any sequence are thus generally badly
estimated, due to lack of memory. There should thus be a trade-off in the complexity of a PST.
The size of the tree grows exponentially with memory length. Thus, the tree’s maximum depth and addition-
ally the alphabet size |Σ| are responsible for the size. An overly-complex PST suffers from overfitting and is
inappropriate when considering rarely observed data, meaning that for rare patterns the probability of observing a
certain symbol after their corresponding suffix is deceptively high. A procedure known as pruning is capable of
controlling the size of a tree avoiding the aforementioned. Pruning can be done according to the frequency of the
nodes. If certain paths in the tree did not appeared at-least nmin times in the training data, the PST is allowed to
prune them. Being nmin a threshold for the number of occurrences. Nevertheless, the reasoning behind pruning
is to remove nodes which do not grant additional information with respect to its parent. Such means that their
conditional probabilities are similar for every symbol σ ∈ Σ. A gain functionG1(c) [54] is capable of determining
the information gain between related nodes. A node represents an information gain for any symbol σ if
G1(c) =∑σ
[PT (σ|c)
PT (σ|suf(c))≥ C or
PT (σ|c)PT (σ|suf(c))
≤ 1
C
]≥ 1, σ ∈ Σ (3.9)
where C is a threshold representing the ratio between the conditional distributions of the nodes.
Consider a PST which structure is seen in Fig. 3.1. The latter models sequences from an alphabet Σ = a, b.
19
Numbers inside each node represent the number of occurrences of the associated sequence. The sequence of a node
is determined by the inverse of the path starting from the root node and finishing in that node travelling along each
edge represented by a symbol σ ∈ a, b. Observing Fig. 3.1, the node highlighted in orange describes sequence ba,
storing the conditional probabilities P (σ|ba). Consider that the probability of sequence s = abaab is requested.
According to Eq. (3.8), P (abaab) = P (a)P (b|a)P (a|ab)P (a|aba)P (b|abaa), being each conditional probability
retrieved from nodes highlighted in blue and orange in Fig. 3.1. Note that the value of P (a|aba) is fetched from
node ba instead of node aba, since the latter was pruned. The node of its longest suffix is thus used.
In a PST, outliers are defined as patterns which appear less frequently in comparison with more frequent ones.
A sequence s is scored as
logloss(s) =1
l
l∑i=1
log2 P (si|s1...si−1) =1
llog2 P (s), (3.10)
representing the average log-loss of the sequence [54]. The latter depicts the residual between the prediction of the
tree and a perfect prediction.
PSTs are thus capable of capturing high-order dependencies in sequences while remaining simple to estimate,
with existing literature in pattern detection [47, 56]. However, the latter are only capable of handling univariate
TS. Due to the lack of freely available MTS outlier detection systems, in the present work, a multivariate PST
approach is built and compared with the proposed approach with the help of an existing modeling algorithm [54].
The aforementioned is detailed in Sec. 5.1.2, based on the modeling of multiple PSTs, one for each variable. The
objective is to demonstrate the importance of inter-variable relations in outlier detection and establish a basis for
comparison.
Methods like Interpolated Markov Models (IMM) can be used to compute the variable length conditional
probabilities of a symbol. In the field of Biology, such models are used to find genes in DNA sequences [57].
IMMs can additionally be used to detect frequency-based anomalies [48].
Variable Markovian techniques, although allowing different symbols to be conditioned with a previous window
of different lengths, these still need to be contiguous and immediately preceding the symbols. In other words, the
preceding sequences can not be sparse. A new variety of Markovian techniques can address sparse history for their
conditional probability estimations. This means that those particular positions can be filled by any symbol, turning
the windows sparse.
One sparse Markovian technique known as Sparse Markov Trees (SMT) [58] estimates the probability distribu-
tion conditioned by input sequences. SMTs are very similar to PSTs with the exception of allowing the introduction
of wild cards, being known as a generalization of PSTs. A sparse Markovian method worth mentioning is RIPPER
[59]. RIPPER can also be seen as a classification algorithm. It is used to build sparse models from the extraction of
k-length windows at a training stage. The objective is to predict the kth term according the previous (k − 1) sym-
bols. The described strategy is used in [60] for detecting intrusions in Unix systems. Regions of the test sequences
are classified as abnormal if a certain percentage of failed predictions is achieved in those particular regions. As
evaluated in [31], RIPPER is said to be more flexible than FSAs when conditioning the probability of events with
respect to preceding symbols. It smoothens the likelihood score of rarely seen patterns. Sparse and variable mod-
els can be greatly effective in some applications, offering better flexibility in terms of the size of context used.
20
Sparse Markov models are many times used when handling DNA sequences and protein data [58, 61] due to their
performance in such environments. We recall, however, that different researches arrive at different conclusions.
There is no perfect model, and many times, the normal fixed Markovian techniques outperform its variants.
3.2.2 Hidden Markov models
Hidden Markov models (HMM) are similar to the previous described Markovian techniques. They generate
sequences of transitions among states in a Markov chain. The difference is that in a Markov chain transition
sequences are directly visible. In a HMM, the precise state sequence through which the model passes is unknown.
States can be defined in the modeling stage by an analyst with a clear understanding of the context specific to
the system. The process is accomplished according to observations considered relevant. Although the states can
be formulated, there is no direct knowledge about the sequence of transitions required to reach them, thus being
denoted as hidden. A sequence of transitions corresponds to an observed data sequence. Each state carries a set
of emission probabilities for each symbol. When a state is reached, one of those symbols is ejected. This symbol
generation is dependent on previous history, meaning that the successive states are not independent from each
other. HMMs are commonly used to capture temporal correlations in data sequences.
Figure 3.2: Three-state hidden Markov model.
Observing the example in fig.3.2, we conclude that the model has three states A,B and C which could emit
one of two symbols S1 or S2. The emission probabilities are shown next to each state. There are also probabilities
for each transition between states. When the system is at state A, it has 20% probability of assuming symbol S1
and 80% of assuming symbol S2. There is, at the same time, a probability of changing state. Probabilities are
estimated according to observations from a training set. Although these are known, the exact state of the system is
unknown at a given instant.
Larger training sets are required in order to avoid overfitting. Without prior knowledge about the system’s
generating process, a model with n states possesses n2 transitions. Every state has a transition to every other state.
The analyst is adviced to understand the system’s behavior in order to eliminate transitions. Domain knowledge is
hence valuable.
HMMs are composed by a hidden transition matrix and an observation matrix, with a symbol emission distri-
bution for each state. Three main procedures exist. Training phase estimates the model’s initial, transition and state
emission probabilities using training sequences. An Expectation Maximization (EM) approach can be employed
[62]. In Testing, the fit probability of each test sequence is determined considering the HMM. Anomaly scores are
21
associated to each sequence. In the final phase, Explanation, the most probable sequence of states which generated
each test sequence is uncovered. An understanding and explanation for every classified anomaly is given. For ad-
ditional details consider [62]. A HMM is thus capable of not only learning a model that best describes the system’s
behavior, but also determine the probability of each test sequence, additionally explaining it.
Note that methods in the evaluation and explanation stages require a considerable amount of time due to the
recursive nature of the algorithms. Complex models are thus avoidable when possible. Although computational
complexity is seen as a disadvantage for HMMs as well as the sensible selection of the parameters, these are widely
used in existing literature outputting favorable results, typically in the detection of malicious activity [6, 7]. A PST
approach for protein sequences is compared with existing HMM literature in [63].
3.3 Bayesian models
A Bayesian network (BN) is a probabilistic graphical model which encodes conditional relationships among
variables. It is composed by two components, a directed acyclic graph (DAG) encoding dependencies between
random variables and a set of parameters defining the local conditional distributions of the attributes.
Random variables are assumed to be discrete with a finite domain. The main idea behind employing the afore-
mentioned techniques in outlier detection is the posterior probability estimation of observing a class label given
evidence. Inference computes the probability of variables given known knowledge encoded in the observed vari-
ables. Models are obtained from learning algorithms together with expert knowledge particular to each problem.
The class label with the highest posterior probability is associated for each instance. If an anomalous class is
selected instead of any of the normal classes, such instance is classified as abnormal. BNs are also commonly used
for clustering and classification purposes.
Learning the structure of a BN [64] can be summarized as finding the DAG which may have generated the data.
A training set allows the network to estimate the likelihood of an instance belonging to each class as well as each
class prior probability. Random variables X = (X1, . . . , Xn) are assumed to be discrete with a finite domain.
The DAG represents the set of variables (nodes) and their conditional dependencies (edges). Nodes that are not
connected represent variables that are conditionally independent, two random variables X and Y are conditionally
independent given a third random variable Z, X ⊥ Y |Z , if and only if P (X ∩Y |Z) = P (X|Z)P (Y |Z). In other
words, knowing the value of Z does not provide information regarding the value of Y knowing the value of X ,
at the same time, the opposite also applies. Edges linking nodes indicate that one variable directly influences the
other. A nodeXp is pronounced as a parent of node Xs if there is a connection directed fromXp toXs. Each node
is associated to a function which receives the node’s parent variables and outputs the probability of the variable
expressed in that node. This local probability distribution stores a set of parameters encoding each probability of a
possible configuration of a node Xi given its parents pa(Xi),
P (Xi = xik|pa(Xi) = mij), (3.11)
where xik is the k-th possible value from the domain of Xi and mij the j-th possible configuration of the set of
variables pa(Xi). The network’s global probability distribution can be seen as a joint probability distribution com-
posed of the several local probability distributions associated to each variable. By evoking the Markov property,
22
the global probability distribution comes
P (X1, ..., Xn) =
n∏i=1
P (Xi|pa(Xi)), (3.12)
where n is the total number of variables. The joint probability depicted in Eq. (3.12) can be used to compute
the probability of an evidence set. The set of conditional probabilities associated to each node is denoted as the
network’s parameters. The local distributions present the prior probabilities for the attributes with no parents and
conditional probability tables for nodes with parents. A variable Xi is independent of all its non-descendant nodes
given its parents pa(Xi). Being BNs probabilistic models, the laws of probability can be evoked, mainly Bayes
theorem
P (Q|e) = P (e|Q) · P (Q)/P (e), (3.13)
where P (Q|e) represents the posterior probability of a distribution Q, P (e|Q) the likelihood of Q given known
evidence e and P (Q) the prior probability ofQ. The probability of evidence P (e) has the objective of normalizing
the result’s sum to 1. Bayes theorem in Eq. (3.13) allows probability update according to new evidence.
For the present work, the outlier detection scenario can be seen as a classification problem among two classes,
normal and anomalous. Each attribute is thus evaluated according to its context using a trained BN, if a cer-
tain criteria is not met, such attribute is considered anomalous. A typical practice is the employment of scoring
functions.
Multivariate categorical data can be used in such networks by aggregating each posterior probability for each
test instance in order to assign a class label.
3.3.1 Bayesian network outlier detection
Several Bayesian network-based techniques have been considered in outlier detection in a wide range of applica-
tions and have instigated numerous research works [65, 66]. A disease outbreak detection system [67] considered
a BN to model a baseline distribution using health-care data. Monitored data was then compared with the baseline
for anomaly prospecting. Data collected from the previous 24 hours allowed for an adaptable BN, which was mod-
eled daily. The joint probability of the data was used by conditioning attributes such as season, day of the week
and current flu level. An high flu level may not indicate an outbreak since it is typically normal in rainy seasons,
thus the relevance of adopting a Bayesian approach. Rules engaging a scoring function determined the degree of
similarity of a subset of data with respect to the baseline. The highest score for the day had its p-value computed
being then compared to a threshold for the final decision.
3.3.2 Dynamic Bayesian networks
One of the main challenges encountered for the scope of this thesis is the data’s temporal component. Time
series are perceived as vectors of stochastic variables observed at regular intervals of time. MTS, although contain-
ing inter-variable dependencies, they also accommodate temporal dependencies. Relations within series are thus
present. Dynamic Bayesian Networks (DBN) [4] are BNs which relate variables over adjacent time steps. These
23
can model probability distributions over time, being interesting when considering complex temporal data. Unlike
standard BNs, DBNs are composed by multiple networks known as transition networks. Additionally, a prior BN
denotes the distribution of initial states. The latter is a standard BN. A transition network possess two types of con-
nectivity among variables noted as inter and intra-slice connectivities. The latter refers to variable dependencies at
the same time frame, just like standard BNs. Inter-slice connectivity is responsible for the temporal aspect which
relates attributes of different time slices.
Data is seen as a collection of subjects which are formed by MTS. This means that a certain test subject possess
multiple variables along time, thus a variable is seen as a single TS. A DBN is modeled according to the evolution
of not only each of the variable TS along discrete time but also their inter-connection, extending thus the domain
of BNs with time.
A transition network Btt−1 of a stationary first-order DBN can be observed at Fig. 3.3, where nodes have an
associated time-stamp. These influence the distributions stored in each node in the sense that variables possess
parameters which depend on time, something oblivious in standard BNs. It is worth noting that data must be
based on discrete time, meaning that time frames should be well defined former to the employment of such net-
works. Temporal associations among entities are obtained from consecutive data, since these are typically highly
correlated in time. In the shown example, the DBN is of first-order, meaning that nodes at time frame t may be
correlated to attributes at the previous time instant t − 1. The connection between attributes X3[t] and X1[t] rep-
resent the intra-slice connectivity of the transition network Btt−1 which correlates the attributes in the same time
frame similar to a standard BN while all the others represent the inter-connectivity. It is worth noting that in the
example shown, a connection between X3[t− 1] and X1[t− 1] is not shown although existing. This is due to the
fact that the connection belongs to the previous transition network Bt−1t−2 which possess the exact same structure
shown since we are referring to a stationary DBN. Recall that being the DBN an acyclic graph, the former can not
contain loops.
X1
X2
X3
X1
X2
X3
Transition : t− 1→ t
t− 1 t
Figure 3.3: Transition network of a stationary first-order DBN.
Considering Fig. 3.3, node X1[t] is conditionally independent from X2[t] leading to a joint probability of the
attributes at time t,
P (X3[t], X2[t], X1[t]) = P (X3[t]|X3[t− 1]) · P (X2[t]|X2[t− 1]) · P (X1[t]|X1[t− 1], X3[t]). (3.14)
There is elasticity when regarding the structural traits of a DBN. These can be stationary, meaning that the transition
24
network does not vary between the transitions t−1→ t of the MTS. In other words, the structure present in Fig. 3.3
is invariant over time. Data is thus treated equally along time, where the transition network Btt−1 adapts to the
data’s transitions in a whole. Non-stationarity implies that a DBN is formed by a set of transition networks which
are particular of each transition. The latter can adapt easier to time signatures. However, the model becomes
more complex, which without caution can provoke overfitting. A dataset which possesses anomalies at particular
time-stamps could unintentionally mask them, preventing their detection. To prevent such unwanted behaviours,
prudence is adviced from the analyst. MTS data can be branched in two classes, one who bears series with
statistical properties invariant of time and those whose properties change over time. Identifying such aspects
greatly improves the decisions made when choosing the appropriate model.
The lag associated to the network influences the structure behind it. A L-th-order DBN means that a certain
attribute Xi[t] at time frame t can possess parent nodes from t−L to t. It is worth noting that besides the increase
in complexity of the transition networks Btt−L as-well as the risks of modeling unwanted signatures or overfitting,
data is partitioned more. This comes from the need of further data by the transition networks Btt−L which require
information from L + 1 time frames when considering a lag of L. A transition t − L → t can thus be seen as a
window Dtt−L which is compromised by the observations of the attributes belonging to the respective transition
network. Before modeling such networks, one should consider if there is sufficient measurements to acquire a
credible result.
Dynamic BNs also perform inference and discover the most probable sequence, somewhat similar to HMMs,
discussed in Sec. 3.2.2, which are also known as conditional probability queries. Since the model must express
data’s normal behavior, an outlier can be represented by how likely it could have been generated by the model.
Hence, the employment of a scoring procedure provides the aforementioned likelihood. The most common choice
is the Log-Likelihood (LL) function which is based by the logarithm of the probability density function of the
Bayesian network. The more anomalous evidence is, the lower the LL score associated to it. A variation of the LL
score known as Minimum Description Length (MDL) can also be employed penalizing complex structures through
the number of parameters.
Learning a DBN is essentially disclosing the set of transition networks which better match a training set. The
latter is explored in the current work in Section 3.3.4.
3.3.3 Dynamic Bayesian network outlier detection
Bayesian networks are fitted for outlier detection, capturing otherwise hard to find correlations among attributes
in high dimensional data. Time series data can easily be adapted to cope with BNs. The graph nature of BNs allow
analysts to quickly identify relations as-well as tracing the root cause of anomalies such as determining which
attributes contributed the most to the problem. Prediction is likewise a virtue of using BNs when handling online
data, such allows the detection of anomalies beforehand usually warned by a decrease of the LL score. An anomaly
detection approach using DBNs was developed to handle the detection of aircraft pilot errors taking advantage of
such virtues [10]. One compelling aspect was the use of hidden nodes similar to HMM to represent pilot actions.
These can not be measured like regular variables from the aircraft sensors. Maximum likelihood estimations of
the parameters were performed given the data and model with an EM algorithm due to the partial observability
nature of the problem. The goal is not only to identify the occurrence of an anomaly but to track its point of origin
25
as-well. It is assumed that an anomaly appearing at a time instant will spread to related states and succeeding time
slices. The procedure reverts the usual flow of the network and backtracks a detected anomaly until the state which
provoked the anomaly is identified. The algorithm can thus warn not only the manifest of an anomaly but also the
past pilot action which probably incited it.
In medical scenarios, the present methods are capable of providing valuable insights about disease processes.
Patients’ underlying clinical details change in the course of medical interventions in addition to affliction devel-
opments. DBNs are thus fitting for such scenarios having increasing focus in the literature [11]. These have also
proven to be efficient when modeling gene expression data [13].
Furthermore, Kalman filters [68] are DBNs which estimate and track system states recursively over time.
Probability density functions are approximated using incoming measurements assuming states as hidden Markov
processes similar to HMMs. Kalman filters can be employed in several domains within the spectre of the present
work. An example using real-time streaming data allowed the engagement of DBNs approaches in environmental
engineering [12].
Another example of the versatility of DBNs is its application on areas such as image and video processing [69].
The frames of a captured video are contemplated as feature vectors. A MTS is thus formulated by features such as
pixel color trough the different frames. Anomalies are classified as sequences of consecutive frames that discern
excessively from normal ones. The LL score of a sequence can thus be used. Alongside other techniques, DBNs
are employed in the developed apparatus by adapting an existing DBN modeling algorithm [14] that supplies an
optimal network when certain criteria are met. Details are discussed further in Chapter 4 with the incorporation of
a sliding window approach. The final result is a complete and adaptable anomaly detection scheme which tackles
common issues from pre-processing to post-processing such as discretization and score analysis respectively.
3.3.4 Model training
Learning a BN can be summed up as to finding the DAG and its parameters which maximize the fitting of a
training set. Such is acquired by comparing each possible structure through a scoring function.
The developed outlier detection technique assumes that sane data originated from stochastic processes which
can be revealed by a modeling framework. With the intention of detecting outliers originated from unusual and
abnormal transitions in MTS, a DBN structure learning algorithm is considered. The prevailing aspiration is the
identification of conditional dependencies amid distinct variables in addition to temporal dependencies. Thus, the
present chapter retains the reasoning behind an optimal tree-augmented DBN (tDBN) structure learning algorithm
[14] which provides the craved model possessing optimum inter/intra-slice connectivities for each transition net-
work. The modeling algorithm described next is employed in the modeling phase of the proposed outlier detection
approach discussed in Chapter 4.
3.3.4.1 Tree-augmented structure learning
The reasons behind choosing a tDBN [14] technique opposed to available DBN learning software [70] is that
the latter typically neglect intra-slice connectivity or simply structure it as a detached approach, meaning that
connectivities are modeled separately. Such means that typically, inter-connectivity is modeled as a standard
Bayesian network, which does not consider temporal relationships. In contrast, tDBN models an optimal network
26
by determining both connectivities simultaneously, providing a solid representation for regular data. It is assumed
that each variable TS is discrete with a finite number of states. The presence of hidden variables or missing values
is unauthorized, meaning that such issues must be taken care prior to modeling. Such matters are discussed further
in Sec. 4.1.
Before proceeding to the actual description of the learning method’s underlying procedures, several concepts
need to be acknowledged. By considering both connectivities in model training, the attainment of the globally
optimal network becomes NP-hard [71], contrary to learning solely the inter-connectivity, which may not be [72].
These are the opposite of P problems, which can be solved in a polynomial number of steps. If the solution for our
task was known, it could be checked in polynomial time. However, acquiring the solution is part of the problem.
NP-hard problems typically require a lot of computational power to check every possible solution combination.
The time to find a solution grows exponentially with the problem’s size.
To tackle this dilemma, tDBN [14] limits the search space to tree-augmented networks. Thus, an attribute
node at a certain time-slice can only possess at most one parent at that same slice. The constraint influences alone
the intra-connectivity. Furthermore, in each node, the maximum number of parents from preceding time slices
is bounded by parameter p, which is typically a value lower than 5. Such structures have already proven to be
effective and superior to their counter-parts, being one of such examples the tree augmented naive Bayes (TAN)
which augments the base structure of the naive BN while capturing correlations among variables [73].
The tDBN algorithm [14] is based on the Chow and Liu approach [74] which approximates joint probability
distributions by the product of terms that introduce just one variable, being thus possible to represent such compu-
tation as a first-order dependency tree. The Chow-Liu algorithm outputs the conditional probabilities to be used.
The approximation offers the minimum Kullback–Leibler Divergence (KLD) [75] between the estimation and the
real distribution, which measures how much information is lost when the approximated distribution is used instead
of the real one. A more in-depth notion is available in Sec. 4.3.1.
Each transition t→ t+1 of the training set requires inspection in order to determine the most suitable network
for the model. For the non-stationary case, each transition should be examined separately. For sake of simplicity,
the explained steps are considering a Markov lag of one, however, the arguments prevail for larger lags. With this
in mind, the main phases for the structure learning method must be applied for each transition t → t + 1 and are
as follows:
• Complete directed graph: The algorithm oughts the existence of a scoring function φ. This can be any
decomposable function desired, nevertheless the LL score is chosen. Its penalized version MDL can also be
employed. For each attribute Xi[t + 1], the set of parents with maximum score is found. By allowing each
node to have one parent from its current time slice, the optimization problem for a node Xi[t + 1] can be
formulated as
maximizesij ,Par[t]
sij = φk(Par[t] ∪Xj [t+ 1], Dt+1t )
subject to Par[t] ∈ S(X[t]), i 6= j,
|S(X[t])| ≤ p.
(3.15)
In Eq. (3.15), φk refers to a term of the decomposable scoring function φ, where Dt+1t refers to a window
27
possessing the set of observations regarding transition t→ t+ 1, while Par[t] specifies the set of parents at
slice t, S(X[t]) represents all the subsets of nodes X[t] from the preceding time slice t and Xj [t+ 1] a node
from the same time slice t + 1 with i 6= j. The presented constraint declares that the set of parents must
be among the sets of preceding nodes S(X[t]), however, these are restricted to only contain a maximum
number of nodes p. This means that a node Xi[t + 1] can only possess at most p parents from time slice t.
The latter is not mandatory, although it facilitates the searching procedure while diminishing overfitting. A
complete directed graph is thus built inX[t+1] using the Chow-Liu algorithm [74] also known as Kruskal’s
algorithm. The latter determines a spanning tree of a weighted graph having maximum weight. Each edge
Xj [t+ 1]→ Xi[t+ 1] belonging to the graph has a weight equal to eij = sij−si, where si represents score
from Eq. (3.15) for node Xi[t + 1] without considering Xj [t + 1] as a parent. This means that the weight
represents the score gain of the network when including Xj [t+ 1] as a parent. The resulting tree is one that
is closest to the distribution associated to the data as measured by the KLD, thus a maximum LL score tree.
• Maximum branching: It is worth noting that the computed edge weights are direction dependant, eij 6= eji.
A maximum branching algorithm must be employed to determine the directed spanning tree, in other words,
an acyclic and connected graph emanating from the root node and possessing a single path from that node to
each other node. The tree has the largest total weight, being this equal to the sum of all the edge weights of
the tree. To achieve such structure, the Edmonds’ algorithm [76] is employed. The resulting nodes have an
in-degree of at most 1. The intra-slice connectivity is immediately retrieved from the Edmond’s algorithm
output, while the inter-slice connectivity is derived from the solutions of the optimization problems given in
Equation (3.15) for each node Xj [t+ 1] with exception of the Edmond’s root node which only requires the
optimal set of parents from the preceding slice.
A tree-augmented DBN is thus obtained with intra/inter-slice connectivity resolved simultaneously with a
polynomial worst-case time complexity. It is worth noting that even upon finding an optimal structure, this is
normally not unique. The previously described processes are considering the modeling of a non-stationary first-
order Markov lag DBN and they should be applied to every transition available. The same reasoning is adapted for
stationary and non-unitary Markov lag DBNs with further details and proofs available at [14].
It is worth noting that the DBN’s prior network is obtained separately with conventional procedures and is not
significant for the developed outlier detection approach.
28
Chapter 4
Proposed method
In the present chapter, a gradual but in-depth analysis of the developed outlier detection mechanism is depicted.
The latter focus on the detection of unusual values in conjunction with irregular sequences present in MTS through
DBN modeling. Firstly, a general and intuitive view is laid out together with a step-by-step reasoning of the
underlying conducts.
The proposed approach can be contemplated as the conjunction of 4 main phases. These are depicted in
Fig. 4.1. Datasets containing MTS are pre-processed prior to modeling if necessary. The pre-processing phase
studied is performed using a discretization and dimensionality reduction technique discussed in Sec. 4.1. Discrete
MTS datasets are then employed to the tDBN modeling algorithm, described in Sec. 3.3.4, which generates a DBN
according to the inserted dataset and parameters chosen. The latter are the Markov lag L, the number of parents
p from preceding frames each node can possess and a flag s deciding the stationarity of the model. Afterwards,
the MTS dataset together with the trained model are delivered to a scoring phase. The aforementioned capitalizes
on the structure and parameters of the model to analyze each MTS transition using a sliding window algorithm.
Entire series are likewise scored. Subsequently, scores are delivered to a score-analysis strategy which the main
goal is to create a boundary differentiating abnormal from normal scores. Two possible strategies are discussed
in the present work and later compared. A score-analysis strategy outputs a threshold used for the final binary
classification. Original data associated to scores below the threshold are classified as outliers.
Each phase is analyzed in the present chapter, being the modeling algorithm discussed in Sec. 3.3.4.
29
Figure 4.1: Scheme of the proposed outlier detection approach comprised of 4 phases. Data sets formed by MTSdata can be directly applied to the modeling phase when discrete and in the right format, otherwise, the pre-processing phase is applied prior to modeling. Missing values must be imputed before hand, not being consideredin the current system. Discrete data is delivered to the modeling phase along with the parameters p, L and s of theDBN to be modeled. The score-analysis phase considers two distinct strategies providing thus two possible routesfor outlier disclosure.
4.1 Time series pre-processing
One of the challenges when handling TS data is their tendency to be continuous and of high dimension. Be-
ing real-world databases vulnerable to inconsistency and noisy sources, data pre-processing is considered. To
tackle such issues, a procedure is applied to data-sets prior to the modeling phase. The aforementioned provides
discretization and dimensionality reduction. Noise cleaning can be endorsed, however, since the aim of the devel-
oped program is to detect anomalous and unusual behavior in data, noise reduction is not mandatory. For a better
understanding regarding data pre-processing and its importance consider [77].
Being the trained model a DBN, un-discretized variables foment an heavy and over-fitted model which ends
up behaving poorly. The same can be said to over-sampled data. These aspects also endorse high computational
burdens. An emphasis should be present in the choice of the representation parameters which can highly influence
the performance of the outlier detection process.
The representation enforced in the proposed approach on the input TS is known as Symbolic Aggregate ap-
proXimation (SAX) [9]. Other techniques can be used prior to the system, being the resulting discretized dataset
30
applied to the approach. SAX has already been practiced in anomaly detection scenarios [2], being thus the elected
strategy. The next statements will consider an univariate TS for the sake of simplicity, being the generalization
foreordain for later. The pseudo-code for the pre-processing mechanism is available in Algorithm 1 and specified
next:
• Normalization: Before proceeding to the actual dimensionality reduction and discretization, TS Xi[t] =
xi[1], ..., xi[T ] are normalized to present zero mean and a standard deviation of one. It is clear that comparing
TS with different offsets and/or amplitudes is futile. Such is tackled employing Z-normalization [78]. The
mean of the TS is subtracted from every data point. The result is then divided by the TS’ standard deviation.
The process is thus
x′
i[t] =xi[t]− xi
σii ∈ N, (4.1)
where xi[t] denotes a data point from variable i at time stamp t while xi is the mean value for that variable
and σi its standard deviation. The Z-score normalization from Eq. (4.1) adjusts the value of every data point
xi[t] of the TS into their new representations x′
i[t]. The latter denotes the number of standard deviations
from the mean the original data point is. Let us recall, that for the case of MTS, the procedure must be
applied to each variable TS, Xi[t], separately. Other types of normalization and variations of the default
Z-normalization can be contemplated [77].
• Dimensionality reduction: Each TS of length T can be converted into equivalent sequences of size m �
T . This reduced data representation is known as compression. Notorious compression techniques include
wavelet transforms among others. In the current program, compression is assured by employing a technique
known as Piecewise Aggregate Approximation (PAA) [79]. The latter subdivides a normalized TS into m
equally sized windows. These segments will turn into the new TS’ values hopefully preserving its properties.
The mean value of each window is hence computed replacing all values from that window. In other words,
all the m means of each window serve as the new TS. Mathematically, a time series X[t] = x[1], ..., x[T ] of
length T is remodeled as another sequence X[t] = x[1], ..., x[m] of length m � T , being each value x[t],
1 ≤ t ≤ T transformed using
x[i] =m
T
Tm i∑
j= Tm (i−1)+1
x[j], 1 ≤ i ≤ m (4.2)
where new values are computed as an average of the original ones in their region of size T/m.
Is worth noting that the aforementioned technique is only possible in numeric TS, meaning that reducing
dimensionality in symbolic sequences should consider a different approach. An additional concern is the
modification of the sampling rate of the data. When exercising PAA, the spacing between values becomes
T/m, considering that the original spacing was 1. If a particular sampling rate is needed for a specific
application, the analyst should pay attention to the dimensionality reduction process and the chosen window
size. In the end of the present phase, the original TS is approximated by a combination of steps as seen in
Fig. 4.2. The aforementioned procedure can be seen as a non-parametric model for numerosity reduction.
31
For the multivariate scenario, it is worth noting that PAA reduction must be applied equally to every variable
belonging to a common TS. If for each attribute TS Xj [t] a different length mj is chosen, the final MTS will
have a length equal to max(mj) causing every variable sequence to not fit together. In other words, distinct
variables would have different sampling rates, forcing an undesirable amount of missing values. It is thus
advisable to maintain the PAA parameter m equal for every attribute.
• Symbolic Discretization: After having a normalized and reduced representation of a TS, the final and
crucial phase concerns the actual discretization. It is worth noting that the PAA phase is not mandatory
and can therefore be overlooked. If the analyst is satisfied with the number of samples of a TS, he can
choose to discretize it soon after normalization. Normalized time series typically have Gaussian distributions
[80]. With such in mind, the domain of a TS can be divided into Σ = σ1, σ2, ..., σ|Σ| equiprobable regions
according to a Gaussian distributionN(0, 1), where |Σ| denotes the size of the alphabet Σ chosen to represent
each new sector. Each region is identified by its boundaries, known as breakpoints B = β0, β1, . . . , β|Σ|,
being the rightmost and leftmost breakpoints β0 and β|Σ| respectively −∞ and ∞. A symbol σi is thus
associated to interval [βi−1, βi[. The objective is to verify in which of the alphabet regions each time series
point falls into. As seen in Fig. 4.2, if a certain element is within the boundaries of symbol σi, for example,
it is replaced by that symbol. The procedure is performed for every time stamp using a sliding window. The
final result is a sequence of symbols which will now represent the TS for the upcoming computations.
Figure 4.2: Example of SAX discretization considering a 8-point TS. The 3-point PAA transform is converted tothe string “acb” using R package jMotif [81].
A simplistic version of the SAX algorithm can be reviewed in Algorithm 1. The pre-processing procedure must
be applied to every variable preferably with the same choice of parameters when handling MTS. The selection of
such values can have a considerable impact on the course to come. While the PAA parameter determines the level
of proximity and memory spent for describing a TS, the alphabet size represents the granularity of expressing each
element. With this in mind, it can be shown that for most applications an alphabet size in the range of 5 to 8
32
normally outputs good results [9].
Algorithm 1 Data pre-processingInput: Multiple continuous MTS Sj with a fixed length T and format. These are considered not to have missingvalues.
Output: The corresponding input MTS discretized with a fixed alphabet size |Σi|, 1 ≤ i ≤ n for each variable Xi
and reduced to a length m� T .1: procedure SAX2: for each MTS Sj do // For each subject3: for each variable Xi of Sj do4: Normalizedi ← z Norm(Xi) // Normalize each variable time series5: function PAA(Normalizedi,m)6: for each section of size T/m from Normalizedi do7: x[i]← (m/T )
∑j∈sectionNormalizedi[j] // Mean of each section
8: Normalizedi ← Normalizedi.Append(x[i]) // Time series of size m9: function DISCRETIZATION(Normalizedi, |Σi|)
10: β ← SegmentGaussianDistrib(|Σi|) // y axis divided in |Σi| equiprobable regions11: for each value i in Normalizedi do12: Discreteji ← ToSymbolic(i, β) // Breakpoints β used to classify each value i13: return(Discreteji ) // Return all discretized variables of all subjects
Although SAX discretization is said to guarantee equiprobablity among symbols, the former statement is not
entirely true. Research [82] has shown that when the PAA phase is applied former to SAX discretization, the
standard deviation of the normalized data tends to shrink. While the mean remains unaltered and equal to zero,
the standard deviation no longer remains unitary. This shrinkage is proportional to the size of the PAA windows
considered and the degree of auto-correlation within the TS. If the PAA reduction is exaggerated, the resulting
values tend to converge to the original TS’ mean. Consequently, the equal distribution of symbols gets harder and
harder to achieve.
Even though SAX is considered to perform TS discretization, mechanisms like Principal Component Analysis
(PCA) [83] can further enhance the outlier detection. The latter selects only the most influential attributes when
considering multivariate data, being thus adviced to be applied prior to SAX. This is absent in the current approach.
4.2 Assumptions
A list of assumptions is specified with the aim of providing additional information for those wishing to employ
the proposed unsupervised outlier detection system to their own application. If the analyst recognizes the referred
assumptions and precautions, normal behavior is guaranteed. The presumptions are thus:
• Discretization & Completeness: A MTS data set must possess all its variables discretized with a finite
domain. Missing values must be treated beforehand. These can be imputed or replaced by an additional
symbol. Univariate TS representing each variable must be compatible, otherwise transitions would have
omitted information corrupting the procedure.
• Training set: Data used for model training should not hold an excessive amount of anomalies due to over-
fitting. The typical system behavior should thus be captured, without forcing anomalous scenarios.
33
• Correlation: Multivariate data employed in the algorithm should present correlation among variables as
well as temporal relationships, which are the main focus when modeling DBNs.
• Complexity: It is advisable to apply data sets with a reduced number of variables, due to the poor scaling
when modeling DBNs. Over-complexity promotes overfitting as well as excessive time consumption. Data
sets comprising a number of variables of 20 or higher should select a smaller subset.
4.3 Dynamic outlier detection technique
The proposed scoring phase is now described. It is noted that all the information accommodated in the following
sections are akin when considering both stationary and non-stationary DBNs. In the studied algorithm, an outlier is
defined as being a single or set of observations belonging to a subject which is suspicious of not being generated by
the data’s main underlying processes. Also referred as statistical outliers, these are typically caused by qualitatively
distinct mechanisms. A measure of outlierness is thus associated to each subset according to their resemblance
with a trained DBN, which represents data’s normal behavior along time.
A window is defined as a sample of a MTS at a specified time interval and has always a size equal to n ·(L+1),
where n is the number of variables and L the DBN’s lag. A window is described as
Dtt−L = x1[t− L], . . . , xn[t− L], . . . , x1[t], . . . , xn[t], L ≤ t ≤ T (4.3)
where t identifies the last time frame present in the window. With this in mind, Eq. (4.3) shows that the size of
a sequence is dependent on the Markov lag L of the trained model. Such occurs due to the fact that variables of
a certain time slice t are conditioned by nodes no later than L prior time frames according to the DBN structure.
Each window possess a transition network associated to it which can be specific of that transition in the case of a
non-stationary DBN.
A dataset containing multiple MTS is employed to train a DBN, discussed in Sec. 3.3.4. The conditional
probabilities from Eq. (3.11) for each attribute are kept in matrices within the algorithm. Such means that a
MTS, depicted in Eq. (3.2), with n variables through T time instants holds n · T attributes. Hence, DBN nodes
emblematize attributes which are associated not only through variables but also time.
The trained model is composed by a prior network B0 and a transition network Btt−1 for each transition
t − 1 → t in the case of a first-order DBN. The initial BN is built upon the Markov lag of the model. The former
means that a Markov lag of L demands an initial BN along L time instances. A transition network Btt−L for frame
t requires variables defined since t − L which means that only after time slice L − 1 transition networks begin
to act. Each transition is associated to a window composed by the observed attributes required to compute its
corresponding anomaly score.
4.3.1 Outlierness score
To acquire a measure of outlierness, windows representing each transition of a subject are graded. The cor-
responding transition networks are used to check the probability of the values of time frame t according to the
window’s attribute values from Eq. (4.3). The probability of sequence x1[t], . . . , xn[t] is computed according to
34
the DBN structure storing the parent nodes of every attribute xi[t] and the parameters of every possible configu-
ration. An attribute can possess parents in the same time frame, xj [t], j 6= i, and at the previous L time instances
xi[t′], t− L ≤ t′ < t.
Before determining the score of each MTS transition, let us first understand the meaning of likelihood. Since
the learned model represents normal behavior, evidence should be set on the DBN in order to compute how likely
it is that certain data was generated by the model. The natural logarithm of the likelihood function is employed
with the aim of preventing the underflow problem typically associated to such functions. Since the logarithm of
a function reaches its maximum at the same point has the original function, the desired outcome is not disturbed.
The more anomalous observations are, the lower LL values they are associated to. Contrarily, a perfect fit has a
score of zero, being thus the maximum possible value for a score. The score of a window Dtt−L, also known as
transition score, is computed as
Scorett−L =
n∑i=1
log(P (xi[t]|pa(xi[t]))), (4.4)
where P (xi[t]|pa(xi[t])) is the probability of attribute xi[t] conditioned by its parent nodes’ values, represented by
pa(xi[t]). These can be attributes from the same time frame or prior ones according to the transition network. Each
score from Eq. (4.4) is thus a sum of the logarithm of the probability of every attribute at time slice t assuming
their observed values. The outlierness of a each transition t− L→ t is acquired.
The LL score is preferred due to its connection with the notions of entropy and KLD from information theory
[84]. Entropy can be seen as the minimum number of bits necessary to specify the value of a certain discrete
random variable. The entropy of a random variable with possible values σ ∈ Σ and known symbol probabilities is
computed as
H = −|Σ|∑i=1
P (σi) · log2(P (σi)), (4.5)
where |Σ| is the size of the alphabet. In other words, the amount of information needed to identify the system’s
state. Note that information is employed under Shannon’s definition [85]. It can be shown [84] that the mean
negative LL of a discrete random variable converges to the sum between the entropy, considering the variable’s
true distribution and the KLD among the true distribution and the distribution considered. The KLD is the number
of additional bits, on average, needed to encode an outcome distributed by an approximating (model) distribution
when comparing with the use of the optimal encoding for the true distribution. It is worth noting that mutual
information I between two random variables X and Y can be expressed as a KLD between the joint distribution
of both variables and the product of the marginal distributions, I(X,Y ) = KLD(P (X,Y )|P (X) · P (Y )) =
H(X) −H(X|Y ) [84]. The latter can be seen as a reduction in the uncertainty of X given Y . Note that we can
not use KLD to measure the distance between two distributions, being the main reason the asymmetry of the KLD.
With this in mind, maximizing the likelihood of data under the trained model is equivalent to minimizing the KLD,
reinforcing the notion that anomalous data associated to low likelihoods are less probable of had been generated
by the true generative model.
Although practical, the score in Eq. (4.4) has a substantial flaw. If the evidence possesses an unseen pattern,
35
the probability of at-least an attribute is zero. This means that the anomaly score for the whole transition is −∞,
since the model assumes the evidence is impossible. It becomes easily understandable by inspecting Eq. (4.4).
The logarithm of zero is not defined and its limit is equal to −∞ when the probability approaches zero from the
positive side. To tackle such issue, a technique known as probability smoothing is employed. The aim is to prevent
score disruption for unseen evidence. The smoothing applied, known in existing literature [54], is
Pi = (1− |Σi|ymin)pi + ymin, (4.6)
where pi represents a conditional probability P (xi[t]|pa(xi[t])), ymin a parameter expressing the degree of prob-
ability uncertainty and |Σi| the granularity of the variable in question. Every probability pi is thus remodeled
to Pi, to not only resolve the problem of unseen patterns but also to ensure a more realistic and fair probability
density distribution. This means that when pi is zero, the new probability will be equal to ymin which normally is
0.001, instead of assuming an unseen pattern as impossible. When a sequence has a probability of 1, every time
a given attribute’s parent nodes configuration occurs that attribute’s value is constant. Equation (4.6) ensures that
the probability is decreased according not only to ymin but also the size of the alphabet related to that attribute.
Consider a small example where an attribute at time frame t can assume one value from {a, b, c, d, e, f}. The
training set revealed that given its parents’ values, the attribute should always adopt symbol c. However, the last
statement encourages overfitting. The algorithm should not assume impossibilities or certainties. Doubt is not a
pleasant condition, but certainty is absurd [86]. In this example, the variable can assume 1 from 6 possible values.
These degree of uncertainty is thus encoded in Eq. (4.6), denoting that variables with higher granularity should
have an higher uncertainty even when the trained model says otherwise. A probability of 1 is hence converted
into a smaller probability, being the reduction proportional to the granularity of the variable. Consequently, the LL
score is computed using the smoothed probabilities.
4.3.2 Sliding window
The next phase is to acquire the outlierness of every subject’s transition. In other words, the outlierness of every
time frame from each TS given the model. The developed method resorts to a sliding window approach to create
an outlierness plot along time for each subject. A sliding window can be defined as a sub-network of a DBN over
consecutive time slices [87]. The mechanism is based in the gradual capture of all the windows Dtt−L which allow
the computation of transition scores Scorett−L, as seen in Eq. (4.4), for each transition. Similar concepts have
been studied in DBN although for different purposes such as approximate inference [87]. A sliding window has
n · (L+ 1) values depending on the order of the DBN.
Consider a discretized univariate TS of length T , modeled with a second-order DBN. The first 3 windows to
score are showed in Fig. 4.3, where each time-stamp ti, L + 1 ≤ i ≤ T represents a time slice of the TS. In this
case, windows are comprised by L+1 observations. Initial attributes at time frames t1 and t2 are not scored by the
mechanism, since these do not possess sufficient preceding values to be evaluated with the second-order transition
network. Nevertheless, the latter are modeled by the prior network B0 and influence the scores of the upcoming
windows which include them. This means that an unusual value at t1, for example, can still corrupt transition
t1 → t3 triggering an anomaly. In Fig. 4.3, contiguous transitions always possess two shared frames, being
36
thus a window Dtiti−2
responsible for describing the likelihood of observations at slice ti given the corresponding
transition network.
For a discrete MTS of length n · T , formatted according to Eq. (3.2), each cell displayed in Fig. 4.3 represents
a set of n observations instead of a single value. However, the reasoning is exactly the same, being the number of
captured values multiplied by the number of variables of the MTS. The entire windows of size n · (L+ 1) are used
to compute the conditional probabilities according to the respective transition network.
Figure 4.3: Sliding window example of a subject modeled with a second-order DBN. At each iteration, windowDtiti−2
is captured to compute the likelihood of attributes at time-slice ti, 3 ≤ i ≤ T given the respective transitionnetwork Btiti−2
.
All transition windows are gathered for every subject and scored according to Eq. (4.4). The whole procedure
is depicted in Algorithm 2. Score analysis must now be employed to discern the final decision boundary.
Algorithm 2 Transition outlier detectionInput: A DBN model with stored conditional probabilities for each transition networkBtt−L. A data set containingeach subject Sj = x1[1], . . . , xn[1], . . . , x1[T ], . . . , xn[T ] as a MTS in the same format as those used for trainingthe DBN. A threshold tresh to discern abnormal and normal scores.
Output: The LL scores for every transition t − L → t which are below the specified threshold. Outlierness ismeasured by multiplying the conditional probabilities of attributesXi[t], ..., Xn[t] from the corresponding window.
1: procedure OUTLIERNESS2: for each time slice t do // For each transition t− L→ t3: for each subject Sj do4: Dt
t−L ← xi[t− L], ..., xn[t− L], ..., xi[t], ..., xn[t]5: Xt ← xi[t], ..., xn[t]6: function SCORING(Dt
t−L, Xt, t)7: TransNet← DBN.getTransNet(t) // Retrieve the transition network Btt−L8: for xi in Xt do // For each attribute of Xt
9: pai ← TransNet.getParents(xi) // Parents of xi[t]10: pi ← TransNet.getProbability(xi, pai) // Retrieve Conditional probability11: Pi ← (1-|Σi|ymin)pi + ymin // Probability Smoothing12: Scorett−L ←
∑i logPi // Score for window Dt
t−L of subject Sj13: scorej ← scorej .Append(Scorett−L) // Every transition score of subject Sj14: if Scorett−L < tresh then15: outliers← outliers.Append(Scorett−L) // Transitions classified as outliers
16: return(outliers)
37
For the final appliance output, a user defined threshold can be enforced. All transitions from every test subject
which obtained a score below the specified value are yield. It is worth noting that even if an outlier is marked
at time slice t, this represents the outlierness of the transition and not necessarily the instant. The analyst should
observe the attributes preceding the selected frame according to the corresponding transition network. It could
happen that outliers detected at time frame t are caused by previous slices. With such in mind, one can consider
an entire window as abnormal rather than specific frames. Furthermore, since the outlier score of a transition
corresponds to the conjunction of the scores of every attribute generated in that time slice, one can inquire which
of the attributes contributed the most to the registered abnormality. It may happen that a single variable assumed a
value so unlikely given its parents that endorsed an high outlierness in the whole transition.
In addition to transition outlierness, the algorithm offers the classification of subjects as well. In many appli-
cations, analysts are worried about removing anomalous series from their models rather than detecting unusual
sections within subjects. Such concern is confronted by computing an additional score for entire MTS. A straight-
forward approach is considered, being a subject outlierness equal to the mean of every score corresponding to every
transition of that subject. The same principle can be applied to any window size ranging from a single transition to
the whole subject. A subject score is computed as
SubjectScorej =1
T − L
T∑t=L+1
Scorett−L, (4.7)
using smoothed probabilities, where Scorett−L represents transition scores belonging to subject j. The mentioned
mechanisms are applicable to any type of DBN, with the difference between stationary and non-stationary models
the transition network considered in each step for the computation of the conditional probabilities.
Providing the option of scoring data sets in multiple ways, the proposed approach adapts more easily to the an-
alysts’ needs and fits into a wide range of scenarios. The next phase is to cleverly determine a boundary separating
abnormal scores from normal ones.
4.4 Score analysis
An appraised trait contemplated in the present algorithm is automaticity. Being the apparatus unsupervised,
the idea is to lower human intervention as much as possible yet offering quality results. Two possible strategies
are discussed for the score-analysis phase, the aim is to compute the optimal threshold separating abnormal and
normal scores. The chosen strategies have score distributional assumptions, therefore, one should note that specific
scenarios can crave for particular approaches. The two strategies studied can be seen as the counterparts of each
other. Such means that generally when one performs poorly the other outputs good results. Such is discussed and
proved by simulated results in Sec. 5.1. By offering the option of choosing one of these strategies, no scenario is
left out of the system’s scope.
38
4.4.1 Tukey’s strategy
The current strategy acknowledges the skewness of the score’s probability distribution. With the concern of
finding a proper way to separate anomalous scores, the nature of the original data is temporarily ignored. One can
define abnormal scores as values that are too far away from the norm, presuming the existence of some kind of
cluster comprising normal scores. The current technique has inspiration in John Tukey’s method [15, 16], which
determines the interquartile range (IQR), also known as H-spread, that measures statistical dispersion. It can be
seen as the width of a box-and-whisker plot and it assumes that the score’s probability distribution is symmetric
around a given point. The method ignores data’s mean and standard deviation, nullifying the impact of extreme
scores. The IQR is hence a measure of variability robust to the presence of outliers, which is a desired trait when
performing anomaly detection.
The before-mentioned cluster is assumed to be centered at the scores’ median. Tukey’s method uses the range
between the first and third quartiles representing the spreadness of the middle 50% of scores, denoted as IQR.
Thus, 50% of all scores are within ±0.5 · IQR of the median. A quartile is a division of data into four intervals.
The first quartile (Q1) is defined as the point in which the lowest 25% fall into, in other words, the lowest 25% have
values lower than Q1. The second quartile (Q2) is the data’s median, cutting the data set in half, while the third
quartile (Q3) is the point in which the highest 25% of values are above it. With this in mind, the IQR is computed
as
IQR = Q3−Q1. (4.8)
Tukey’s method uses the notion of fences [16], frontiers which separate outliers from normal data. Upper and
lower fences are computed as Q3 + (1.5 · IQR) and Q1–(1.5 · IQR), respectively. The reason behind choosing
1.5 · IQR is that for most cases, a value of IQR labels too many outliers (too exclusive) while 2 · IQR begins
to classify extreme values as normal (too inclusive). An intermediate value is thus chosen and performs typically
well for most situations. In a standard Gaussian distribution, a level of 1.5 · IQR allocates about 1% of outliers,
which is a reasonable value. The analyst’s definition of outlier can come into play depending on the sensitivity
desired. Thus, there is not a statistically-driven reason for that value, being fruit of conducted experiments [16].
Figure 4.4: Distribution skewness types. Adapted from [88].
It is worth noting that Tukey’s method should not be confused with the IQR method, as seen at [89]. The latter
39
does not use fences when employed in outlier detection, it instead considers anomalies as values which deviate at
least λ · IQR from the mean, where λ is a sensitivity parameter. John W. Tukey combined such concept with the
notion of fences representing deviations from percentiles rather than the mean. The IQR method is not sensible
to asymmetry in the sampling distribution, since it only considers absolute deviations from the center. Tukey’s
mechanism improves such aspect by introducing fences. However, this still presents an issue for many asymmetric
distributions.
Transition and subject scores are classified as abnormal if their value subsists below their respective lower
fence, and equivalently, above their corresponding higher fence. Such can be visually observed in Box plots [15].
However, the developed algorithm is only distressed with the recognition of anomalous scores below the lower
fence, since these are defined as low likelihood entities. Distributions are thus typically negatively skew as seen in
Fig. 4.4. Thus, scores si holding inequality
si ≤ Q1–(1.5 · IQR) (4.9)
are considered abnormal, being Q1–(1.5 · IQR) the computed threshold.
Score distributions generated by the developed algorithm are often “negatively skewed” with the right side of
the bell curve compressed and with a long tail on the left side. Tukey’s procedure typically prefers symmetric
score distributions with a low ratio of outliers having a breakdown at about 25% [90]. In scenarios with absence of
anomalies, this mechanism is capable of reducing or even completely eliminate false positive occurrences, since
fences are not forced to be in the data’s observed domain.
4.4.2 Gaussian mixture model strategy
One of the disadvantages when employing Tukey’s method [15, 16] is that the latter implies a symmetric distri-
bution of scores. In some applications, outliers are formed by specific phenomena which may cause the appearance
of a Gaussian like curve in the score’s histogram, rather than disperse scores. Such situations are visible in experi-
ences performed in Sec. 5.1. This is typically noticeable in the presence of a larger number of anomalies. In such
scenarios Tukey’s method may not provide the optimal solution. If more than one curve exists, the threshold does
not separate both curves properly.
To handle such distributions, an alternate method is engaged to specifically tackle normal and abnormal curves.
With inspiration from classification and clustering problems, a Gaussian Mixture Model (GMM) [17] is considered.
A GMM is a probabilistic model that assumes data is generated from a finite mixture of Gaussian distributions
with unknown parameters. The latter assumption is tolerable since most real-world phenomena has Gaussian like
distributions [80].
For the implementation of such mechanism in the present work, a scores array is seen as a data set whose
nature can be explained by a mixture of two Gaussian curves. In this case, the threshold discovery task converts
into a classification problem among two classes, normality and abnormality. The desired solution is the boundary
between both curves, where there is maximum uncertainty about which class a score value belongs to.
On account of having two classes C1 and C2, an example is shown on Fig. 4.5. Let us consider classes C1 and
C2 as anomalous and normal, respectively. The aim is to compute the degree of belief of assigning each computed
40
Figure 4.5: Gaussian distribution of two classes C1 and C2. Threshold y0 represents the point separating bothcurves, determined by BCR. Adapted from [91].
score to one of the classes. The problem is defined as uncovering the value of P (C1, C2|y) for each score value y.
Such can be obtained by employing Bayes Rule:
P (Ci|y) =P (y|Ci) · P (Ci)
P (y), i ∈ 1, 2 (4.10)
where P (y|Ci) is the likelihood of score y belonging to class Ci, P (Ci) the priors for each class and P (y) the
evidence. Probabilities P (y|Ci) assume Gaussian distributions N(y|µi, σi) while priors should be estimated.
Evidence P (y) for each score is calculated according to
P (y) = P (y|C1)P (C1) + P (y|C2)P (C2). (4.11)
Combining Eq. (4.10) and Eq. (4.11) leads to the conclusion that for a score y be classified as anomalous,
P (y|C1)P (C1) > P (y|C2)P (C2), since P (C1|y) > P (C2|y). In the opposite case, scores are classified as
normal. Such is known as Bayes classification rule (BCR). The BCR is the key to uncover the boundary between
both classes, giving thus the threshold y0 desired, as depicted in Fig. 4.5. Although the boundary minimizes
classification error, miss-matches are doomed to occur due to intersections between both distributions.
The dilemma is now to discover the parameters of each Gaussian distribution. The GMM is defined as the
sum of the two Gaussian distributions, α1N(Y |µ1, σ21) + α2N(Y |µ2, σ
22). For this, an Expectation Maximization
algorithm [92] is used to determine the values of parameters αi, µi and σ2i :
1. Initialization: Determine the initial component parameters for each curve. It commonly comprises hierar-
chical clustering [93] like k-means to establish hard labels to each data point. These are then applied to a
MLE [44] for the initial parameters.
41
2. Expectation & Maximization: Assign labels on every score based on which class most likely generated
each score according to BCR. Use MLE [44] to re-estimate αi, µi and σ2i of each Gaussian using the assigned
labels.
3. Repeat: Reiterate Step 2 repeatedly until convergence. A LL cost function measuring the fit of the GMM to
the score’s data set is used to determine the convergence point.
The current strategy performs particularly well when the scores’ distribution is discontinued and comprised by
two distinct curves. Thus, opposing Tukey’s method, GMM excels in environments with high proportion of outliers
where such curves are more accented, having difficulty of discerning false positives from anomalies in outlier-free
datasets and single-curve distributions. The latter can easily be explained by the fact that the implemented GMM
always tries to fit an abnormal class.
A single Gaussian distribution could be fitted to tackle low outlier ratios, however, the latter would require the
definition of a deviation measurement. An example would be to classify outliers as values which lie more than λ
standard deviations from the GMM mean. It is worth recalling that the analyst can always enforce any value of
threshold or simply consider a certain percentile of scores as outliers.
Since the current thesis focus on the detection of anomalies, two-Gaussian Mixture Models are considered.
Nevertheless, GMMs with higher number of models can be employed requiring a more complex definition of
outlierness from the analyst due to the several classes encountered. However, in a more general case the analyst
does not typically know how many components yield the best fitting. Information criteria is thus used. More
specifically BIC maximization [94], which penalizes the number of estimated parameters by subtracting a term to
the Log-Likelihood.
In the current thesis, GMM is employed with the aid of the freely available R package mclust [94, 95]. For
additional information regarding EM, consider [96].
4.5 Web application and software
Having the intention of providing a fully available and adaptable outlier detection mechanism, the developed
implementation is freely reachable online [3]. Most figures from experiments undertaken for the current study (see
Chapter 5) derive from the built web application and can be easily reproduced.
Figure 4.6: Front page screenshot of the developed web application reachable through [3].
The application offers support from data-formatting to score analysis together with a tutorial video showing
42
how to maneuver inside the application, as seen in Fig. 4.6. Results can be downloaded and parameters changed in
the midst of each procedure. Sample datasets are also accessible for immediate usage.
Figure 4.7: Screenshot from a transition score-analysis tab using the developed web application [3].
In Fig. 4.7, the score-analysis tab regarding transition outlierness is displayed. Users are capable of adjusting
visualization parameters to their liking as well as values which influence the outputted results. Score-analysis
offers automatic thresholds considering the studied strategies as well as manual readjustment. The user-interface
adapts in real time to the user’s actions.
Furthermore, source code is available, users can run the application in their own setups and adapt each phase
for their particular endeavours enhancing the proposed outlier detection system.
43
44
Chapter 5
Experimental results
A vital stage when developing any algorithm is defining its performance. Several experiments are conducted
using synthetic data as well as datasets available from multiple sources and applications. The goal is to first
demonstrate the method’s outcome in controlled conditions before proceeding to more complex and real-world
scenarios. Results are discussed according to their accuracy, feasibility and more. Nevertheless, it is worth noting
that each situation appeals to different means. Favorable practices for one specific domain do not necessarily imply
satisfying results to any appliance.
The proposed approach is tested against simulated data. Experiments are divided in two groups due to the two
possible strategies in the score-analysis phase, as seen in Fig. 4.1. Such means that both strategies are validating
the approach. Simulated experiments considering a PST method are carried out afterwards and contrasted with the
proposed system. Furthermore, real datasets are addressed using the developed approach displaying its versatility.
5.1 Simulated data
The first step consists of verifying the approach’s performance and consistency in controlled scenarios. More
specifically, the present experiments subsist on training two separate DBNs, one for generating normal data de-
nominated as DBNN and another one to produce outliers DBNO. All data is then mixed together in a single data
set with the intention of identifying the intruders inserted by DBNO. The produced dataset is used to model a
DBN embodying standard data behavior. In perfect conditions the latter is identical to the DBN employed when
generating normal instances DBNN .
Let us first understand the assumptions of such operations. The obtained performance will not only depend
on the ratio of outliers compared to normal data but also on how divergent these are. Several difficulty levels are
analyzed in order to capture the method’s limits. The higher the ratio of outliers, the more the attained model
admits them. This results in a scoring procedure which reduces the score of the deviants while possibly increasing
the number of false negatives (FN) and false positives (FP). The proportion of abnormality is thus a factor to
consider. The trained model is said to be overfitted in respect to the outliers in such situations. It is worth noting,
that in general, the more data is supplied to the algorithm the better, however, the system’s ordinary behavior must
be captured with its corresponding proportion. Few subjects can lead to the overfitting of the transition networks,
45
specially when considering a non-stationary DBN, providing a precarious model which does not excel in practice.
Hence, measurement should worry about capturing a system’s typical behavior instead of focusing on catching
solely anomalies.
Performance measures: The final stage of the algorithm can be perceived as a binary classification task.
Hence, performance measures should be defined to evaluate each experiment. For the actual detection of inserted
anomalies, the precision of an experiment, also known as Positive Predictive Value (PPV) is measured,
PPV =TP
NO=
TP
TP + FP, (5.1)
where TP is the number of true positives detected and NO the total number of generated anomalies. The latter
is computed as the sum of the number of correctly classified anomalies TP and the number of miss-classified
anomalies FP . The result provides the probability of randomly selecting a true outlier from all observations
classified as abnormal. Another performance measure is recall, also known as True Positive Rate (TPR),
TPR =TP
TP + FN, (5.2)
where FN is the number of false negatives. The latter are outlying observations miss-classified as normal. TPR
provides the probability of a detected outlier being correctly classified. To conjointly consider both indicators in
the upcoming experiments, a measure denoted as F1 score is computed,
F1 = 2 · PPV · TPRPPV + TPR
. (5.3)
Standard 5.3 is the harmonic average between PPV and TPR with both being evenly weighted. Values of the
F1 measure range from 0 to 1, being the latter a case of perfect precision and recall. Accuracy measures may also
be calculated as
ACC =TP + TN
TP + TN + FP + FN, (5.4)
which represents the probability of correctly classifying an observation, being TN the number of normal instances
correctly classified.
Experiments are conducted to scrutinize subject outliers fruit of transition anomalies. All the DBNs in the
current chapter are modeled using the LL score.
Data generation: To produce a dataset native from a DBN, normal subjects are generated by the transition
network described in Fig. 5.1, representing DBNN . The majority of subjects belong to DBNN while the rest are
created in a different DBN which the degree of divergence from DBNN reflects the difficulty level being tested.
More similar networks incite better concealed outliers. Note that the model used by the outlier detection algorithm
is trained using the mixed dataset. This means that the latter is expected to differ from the normal model DBNN .
For simplicity, first-order stationary DBNs are considered, noting that the choice of the model parameters in
the training phase further influence the obtained model and consequently the final outcome. With such in mind,
parameters do not differ between experiments. All variables have a domain Σ of 3 symbols also known as the
alphabet size while subjects are MTS comprised by T=10 time instants. Every modeling undertaken in synthetic
46
data experiments considered a stationary DBN with unitary lag and one possible attribute from preceding slices.
The aim is to test the impact of the proportion of outlierness and not the choices of modeling parameters, leaving
the latter discussions for real data sets.
X1
X2
X3
X4
X5
X1
X2
X3
X4
X5
Transition : t− 1→ t
t− 1 t
Figure 5.1: Transition network of the stationary first-order DBN used for generating normal subjects, DBNN .
All attributes from the non-outlier model are conditioned by their preceding values, as seen in Fig. 5.1. Corre-
lation among variablesX5 andX2 persists. For generating anomalous subjects, two different DBNs are considered
and shown in Fig. 5.2. These differ from DBNN , where dashed arrows are representative of removed connections
and red edges relations that have been added with respect to DBNN . Black non-dashed links illustrate common
connections between the abnormal and normal networks. Such dissimilarities influence the difficulty of the gen-
erated outliers. It is worth noting that being the models stationary, the transition networks shown are replicated
along every time window. All DBNs possess the same initial network, in such manner the aforementioned does
not influence the obtained results.
By observing both anomalous models displayed in Fig. 5.2, network DBNC is structurally more dissimilar
from DBNN . The latter retains only two of the original connections while increasing the number of relationships
among different variables. It is thus expected to output more dissimilar outliers when compared to DBNB .
47
X1
X2
X3
X4
X5
X1
X2
X3
X4
X5
Transition : t− 1→ t
t− 1 t
X1
X2
X3
X4
X5
X1
X2
X3
X4
X5
Transition : t− 1→ t
t− 1 t
(a) (b)
Figure 5.2: Transition networks of the stationary first-order models DBNB (a) and DBNC (b) used for generatinganomaly subjects. Dashed connections represent links which are removed with respect to the normal network,displayed in Fig. 5.1, while red links symbolize added dependencies. Non-dashed black edges are connectionswhich are also present in the normal network.
5.1.1 Proposed approach
The proposed approach is tested against synthetic data. Experiments are divided in two groups due to the two
possible strategies in the score-analysis phase.
5.1.1.1 Tukey’s strategy
As a first contact, every test is examined using Tukey’s method (see Sec. 4.4.1 in Page 39) in the score-analysis
phase. Advantages and disadvantages are discussed and later compared with a GMM strategy.
Results are observable in Table 5.1, where DBNO PO Ns used in each test indicates which anomalous network,
DBNB or DBNC , is employed, the percentage of subjects that belong to the said anomalous network PO and the
total number of subjects in the dataset Ns. Every value is rounded down to two decimal places and represent an
average among 5 trials, since every experiment is executed 5 times. It is worth recalling that the latter have the
objective of uncovering subject outlierness. However, since a subject score is determined by an average of all the
transition scores of that subject, it is possible to observe which transitions contributed for the classification of each
subject, validating thus both disclosures simultaneously.
Observing the results of outlier detection using any anomaly network with Tukey’s method, available in Ta-
ble 5.1, it comes that datasets with only 100 subjects perform generally poorly. Such is explained by the fact
that these do not possess enough information about the data’s underlying processes. Replicating the results of an
experiment with only 100 subjects is difficult, since these are very sensible to each observation. Recall that data is
generated randomly according to each networks’ parameters. Accuracy ACC as well as F1 scores tend to decline
with the increase of outlier ratios, which is expected due to less normal data available for a correct modeling phase.
The latter is observed by the decrease of TPR measurements. The computed thresholds converge to more stable
48
Table 5.1: Results for subject outlier detection with simulated data sets while employing Tukey’s score analysis.An experiment DBNO PO Ns possesses a percentage PO of anomaly subjects generated using DBNO, which canbe DBNB or DBNC , from a total number of subjects in the dataset Ns. Every performance measure representsan average of 5 separate tests. The threshold Thresh computed for each experiment using Tukey’s method is anaverage over the 5 trials. Column NO indicates the number of outlier subjects at each experiment.
Experiment NO Thresh PPV TPR ACC F1 Experiment NO Thresh PPV TPR ACC F1
B 5 100 5 -8,72 0.88 0.70 0.98 0.78 C 5 100 5 -8.61 0.89 0.73 0.98 0.80B 5 1000 50 -5.56 0.93 0.96 0.99 0.94 C 5 1000 50 -5.46 0.91 0.98 0.99 0.94
B 5 10000 500 -5.47 0.95 0.98 0.99 0.96 C 5 10000 500 -5.32 0.94 1.00 0.99 0.97B 10 100 10 -9.85 0.96 0.38 0.94 0.54 C 10 100 10 -8.72 0.89 0.73 0.97 0.80
B 10 1000 100 -5.61 0.99 0.87 0.99 0.93 C 10 1000 100 -5.57 0.97 0.87 0.98 0.92B 10 10000 1000 -5.56 0.99 0.91 0.99 0.95 C 10 10000 1000 -5.55 0.99 0.87 0.98 0.93
B 20 100 20 -10.01 1.00 0.19 0.83 0.32 C 20 100 20 -8.71 0.90 0.22 0.84 0.35B 20 1000 200 -6.08 1.00 0.20 0.84 0.33 C 20 1000 200 -6.17 1.00 0.37 0.87 0.54
B 20 10000 2000 -6.09 1.00 0.16 0.83 0.28 C 20 10000 2000 -6.01 1.00 0.29 0.86 0.45
values with the increase of data causing oscillations to tend to zero. Hence outputting more reliable values for
every performance measure.
Figure 5.3: Subject outlierness histogram for an experiment C 20 10000 showing thresholds for both score-analysis strategies. Scores below the threshold are classified as abnormal (in red) while scores above the thresholdare classified as normal (in green). Scores which are around the threshold are displayed in yellow. All the colorrepresentation is considering the Tukey’s threshold, the same reasoning is applied to the GMM threshold.
Discussing the impact of outlier ratios, Tukey’s method is recognized to be more effective in the presence of
lower anomaly percentages. The aforementioned arises from the fact that the score distribution starts to be increas-
ingly asymmetric with the increase of more extreme scores and such has been confirmed in existing literature [89].
Moreover, when the ratio is high enough and the majority of outliers are generated by a common process, the
score distribution of abnormal data becomes visible, observed in Fig. 5.3. The latter is not endorsed by Tukey’s
method. Such phenomena explains why for the same ratio, F1 scores may decrease with the increase of subjects
when employing Tukey’s threshold in large datasets. Furthermore, the breakpoint of Tukey’s method [90] prevents
favorable results when in the presence of abundance outlierness. In the other hand, false positives tend to disap-
pear, reflecting high precision measurements. The latter is due to the fact that Tukey’s thresholds avoid the main
49
distribution, however, such allows the number of false negatives to rise.
Comparing experiments of both networks DBNB and DBNC , accuracy is in general higher in experiments
with DBNC . Such is expected, since the latter has fewer connections in common with DBNN , resulting in a more
dissimilar structure. However, such is not always true. Score distribution asymmetry may disturb Tukey’s analysis
in extreme cases.
A control test to verify the system’s behavior when in the absence of outliers is performed. The aim is to
check the number of false positives if any and compare the trained model with the presumed one. Results shown
that all the connections from DBNN are present, however additional links are trained. It is worth recalling that the
modeling algorithm may have multiple solutions and furthermore, in the case of modeling with the LL score, higher
number of relations tend to be formed when compared to MDL. The control test scores are seen in Fig. 5.4. Tukey’s
strategy outputs a threshold of -5.57 which is tolerable given the distribution. However, the GMM mechanism
yields a value of -5.06 which bears an excessive number of false positives. The latter results from the fact that a
second outlying distribution is expected thus proving the convenience of employing Tukey’s approach in scenarios
with reduced abnormality.
Figure 5.4: Subject outlierness histogram for a control experiment showing thresholds for both score-analysisstrategies. Scores below the threshold are classified as abnormal (in red) while scores above the threshold areclassified as normal (in green). Scores which are around the threshold are displayed in yellow. All the colorrepresentation is considering the Tukey’s threshold, the same reasoning is applied to the GMM threshold.
It is worth noting that despite false positives being unwanted, these are normal to appear in the present ex-
periments. The non-outlier DBN, seen in Fig. 5.1, generates certain sequences with a low probability due to its
parameters. When creating a large number of transitions, these are doom to occur. The scoring phase applies a low
likelihood to such instances causing miss-classifications.
5.1.1.2 GMM Method
Inspecting Tukey’s analysis results, the performance of experiments with larger anomaly ratios bear low F1
values due to high counts of false negatives. Such is coherent with the idea that higher outlier ratios disturb the
training phase of the algorithm and thus the final results. However, the main reason for the aforementioned outcome
is the presence of an outlier curve in the scores distribution, as seen in Fig. 5.3. The latter occurs due to the high
proportion of outliers formed by a specific mechanism, that in this case is an abnormal DBN. Tukey’s method does
not perform well in such scenarios.
50
A GMM strategy is thus employed. The datasets used in each experiment are exactly the same tested with
Tukey’s method. Hence, the score-analysis phase is the only difference among the tests, affecting the computed
threshold. The results are available at Table 5.2, where it becomes noticeable the considerable increase in recall
for experiments with higher proportions of outliers when compared with results from Table 5.1. Such has been
confirmed in existing literature [89].
Figure 5.5: Comparison between GMM and Tukey’s score-analysis F1 scores for multiple outlier ratios. Eachvalue is an average of all 15 trials performed for each outlier ratio.
Table 5.2: Results for subject outlier detection with simulated data sets while employing the GMM strategy in scoreanalysis. An experiment DBNO PO Ns possesses a percentage PO of anomaly subjects generated using DBNO,which can be DBNB or DBNC , from a total number of subjects in the dataset Ns. Every performance measurerepresents an average of 5 separate tests. The threshold Thresh computed for each experiment using the GMMstrategy is an average over the 5 trials. Column NO indicates the number of outlier subjects at each experiment.
Experiment NO Thresh PPV TPR ACC F1 Experiment NO Thresh PPV TPR ACC F1
B 5 100 5 -8.78 0.82 0.70 0.98 0.76 C 5 100 5 -7.51 0.64 1.00 0.96 0.78B 5 1000 50 -5.71 0.91 0.97 0.99 0.94 C 5 1000 50 -5.40 0.86 0.99 0.99 0.92
B 5 10000 500 -5.45 0.95 0.98 0.99 0.96 C 5 10000 500 -5.49 0.98 1.00 0.99 0.99B 10 100 10 -8.55 0.77 0.68 0.93 0.72 C 10 100 10 -8.64 0.92 0.78 0.97 0.84
B 10 1000 100 -5.43 0.94 0.96 0.99 0.95 C 10 1000 100 -5.24 0.89 0.97 0.98 0.93B 10 10000 1000 -5.27 0.91 0.98 0.99 0.94 C 10 10000 1000 -5.28 0.93 0.96 0.99 0.95
B 20 100 20 -8.18 0.66 0.49 0.85 0.56 C 20 100 20 -7.17 0.75 0.58 0.88 0.65B 20 1000 200 -5.18 0.86 0.89 0.94 0.87 C 20 1000 200 -5.24 0.91 0.92 0.96 0.92
B 20 10000 2000 -5.13 0.86 0.94 0.96 0.90 C 20 10000 2000 -5.11 0.93 0.94 0.97 0.94
51
In general, the number of false positives is higher when employing the GMM strategy. Such is due to the
fact that GMM always considers the existence of an abnormal model even in its absence. Due to similarities
between the DBNs, especially when considering DBNB , scores from both curves tend to mix together around the
threshold being thus difficult to discern them. An example is experiment from Fig. 5.3, where the GMM threshold
conveniently separates both classes. With such in mind, for the current scenarios, the GMM approach has typically
higher recall but lower precision with thresholds smaller in module. The latter is more visible in higher ratios
of outlierness, since in the presence of fewer anomalies Tukey’s method displays higher F1 scores. In Fig. 5.5,
the average F1 scores of every experiment using a specific method and outlier ratio is shown. Tukey’s method
performs very poorly in datasets with 20% of anomalies while outperforming the GMM strategy in datasets with
5% of anomalies as well as control experiments.
It is worth recalling that the GMM strategy assumes that two classes with Gaussian distribution exist, in real-
world scenarios it may happen that multiple curves appear in the score’s distribution even if there is only one
normal class. In such cases, the current approach would probably exceedingly miss-classify. The algorithm is
easily adapted to several classes by employing GMM for more than two models. However, this requires a more
in-depth intervention and reasoning by the analyst.
Although the present synthetic experiments seem to endorse the use of a GMM strategy, one should note that
both normal and abnormal data are generated according to two defined models which can by some degree be
separated. Such is a favorable scenario for GMM. When considering other scenarios, Tukey’s method is not so
susceptible to the presence of well defined curves being thus always a strategy to consider. With that said, it is
adviced for the analyst to apply as much knowledge as possible with the experimentation of both strategies.
5.1.2 Comparison with probabilistic suffix trees
To compare the developed system, an additional multivariate outlier detection system is built. The latter adopts
PSTs with the aim of mining abnormal values among each TS. This alternative approach modifies the modeling
and scoring phases of the outlier detection system. The latter means that the pre-processing and score-analysis
phases are the same as the proposed approach.
As discussed in Sec. 3.2.1, PSTs are only capable of modeling univariate time series. To tackle MTS, vari-
ous considerations are undertaken. First, each dataset is divided into multiple groups, each one containing data
concerning solely one variable i ∈ 1, ..., n. Every group is used to model a PST P i. A subject j is used to form
sequences Xji , which are univariate TS for each variable i. Thus, each sequence Xj
i comprises observations of
every time frame t ∈ 1, ..., T concerning a variable i. A dataset comprising 5 variables models 5 trees, which
do not interchange any information. Each tree computes an univariate score for all TS according to Eq. (3.10) in
Page 20. Scores from all the separate trees belonging to a common subject j of the original dataset are gathered.
The mean of each score array is computed as
SubjectScorej =1
n
n∑i=1
logloss(Xji ), (5.5)
being the latter the multivariate score for each MTS. Note that the present algorithm does not consider relation-
ships among variables, solely within univariate TS. In Eq. (5.5), n represents the total number of variables while
52
logloss(Xji ) depicts the univariate score of sequence Xj
i computed with PST P i and concerning variable i and
subject j.
An existing PST modeling package [97] is engaged. Each experiment preformed is compared with the proposed
approach. Results are shown in Tables 5.3 and 5.4. Likewise, the PST approach employs score analysis posterior
to scoring, selecting one of the two considered strategies. Every PST is modeled with maximum possible depth
and a nmin of 30. Pruning using the G1 function with a C of 1.2 is used (see Sec. 3.2.1).
Experiments preformed with solely 100 subjects did not provide satisfactory results, since the latter do not grant
sufficient information. Tree modeling and pruning have to be adapted to such scenarios. Tests are thus focused
on datasets with sufficient material. Datasets used in the present mechanism are common to the ones employed
with the DBN approach. Control experiments using solely normal data when employing the multivariate PST
system displayed a result identical to the shown in Fig. 5.4. Results demonstrate the low performance of the PST
Table 5.3: Results of subject outlier detection using the PST approach with Tukey’s score analysis. An experimentDBNO PO Ns possesses a percentage PO of anomaly subjects generated using DBNO, which can be DBNB orDBNC , from a total number of subjects in the dataset Ns. Every performance measure represents an average of 5separate tests. The threshold Thresh computed for each experiment using the Tukey’s strategy is an average overthe 5 trials. Column NO indicates the number of outlier subjects at each experiment.
Experiment NO Thresh PPV TPR ACC F1
B 5 10000 500 -1.55 0.96 0.73 0.98 0.83
C 5 10000 500 -1.61 0.96 0.94 0.99 0.95
B 10 10000 1000 -1.65 0.70 0.02 0.90 0.04
C 10 10000 1000 -1.63 0.98 0.39 0.94 0.56
B 20 10000 2000 -1.74 0.42 0.00 0.80 0.00
C 20 10000 2000 -1.73 1.00 0.03 0.81 0.06
Table 5.4: Results of subject outlier detection using the PST approach with GMM score analysis. An experimentDBNO PO Ns possesses a percentage PO of anomaly subjects generated using DBNO, which can be DBNB orDBNC , from a total number of subjects in the dataset Ns. Every performance measure represents an average of 5separate tests. The threshold Thresh computed for each experiment using the GMM strategy is an average overthe 5 trials. Column NO indicates the number of outlier subjects at each experiment.
Experiment NO Thresh PPV TPR ACC F1
B 5 10000 500 -1.47 0.86 0.88 0.99 0.87
C 5 10000 500 -1.59 0.94 0.95 0.99 0.94
B 10 10000 1000 -1.27 0.20 0.87 0.65 0.33
C 10 10000 1000 -1.55 0.88 0.68 0.96 0.77
B 20 10000 2000 -1.30 0.25 0.67 0.53 0.36
C 20 10000 2000 -1.45 0.763 0.883 0.92 0.82
53
approach when discerning anomalies generated by DBNB . Such is explained by the fact that this network is much
similar to DBNN when compared with DBNC . Furthermore, since inter-variable relations are not considered,
subjects become identical when seen by the PSTs. Hence, the resulting score distributions display a single curve
blending both classes. One exception are experiments considering only 5% of anomalies, which indicate that with
the increase of outlier ratios, the few dissimilarities among classes are modeled, causing outliers to fit the PSTs.
Regarding experiments with DBNC , although outputting worse results than the proposed approach, both classes
are more easily discerned. Such is mainly due to the differences present in the intra-variable relationships when
compared with the normal network.
Figure 5.6: Subject outlierness histogram using the proposed approach (a) and the PST approach (b) for a sameexperiment C 20 10000. Both histograms display thresholds using both score-analysis strategies. Scores belowthe threshold are classified as abnormal (in red) while scores above the threshold are classified as normal (in green).Scores which are around the threshold are displayed in yellow. All the color representation is considering Tukey’sthresholds, the same reasoning is applied to GMM thresholds.
In Fig. 5.6 a comparison between both approaches for an experiment with 20% of outliers from DBNC is
shown. The PST system can not separate both classes as well as the DBN approach, demonstrating the importance
of inter-variable relationships present in DBNC and the robustness of the DBN approach. In general, the PST
approach scales poorly with the increase of outlier ratios. It is worth recalling that datasets are generated using
DBNs, which gives an advantage for the approach proposed. However, outlier networks possess intra-variable
dependencies which are captured by PSTs, being thus target of comparison.
5.2 Real Multivariate Time Series
With validation completed in Sec. 5.1, real-world scenarios are tackled ranging from numerous applications.
The objective is to demonstrate the employment of the developed approach through different usages.
54
5.2.1 Electrocardiogram dataset
A common application of anomaly detection in medical scenarios is in Electrocardiogram (ECG) alert systems
[2]. These have the capability of detecting unusual patterns in electrical signals measured from patients. Data is
usually un-discretized and present expected patterns when under normal circumstances. In the shown experiments,
it is demonstrated that the present approach is helpful when detecting suspect regions. A complete study from
pre-processing to score analysis is thus made available.
The ECG dataset is available at [98] and is composed by 200 MTS. Tests are performed mainly using non-
stationary DBNs since specific phenomena occurs in particular time instances, which is a peculiarity of ECG
signals. The location of the peaks of the ventricular contraction typically occur at time frames 3 and 10 as shown
in Fig. 5.7, where a representation of the whole MTS data set is present. It is worth noting that the current
experiments have the objective of testing the system’s reaction to noisy and unstable data.
Figure 5.7: Mean and standard deviation of normalized ECG variables along time using a SAX alphabet of 8.
The test seen in Fig. 5.8 shows that the current approach has difficulty evaluating time slices with higher
variance. The latter is due to the fact that there is not a predominant pattern in the location of the peaks, since these
vary intensively from subject to subject contrary to more advanced slices where data behaves more predictably.
Transitions must not be regarded as time slices, since these are windows of the original TS and thus possess a time
delay proportional to the Markov Lag of the model.
As shown in Fig. 5.7, SAX discretization offers a low definition at the ends of the peaks, such is due to the
signal’s variability which does not act equally along all frames. Excessively increasing the alphabet causes each
observation to be unreasonably sensible to noise. Moreover, overfitting may occur. PAA dimensionality reduction
is not employed in the current tests due to the low number of frames.
The system proved its ability to detect unusually behaved sections in ECGs which coincide with the high
variance portions in Fig. 5.8, showing that it can be employed to detect such peaks. The latter series are discretized
with a SAX alphabet of 5. The score distribution is negatively skew, as seen in Fig. 5.10(a), thus being appropriate
for Tukey’s threshold. The modeling phase used a second-order DBN, which demonstrated favorable results.
Furthermore, the occurrence of ventricular peaks in unusual slices is likewise uncovered. To test the afore-
mentioned, 10% of the subjects are flipped and mixed together in the original set. The objective is to detect the
now abnormal transitions. Results in Fig. 5.9 demonstrate the detection of such transitions which are present on
55
Figure 5.8: ECG transitions arranged by subject using a non-stationary second-order DBN with Tukey’s score-analysis. Data is discretized using SAX with an alphabet of size 5. Transitions displayed in red are classified asabnormal while green ones are classified as normal.
Figure 5.9: ECG dataset transitions arranged by subject. A non-stationary second-order DBN model is usedtogether with Tukey’s score-analysis. Flipped subjects are associated to the highest subject ids. Data is discretizedusing SAX with an alphabet of 5 symbols. Transitions displayed in red are classified as abnormal while green onesare classified as normal.
subjects with higher id.
To test the impact of the discretization phase, the exact same experiment is conducted with the difference
that all MTS are discretized with 10 symbols instead of 5. The transition score distribution from Fig. 5.10(b)
starts to reflect multiple clusters particular of different types of patterns. The second major group formed by the
transitions of the peaks becomes larger. The latter is between the normal cluster and an anomalous cluster which
appeared in the far left. With the increase of granularity, the system begins to better differentiate the data’s different
contexts, being an example the type of ventricular peak each subject possesses. Transition peaks with common
characteristics start to join together.
It is worth noting that in non-stationarity scenarios, comparing transition scores should be minded carefully.
This comes from the fact that scores from different transitions are evaluated in distinct networks. A normal value
for one network can be anomalous for another. With such in mind, separate score analysis could be performed for
each transition score distribution. The latter is not evaluated in the current approach. Moreover, the number of FP
may increase since scores depicted as normal globally can be seen as anomalies considering solely its transition.
56
Figure 5.10: Comparison of ECG transition score distributions using a SAX alphabet Σ of size (a) 5 and (b)10 symbols in the pre-processing phase. Transitions are modeled by a non-stationary second-order DBN. Bothhistograms display a Tukey’s threshold. Scores represented in red are classified as anomalous while in green areclassified as normal. Scores around the threshold are displayed in yellow.
5.2.2 Mortality Dataset
To further test the proposed approach, consider a cell wise outlier detection scenario where a multivariate dataset
is confined in a data matrix. Such is studied in [99] where data is considered to have been generated from a
multivariate Gaussian distribution. After column standardization, an univariate outlier detection system is applied
to each variable forming a new matrix. Correlations between pairs of variables are quantified for the computation
of predictions for all cells. The proposed approach DetectDeviatingCells [99] classifies cell-wise as well as row-
wise anomalies. One of the examples tested [99] refers to a data set comprising male mortality in France from 19th
century forward, deriving from [100]. It is intended to discover outlying years. Therefore, the present mission is
to detect the main iconic events which occurred in 19th and 20th century France, providing an explanation.
In order to adapt the data set for the developed approach, a number of transformations have to be enforced.
Beyond being un-discretized, data is structured as a matrix where the only apparent variables are the mortality rate
for each age group. To remodel it to the desired format of longitudinal data, several conversions are performed:
• Variables: the several ages ranging from 1 to 91 years old are regarded as variables Xi, meaning that
correlations among mortality rates of different ages can be modeled.
• Time: being the rows of the original matrix the year in which measurements were obtained, such are now
seen as time instants t for each variable Xi.
57
• Reduction: due to the excessive number of variables, only a sub set is selected.
Experiments conducted with the total number of variables revealed bad results due to excessive ambiguity. It is
worth noting that with all the transformations performed, the data set is reduced to a single subject which portrays
a MTS. Attributes Xi[t] are thus mortality rates of specific age groups at particular years. After the mentioned
procedures, the resulting series formatted according to Eq. (3.2) is pre-processed, as discussed in Sec. 4.1.
Being data only comprised of a single subject, non-stationary models can not be employed. These would
renounce the drawn aspirations due to overfitting. The aforementioned comes as a result of existing only one
observation from each transition. Hence, a stationary DBN is modeled to capture abnormal male mortality rates
throughout the years, represented by transitions.
Each variable of the MTS is discretized with an alphabet size of 5, perceptible in Fig. 5.11. Such is selected
due to variable oscillations which would be encoded with higher alphabets. Reinforcing the problem’s facet that
available data is minimal, being comprised by only one subject, excessive sensitivity showed to be ineffective. All
tests employ Tukey’s method in the score analysis phase.
Two separate experiments are presented in Fig. 5.12. A data set comprising France’s male mortality rates from
1841 to 1987 is used. In the first experiment, 5 variables are selected, being ages 20, 30, 40, 60 and 80, which
are respectively variables V 2-V 6 in Fig. 5.11. The objective is to determine unusual events such as wars and
epidemics. Hence the reason to mainly pick ages belonging to the adult category. The trained model involves
a stationary third-order DBN. Nodes are allowed to have at-most 1 parent from previous slices. The reasoning
behind the parameter choice is purely experimental. The problem exhibits a preference of attributes establishing
connections with previous nodes which are not consecutive with themselves. Such means that typically, explaining
the current value requires knowledge of values from 2 or 3 slices before. It is worth recalling that having a lag of
3 does not mean that every or even any relation has such lag, it just offers such possibility.
Figure 5.11: Normalized values of variables Vi∈1,...,6 representing France’s mortality rates of males with ages 10,20, 30, 40, 60 and 80 respectively from 1841 to 1987. Each time stamp represents a year. Data is discretized witha SAX alphabet Σ of 5 symbols.
Results confirm major events which shaked France’s history. These are displayed in Fig. 5.12, representing
both World Wars, the influenza pandemic, the Franco-Prussian War and the European revolutionary wave of 1848.
France was a belligerent in several conflicts as well as colonization wars in the 1850s.
58
In the second experiment, variable V 1 is added to the first set. The new age represents the male mortality rate
of children aged 10 years old. The goal is to capture the impact of the presence of youth mortality in the outputted
years. The results are similar, being the differences observed in the 1860s and around the Spanish flu. Youth is
more susceptible to epidemics, which is perceived by the algorithm. Due to industrial development as well as
overcrowding, diseases like cholera peaked infant mortality around the detected years [101]. Note the increase of
outliers from the Spanish Flu in Fig. 5.12, reinforcing previous statements. With the introduction of public health-
care as well as asepsis for preventing infections, infant mortality rates stabilized through the years, explaining the
lack of anomalies in later transitions.
Figure 5.12: Transition outlierness for mortality datasets of (a) 5 and (b) 6 variables using a third-order DBN. Thefirst dataset is comprised by 5 variables representing male mortality rates of males with ages 20, 30, 40, 60 and 80.The second dataset includes the same variables as the first with the addition of a variable representing the mortalityrate of males aged 10. Transitions are arranged by year and classified as anomalous (red) and normal (green).Major wars and epidemics which affected France in the selected years are exhibited.
Reminiscing the notion that each transition is represented by a window dependent on the model’s structure, a
year disclosed as an anomaly does not always imply that the respective year is abnormal but rather its appearance
given the window’s observations. Hence, the approach captures the starting and end points of wars with more
ease. An example is WW1, which is uncovered at 1914, however later years are not present since the mechanism
expected high mortality rates given the observed attributes.
A compelling result when modeling the reduced dataset of 5 variables with a fourth-order DBN depicted in
Fig. 5.13. By overfitting the model, two major score areas appear. Further investigation reveals that transitions
classified as normal prior to WW2 belong to one score group while regular ones from after WW2 reside on another
group. Such is marked in Fig. 5.13, being the results explained by medical and society shifts as well as the
introduction of Penicillin during WW2 [102]. It is also observable in Fig. 5.11 the distinct patterns pre and post
WW2 roughly around time stamp 100.
Results obtained in [99] are thus confirmed in the current study. Experiments using the current approach in data
59
Figure 5.13: Transition score distribution for the mortality data set using Tukey’s threshold. Data is modeled usinga fourth-order DBN. Transitions are classified as anomalous (red) and as normal (green). Due to overfitting, scoresare concentrated in two major clusters representing the expected behavior of France’s male mortality rates beforeand after WW2 respectively.
sets including years after the 1980s showed minimal differences when compared with the presented ones. Such is
explained by the absence of sudden changes in mortality rates.
5.2.3 Pen digits dataset
A distinct application is the recognition of drawn digits. With measurements taken along time from each drawing
phase of each character, longitudinal data is built. The goal is thus to model the system to certain symbols being
at the same time unwanted characters amidst the training data. The latter are thus aimed to be uncovered. Data is
available at [103] and studied in [104]. Handwriting samples are captured using a sensitive tablet which outputs
the x and y coordinates of the pen at fixed time intervals.
A data set comprising 1143 MTS along 8 time frames representing digit 1 is assembled from 44 different
writers. The original MTS are discretized with an alphabet size of 8. The data set is contaminated with 130
subjects belonging to a different digit, roughly 10%. The aim is to detect the aforementioned and subsequently
understand similarities between digits.
Results for multiple tests are present in Table 5.5, where Di represents the anomaly digit i introduced. All
experiments model a non-stationary first-order DBN. Such demonstrates that a specific pair of coordinates is more
easily explained by its immediate precedent. Every attribute can possess at-most one parent from its preceding
slice. Thresholds are selected manually. The objective is not only to capture the performance of the outlier
detection mechanism but further understand which digits are more commonly resembled with digit 1. Results
show that distinguishing digit 7 from 1 is difficult due to their similarity, proved by the low F1 score obtained.
Such reflects the blending of both class distributions. Digits 8 and 9 proved to be more easily discerned from 1
with favorable performance measures.
Further experimentation employing a Tukey and GMM strategy showed that Tukey’s thresholds assume most
of the score distribution as a single cluster, outputting a low number of TP. On the other hand, GMM proved to
be able to separate both distributions detecting practically all anomalies in each test. However the later outputs an
high number of FP, since both Gaussians are mixed together. For the case of experiment D9, GMM acquired 125
60
TP and 171 FP with a threshold of -2.37.
Table 5.5: Results of pen digits outlier detection experiments. All datasets are composed with 1143 MTS along 8time frames representing digit 1 and NO subjects belonging to a different digit Di. Thresholds Tresh are chosenmanually.
Experiment NO TP FP TN FN Tresh PPV TPR ACC F1
D5 130 55 38 1105 75 -3.0 0.59 0.423 0.91 0.49
D7 130 24 41 1102 106 -3.0 0.37 0.18 0.88 0.25
D8 130 98 45 1098 32 -2.5 0.69 0.75 0.94 0.72
D9 130 90 42 1101 40 -3.0 0.68 0.69 0.94 0.69
It is worth noting the difficulty associated to the current task. Data requires discretization before being modeled
and conceals different types of features even for the same digit, since 44 writers are present. The size of the data
sets along with the roughly 10% of anomalies turn outlier detection into a challenge. Beyond the aforementioned,
there is no pre-processing specific to the scenario, contrary to existing literature, being raw data from the sensors
directly applied to the system.
Figure 5.14: Transition outlierness of an experiment comprised of 1143 MTS representing digit 1 and 130 subjectsrepresenting digit 0. The latter are associated to the highest subjects ids. The modeling phase employed a first-orderDBN. Transitions displayed in red are classified as abnormal, while in green are considered normal.
An advantage of the developed approach is the capturing of features which help explain each detection. A
similar experiment is depicted where the unusual digit is 0. However, unlike previously, a stationary DBN is
modeled. Transition anomalies seen in Fig. 5.14 demonstrate the presence of digits 0 in the top. Outliers detected
in the final transitions of digit 1 exhibit the different types of calligraphy, which complete the drawing of digit 1 in
distinct ways, contrary to 0.
The versatility of the proposed approach is shown. The appliance of the system in distinct scenarios demon-
61
strated the importance of classifying abnormality using inter-variable dependencies as well as temporal dependen-
cies. Furthermore, by considering two different strategies in score-analysis, the system can adapt to each scenario
more easily.
62
Chapter 6
Discussion
The present chapter focuses on reviewing the thesis’s achievements and providing possible future directions
to enhance the undertaken research. Conclusions are captured from the attained results detailed in Chapter 5
discussing the advantages and disadvantages of the developed system.
6.1 Conclusions
The proposed approach offers the opportunity to tackle portions as well as entire MTS with the employment of
an adaptable outlier detection mechanism. A diverse set of applications have proved to benefit from the former. It
ranges from pre-processing to score-analysis including a DBN modeling phase versatile for longitudinal data. A
widely available web application [3] is deployed to assist any analyst in their specific endeavour along with an user-
friendly interface and straightforward tutorial. A paper based on the current thesis is currently under preparation
for submission in a journal.
With such in mind the DBN sliding window outlier detection mechanism offers an additional tool for longi-
tudinal data previously nonexistent. Anomalies are conveniently analyzed and easily adapted. Furthermore, the
application can be applied to a wide range of scenarios and exercised recursively or partially. An example is per-
forming anomaly detection in order to cleanse unwanted subjects. The resulting data set can then be remodeled
without the influence of unusual patterns. Furthermore, by tweaking the chosen parameters the system is capable
of changing the perception of outlierness. The latter means that different contextual patterns can be captured.
The developed approach shows particularly good results when employed in discretized data with inter-variable
dependencies as well as temporal relations, contrary to existing TS outlier detection methods which solely consider
intra-variable dependencies. Stationary models allow confronting data which does not vary its behavior over time,
in opposition to non-stationarity. Although not being built for MTS with an high number of variables, most
scenarios are analyzed successfully considering a smaller subset of variables, being thus easily adapted to such
cases. Other types of temporal data are proven to accommodate the current system, being the pre-processing phase
specially of great importance.
63
6.2 Future Work
With the availability provided and the partition nature of the developed system, as seen in Fig. 4.1, analysts
can adapt specific stages to their ambitions. The phases which could benefit the most from applying domain
specific knowledge as well as more complex mechanisms are pre-processing and score analysis. Scoring can also
be enhanced. Improvements should keep in mind the assumptions required by the modeling phase which could be
seen as the primary stage of the system.
Different discretization and dimensionality reduction mechanisms should be studied and compared with the
aim of better capturing the main features of data before modeling. In the current approach only SAX is considered
which could further be optimized due to equiprobability issues [82].
In the score analysis phase, more complex and domain specific methods can improve results significantly such
as considering GMM with more than two components or emphasizing specific scores, regarding several classes of
anomalies instead of just one.
A possible adjustment to the existing strategy would be an iterative training phase capable of detecting outliers
in a first approach, removing them temporarily for further modeling. In each iteration modeling would just consider
normal instances from the previous iteration. Without outliers influencing the acquired DBN, the converged model
would be used for the final outlier detection.
The modeling phase can further be enhanced with the employment of change-point mechanisms [105] in the
case of non-stationarity. The latter can model changes in the underlying processes through time instead of consid-
ering each transition separately, amplifying the number of possible experiments.
In a supervised scenario, score analysis can benefit from labeled elements. For instance, SVMs [41] can be
used to create a boundary between the normal and abnormal classes, being the latter used to classify test instances.
Furthermore, different clustering models are available to commit in the disjointing of each class when labeling is
not available.
Scoring can further be improved to emphasize certain transition networks in the case of non-stationarity. More-
over, score analysis can be performed separately to each transition without mixing scores from different networks.
Such however increases computational burden.
The enhancement of the contrasted multivariate PST mechanism is additionally encouraged. By providing
an algorithm capable of modeling a tree encoding multivariate configurations, an efficient MTS variable length
Markov model can surge.
64
Bibliography
[1] J. D. Pinto, A. M. Carvalho, and S. Vinga. Outlier detection in cox proportional hazards models based on
the concordance c-index. In LOD, pages 252–256. Springer, 2015.
[2] E. Keogh, J. Lin, and A. Fu. HOT SAX: Finding the most unusual time series subsequence: Algorithms and
applications. In ICDM, pages 440–449. Citeseer, 2004.
[3] J. L. F. Serras. Dynamic Bayesian outlier detection. https://jorgeserras.shinyapps.io/
outlierdetection/, October 2018.
[4] N. Friedman, K. P. Murphy, and S. J. Russell. Learning the structure of dynamic probabilistic networks. In
UAI, pages 139–147. Morgan Kaufmann, 1998.
[5] F. Zappa and P. Occhiogrosso. Real Frank Zappa Book. Simon and Schuster, 1989.
[6] C. Warrender, S. Forrest, and B. Pearlmutter. Detecting intrusions using system calls: Alternative data
models. In S&P, pages 133–145. IEEE, 1999.
[7] T. Lane. Hidden Markov models for human/computer interface modeling. In IJCAI, pages 35–44. Citeseer,
1999.
[8] Evolution of anomaly detection systems in security data science. https://www.analyticsvidhya.
com/blog/2018/02/demystifying-security-data-science/. Accessed: 03 September
2018.
[9] J. Lin, E. J. Keogh, S. Lonardi, and B. Y. Chiu. A symbolic representation of time series, with implications
for streaming algorithms. In DMKD, pages 2–11. ACM, 2003.
[10] M. Saada, Q. Meng, and T. Huang. A novel approach for pilot error detection using Dynamic Bayesian
Networks. Cognitive neurodynamics, 8(3):227–238, 2014.
[11] M. L. Bueno, A. Hommersom, P. J. Lucas, M. Lappenschaar, and J. G. Janzing. Understanding disease
processes by partitioned dynamic Bayesian networks. BI, 61:283–297, 2016.
[12] D. J. Hill, B. S. Minsker, and E. Amir. Real-time Bayesian anomaly detection in streaming environmental
data. Water Resour. Res., 45(4), 2009.
[13] K. Murphy, S. Mian, et al. Modelling gene expression data using dynamic Bayesian networks. Technical
report, Computer Science Division, University of California, Berkeley, CA, 1999.
65
[14] J. L. Monteiro, S. Vinga, and A. M. Carvalho. Polynomial-time algorithm for learning optimal tree-
augmented dynamic Bayesian networks. In UAI, pages 622–631, 2015.
[15] J. W. Tukey. Exploratory data analysis, volume 2. Reading, Mass., 1977.
[16] D. C. Hoaglin. John W. Tukey and data analysis. Statistical Science, pages 311–318, 2003.
[17] G. McLachlan. Finite mixture models. Annual Review of Statistics and Its Application, 5(1), 2018.
[18] J. Lopez-de Lacalle. tsoutliers: Detection of outliers in time series. R package version 0.6-5. https:
//CRAN.R-project.org/package=tsoutliers, 2016.
[19] D. V. Matt Dancho. anomalize: Tidy anomaly detection. R package version 0.1.1. https://CRAN.
R-project.org/package=anomalize, 2018.
[20] RStudio. Shiny: A web application framework for R (2015). http://shiny.rstudio.com.
[21] S. Urbanek. RJava: Low-level R to Java interface. http://CRAN.R-project.org/package=
rJava.
[22] C. C. Aggarwal. Outlier analysis. Springer International Publishing Switzerland, 2017.
[23] F. E. Grubbs. Procedures for detecting outlying observations in samples. Technometrics, 11(1):1–21, 1969.
[24] D. M. Hawkins. Identification of outliers, volume 11. Springer, 1980.
[25] M. Gupta, J. Gao, C. C. Aggarwal, and J. Han. Outlier detection for temporal data: A survey. IEEE Trans.
Knowl. Data Eng., 26(9):2250–2267, 2014.
[26] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection for discrete sequences: A survey. IEEE Trans.
Knowl. Data Eng., 24(5):823–839, 2012.
[27] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM computing surveys (CSUR),
41(3):15, 2009.
[28] V. Hodge and J. Austin. A survey of outlier detection methodologies. Artificial intelligence review, 22(2):
85–126, 2004.
[29] R. Domingues, M. Filippone, P. Michiardi, and J. Zouaoui. A comparative evaluation of outlier detection
algorithms: Experiments and analyses. Pattern Recognition, 74:406–421, 2018.
[30] A. Valdes and K. Skinner. Adaptive, model-based monitoring for cyber attack detection. In International
Workshop on Recent Advances in Intrusion Detection, pages 80–93. Springer, 2000.
[31] V. Chandola, V. Mithal, and V. Kumar. Comparative evaluation of anomaly detection techniques for se-
quence data. In ICDM, pages 743–748. IEEE, 2008.
[32] P. J. Rousseeuw and A. M. Leroy. Robust regression and outlier detection, volume 589. John wiley & sons,
2005.
66
[33] P. Galeano, D. Pena, and R. S. Tsay. Outlier detection in multivariate time series by projection pursuit.
Journal of the American Statistical Association, 101(474):654–669, 2006.
[34] R. S. Tsay, D. Pena, and A. E. Pankratz. Outliers in multivariate time series. Biometrika, 87(4):789–804,
2000.
[35] A. M. Bianco, M. Garcia Ben, E. Martinez, and V. J. Yohai. Outlier detection in regression models with
arima errors using robust estimates. Journal of Forecasting, 20(8):565–579, 2001.
[36] N. Kumar, V. N. Lolla, E. Keogh, S. Lonardi, C. A. Ratanamahatana, and L. Wei. Time-series bitmaps: a
practical visualization tool for working with large time series databases. In ICDM, pages 531–535. SIAM,
2005.
[37] S. Budalakoti, A. N. Srivastava, R. Akella, and E. Turkov. Anomaly detection in large sets of high-
dimensional symbol sequences. 2006.
[38] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Lof: identifying density-based local outliers. In
SIGMOD, volume 29, pages 93–104. ACM, 2000.
[39] K. Ismo et al. Outlier detection using k-nearest neighbour graph. In null, pages 430–433. IEEE, 2004.
[40] A. K. Jain and R. C. Dubes. Algorithms for clustering data. 1988.
[41] J. Ma and S. Perkins. Time-series novelty detection using one-class support vector machines. In IJCNN,
volume 3, pages 1741–1745. IEEE, 2003.
[42] M. Hauskrecht, I. Batal, M. Valko, S. Visweswaran, G. F. Cooper, and G. Clermont. Outlier detection for
patient monitoring and alerting. BI, 46(1):47–55, 2013.
[43] M. Davy and S. Godsill. Detection of abrupt spectral changes using support vector machines an application
to audio signal segmentation. In ICASSP, volume 2, pages II–1313. IEEE, 2002.
[44] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em
algorithm. Journal of the royal statistical society. Series B (methodological), pages 1–38, 1977.
[45] K. R. Koch. Robust estimation by expectation maximization algorithm. Journal of Geodesy, 87(2):107–116,
2013.
[46] L. Tarassenko, P. Hayton, N. Cerneaz, and M. Brady. Novelty detection for the identification of masses in
mammograms. In ICANN, pages 442–447. IET, 1995.
[47] E. Keogh, S. Lonardi, and B.-c. Chiu. Finding surprising patterns in a time series database in linear time
and space. In SIGKDD, pages 550–556. ACM, 2002.
[48] R. Gwadera, M. J. Atallah, and W. Szpankowski. Reliable detection of episodes in event sequences. Knowl-
edge and Information Systems, 7(4):415–437, 2005.
[49] E. Hancock. Ideas into words: Mastering the craft of science writing. JHU Press, 2003.
67
[50] D. Ron, Y. Singer, and N. Tishby. The power of amnesia: Learning probabilistic automata with variable
memory length. Mach. Learn., 25(2-3):117–149, 1996.
[51] N. Ye et al. A Markov chain model of temporal behavior for anomaly detection. In Proceedings of the 2000
IEEE Systems, Man, and Cybernetics Information Assurance and Security Workshop, volume 166, page
169. West Point, NY, 2000.
[52] R. Sekar, A. Gupta, J. Frullo, T. Shanbhag, A. Tiwari, H. Yang, and S. Zhou. Specification-based anomaly
detection: a new approach for detecting network intrusions. In CCS, pages 265–274. ACM, 2002.
[53] P. Sun, S. Chawla, and B. Arunasalam. Mining for outliers in sequential databases. In ICDM, pages 94–105.
SIAM, 2006.
[54] A. Gabadinho and G. Ritschard. Analyzing state sequences with probabilistic suffix trees: the pst r package.
JSS, 72(3):1–39, 2016.
[55] C. Low-Kam, A. Laurent, and M. Teisseire. Detection of sequential outliers using a variable length Markov
model. In ICMLA, pages 571–576. IEEE, 2008.
[56] F. Rasheed and R. Alhajj. A framework for periodic outlier pattern detection in time-series sequences. IEEE
transactions on cybernetics, 44(5):569–582, 2014.
[57] S. L. Salzberg, A. L. Delcher, S. Kasif, and O. White. Microbial gene identification using interpolated
Markov models. Nucleic acids research, 26(2):544–548, 1998.
[58] E. Eskin, W. S. Noble, and Y. Singer. Protein family classification using sparse Markov transducers. Journal
of Computational Biology, 10(2):187–213, 2003.
[59] W. W. Cohen. Fast effective rule induction. In Machine Learning Proceedings 1995, pages 115–123.
Elsevier, 1995.
[60] W. Lee, S. J. Stolfo, and P. K. Chan. Learning patterns from unix process execution traces for intrusion
detection. In AAAI Workshop on AI Approaches to Fraud Detection and Risk Management, pages 50–56,
1997.
[61] V. Jaaskinen, J. Xiong, J. Corander, and T. Koski. Sparse Markov chains for sequence data. Scandinavian
Journal of Statistics, 41(3):639–655, 2014.
[62] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc.
IEEE, 77(2):257–286, 1989.
[63] G. Bejerano and G. Yona. Variations on probabilistic suffix trees: statistical modeling and prediction of
protein families. BI, 17(1):23–43, 2001.
[64] N. Friedman. The Bayesian structural EM algorithm. In UAI, pages 129–138. Morgan Kaufmann, 1998.
[65] K. Das and J. Schneider. Detecting anomalous records in categorical datasets. In SIGKDD, pages 220–229.
ACM, 2007.
68
[66] S. Babbar and S. Chawla. On Bayesian network and outlier detection. In COMAD, page 125, 2010.
[67] W.-K. Wong, A. W. Moore, G. F. Cooper, and M. M. Wagner. Bayesian network anomaly pattern detection
for disease outbreaks. In Proceedings of the 20th International Conference on Machine Learning (ICML-
03), pages 808–815, 2003.
[68] R. E. Kalman. A new approach to linear filtering and prediction problems. J. Basic Eng., 82(1):35–45,
1960.
[69] A. Ogbechie, J. Dıaz-Rozo, P. Larranaga, and C. Bielza. Dynamic Bayesian network-based anomaly detec-
tion for in-process visual inspection of laser surface heat treatment. In Machine Learning for Cyber Physical
Systems, pages 17–24. Springer, 2017.
[70] A. Hartemink et al. Banjo: Bayesian network inference with java objects. URL [http://www. cs. duke.
edu/amink/software/banjo/], 2005.
[71] D. Chickering, D. Geiger, and D. Heckerman. Learning Bayesian networks: Search methods and experi-
mental results. In AISTATS, pages 112–128, 1995.
[72] N. Dojer. Learning Bayesian networks does not have to be NP-hard. In MFCS, pages 305–314. Springer,
2006.
[73] N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian network classifiers. Mach. Learn., 29(2-3):131–163,
1997.
[74] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. Inf. Theory,
14(3):462–467, 1968.
[75] S. Kullback and R. A. Leibler. On information and sufficiency. Ann. Math. Stat., 22(1):79–86, 1951.
[76] J. Edmonds. Optimum branchings. J. Res. Nat. Bur. Standards, 71:233–240, 1967.
[77] J. Han, J. Pei, and M. Kamber. Data mining: concepts and techniques. Elsevier, 2011.
[78] D. Q. Goldin and P. C. Kanellakis. On similarity queries for time-series data: constraint specification and
implementation. In CP, pages 137–153. Springer, 1995.
[79] E. J. Keogh and M. J. Pazzani. Scaling up dynamic time warping for datamining applications. In KDD,
pages 285–289. ACM, 2000.
[80] R. J. Larsen, M. L. Marx, et al. An introduction to mathematical statistics and its applications, volume 2.
Prentice-Hall Englewood Cliffs, NJ, 1986.
[81] P. Senin, J. Lin, X. Wang, T. Oates, S. Gandhi, A. P. Boedihardjo, C. Chen, and S. Frankenstein. Grammarviz
3.0: Interactive discovery of variable-length time series patterns. ACM Trans. Knowl. Discov. Data, 12(1):
10:1–10:28, Feb. 2018. ISSN 1556-4681. URL http://doi.acm.org/10.1145/3051126.
[82] M. Butler and D. Kazakov. Sax discretization does not guarantee equiprobable symbols. Trans. Knowl.
Data Eng., (1):1–1, 2015.
69
[83] S. Wold, K. Esbensen, and P. Geladi. Principal component analysis. Chemom. Intell. Lab. Syst., 2(1-3):
37–52, 1987.
[84] C. R. Shaliz and A. Kontorovich. Shannon entropy and kullback-leibler divergence. Almost None of the
Theory of Stochastic Processes, pages 229–236, 2010.
[85] C. E. Shannon. A mathematical theory of communication. SIGMOBILE, 5(1):3–55, 2001.
[86] Voltaire. Letter to Frederick II of Prussia, 6 April. https://en.wikiquote.org/wiki/Voltaire,
1767. Accessed: 23 September 2018.
[87] X.-G. Gao, J.-F. Mei, H.-Y. Chen, and D.-Q. Chen. Approximate inference for dynamic Bayesian networks:
sliding window approach. Applied intelligence, 40(4):575–591, 2014.
[88] Distribution skewness types. http://www.assetinsights.net/Glossary/G_Probability_
Distribution.html. Accessed: 20 September 2018.
[89] P. R. Jones. A note on detecting statistical outliers in psychophysical data. bioRxiv, page 074591, 2016.
[90] P. J. Rousseeuw and C. Croux. Alternatives to the median absolute deviation. J. Am. Stat. Assoc., 88(424):
1273–1283, 1993.
[91] Gaussian distributions. http://tinyheero.github.io/2015/10/13/mixture-model.
html. Accessed: 20 September 2018.
[92] G. McLachlan and T. Krishnan. The EM algorithm and extensions, volume 382. John Wiley & Sons, 2007.
[93] J. D. Banfield and A. E. Raftery. Model-based gaussian and non-gaussian clustering. Biometrics, pages
803–821, 1993.
[94] L. Scrucca, M. Fop, T. B. Murphy, and A. E. Raftery. mclust 5: Clustering, classification and density
estimation using Gaussian finite mixture models. The R journal, 8(1):289, 2016.
[95] C. Fraley, A. Raftery, L. Scrucca, T. B. Murphy, M. Fop, and M. L. Scrucca. Package ‘mclust’. 2017.
[96] M. A. T. Figueiredo and A. K. Jain. Unsupervised learning of finite mixture models. ITPIDJ, 24(3):381–396,
2002.
[97] A. Gabadinho. PST: Probabilistic suffix trees and variable length Markov chains. https://CRAN.
R-project.org/package=PST, 2016.
[98] H. A. Dau, E. Keogh, K. Kamgar, C.-C. M. Yeh, Y. Zhu, S. Gharghabi, C. A. Ratanamahatana, Yanping,
B. Hu, N. Begum, A. Bagnall, A. Mueen, and G. Batista. The ucr time series classification archive, October
2018. https://www.cs.ucr.edu/˜eamonn/time_series_data_2018/.
[99] P. J. Rousseeuw and W. V. D. Bossche. Detecting deviating data cells. Technometrics, 60(2):135–145, 2018.
[100] Human Mortality Database. University of California, Berkeley (USA), and Max Planck Institute for De-
mographic Research (Germany)., Available at www.mortality.org or www.humanmortality.de
(data downloaded on 18 September 2018).
70
[101] I. N. D. Demographiques. Infant mortality in France. https://www.ined.fr/en/everything_
about_population/demographic-facts-sheets/focus-on/infant_mortality_
france/. Accessed: 27 September 2018.
[102] E. Torok, E. Moran, and F. Cooke. Oxford handbook of infectious diseases and microbiology. Oxford
University Press, 2016.
[103] D. Dheeru and E. Karra Taniskidou. UCI machine learning repository, 2017. URL http://archive.
ics.uci.edu/ml.
[104] F. Alimoglu and E. Alpaydin. Methods of combining multiple classifiers based on different representations
for pen-based handwritten digit recognition. In TAINN, 1996.
[105] X. Xuan and K. Murphy. Modeling changing dependency structure in multivariate time series. In ICML,
pages 1055–1062. ACM, 2007.
71