Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad...

Preview:

Citation preview

Similarity based Retrieval from Sequence Databases using Automata as Queries

作者 :A. Prasad Sistla, Tao Hu, Vikas howdhry

出處 :CIKM 2002 ACM

指導教授 : 郭煌政老師 學生 : 林奕森

Outline Introduction Related work Definitions and Examples Algorithms for Infinite Norm Distance Algorithms for Average Block Distance Experimental Results Conclusion and Discussion

Introduction(1/4) Sequence Databases occur in many

areas of research in Database Management Systems. For example, Temporal Databases, Time-

series Databases and Video Databases are some examples of sequence databases.

In this paper we consider similarity based retrieval from sequence databases.

Introduction (2/4) Similarity based retrieval consists of

retrieving those subsequences that closely satisfy the query based on a similarity measure.

In this paper, we consider a language based on finite state automata for specifying queries on sequences, and develop similarity based methods for retrieval

Introduction (3/4) We consider the following problems for

a given database sequence d and a specification automaton A: (i) retrieval of k closest subsequences of d

with respect to the automaton A (called “nearest neighbor query”)

(ii) retrieval of all subsequences of d with in a given distance from A (also called “range query”)

Introduction (4/4) We have implemented the proposed

methods on top of Sequel Server. We also consider a restricted class of

automata, called cycle-restricted automata.

We present more efficient algorithms for these automata.

Related work (1/3) There has been much work done on

querying from time-series and other sequence databases

For example, methods for similarity based retrieval from such databases have been proposed in [11, 2, 3, 5, 15, 14]

Related work (2/3) The paper [1] presents a language,

called SDL The retrieval is done based on exact match

and is not similarity based retrieval like ours using a global distance measure.

There has also been much work done on data-mining over time series data [4, 12, 6] and other databases. Among these works, [6] uses automata

Related work (3/3) All these works mostly consider

discovery of patterns that have a given minimum level of support. They do not consider similarity based retrieval.

A temporal query language and efficient algorithms for similarity based retrieval have been presented in [18].

Definitions and Examples Basic Automata and Similarity values

Automata 1 An automaton A is 5-tuple (Q,Σ, δ, I,F) where

Q is a finite set of states, Σ is a finite set of symbols called the input alphabet, δ is the set of transitions, I,F ⊆ Q are the set of initial and final states, respectively.

2 Each input symbol represents an atomic predicate (also called an atomic query in some places) on a single database state.

Definitions and Examples Automata example

1 Each transition of A, i.e. each member of δ, is a triple of the form (q, a, q’) where q, q’ ∈ Q and a ∈ Σ; this triple denotes that the automaton makes a transition from state q to q’ on input a; we also represent such a transition as q →a q’.

2 For example, in a stock market database, price(ibm) = 100 represents an atomic predicate.

Definitions and Examples Automata example

the automaton B defined as follows. It has three states 1,2,3. Its input symbols are the atomic queries time = 10AM, time = 4PM and price(IBM) < 100. States 1,3 are the start and final states repsectively. The automaton has the following transitions— from state 1 to 2 on the input symbol time = 10AM, from state 2 back to 2 on the symbol price(IBM) < 100, and from state 2 to 3 on the symbol time = 4PM.

Definitions and Examples Similarity Measure

A database sequence d is a finite sequence of database states

A database state represent an image (in case of video databases) or a document in case of textual databases.

For a database state c and an atomic query c’, we let sim (c’, c) denote the similarity value with which c satisfies the query c’.

Definitions and Examples Similarity Measure

We let dist(c, c’) = 1- sim(c, c’) represent the distance between c and c’

we define the similarity of a database sequence d = (do, ..., dn-1) with respect to an automaton A

we define a distance measure dist(d, a) between d and an input sequence a = (a0, ..., an-1) of equal length.

Definitions and Examples Similarity Measure

Let sim_vec(d, a) be the sequence (s0, ..., sn-1) where for each i = 0, ..., n- 1, si = sim(ai, di). We assume that all similarity values and distances are normalized , i.e. they lie in the interval [0, 1]

Let F be a vector distance function which given two vectors x , y as arguments, associates a positive real number lying in the interval [0, 1]

xx

Definitions and Examples Similarity Measure

We define dist(d, a) = F(sim vec(d, a), 1). Now, we define a distance measure dist(d,C) between the database sequence d and a set C ⊆ Σ. dist(d,C) is the minimum of dist(d,α), where the minimum is taken over all α ∈ C such that |α| = |d|; if there is no sequence α ∈ C such that |α| = |d| then we take dist(d,C) to be equal to 1.

Definitions and Examples Similarity Measure

we define the distance of d with respect to A, denoted by dist(d,A), to be dist(d,L(A)). We define the similarity of d with respect to the automaton A, denoted by sim(d,A), to be 1- dist(d,L(A)).

Definitions and Examples Similarity Measure

Definitions and Examples Similarity Measure

Note that F1 is the average block distance function and F2 is the mean square distance function, etc.

We call F1 as the average block distance function and F∞ as the infinite norm distance function.

Definitions and Examples Wild Card Symbol

We assume that there is a special input symbol φ which denotes a wild card symbol, i.e. it denotes an atomic query which is always satisfied.

Cycle-Restricted Automata Let A = (Q,Σ, δ,I,F) be an automaton. A path of

the automaton is a sequence of transitions of the following form — q0 →a0 q1, q1 →a1 q2, ..., qn-1 →an-1 qn. We call such a sequence as a path from q0 to qn.

Definitions and Examples Cycle-Restricted Automata

We call the path a φ-path if all input symbols appearing in it are wild cards, i.e., for each i = 0, ..., n- 1, ai = φ. The above path is called a cycle if qn = q0 and q0, q1, ..., qn-1 are all distinct. A φ-path which is also a cycle is called a φ-cycle. We say that an automaton is cycle-restricted if it has no φ-cycles of length greater than 1

Definitions and Examples Nearest Neighbor and Range

Queries In this paper, we consider the

evaluation of the two types of queries assuming that we are given a query automaton A and a database sequence d

Definitions and Examples Nearest Neighbor and Range

Queries The first type of queries are called

nearest neighbor queries. Here we have to retrieve k subsequences of d having the lowest distances with respect to A where k is an additional input which is a positive integer.

Definitions and Examples Nearest Neighbor and Range

Queries The second type of queries are called

range queries. Here we have to retrieve all subsequences of d whose distance with respect to A is less than or equal to &, where & is an additional input which is a positive fraction.

ALGORITHMS FOR INFINITE NORM

DISTANCE definitions and lemma Lemma4.1

Let q be any state in Q and i be an integer such that 1 ≤ i ≤ n. Further, let q1, ..., qm be the successor states of q on input symbols a1, ..., am respectively

ALGORITHMS FOR INFINITE NORM DISTANCE

ALGORITHMS FOR INFINITE NORM DISTANCE

ALGORITHMS FOR INFINITE NORM DISTANCE

ALGORITHMS FOR INFINITE NORM DISTANCE

Employing Indices for fast retrieval for each i = 1, ...m, we can retrieve a list Li

of entries of the form (I, val) where I is an interval of the form [u,v] such that 1 ≤ u ≤ v ≤ n and and 0 ≤ val < 1. The entry ([u,v], val) on the list Li denotes that the the distance, with respect to ai, of all database states whose indices fall with in the range [u,v] is val; that is, for all j such that u ≤ j ≤ v, dist(dj, ai) = val.

Algorithms for Average Block Distance

For any subsequence σ = (di, ..., di+l-1) of d and any string a = (a1, ..., al) ∈ Σ* of the same length, let bd(σ, a) be the sum Σj=0,...,l-1dist(di+j, aj+1); it denotes the block distance between σ and a.

Algorithms for Average Block Distance

let val(q, i, r) = min{bd(σ, a) : σ is a subsequence of d starting from di and a is any string in T(q) which is of the same length as σ whose pseudo length is r }

T(q) is the set of strings accepted by A starting from the state q

Algorithms for Average Block Distance

AVG-DIST :computes the minimum of the distances of all the subsequences of the database sequence with respect to the automaton A.

AVGDIST- RESTR-AUT :cycle restricted automata .

Experimental Results We have implemented all the

algorithms INF-NORM, INFNORM-INDX, AVG-DIST, INF-NORM-RESTR-AUT and AVG-DIST-RESTR-AUT.

They use SQL to run algorithms on a stock market database.

Experimental Results The database stored the end-of-day

Dow-Jones Industrial averages over the last 98 years giving a database sequence of length 26,716 ( the length is the total number of trading days during that period).

This query is specified by an automaton that accepts the language given by the regular expression ab*c .

Experimental Results

Conclusion and Discussion Introduced a powerful formalism based

on automata for expressing queries on sequence databases.

We also have given efficient algorithms for similarity based retrieval that employ indices.

Implemented the algorithms for time-series databases on PC using Sequel server

Conclusion and Discussion Experimental results showing the

effectiveness of our methods are presented.

It will also be interesting to see if and how the techniques of the paper can be extended for data mining over sequences.