26
Incremental Mining of Information Interest for Personalized Web Scanning Rey-Long Liu ( 劉劉劉 ) Dept. of Medical Informatics Tzu Chi University

Incremental Mining of Information Interest for Personalized Web Scanning Rey-Long Liu ( 劉瑞瓏 ) Dept. of Medical Informatics Tzu Chi University

Embed Size (px)

Citation preview

Incremental Mining of Information Interest for Personalized Web Scanning

Rey-Long Liu (劉瑞瓏 )

Dept. of Medical Informatics

Tzu Chi University

2

Problem Definition

Personalized web scanning An environmental scanning routine for users and businesses A resource-consuming job (e.g. network bandwidth) Key issues

Seed finding Information crawling Information monitoring

Should be guided by proper information interest, which is both Implicit: The user is unable and/or unwilling to express the

interest, and Evolving: The interest may change although it is relatively long-

term

3

Spec. for user’s interest

Scanner

Seed FindingUser

Personalized Folder

C

R

C

n2

2

C

n

2

C

n

1

C

n

C

1

2

1

2

C

1

2

1

1

C

1

1

C

1

2

C

1

C

11

2

C

1

2

C

1

1

C

1

2

1

2

C

1

2

1

1

C

1

C

1

2

1

2

C

1

2

1

1

C

1

2

1

2

C

1

2

1

1

C

1

2

1

2

C

1

2

New Info

Interest designation

Info Scanned

New Info

Gathering & Monitoring

Interest Miner

The Web

Info Scanned

{<k1 OR k2> AND<k3 OR k4> …}

Our goal: Incremental mining of information interest to guide web scanning

4

Related Fields

Information gathering Aimed at “one-shot” information needs, rather than

relatively long-term needs Information monitoring

Aimed at the “dynamics” of information of interest (IOI), rather than the location of the IOI

Profile building for folders (categories) Aimed at information analysis (e.g. information

classification and similarity measurement), rather than the derivation of comprehensible specifications

5

Major Challenges Interest specifications should be both

Precise To direct the scanner to suitable info subspaces

Comprehensible To allow the user to refine the specifications, and To allow the search engines to find proper seeds for scanning

The specifications should be derived under the common condition that the user’s interest is often Implicit, Evolving, and Collectively defined by a hierarchy of folders in which each

folder’s context of discussion (COD) is implicitly expressed Example:

Root System Development Decision Support SystemsRoot Manufacturing Decision Support Systems

A folder’s COD is actually indicated by the profiles of its ancestors.

6

IMind

Main contributions Incrementally mining interest specifications which

are more Precise (by specifying each folder’s COD), and Comprehensible (in conjunctive normal form)

No predefined feature sets

7

Input A hierarchy T of folders, A set of folders G designated as the goals of web

scanning, and A set X of documents added to a folder f.

Output Update the profile of each related folder of f in T, For each folder g in G, if the interest specification

of g has changed, send the new specification to the scanner.

8

Example output of IMind

card, machine, PC, sound, printer, …

CPU, bit, instruction, register, processor, chip, …

file, information, window, system, site, server, …

… …

……

Computer & Internet

Hardware

Desktop Computers

Root

The interest specification for Desktop Computers:(file OR information OR window OR system OR site OR server OR …) AND(CPU OR bit OR instruction OR register OR processor OR chip OR …) AND(card OR machine OR PC OR sound OR printer OR …).

9

The algorithm

(1) W {w | w is a word in X, and w is not a stop word};(2) While (f is not the root of T) do

(2.1) Construct or update each 3-tuple <w, rw,f, dw,f> in the profile of f;(2.2) For each sibling b of f, update dw,b;(2.3) f parent of f;

(3) For each goal folder g in G, do(3.1) Ig Disjunction of the profile terms having higher rw,gdw,g values (a number

of profile terms in g are selected);(3.2) a parent of g;(3.3) While (a is not the root of T) do

(3.3.1) Ig Conjunction of Ig and disjunction of the terms having higher rw,adw,

a values (a number of profile terms in both a and g are selected);(3.3.2) a parent of a;

(3.4) If Ig specification of g, send Ig to the scanner to update the specification of g;

Incremental update of folder profiles

Derivation of interest specifications

10

• Measuring how representative and discriminative a term w is in a folder f:

rw,f = Support(w,f) (= P(w|f))

dw,f = Support(w,f) / Avg Support(w,fi), where fi is in {f } U {siblings of f}

… …

……

System, Computer, Analysis, …(O)

Systems Development

Decision, simulation,… (O)System, Computer, … (X)

Decision Support Systems

Transaction Processing Systems

Accounting, Sales … (O)System, Computer, … (X)

Product, factory, …(O)

Manufacturing

Decision, simulation, … (O)

Decision Support Systems

11

• Incremental update of profile terms

f

Both r-values and d-values of the profile terms are updated ‧‧‧

‧‧‧

‧‧‧

‧‧‧

Only d-values of the terms are updated

X: the set of documents added to f

12

Complexity of Incremental Mining Space complexity

O(Nt), where N is the total number of different terms accumulated, and t is the number of folders in the hierarchy

Time complexity Profile mining (step 2)

The maximum number of updates is iBiN, where Bi is the number of siblings of the level-i ancestor of f (i.e. the ancestor whose level is

i) plus one (i.e. including the level-i ancestor) Specification derivation (step 3)

The maximum number of operations required to update interest specifications is iji,jN, where i,j is the number of descendant goal folders of the jth sibling of the level-i ancestor of

f

Note: The above numbers should be much smaller in practice, since each folder is quite unlikely to contain all terms (i.e. N terms)

13

Empirical Evaluation Experimental Data

Source: Yahoo! (http://www.yahoo.com) Coverage: Computers & Internet, Society and Culture, and

Science The larger hierarchy:

261 folders, among which 174 were leaf folders, among which 142 are not duplicate (and set as goal folders)

2844 documents The smaller hierarchy:

169 folders, among which 119 were leaf folders, among which 109 are not duplicate (and set as goal folders)

3615 documents

14

Evaluation method Sending the specifications to Yahoo!

Other search engines were tried as well. However, they limited the number of terms in a query and/or did not return the category of the web sites Google (http://www.google.com), Lycos (http://www.lycos.com), Open Directory Project (ODP, http://www.dmoz.org), AltaVista (http://www.altavista.com), and Netscape (http://www.netscape.com)

Yahoo! returns web sites and their categories Top 200 web sites are considered

In practice, the web scanner may process only a limited number of seeds

Yahoo! claims to sort the relevance of each web site by her complicated and proprietary algorithm

15

Evaluation criteria Completeness

Average sites found per folder Reliability

Percentage of folders with sites retrieved

16

Systems evaluated IMind (with = 10 and 20) Baselines (with the same number of terms as IMind)

Vector-based approach Norm-of-the-folder (NOF)

The profile of the folder was a vector constructed by averaging the document vectors in the folder

Rocchio’s method (RO) The profile was a vector constructed by computing a weighted sum o

f the positive document vectors and the negative document vectors Probability-based approach

Naive Bayes (NB) The profile was constructed by estimating the conditional probabiliti

es of the terms in the folder Hierarchical approach

Hierarchical Shrinkage (HS) The profile was constructed by employing the hierarchical relationsh

ips (e.g. sibling) among folders to refine the estimates of the conditional probabilities produced by NB

17

Results

0

1

2

3

4

5

6

10000 20000 40000Feature set size

Aver

age

sites

foun

d pe

rfo

lder

IMind-10

NOF-10

RO-10

NB-10

HS-10

0

0.5

1

1.5

2

2.5

3

10000 20000 40000

Feature set size

Aver

age

sites

foun

d pe

rfo

lder

IMind-20

NOF-20

RO-20

NB-20

HS-20

Average sites found per folder (the larger hierarchy)

18

0

2

4

6

8

10

12

10000 20000 40000Feature set size

Aver

age

sites

foun

d pe

rfo

lder

IMind-10

NOF-10

RO-10

NB-10

HS-10

0

1

2

3

4

5

6

10000 20000 40000Feature set size

Aver

age

sites

foun

d pe

rfo

lder

IMind-20

NOF-20

RO-20

NB-20

HS-20

Average sites found per folder (the smaller hierarchy)

19

0

20

40

60

10000 20000 40000

Feature set size

Fold

ers w

ith si

tes f

ound

(%)

IMind-10

NOF-10

RO-10

NB-10

HS-10

0

20

40

60

10000 20000 40000Feature set size

Fold

ers w

ith si

tes f

ound

(%)

IMind-20

NOF-20

RO-20

NB-20

HS-20

Percentage of folders with sites retrieved (the larger hierarchy)

20

0

20

40

60

10000 20000 40000Feature set size

Fold

ers w

ith si

tes fo

und

(%)

IMind-10

NOF-10

RO-10

NB-10

HS-10

0

20

40

60

10000 20000 40000

Feature set size

Fold

ers w

ith si

tes f

ound

(%)

IMind-20

NOF-20

RO-20

NB-20

HS-20

Percentage of folders with sites retrieved (the smaller hierarchy)

21

More specially, the results showed that IMind derived more precise specifications

Making seed finding both more complete and reliable Some specifications derived by the baselines were too

vague for Yahoo! to process Yahoo! did not respond to 2, 3, 19, and 78 queries generated by

RO-20, NOF-20, NB-20, and HS-20, respectively IMind derived more comprehensible specifications

Specifying each level of COD of each folder IMind improved more when more training data

was given Contributing more significant improvements on the

smaller hierarchy, which has more training documents IMind does not require feature set tuning

Demonstrating more stable performance

22

IMind successfully controlled the time spent to process each document The time mainly depends on the number of terms in

related folders, while the number should converge to a certain limit

0

5

10

15

20

0 500 1000 1500 2000 2500

Document ID

Tim

e Sp

ent (

Sec.

)

Time spent for individual documents sequentially added into the larger hierarchy (running on a PC with a CPU running in 2.6 GHz and a RAM whose size was 2 GB)

23

Conclusion

Personalized web scanning needs to be guided by the user’s information interest, which is both implicit and evolving

IMind is an incremental text mining system to derive precise and comprehensible interest specifications

24

Extension

How can the user refine the specifications mined? An intelligent interface to guide the refinement

How can the length of the specifications be determined more intelligently? Automatic thresholding Manual setting

25

More related extensions

Information Scanning:

Autonomous scanning, Adaptive discovery, Adaptive monitoring, & Adaptive elicitation

Information Analysis:Exception management, Trend detection, Association detection, Even tracking, & Novelty detection

Information/Knowledge Classification & Filtering:

Semantic context recognition, Integrated filtering and classification, & Incremental context mining

Environmental Information:Partners, Customers, Competitors, Government, & News providers

Internal Information:

Transaction Data, Knowledge shared, & Information shared

Information/Knowledge Delivery:Intelligent information retrieval, Adaptive online guidance, Adaptive dissemination, People finding, Knowledge finding, Knowledge map, & Computer-Assisted Instruction

ThanksThanks