GRID SD Extracção de padrões em bases de dados de grandes dimensões Centro de Ciências e Tecnologia da Computação Departamento de Informática Escola de

GRIDSD

Extracção de padrões em bases de dados de grandes dimensões

Centro de Ciências e Tecnologia da ComputaçãoDepartamento de InformáticaEscola de EngenhariaUniversidade do MinhoPORTUGAL

Departamento de Informática, Universidade do Minho

Braga, 27 de Março de 2005

Ronnie Alves ([email protected])

http://alfa.di.uminho.pt/~ronnie


1

GRID - Grupo de Investigação e Desenvolvimento em Sistemas de Dados

Roteiro

• DBMS, os maiores e mais pesados • Exploração dos DBMS

• Mineração de bases de dados

– Large Transactional Tables

– Fracamente acoplado

– Fortemente acoplado

• Extracção de conhecimento via Procedural Schema

– Itemset Mining

• Trabalhos Relacionados


1


DBMS, os maiores e mais pesados

• Database Size, Decision Support Systems – All Environments & UNIX Only: Top honors went to France

Telecom for the largest database in the All Environments and UNIX Only categories. At 29.2TB, the database was three times as big as that of the 2001 winner. France Telecom runs the Oracle Database on HP Superdome servers and HP RAID storage systems.

• Rows/Records, Decision Support Systems – All Environments & UNIX Only: AT&T received two

additional Grand Prizes for most rows, All Environments and UNIX Only categories. The 496 billion-row system represents a doubling of the highest figure for this category in 2001.

» March 2004 Issue DM Review Magazine» http://www.dmreview.com/article_sub.cfm?articleId=8182


1


Exploração dos DBMS

• Buscar e Resumir dados– SQL

• SQL + funções de agregação (count, min, sum,…)» SELECT Empresa, Produto, SUM(Total) » FROM Vendas » GROUP BY Empresa, Produto » HAVING SUM(Total)>10000

• Extrair Padrões dos dados– Para além das funcionalidades do SQL ANSI.

» Para tal entra em acção os algoritmos e ferramentas de mineração


1


Mineração de bases de dados

• Actividade básica, – encontrar padrões frequentes (itemset mining)


1


Mineração de bases de dados

• Fracamente acoplado– O processo de mineração é feito fora do DBMS– Usa o DBMS apenas para buscar os dados

• Fortemente acoplado– Usa todos os recursos do DBMS para efeitos do processo

de mineração


1


Itemset Mining

• Tarefa principal e de maior custo para a geração de Regras de Associação

• Regras da Forma: “A [support s, confidence c]”.• Examples:

– buys(x, “diapers”) buys(x, “beers”) [0.5%, 60%]– age(x, “30-34”) ^ income(x ,“42K-48K”) buys(x, “high

resolution TV”) [2%,60%] – major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%, 75%]


1


Itemset Mining Algoritmos

• Apriori

• FP-Growth


1


Apriori—Pseudocode

Ck: Candidate itemset of size kLk : frequent itemset of size k

L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;

PROBLEMA: As implementações SQL-based do Apriori geram vários joins de grande custo na medida em que K aumenta!


1


Apriori Exemplo

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

C1L1

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

C2 C2

Scan D

C3 L3itemset{2 3 5}

Scan D itemset sup{2 3 5} 2


1


FP-Growth

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

min_support = 0.5

TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

Etapas:

1. Items frequentes de tamanho 1

2. Ordenar os items de forma descendente

3. Construir a FP-tree correspondente


1


FP-Growth Conditional Pattern Base

Conditional pattern bases

item cond. pattern base

c f:3

a fc:3

b fca:1, f:1, c:1

m fca:2, fcab:1

p fcam:2, cb:1

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3


1


FP-Growth Construct Conditional FP-tree

m-conditional pattern base:

fca:2, fcab:1

Todos os padrões frequentes a m

m,

fm, cm, am,

fcm, fam, cam,

fcam

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header TableItem frequency head f 4c 4a 3b 3m 3p 3

•PROBLEMA: Tamanho da FP-tree e a memória disponível. A recursão para geração das Conditional FP-trees implica em várias reconstruções de tabelas (CONFP)!


1


Pattern Growth Mining on DBMS

• Premissas:– Não gerar joins– Não permitir reconstrução de tabelas CONFP

• Programação do DBMS para itemset mining– Store procedures– Funções UDF– Cursores(salvo cuidado!)


1


PGS etapas

• Disponibilzar uma pattern tree

• Extrair padrões via pattern growth


1


PGS pattern tree

1. Dado um suporte mínimo gerar os itemsets frequentes de tamanho_1

2. Gerar nova tabela de transações a partir de 1

3. Criar pattern tree

CREATE TABLE #save_time ( start_time datetime not null)INSERT #save_time VALUES ( GETDATE())exec genfitems '3' -->removes infrequent 1-L itemsexec gentransfi -->generates only transactions with frequent 1-L itemsexec genefp -->generates pattern-tree (fp)exec genpb '3' -->generates the (pb) and (confp)exec up_confp -->maintain the position control of itemsGOSELECT 'Structure, msec' = DATEDIFF(millisecond, start_time, getdate())FROM #save_timeDROP TABLE #save_timeGO


1


Pattern tree

1. PROCEDURE EFP2. DO with (EXISTS TRANSFI) 3. CREATE TABLE EFP (item, cnt, path)4. CREATE TABLE FP (item, cnt, path) 5. DECLARE6. BEGIN 7. count = 18. curpath = null9. c_transfi CURSOR for TRANSFI10. FOR each row in c_transfi11. BEGIN12. curpath = curpath + ‘:’ + c_transfi.item13. INSERT INTO EFP 14. values(c_transfi.item, count, curpath)15. END16. SELECT item, sum(cnt) as cnt, path17. INTO FP18. FROM EFP19. GROUP BY item, path20. END


1


PGS Minerar

1. Extracção de single Patterns

2. Extracção de not single Patterns

CREATE TABLE #save_time ( start_time datetime not null)INSERT #save_time VALUES ( GETDATE())exec genspg '3' -->generates single patterns in table (patterns)exec genpg '3' -->add not single patterns in table (patterns)GOSELECT 'Mining, msec' = DATEDIFF(millisecond, start_time, getdate())FROM #save_timeDROP TABLE #save_timeGO


1


Fragment Growth

1. DECLARE pg_subPath CURSOR for2. SELECT * 3. FROM getTable_pb(@v_prefix,@v_item) order by ord4. SELECT list_pg_item = pg_subPath.item 5. FOR each row in pg_subPath6. BEGIN7. SELECT node_path = pg_subPath.item+’%’+ c_confp.item8. SELECT node_supp = 9. getNodeSupp(pg_subPath.prefix, node_path)10. SELECT pat_item = pg_subPath.item11. SELECT pat_fp = node_path+'%'+ pg_subPath.prefix12. SELECT pat_cnt = node_supp13. SELECT exist_pat = (14. SELECT count(*) FROM PATTERNS 15. WHERE item=pat_item and fp=pat_fp)16. INSERT INTO PATTERNS (item,fp,cnt)17. VALUES (pat_item, pat_fp, pat_cnt)18. SELECT list_pg_item = list_pg_item 19. +’%’+ pg_subPath.item20. END


1


Gerador de datasets


1


Comparativo


1


PGS open questions

• Problema para portabilidade do algoritmo– Outro DBMS, outro PGS ORACLE PLSQL

• Máximo cuidado com os cursores!


1


Sites

• Jiawei Han – http://www-faculty.cs.uiuc.edu/~hanj/

• Bart Goethals – http://www.adrem.ua.ac.be/~goethals/

• Frequent Itemset Mining Implementations Repository– http://fimi.cs.helsinki.fi

• Weka 3: Data Mining Software in Java– http://www.cs.waikato.ac.nz/ml/weka/

http://www-faculty.cs.uiuc.edu/~hanj/


1


Trabalhos Relacionados

• Pattern Growth Tightly Coupled on RDBMS (PKDD’05 submitted)

• Integrating Pattern Growth Mining on SQL Server (technical report)

• A Hybrid Method to Discover Inter-Transactional Rules (JISBD’05 accepted)


1


Trabalhos Relacionados

• Ponto de partida para Mineração sobre cubos (ROLAP)– Itemset mining integrado em RDBMS– Mineração de cubos pre-computados– Mineração integrada com os cubos on-the-fly– Aplicação sobre webhouses

• Primeiras interações– When the Hunter becomes the prey... (DataGadgets’04)– Clickstreams, The basis to Establish User Navigation

Patterns (DataMining’04)– Mining Clickstream based Data Cubes (ICEIS’04)


1


References(1)

1. Agarwal, R., Shim., R.: Developing tightly-coupled data mining application on a relational database system. In Proc.of the 2nd Int. Conf. on Knowledge Discovery in Database and Data Mining, Portland, Oregon (1996)

2. Agrawal, R., Imielinski, T., Swami, A..: Mining association rules between sets of items in large databases. In Proc. of the ACM SIGMOD Intl. Conference on Management of Data (1993) 207–216

3. Agrawal, R., Srikant., R.: Fast algorithms for mining association rules. In Proc. of the 20th Very Large Data Base Conference (1994) 487–499

4. Alves, R., Belo, O.: Integrating Pattern Growth Mining on SQL-Server RDBMS. Technical Report-003, University of Minho, Department of Informatics, May (2005) http://alfa.di.uminho.pt/~ronnie/files_files/rt/2005-RT3-Ronnie.pdf

5. Alves, R., Gabriel, P., Azevedo, P., Belo, O.: A Hybrid Method to Discover Inter-Transactional Rules. In Proceedings of the JISBD’2005, Granada (2005) to appear

6. Cheung, W., Zaïane, O. R.: Incremental Mining of Frequent Patterns Without Candidate Generation or Support Constraint, Seventh International Database Engineering and Applications Symposium (IDEAS 2003), Hong Kong, China, July 16-18 (2003) 111-116

7. El-Hajj, M., Zaïane, O.R.: Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining, in Proc. 2003 Int'l Conf. on Knowledge Discovery and Data Mining (ACM SIGKDD), Washington, DC, USA, August 24-27 (2003) 109-118

8. Han, J., Pei, J., Yin., Y.: Mining frequent patterns without candidate generation. In Proc. of ACM SIGMOD Intl. Conference on Management of Data, (2000) 1–12


1


References(2)

9. Hidber, C.: Online association rule mining. In A. Delis, C. Faloutsos, and S. Ghandeharizadeh, editors, Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, volume 28(2) of SIGMOD Record. ACM Press (1999) 145–156

10. Orlando, S., Palmerini, P., Perego, R.: Enhancing the apriori algorithm for frequent set counting. In Y. Kambayashi, W. Winiwarter, and M. Arikawa, editors, Proceedings of the Third International Conference on Data Warehousing and Knowledge Discovery, volume 2114 of Lecture Notes in Computer Science (2001) 71–82

11. Orlando, S., Palmerini, P., Perego, R., Silvestri, F.: Adaptive and resource-aware mining of frequent sets. In V. Kumar, S. Tsumoto, P.S. Yu, and N.Zhong, editors, Proceedings of the 2002 IEEE International Conference on Data Mining. IEEE Computer Society (2002) To appear

12. Rantzau, R.: Processing frequent itemset discovery queries by division and set containment join operators. In DMKD03: 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (2003)

13. Sarawagi, S., Thomas, S., Agrawal, R.: Integrating mining with relational database systems: alternatives and implications. In Proc. of the ACM SIGMOD Conference on Management of data, Seattle, Washington, USA (1998)

14. Shang, X., Sattler, K., Geist, I.: Sql based frequent pattern mining without candidate generation. In SAC’04 Data Mining, Nicosia, Cyprus (2004)

15. Wang, H., Zaniolo, C.: Using SQL to build new aggregates and extenders for Object-Relational systems. In Proc. Of the 26th Int. Conf. on Very Large Databases, Cairo, Egypt (2000)

16. Yoshizawa, T., Pramudiono, I., Kitsuregawa, M.: Sql based association rule mining using commercial rdbms (ibm db2 udb eee). In In Proc. DaWaK, London, UK (2000)

GRIDSD

Extracção de padrões em bases de dados de grandes dimensões

Centro de Ciências e Tecnologia da ComputaçãoDepartamento de InformáticaEscola de EngenhariaUniversidade do MinhoPORTUGAL

Ronnie Alves ([email protected])

http://alfa.di.uminho.pt/~ronnie


Belém, 10 de Janeiro de 2005

Documents

GRID SD Extracção de padrões em bases de dados de grandes dimensões Centro de Ciências e Tecnologia da Computação Departamento de Informática Escola de