Data Warehouse - Cach Thiet Ke

  • Upload
    fptnam

  • View
    17

  • Download
    1

Embed Size (px)

DESCRIPTION

Tai lieu tham khao ve Data Warehouse.

Citation preview

  • 1

    [email protected] - [email protected]

    Data Warehouse

    v cch thit k Data Warehouse

    Lu c Thng

    Ban BI, phng Nghin cu pht trin Trung tm phn mm vin thng Viettel

    [email protected]

    [email protected]

  • 2

    [email protected] - [email protected]

    Contents Phn 1: Data Warehouse l g.............................................................................................. 4

    1.1 Gii thiu v database ............................................................................................ 4

    1.2 Data Warehouse ..................................................................................................... 7

    1.3 M hnh d liu a chiu ....................................................................................... 9

    1.3.1 Cy phn cp ................................................................................................. 10

    1.3.2 S liu tng hp ............................................................................................ 11

    1.3.3 Tnh nng ca OLAP..................................................................................... 12

    1.3.4 M hnh thit k data warehouse................................................................... 14

    1.4 Kin trc h thng data warehouse ...................................................................... 17

    1.4.1 Tng ETL ...................................................................................................... 18

    1.4.2 Tng data warehouse ..................................................................................... 19

    1.4.3 Tng khai thc d liu ................................................................................... 19

    Phn 2: Xy dng Data Warehouse ................................................................................... 21

    2.1. Xy dng bng dimension ................................................................................... 21

    2.1.1 Cu trc bng dimension ............................................................................... 21

    2.1.2 Thit k phng v thit k bng tuyt ........................................................... 23

    2.1.3 Dimension thi gian ...................................................................................... 24

    2.1.4 Bng dimension khng l .............................................................................. 26

    2.1.5 Bng dimension t hon................................................................................... 27

    2.1.6 Dimension lng nhau .................................................................................... 27

    2.1.7 Mt bng dimension hay tch ra lm hai? .................................................... 27

    2.1.8 Cp nht gi tr dimension ............................................................................ 27

    2.1.9 Bn ghi dimension v tr v vic cp nht d liu dimension sai ................ 30

    2.1.10 Tng kt ..................................................................................................... 31

    2.2 Xy dng bng Fact ............................................................................................. 31

    2.2.1 Cu trc bng fact ......................................................................................... 31

    2.2.2 Ton vn thc th .......................................................................................... 33

    2.2.3 Phn loi bng fact ........................................................................................ 35

  • 3

    [email protected] - [email protected]

    2.2.4 Ghi d liu vo bng fact .............................................................................. 35

    2.2.5 Bng fact khng s liu ................................................................................. 38

    2.2.6 Bng tng hp d liu ................................................................................... 39

    2.2.7 Tng kt ......................................................................................................... 42

    Phn 3: Xy dng lung ETL ............................................................................................ 44

    3.1 V ETL ................................................................................................................. 44

    3.2 Thu thp d liu ................................................................................................... 44

    3.3 Lm sch v chun ho d liu ............................................................................ 44

    Phn 4: p dng thc t cho vin thng, xy dng h thng data warehouse cho mobile

    tr trc.............................................................................................................................. 45

    4.1 Thu thp yu cu .................................................................................................. 45

    4.2 Thit k data warehouse ....................................................................................... 45

    4.3 Xy dng lung ETL ........................................................................................... 45

    4.4 Xy dng OLAP Cube ......................................................................................... 45

  • 4

    [email protected] - [email protected]

    Phn 1: Data Warehouse l g

    1.1 Gii thiu v database Database ng vai tr quan trng bc nht trong tt c cc h thng thng tin, l

    phn h trung tm ca cc h thng nghip v x l d liu giao dch. Database bao gm

    mt tp hp d liu c cu trc v mt bn thng tin m t d liu c cu trc

    (metadata), c thit k cho nhu cu lu tr v x l thng tin ca t chc, doanh

    nghip. Mt database thng c ci t km theo mt h qun tr c s d liu

    (DBMS Database Management System), tc l mt phn mm cho php ngi dng

    nh ngha, to mi, iu khin v qun tr mt database.

    Trong ng cnh database trong h thng nghip v, mt tc v cn truy sut, x l

    d liu gi l mt giao dch (transaction). m bo x l giao dch chnh xc, thit k

    database cn phi tho mn 4 tnh cht l Tnh nguyn t (Atomicity), Tnh nht qun

    (Consistency), Tnh tch bit (Isolation), Tnh bn vng (Durability) ACID.

    - Tnh nguyn t (Atomicity) quy nh rng nu mt giao dch c nhiu thao tc

    x l d liu th hoc l tt c cc thao tc u thnh cng, hoc khng c thao

    tc no thnh cng c. Mt h thng gi l c tnh nguyn t khi n tho mn iu

    kin trn trong bt k tnh hung li no, bao gm c li phn cng hay phn

    mm. V pha ngi dng, mt giao dch thnh cng c th hin nh l mt

    thao tc duy nht, cn mt giao dch tht bi khng c bt k tc ng no n

    database c.

    - Tnh nht qun (Consistency) quy nh rng d liu trong database phi hp l

    trc v sau mi giao dch. Mi thao tc ghi d liu vo database phi tho mn

    cc lut quy nh trc bao gm nhng khng gi gn trong constraint, cascade,

    trigger Tnh cht ny khng m bo d liu trong database ng nghip v (

    l vic ca lp trnh vin, database lm ht th coder lm g?), n ch gip hn ch

    li (nu c) trong qu trnh pht trin phn mm.

    - Tnh tch bit (Isolation) quy nh rng vic thc thi song song nhiu giao dch

    mt lc cho ra kt qu tng ng vic thc thi cc giao dch mt cch tun

    t.

    - Tnh bn vng (Durability) quy nh rng mt giao dch khi thnh cng s

    c ghi nhn vnh vin ngay c khi c s c v phn cng hay phn mm.

    Trong h thng c s d liu quan h, d liu thng c lu tr di dng cc

    thc th. Thc th (Entity) l cch th hin mt tp cc i tng thc t m h thng

    nghip v cn phi qun l, lu tr. V mt ng dng m ni, cc thc th thuc cng

    mt loi c cc thuc tnh ging ht nhau.

  • 5

    [email protected] - [email protected]

    // TODO: phn r thc th mnh, thc th yu, quan h gia cc thc th

    tho mn tnh ACID, thit k ca mt database thng c a v dng

    chun 3 (3NF 3th Normal Form), quy nh rng mi thc th trong database phi tho

    mn cc iu kin sau:

    - Gi tr ca mt thuc tnh phi l gi tr nguyn t, tc l khng phi mt danh

    sch cc gi tr hoc gi tr phc hp (Chun 1)

    - Cc thuc tnh khng phi kho chnh ch ph thuc vo ton b tp kho chnh,

    khng ph thuc vo tp con no ca tp kho chnh ni trn (Chun 2)

    - Cc thuc tnh khng phi kho c lp vi nhau (khng th ni suy gi tr mt

    thuc tnh t cc thuc tnh khng phi kho khc) (Chun 3)

    V d i tng Ho n bn hng gm cc thng tin ch yu sau:

    Ho n bn hng

    PK M n hng

    Ngy bn hng M nhn vin bn hng Tn nhn vin bn hng Tn khch hng S in thoi khch hng Cc dng n hng - M mt hng - Tn mt hng - n gi - S lng - Thnh tin

    Thc th ny c thuc tnh Cc dng n hng vi phm quy nh gi tr nguyn t ca

    dng chun 1. a v chun 1, ta thm thc th Dng n hng vo thit k nh

    sau:

    Ho n bn hng

    PK M n hng

    Ngy bn hng M nhn vin bn hng Tn nhn vin bn hng Tn khch hng S in thoi khch hng

    Dng n hng

    PK,FK1 M n hngPK M mt hng

    Tn mt hng n gi S lng Thnh tin

  • 6

    [email protected] - [email protected]

    Thit k ny vi phm chun 2 ch thc th Dng n hng c thuc tnh Tn mt

    hng ch ph thuc vo thuc tnh M mt hng (l tp con ca kho chnh) => Sa

    bng cch thm thc th Mt hng lu cc thng tin trn.

    Ho n bn hng

    PK M n hng

    Ngy bn hng M nhn vin bn hng Tn nhn vin bn hng Tn khch hng S in thoi khch hng

    Dng n hng

    PK,FK1 M n hngPK,FK2 M mt hng

    n gi S lng Thnh tin

    Mt hng

    PK M mt hng

    Tn mt hng n gi

    Ch : thuc tnh n gi cng ph thuc M mt hng nhng do yu cu nghip

    v, gi trn ho n l bt bin trong khi gi mt hng thc t c thay i nn cn lu

    thng tin ny vo thc th Dng n hng.

    Thit k ny vi phm chun 3:

    - Bng Ho n bn hng c trng Tn nhn vin bn hng, S in thoi

    khch hng khng ph thuc kho chnh m ph thuc thuc tnh M nhn vin

    bn hng => khc phc bng cch to thm bng Nhn vin

    - Bng Ho n bn hng c trng S in thoi khch hng khng ph thuc

    kho chnh m ph thuc trng Tn khch hng => Khc ph bng cch thm

    bng Khch hng

    - Bng Dng n hng c trng Thnh tin c th ni suy t 2 trng n

    gi v Thnh tin => Khc phc bng cch b trng Thnh tin

    Ho n bn hng

    PK M n hng

    Ngy bn hngFK2 M nhn vin bn hngFK1 M khch hng

    Dng n hng

    PK,FK1 M n hngPK,FK2 M mt hng

    n gi S lng

    Khch hng

    PK M khch hng

    Tn khch hng S in thoi khch hng

    Nhn vin bn hng

    PK M nhn vin bn hng

    Tn nhn vin bn hng

    Mt hng

    PK M mt hng

    Tn mt hng n giFK1 M loi mt hng

    Loi mt hng

    PK M loi mt hng

    Tn loi mt hng

  • 7

    [email protected] - [email protected]

    Phn ny m t qua v database v gii thiu s lc cch thit k database trong

    mt h thng nghip v. Trong phn sau chng ta s ni v data warehouse, cc thnh

    phn ca data warehouse v thit k thng thy ca data warehouse.

    1.2 Data Warehouse Kinh t kh khn, i th cng nhiu th vic phn tch d liu cng tr nn quan

    trng i vi doanh nghip nhm h tr ra quyt nh, gia tng li th cnh tranh. Tuy

    nhin database thng thng li khng tho mn cc yu cu v phn tch d liu,

    database thng thng ch h tr tt cc nghip v hng ngy v im mnh nht ca n

    l m bo ton vn d liu, x l giao dch, truy cp song song. Database thng thng

    c gi l database nghip v (operational database) hoc h thng x l giao dch

    thi gian thc (online transaction processing OLTP). Thng thng cc database

    nghip v ch lu tr d liu chi tit cho thi im hin ti, khng lu d liu lch s, d

    liu trong database c thit k chun ho rt cao nn thng c hiu nng km khi truy

    vn phc tp (join nhiu bng d liu vi nhau) hoc khi lng d liu ln. Thm na,

    vic truy vn d liu t nhiu ngun khc nhau l gn nh khng th nu ch dng

    database nghip v.

    C cung th tt c cu, ngay t nhng nm 70 nhiu cng ty bn cc h thng

    database h tr phn tch, bo co nh Teradata, MAPPER, nhng thut ng data

    warehouse ch c s dng vo nm 1988 trong mt bi bo k thut ca IBM c tiu

    Kin trc h thng thng tin v kinh doanh (An architecture for a business and

    information system http://altaplana.com/ibmsj2701G.pdf). Phn ny s dnh ring

    ni v khi nim data warehouse.

    Theo wikipedia (http://en.wikipedia.org/wiki/Data_warehouse), data warehouse

    chnh l database chuyn dng cho to bo co v phn tch d liu. N va h tr cc

    truy vn phc tp, va l im tp trung d liu t nhiu ngun khc nhau c c

    thng tin phn tch y nht. Theo , data warehouse l mt tp hp d liu hng

    ch , ton vn, khng b r r mt mt v c gi tr lch s. C th cc tnh cht nh

    sau:

    - Tnh hng ch (Subject oriented) ngha l data warehouse tp trung vo

    vic phn tch cc yu cu qun l nhiu cp khc nhau trong quy trnh ra

    quyt nh. Cc yu cu phn tch ny thng rt c th, v xoay quanh loi hnh

    kinh doanh ca doanh nghip, v d cc cng ty phn phi s quan tm n tnh

    hnh kinh doanh, doanh nghip vin thng quan tm n lu lng dch v Tuy

    nhin mt doanh nghip thng quan tm n vi ch khc nhau, nh cng ty

    phn phi cn phi quan tm n kho bi, chui cung ng

  • 8

    [email protected] - [email protected]

    - Tnh ton vn (Integrated) gii quyt cc kh khn trong vic kt hp d liu t

    nhiu ngun d liu khc nhau, gii quyt cc sai khc v tn trng d liu (d

    liu khc nhau nhng tn ging nhau), ngha d liu (tn ging nhau nhng d

    liu khc nhau), nh dng d liu (tn v ngha ging nhau nhng kiu d liu

    khc nhau).

    - Tnh bt bin (Nonvolatile) quy nh rng d liu phi thng nht theo thi gian

    (bng cch hn ch ti a sa i hoc xo d liu), t lm tng quy m d liu

    ln ng k so vi h thng nghip v (5-10 nm so vi 2 n 6 thng nh

    database thng thng)

    - Gi tr lch s (time varying) ni v kh nng ly cc gi tr khc nhau ca

    cng mt thng tin v thi im xy ra thay i. V d thng tin a ch, email, s

    in thoi ca khch hng c th thay i, nhng vic thay i khng c

    php tc ng n gi tr bo co, phn tch thc hin trc khi s thay i xy ra.

    Tnh cht Database nghip v Data warehouse

    Ngi dng Nhn vin vn hnh Cn b qun l, nhn vin phn tch s liu

    Loi hnh s dng D on c, lp i lp li Truy vn t xut, khng xc nh trc

    D liu Hin ti, mc chi tit Lch s, mc tng hp

    T chc d liu Theo yu cu nghip v Theo vn cn phn tch

    Cu trc d liu Ti u cho cc giao dch nh Ti u cho truy vn phc tp, trn lng d liu ln

    Tn sut truy cp Tn sut cao Tn sut t trung bnh n thp

    Loi truy cp c, ghi, cp nht, xo c, ghi

    S lng bn ghi mi phin truy cp

    t Rt ln

    Thi gian truy cp Ngn Tng i di (n mc pht hoc ting ng h)

    Mc x l song song Cao, cc tc v x l ng thi trn mt bn ghi nht nh xy ra thng xuyn

    Thp

    Kho Cn thit Khng cn thit

    Tn sut cp nht d liu Thng xuyn Khng cp nht

    D tha d liu Thp (bng chun ho) Cao (bng d liu thng phi chun)

    M hnh d liu M hnh quan h thc th (Entity Relational)

    M hnh d liu a chiu (multidimensional)

    M hnh trin khai Ton b h thng Tng dn theo data mart

    Bng 1.1: So snh database nghip v v data warehouse

  • 9

    [email protected] - [email protected]

    Data warehouse cho php ngi dng mc qun l, ra quyt nh thc hin cc

    php phn tch tng tc vi data bng h thng x l phn tch trc tuyn (online

    analytical processing OLAP). Ngoi ra data warehouse cng c dng cho bo co,

    data mining v phn tch thng k. Database v data warehouse, do ch khc nhau v

    mt khi nim, mt database nu dng ring cho cc mc ch trn cng c coi l data

    warehouse.

    Nh vy, nu nh database c v nh ci t sch c nhn, ni ngi ta thng

    xuyn tra cu, cp nht, hiu nh, ghi ch vo l, thm mi hoc chuyn sch i, th data

    warehouse li c so snh vi th vin quc gia, ni cc ti liu kinh in c a n

    lin tc lu tr v tham kho, khng ai sa cha hoc chuyn chng qua ch no khc

    c.

    1.3 M hnh d liu a chiu Data warehouse v cc h thng OLAP c xy dng da vo m hnh d liu

    a chiu (multidimensional model). M hnh ny cho hiu nng tt trn nhng php

    truy vn phc tp v gip ngi dng c th nhn d liu theo nhiu kha cnh khc nhau.

    M hnh ny hin th d liu di dng khng gian n-chiu, gi l data cube hoc

    hypercube.

    Hnh 1.1 Mt cube 3 chiu hin th d liu s lng bn hng vi 3 chiu Th trng

    (Store), Thi gian (Time), Sn phm (Product) v ch tiu Doanh s (amount)

    Mt khi data cube c xc nh bng ct lp v tiu ch. Ct lp (Dimension)

    l cc thng tin, quan im c dng phn tch d liu. V d data cube hnh 1.1

    phn tch s liu bn hng, c 3 ct lp l Th trng, Thi gian v Sn phm. Cc gi

    tr trong mt ct lp gi l lp (dimension member). V d Paris, Nice, Rome v Milan

  • 10

    [email protected] - [email protected]

    l cc lp ca ct lp Th trng. Cc ct lp thng c thm cc thuc tnh (attribute)

    m t thm thng tin cho n. V d ct lp Sn phm c th cha cc thuc tnh nh M

    sn phm, Tn sn phm, M t, Kch thc, tuy nhin cc thuc tnh ny khng c

    th hin trong hnh trn.

    Cng vi ct lp, cc (cell) ca mt cube cha cc gi tr dng s v c gi l

    tiu ch (measure). M hnh a chiu yu cu vic thc hin cc php ton s hc (cng,

    tr, nhn, chia) trn cc tiu ch ny m ngha ca s liu vn chnh xc. V d trong

    hnh 1.1 trn, khi cube c 1 tiu ch l Doanh s. Thng thng mt cube s c nhiu

    tiu ch khc nhau. Khi cube hnh 1.1 mc d khng hin th nhng c th c tiu ch

    S lng (s sn phm bn ra) na.

    1.3.1 Cy phn cp

    Mc chi tit ca cc tiu ch th hin cho ngi dng c gi l mc d liu

    (data granularity), c quyt nh bng vic kt hp cc mc d liu ca tng ct lp.

    V d trong hnh 1.1 mc chi tit l: mc thnh ph vi ct lp Th trng, mc qu

    vi ct lp Thi gian, mc loi hng trong ct lp Hng ho.

    c rt ra tri thc t d liu, ngi dng cn quan st cube di nhiu mc chi

    tit khc nhau. Vn v d 1.1 trn, ngi dng c th mun bit cc tiu ch bn hng

    mc chi tit hn nh mc ca hng, hoc mc cao hn nh mc quc gia chng hn.

    Tnh cht cy phn cp (hierrarchie) ca OLAP cho php thc hin iu ny bng cch

    nh ngha ra mt cu trc hnh cy cc mc chi tit khc nhau ca mt ct lp. Vi 2

    mc lin nhau trong mt cy, mc thp hn gi l mc con (child level), mc cao

    hn gi l mc cha (parent level). Hnh 1.2 bn di v d cc mc ca ct lp Th

    trng, trong tng ca hng c th c gn cho mt thnh ph, thnh ph gn n

    tnh, ri n quc gia. Lp trn cng cy phn cp l mc Tt c i din cho ton b

    cy phn cp, mc ny c 1 gi tr duy nht cng l Tt c dng ly tiu ch c

    tng hp n mc cao nht i din cho ton b cy phn cp (trong v d ny l ly tng

    doanh s bn hng ca tt c cc quc gia).

  • 11

    [email protected] - [email protected]

    Ca hng 1 Ca hng 2 Ca hng 2

    Paris

    le-de-France

    Nice

    Provence-Alpes-Cte d'Azur

    Php

    Ca hng 2 Ca hng 2 Ca hng 2

    Roma Milan

    Lazio Lombardy

    Italy

    Tt c

    Mc ca hng

    Mc thnh ph

    Mc tnh

    Mc quc gia

    Mc tt c

    .

    ..

    Hnh 1.2 Cc gi tr ca cy phn cp Th trng

    //TODO: cn nhc vit v cy phn cp khng cn bng (cy phn cp c nhnh b thiu

    mc so vi nhnh khc).

    1.3.2 S liu tng hp Vic tng hp s liu xy ra khi ngi dng thay i mc chi tit ca d liu ly

    ra t cube, bng cch duyt qua cy phn cp ca ct lp. V d hnh 1.1, nu ct lp Th

    trng s dng mc tnh thay v mc thnh ph th doanh s ca tt c cc thnh ph

    trong cng mt tnh s c tng hp bng php cng. Tng t, d liu mc Tt c

    c tng hp bng gi tr d liu ca tt c cc quc gia.

    Nhm m bo tng hp chnh xc, ngi ta sut ra mt vi lut tng hp. Cc

    lut tng hp chnh bao gm:

    - Tnh tch bit (Disjointness of instance): giao ca cc tp lp ct c mc cha

    khc nhau phi l tp rng. V d trong hnh 1.2 mt thnh ph khng c php

    thuc 2 tnh khc nhau.

    - Tnh hon thin (Completeness): tt c lp ct u phi xut hin trong cy phn

    lp v ng vi mi lp ct u phi tn ti mt lp ct cha tng trn. V d hnh

    1.2 mi ca hng u c gn cho mt thnh ph.

    - S dng ng php ton tng hp (Correct use of aggregation function): mi

    tiu ch c tnh cht khc nhau, chnh cc tnh cht ny quyt nh php ton tng

    hp c php s dng cho tiu ch .

    Mt s php ton tng hp chnh nh sau:

  • 12

    [email protected] - [email protected]

    - Cc tiu ch cng dn (Additive measure) thng dng nht, l cc tiu ch c th

    thc hin php tnh cng m ngha vn chnh xc. V d tiu ch Doanh thu

    hnh 1.1 l tiu ch cng dn: h thng cng gi tr doanh thu cc lp ct con s ra

    doanh thu lp ct cha.

    - Cc tiu ch bn cng dn (Semiadditive measure) l tiu ch cng dn nhng

    ngha ca n s b sai i nu dng vi mt s ct lp no . V d tiu ch S

    lng hng cho bit trong kho cn bao nhiu hng v c th cng dn bit mt

    thnh ph, mt tnh cn bao nhiu hng. Nhng tiu ch ny nu kt hp vi

    dimension Thi gian s mt ngha v n ch mang tnh thi im.

    - Cc tiu ch khng cng dn (Nonadditive measure value-per-unit) l cc

    tiu ch khng cho php thc hin cc php cng, tr. V d cho tiu ch ny chnh

    l cc tiu ch t l, trung bnh.

    Khi xc nh tiu ch cn thit phi ch r cc php ton tng hp dng cho cc ct

    lp cho data warehouse, c bit quan trng trong trng hp tiu ch bn cng dn v

    khng cng dn. V d tiu ch S lng hng trn l bn cng dn, tuy khng th

    dng cng tr vi ct lp Thi gian nhng vn c th thc hin php tnh trung bnh,

    trung v, tm max, min.

    1.3.3 Tnh nng ca OLAP

    Nh ni trn, tnh cht c bn ca m hnh d liu a chiu l cho php

    ngi dng quan st d liu trn nhiu phng din khc nhau, cc mc chi tit

    khc nhau. OLAP cung cp mt s tnh nng cho php thc hin iu , c th:

  • 13

    [email protected] - [email protected]

    Hnh 1.3 Cc tnh nng ca OLAP

  • 14

    [email protected] - [email protected]

    - Tnh nng nhn xa (roll-up) bin tiu ch t mc chi tit sang mc tng hp

    hin th cho ngi dng, c thc hin khi i t mc thp ln mc cao trong cy

    phn cp hoc gim s ct lp xung. Hnh 1.3b l v d cho tnh nng nhn xa ny

    khi ct lp Th trng chuyn t mc thnh ph ln mc quc gia, gi tr cc lp

    ct thnh ph ca mt quc gia c cng dn vo thnh gi tr kt qu.

    - Tnh nng o su (drill-down) thc hin ngc li vi nhn xa, tc l i t mc

    tng hp cao n mc chi tit hn. V d nh trong hnh 1.3c, ct lp Thi gian i

    t mc qu xung mc cc thng trong qu.

    - Tnh nng o chiu (pivot hoc rotate) bin hng thnh ct, ct thnh hng gip

    cung cp cho ngi dng mt cch th hin d liu khc. Tnh nng ny c th

    hin hnh 1.3d.

    - Tnh nng ct lt mng (slice) thc hin ct ly d liu mt lp ct c th trong

    mt ct lp. V d nh hnh 1.3e ch duy nht d liu ca thnh ph Paris c

    hin th.

    - Tnh nng ct khi (dice) thc hin la chn gi tr cho t nht hai lp ct. V d

    nh hnh 1.3f l cube th hin d liu cho thnh ph Paris trong qu 1 v qu 2.

    Ngoi 5 tnh nng c bn trn, cc b cng c OLAP trn th trng cng cung

    cp thm mt lot cc tnh nng h tr khc nh cc php ton s hc, thng k, cc php

    ton kinh t

    1.3.4 M hnh thit k data warehouse

    Cn c vo cch thc lu tr d liu, ngi ta thng tip cn m hnh d liu a

    chiu theo 3 hng sau:

    - OLAP kiu quan h (Relational OLAP ROLAP) lu tr d liu trong c s

    d liu quan h, dng cu lnh SQL thc hin cc tnh nng ca OLAP.

    - OLAP a chiu (Multidimensional OLAP MOLAP) lu tr d liu di dng

    file c cu trc c th (v d nh cu trc dng mng array) v thc hin cc

    tnh nng OLAP trn cu trc ny. Mc d b hn ch v lng d liu lu tr v

    x l c so vi ROLAP, MOLAP thng cho hiu nng tt hn trong cc php

    truy vn hoc tng hp s liu (v d liu c thit k ti u cho truy vn OLAP

    trong khi ROLAP phi thng qua database).

    - OLAP lai (Hybrid OLAP HOLAP) kt hp 2 cng ngh ROLAP v MOLAP

    ni trn, tn dng kh nng lu tr ca OLAP v kh nng x l ca MOLAP. V

    d HOLAP s lu d liu chi tit trn c s d liu quan h cn d liu tng hp

    hn truy vn cho ngi dng c lu trn khng gian MOLAP.

  • 15

    [email protected] - [email protected]

    Trong h thng ROLAP, d liu a chiu c lu tr di dng bng quan h, t

    chc theo cu trc c bit theo lc hnh sao, lc hnh bng tuyt, lc nh

    sao v lc chm sao nh sau:

    - Lc hnh sao (star schema) bao gm duy nht mt bng fact v nhiu bng

    dimension (mi bng cho mt dimension). Cc thc th trong lc hnh sao

    khng c chun ho nh database nghip v (cc thc th c cu trc phn cp

    c nhp chung vo lm mt). V d thc th Mt hng trong database nghip v

    s c tch thnh 2 thc th Mt hng v Loi mt hng.

    Dng n hng

    M n hng Ngy bn hngFK3 M mt hng M nhn vin bn hngFK1 M khch hng n gi S lng Thnh tinFK2 M nhn vin

    Khch hng

    PK M khch hng

    Tn khch hng a ch S in thoi

    Nhn vin

    PK M nhn vin

    Tn nhn vin S in thoi

    Mt hng

    PK M mt hng

    Tn mt hng n gi Loi mt hng

    Hnh 1.4: Lc hnh sao

    - Lc hnh bng tuyt (snowflake schema) gim bt d tha d liu trong

    lc hnh sao bng cch chun ho cc bng dimension. Do , mt thc th

    dimension c phn cp s c th hin thnh nhiu bng d liu khc nhau, mi

    bng mt cp. Hnh 1.5 l lc hnh bng tuyt, trong dimension Mt hng

    c th hin qua 2 bng d liu Mt hng v Loi mt hng.

  • 16

    [email protected] - [email protected]

    Dng n hng

    M n hng Ngy bn hngFK3 M mt hng M nhn vin bn hngFK1 M khch hng n gi S lng Thnh tinFK2 M nhn vin

    Khch hng

    PK M khch hng

    Tn khch hng a ch S in thoi

    Nhn vin

    PK M nhn vin

    Tn nhn vin S in thoi

    Mt hng

    PK M mt hng

    Tn mt hng n giFK1 M loi mt hng

    Loi mt hng

    PK M loi mt hng

    Tn loi mt hng

    Hnh 1.5: Lc hnh bng tuyt

    - Lc nh sao (starflake schema) l s kt hp gia lc hnh sao v lc

    hnh bng tuyt khi mt s dimension c chun ho trong khi mt s khc

    th khng.

    - Lc chm sao (constellation schema) l lc thng dng nht trong thit

    k data warehouse, l lc trong cc bng fact dng chung dimension vi

    nhau. V d hnh 1.6 bn di hai bng fact Dng n hng v Hng trong kho s

    dng chung dimension Mt hng.

  • 17

    [email protected] - [email protected]

    Dng n hng

    M n hng Ngy bn hngFK3 M mt hng M nhn vin bn hngFK1 M khch hng n gi S lng Thnh tinFK2 M nhn vin

    Khch hng

    PK M khch hng

    Tn khch hng a ch S in thoi

    Nhn vin

    PK M nhn vin

    Tn nhn vin S in thoi

    Mt hng

    PK M mt hng

    Tn mt hng n gi Loi mt hng

    Hng trong kho

    Ngy kim hngFK1 M mt hng S lng Thnh tin

    Hnh 1.6: Lc chm sao

    1.4 Ralph Kimball vs Bill Inmon //TODO: so snh l thuyt ca Kilball v Inmon

    1.5 Kin trc h thng data warehouse Mt h thng data warehouse bao gm 3 thnh phn chnh sau:

    - Mt b cng c thu thp d liu t h thng nghip v, chun ho chng v

    nh dng d liu a chiu, np vo data warehouse (Extract-Transformation-

    Loading ETL).

    - Mt database dng lm data warehouse lu tr d liu

    - Mt lot cc cng c khai thc d liu t data warehouse nh h thng OLAP, h

    thng bo co tnh, h thng data mining

  • 18

    [email protected] - [email protected]

    Hnh 1.7: Kin trc h thng Data Warehouse

    1.5.1 Tng ETL

    Tng ETL (Extract Transform Load) l tng thp nht, n i vi ngi dng

    cui, bao gm 3 bc:

    - Bc thu thp (extract) gom gp d liu t nhiu khc nhau v. Cc ngun ny

    c th l database h thng nghip v (MS SQL, mySQL, Oracle, DB2), cng

    c th l file cc nh dng khc nhau (CSV, fix-length, excel, XML), c th

    l d liu ni b doanh nghip hoc t bn ngoi. Mt h thng ETL tt phi m

    bo tng thch vi cc ngun d liu thng dng ny.

    - Bc chun ho (transform) bin i d liu t nh dng ngun sang nh dng

    ca data warehouse (nh dng d liu a chiu ni bc trn), bao gm cc

    bc nh:

    o Bc dn dp (cleaning) xo cc bn ghi b sai, li v chuyn ho d liu

    v nh dng chun chung.

    o Bc tp hp (integration) ct gt d liu c chung ngha t nhiu

    ngun khc nhau v mt khung duy nht.

    o Bc tng hp (aggregation) tng hp d liu da vo chi tit ca data

    warehouse.

    - Bc np d liu (load) ghi d liu c chun ho vo data warehouse. Bc

    ny bao gm c qu trnh cp nht thay i t h thng nghip v vo data

    warehouse, m bo s liu bo co lun c cp nht. Tu thuc vo chnh sch

  • 19

    [email protected] - [email protected]

    cng ty, vic cp nht ny c th phi thc hin theo thi gian thc, cp nht theo

    gi, theo ngy hoc thm ch theo thng.

    1.5.2 Tng data warehouse

    Tng data warehouse ng trung tm mt h thng data warehouse lm nhim

    v lu tr d liu bao quanh tt c cc hot ng nghip v, cc phng ban ca doanh

    nghip. Data warehouse thng bao gm mt hoc nhiu data mart, vi data mart chnh

    l data warehouse thu nh tp trung vo mt nghip v nht nh no ca doanh

    nghip (v d data mart v sale, data mart v kho bi, data mart v nhn s)

    Ngoi nhim v lu tr d liu, tng data warehouse cn c mt thnh phn khc

    rt quan trng gi l siu d liu (metadata). Siu d liu li c chia lm 2 nhm l

    nhm siu d liu k thut v siu d liu nghip v. Siu d liu nghip v (business

    metadata) m t ngha d liu, cc lut v rng buc tc ng ln d liu. Siu d liu

    k thut (technical metadata) m t cch thc t chc, lu tr v iu khin d liu

    trong h thng my tnh.

    Trong phm vi data warehouse, siu d liu k thut c s dng m t thng

    tin v data warehouse, v d liu ngun v cc tin trnh ETL. C th:

    - Siu d liu m t cu trc data warehouse v cc data mart mc logic (m t

    bng fact, bng dimension, cy phn cp, ngun gc d liu) v mc vt l (cu

    trc bng, index, partition). Ngoi ra n cn cha thng tin bo mt d liu (xc

    thc, phn quyn ngi dng) v cc thng tin gim st (thng k hiu nng s

    dng, bo co li)

    - Siu d liu m t d liu ngun, cng mc logic (cch thc v tham s kt ni

    ly d liu, tn sut cp nht d liu, ngha d liu) v vt l (cu trc d liu)

    - Siu d liu m t cc tin trnh ETL, bao gm c gc gc d liu (truy c d

    liu trn data warehouse v n gc gc ca n trong h thng nghip v), cc lut

    thu thp, lm sch, chuyn ho d liu.

    1.5.3 Tng khai thc d liu

    Tng khai thc d liu cha cc cng c cho ngi dng cui khai thc, s dng

    cc d liu trong data warehouse. Mt s cng c chnh:

    - Bo co OLAP (OLAP tool) l bo co ng cho php ngi dng s dng cc

    tnh nng ca OLAP ( ni phn 1.3.3) to bo co. Cc truy vn t xut

    ny c gi l truy vn tu bin (ad hoc query) v h thng khng h c

    chun b trc cho thao tc ca ngi dng. Bo co OLAP c s dng khi

  • 20

    [email protected] - [email protected]

    ngi dng mun cc thng tin ct lp, chuyn su hoc ton cnh trc khi ra

    quyt nh.

    - Bo co tnh (reporting tool) l cc bo co c cu trc, format, s dng truy vn

    c nh ngha trc , i khi bao gm c biu . Bo co tnh c s dng

    khi ngi dng mun xem cc thng tin nh gi, iu hnh.

    - B cng c khai ph d liu (data mining) cho php ngi dng phn tch d

    liu tm ra cc thng tin qu gi cn b n du, v d nh cc xu hng, cc

    mu chung.

  • 21

    [email protected] - [email protected]

    Phn 2: Xy dng Data Warehouse Phn 1 ca ti liu m t data warehouse, mc ch, ngha v kin trc data

    warehouse. Phn 2 ny s gii thiu cc khi nim trong thit k data warehouse.

    2.1. Xy dng bng dimension Bng dimension cung cp cc thng tin, ng cnh cho bng fact v do cng l

    cung cp cho tt c s liu th hin trong data warehouse. D c quy m nh hn bng

    fact rt nhiu ln, cc bng dimension li l tri tim v khi c ca data warehouse v

    mun truy cp s liu data warehouse u phi thng qua chng. C ngi ni rng, tt

    xu trong thit k ca mt data warehouse chnh l tt xu trong thit k cc bng

    dimension ca n.

    2.1.1 Cu trc bng dimension

    Hnh 2.1: Cu trc c bn mt bng dimension

    V c bn cc bng dimension u c cu trc vt l nh hnh 2.1. Kho chnh ca

    bng dimension l trng d liu (thng l kiu s) lu nhng gi tr duy nht, khng

    c ngha, gi l kho thay th (surrogate key). Kho thay th ny c ni ti h

    thng data warehouse sinh ra bng cc lung ETL x l d liu. Gi tr kho ny ch

    c to ra duy nht trong ni ti data warehouse, cc thao tc thay i bn ngoi u b

    cm.

    Trc y vic sinh gi tr kho thay th thng c ph mc cho database dng

    lm data warehouse, c th l cho tnh nng database trigger. Cng ngy ngi ta cng

    nhn ra rng vic dng data trigger lm chm c tin trnh ETL v hn ch dng n. Vic

  • 22

    [email protected] - [email protected]

    dng database to kho chnh cng khng c khuyn khch na v khi data

    warehouse phi ph thuc mt tin trnh khc t bn ngoi, d gy mt ng b (trong

    trng hp database dng lm data warehouse trong giai on xy dng h thng v giai

    on trin khai khc nhau). Do hng tip cn an ton nht vn l dng chnh lung

    ETL sinh gi tr cho kho thay th ny.

    Mt thnh phn khc ca bng dimension l kho chnh ca d liu trong h thng

    nghip v, c gi l kho t nhin (natural key), gi tr ca kho t nhin thng

    khng phi v ngha. V d bng dimension Nhn vin s c trng EMP_ID lu m

    nhn vin ly t h thng nghip v. Mc d trng EMP_ID cng c th dng lm kho

    chnh cho bng dimension Nhn vin ny nhng khi thit k vn phi cung cp cho bng

    dimension mt kho thay th (trong trng hp phi nhp d liu t 2 h thng khc

    nhau dn n kho t nhin c th trng nhau, hoc trng hp gi tr dimension b thay

    i, s trnh by sau).

    C kin cho rng dimension c kho t nhin c th dng lm kho chnh ri,

    khng nht thit phi dng kho thay th kiu s v ngha na m c th dng mt gi tr

    c ngha no nh thi gian thay i, khi tp thuc tnh {kho t nhin, thi gian

    thay i} chnh l kho chnh ca bng dimension. Hng tip cn ny c th c li

    trong mt vi trng hp nhng li b tc trong cc tnh hung sau:

    - Sai nh ngha: Kho thay th, theo nh ngha, t bn thn n khng c ngha

    g c. Nu nh c tnh gn ngha cho kho thay th th ngi thit k ETL phi

    thm lung x l qun l ngha cc gi tr ny, khin cho vic xy dng lung

    ETL phc tp hn do cng chy lu hn.

    - Gim hiu nng: Thm ngha cho kho thay th khin cc cu truy vn t ngi

    dng cui phi thm iu kin xc thc, khin cu lnh phc tp hn, truy vn tn

    ti nguyn v thi gian hn so vi php so snh 2 gi tr kiu s n thun.

    Thnh phn cui cng ca bng dimension, bn cnh kho chnh v kho t nhin,

    l mt lot cc thuc tnh m t (desciptive attribute). Cc thuc tnh m t c th

    nhiu kiu d liu khc nhau v s lng c th rt ln (c bit vi dimension nh

    khch hng, nhn vin, sn phm). Nhn chung l khng nn qu s hi khi thit k ra

    bng dimension c hn 100 thuc tnh m t, lm vy khng sai u. Ch cn lu lm

    sch d liu cn thn cho cc thuc tnh ny l c.

    Mt lu nh l nn ch n cc thuc tnh m t c kiu d liu l kiu s, v

    tu vo ngha, n c th li l mt thuc tnh o m c v phi t vo bng fact.

    Thng tin m t ch c dng m t, khng phi dng cng dn. Cng khng nn

  • 23

    [email protected] - [email protected]

    qu lo lng v trong 99% trng hp gi tr ny c phn ra l fact hay thuc tnh

    dimension ngay. 1% cn li phn vo u cng c, v vic phn thuc tnh vo u

    khng lm thay i ngha ca thuc tnh, m ch thay i cch x l thuc tnh .

    //TODO: v d thuc tnh dimension c th dng trong fact

    2.1.2 Thit k phng v thit k bng tuyt Bng dimension l cc bng d liu c phi chun ho v dng bng phng.

    Trong , tt c d liu phn cp v cc cu trc c chun ho ca h thng nghip v

    c thit k li v dng phng. D liu trong bng dimension, v th, l d tha v

    tng ng vi dng chun 2 trong thit k database nghip v.

    Mt bng dimension c th bao hm nhiu hn mt cu trc phn cp. V d

    dimension Ca hng trong hnh 1.1 c th c cu trc cy theo phn cp a l theo qun

    l hnh chnh, va c cu trc cy theo phn cp ca ni b doanh nghip. C hai cy

    phn cp ny u c th nm trong mt cy phn cp, vi iu kin rng buc duy nht l

    d cy phn cp no, gi tr ca thuc tnh phn cp lun l duy nht.

    Nu mt bng dimension dng chun ho, cu trc phn lp s c th hin

    di dng lc hnh bng tuyt (hnh 2.2a). Ch rng v mt ni dung d liu th hai

    cch th hin ny khng khc g nhau, tuy nhin mi cch th hin li c im li, im

    hi khi ngi dng thao tc. Trng hp bng phng khng chun ho l d tha d liu,

    d gy ra sai st mt ng b gia cc bn ghi (lp cha ging nhau nhng thng tin ca

    lp cha li khc nhau). Trng hp ca bng chun ho l gim hiu nng khi query d

    liu (do phi join nhiu bng vi nhau) v gy kh hiu cho ngi dng khng am hiu

    k thut (cc nhn vin kinh doanh, phn tch s liu, cn b qun l cp cao)

  • 24

    [email protected] - [email protected]

    Mt hng

    PK M mt hng

    Tn mt hng n giFK1 M loi mt hng

    Loi mt hng

    PK M loi mt hng

    Tn loi mt hng

    Mt hng

    PK M mt hng

    Tn mt hng n gi Loi mt hng

    a) Bng dimension chun ho

    b) Bng dimension khng chun ho

    Hnh 2.2: Cc dng th hin bng dimension

    Mi khi thm bn ghi mi vo bng dimension, h thng phi gn cho bn ghi

    mt kho thay th, ng vai tr lm kho chnh cho bng dimension (hnh 2.1). Trong

    mi trng data warehouse, cn thit phi c tin trnh ETL qun l gi tr kho thay th

    ny cho mi bng dimension (n gin nht l tm gi tr kho cao nht c sn trong

    bng, cng thm 1 ri gn cho kho mi, tuy nhin cch ny b hn ch v hiu nng).

    2.1.3 Dimension thi gian

    Gn nh tt c bng fact u c t nht mt gi tr dimension l thi gian. Cc

    php o c u c thc hin ti im mc no v c xoay vng sau mi khong

    chu k nht nh.

  • 25

    [email protected] - [email protected]

    Hnh 2.3: Dimension thi gian

    Dimension thi gian c s dng nhiu nht chnh l lch ngy, vi n v nh

    nht mc bn ghi ca bng dimension chnh l mt ngy. Hi bt ng t l dimension

    ny li c kh nhiu thuc tnh, nh hnh v 2.3 trn. Ch mt vi thuc tnh l c th t

    sinh ra t cu lnh SQL (nh th, ngy, thng, qu, nm). Cc trng d liu cn li

    (nh ngy lm vic, ngy ngh, nm ti chnh) s khc bit tu vo chnh sch ca

    quc gia, chnh sch cng ty (v d Vit Nam c ngy quc l gi t Hng Vng tnh l

    ngy ngh, c cng ty lm 5 ngy/tun, cng ty khc 6 ngy/tun). Dimension thi gian

    l kiu dimension c bit nht trong d n data warehouse, thng dng nht v c bit

    nht. Thng thng dimension ny c to mt ln vo u d n ri gi nguyn gn

    nh khng cp nht trong sut vng i xy dng, vn hnh, nng cp d n (tr khi

  • 26

    [email protected] - [email protected]

    chnh sch cng ty thay i, tng lng gim gi lm, trc lm 6 ngy nay gim xung

    cn 5, hoc nh nc thm mt ngy quc l mi, v d vy). Cch to bng dimension

    thi gian hay nht l dnh ra mt bui chiu ngi nghch quyn lch, vit ra excel ri ghi

    d liu excel vo data warehouse. D c 9 nm lm mt in Bin Nn vnh hoa

    nn thin s vng th cng cha mt n 4000 bn ghi, khng qu nhiu.

    Nh trn vit, kho chnh ca bng dimension ch cha gi tr c ngha nh

    danh, t n khng c gi tr g c. Tuy nhin vi dimension thi gian, rt nhiu d n

    data warehouse li gn gi tr c ngha cho kho chnh, m c th l gi tr kiu

    YYYYMMDD (v d 20130523 cho ngy 23, thng 05, nm 2013). V vic gn gi tr c

    ngha cho kho thay th ny hay khng cn nhiu tranh ci, v mi bn u a ra cc

    trng hp li hn v bt li hn. Khng c trng hp no m mt gii php b tc

    trong khi gii php kia chy c c, ch c li hn hoc bt li hn, nn khng i su

    vo chi tit na. Tuy nhin i d n Viettel BI chn kiu YYYYMMDD nhn vin

    vn hnh d thc hin hn.

    Trong nhiu trng hp, thi gian trong bng fact cn c tnh ton di mc

    ngy, mc gi hoc pht. Khi theo phn x chng ta ngh ngay n vic to mt

    bng dimension mc gi hoc pht tng ng ny. C cn thn, mt nm c 8.760 gi,

    525.600 pht, 31.536.000 giy, ln th ny m query khng kho cht c ci data

    warehouse. Nu gp phi tnh hnh trn chng ta c th to bng dimension 24h,

    dimension 60 pht, dimension 60 giy ri kt hp chng vi nhau. Hoc c th ghi lun

    gi tr thi gian (kiu datetime ca database) vo bng fact, coi n l mt gi tr c bit.

    2.1.4 Bng dimension khng l

    Khi xy dng data warehouse i khi chng ta gp cc bng dimension c s

    lng bn ghi khng l, c quy m tng ng hoc thm ch ln hn bng fact. Nh

    Viettel, dimension Thu bao ln ti hng trm triu bn ghi.

    Cha ht, cc dimension khng l ny thng tp trung thuc tnh ca nhiu

    ngun khc nhau na, to nn bng d liu c s trng v cng ln, s ct v cng

    nhiu.

    Khng may cho chng ta, cc bng dimension khng l ny thng cha cc

    thng tin khch quan, khng ph thuc ni ti cng ty, doanh nghip, do tn sut cp

    nht, thm mi d liu l rt ln. Vn ny s c ni n trong phn sau Cp nht

    gi tr dimension.

  • 27

    [email protected] - [email protected]

    2.1.5 Bng dimension t hon

    Trong data warehouse cng tn ti cc bng dimension t hon ch c mt hai ct

    d liu v mt vi bn ghi. Cc dimension ny thng khng c ngun ring bit m

    c trch rt t mt thuc tnh no ca d liu ngun. V d dimension Gii tnh

    c trch t bng Thng tin thu bao vi 3 gi tr Nam, N v Khng cung cp.

    Hoc trong vin thng c dimension Loi cuc gi c trch xut t chi tit cc thoi

    cha m code phn bit cuc gi thng, gi video call hoc gi chuyn tip.

    Mi dimension dng ny, d t hon, vn cn x l rt cn thn. Thng thng

    chng khng c lung ETL ring bit thm d liu vo, m n lun trong lung x l

    ngun d liu chng c c rt ra.

    2.1.6 Dimension lng nhau

    Dimension lng, hay dimension nm trong dimension khc, l mt k thut

    thng gp khi xy dng data warehouse. V d dimension Thu bao thng s c ngy

    kch hot (l mt dimension) v huyn kch hot (cng l mt dimension khc). Khi gp

    tnh hung ny, nn nh rng y l mt tnh hung thng xuyn xy ra v khng c g

    sai st khi thit k data warehouse c.

    2.1.7 Mt bng dimension hay tch ra lm hai?

    Khi thit k data warehouse, quan im thng thy l cc bng dimension c

    lp nhau. Quan im ny khng hon ton chnh xc, d ng trong 90% cc trng hp.

    V d nh mi quan h gia nhn vin bn hng v ca hng chng hn. C cng ty quy

    nh nhn vin bn hng phi thuc mt ca hng nht nh, ch c ly hng t ca

    hng thi. Khi mi quan h gia Nhn vin Ca hng l quan h 1 nhiu, l

    dng cy th mc v nht nh phi nm trong mt dimension duy nht thi. Nhng trong

    cng ty khc mi quan h li lng hn, mt nhn vin, d bin ch thuc mt ca hng

    nhng li c th ly hng t nhiu ngun khc nhau chng hn. Khi khng nht thit

    v cng khng nn to mt dimension duy nht m nn chia ra thnh 2 thc th

    dimension khc nhau.

    Nhn chung nn thit k cc bng dimension c lp hon ton vi nhau (hoc

    lng nhau, nh trnh by mc 2.1.6), dimension c quan h nhiu nhiu nn tch ra

    thnh mt bng fact.

    2.1.8 Cp nht gi tr dimension

    Khi data warehouse nhn thy c s thay i gi tr trong mt bn ghi ca

    dimension, n phi c lp trnh c hnh ng tng ng vi s thay i . C 3

  • 28

    [email protected] - [email protected]

    phng n hnh x chnh c nh s th t l Kiu thay i 1, Kiu thay i 2 v Kiu

    thay i 3.

    Kiu thay i 1 (Type 1 Slowly Changing Dimension) n gin ghi cc d

    liu b thay i vo bng dimension. Ngi thit k data warehouse chn kiu 1 khi d

    liu ngun thay i di dng sa sai hoc khi phn b cp nht ny khng quan trng,

    khng lm thay i ngha bng fact. Kiu thay i 1 ny lun dng php UPDATE d

    liu thay i trong data warehouse, chnh v vy cn lu rng khi cp nht d liu bng

    dimension, d liu cc bng fact tng hp s dng cc trng dimension thay i cng

    thay i theo.

    Kiu thay i 2 (Type 2 Slowly Changing Dimension) cho php theo di cc

    thay i xy ra trong bng dimension v lin kt chnh xc bn ghi fact vi bn ghi

    dimension ang c hiu lc ti thi im bn ghi fact. tng rt n gin: khi data

    warehouse nhn ra d liu ngun c cp nht, thay v ghi , h thng cp nht trng thi

    bn ghi c v sinh thm mt bn ghi mi vo bng dimension. Bn ghi mi ny c gn

    cho mt kho thay th mi toanh (khng dnh dng g n bn ghi c na) v t lc ny

    h thng data warehouse dng bn ghi mi lin kt vi cc bn ghi fact c sinh ra.

    Cc bn ghi fact sinh ra trc vn lin kt vi bn ghi dimension c.

    Kiu thay i 2 ny th hin r nht s thay i ca d liu theo dng thi gian v

    mi s thay i d nh nht ca thc th trn d liu ngun u c ghi nhn trong data

    warehouse. Hnh 2.4 l mt v d cho phng n thay i gi tr dimension theo kiu 2.

    Nhn vin Jane Doe qua mt khong thi gian nhn nhng v tr cng vic khc nhau,

    mi ln thay i cng vic u c data warehouse ghi nhn li vi mi ln thay i

    cng vic l mt bn ghi mi trong data warehouse (mi bn ghi li nhn mt kho thay

    th mi tng ng).

  • 29

    [email protected] - [email protected]

    Hnh 2.4: V d cho thay i dimension kiu 2

    Vi bng dimension quy m nh khong chc trng, vi nghn bn ghi, vic ghi

    nhn thay i c th thc hin bng cch r sot tng trng d liu ca tng bn ghi.

    Nhng nu s trng, s bn ghi qu ln th phng php r sot ny mt qu nhiu thi

    gian, khng kh thi na. Thay vo c th thay th bng phng php kim tra

    checksum (cyclic redundancy checksum CRC) bn ghi data warehouse v bn ghi

    nghip v, nu checksum khng ging nhau tc l c thay i. D liu dng

    checksum c th dng php cng chui tt c cc trng d liu cn kim tra (i ht

    kiu d liu v dng chui). Cc php checksum thng dng nht hin nay (CRC32,

    MD5, SHA1, SHA3, SHA1-Base32, SHA256, SHA384, SHA512) u c th s dng

    tnh checksum cho kt qu chnh xc.

    Nhng nu mun sinh ra d liu bng fact, data warehouse li phi tm bn ghi c

    kho thay th mi nht. iu ny khng phi lc no cng d dng v thng tiu tn ti

    nguyn cng nh hiu nng h thng. Thay vo , bng dimension c th thm mt

    trng is_active tin truy vn. Trong trng hp ny data warehouse khng ch n

    gin thm bn ghi mi vo bng dimension m trc cn phi cp nht gi tr

    is_active trong bn ghi c v trng thi khng s dng. Ngoi trng is_active ra, vic

    ghi nhn cc thng tin b sung cho s kin cp nht bn ghi ny i khi cng cn thit.

    Mt s thng tin b sung thng c lu li bao gm:

    - Ngy bt u c hiu lc

    - Ngy ht hiu lc (sau khi ht hiu lc mi c cp nht thm vo bng

    dimension)

  • 30

    [email protected] - [email protected]

    - L do cp nht

    Kiu thay i 3 (Type 3 Slowly Changing Dimension) c dng khi gi tr

    dimension thay i nhng ngi dng data warehouse c th la chn s dng gi tr mi

    hoc c. Nhn chung kiu thay i 3 ny t c s dng (v c th thay bng bng fact)

    Vic la chn p dng kiu thay i no vo thc th dimension tu thuc vo yu

    cu nghip v bng dimension . Nhng trong nhiu trng hp thit k data warehouse

    kt hp 2 kiu thay i trong mt dimension, m thng dng nht l kt hp kiu 1 v

    kiu 2. Khi cc thuc tnh trong dimension c nh u tin, nu thuc tnh thay i

    thuc u tin thp th dng kiu 1, ghi gi tr ln tt c cc bn ghi lch s t trc

    n nay. Nu thuc tnh thay i thuc u tin cao hn th dng kiu 2, v hiu ho

    bn ghi c v sinh bn ghi mi.

    2.1.9 Bn ghi dimension v tr v vic cp nht d liu dimension

    sai

    D liu v mun l vn rt au u trong xy dng v vn hnh h thng data

    warehouse, khc phc rt mt mi nhng li thng xy ra. Nguyn nhn v mun

    thng l do mt ng b gia data warehouse v h thng nghip v, do li ca h

    thng nghip v khng y d liu kp thi cho data warehouse Khi pht hin v mun

    nhn vin vn hnh data warehouse cn yu cu nhn vin vn hnh h thng nghip v

    x l vn m bo khng xy ra na. Tuy nhin nguyn nhn mt ng b khng

    c bn n y m s ch yu ni v bin php khc phc.

    V d trong vin thng, theo chnh sch mi mt lot thu bao di ng c thay

    i sang gi cc mi nhng s thay i ny cha kp v h thng data warehouse, phi

    vi ngy sau mi c ghi nhn. Khi rt nhiu bng fact tng hp s liu vn dng

    thng tin gi cc c, bo co sai. Khi cc bc x l nn thc hin nh sau:

    - Bc 1: Sao lu cc bn ghi cn thay i

    - Bc 2: Cp nht phn d liu b thay i trong d liu sao lu trn

    - Bc 3: Xo bn ghi c trong data warehouse

    - Bc 4: Ghi d liu cp nht vo data warehouse

    Vic cp nht bn ghi trong bng fact data warehouse khng c khuyn khch do s

    lng bn ghi qu ln (trong khi data warehouse khng ti u cho cp nht).

    V d tip theo l Trung tm kinh doanh a vo s dng mt gi cc mi,

    nhng thng tin v gi cc li cha kp cp nht vo data warehouse. y l v d

    tiu biu nht

  • 31

    [email protected] - [email protected]

    2.1.10 Tng kt

    Mc ny ni v khi nim v cc k thut xy dng bng dimension trong data

    warehouse. Mc d bng dimension c quy m s bn ghi nh hn bng fact rt nhiu,

    nhng n li v cng quan trng, cung cp ngha cho s liu trong bng fact.

    Cc k thut trong mc ny u c s dng trong thit k thc t. c bit l ba

    kiu thay i d liu dimension hin tr thnh kin thc c bn m bt k ngi thit

    k data warehouse no cng phi bit.

    Mc tip theo s bn v bng fact v cc k thut dng xy dng bng fact.

    2.2 Xy dng bng Fact Bng fact lu cc tiu ch, ch tiu v hot ng sn xut kinh doanh ca doanh

    nghip. Mi quan h gia tiu ch v bng fact cng n gin: tiu ch chnh l bn ghi

    fact. Mt tiu ch (measurement) c nh ngha l mt lng quan st c theo mt

    n v o lng thng nht.

    M hnh d liu a chiu c xy dng xung quanh cc tiu ch ny. Bng fact

    cha tiu ch, cn bng dimension cha ng cnh ca cc tiu ch . Mi quan h tng

    chng n gin ny cung cp cho ngi dng cui gc nhn rt trc quan, y v d

    s dng v d liu trong data warehouse.

    Mc 2.1 ni v k thut thit k bng dimension, vic ca mc ny s l thit

    k bng fact. Mc d s liu (fact) mi l th ngi dng mun, l linh hn ca data

    warehouse, nhng linh hn l ci g nu khng c da tht bao bc bn ngoi? Cc kin

    thc v dimension trong mc trc gip ngi dng d dng nm bt ni dung mc ny

    hn rt nhiu.

    2.2.1 Cu trc bng fact Mi bng fact c xc nh bng chi tit ca bng d liu. chi tit ny li

    c nh ngha qua s kin o lng trong thc t. Ngi thit k lun phi ch r mc

    chi tit ca bng fact, tc l cch thc cc tiu ch trong bng c o trong th gii thc

    nh th no. Hnh 2.5 l v d mt bng fact mc chi tit nht, vi mi bn ghi th hin

    mt dng n hng (dng ho n).

    Bng fact thng thng khng c kho chnh ring, m cha mt tp cc kho

    ngoi kt ni n bng dimension ( cung cp ng cnh cho cho thng tin trong bng

    fact). Hu ht bng fact cn c thm cc trng lu s liu, chnh l cc tiu ch o lng

    sn xut kinh doanh. Trong hnh 2.5 cn c mt vi dimension c bit gi l dimension

    thoi ho (degenerate dimension), chnh l cc dimension c tn ti trong bng fact

  • 32

    [email protected] - [email protected]

    nhng khng nht thit phi to ra bng dimension cho n. Dimension thoi ho c

    vit tt l DD.

    Trong thc t bng fact no cng c t nht 3 dimension, thng l nhiu hn.

    chi tit s liu cng cao, data warehouse cng cn nhiu bng dimension. ng bun thay

    cng ngy nhu cu s dng d liu data warehouse cng kh lng, chi tit d liu

    cng ln, s lng dimension cng lc cng tng.

    Hnh 2.5: Bng fact giao dch bn hng mc chi tit nht

    Khng dng kho thay th nh dimension, bng fact dng cc kho ph lm kho

    chnh cho n. Trong hnh 2.5 trn, kho chnh c s dng l tp hp {Ticket Number,

    Line Number}, hai trng ny nh ngha ra bn ghi duy nht trong ton b qu trnh

    thanh ton ca h thng nghip v, v do cng nh ngha bn ghi duy nht trong data

    warehouse. Cng l bng fact v giao dch bn hng nhng mc tng hp cao hn,

    kho chnh s s dng nhng trng d liu khc.

    Trong v d trn, kho chnh cng c th tp hp {Cash Register (ca thanh ton),

    Time of day, Line Number}. Trong trng hp ny, nu khng ch k c th xy ra

    tnh hung mt s kin khc c thng tin y ht nh trn xy ra, khi khng c cch no

  • 33

    [email protected] - [email protected]

    bit bn ghi l bn ghi trng, hay l mt s kin khc c. cht ch hn, trong qu

    trnh np d liu vo data warehouse, cc bn ghi fact c gn thm trng d liu

    sequence.

    2.2.2 Ton vn thc th

    Ton vn thc th trong m hnh d liu a chiu ngha l mi bn ghi trong bng

    fact u phi c gn gi tr dimension ng n, ni cch khc tng bn ghi fact phi

    tm thy cc gi tr dimension tng ng vi n.

    Thng thng c 2 loi vi phm ton vn chnh:

    - Ghi d liu vo bng fact nhng khng tm thy gi tr dimension tng ng

    - Xo bn ghi dimension trong khi gi tr b xo c s dng t trc ri

    Bng d liu trong data warehouse thng khng t rng buc kho cht ch nh

    h thng nghip v, v th vic vi phm mt trong hai li trn (hoc c 2) cc k d xy

    ra, v mi ln xy ra u gy nh hng nghim trng n d liu. Li vi phm ton vn

    thc th trong data warehouse khng phi li nh, n l li nguy him. Kt qu truy vn

    trn d liu fact vi phm ton vn thc th s b thiu hoc sai, gy nh hng n

    chnh xc ca qu trnh ra quyt nh.

    Hnh 2.6: Thi im kim tra ton vn thc th

  • 34

    [email protected] - [email protected]

    Hnh 2.6 m t 3 thi im hp l kim tra ton vn thc th trong cc lung ETL

    khi ghi d liu fact vo data warehouse. Chng l:

    - Kim tra k d liu fact trc khi ghi vo data warehouse v trc khi xo d liu

    khi bng dimension

    - t rng buc ton vn vo data warehouse m bo khng ghi d liu sai hoc

    xo d liu ng trong data warehouse

    - Pht hin v sa cc li ton vn thc th sau khi ghi vo data warehouse bng

    cch nh k r sot bng fact, tm ra gi tr kho ngoi li.

    Trn thc t th thi im kim tra ton vn tt nht l giai on trc khi ghi vo

    data warehouse. Lc ny ch cn thm bc tm kim gi tr dimension no cha c trong

    bng dimension ghi tng ng, sau s cp nht thng tin chi tit cho dimension sau

    (cng vic ny s c m t bn di). Nu bc x l ny c thc hin cn thn,

    bng fact s tun theo rng buc rng buc ton vn. Tng t, khi mun xo bn ghi

    trong bng dimension, trc ht phi kim bng cch join cc bn ghi mun xo vi bng

    fact, sau khi chc chn khng tr v kt qu no mi thc hin xo.

    Giai on th hai thng khng c s dng do lm gim tc ghi d liu vo

    data warehouse xung ng k. Mt s data warehouse cho tc ghi rt tt ngay c khi

    c rng buc ton vn (nh h thng Red Brick ca IBM), tuy nhin y ch l cc

    trng hp c bit, phn ln data warehouse vn x l chm khi c rng buc ton vn.

    Kim tra ton vn sau khi ghi d liu vo data warehouse trn thc t cng t thc

    hin do khi lng cn kim tra qu ln. V d nu c kim tra tht th cu lnh s

    dng:

    select f.product_key

    from fact_table f

    where f.product_key not in (select p.product_key from product_dimension p)

    Nhn chung l khng th thc hin c cu lnh ny trong h thng tht. Vic kim tra,

    nu c, s b gii hn trong mt khong thi gian nht nh (hm nay, tun ny, thng

    ny), vic vi phm ton vn vn c th xy ra trong phn d liu trc .

    D sao, cch thc trung dung nht, va m bo ton vn va m bo hiu nng

    chnh l kim tra d liu bng fact trc khi ghi d liu vo v khng xo d liu trong

    bng dimension. D liu trong dimension ch nn b xo khi c sai st cn sa, cn ni

    chung th c nguyn. Mt bn ghi (hoc nhiu bn ghi) khng c s dng cha chc

    sau ny khng c s dng, quy m bng dimension d sao cng khng qu ln

    nh hng nghim trng n hiu nng h thng.

  • 35

    [email protected] - [email protected]

    i lc xy ra trng hp d liu dimension mi toanh c pht hin trong bng

    fact (d liu dimension v mun). Lc ny chng ta s ghi bn ghi mi vo bng

    dimension (v do to ra kho thay th mi) vi thuc tnh ca dimension n gin l

    Unknown.

    2.2.3 Phn loi bng fact Bng fact cha tt c tiu ch v hot ng sn xut kinh doanh ca doanh nghip.

    Nhiu vy nhng bng fact ch c phn loi v 3 dng chnh:

    Bng fact mc chi tit (transactional grain fact table) thng m t mt s kin

    no xy ra th gii thc c ghi nhn vo data warehouse. Cc tiu ch v dimension

    trong bng fact ny v th khng m t mt qu trnh, m ch ghi nhn gi tr thi im s

    kin xy ra.

    Bng fact mc chi tit l bng c quy m ln nht (v lng bn ghi) v chi tit

    nht trong 3 dng bng fact chnh. D liu ca n thng cha thng tin y , chi tit

    nht v s kin, bao gm c thi gian chnh xc s kin xy ra. Do quy m ln vy nn

    vic s dng bng ny mc ngi dng cui rt t khi xy ra m thng y l u vo

    tng hp ln cc bng fact c mc tng hp cao hn.

    Bng fact tng hp thng k (periodic snapshot fact table) i din cho mt

    khung thi gian nht nh no v s lp li sau mi chu k. Mt s dng thng thy

    nht ca bng fact tng hp thng k l tng hp theo ngy, theo thng, theo nm.

    C 2 cch xy dng d liu cho bng fact tng hp thng k. Mt cch l c i

    n cui thng ri tnh mt th, nhanh v gn. Cch khc l duy tr v cp nht bn ghi

    lu k thng, cng dn kt qu hng ngy li cho n ngy u thng sau th cht thng

    c v sinh ra bn ghi mi.

    Bng fact lu k (accumulating snapshot fact table) l bng lu li s liu c

    thi gian khng nh trc, nh l doanh s ca mt sn phm t khi sn phm ra i n

    thi im hin ti (v tip tc c cp nht trong tng lai)

    2.2.4 Ghi d liu vo bng fact

    Bng fact, c bit l bng fact chi tit, l ni tp trung s lng bn ghi ln nht

    data warehouse. V th vic ghi d liu vo bng fact thng khng phi l vic n gin

    v thng rt tai v nu khng c x l cn thn. Mc ny s ni n cc vn

    thng gp ca ngi vn hnh ETL trong qu trnh ghi d liu trn.

  • 36

    [email protected] - [email protected]

    X l index

    Index rt tt khi query nhng li tai hi v cng khi ghi d liu vo database. Cc

    bng d liu c nhiu index lm chm vic ghi n mc c cm gic c tin trnh ng

    li khng hot ng. Vic cn lm y l: xo ht index trc tin trnh ghi d liu vo

    data warehouse, sau khi ghi xong li to li index t u.

    X l partition

    Partition cho php bng d liu (v c index na) c chia ra thnh cc bng d

    liu nh hn v mt vt l. Php chia ny cho php cu truy vn c th chy n ng

    phn khu cha d liu cn thit m khng cn tm kim trn ton b bng d liu. c

    x l ng n, bng partition gip lm tng hiu nng truy vn ln ng k trn cc

    bng fact ln. Vic partition ny thng trong sut vi ngi dng, v c vn hnh kt

    hp bi c i DBA v ETL.

    K thut nh partition thng dng nht trn bng fact l nh partition theo

    trng thi gian. u im ca trng thi gian l lun c nh, li c nh ngha sn

    nn chng ta lun bit c kho thay th sp c s dng l g s dng. Sai lm

    thng thy khi dng trng thi gian l ngi thit k thm mt trng thi gian vo

    bn ghi fact (c th dng lun thi gian ghi bn ghi vo warehouse) v dng trng

    nh partition. Nhng nu trng thi gian khng xut hin trong cu truy vn ca

    ngi dng cui th nh partition nh vy l v ngha. Bi hc rt ra: ch nh partition

    vo trng thi gian c ngi dng quan tm, s dng.

    Bng fact nh partition theo thi gian thng l nh theo nm, theo qu, theo

    thng, i khi l tun hoc ngy. Thng thng i thit k data warehouse phi lm vic

    vi i DBA xc nh phng n nh partition tt nht cho tng bng fact mt. Cng

    thng thng i DBA khng trc tip tc ng vo bng fact m vic tng partition phi

    thc hin bng lung ETL. Trong phn ln trng hp, i DBA s khng can thip vo

    vic vn hnh cc bng d liu trong data warehouse, khi vic qun l partition ny do

    i vn hnh ETL m nhim, v vic ca i vn hnh l dng lung sa partition cho

    tng bng fact mt. Vi bng partition th thnh thong i vn hnh s gp li sau:

    ORA-14400: inserted partition key is beyond highest legal partition key

    Khi chc chn lung tng ETL b li v vic cn lm l sa lung ETL tng

    partition hoc lin h i DBA ngay lp tc.

  • 37

    [email protected] - [email protected]

    trnh li trn, trc khi ghi d liu vo data warehouse, lung ETL c th ch

    ng kim tra bng cch so snh gi tr thi gian ln nht trn phn d liu sp c ghi

    v gi tr partition cao nht ca bng fact. C th so snh

    select max(date_key) from FACT_TABLE

    vi

    select high_value from all_tab_partitions

    where table_name = 'FACT_TABLE'

    and partition_position = (select max(partition_position)

    from all_tab_partitions where table_name= 'FACT_TABLE')

    Nu gi tr trn script 1 ln hn script 2, ta c th tng partition trc khi ghi d liu

    bng on script:

    ALTER TABLE FACT_TABLE

    ADD PARTITION year_2013 VALUES LESS THAN (20140101)

    C qu trnh theo di, cp nht partition ni trn c th c vit vo th tc stored

    procedure trn data warehouse v c lung x l ETL gi trc mi ln ghi d liu.

    Loi b rollback log

    Mc nh tt c c s d liu quan7 h u h tr x l li khi giao dch tht bi.

    H thng tr cc bn ghi li v trng thi trc khi commit, bng cch ghi nhn li tt c

    thao tc thay i d liu. Khi c li, database c li bn ghi log ny v sa cha tt c

    thao tc cp nht cha c commit. Vic commit mt giao dch, ngha l thng bo cho

    database bit giao dch thnh cng v cc tc ng giao dch gy ra phi c cp nht

    vo c s d liu.

    Rollback log, hay cn gi l redo log, do c ngha rt ln trong h thng

    nghip v. Nhng trong data warehouse khi m tt c giao dch u c lung ETL

    qun l th redo log li l tnh nng phin phc, gy cn tr ghi d liu v cn phi b loi

    b tng hiu nng x l. Tng kt li cc nguyn nhn data warehouse khng cn redo

    log:

    - D liu c ghi vo data warehouse bng mt tin trnh c gim st k cng

    lung ETL

    - D liu c ghi hng lot vo data warehouse

    - Nu chng may b li, ngi vn hnh c th d dng khc phc v cho chy li

    tin trnh

  • 38

    [email protected] - [email protected]

    Ghi ch nh l cc hng cung cp c s d liu c c ch qun l log khc nhau

    v i khi cung cp tnh nng log m hng khc khng c. Khi xy dng lung ETL cn

    ch cc im khc bit ny ti u t hiu qu cao nht.

    Ghi d liu

    Ghi mt lng ln d liu vo data warehouse khng ch n gin l cu insert,

    cc k thut chnh khi ghi d liu bao gm:

    - Tch ghi d liu c vi cp nht d liu mi: mt s cng c ETL (v c

    database) cung cp tnh nng cp nht hoc ghi mi. Tnh nng ny mi nghe qua

    th c v rt tin li v lm cho lung ETL trng n gin hn rt nhiu nhng x

    l thc t li qu chm chp (do mi khi c bn ghi mi u phi kim tra bng d

    liu xem tn ti cha). Khng nn dng tnh nng ny, thay vo nn to ra 2

    lung khc nhau: cp nht trc, sau ghi mi d liu.

    - Dng tool ghi d liu ca database: mi database thng cung cp mt tool ghi

    d liu vo database, p dng nhiu k thut c quyn ca ring database nn tc

    ghi nhanh hn dng cu insert rt nhiu.

    - Chia lung ghi d liu chy song song: khi phi ghi s lng ln bn ghi vo

    data warehouse nn chia nh khi d liu ra lm nhiu phn chia cho nhiu lung

    ETL chy song song.

    - Hn ch sa i: chy lnh UPDATE trong data warehouse thng rt chm chp

    v kh theo di, do khi tht s mun sa i d liu thng phi s dng cc

    k thut khc. Trong mt k thut thng dng v n gin l xo cc bn ghi

    c cn cp nht i ri ghi li chnh cc bn ghi vi thng tin c sa cha.

    - To bng tng hp bn ngoi database: cc php sp xp, php giao, php tng

    hp d liu nn c thc hin bn ngoi data warehouse t hiu qu cao hn.

    Nguyn nhn l do ti nguyn dnh cho data warehouse thng l c hn v rt

    kh nng cp v phn cng (c theo chiu dc thm RAM, thm CPU hoc

    chiu ngang thm server), trong khi b ETL c th d dng trin khai trn nhiu

    server khc nhau share ti.

    2.2.5 Bng fact khng s liu

    Bng fact cha cc s kin c s liu o m c. Nhng trong mt s trng

    hp cc s kin ny li khng c s liu o m c. V d bng fact v vic khch

    hng check-in khch sn, hoc vin thng l bng fact v vic khch hng i gi cc t

    Economy sang Tomato. Cc bng fact ny ch ghi thi gian s kin xy ra v cc thuc

    tnh, khng c s liu no c.

  • 39

    [email protected] - [email protected]

    Cc bng fact ny gi l bng fact khng c s liu (factless fact table). Php

    ton dng ch yu trn bng fact ny l php m (count) v m khng lp (count

    distinct).

    2.2.6 Bng tng hp d liu

    Bng fact c s bn ghi rt ln, lm chm vic truy vn cho ngi dng cui. Vn

    l vn mun th cc chuyn gia DBA, cc chuyn gia ETL v ngi dng h

    thng BI cng thng nht. Bin php tng tc truy vn n tng nht, thng c cc

    chuyn gia data warehouse khuyn dng l s dng bng tng hp (aggregate table).

    Bng tng hp l bng fact nhng d liu trong c tng hp mc cao hn

    trong cy phn cp dimension so vi bng fact gc. Khi truy vn, tu mc s dng cy

    phn cp m ngi dng s dng bng fact gc hay bng tng hp tng ng (thc cht

    ra bng tng hp cng phi trong sut vi ngi dng, v vic la chn dng bng no

    phi l vic ca cng c khai thc). Cha cn tn km chi ph nng cp phn cng, phn

    mm, ch cn dng bng tng hp hiu nng truy vn tng ln r rt.

    Hnh 2.7: v d m hnh d liu a chiu vi dimension mc ngy, sn phm v

    kho

    Bng tng hp khng h mi l, ch ny c tho lun rt nhiu trn cc

    din n data warehouse v c nhiu sch nhc n. Mt vi im chnh:

    - Trong mt thit k data warehouse, ng vi mi bng fact li c mt tp cc bng

    tng hp i din cho cc php nhm thng dng nht. Phng php chuyn qua

    li bng tng hp l phng php ca c s d liu a chiu, ch h tr c s d

    liu a chiu v khng c phng php tng ng no trn cho c s d liu

    quan h.

    - Module la chn bng tng hp nm gia module x l truy vn t ngi dng v

    data warehouse.

  • 40

    [email protected] - [email protected]

    - Module la chn bng tng hp x l truy vn t ngi dng cui v nu c th,

    chuyn cu lnh SQL sang ly d liu t bng tng hp thay v bng fact ngun

    - Module la chn bng tng hp bit khi no phi dng bng tng hp, v nu

    dng th dng bng tng hp no v cc thng tin phi c cu hnh sn

    trong siu d liu m t data warehouse.

    Hnh 2.8: Cu trc module la chn bng tng hp

    Tnh nng bng tng hp trong data warehouse c coi l tt phi p ng cc

    tiu chun sau:

    - Ci thin ng k hiu nng cho phn ln truy vn ca ngi dng (gii php ti

    u dnh cho tt c mi ngi l khng h c, v nu c cng khng kh thi trong

    thc t)

    - Khng lm tng dung lng d liu lu tr ln qu nhiu. Cng kh ni qu

    nhiu y l bao nhiu v tu tnh hnh c th, nhng ni chung tng dung

    lng bng tng hp khng nn ln hn dung lng bng chnh.

    - Hon ton trong sut vi ngi dng cui. Nhn vin kinh doanh hoc cc sp

    qun l cp cao khng cn v cng khng mun bit qu chi tit cc vn k

    thut khng lin quan u.

    - Khng lm nh hng n vic vn hnh h thng ETL. Mc d mi ln to bng

    tng hp u phi chy mt lot php ton sp xp, gom nhm nhng vic to

    bng tng hp ny nn c dng t ng bng tin trnh ETL.

    - t nh hng n vic qun tr vin DBA. Vic to bng tng hp no nn c t

    ng ho bng cch theo di thi quen truy vn ca ngi dng cui.

  • 41

    [email protected] - [email protected]

    Nu thit k tt, bng tng hp s tho mn c 5 yu cu trn, thit k ti th chng

    t c yu cu no c. Di y l mt s hng dn thit k t c cc yu cu

    ni trn:

    Hng dn thit k s 1

    Bng tng hp phi c khng gian lu tr ca ring n, tch bit vi bng fact

    ngun. Mi cp tng hp phi c mt bng fact ring.

    Vic chia bng tng hp thnh cc bng fact l rt quan trng v c nhiu hiu

    ng tch cc. u tin l cu trc bng fact v bng tng hp n gin hn, cu lnh SQL

    truy vn v th ch cn thay tn bng, khng cn thm iu kin phc tp. Th hai, cng

    do khng phi thm iu kin nn ngi dng cui khng gp nguy c ly kt qu sai v

    cng nhm s liu cc mc tng hp khc. Th ba, lu ra bng khc lm tng tc truy

    vn d liu mc database v lm gim s bn ghi cn tm kim. Cui cng, lu ra bng

    fact ring khin vic qun l bng tng hp d dng hn, nu mt bng tng hp b li

    ch cn sa trong phm vi bng m khng gy nguy c nh hng cc phn d liu

    khc.

    Hng dn thit k s 2 Bng dimension tng ng vi bng tng hp cng phi phn cp tng ng vi

    d liu trong bng tng hp. Ni cch khc nu bng tng hp l phin bn rt gn ca

    bng fact, b i mt vi mc trong cy phn cp, th bng dimension cho bng tng hp

    cng phi b cc mc phn cp tng ng.

    Hnh 2.9: M hnh bng tng hp sau khi li mc Nhm sn phm

  • 42

    [email protected] - [email protected]

    Hnh 2.9 l m hnh bng tng hp ca v d trong hnh 2.7. M hnh ny gii

    quyt trng hp ngi dng cui mun truy vn d liu bn hng theo ngy, theo ca

    hng v theo nhm sn phm (ch khng quan tm n tng sn phm mt nh hnh 2.7

    na).

    Khng phi dimension no cng xut hin cc mc tng hp cao hn. Mt s

    dimension v s liu trn bng fact ch c ngha mc bng fact ngun v lm sai hn

    ngha bo co khi kt hp vi mc cao hn ca dimension khc.

    Cc bng dimension rt gn ny cng phi tun theo cc quy tc v bng

    dimension nh kho thay th, kho t nhin, cc thuc tnh Nh vy trc khi xy

    dng dimension cho bng fact gc, cn xy dng bng dimension cho cc bng tng hp

    (v cng cn thit thit phi lu kho thay th ca bng dimension rt gn trong bng

    dimension mc chi tit hn).

    Hng dn thit k s 3

    Bng fact v tt c cc bng tng hp ca n phi c lin kt vi nhau thnh

    mt h bng fact module la chn bit c phi s dng bng no chy cu truy

    vn. Mi thnh phn trong h bng fact s bao gm mt bng fact (c th l bng gc

    hoc bng tng hp), bao xung quanh bng cc dimension tng ng vi n. Trong h

    ny c mt mu duy nht cha bng fact gc, cc mu khc l mu tng hp. Hnh 2.7 l

    mu gc, hnh 2.9 l mt trong s rt nhiu mu tng hp ca mu 2.7.

    Mi lin h gia bng fact gc v cc bng tng hp (cng nh cc bng

    dimension tng ng) phi c cu hnh h thng truy vn bit c m la chn.

    Thng thng thng tin ny c cu hnh trong b cng c OLAP ca h thng data

    warehouse.

    Hng dn thit k s 4

    Ni chung th ngi dng cui khng nn bit v s tn ti ca bng tng hp.

    Cc truy vn t ngi dng cui trc tip n data warehouse khng thng qua cc cng

    c khai thc ch c im dng duy nht l bng fact gc, khng nn i xa hn n bng

    tng hp.

    2.2.7 Tng kt Trong mc ny chng ta xc nh bng fact l ni lu tr ton b s liu sn

    xut kinh doanh ca doanh nghip. Cc bng fact ny u c 2 phn chnh: s liu v cc

    kho ngoi lin kt n bng dimension lm ng cnh cho s liu.

  • 43

    [email protected] - [email protected]

    Chng ta ch ra c ton vn thc th l v cng quan trng trong m hnh d

    liu a chiu, v xut ra 3 phng n kim tra rng buc ton vn.

    Ri chuyn sang cc k thut thng dng tng hiu nng x l bng fact, bao

    gm c cc k thut ti u tin trnh ghi d liu v k thut to bng tng hp tng tc

    truy vn vo bng fact ny.

  • 44

    [email protected] - [email protected]

    Phn 3: Xy dng lung ETL

    3.1 V ETL

    3.2 Thu thp d liu

    3.3 Lm sch v chun ho d liu

  • 45

    [email protected] - [email protected]

    Phn 4: p dng thc t cho vin thng, xy dng h thng data

    warehouse cho mobile tr trc

    4.1 Thu thp yu cu

    4.2 Thit k data warehouse

    4.3 Xy dng lung ETL

    4.4 Xy dng OLAP Cube