Upload
fptnam
View
17
Download
1
Embed Size (px)
DESCRIPTION
Tai lieu tham khao ve Data Warehouse.
Citation preview
1
[email protected] - [email protected]
Data Warehouse
v cch thit k Data Warehouse
Lu c Thng
Ban BI, phng Nghin cu pht trin Trung tm phn mm vin thng Viettel
2
[email protected] - [email protected]
Contents Phn 1: Data Warehouse l g.............................................................................................. 4
1.1 Gii thiu v database ............................................................................................ 4
1.2 Data Warehouse ..................................................................................................... 7
1.3 M hnh d liu a chiu ....................................................................................... 9
1.3.1 Cy phn cp ................................................................................................. 10
1.3.2 S liu tng hp ............................................................................................ 11
1.3.3 Tnh nng ca OLAP..................................................................................... 12
1.3.4 M hnh thit k data warehouse................................................................... 14
1.4 Kin trc h thng data warehouse ...................................................................... 17
1.4.1 Tng ETL ...................................................................................................... 18
1.4.2 Tng data warehouse ..................................................................................... 19
1.4.3 Tng khai thc d liu ................................................................................... 19
Phn 2: Xy dng Data Warehouse ................................................................................... 21
2.1. Xy dng bng dimension ................................................................................... 21
2.1.1 Cu trc bng dimension ............................................................................... 21
2.1.2 Thit k phng v thit k bng tuyt ........................................................... 23
2.1.3 Dimension thi gian ...................................................................................... 24
2.1.4 Bng dimension khng l .............................................................................. 26
2.1.5 Bng dimension t hon................................................................................... 27
2.1.6 Dimension lng nhau .................................................................................... 27
2.1.7 Mt bng dimension hay tch ra lm hai? .................................................... 27
2.1.8 Cp nht gi tr dimension ............................................................................ 27
2.1.9 Bn ghi dimension v tr v vic cp nht d liu dimension sai ................ 30
2.1.10 Tng kt ..................................................................................................... 31
2.2 Xy dng bng Fact ............................................................................................. 31
2.2.1 Cu trc bng fact ......................................................................................... 31
2.2.2 Ton vn thc th .......................................................................................... 33
2.2.3 Phn loi bng fact ........................................................................................ 35
3
[email protected] - [email protected]
2.2.4 Ghi d liu vo bng fact .............................................................................. 35
2.2.5 Bng fact khng s liu ................................................................................. 38
2.2.6 Bng tng hp d liu ................................................................................... 39
2.2.7 Tng kt ......................................................................................................... 42
Phn 3: Xy dng lung ETL ............................................................................................ 44
3.1 V ETL ................................................................................................................. 44
3.2 Thu thp d liu ................................................................................................... 44
3.3 Lm sch v chun ho d liu ............................................................................ 44
Phn 4: p dng thc t cho vin thng, xy dng h thng data warehouse cho mobile
tr trc.............................................................................................................................. 45
4.1 Thu thp yu cu .................................................................................................. 45
4.2 Thit k data warehouse ....................................................................................... 45
4.3 Xy dng lung ETL ........................................................................................... 45
4.4 Xy dng OLAP Cube ......................................................................................... 45
4
[email protected] - [email protected]
Phn 1: Data Warehouse l g
1.1 Gii thiu v database Database ng vai tr quan trng bc nht trong tt c cc h thng thng tin, l
phn h trung tm ca cc h thng nghip v x l d liu giao dch. Database bao gm
mt tp hp d liu c cu trc v mt bn thng tin m t d liu c cu trc
(metadata), c thit k cho nhu cu lu tr v x l thng tin ca t chc, doanh
nghip. Mt database thng c ci t km theo mt h qun tr c s d liu
(DBMS Database Management System), tc l mt phn mm cho php ngi dng
nh ngha, to mi, iu khin v qun tr mt database.
Trong ng cnh database trong h thng nghip v, mt tc v cn truy sut, x l
d liu gi l mt giao dch (transaction). m bo x l giao dch chnh xc, thit k
database cn phi tho mn 4 tnh cht l Tnh nguyn t (Atomicity), Tnh nht qun
(Consistency), Tnh tch bit (Isolation), Tnh bn vng (Durability) ACID.
- Tnh nguyn t (Atomicity) quy nh rng nu mt giao dch c nhiu thao tc
x l d liu th hoc l tt c cc thao tc u thnh cng, hoc khng c thao
tc no thnh cng c. Mt h thng gi l c tnh nguyn t khi n tho mn iu
kin trn trong bt k tnh hung li no, bao gm c li phn cng hay phn
mm. V pha ngi dng, mt giao dch thnh cng c th hin nh l mt
thao tc duy nht, cn mt giao dch tht bi khng c bt k tc ng no n
database c.
- Tnh nht qun (Consistency) quy nh rng d liu trong database phi hp l
trc v sau mi giao dch. Mi thao tc ghi d liu vo database phi tho mn
cc lut quy nh trc bao gm nhng khng gi gn trong constraint, cascade,
trigger Tnh cht ny khng m bo d liu trong database ng nghip v (
l vic ca lp trnh vin, database lm ht th coder lm g?), n ch gip hn ch
li (nu c) trong qu trnh pht trin phn mm.
- Tnh tch bit (Isolation) quy nh rng vic thc thi song song nhiu giao dch
mt lc cho ra kt qu tng ng vic thc thi cc giao dch mt cch tun
t.
- Tnh bn vng (Durability) quy nh rng mt giao dch khi thnh cng s
c ghi nhn vnh vin ngay c khi c s c v phn cng hay phn mm.
Trong h thng c s d liu quan h, d liu thng c lu tr di dng cc
thc th. Thc th (Entity) l cch th hin mt tp cc i tng thc t m h thng
nghip v cn phi qun l, lu tr. V mt ng dng m ni, cc thc th thuc cng
mt loi c cc thuc tnh ging ht nhau.
5
[email protected] - [email protected]
// TODO: phn r thc th mnh, thc th yu, quan h gia cc thc th
tho mn tnh ACID, thit k ca mt database thng c a v dng
chun 3 (3NF 3th Normal Form), quy nh rng mi thc th trong database phi tho
mn cc iu kin sau:
- Gi tr ca mt thuc tnh phi l gi tr nguyn t, tc l khng phi mt danh
sch cc gi tr hoc gi tr phc hp (Chun 1)
- Cc thuc tnh khng phi kho chnh ch ph thuc vo ton b tp kho chnh,
khng ph thuc vo tp con no ca tp kho chnh ni trn (Chun 2)
- Cc thuc tnh khng phi kho c lp vi nhau (khng th ni suy gi tr mt
thuc tnh t cc thuc tnh khng phi kho khc) (Chun 3)
V d i tng Ho n bn hng gm cc thng tin ch yu sau:
Ho n bn hng
PK M n hng
Ngy bn hng M nhn vin bn hng Tn nhn vin bn hng Tn khch hng S in thoi khch hng Cc dng n hng - M mt hng - Tn mt hng - n gi - S lng - Thnh tin
Thc th ny c thuc tnh Cc dng n hng vi phm quy nh gi tr nguyn t ca
dng chun 1. a v chun 1, ta thm thc th Dng n hng vo thit k nh
sau:
Ho n bn hng
PK M n hng
Ngy bn hng M nhn vin bn hng Tn nhn vin bn hng Tn khch hng S in thoi khch hng
Dng n hng
PK,FK1 M n hngPK M mt hng
Tn mt hng n gi S lng Thnh tin
6
[email protected] - [email protected]
Thit k ny vi phm chun 2 ch thc th Dng n hng c thuc tnh Tn mt
hng ch ph thuc vo thuc tnh M mt hng (l tp con ca kho chnh) => Sa
bng cch thm thc th Mt hng lu cc thng tin trn.
Ho n bn hng
PK M n hng
Ngy bn hng M nhn vin bn hng Tn nhn vin bn hng Tn khch hng S in thoi khch hng
Dng n hng
PK,FK1 M n hngPK,FK2 M mt hng
n gi S lng Thnh tin
Mt hng
PK M mt hng
Tn mt hng n gi
Ch : thuc tnh n gi cng ph thuc M mt hng nhng do yu cu nghip
v, gi trn ho n l bt bin trong khi gi mt hng thc t c thay i nn cn lu
thng tin ny vo thc th Dng n hng.
Thit k ny vi phm chun 3:
- Bng Ho n bn hng c trng Tn nhn vin bn hng, S in thoi
khch hng khng ph thuc kho chnh m ph thuc thuc tnh M nhn vin
bn hng => khc phc bng cch to thm bng Nhn vin
- Bng Ho n bn hng c trng S in thoi khch hng khng ph thuc
kho chnh m ph thuc trng Tn khch hng => Khc ph bng cch thm
bng Khch hng
- Bng Dng n hng c trng Thnh tin c th ni suy t 2 trng n
gi v Thnh tin => Khc phc bng cch b trng Thnh tin
Ho n bn hng
PK M n hng
Ngy bn hngFK2 M nhn vin bn hngFK1 M khch hng
Dng n hng
PK,FK1 M n hngPK,FK2 M mt hng
n gi S lng
Khch hng
PK M khch hng
Tn khch hng S in thoi khch hng
Nhn vin bn hng
PK M nhn vin bn hng
Tn nhn vin bn hng
Mt hng
PK M mt hng
Tn mt hng n giFK1 M loi mt hng
Loi mt hng
PK M loi mt hng
Tn loi mt hng
7
[email protected] - [email protected]
Phn ny m t qua v database v gii thiu s lc cch thit k database trong
mt h thng nghip v. Trong phn sau chng ta s ni v data warehouse, cc thnh
phn ca data warehouse v thit k thng thy ca data warehouse.
1.2 Data Warehouse Kinh t kh khn, i th cng nhiu th vic phn tch d liu cng tr nn quan
trng i vi doanh nghip nhm h tr ra quyt nh, gia tng li th cnh tranh. Tuy
nhin database thng thng li khng tho mn cc yu cu v phn tch d liu,
database thng thng ch h tr tt cc nghip v hng ngy v im mnh nht ca n
l m bo ton vn d liu, x l giao dch, truy cp song song. Database thng thng
c gi l database nghip v (operational database) hoc h thng x l giao dch
thi gian thc (online transaction processing OLTP). Thng thng cc database
nghip v ch lu tr d liu chi tit cho thi im hin ti, khng lu d liu lch s, d
liu trong database c thit k chun ho rt cao nn thng c hiu nng km khi truy
vn phc tp (join nhiu bng d liu vi nhau) hoc khi lng d liu ln. Thm na,
vic truy vn d liu t nhiu ngun khc nhau l gn nh khng th nu ch dng
database nghip v.
C cung th tt c cu, ngay t nhng nm 70 nhiu cng ty bn cc h thng
database h tr phn tch, bo co nh Teradata, MAPPER, nhng thut ng data
warehouse ch c s dng vo nm 1988 trong mt bi bo k thut ca IBM c tiu
Kin trc h thng thng tin v kinh doanh (An architecture for a business and
information system http://altaplana.com/ibmsj2701G.pdf). Phn ny s dnh ring
ni v khi nim data warehouse.
Theo wikipedia (http://en.wikipedia.org/wiki/Data_warehouse), data warehouse
chnh l database chuyn dng cho to bo co v phn tch d liu. N va h tr cc
truy vn phc tp, va l im tp trung d liu t nhiu ngun khc nhau c c
thng tin phn tch y nht. Theo , data warehouse l mt tp hp d liu hng
ch , ton vn, khng b r r mt mt v c gi tr lch s. C th cc tnh cht nh
sau:
- Tnh hng ch (Subject oriented) ngha l data warehouse tp trung vo
vic phn tch cc yu cu qun l nhiu cp khc nhau trong quy trnh ra
quyt nh. Cc yu cu phn tch ny thng rt c th, v xoay quanh loi hnh
kinh doanh ca doanh nghip, v d cc cng ty phn phi s quan tm n tnh
hnh kinh doanh, doanh nghip vin thng quan tm n lu lng dch v Tuy
nhin mt doanh nghip thng quan tm n vi ch khc nhau, nh cng ty
phn phi cn phi quan tm n kho bi, chui cung ng
8
[email protected] - [email protected]
- Tnh ton vn (Integrated) gii quyt cc kh khn trong vic kt hp d liu t
nhiu ngun d liu khc nhau, gii quyt cc sai khc v tn trng d liu (d
liu khc nhau nhng tn ging nhau), ngha d liu (tn ging nhau nhng d
liu khc nhau), nh dng d liu (tn v ngha ging nhau nhng kiu d liu
khc nhau).
- Tnh bt bin (Nonvolatile) quy nh rng d liu phi thng nht theo thi gian
(bng cch hn ch ti a sa i hoc xo d liu), t lm tng quy m d liu
ln ng k so vi h thng nghip v (5-10 nm so vi 2 n 6 thng nh
database thng thng)
- Gi tr lch s (time varying) ni v kh nng ly cc gi tr khc nhau ca
cng mt thng tin v thi im xy ra thay i. V d thng tin a ch, email, s
in thoi ca khch hng c th thay i, nhng vic thay i khng c
php tc ng n gi tr bo co, phn tch thc hin trc khi s thay i xy ra.
Tnh cht Database nghip v Data warehouse
Ngi dng Nhn vin vn hnh Cn b qun l, nhn vin phn tch s liu
Loi hnh s dng D on c, lp i lp li Truy vn t xut, khng xc nh trc
D liu Hin ti, mc chi tit Lch s, mc tng hp
T chc d liu Theo yu cu nghip v Theo vn cn phn tch
Cu trc d liu Ti u cho cc giao dch nh Ti u cho truy vn phc tp, trn lng d liu ln
Tn sut truy cp Tn sut cao Tn sut t trung bnh n thp
Loi truy cp c, ghi, cp nht, xo c, ghi
S lng bn ghi mi phin truy cp
t Rt ln
Thi gian truy cp Ngn Tng i di (n mc pht hoc ting ng h)
Mc x l song song Cao, cc tc v x l ng thi trn mt bn ghi nht nh xy ra thng xuyn
Thp
Kho Cn thit Khng cn thit
Tn sut cp nht d liu Thng xuyn Khng cp nht
D tha d liu Thp (bng chun ho) Cao (bng d liu thng phi chun)
M hnh d liu M hnh quan h thc th (Entity Relational)
M hnh d liu a chiu (multidimensional)
M hnh trin khai Ton b h thng Tng dn theo data mart
Bng 1.1: So snh database nghip v v data warehouse
9
[email protected] - [email protected]
Data warehouse cho php ngi dng mc qun l, ra quyt nh thc hin cc
php phn tch tng tc vi data bng h thng x l phn tch trc tuyn (online
analytical processing OLAP). Ngoi ra data warehouse cng c dng cho bo co,
data mining v phn tch thng k. Database v data warehouse, do ch khc nhau v
mt khi nim, mt database nu dng ring cho cc mc ch trn cng c coi l data
warehouse.
Nh vy, nu nh database c v nh ci t sch c nhn, ni ngi ta thng
xuyn tra cu, cp nht, hiu nh, ghi ch vo l, thm mi hoc chuyn sch i, th data
warehouse li c so snh vi th vin quc gia, ni cc ti liu kinh in c a n
lin tc lu tr v tham kho, khng ai sa cha hoc chuyn chng qua ch no khc
c.
1.3 M hnh d liu a chiu Data warehouse v cc h thng OLAP c xy dng da vo m hnh d liu
a chiu (multidimensional model). M hnh ny cho hiu nng tt trn nhng php
truy vn phc tp v gip ngi dng c th nhn d liu theo nhiu kha cnh khc nhau.
M hnh ny hin th d liu di dng khng gian n-chiu, gi l data cube hoc
hypercube.
Hnh 1.1 Mt cube 3 chiu hin th d liu s lng bn hng vi 3 chiu Th trng
(Store), Thi gian (Time), Sn phm (Product) v ch tiu Doanh s (amount)
Mt khi data cube c xc nh bng ct lp v tiu ch. Ct lp (Dimension)
l cc thng tin, quan im c dng phn tch d liu. V d data cube hnh 1.1
phn tch s liu bn hng, c 3 ct lp l Th trng, Thi gian v Sn phm. Cc gi
tr trong mt ct lp gi l lp (dimension member). V d Paris, Nice, Rome v Milan
10
[email protected] - [email protected]
l cc lp ca ct lp Th trng. Cc ct lp thng c thm cc thuc tnh (attribute)
m t thm thng tin cho n. V d ct lp Sn phm c th cha cc thuc tnh nh M
sn phm, Tn sn phm, M t, Kch thc, tuy nhin cc thuc tnh ny khng c
th hin trong hnh trn.
Cng vi ct lp, cc (cell) ca mt cube cha cc gi tr dng s v c gi l
tiu ch (measure). M hnh a chiu yu cu vic thc hin cc php ton s hc (cng,
tr, nhn, chia) trn cc tiu ch ny m ngha ca s liu vn chnh xc. V d trong
hnh 1.1 trn, khi cube c 1 tiu ch l Doanh s. Thng thng mt cube s c nhiu
tiu ch khc nhau. Khi cube hnh 1.1 mc d khng hin th nhng c th c tiu ch
S lng (s sn phm bn ra) na.
1.3.1 Cy phn cp
Mc chi tit ca cc tiu ch th hin cho ngi dng c gi l mc d liu
(data granularity), c quyt nh bng vic kt hp cc mc d liu ca tng ct lp.
V d trong hnh 1.1 mc chi tit l: mc thnh ph vi ct lp Th trng, mc qu
vi ct lp Thi gian, mc loi hng trong ct lp Hng ho.
c rt ra tri thc t d liu, ngi dng cn quan st cube di nhiu mc chi
tit khc nhau. Vn v d 1.1 trn, ngi dng c th mun bit cc tiu ch bn hng
mc chi tit hn nh mc ca hng, hoc mc cao hn nh mc quc gia chng hn.
Tnh cht cy phn cp (hierrarchie) ca OLAP cho php thc hin iu ny bng cch
nh ngha ra mt cu trc hnh cy cc mc chi tit khc nhau ca mt ct lp. Vi 2
mc lin nhau trong mt cy, mc thp hn gi l mc con (child level), mc cao
hn gi l mc cha (parent level). Hnh 1.2 bn di v d cc mc ca ct lp Th
trng, trong tng ca hng c th c gn cho mt thnh ph, thnh ph gn n
tnh, ri n quc gia. Lp trn cng cy phn cp l mc Tt c i din cho ton b
cy phn cp, mc ny c 1 gi tr duy nht cng l Tt c dng ly tiu ch c
tng hp n mc cao nht i din cho ton b cy phn cp (trong v d ny l ly tng
doanh s bn hng ca tt c cc quc gia).
11
[email protected] - [email protected]
Ca hng 1 Ca hng 2 Ca hng 2
Paris
le-de-France
Nice
Provence-Alpes-Cte d'Azur
Php
Ca hng 2 Ca hng 2 Ca hng 2
Roma Milan
Lazio Lombardy
Italy
Tt c
Mc ca hng
Mc thnh ph
Mc tnh
Mc quc gia
Mc tt c
.
..
Hnh 1.2 Cc gi tr ca cy phn cp Th trng
//TODO: cn nhc vit v cy phn cp khng cn bng (cy phn cp c nhnh b thiu
mc so vi nhnh khc).
1.3.2 S liu tng hp Vic tng hp s liu xy ra khi ngi dng thay i mc chi tit ca d liu ly
ra t cube, bng cch duyt qua cy phn cp ca ct lp. V d hnh 1.1, nu ct lp Th
trng s dng mc tnh thay v mc thnh ph th doanh s ca tt c cc thnh ph
trong cng mt tnh s c tng hp bng php cng. Tng t, d liu mc Tt c
c tng hp bng gi tr d liu ca tt c cc quc gia.
Nhm m bo tng hp chnh xc, ngi ta sut ra mt vi lut tng hp. Cc
lut tng hp chnh bao gm:
- Tnh tch bit (Disjointness of instance): giao ca cc tp lp ct c mc cha
khc nhau phi l tp rng. V d trong hnh 1.2 mt thnh ph khng c php
thuc 2 tnh khc nhau.
- Tnh hon thin (Completeness): tt c lp ct u phi xut hin trong cy phn
lp v ng vi mi lp ct u phi tn ti mt lp ct cha tng trn. V d hnh
1.2 mi ca hng u c gn cho mt thnh ph.
- S dng ng php ton tng hp (Correct use of aggregation function): mi
tiu ch c tnh cht khc nhau, chnh cc tnh cht ny quyt nh php ton tng
hp c php s dng cho tiu ch .
Mt s php ton tng hp chnh nh sau:
12
[email protected] - [email protected]
- Cc tiu ch cng dn (Additive measure) thng dng nht, l cc tiu ch c th
thc hin php tnh cng m ngha vn chnh xc. V d tiu ch Doanh thu
hnh 1.1 l tiu ch cng dn: h thng cng gi tr doanh thu cc lp ct con s ra
doanh thu lp ct cha.
- Cc tiu ch bn cng dn (Semiadditive measure) l tiu ch cng dn nhng
ngha ca n s b sai i nu dng vi mt s ct lp no . V d tiu ch S
lng hng cho bit trong kho cn bao nhiu hng v c th cng dn bit mt
thnh ph, mt tnh cn bao nhiu hng. Nhng tiu ch ny nu kt hp vi
dimension Thi gian s mt ngha v n ch mang tnh thi im.
- Cc tiu ch khng cng dn (Nonadditive measure value-per-unit) l cc
tiu ch khng cho php thc hin cc php cng, tr. V d cho tiu ch ny chnh
l cc tiu ch t l, trung bnh.
Khi xc nh tiu ch cn thit phi ch r cc php ton tng hp dng cho cc ct
lp cho data warehouse, c bit quan trng trong trng hp tiu ch bn cng dn v
khng cng dn. V d tiu ch S lng hng trn l bn cng dn, tuy khng th
dng cng tr vi ct lp Thi gian nhng vn c th thc hin php tnh trung bnh,
trung v, tm max, min.
1.3.3 Tnh nng ca OLAP
Nh ni trn, tnh cht c bn ca m hnh d liu a chiu l cho php
ngi dng quan st d liu trn nhiu phng din khc nhau, cc mc chi tit
khc nhau. OLAP cung cp mt s tnh nng cho php thc hin iu , c th:
13
[email protected] - [email protected]
Hnh 1.3 Cc tnh nng ca OLAP
14
[email protected] - [email protected]
- Tnh nng nhn xa (roll-up) bin tiu ch t mc chi tit sang mc tng hp
hin th cho ngi dng, c thc hin khi i t mc thp ln mc cao trong cy
phn cp hoc gim s ct lp xung. Hnh 1.3b l v d cho tnh nng nhn xa ny
khi ct lp Th trng chuyn t mc thnh ph ln mc quc gia, gi tr cc lp
ct thnh ph ca mt quc gia c cng dn vo thnh gi tr kt qu.
- Tnh nng o su (drill-down) thc hin ngc li vi nhn xa, tc l i t mc
tng hp cao n mc chi tit hn. V d nh trong hnh 1.3c, ct lp Thi gian i
t mc qu xung mc cc thng trong qu.
- Tnh nng o chiu (pivot hoc rotate) bin hng thnh ct, ct thnh hng gip
cung cp cho ngi dng mt cch th hin d liu khc. Tnh nng ny c th
hin hnh 1.3d.
- Tnh nng ct lt mng (slice) thc hin ct ly d liu mt lp ct c th trong
mt ct lp. V d nh hnh 1.3e ch duy nht d liu ca thnh ph Paris c
hin th.
- Tnh nng ct khi (dice) thc hin la chn gi tr cho t nht hai lp ct. V d
nh hnh 1.3f l cube th hin d liu cho thnh ph Paris trong qu 1 v qu 2.
Ngoi 5 tnh nng c bn trn, cc b cng c OLAP trn th trng cng cung
cp thm mt lot cc tnh nng h tr khc nh cc php ton s hc, thng k, cc php
ton kinh t
1.3.4 M hnh thit k data warehouse
Cn c vo cch thc lu tr d liu, ngi ta thng tip cn m hnh d liu a
chiu theo 3 hng sau:
- OLAP kiu quan h (Relational OLAP ROLAP) lu tr d liu trong c s
d liu quan h, dng cu lnh SQL thc hin cc tnh nng ca OLAP.
- OLAP a chiu (Multidimensional OLAP MOLAP) lu tr d liu di dng
file c cu trc c th (v d nh cu trc dng mng array) v thc hin cc
tnh nng OLAP trn cu trc ny. Mc d b hn ch v lng d liu lu tr v
x l c so vi ROLAP, MOLAP thng cho hiu nng tt hn trong cc php
truy vn hoc tng hp s liu (v d liu c thit k ti u cho truy vn OLAP
trong khi ROLAP phi thng qua database).
- OLAP lai (Hybrid OLAP HOLAP) kt hp 2 cng ngh ROLAP v MOLAP
ni trn, tn dng kh nng lu tr ca OLAP v kh nng x l ca MOLAP. V
d HOLAP s lu d liu chi tit trn c s d liu quan h cn d liu tng hp
hn truy vn cho ngi dng c lu trn khng gian MOLAP.
15
[email protected] - [email protected]
Trong h thng ROLAP, d liu a chiu c lu tr di dng bng quan h, t
chc theo cu trc c bit theo lc hnh sao, lc hnh bng tuyt, lc nh
sao v lc chm sao nh sau:
- Lc hnh sao (star schema) bao gm duy nht mt bng fact v nhiu bng
dimension (mi bng cho mt dimension). Cc thc th trong lc hnh sao
khng c chun ho nh database nghip v (cc thc th c cu trc phn cp
c nhp chung vo lm mt). V d thc th Mt hng trong database nghip v
s c tch thnh 2 thc th Mt hng v Loi mt hng.
Dng n hng
M n hng Ngy bn hngFK3 M mt hng M nhn vin bn hngFK1 M khch hng n gi S lng Thnh tinFK2 M nhn vin
Khch hng
PK M khch hng
Tn khch hng a ch S in thoi
Nhn vin
PK M nhn vin
Tn nhn vin S in thoi
Mt hng
PK M mt hng
Tn mt hng n gi Loi mt hng
Hnh 1.4: Lc hnh sao
- Lc hnh bng tuyt (snowflake schema) gim bt d tha d liu trong
lc hnh sao bng cch chun ho cc bng dimension. Do , mt thc th
dimension c phn cp s c th hin thnh nhiu bng d liu khc nhau, mi
bng mt cp. Hnh 1.5 l lc hnh bng tuyt, trong dimension Mt hng
c th hin qua 2 bng d liu Mt hng v Loi mt hng.
16
[email protected] - [email protected]
Dng n hng
M n hng Ngy bn hngFK3 M mt hng M nhn vin bn hngFK1 M khch hng n gi S lng Thnh tinFK2 M nhn vin
Khch hng
PK M khch hng
Tn khch hng a ch S in thoi
Nhn vin
PK M nhn vin
Tn nhn vin S in thoi
Mt hng
PK M mt hng
Tn mt hng n giFK1 M loi mt hng
Loi mt hng
PK M loi mt hng
Tn loi mt hng
Hnh 1.5: Lc hnh bng tuyt
- Lc nh sao (starflake schema) l s kt hp gia lc hnh sao v lc
hnh bng tuyt khi mt s dimension c chun ho trong khi mt s khc
th khng.
- Lc chm sao (constellation schema) l lc thng dng nht trong thit
k data warehouse, l lc trong cc bng fact dng chung dimension vi
nhau. V d hnh 1.6 bn di hai bng fact Dng n hng v Hng trong kho s
dng chung dimension Mt hng.
17
[email protected] - [email protected]
Dng n hng
M n hng Ngy bn hngFK3 M mt hng M nhn vin bn hngFK1 M khch hng n gi S lng Thnh tinFK2 M nhn vin
Khch hng
PK M khch hng
Tn khch hng a ch S in thoi
Nhn vin
PK M nhn vin
Tn nhn vin S in thoi
Mt hng
PK M mt hng
Tn mt hng n gi Loi mt hng
Hng trong kho
Ngy kim hngFK1 M mt hng S lng Thnh tin
Hnh 1.6: Lc chm sao
1.4 Ralph Kimball vs Bill Inmon //TODO: so snh l thuyt ca Kilball v Inmon
1.5 Kin trc h thng data warehouse Mt h thng data warehouse bao gm 3 thnh phn chnh sau:
- Mt b cng c thu thp d liu t h thng nghip v, chun ho chng v
nh dng d liu a chiu, np vo data warehouse (Extract-Transformation-
Loading ETL).
- Mt database dng lm data warehouse lu tr d liu
- Mt lot cc cng c khai thc d liu t data warehouse nh h thng OLAP, h
thng bo co tnh, h thng data mining
18
[email protected] - [email protected]
Hnh 1.7: Kin trc h thng Data Warehouse
1.5.1 Tng ETL
Tng ETL (Extract Transform Load) l tng thp nht, n i vi ngi dng
cui, bao gm 3 bc:
- Bc thu thp (extract) gom gp d liu t nhiu khc nhau v. Cc ngun ny
c th l database h thng nghip v (MS SQL, mySQL, Oracle, DB2), cng
c th l file cc nh dng khc nhau (CSV, fix-length, excel, XML), c th
l d liu ni b doanh nghip hoc t bn ngoi. Mt h thng ETL tt phi m
bo tng thch vi cc ngun d liu thng dng ny.
- Bc chun ho (transform) bin i d liu t nh dng ngun sang nh dng
ca data warehouse (nh dng d liu a chiu ni bc trn), bao gm cc
bc nh:
o Bc dn dp (cleaning) xo cc bn ghi b sai, li v chuyn ho d liu
v nh dng chun chung.
o Bc tp hp (integration) ct gt d liu c chung ngha t nhiu
ngun khc nhau v mt khung duy nht.
o Bc tng hp (aggregation) tng hp d liu da vo chi tit ca data
warehouse.
- Bc np d liu (load) ghi d liu c chun ho vo data warehouse. Bc
ny bao gm c qu trnh cp nht thay i t h thng nghip v vo data
warehouse, m bo s liu bo co lun c cp nht. Tu thuc vo chnh sch
19
[email protected] - [email protected]
cng ty, vic cp nht ny c th phi thc hin theo thi gian thc, cp nht theo
gi, theo ngy hoc thm ch theo thng.
1.5.2 Tng data warehouse
Tng data warehouse ng trung tm mt h thng data warehouse lm nhim
v lu tr d liu bao quanh tt c cc hot ng nghip v, cc phng ban ca doanh
nghip. Data warehouse thng bao gm mt hoc nhiu data mart, vi data mart chnh
l data warehouse thu nh tp trung vo mt nghip v nht nh no ca doanh
nghip (v d data mart v sale, data mart v kho bi, data mart v nhn s)
Ngoi nhim v lu tr d liu, tng data warehouse cn c mt thnh phn khc
rt quan trng gi l siu d liu (metadata). Siu d liu li c chia lm 2 nhm l
nhm siu d liu k thut v siu d liu nghip v. Siu d liu nghip v (business
metadata) m t ngha d liu, cc lut v rng buc tc ng ln d liu. Siu d liu
k thut (technical metadata) m t cch thc t chc, lu tr v iu khin d liu
trong h thng my tnh.
Trong phm vi data warehouse, siu d liu k thut c s dng m t thng
tin v data warehouse, v d liu ngun v cc tin trnh ETL. C th:
- Siu d liu m t cu trc data warehouse v cc data mart mc logic (m t
bng fact, bng dimension, cy phn cp, ngun gc d liu) v mc vt l (cu
trc bng, index, partition). Ngoi ra n cn cha thng tin bo mt d liu (xc
thc, phn quyn ngi dng) v cc thng tin gim st (thng k hiu nng s
dng, bo co li)
- Siu d liu m t d liu ngun, cng mc logic (cch thc v tham s kt ni
ly d liu, tn sut cp nht d liu, ngha d liu) v vt l (cu trc d liu)
- Siu d liu m t cc tin trnh ETL, bao gm c gc gc d liu (truy c d
liu trn data warehouse v n gc gc ca n trong h thng nghip v), cc lut
thu thp, lm sch, chuyn ho d liu.
1.5.3 Tng khai thc d liu
Tng khai thc d liu cha cc cng c cho ngi dng cui khai thc, s dng
cc d liu trong data warehouse. Mt s cng c chnh:
- Bo co OLAP (OLAP tool) l bo co ng cho php ngi dng s dng cc
tnh nng ca OLAP ( ni phn 1.3.3) to bo co. Cc truy vn t xut
ny c gi l truy vn tu bin (ad hoc query) v h thng khng h c
chun b trc cho thao tc ca ngi dng. Bo co OLAP c s dng khi
20
[email protected] - [email protected]
ngi dng mun cc thng tin ct lp, chuyn su hoc ton cnh trc khi ra
quyt nh.
- Bo co tnh (reporting tool) l cc bo co c cu trc, format, s dng truy vn
c nh ngha trc , i khi bao gm c biu . Bo co tnh c s dng
khi ngi dng mun xem cc thng tin nh gi, iu hnh.
- B cng c khai ph d liu (data mining) cho php ngi dng phn tch d
liu tm ra cc thng tin qu gi cn b n du, v d nh cc xu hng, cc
mu chung.
21
[email protected] - [email protected]
Phn 2: Xy dng Data Warehouse Phn 1 ca ti liu m t data warehouse, mc ch, ngha v kin trc data
warehouse. Phn 2 ny s gii thiu cc khi nim trong thit k data warehouse.
2.1. Xy dng bng dimension Bng dimension cung cp cc thng tin, ng cnh cho bng fact v do cng l
cung cp cho tt c s liu th hin trong data warehouse. D c quy m nh hn bng
fact rt nhiu ln, cc bng dimension li l tri tim v khi c ca data warehouse v
mun truy cp s liu data warehouse u phi thng qua chng. C ngi ni rng, tt
xu trong thit k ca mt data warehouse chnh l tt xu trong thit k cc bng
dimension ca n.
2.1.1 Cu trc bng dimension
Hnh 2.1: Cu trc c bn mt bng dimension
V c bn cc bng dimension u c cu trc vt l nh hnh 2.1. Kho chnh ca
bng dimension l trng d liu (thng l kiu s) lu nhng gi tr duy nht, khng
c ngha, gi l kho thay th (surrogate key). Kho thay th ny c ni ti h
thng data warehouse sinh ra bng cc lung ETL x l d liu. Gi tr kho ny ch
c to ra duy nht trong ni ti data warehouse, cc thao tc thay i bn ngoi u b
cm.
Trc y vic sinh gi tr kho thay th thng c ph mc cho database dng
lm data warehouse, c th l cho tnh nng database trigger. Cng ngy ngi ta cng
nhn ra rng vic dng data trigger lm chm c tin trnh ETL v hn ch dng n. Vic
22
[email protected] - [email protected]
dng database to kho chnh cng khng c khuyn khch na v khi data
warehouse phi ph thuc mt tin trnh khc t bn ngoi, d gy mt ng b (trong
trng hp database dng lm data warehouse trong giai on xy dng h thng v giai
on trin khai khc nhau). Do hng tip cn an ton nht vn l dng chnh lung
ETL sinh gi tr cho kho thay th ny.
Mt thnh phn khc ca bng dimension l kho chnh ca d liu trong h thng
nghip v, c gi l kho t nhin (natural key), gi tr ca kho t nhin thng
khng phi v ngha. V d bng dimension Nhn vin s c trng EMP_ID lu m
nhn vin ly t h thng nghip v. Mc d trng EMP_ID cng c th dng lm kho
chnh cho bng dimension Nhn vin ny nhng khi thit k vn phi cung cp cho bng
dimension mt kho thay th (trong trng hp phi nhp d liu t 2 h thng khc
nhau dn n kho t nhin c th trng nhau, hoc trng hp gi tr dimension b thay
i, s trnh by sau).
C kin cho rng dimension c kho t nhin c th dng lm kho chnh ri,
khng nht thit phi dng kho thay th kiu s v ngha na m c th dng mt gi tr
c ngha no nh thi gian thay i, khi tp thuc tnh {kho t nhin, thi gian
thay i} chnh l kho chnh ca bng dimension. Hng tip cn ny c th c li
trong mt vi trng hp nhng li b tc trong cc tnh hung sau:
- Sai nh ngha: Kho thay th, theo nh ngha, t bn thn n khng c ngha
g c. Nu nh c tnh gn ngha cho kho thay th th ngi thit k ETL phi
thm lung x l qun l ngha cc gi tr ny, khin cho vic xy dng lung
ETL phc tp hn do cng chy lu hn.
- Gim hiu nng: Thm ngha cho kho thay th khin cc cu truy vn t ngi
dng cui phi thm iu kin xc thc, khin cu lnh phc tp hn, truy vn tn
ti nguyn v thi gian hn so vi php so snh 2 gi tr kiu s n thun.
Thnh phn cui cng ca bng dimension, bn cnh kho chnh v kho t nhin,
l mt lot cc thuc tnh m t (desciptive attribute). Cc thuc tnh m t c th
nhiu kiu d liu khc nhau v s lng c th rt ln (c bit vi dimension nh
khch hng, nhn vin, sn phm). Nhn chung l khng nn qu s hi khi thit k ra
bng dimension c hn 100 thuc tnh m t, lm vy khng sai u. Ch cn lu lm
sch d liu cn thn cho cc thuc tnh ny l c.
Mt lu nh l nn ch n cc thuc tnh m t c kiu d liu l kiu s, v
tu vo ngha, n c th li l mt thuc tnh o m c v phi t vo bng fact.
Thng tin m t ch c dng m t, khng phi dng cng dn. Cng khng nn
23
[email protected] - [email protected]
qu lo lng v trong 99% trng hp gi tr ny c phn ra l fact hay thuc tnh
dimension ngay. 1% cn li phn vo u cng c, v vic phn thuc tnh vo u
khng lm thay i ngha ca thuc tnh, m ch thay i cch x l thuc tnh .
//TODO: v d thuc tnh dimension c th dng trong fact
2.1.2 Thit k phng v thit k bng tuyt Bng dimension l cc bng d liu c phi chun ho v dng bng phng.
Trong , tt c d liu phn cp v cc cu trc c chun ho ca h thng nghip v
c thit k li v dng phng. D liu trong bng dimension, v th, l d tha v
tng ng vi dng chun 2 trong thit k database nghip v.
Mt bng dimension c th bao hm nhiu hn mt cu trc phn cp. V d
dimension Ca hng trong hnh 1.1 c th c cu trc cy theo phn cp a l theo qun
l hnh chnh, va c cu trc cy theo phn cp ca ni b doanh nghip. C hai cy
phn cp ny u c th nm trong mt cy phn cp, vi iu kin rng buc duy nht l
d cy phn cp no, gi tr ca thuc tnh phn cp lun l duy nht.
Nu mt bng dimension dng chun ho, cu trc phn lp s c th hin
di dng lc hnh bng tuyt (hnh 2.2a). Ch rng v mt ni dung d liu th hai
cch th hin ny khng khc g nhau, tuy nhin mi cch th hin li c im li, im
hi khi ngi dng thao tc. Trng hp bng phng khng chun ho l d tha d liu,
d gy ra sai st mt ng b gia cc bn ghi (lp cha ging nhau nhng thng tin ca
lp cha li khc nhau). Trng hp ca bng chun ho l gim hiu nng khi query d
liu (do phi join nhiu bng vi nhau) v gy kh hiu cho ngi dng khng am hiu
k thut (cc nhn vin kinh doanh, phn tch s liu, cn b qun l cp cao)
24
[email protected] - [email protected]
Mt hng
PK M mt hng
Tn mt hng n giFK1 M loi mt hng
Loi mt hng
PK M loi mt hng
Tn loi mt hng
Mt hng
PK M mt hng
Tn mt hng n gi Loi mt hng
a) Bng dimension chun ho
b) Bng dimension khng chun ho
Hnh 2.2: Cc dng th hin bng dimension
Mi khi thm bn ghi mi vo bng dimension, h thng phi gn cho bn ghi
mt kho thay th, ng vai tr lm kho chnh cho bng dimension (hnh 2.1). Trong
mi trng data warehouse, cn thit phi c tin trnh ETL qun l gi tr kho thay th
ny cho mi bng dimension (n gin nht l tm gi tr kho cao nht c sn trong
bng, cng thm 1 ri gn cho kho mi, tuy nhin cch ny b hn ch v hiu nng).
2.1.3 Dimension thi gian
Gn nh tt c bng fact u c t nht mt gi tr dimension l thi gian. Cc
php o c u c thc hin ti im mc no v c xoay vng sau mi khong
chu k nht nh.
25
[email protected] - [email protected]
Hnh 2.3: Dimension thi gian
Dimension thi gian c s dng nhiu nht chnh l lch ngy, vi n v nh
nht mc bn ghi ca bng dimension chnh l mt ngy. Hi bt ng t l dimension
ny li c kh nhiu thuc tnh, nh hnh v 2.3 trn. Ch mt vi thuc tnh l c th t
sinh ra t cu lnh SQL (nh th, ngy, thng, qu, nm). Cc trng d liu cn li
(nh ngy lm vic, ngy ngh, nm ti chnh) s khc bit tu vo chnh sch ca
quc gia, chnh sch cng ty (v d Vit Nam c ngy quc l gi t Hng Vng tnh l
ngy ngh, c cng ty lm 5 ngy/tun, cng ty khc 6 ngy/tun). Dimension thi gian
l kiu dimension c bit nht trong d n data warehouse, thng dng nht v c bit
nht. Thng thng dimension ny c to mt ln vo u d n ri gi nguyn gn
nh khng cp nht trong sut vng i xy dng, vn hnh, nng cp d n (tr khi
26
[email protected] - [email protected]
chnh sch cng ty thay i, tng lng gim gi lm, trc lm 6 ngy nay gim xung
cn 5, hoc nh nc thm mt ngy quc l mi, v d vy). Cch to bng dimension
thi gian hay nht l dnh ra mt bui chiu ngi nghch quyn lch, vit ra excel ri ghi
d liu excel vo data warehouse. D c 9 nm lm mt in Bin Nn vnh hoa
nn thin s vng th cng cha mt n 4000 bn ghi, khng qu nhiu.
Nh trn vit, kho chnh ca bng dimension ch cha gi tr c ngha nh
danh, t n khng c gi tr g c. Tuy nhin vi dimension thi gian, rt nhiu d n
data warehouse li gn gi tr c ngha cho kho chnh, m c th l gi tr kiu
YYYYMMDD (v d 20130523 cho ngy 23, thng 05, nm 2013). V vic gn gi tr c
ngha cho kho thay th ny hay khng cn nhiu tranh ci, v mi bn u a ra cc
trng hp li hn v bt li hn. Khng c trng hp no m mt gii php b tc
trong khi gii php kia chy c c, ch c li hn hoc bt li hn, nn khng i su
vo chi tit na. Tuy nhin i d n Viettel BI chn kiu YYYYMMDD nhn vin
vn hnh d thc hin hn.
Trong nhiu trng hp, thi gian trong bng fact cn c tnh ton di mc
ngy, mc gi hoc pht. Khi theo phn x chng ta ngh ngay n vic to mt
bng dimension mc gi hoc pht tng ng ny. C cn thn, mt nm c 8.760 gi,
525.600 pht, 31.536.000 giy, ln th ny m query khng kho cht c ci data
warehouse. Nu gp phi tnh hnh trn chng ta c th to bng dimension 24h,
dimension 60 pht, dimension 60 giy ri kt hp chng vi nhau. Hoc c th ghi lun
gi tr thi gian (kiu datetime ca database) vo bng fact, coi n l mt gi tr c bit.
2.1.4 Bng dimension khng l
Khi xy dng data warehouse i khi chng ta gp cc bng dimension c s
lng bn ghi khng l, c quy m tng ng hoc thm ch ln hn bng fact. Nh
Viettel, dimension Thu bao ln ti hng trm triu bn ghi.
Cha ht, cc dimension khng l ny thng tp trung thuc tnh ca nhiu
ngun khc nhau na, to nn bng d liu c s trng v cng ln, s ct v cng
nhiu.
Khng may cho chng ta, cc bng dimension khng l ny thng cha cc
thng tin khch quan, khng ph thuc ni ti cng ty, doanh nghip, do tn sut cp
nht, thm mi d liu l rt ln. Vn ny s c ni n trong phn sau Cp nht
gi tr dimension.
27
[email protected] - [email protected]
2.1.5 Bng dimension t hon
Trong data warehouse cng tn ti cc bng dimension t hon ch c mt hai ct
d liu v mt vi bn ghi. Cc dimension ny thng khng c ngun ring bit m
c trch rt t mt thuc tnh no ca d liu ngun. V d dimension Gii tnh
c trch t bng Thng tin thu bao vi 3 gi tr Nam, N v Khng cung cp.
Hoc trong vin thng c dimension Loi cuc gi c trch xut t chi tit cc thoi
cha m code phn bit cuc gi thng, gi video call hoc gi chuyn tip.
Mi dimension dng ny, d t hon, vn cn x l rt cn thn. Thng thng
chng khng c lung ETL ring bit thm d liu vo, m n lun trong lung x l
ngun d liu chng c c rt ra.
2.1.6 Dimension lng nhau
Dimension lng, hay dimension nm trong dimension khc, l mt k thut
thng gp khi xy dng data warehouse. V d dimension Thu bao thng s c ngy
kch hot (l mt dimension) v huyn kch hot (cng l mt dimension khc). Khi gp
tnh hung ny, nn nh rng y l mt tnh hung thng xuyn xy ra v khng c g
sai st khi thit k data warehouse c.
2.1.7 Mt bng dimension hay tch ra lm hai?
Khi thit k data warehouse, quan im thng thy l cc bng dimension c
lp nhau. Quan im ny khng hon ton chnh xc, d ng trong 90% cc trng hp.
V d nh mi quan h gia nhn vin bn hng v ca hng chng hn. C cng ty quy
nh nhn vin bn hng phi thuc mt ca hng nht nh, ch c ly hng t ca
hng thi. Khi mi quan h gia Nhn vin Ca hng l quan h 1 nhiu, l
dng cy th mc v nht nh phi nm trong mt dimension duy nht thi. Nhng trong
cng ty khc mi quan h li lng hn, mt nhn vin, d bin ch thuc mt ca hng
nhng li c th ly hng t nhiu ngun khc nhau chng hn. Khi khng nht thit
v cng khng nn to mt dimension duy nht m nn chia ra thnh 2 thc th
dimension khc nhau.
Nhn chung nn thit k cc bng dimension c lp hon ton vi nhau (hoc
lng nhau, nh trnh by mc 2.1.6), dimension c quan h nhiu nhiu nn tch ra
thnh mt bng fact.
2.1.8 Cp nht gi tr dimension
Khi data warehouse nhn thy c s thay i gi tr trong mt bn ghi ca
dimension, n phi c lp trnh c hnh ng tng ng vi s thay i . C 3
28
[email protected] - [email protected]
phng n hnh x chnh c nh s th t l Kiu thay i 1, Kiu thay i 2 v Kiu
thay i 3.
Kiu thay i 1 (Type 1 Slowly Changing Dimension) n gin ghi cc d
liu b thay i vo bng dimension. Ngi thit k data warehouse chn kiu 1 khi d
liu ngun thay i di dng sa sai hoc khi phn b cp nht ny khng quan trng,
khng lm thay i ngha bng fact. Kiu thay i 1 ny lun dng php UPDATE d
liu thay i trong data warehouse, chnh v vy cn lu rng khi cp nht d liu bng
dimension, d liu cc bng fact tng hp s dng cc trng dimension thay i cng
thay i theo.
Kiu thay i 2 (Type 2 Slowly Changing Dimension) cho php theo di cc
thay i xy ra trong bng dimension v lin kt chnh xc bn ghi fact vi bn ghi
dimension ang c hiu lc ti thi im bn ghi fact. tng rt n gin: khi data
warehouse nhn ra d liu ngun c cp nht, thay v ghi , h thng cp nht trng thi
bn ghi c v sinh thm mt bn ghi mi vo bng dimension. Bn ghi mi ny c gn
cho mt kho thay th mi toanh (khng dnh dng g n bn ghi c na) v t lc ny
h thng data warehouse dng bn ghi mi lin kt vi cc bn ghi fact c sinh ra.
Cc bn ghi fact sinh ra trc vn lin kt vi bn ghi dimension c.
Kiu thay i 2 ny th hin r nht s thay i ca d liu theo dng thi gian v
mi s thay i d nh nht ca thc th trn d liu ngun u c ghi nhn trong data
warehouse. Hnh 2.4 l mt v d cho phng n thay i gi tr dimension theo kiu 2.
Nhn vin Jane Doe qua mt khong thi gian nhn nhng v tr cng vic khc nhau,
mi ln thay i cng vic u c data warehouse ghi nhn li vi mi ln thay i
cng vic l mt bn ghi mi trong data warehouse (mi bn ghi li nhn mt kho thay
th mi tng ng).
29
[email protected] - [email protected]
Hnh 2.4: V d cho thay i dimension kiu 2
Vi bng dimension quy m nh khong chc trng, vi nghn bn ghi, vic ghi
nhn thay i c th thc hin bng cch r sot tng trng d liu ca tng bn ghi.
Nhng nu s trng, s bn ghi qu ln th phng php r sot ny mt qu nhiu thi
gian, khng kh thi na. Thay vo c th thay th bng phng php kim tra
checksum (cyclic redundancy checksum CRC) bn ghi data warehouse v bn ghi
nghip v, nu checksum khng ging nhau tc l c thay i. D liu dng
checksum c th dng php cng chui tt c cc trng d liu cn kim tra (i ht
kiu d liu v dng chui). Cc php checksum thng dng nht hin nay (CRC32,
MD5, SHA1, SHA3, SHA1-Base32, SHA256, SHA384, SHA512) u c th s dng
tnh checksum cho kt qu chnh xc.
Nhng nu mun sinh ra d liu bng fact, data warehouse li phi tm bn ghi c
kho thay th mi nht. iu ny khng phi lc no cng d dng v thng tiu tn ti
nguyn cng nh hiu nng h thng. Thay vo , bng dimension c th thm mt
trng is_active tin truy vn. Trong trng hp ny data warehouse khng ch n
gin thm bn ghi mi vo bng dimension m trc cn phi cp nht gi tr
is_active trong bn ghi c v trng thi khng s dng. Ngoi trng is_active ra, vic
ghi nhn cc thng tin b sung cho s kin cp nht bn ghi ny i khi cng cn thit.
Mt s thng tin b sung thng c lu li bao gm:
- Ngy bt u c hiu lc
- Ngy ht hiu lc (sau khi ht hiu lc mi c cp nht thm vo bng
dimension)
30
[email protected] - [email protected]
- L do cp nht
Kiu thay i 3 (Type 3 Slowly Changing Dimension) c dng khi gi tr
dimension thay i nhng ngi dng data warehouse c th la chn s dng gi tr mi
hoc c. Nhn chung kiu thay i 3 ny t c s dng (v c th thay bng bng fact)
Vic la chn p dng kiu thay i no vo thc th dimension tu thuc vo yu
cu nghip v bng dimension . Nhng trong nhiu trng hp thit k data warehouse
kt hp 2 kiu thay i trong mt dimension, m thng dng nht l kt hp kiu 1 v
kiu 2. Khi cc thuc tnh trong dimension c nh u tin, nu thuc tnh thay i
thuc u tin thp th dng kiu 1, ghi gi tr ln tt c cc bn ghi lch s t trc
n nay. Nu thuc tnh thay i thuc u tin cao hn th dng kiu 2, v hiu ho
bn ghi c v sinh bn ghi mi.
2.1.9 Bn ghi dimension v tr v vic cp nht d liu dimension
sai
D liu v mun l vn rt au u trong xy dng v vn hnh h thng data
warehouse, khc phc rt mt mi nhng li thng xy ra. Nguyn nhn v mun
thng l do mt ng b gia data warehouse v h thng nghip v, do li ca h
thng nghip v khng y d liu kp thi cho data warehouse Khi pht hin v mun
nhn vin vn hnh data warehouse cn yu cu nhn vin vn hnh h thng nghip v
x l vn m bo khng xy ra na. Tuy nhin nguyn nhn mt ng b khng
c bn n y m s ch yu ni v bin php khc phc.
V d trong vin thng, theo chnh sch mi mt lot thu bao di ng c thay
i sang gi cc mi nhng s thay i ny cha kp v h thng data warehouse, phi
vi ngy sau mi c ghi nhn. Khi rt nhiu bng fact tng hp s liu vn dng
thng tin gi cc c, bo co sai. Khi cc bc x l nn thc hin nh sau:
- Bc 1: Sao lu cc bn ghi cn thay i
- Bc 2: Cp nht phn d liu b thay i trong d liu sao lu trn
- Bc 3: Xo bn ghi c trong data warehouse
- Bc 4: Ghi d liu cp nht vo data warehouse
Vic cp nht bn ghi trong bng fact data warehouse khng c khuyn khch do s
lng bn ghi qu ln (trong khi data warehouse khng ti u cho cp nht).
V d tip theo l Trung tm kinh doanh a vo s dng mt gi cc mi,
nhng thng tin v gi cc li cha kp cp nht vo data warehouse. y l v d
tiu biu nht
31
[email protected] - [email protected]
2.1.10 Tng kt
Mc ny ni v khi nim v cc k thut xy dng bng dimension trong data
warehouse. Mc d bng dimension c quy m s bn ghi nh hn bng fact rt nhiu,
nhng n li v cng quan trng, cung cp ngha cho s liu trong bng fact.
Cc k thut trong mc ny u c s dng trong thit k thc t. c bit l ba
kiu thay i d liu dimension hin tr thnh kin thc c bn m bt k ngi thit
k data warehouse no cng phi bit.
Mc tip theo s bn v bng fact v cc k thut dng xy dng bng fact.
2.2 Xy dng bng Fact Bng fact lu cc tiu ch, ch tiu v hot ng sn xut kinh doanh ca doanh
nghip. Mi quan h gia tiu ch v bng fact cng n gin: tiu ch chnh l bn ghi
fact. Mt tiu ch (measurement) c nh ngha l mt lng quan st c theo mt
n v o lng thng nht.
M hnh d liu a chiu c xy dng xung quanh cc tiu ch ny. Bng fact
cha tiu ch, cn bng dimension cha ng cnh ca cc tiu ch . Mi quan h tng
chng n gin ny cung cp cho ngi dng cui gc nhn rt trc quan, y v d
s dng v d liu trong data warehouse.
Mc 2.1 ni v k thut thit k bng dimension, vic ca mc ny s l thit
k bng fact. Mc d s liu (fact) mi l th ngi dng mun, l linh hn ca data
warehouse, nhng linh hn l ci g nu khng c da tht bao bc bn ngoi? Cc kin
thc v dimension trong mc trc gip ngi dng d dng nm bt ni dung mc ny
hn rt nhiu.
2.2.1 Cu trc bng fact Mi bng fact c xc nh bng chi tit ca bng d liu. chi tit ny li
c nh ngha qua s kin o lng trong thc t. Ngi thit k lun phi ch r mc
chi tit ca bng fact, tc l cch thc cc tiu ch trong bng c o trong th gii thc
nh th no. Hnh 2.5 l v d mt bng fact mc chi tit nht, vi mi bn ghi th hin
mt dng n hng (dng ho n).
Bng fact thng thng khng c kho chnh ring, m cha mt tp cc kho
ngoi kt ni n bng dimension ( cung cp ng cnh cho cho thng tin trong bng
fact). Hu ht bng fact cn c thm cc trng lu s liu, chnh l cc tiu ch o lng
sn xut kinh doanh. Trong hnh 2.5 cn c mt vi dimension c bit gi l dimension
thoi ho (degenerate dimension), chnh l cc dimension c tn ti trong bng fact
32
[email protected] - [email protected]
nhng khng nht thit phi to ra bng dimension cho n. Dimension thoi ho c
vit tt l DD.
Trong thc t bng fact no cng c t nht 3 dimension, thng l nhiu hn.
chi tit s liu cng cao, data warehouse cng cn nhiu bng dimension. ng bun thay
cng ngy nhu cu s dng d liu data warehouse cng kh lng, chi tit d liu
cng ln, s lng dimension cng lc cng tng.
Hnh 2.5: Bng fact giao dch bn hng mc chi tit nht
Khng dng kho thay th nh dimension, bng fact dng cc kho ph lm kho
chnh cho n. Trong hnh 2.5 trn, kho chnh c s dng l tp hp {Ticket Number,
Line Number}, hai trng ny nh ngha ra bn ghi duy nht trong ton b qu trnh
thanh ton ca h thng nghip v, v do cng nh ngha bn ghi duy nht trong data
warehouse. Cng l bng fact v giao dch bn hng nhng mc tng hp cao hn,
kho chnh s s dng nhng trng d liu khc.
Trong v d trn, kho chnh cng c th tp hp {Cash Register (ca thanh ton),
Time of day, Line Number}. Trong trng hp ny, nu khng ch k c th xy ra
tnh hung mt s kin khc c thng tin y ht nh trn xy ra, khi khng c cch no
33
[email protected] - [email protected]
bit bn ghi l bn ghi trng, hay l mt s kin khc c. cht ch hn, trong qu
trnh np d liu vo data warehouse, cc bn ghi fact c gn thm trng d liu
sequence.
2.2.2 Ton vn thc th
Ton vn thc th trong m hnh d liu a chiu ngha l mi bn ghi trong bng
fact u phi c gn gi tr dimension ng n, ni cch khc tng bn ghi fact phi
tm thy cc gi tr dimension tng ng vi n.
Thng thng c 2 loi vi phm ton vn chnh:
- Ghi d liu vo bng fact nhng khng tm thy gi tr dimension tng ng
- Xo bn ghi dimension trong khi gi tr b xo c s dng t trc ri
Bng d liu trong data warehouse thng khng t rng buc kho cht ch nh
h thng nghip v, v th vic vi phm mt trong hai li trn (hoc c 2) cc k d xy
ra, v mi ln xy ra u gy nh hng nghim trng n d liu. Li vi phm ton vn
thc th trong data warehouse khng phi li nh, n l li nguy him. Kt qu truy vn
trn d liu fact vi phm ton vn thc th s b thiu hoc sai, gy nh hng n
chnh xc ca qu trnh ra quyt nh.
Hnh 2.6: Thi im kim tra ton vn thc th
34
[email protected] - [email protected]
Hnh 2.6 m t 3 thi im hp l kim tra ton vn thc th trong cc lung ETL
khi ghi d liu fact vo data warehouse. Chng l:
- Kim tra k d liu fact trc khi ghi vo data warehouse v trc khi xo d liu
khi bng dimension
- t rng buc ton vn vo data warehouse m bo khng ghi d liu sai hoc
xo d liu ng trong data warehouse
- Pht hin v sa cc li ton vn thc th sau khi ghi vo data warehouse bng
cch nh k r sot bng fact, tm ra gi tr kho ngoi li.
Trn thc t th thi im kim tra ton vn tt nht l giai on trc khi ghi vo
data warehouse. Lc ny ch cn thm bc tm kim gi tr dimension no cha c trong
bng dimension ghi tng ng, sau s cp nht thng tin chi tit cho dimension sau
(cng vic ny s c m t bn di). Nu bc x l ny c thc hin cn thn,
bng fact s tun theo rng buc rng buc ton vn. Tng t, khi mun xo bn ghi
trong bng dimension, trc ht phi kim bng cch join cc bn ghi mun xo vi bng
fact, sau khi chc chn khng tr v kt qu no mi thc hin xo.
Giai on th hai thng khng c s dng do lm gim tc ghi d liu vo
data warehouse xung ng k. Mt s data warehouse cho tc ghi rt tt ngay c khi
c rng buc ton vn (nh h thng Red Brick ca IBM), tuy nhin y ch l cc
trng hp c bit, phn ln data warehouse vn x l chm khi c rng buc ton vn.
Kim tra ton vn sau khi ghi d liu vo data warehouse trn thc t cng t thc
hin do khi lng cn kim tra qu ln. V d nu c kim tra tht th cu lnh s
dng:
select f.product_key
from fact_table f
where f.product_key not in (select p.product_key from product_dimension p)
Nhn chung l khng th thc hin c cu lnh ny trong h thng tht. Vic kim tra,
nu c, s b gii hn trong mt khong thi gian nht nh (hm nay, tun ny, thng
ny), vic vi phm ton vn vn c th xy ra trong phn d liu trc .
D sao, cch thc trung dung nht, va m bo ton vn va m bo hiu nng
chnh l kim tra d liu bng fact trc khi ghi d liu vo v khng xo d liu trong
bng dimension. D liu trong dimension ch nn b xo khi c sai st cn sa, cn ni
chung th c nguyn. Mt bn ghi (hoc nhiu bn ghi) khng c s dng cha chc
sau ny khng c s dng, quy m bng dimension d sao cng khng qu ln
nh hng nghim trng n hiu nng h thng.
35
[email protected] - [email protected]
i lc xy ra trng hp d liu dimension mi toanh c pht hin trong bng
fact (d liu dimension v mun). Lc ny chng ta s ghi bn ghi mi vo bng
dimension (v do to ra kho thay th mi) vi thuc tnh ca dimension n gin l
Unknown.
2.2.3 Phn loi bng fact Bng fact cha tt c tiu ch v hot ng sn xut kinh doanh ca doanh nghip.
Nhiu vy nhng bng fact ch c phn loi v 3 dng chnh:
Bng fact mc chi tit (transactional grain fact table) thng m t mt s kin
no xy ra th gii thc c ghi nhn vo data warehouse. Cc tiu ch v dimension
trong bng fact ny v th khng m t mt qu trnh, m ch ghi nhn gi tr thi im s
kin xy ra.
Bng fact mc chi tit l bng c quy m ln nht (v lng bn ghi) v chi tit
nht trong 3 dng bng fact chnh. D liu ca n thng cha thng tin y , chi tit
nht v s kin, bao gm c thi gian chnh xc s kin xy ra. Do quy m ln vy nn
vic s dng bng ny mc ngi dng cui rt t khi xy ra m thng y l u vo
tng hp ln cc bng fact c mc tng hp cao hn.
Bng fact tng hp thng k (periodic snapshot fact table) i din cho mt
khung thi gian nht nh no v s lp li sau mi chu k. Mt s dng thng thy
nht ca bng fact tng hp thng k l tng hp theo ngy, theo thng, theo nm.
C 2 cch xy dng d liu cho bng fact tng hp thng k. Mt cch l c i
n cui thng ri tnh mt th, nhanh v gn. Cch khc l duy tr v cp nht bn ghi
lu k thng, cng dn kt qu hng ngy li cho n ngy u thng sau th cht thng
c v sinh ra bn ghi mi.
Bng fact lu k (accumulating snapshot fact table) l bng lu li s liu c
thi gian khng nh trc, nh l doanh s ca mt sn phm t khi sn phm ra i n
thi im hin ti (v tip tc c cp nht trong tng lai)
2.2.4 Ghi d liu vo bng fact
Bng fact, c bit l bng fact chi tit, l ni tp trung s lng bn ghi ln nht
data warehouse. V th vic ghi d liu vo bng fact thng khng phi l vic n gin
v thng rt tai v nu khng c x l cn thn. Mc ny s ni n cc vn
thng gp ca ngi vn hnh ETL trong qu trnh ghi d liu trn.
36
[email protected] - [email protected]
X l index
Index rt tt khi query nhng li tai hi v cng khi ghi d liu vo database. Cc
bng d liu c nhiu index lm chm vic ghi n mc c cm gic c tin trnh ng
li khng hot ng. Vic cn lm y l: xo ht index trc tin trnh ghi d liu vo
data warehouse, sau khi ghi xong li to li index t u.
X l partition
Partition cho php bng d liu (v c index na) c chia ra thnh cc bng d
liu nh hn v mt vt l. Php chia ny cho php cu truy vn c th chy n ng
phn khu cha d liu cn thit m khng cn tm kim trn ton b bng d liu. c
x l ng n, bng partition gip lm tng hiu nng truy vn ln ng k trn cc
bng fact ln. Vic partition ny thng trong sut vi ngi dng, v c vn hnh kt
hp bi c i DBA v ETL.
K thut nh partition thng dng nht trn bng fact l nh partition theo
trng thi gian. u im ca trng thi gian l lun c nh, li c nh ngha sn
nn chng ta lun bit c kho thay th sp c s dng l g s dng. Sai lm
thng thy khi dng trng thi gian l ngi thit k thm mt trng thi gian vo
bn ghi fact (c th dng lun thi gian ghi bn ghi vo warehouse) v dng trng
nh partition. Nhng nu trng thi gian khng xut hin trong cu truy vn ca
ngi dng cui th nh partition nh vy l v ngha. Bi hc rt ra: ch nh partition
vo trng thi gian c ngi dng quan tm, s dng.
Bng fact nh partition theo thi gian thng l nh theo nm, theo qu, theo
thng, i khi l tun hoc ngy. Thng thng i thit k data warehouse phi lm vic
vi i DBA xc nh phng n nh partition tt nht cho tng bng fact mt. Cng
thng thng i DBA khng trc tip tc ng vo bng fact m vic tng partition phi
thc hin bng lung ETL. Trong phn ln trng hp, i DBA s khng can thip vo
vic vn hnh cc bng d liu trong data warehouse, khi vic qun l partition ny do
i vn hnh ETL m nhim, v vic ca i vn hnh l dng lung sa partition cho
tng bng fact mt. Vi bng partition th thnh thong i vn hnh s gp li sau:
ORA-14400: inserted partition key is beyond highest legal partition key
Khi chc chn lung tng ETL b li v vic cn lm l sa lung ETL tng
partition hoc lin h i DBA ngay lp tc.
37
[email protected] - [email protected]
trnh li trn, trc khi ghi d liu vo data warehouse, lung ETL c th ch
ng kim tra bng cch so snh gi tr thi gian ln nht trn phn d liu sp c ghi
v gi tr partition cao nht ca bng fact. C th so snh
select max(date_key) from FACT_TABLE
vi
select high_value from all_tab_partitions
where table_name = 'FACT_TABLE'
and partition_position = (select max(partition_position)
from all_tab_partitions where table_name= 'FACT_TABLE')
Nu gi tr trn script 1 ln hn script 2, ta c th tng partition trc khi ghi d liu
bng on script:
ALTER TABLE FACT_TABLE
ADD PARTITION year_2013 VALUES LESS THAN (20140101)
C qu trnh theo di, cp nht partition ni trn c th c vit vo th tc stored
procedure trn data warehouse v c lung x l ETL gi trc mi ln ghi d liu.
Loi b rollback log
Mc nh tt c c s d liu quan7 h u h tr x l li khi giao dch tht bi.
H thng tr cc bn ghi li v trng thi trc khi commit, bng cch ghi nhn li tt c
thao tc thay i d liu. Khi c li, database c li bn ghi log ny v sa cha tt c
thao tc cp nht cha c commit. Vic commit mt giao dch, ngha l thng bo cho
database bit giao dch thnh cng v cc tc ng giao dch gy ra phi c cp nht
vo c s d liu.
Rollback log, hay cn gi l redo log, do c ngha rt ln trong h thng
nghip v. Nhng trong data warehouse khi m tt c giao dch u c lung ETL
qun l th redo log li l tnh nng phin phc, gy cn tr ghi d liu v cn phi b loi
b tng hiu nng x l. Tng kt li cc nguyn nhn data warehouse khng cn redo
log:
- D liu c ghi vo data warehouse bng mt tin trnh c gim st k cng
lung ETL
- D liu c ghi hng lot vo data warehouse
- Nu chng may b li, ngi vn hnh c th d dng khc phc v cho chy li
tin trnh
38
[email protected] - [email protected]
Ghi ch nh l cc hng cung cp c s d liu c c ch qun l log khc nhau
v i khi cung cp tnh nng log m hng khc khng c. Khi xy dng lung ETL cn
ch cc im khc bit ny ti u t hiu qu cao nht.
Ghi d liu
Ghi mt lng ln d liu vo data warehouse khng ch n gin l cu insert,
cc k thut chnh khi ghi d liu bao gm:
- Tch ghi d liu c vi cp nht d liu mi: mt s cng c ETL (v c
database) cung cp tnh nng cp nht hoc ghi mi. Tnh nng ny mi nghe qua
th c v rt tin li v lm cho lung ETL trng n gin hn rt nhiu nhng x
l thc t li qu chm chp (do mi khi c bn ghi mi u phi kim tra bng d
liu xem tn ti cha). Khng nn dng tnh nng ny, thay vo nn to ra 2
lung khc nhau: cp nht trc, sau ghi mi d liu.
- Dng tool ghi d liu ca database: mi database thng cung cp mt tool ghi
d liu vo database, p dng nhiu k thut c quyn ca ring database nn tc
ghi nhanh hn dng cu insert rt nhiu.
- Chia lung ghi d liu chy song song: khi phi ghi s lng ln bn ghi vo
data warehouse nn chia nh khi d liu ra lm nhiu phn chia cho nhiu lung
ETL chy song song.
- Hn ch sa i: chy lnh UPDATE trong data warehouse thng rt chm chp
v kh theo di, do khi tht s mun sa i d liu thng phi s dng cc
k thut khc. Trong mt k thut thng dng v n gin l xo cc bn ghi
c cn cp nht i ri ghi li chnh cc bn ghi vi thng tin c sa cha.
- To bng tng hp bn ngoi database: cc php sp xp, php giao, php tng
hp d liu nn c thc hin bn ngoi data warehouse t hiu qu cao hn.
Nguyn nhn l do ti nguyn dnh cho data warehouse thng l c hn v rt
kh nng cp v phn cng (c theo chiu dc thm RAM, thm CPU hoc
chiu ngang thm server), trong khi b ETL c th d dng trin khai trn nhiu
server khc nhau share ti.
2.2.5 Bng fact khng s liu
Bng fact cha cc s kin c s liu o m c. Nhng trong mt s trng
hp cc s kin ny li khng c s liu o m c. V d bng fact v vic khch
hng check-in khch sn, hoc vin thng l bng fact v vic khch hng i gi cc t
Economy sang Tomato. Cc bng fact ny ch ghi thi gian s kin xy ra v cc thuc
tnh, khng c s liu no c.
39
[email protected] - [email protected]
Cc bng fact ny gi l bng fact khng c s liu (factless fact table). Php
ton dng ch yu trn bng fact ny l php m (count) v m khng lp (count
distinct).
2.2.6 Bng tng hp d liu
Bng fact c s bn ghi rt ln, lm chm vic truy vn cho ngi dng cui. Vn
l vn mun th cc chuyn gia DBA, cc chuyn gia ETL v ngi dng h
thng BI cng thng nht. Bin php tng tc truy vn n tng nht, thng c cc
chuyn gia data warehouse khuyn dng l s dng bng tng hp (aggregate table).
Bng tng hp l bng fact nhng d liu trong c tng hp mc cao hn
trong cy phn cp dimension so vi bng fact gc. Khi truy vn, tu mc s dng cy
phn cp m ngi dng s dng bng fact gc hay bng tng hp tng ng (thc cht
ra bng tng hp cng phi trong sut vi ngi dng, v vic la chn dng bng no
phi l vic ca cng c khai thc). Cha cn tn km chi ph nng cp phn cng, phn
mm, ch cn dng bng tng hp hiu nng truy vn tng ln r rt.
Hnh 2.7: v d m hnh d liu a chiu vi dimension mc ngy, sn phm v
kho
Bng tng hp khng h mi l, ch ny c tho lun rt nhiu trn cc
din n data warehouse v c nhiu sch nhc n. Mt vi im chnh:
- Trong mt thit k data warehouse, ng vi mi bng fact li c mt tp cc bng
tng hp i din cho cc php nhm thng dng nht. Phng php chuyn qua
li bng tng hp l phng php ca c s d liu a chiu, ch h tr c s d
liu a chiu v khng c phng php tng ng no trn cho c s d liu
quan h.
- Module la chn bng tng hp nm gia module x l truy vn t ngi dng v
data warehouse.
40
[email protected] - [email protected]
- Module la chn bng tng hp x l truy vn t ngi dng cui v nu c th,
chuyn cu lnh SQL sang ly d liu t bng tng hp thay v bng fact ngun
- Module la chn bng tng hp bit khi no phi dng bng tng hp, v nu
dng th dng bng tng hp no v cc thng tin phi c cu hnh sn
trong siu d liu m t data warehouse.
Hnh 2.8: Cu trc module la chn bng tng hp
Tnh nng bng tng hp trong data warehouse c coi l tt phi p ng cc
tiu chun sau:
- Ci thin ng k hiu nng cho phn ln truy vn ca ngi dng (gii php ti
u dnh cho tt c mi ngi l khng h c, v nu c cng khng kh thi trong
thc t)
- Khng lm tng dung lng d liu lu tr ln qu nhiu. Cng kh ni qu
nhiu y l bao nhiu v tu tnh hnh c th, nhng ni chung tng dung
lng bng tng hp khng nn ln hn dung lng bng chnh.
- Hon ton trong sut vi ngi dng cui. Nhn vin kinh doanh hoc cc sp
qun l cp cao khng cn v cng khng mun bit qu chi tit cc vn k
thut khng lin quan u.
- Khng lm nh hng n vic vn hnh h thng ETL. Mc d mi ln to bng
tng hp u phi chy mt lot php ton sp xp, gom nhm nhng vic to
bng tng hp ny nn c dng t ng bng tin trnh ETL.
- t nh hng n vic qun tr vin DBA. Vic to bng tng hp no nn c t
ng ho bng cch theo di thi quen truy vn ca ngi dng cui.
41
[email protected] - [email protected]
Nu thit k tt, bng tng hp s tho mn c 5 yu cu trn, thit k ti th chng
t c yu cu no c. Di y l mt s hng dn thit k t c cc yu cu
ni trn:
Hng dn thit k s 1
Bng tng hp phi c khng gian lu tr ca ring n, tch bit vi bng fact
ngun. Mi cp tng hp phi c mt bng fact ring.
Vic chia bng tng hp thnh cc bng fact l rt quan trng v c nhiu hiu
ng tch cc. u tin l cu trc bng fact v bng tng hp n gin hn, cu lnh SQL
truy vn v th ch cn thay tn bng, khng cn thm iu kin phc tp. Th hai, cng
do khng phi thm iu kin nn ngi dng cui khng gp nguy c ly kt qu sai v
cng nhm s liu cc mc tng hp khc. Th ba, lu ra bng khc lm tng tc truy
vn d liu mc database v lm gim s bn ghi cn tm kim. Cui cng, lu ra bng
fact ring khin vic qun l bng tng hp d dng hn, nu mt bng tng hp b li
ch cn sa trong phm vi bng m khng gy nguy c nh hng cc phn d liu
khc.
Hng dn thit k s 2 Bng dimension tng ng vi bng tng hp cng phi phn cp tng ng vi
d liu trong bng tng hp. Ni cch khc nu bng tng hp l phin bn rt gn ca
bng fact, b i mt vi mc trong cy phn cp, th bng dimension cho bng tng hp
cng phi b cc mc phn cp tng ng.
Hnh 2.9: M hnh bng tng hp sau khi li mc Nhm sn phm
42
[email protected] - [email protected]
Hnh 2.9 l m hnh bng tng hp ca v d trong hnh 2.7. M hnh ny gii
quyt trng hp ngi dng cui mun truy vn d liu bn hng theo ngy, theo ca
hng v theo nhm sn phm (ch khng quan tm n tng sn phm mt nh hnh 2.7
na).
Khng phi dimension no cng xut hin cc mc tng hp cao hn. Mt s
dimension v s liu trn bng fact ch c ngha mc bng fact ngun v lm sai hn
ngha bo co khi kt hp vi mc cao hn ca dimension khc.
Cc bng dimension rt gn ny cng phi tun theo cc quy tc v bng
dimension nh kho thay th, kho t nhin, cc thuc tnh Nh vy trc khi xy
dng dimension cho bng fact gc, cn xy dng bng dimension cho cc bng tng hp
(v cng cn thit thit phi lu kho thay th ca bng dimension rt gn trong bng
dimension mc chi tit hn).
Hng dn thit k s 3
Bng fact v tt c cc bng tng hp ca n phi c lin kt vi nhau thnh
mt h bng fact module la chn bit c phi s dng bng no chy cu truy
vn. Mi thnh phn trong h bng fact s bao gm mt bng fact (c th l bng gc
hoc bng tng hp), bao xung quanh bng cc dimension tng ng vi n. Trong h
ny c mt mu duy nht cha bng fact gc, cc mu khc l mu tng hp. Hnh 2.7 l
mu gc, hnh 2.9 l mt trong s rt nhiu mu tng hp ca mu 2.7.
Mi lin h gia bng fact gc v cc bng tng hp (cng nh cc bng
dimension tng ng) phi c cu hnh h thng truy vn bit c m la chn.
Thng thng thng tin ny c cu hnh trong b cng c OLAP ca h thng data
warehouse.
Hng dn thit k s 4
Ni chung th ngi dng cui khng nn bit v s tn ti ca bng tng hp.
Cc truy vn t ngi dng cui trc tip n data warehouse khng thng qua cc cng
c khai thc ch c im dng duy nht l bng fact gc, khng nn i xa hn n bng
tng hp.
2.2.7 Tng kt Trong mc ny chng ta xc nh bng fact l ni lu tr ton b s liu sn
xut kinh doanh ca doanh nghip. Cc bng fact ny u c 2 phn chnh: s liu v cc
kho ngoi lin kt n bng dimension lm ng cnh cho s liu.
43
[email protected] - [email protected]
Chng ta ch ra c ton vn thc th l v cng quan trng trong m hnh d
liu a chiu, v xut ra 3 phng n kim tra rng buc ton vn.
Ri chuyn sang cc k thut thng dng tng hiu nng x l bng fact, bao
gm c cc k thut ti u tin trnh ghi d liu v k thut to bng tng hp tng tc
truy vn vo bng fact ny.
44
[email protected] - [email protected]
Phn 3: Xy dng lung ETL
3.1 V ETL
3.2 Thu thp d liu
3.3 Lm sch v chun ho d liu
45
[email protected] - [email protected]
Phn 4: p dng thc t cho vin thng, xy dng h thng data
warehouse cho mobile tr trc
4.1 Thu thp yu cu
4.2 Thit k data warehouse
4.3 Xy dng lung ETL
4.4 Xy dng OLAP Cube