Implementação do DW. SAD Tagus 2004/05 H. Galhardas O problema e as soluções Grandes quantidades de dados => Métodos de acesso e processamento de interrogações

Implementação do DWImplementação do DW

SAD Tagus 2004/05 H. Galhardas

O problema e as O problema e as soluçõessoluções

Grandes quantidades de dados => Grandes quantidades de dados => Métodos de acessoMétodos de acesso e e processamento de interrogaçõesprocessamento de interrogações eficientes eficientes

Materialização de vistas para informação agregadaMaterialização de vistas para informação agregada: : Identificar, explorar e actualizar de forma eficienteIdentificar, explorar e actualizar de forma eficiente

Índices:Índices: Index intersection, bit map indices, join indicesIndex intersection, bit map indices, join indices


Escolha dos agregadosEscolha dos agregados

ObjectivoObjectivo: Minimizar o tempo de resposta: Minimizar o tempo de resposta

PerguntaPergunta: Quais os agregados a materializar?: Quais os agregados a materializar?

FactoresFactores: : Espaço de armazenamento – custo baixo! Custo de actualização – custo alto!

CompromissoCompromisso::

Tempo de resposta vs custo de actualizaçãoTempo de resposta vs custo de actualização


Cube OperationCube Operation Transform it into a SQL-like language (with a new operator Transform it into a SQL-like language (with a new operator

cube bycube by, introduced by Gray et al.’96), introduced by Gray et al.’96)SELECT item, city, year, SUM (amount)

FROM SALES

CUBE BY item, city, year Need compute the following Group-BysNeed compute the following Group-Bys

(date, product, customer),(date,product),(date, customer), (product, customer),(date), (product), (customer)()

2^n cuboids

(item)(city)

()

(year)

(city, item) (city, year) (item, year)

(city, item, year)


Goal: Goal: efficientefficient computation of aggregations across computation of aggregations across many sets of dimensionsmany sets of dimensions

Take into account Take into account amount of main memoryamount of main memory available and available and computation timecomputation time

Data cube can be viewed as a lattice of cuboids Data cube can be viewed as a lattice of cuboids

Materialization of data cubeMaterialization of data cube Materialize every (cuboid) (full materialization), none (no

materialization), or some (partial materialization)

Efficient Data Cube Efficient Data Cube ComputationComputation

)11(

n

i iLT


Partial materialization of Partial materialization of cuboidscuboids

Three factors must be considered:Three factors must be considered:1. Identify the subset of cuboids to materialize

Take into account queries in the workload, their frequencies and accessing costs; cost of incremental updates, total storage requirements

Popular approaches: use heuristics or materialize cuboids based on other frequently referenced cuboids

2. Exploit the materialized cuboids during query processing

How to use indexes and how to transform OLAP operations onto the selected cuboids

3. Efficiently update the materialized cuboids during load and refresh

Explore parallelism and incremental update


Classification of agg. facts Classification of agg. facts wrt the agg. functions wrt the agg. functions

DistributiveDistributive: if the result derived by applying the function to : if the result derived by applying the function to n n aggregate values is the same as that derived by applying aggregate values is the same as that derived by applying the function on all the data without partitioning.the function on all the data without partitioning.

Ex: count(), sum(), min(), max().

AlgebraicAlgebraic:: if it can be computed by an algebraic function with if it can be computed by an algebraic function with MM arguments (where arguments (where M M is a constant), each of which is is a constant), each of which is obtained by applying a distributive aggregate function.obtained by applying a distributive aggregate function.

Ex: avg(), min_N(), standard_deviation().

HolisticHolistic:: if there is no algebraic function with M arguments if there is no algebraic function with M arguments that characteriezes the computation.that characteriezes the computation.

Ex: median(), mode(), rank().


Arquitecturas de servidor Arquitecturas de servidor OLAPOLAP

Relational OLAP (ROLAP)Relational OLAP (ROLAP) Usa SGBDs relacionais ou relacional extendido para armazenar e gerir os

dados do datawarehouse e usa middleware OLAP para suportar funcinalidades específicas do OLAP.

Inclui optimização suportada pelo SGBDR, implementa lógica de navegação de agregação e serviços/ferramentas adicionais

Maior escalabilidade

Multidimensional OLAP (MOLAP) Multidimensional OLAP (MOLAP) Motor de armazenamento multidimensional baseado em arrays (sparse

matrix techniques) Indexação rápida de dados sumarizados pré-calculados

Hybrid OLAP (HOLAP)Hybrid OLAP (HOLAP) Flexibilidade: baixo nível: relacional, alto nível: array

Specialized SQL serversSpecialized SQL servers Suporte especializado para interrogações SQL sobre esquemas em estrela e

floco de neve


Some efficient cube Some efficient cube computation methodscomputation methods

ROLAP-based cubing algorithms (Agarwal et al’96)

Array-based cubing algorithm (Zhao et al’97)Bottom-up computation method (Bayer &

Ramarkrishnan’99)And others...


ROLAP-Based Cubing ROLAP-Based Cubing AlgorithmAlgorithm

Value-based addressing: dimension values accessed by

key-based addressing strategies

Sorting, hashing, and grouping operations are applied to the

dimension attributes in order to reorder and cluster related

tuples

Grouping is performed on some subaggregates as a “partial

grouping step”

Aggregates may be computed from previously computed

aggregates, rather than from the base fact table


Iceberg CubeIceberg Cube Computing only the cuboid cells whose count Computing only the cuboid cells whose count

or other aggregates satisfying the condition or other aggregates satisfying the condition likelike

HAVING COUNT(*) >= minsup

MotivationMotivation Only a small portion of cube cells may be “above the

water’’ in a sparse cube Only calculate “interesting” cells—data above certain

threshold Avoid explosive growth of the cube

Suppose 100 dimensions, only 1 base cell. How many aggregate cells if count >= 1? What about count >= 2?


Multi-way Array Multi-way Array Aggregation for Cube Aggregation for Cube

ComputationComputation Direct array addressingDirect array addressing: dimension values accessed via the index of their corresponding array : dimension values accessed via the index of their corresponding array locationslocations

Partition arrays into Partition arrays into chunkschunks (a small subcube which fits in memory). (a small subcube which fits in memory). Compressed sparse array addressing: (chunk_id, offset)Compressed sparse array addressing: (chunk_id, offset) Compute aggregates in “multiway” Compute aggregates in “multiway” by visiting cube cellsby visiting cube cells in the order which minimizes the # of times to in the order which minimizes the # of times to

visit each cell, thereby reducing memory access and storage cost.visit each cell, thereby reducing memory access and storage cost.

What is the best traversing order to do multi-way aggregation?

A

B

29 30 31 32

1 2 3 4

5

9

13 14 15 16

6463626148474645

a1a0

c3c2

c1c 0

b3

b2

b1

b0

a2 a3

C

B

4428 56

4024 52

3620

60


CuboidsCuboids

ABC – base cuboid – already computedABC – base cuboid – already computed AB, AC, BC – 2-D cuboidsAB, AC, BC – 2-D cuboids A, B, C – 1-D cuboidA, B, C – 1-D cuboid All – 0-D cuboidAll – 0-D cuboid

Many possible orderings w/ which chunks Many possible orderings w/ which chunks can be read into memorycan be read into memory


Example (1)Example (1)

A

B

29 30 31 32

1 2 3 4

5

9

13 14 15 16

6463626148474645

a1a0

c3c2

c1c 0

b3

b2

b1

b0

a2 a3

C

4428 56

4024 52

3620

60

B


Example (2)Example (2)

A

B

29 30 31 32

1 2 3 4

5

9

13 14 15 16

6463626148474645

a1a0

c3c2

c1c 0

b3

b2

b1

b0

a2 a3

C

4428 56

4024 52

3620

60

B


Multi-Way Array Multi-Way Array Aggregation for Cube Aggregation for Cube Computation (Cont.)Computation (Cont.)

MethodMethod: the planes should be sorted and computed : the planes should be sorted and computed according to their size in ascending order.according to their size in ascending order. See the details of Example 2.12 (pp. 75-78) Idea: keep the smallest plane in the main memory, fetch and

compute only one chunk at a time for the largest plane

Limitation of the methodLimitation of the method: computing well only for a small : computing well only for a small number of dimensionsnumber of dimensions If there are a large number of dimensions, “bottom-up

computation” and iceberg cube computation methods can be explored


Indexing OLAP Data: Indexing OLAP Data: Bitmap IndexBitmap Index

Index on a particular columnIndex on a particular column Each value in the column has a bit vectorEach value in the column has a bit vector: bit-op is fast: bit-op is fast The length of the bit vector: # of records in the base tableThe length of the bit vector: # of records in the base table The The i i-th bit is set if the -th bit is set if the i i-th row of the base table has the value for -th row of the base table has the value for

the indexed columnthe indexed column Not suitable for high cardinality domainsNot suitable for high cardinality domains

RecID Retail Dealer1 1 02 0 13 0 14 1 05 0 1

RecIDAsia Europe America1 1 0 02 0 1 03 1 0 04 0 0 15 0 1 0

Base table Index on Region Index on TypeCust Region TypeC1 Asia RetailC2 Europe DealerC3 Asia DealerC4 America RetailC5 Europe Dealer


Indexing OLAP Data: Join Indexing OLAP Data: Join IndicesIndices

Join index:Join index: JI(R-id, S-id) where R (R-id, JI(R-id, S-id) where R (R-id, …) …) S (S-id, …) S (S-id, …)

Traditional indices map the values to a Traditional indices map the values to a list of record idslist of record ids It materializes relational join in JI file and

speeds up relational join — a rather costly operation

In data warehouses, join index relates In data warehouses, join index relates the values of the the values of the dimensionsdimensions of a star of a star schema to schema to rowsrows in the fact table. in the fact table.Ex: fact table: Sales and two dimensions city

and productA join index on city maintains for each distinct

city a list of R-IDs of the tuples recording the Sales in the city


Example of join indexesExample of join indexes


Efficient Processing of Efficient Processing of OLAP QueriesOLAP Queries

1.1. Determine Determine which operationswhich operations should be performed should be performed

on the available cuboids:on the available cuboids:

transform drill, roll, etc. into corresponding SQL and/or

OLAP operations, e.g, dice = selection + projection

2.2. Determine Determine to which materialized cuboid(s)to which materialized cuboid(s) the the

relevant operations should be applied.relevant operations should be applied.

3.3. Exploring indexing structures and compressed vs. Exploring indexing structures and compressed vs.

dense array structures in MOLAPdense array structures in MOLAP


BibliografiaBibliografia

(Livro) (Livro) Data Mining: Concepts and TechniquesData Mining: Concepts and Techniques, J. , J. Han & M. Kamber, Morgan Kaufmann, 2001 Han & M. Kamber, Morgan Kaufmann, 2001 (Capítulo 2 no livro e Capítulo 3 nos drafts)(Capítulo 2 no livro e Capítulo 3 nos drafts)

(Artigo) Data Cube: A relational aggregation operator (Artigo) Data Cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals, J. generalizing group-by, cross-tab, and sub-totals, J. Gray et al, DMKD 1(1), 1997Gray et al, DMKD 1(1), 1997

(Artigo) On the computation of multidimensional (Artigo) On the computation of multidimensional aggregates, Agarwal et al, VLDB, 1996aggregates, Agarwal et al, VLDB, 1996

Documents

Implementação do DW. SAD Tagus 2004/05 H. Galhardas O problema e as soluções Grandes quantidades de dados => Métodos de acesso e processamento de interrogações