35
Lecture 16 Dr. Fawad Hussain Primary and Secondary Indices Primary and Secondary Indices Primary and Secondary Indices Primary and Secondary Indices

Cs437 lecture 16-18

Embed Size (px)

Citation preview

Page 1: Cs437 lecture 16-18

Lecture 16

Dr. Fawad Hussain

Primary and Secondary IndicesPrimary and Secondary IndicesPrimary and Secondary IndicesPrimary and Secondary Indices

Page 2: Cs437 lecture 16-18

� Use multiple indices for certain types of queries.

� Example: select account-number

from account

where branch-name = “Perryridge” and balance = 1000

� Possible strategies for processing query using indices on single attributes:1. Use index on branch-name to find accounts with balances of $1000; test branch-name = “Perryridge”.

2.Use index on balance to find accounts with balances of $1000; test branch-name = “Perryridge”.

3. Use branch-name index to find pointers to all records pertaining to the Perryridge branch. Similarly use index on balance. Take intersection of both sets of pointers obtained.

Multiple-Key Access

Page 3: Cs437 lecture 16-18

� With the where clausewhere branch-name = “Perryridge” and balance = 1000the index on the combined search-key will fetch only records that satisfy both conditions.Using separate indices in less efficient — we may fetch many records (or pointers) that satisfy only one of the conditions.

� Can also efficiently handle where branch-name - “Perryridge” and balance < 1000

Indices on Multiple Attributes

Page 4: Cs437 lecture 16-18

Sample account File

Page 5: Cs437 lecture 16-18

Hash Function of branch-name

Page 6: Cs437 lecture 16-18

Bitmap Indices on Relation customer-info

Page 7: Cs437 lecture 16-18

�Primary Index

�Secondary Index

�Sparse Index vs Dense Index

� Indexing Techniques

� Primary versus secondary indexing.

� Single index access versus scanning.

� Combining multiple indexes.

What we studied

Aneeb
Note
https://msdn.microsoft.com/en-us/library/ms191195.aspx
Page 8: Cs437 lecture 16-18

�PI for a table (in Teradata) is a specification of its partitioning column(s).

�PI may be defined as unique (UPI) or non-unique (NUPI).� Automatic enforcement of uniqueness when UPI is specified.

�PI provides an implicit access path to any row just by knowing its value.

�Only one PI per table.

� PI can be on multiple columns i.e. composite.

Primary Index

Page 9: Cs437 lecture 16-18

Primary index selection criteria:

�Common join and retrieval key.

� Distributes rows evenly across database partitions.

� Less than 10,000 rows per PI value when non-unique.

WHY?

Primary Index

Page 10: Cs437 lecture 16-18

Trick question: What should be the primary index of the transaction table for a large financial services firm?

create table tx

(tx_id decimal (15,0) NOT NULL

,account_it decimal (10,0) NOT NULL

,tx_amt decimal (15,2) NOT NULL

,tx_dt date NOT NULL

,tx_cd char (2) NOT NULL

....

) primary index (???);

Ans: It depends

Primary Index

Page 11: Cs437 lecture 16-18

�Almost all joins and retrievals will come in through the account _id foreign key.� Want account_id as NUPI.

�If data is “lumpy” when distributed on account_id or if accounts have very large numbers of transactions (e.g., an institutional account could easily have 10,000+ transactions).� Want tx_id as UPI for good data distribution.

Primary Index

Page 12: Cs437 lecture 16-18

�Joins and access via primary index are very efficient due to Teradata’s

sophisticated row hashing algorithms that allow going directly to the data

block containing the desired row.

�Single I/O operation for accessing a data row via UPI.

�Single I/O operation for accessing a data row via NUPI whenever all rows

with the same PI value fit into a single block.

�Single VAMP operation for indexed retrieval.

�No spool space required.

Primary Index

Page 13: Cs437 lecture 16-18

�Primary index is free!

� No storage cost.

� No index build required.

�This is a direct result of the underlying hash-based file system implementation.

�OLTP databases use a page-based file system and therefore do not deliver this performance advantage.

Primary Index

Page 14: Cs437 lecture 16-18

�Any index that is NOT a PI

�SI structures are implemented using the same underlying structure as base tables (often referred to as sub-tables).

�SI may be defined as unique (USI) or non-unique (NUSI).� Automatic enforcement of uniqueness when USI is specified.

�Up to thirty-two SI’s per table in Teradata.

�Unlike a primary index, SI are not “free” in terms of storage.

�SI is NOT required BUT desired.

Secondary Index(SI)

Page 15: Cs437 lecture 16-18

Lecture 17

Dr. Fawad Hussain

Primary and Secondary IndicesPrimary and Secondary IndicesPrimary and Secondary IndicesPrimary and Secondary Indices----IIIIIIII

Page 16: Cs437 lecture 16-18

Primary index selection criteria:

�Common join and retrieval key.

� Distributes rows evenly across database partitions.

� Less than 10,000 rows per PI value when non-unique.

WHY?

Recall - 1

Page 17: Cs437 lecture 16-18

�Secondary Index (SI)

�Any index that is NOT a PI

�SI structures are implemented using the same underlying structure as base tables (often referred to as sub-tables).

�SI may be defined as unique (USI) or non-unique (NUSI).

�Automatic enforcement of uniqueness when USI is specified.

�Up to thirty-two SI’s per table in Teradata.

�Unlike a primary index, SI are not “free” in terms of storage.

�SI is NOT required BUT desired.

Recall - 2

Page 18: Cs437 lecture 16-18

�A non-unique secondary index (NUSI) is partitioned so that each index entry is co-located on the same Vamp (Virtual Access Module Processor) with its corresponding row in the base table.

�Each row access via a NUSI is a single Vamp operation because the NUSI entry and data row are co-located.

�NUSI access is always performed in parallel across all Vamp whenever it is appropriate to do so.

Secondary Index (NUSI)

Page 19: Cs437 lecture 16-18

Compressed ROWID index structure:

� Hash on index value to get block location (ROWID for sub-table).

� Store index value just once followed by all ROWIDs in base table corresponding to the index value.

� Sorted by ROWID to facilitate maximum efficiency when accessing base table, performing updates and deletes, etc.

� Additional blocks allocated when NUSI is non-selective and compressed ROWID structure for the index value exceeds 64K.

Secondary Index (NUSI)

Page 20: Cs437 lecture 16-18

Secondary Index (NUSI)

Non Unique Secondary

Index Value

Non Unique Secondary

Index ValueHashing AlgorithmHashing Algorithm

NUSI Sub-table

Base Table

Page 21: Cs437 lecture 16-18

�Building a NUSI helps when the selectivity of the indexed column is very high.

�Cost-based optimizer will determine when to access via NUSI:�Number of rows selected by NUSI must be less than number of blocks in the table

to justify access via NUSI (assumes even distribution of rows with NUSI value within table).

�Must also consider cost for reading the NUSI sub-table and building ROWID spool file.

Note that the extreme efficiency of table scanning in Teradata reduces the need for secondary indexing as compared to other databases.

When to build NUSI?

Page 22: Cs437 lecture 16-18

�A unique secondary index (USI) is partitioned by the unique column upon which the index is built.

� Row access via a USI is a two Vamp operation.�First I/O is initiated on the Vamp with the USI entry.

�Second I/O is initiated on the Vamp with the data row entry.

Secondary Index (USI)

Page 23: Cs437 lecture 16-18

When to Build a USI?

�To allow data access without all VAMP operations.

�Increased efficiency for (very) high selectivity retrievals.

� Obtain co-location of index with frequently joined tables.

When to build USI?

Page 24: Cs437 lecture 16-18

Example:

create table order_header(order_id decimal(12, 0) NOT NULL,customer_id decimal(9, 0) NOT NULL,order_dt date NOT NULL...)primary index( customer_id );

create unique index oh_order_idx (order_id) on order_header;

create table order_detail(order_id decimal(12, 0) NOT NULL,product_id integer NOT NULL,extended_price_amt decimal(15,2) NOT NULL,item_cnt integer NOT NULL...)primary index( order_id );

When to build USI?

Page 25: Cs437 lecture 16-18

Example: How many customers ordered green socks in the last month? (Assume that

green socks is quite selective)

select count(distinct order_header.customer_id)

from order_header

,order_detail ,product

where order_header.order_id = order_detail.order_id

and order.order_dt > add_months(date, -1)

and order_detail.product_id = product.product_id

and product.product_subcategory_cd = 'SOCKS'

and product.color_cd = 'GREEN'

;

The order_id USI on order_header table obviates the need for all Vamp duplication of spool result from

order detail to product join when joining to the order header table.

When to build USI?

Page 26: Cs437 lecture 16-18

Lecture 18

Dr. Fawad Hussain

Primary and Secondary IndicesPrimary and Secondary IndicesPrimary and Secondary IndicesPrimary and Secondary Indices----IIIIIIIIIIII

Page 27: Cs437 lecture 16-18

Example: What is the average age (in years) of customers who live in California or Massachusetts, completed a graduate degree, are consultants, and have a hobby of volleyball or chess?

select avg( (days(date) - days(customer.birth_dt)) / 365.25

)

from customer

where customer.state_cd in (‘CA’ , MA’)

and customer.education_cd = ‘G’

and customer.occupation_cd = ‘CONSULTANT’

and customer.hobby_cd in (‘VOLLEYBALL’,‘CHESS’)

;

A Simple Query

Page 28: Cs437 lecture 16-18

Assume:

� 20M customers.

� 128 byte rows.

� 64K data block size.

Results in approximately 512 rows per block and a total of 39,063 blocks in the customer table.

Note: We are ignoring block overhead for purposes of simplicity in calculations.

Sample Query Structure

Page 29: Cs437 lecture 16-18

Assume:

� 8% of customers live in California.

� 4% of customers live in Massachusetts.

� 4% of customers have completed a graduate degree.

� 6% of customers are consultants.

� 2% of customers have a primary hobby of chess.

� 3% of customers have a primary hobby of volleyball.

Data Demographics

Page 30: Cs437 lecture 16-18

� Must read every block in the table.

� Apply where clause predicates to determine which customers to include in average.

� Adjust numerator and denominator of average as appropriate.

Total I/O count = 39,063

Note: Data demographics have no (minimal) impact on query performance when using a full table scan operation.

Full Table Scan

Page 31: Cs437 lecture 16-18

B-tree or hash organization of column values:

� Index entries store row IDs (RIDs), lists of RIDs, or pointers to lists of RIDs.

� Originally designed for columns with many unique values (OLTP legacy).

� Assuming an eight byte RID, we will get 8096 RIDs per 64K block.

Single Index Structure

Page 32: Cs437 lecture 16-18

�Optimizer chooses index with best selectivity based on values specified in query.1. Access next (first) index entry corresponding to specified column

value(s).

2. Use RID from index entry to locate row with specified column value.

3. Validate remaining predicates to qualify row.

4. Adjust average as appropriate.

5. Go to 2 until no more matching index values.

Single Index Access

Page 33: Cs437 lecture 16-18

�What are my indexing choices?

�state_cd (8% + 4% = 12% selectivity)

�education_cd (4% selectivity)

�occupation_cd (6% selectivity)

�hobby_cd (2% + 3% = 5% selectivity)

Choose education_cd because it has best selectivity.

Single Index Access

Page 34: Cs437 lecture 16-18

Access via index on education_cd:

� 800,000 RIDs (4% of 20M)

� 99 blocks of RIDs to read

But...4% selectivity with 512 rows per block in the base table means that 800,000 selected RIDs will cause access to every block in the base table!

Total I/O count = 39,063 + 99 = 39,162

Worse than full table scan!

Single Index Access Performance

Page 35: Cs437 lecture 16-18

How do we calculate that number?

The selectivity on education is 4% (or 8 lac rows). If these rows were consecutively distributed, then we would have to access

8lac rows/512 (rows per block)= 1563 blocks.

However, assuming equal distribution, and 4% selectivity, we will use probability to find the distribution. That is, 4% selectivity gives us

0.04*512=20.48 rows (that are desired) are found in each block. Hence, total I/O required is

8lacs /20.48 = 39062.5 bocks

Plus the additional 99 for accessing the index (8 lac Rows and 8096 RIDs /block=99 blocks of RIDs to read)

The calculations