Ch06 Hashing

Embed Size (px)

Citation preview

  • 7/30/2019 Ch06 Hashing

    1/39

    HASHING

  • 7/30/2019 Ch06 Hashing

    2/39

    www.ittelkom.ac.id

    Hashing Techniques

    This is where a records placement is determined byvalue in the hash field. This value has a hash orrandomizing functionapplied to it which yields theaddress of the disk block where the record is stored.For most records, we need only a single-block accessto retrieve that record.

    2

  • 7/30/2019 Ch06 Hashing

    3/39

    www.ittelkom.ac.id

    General hashing approach

    Hash File for a relation R has N pages Each page is called a bucket Buckets are numbered from 0 to N 1 If a bucket gets full, an overflow page must be chained to it

    Each record t in R has a search key k Can be made out of 1 or more attributes Ex. Studens(sid, name, login, age, gpa)

    Search key: age attribute

    A hash function is used to map the search key k of arecord t in R to bucket number [0, N-1] Hash function should distribute records uniformly Record is searched inside the bucket

    3

  • 7/30/2019 Ch06 Hashing

    4/39

  • 7/30/2019 Ch06 Hashing

    5/39

    www.ittelkom.ac.id 5

    121 Jil NY $5595

    123 Bob NY $102

    1237 Pat WI $30

    2381 Bill LA $500

    8387 Ned SJ $73

    4882 Al SF $52303

    9403 Ned NY $3333

    81982 Tim MIA $4000

    H()

    LA

    NY

    NY

    NY

    MIA

    SJ

    SF

    WI

    city

  • 7/30/2019 Ch06 Hashing

    6/39

  • 7/30/2019 Ch06 Hashing

    7/39

    www.ittelkom.ac.id

    Internal Hashing (cont)

    Collisionsoccur when a hash field value of a recordbeing inserted hashes to an address that alreadycontains a different record.

    The process of finding another position for thisrecord is called collision resolution.

    Chapter 5 7

  • 7/30/2019 Ch06 Hashing

    8/39

    www.ittelkom.ac.id

    Collision Resolution

    Open Addressing- Places the record to be inserted inthe first availableposition subsequent to the hashaddress.

    Chaining - A pointer field is added to each recordlocation. When an overflow occurs this pointer is set topoint to overflow blocks making a linked list.

    Chapter 5 8

  • 7/30/2019 Ch06 Hashing

    9/39

    www.ittelkom.ac.id

    Collision Resolution (cont)

    Multiple hashing - If an overflow occurs a secondhashfunction is used to find a new location. If that location isalso filled either another hash function is applied or openaddressing is used.

    Chapter 5 9

  • 7/30/2019 Ch06 Hashing

    10/39

    www.ittelkom.ac.id

    Goals of the Hash Function

    The goals of a good hash function are to uniformlydistributethe records over the address space whileminimizing collisions to avoid wasting space.

    Research has shown 70% to 90% fill ratio best. That when uses a Mod function M should be a prime

    number.

    Chapter 5 1

  • 7/30/2019 Ch06 Hashing

    11/39

    www.ittelkom.ac.id

    External Hashing for Disk Files

    External hashing makes use ofbuckets, each of whichcan hold multiple records.

    A bucketis either a block or a cluster of contiguousblocks.

    The hash function maps a key into a relative bucketnumber, rather than an absolute block addressforthe bucket.

    Chapter 5 1

  • 7/30/2019 Ch06 Hashing

    12/39

    www.ittelkom.ac.id

    Types of External Hashing

    Static Hashing Using a fixed address space

    Dynamically Hashing Dynamically changing address space:

    Extendible hashing : with a directory Linear hashing : without a directory

    Chapter 5 1

  • 7/30/2019 Ch06 Hashing

    13/39

    www.ittelkom.ac.id

    Static Hashing

    Under Static Hashing a fixednumber of buckets (M)isallocated.

    Based on the hash value a bucket numberis

    determined in the block directory array which yields theblock address.

    Ifnrecords fit into each block. This method allows upto n*Mrecords to be stored.

    Chapter 5 1

  • 7/30/2019 Ch06 Hashing

    14/39

    www.ittelkom.ac.id

    Static Hashing: Scheme

    0

    1

    2

    N -1

    h

    SearchKey k

    Primary Buckets Overflow Pages

  • 7/30/2019 Ch06 Hashing

    15/39

    www.ittelkom.ac.id

    Static Hashing: Issues

    Number of primary buckets is fixed at file creation Hash function maps key to a bucket number Typical hash function

    H(k) = a*k + b Bucket number = h(k) mod N

    Key can be int or char Char each character is mapped to ASCII, and all value are

    added to get an integer

    Parameters a and b are choose to tune the distribution of values(i.e. need to play with this values to get them right )

    When a primary bucket get full, need to create anoverflow page and chain it to primary bucket.

  • 7/30/2019 Ch06 Hashing

    16/39

    www.ittelkom.ac.id

    Static hashing: Operations

    Search for key value k: Hash k to find the bucket, call this bucket B Search records in B to find the one(s) with key k If records are found

    clustered, the data record is there: Cost: 1 I/O unclustered, need to fetch the actual data page: Cost 2 I/Os If records are not found, need to search in overflow pages (if there

    are any)

    Clustered: Cost: (1 + number of pages searched) * I/O

    Unclustered: Cost: (2 + number of pages searched) * I/O The more overflow page you have, the worst the

    performance get Need to keep overflow pages to 1 or 2, but rarely gets done!

  • 7/30/2019 Ch06 Hashing

    17/39

    www.ittelkom.ac.id

    Static Hashing: Operations (cont.)

    Insert (or Update) record with key k Hash k to find bucket, call this bucket B If bucket has room

    clustered, write data record there: Cost: 2 I/Os Read page, then write it back updated

    unclustered, write record to actual data page: Cost 4 I/Os If bucket is full, write to overflow page (create one if needed)

    Clustered: Cost: (2 + number of pages searched) * I/O Unclustered: Cost: (4 + number of pages searched) * I/O

    Delete costs are the same, since we need to write pageback to disk Again, overflow pages make performance bad as the

    number of records increases

  • 7/30/2019 Ch06 Hashing

    18/39

    www.ittelkom.ac.id

    Extensible hashing

    Allows the number of buckets to grow or shrink Hash function hashes to slots in a directory

    Slots store the page id of the bucket

    Directory can be kept in buffer pool Directory can have hundreds or thousand of slots to buckets When a bucket gets full

    Create a new bucket and split records between the new and fullbucket

    Redistributes the data Hash function still works!!!

    Overflow page is need only if you have many duplicate records

  • 7/30/2019 Ch06 Hashing

    19/39

    www.ittelkom.ac.id

    Extendible Hashing

    In Extendible Hashing, a type of directory is maintained asan array of 2dbucket addresses. Where drefers to thefirst dhigh (left most) order bits and is referred to as the

    global depthof the directory. However, there does NOThave to be a DISTINCT bucket for each directory entry.

    A local depth dis stored with each bucket to indicate thenumber of bits used for that bucket.

    Chapter 5 1

  • 7/30/2019 Ch06 Hashing

    20/39

    www.ittelkom.ac.id

    Binary Pattern Hashing Technique

    Hash function will map search key to binary pattern Ex h(51) = 00110011

    Last dbits in the pattern are taken as bucket number! Ex. Ifd= 2, then h(51) = 00110011 will yield bucket number 11,

    which is 3 in binary Thus, 51 goes to bucket 3 The number ofdof bits used to hash the search key is

    called the depth

    Two types of depths Bucket depth

    Number of bits need to hash value to a given bucket File depth

    Largest depth of any bucket in the hash file

  • 7/30/2019 Ch06 Hashing

    21/39

    www.ittelkom.ac.id

    Example Extensible Hashing Index

    2

    DirectoryPage

    00

    01

    10

    11

    4 12 32 162

    1 5 21

    2

    10

    2

    15 7 19

    2

    Bucket A

    Bucket B

    Bucket C

    Bucket D

    Hashing onan int attribute

    H(4) = 100d = 2gives slot 00.For value 4Bucket isthen foundfrom the slot.

  • 7/30/2019 Ch06 Hashing

    22/39

    www.ittelkom.ac.id

    The use of the depth

    Depth tells us the number of bits that we need to use topick a bucket Ex. H(4) = 100, d = 2, tell us to use 00 to identify slot. This

    would be slot 00. Directory has a global depth

    Used to hash key to proper slot Each bucket has a local depth

    Used when bucket need to be slipt Let us see what happens when we need to insert thevalue 22 into the hash index H(22) = 10110, d =2

  • 7/30/2019 Ch06 Hashing

    23/39

    www.ittelkom.ac.id

    Hash File before the Insert 22

    2

    DirectoryPage

    00

    01

    10

    11

    4 12 32 162

    1 5 21

    2

    10

    2

    15 7 19

    2

    Bucket A

    Bucket B

    Bucket C

    Bucket D

  • 7/30/2019 Ch06 Hashing

    24/39

    www.ittelkom.ac.id

    Hash File After Inserting 22

    2

    DirectoryPage

    00

    01

    10

    11

    4 12 32 162

    1 5 21

    2

    10 22

    2

    15 7 19

    2

    Bucket A

    Bucket B

    Bucket C

    Bucket D

    H(22) = 10110d = 2gives slot 10.This buckethas space

  • 7/30/2019 Ch06 Hashing

    25/39

    www.ittelkom.ac.id

    The issue of a full bucket: Insert 20

    2

    DirectoryPage

    00

    01

    10

    11

    4 12 32 162

    1 5 21

    2

    10

    2

    15 7 19

    2

    Bucket A

    Bucket B

    Bucket C

    Bucket D

    H(20) = 10100d = 2gives slot 00.This bucket isFull

  • 7/30/2019 Ch06 Hashing

    26/39

    www.ittelkom.ac.id

    Splitting a bucket

    A full bucket gets split into two buckets Their directory slots are called corresponding elements

    These buckets have the same hash value at the current

    depth d But at depth d+ 1, they differ by 1 bit one has a 1 at bit position d+ 1 the other has a 0 at bit position d+ 1

    Example: Bucket A is split into two buckets: bucket A and bucket A2 Bucket A , d= 2, has value 00, but at d= 3 becomes 000 Bucket A2, d= 2, has value 00, but at d= 3 becomes 100

  • 7/30/2019 Ch06 Hashing

    27/39

    www.ittelkom.ac.id

    Splitting a bucket (cont.)

    The values in the original bucket A and the new value tobe inserted get distributed into buckets A and A2.

    The hash function now increment the local depth of thebuckets A and A2 to be d+ 1

    Now, the keys are hashed to buckets using d+ 1 bits Recall that bucket A had: 4, 12, 32, 16, and d= 2 We wanted to insert 20

    Now dbecomes 3, we get the following hashing:

    4 = 100 H(4) = 100

    12 = 1100 H(12) = 100 32 = 100000 H(32) = 000 16 = 10000 H(16) = 000 20 = 10100 H(20) = 100

  • 7/30/2019 Ch06 Hashing

    28/39

    www.ittelkom.ac.id

    Corresponding elements & buckets

    4 12 32 16

    2 Bucket A

    32 16

    3 Bucket A

    4 12 20

    3 Bucket A2

    Operation insert 20

    AfterSplit

    Original bucket A Split image of bucket A

    splitting

    Local depth is changed from 2 to 3Need 3 bits for hashing

    Hash = 00

    Hash = 000 Hash = 100

  • 7/30/2019 Ch06 Hashing

    29/39

    www.ittelkom.ac.id

    Expanding the Directory

    2

    DirectoryPage

    00

    01

    10

    11

    32 16

    3

    1 5 21

    2

    10

    2

    15 7 19

    2

    Bucket A

    Bucket B

    Bucket C

    Bucket D

    4 12 20

    3 Bucket A2

    Number of slotsis not enough.

    Need to doublesize of thedirectoryand increaseGlobal depth

  • 7/30/2019 Ch06 Hashing

    30/39

    www.ittelkom.ac.id

    Expanded Hash Index

    DirectoryPage

    32 16

    3

    1 5 21

    2

    10

    2

    15 7 19

    2

    Bucket A

    Bucket B

    Bucket C

    Bucket D

    4 12 20

    3 Bucket A2

    3

    111

    110

    101

    100

    011

    010

    001

    000

  • 7/30/2019 Ch06 Hashing

    31/39

    www.ittelkom.ac.id

    Some Issues

    Some corresponding elements point to the same bucket This means the bucket has not been split

    Not all splits operations cause the directory to be doubled.

    Each bucket has a local depth If depth of bucket = global depth 1, then splitting this bucket willnot cause a doubling in directory

    Doubling only occurs when

    Bucket is full and cannot fit another insertion Bucket has same local depth as global depth

  • 7/30/2019 Ch06 Hashing

    32/39

    www.ittelkom.ac.id

    Tradeoffs of Extensible Hashing

    Advantages Can gracefully adapt to insertion and deletions Limits the number of overflow pages Hash function is easy to implement

    No need for complex prime number computations Disadvantages

    Directory can grow large when we have billions of records Also, when we have skewed data distributions

    Lots of values go to same bucket

    Lots of empty buckets, a few one have all the data Overflow pages due to collisions (values that hash to same bucket)

    Too much doubly in the size of the directory

  • 7/30/2019 Ch06 Hashing

    33/39

    www.ittelkom.ac.id

    Shrinking Extendible Hashing Files

    The generally used principal for shrinking extendiblehashing files is that when d > dfor all buckets after adeletion occurs.

    Buckets may be combined when the each of thebuckets to be combined are less than half full andhave the same bit pattern with the exception of the dbit. I.e. d= 3 and the bit patterns of110 and 111.

    Chapter 5 3

  • 7/30/2019 Ch06 Hashing

    34/39

    www.ittelkom.ac.id

    Linear Hashing

    Dynamic hashing technique No need for directory Limits overflow pages due to collisions Splitting of buckets is done in a more lazy fashion

    Idea is to have a family of hash functions h0, h1,h2, Each function has a range twice as big as the predecesor Ifhimaps to M buckets, hi+1 maps to 2Mbuckets

    This is used when more buckets are needed Switch from current hito hi+1 if we need to grow number

    of buckets beyond current M (we double number of bucketsto 2M)

  • 7/30/2019 Ch06 Hashing

    35/39

    www.ittelkom.ac.id

    Linear Hashing

    Linear Hashing allows the hash file to expand and shrink itsnumber of buckets dynamically withoutneeding a directory.

    It starts with Mbuckets numbered 0 to M-1 and use the modhash function h(K)= K mod M

    as the initial hash function called hi. Overflow is handled by chaining individualoverflow chains for

    each bucket.

    It works by methodically splitting the original buckets; startingwith bucket 0, redistributing the contents ofbucket 0 betweenbucket 0 and bucket M (the new bucket) using a secondaryhash function: hi+1(K) = K mod 2M

  • 7/30/2019 Ch06 Hashing

    36/39

    www.ittelkom.ac.id

    Linear Hashing (Cont)

    This splitting of buckets is done in order(0,1,,M-1)REGARDLESSof which bucket the collision occurred. To keeptrack of the next bucket to be split we will use n. So nwould beincremented to 1.

    When a record hashes to a bucket less than nwe use thesecondary hash functionto determine which of the twobuckets it belongs in.

    When all of the original Mbuckets have been split and we have2Mbuckets and n = M

    We reset M to 2M, n to 0and change oursecondaryhashfunction to ourprimaryhash function. Shrinking of the file is done based on the load factor using the

    reverse of splitting.

    Chapter 5 3

  • 7/30/2019 Ch06 Hashing

    37/39

    www.ittelkom.ac.id

    Building the family of hash functions

    General form is hi(key) = h(key) mod (2iN) h(key) acts as the base function

    h(key) is the same as for extensible hashing Looks as the bit pattern in the value

    IfNis a power of 2, and d0 is the number of bit torepresent N, then di gives the number of bits used by

    function hi

    di= d0+ i

  • 7/30/2019 Ch06 Hashing

    38/39

    www.ittelkom.ac.id

    General Scheme

    Hash index file as an associated round number Called Level

    At round numberLevelwe use hash functions

    hLevelandhLevel + 1 We keep track of the next bucket to be split Buckets are split in a round robin fashion

    Every bucket eventually gets splits

    Index file has tree types of buckets Buckets that were split in this round Buckets that are yet to be split Buckets created by splits in this round

  • 7/30/2019 Ch06 Hashing

    39/39

    itt lk id

    Organization of Index Hash File

    Next bucket tobe split

    Split in this round

    Original Bucketsin this Level

    hLevel

    used here

    Bucketsnot split

    New bucketsresulting fromsplits