lect12-c

Embed Size (px)

Citation preview

  • 7/28/2019 lect12-c

    1/12

    CPSC 231 Sorting Large Files (D.H.) 1

    LEARNING OBJECTIVES

    Sorting of large files

    merge sort

    performance of merge sort

    multi-step merge sort

  • 7/28/2019 lect12-c

    2/12

    CPSC 231 Sorting Large Files (D.H.) 2

    Sorting of Large Files

    If a file is too large to be sorted in main

    memory then it has to be sorted on the disk.

    Example: If a file consists of 8 000 000

    records and each record is 100 bytes long

    then the file size is approximately 800 MB.

    If a computer has 8 MB of RAM availablefor sorting then only a small part of this file

    would fit into main memory.

  • 7/28/2019 lect12-c

    3/12

    CPSC 231 Sorting Large Files (D.H.) 3

    Merge Sort

    If we do not have enough of available RAM

    to sort the entire file, we may sort parts of

    the file, save the sorted sub-files (runs) on

    the disk and then use the K-way merge to

    sort the entire file.

    A run is a sorted subset of file which is used

    later to sort the entire file. Runs can be

    created using a heap sort. What is a maximum size of a run in the

    example on the previous slide?

  • 7/28/2019 lect12-c

    4/12

    CPSC 231 Sorting Large Files (D.H.) 4

    Pros of the Merge Sort

    Can sort very large files. Reading of the input file is sequential.

    Reading of run and writing the output file is

    also sequential. If heap sort is used for sorting of the runs

    then we can overlap I/O and sorting.

    Since I/O is largely sequential, this methodcan be used for sorting files on tapes.

    See fig. 8.21 p.320

  • 7/28/2019 lect12-c

    5/12

    CPSC 231 Sorting Large Files (D.H.) 5

    Performance of Merge Sort

    Merge sort requires I/O time for the

    following operations:

    reading all records into memory for sorting andforming runs

    writing sorted runs to disk

    reading sorted runs into main memory

    writing sorted file to the disk

  • 7/28/2019 lect12-c

    6/12

    CPSC 231 Sorting Large Files (D.H.) 6

    Merge Sort versus Key Sort

    It takes approximately 6 minutes to sort an

    800 MB file from our example on a Seagate

    Cheetah 9 hard disk (track to track seektime = 11msec)

    It would have taken approximately 24

    hours to sort the same file using the KeySort algorithm.

  • 7/28/2019 lect12-c

    7/12

    CPSC 231 Sorting Large Files (D.H.) 7

    Sorting a File that is Even

    Larger

    To sort a file that is ten times larger we

    need to do more seeks on the disk (since the

    main memory is the same, we have to createmore runs and perform more seeks to merge

    those runs)

    It takes approximately2 hours and six

    minutes to merge sort an 8 GB file on the

    Seagate Cheetah 9 disk drive.

  • 7/28/2019 lect12-c

    8/12

    CPSC 231 Sorting Large Files (D.H.) 8

    The cost of merging a bigger

    file The number of seeks needed to merge a file

    that is 10 times larger than the original file

    is 100 times larger. WHY?

    In general, for a K-way merge of K runs

    where each run is as large as the memory

    space available, the buffer size for each ofthe runs is:

    (1/K)*size of each run

  • 7/28/2019 lect12-c

    9/12

    CPSC 231 Sorting Large Files (D.H.) 9

    The number of seeks needed to

    merge a big file K seeks are needed to read all records in

    each individual run.

    Since there are K runs altogether, then the

    merge operation requires:

    K2 seeks.

    Thus if a file is N times bigger, N2 more

    seeks are needed to merge it.

  • 7/28/2019 lect12-c

    10/12

    CPSC 231 Sorting Large Files (D.H.) 10

    How to improve performance

    of merge sort? Allocate more hardware: more main

    memory, multiple disk drives and I/O

    channels.

    Perform the merge in more than one step.

    Algorithmically increase the lengths of the

    initial sorted runs.

    Find ways to overlap I/O operations.

  • 7/28/2019 lect12-c

    11/12

    CPSC 231 Sorting Large Files (D.H.) 11

    Multi-Step Merge

    Multi-step merge is a merge in which not all

    runs are merged in one step. Rather, several

    sets of runs are merged separately, each set

    producing one long run consisting of the

    records from all its runs. These new, longer

    sets are then merged, either all together or

    in several sets. See example of a two-step merge fig. 8.23

    p.330

  • 7/28/2019 lect12-c

    12/12

    CPSC 231 Sorting Large Files (D.H.) 12

    Pros and Cons of Multi-Step

    Merge Con: it requires that each record is read

    twice (once to form the intermediate runs

    and again to form the final sorted file)

    Pros: We can create large runs by using

    bigger buffers and thus reduce the number

    of disk accesses. In some cases multi-step

    merge is the only reasonable way to

    perform a merge on tape if the number of

    tape drives is limited.