Upload
manuelq9
View
215
Download
0
Embed Size (px)
Citation preview
7/28/2019 lect12-c
1/12
CPSC 231 Sorting Large Files (D.H.) 1
LEARNING OBJECTIVES
Sorting of large files
merge sort
performance of merge sort
multi-step merge sort
7/28/2019 lect12-c
2/12
CPSC 231 Sorting Large Files (D.H.) 2
Sorting of Large Files
If a file is too large to be sorted in main
memory then it has to be sorted on the disk.
Example: If a file consists of 8 000 000
records and each record is 100 bytes long
then the file size is approximately 800 MB.
If a computer has 8 MB of RAM availablefor sorting then only a small part of this file
would fit into main memory.
7/28/2019 lect12-c
3/12
CPSC 231 Sorting Large Files (D.H.) 3
Merge Sort
If we do not have enough of available RAM
to sort the entire file, we may sort parts of
the file, save the sorted sub-files (runs) on
the disk and then use the K-way merge to
sort the entire file.
A run is a sorted subset of file which is used
later to sort the entire file. Runs can be
created using a heap sort. What is a maximum size of a run in the
example on the previous slide?
7/28/2019 lect12-c
4/12
CPSC 231 Sorting Large Files (D.H.) 4
Pros of the Merge Sort
Can sort very large files. Reading of the input file is sequential.
Reading of run and writing the output file is
also sequential. If heap sort is used for sorting of the runs
then we can overlap I/O and sorting.
Since I/O is largely sequential, this methodcan be used for sorting files on tapes.
See fig. 8.21 p.320
7/28/2019 lect12-c
5/12
CPSC 231 Sorting Large Files (D.H.) 5
Performance of Merge Sort
Merge sort requires I/O time for the
following operations:
reading all records into memory for sorting andforming runs
writing sorted runs to disk
reading sorted runs into main memory
writing sorted file to the disk
7/28/2019 lect12-c
6/12
CPSC 231 Sorting Large Files (D.H.) 6
Merge Sort versus Key Sort
It takes approximately 6 minutes to sort an
800 MB file from our example on a Seagate
Cheetah 9 hard disk (track to track seektime = 11msec)
It would have taken approximately 24
hours to sort the same file using the KeySort algorithm.
7/28/2019 lect12-c
7/12
CPSC 231 Sorting Large Files (D.H.) 7
Sorting a File that is Even
Larger
To sort a file that is ten times larger we
need to do more seeks on the disk (since the
main memory is the same, we have to createmore runs and perform more seeks to merge
those runs)
It takes approximately2 hours and six
minutes to merge sort an 8 GB file on the
Seagate Cheetah 9 disk drive.
7/28/2019 lect12-c
8/12
CPSC 231 Sorting Large Files (D.H.) 8
The cost of merging a bigger
file The number of seeks needed to merge a file
that is 10 times larger than the original file
is 100 times larger. WHY?
In general, for a K-way merge of K runs
where each run is as large as the memory
space available, the buffer size for each ofthe runs is:
(1/K)*size of each run
7/28/2019 lect12-c
9/12
CPSC 231 Sorting Large Files (D.H.) 9
The number of seeks needed to
merge a big file K seeks are needed to read all records in
each individual run.
Since there are K runs altogether, then the
merge operation requires:
K2 seeks.
Thus if a file is N times bigger, N2 more
seeks are needed to merge it.
7/28/2019 lect12-c
10/12
CPSC 231 Sorting Large Files (D.H.) 10
How to improve performance
of merge sort? Allocate more hardware: more main
memory, multiple disk drives and I/O
channels.
Perform the merge in more than one step.
Algorithmically increase the lengths of the
initial sorted runs.
Find ways to overlap I/O operations.
7/28/2019 lect12-c
11/12
CPSC 231 Sorting Large Files (D.H.) 11
Multi-Step Merge
Multi-step merge is a merge in which not all
runs are merged in one step. Rather, several
sets of runs are merged separately, each set
producing one long run consisting of the
records from all its runs. These new, longer
sets are then merged, either all together or
in several sets. See example of a two-step merge fig. 8.23
p.330
7/28/2019 lect12-c
12/12
CPSC 231 Sorting Large Files (D.H.) 12
Pros and Cons of Multi-Step
Merge Con: it requires that each record is read
twice (once to form the intermediate runs
and again to form the final sorted file)
Pros: We can create large runs by using
bigger buffers and thus reduce the number
of disk accesses. In some cases multi-step
merge is the only reasonable way to
perform a merge on tape if the number of
tape drives is limited.