71
File Structures SNU-OOPSLA Lab. 1 Chap 8. Cosequential Processing Chap 8. Cosequential Processing and the Sorting of Large Files and the Sorting of Large Files 서서서서서 서서서서서서 서서서서서서서서서서 SNU-OOPSLA-LAB 서서 서 서 서 File Structures by Folk, Zoellick, and Riccard i

Chap 8. Cosequential Processing and the Sorting of Large Files

  • Upload
    jess

  • View
    59

  • Download
    2

Embed Size (px)

DESCRIPTION

File Structures by Folk, Zoellick, and Riccardi. Chap 8. Cosequential Processing and the Sorting of Large Files. 서울대학교 컴퓨터공학부 객체지향시스템연구실 SNU-OOPSLA-LAB 교수 김 형 주. Chapter Objectives(1). Describe a class of frequently used processing activities known as cosequential process - PowerPoint PPT Presentation

Citation preview

Page 1: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 1

Chap 8. Cosequential Processing Chap 8. Cosequential Processing and the Sorting of Large Files and the Sorting of Large Files

서울대학교 컴퓨터공학부객체지향시스템연구실SNU-OOPSLA-LAB

교수 김 형 주

File Structures by Folk, Zoellick, and Riccardi

Page 2: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 2

Chapter Objectives(1)Chapter Objectives(1)

Describe a class of frequently used processing activities known as cosequential process

Provide a general object-oriented model for implementing varieties of cosequential processes

Illustrate the use of the model to solve a number of different kinds of cosequential processing problems, including problems other than simple merges and matches

Introduce heapsort as an approach to overlapping I/O with sorting in RAM

Page 3: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 3

Chapter Objectives(2)Chapter Objectives(2)

Show how merging provides the basis for sorting very large files

Examine the costs of K-way merges on disk and find ways to reduce those costs

Introduce the notion of replacement selection Examine some of the fundamental concerns associated

with sorting large files using tapes rather than disks Introduce UNIX utilities for sorting, merging, and

cosequential processing

Page 4: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 4

ContentsContents

8.1 Cosequential operations

8.2 Application of the OO Model to a General Ledger Program

8.3 Extension of the OO Model to Include Multiway Merging

8.4 A Second Look at Sorting in Memory

8.5 Merging as a Way of Sorting Large Files on Disk

8.6 Sorting Files on Tape

8.7 Sort-Merge Packages

8.8 Sorting and Cosequential Processing in Unix

Page 5: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 5

Cosequential operationsCosequential operations

Coordinated processing of two or more sequential lists to produce a single list

Kinds of operations merging, or union matching, or intersection combination of above

8.1 An Object-Oriented Model for Implementation Cosequential Processes

Page 6: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 6

Matching Names in Two Lists(1)Matching Names in Two Lists(1)

So called “intersection operation” Output the names common to two lists Things that must be dealt with to make match procedure

work reasonably initializing that is to arrange things methods that are getting and accessing the next list item synchronizing between two lists handling EOF conditions recognizing errors

e.g. duplicate names or names out of sequence

8.1 An Object-Oriented Model for Implementation Cosequential Processes

Page 7: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 7

Matching Names in Two Lists(2)Matching Names in Two Lists(2)

In comparing two names if Item(1) is less than Item(2), read the next from List 1

if Item(1) is greater than Item(2), read the next name from

List 2

if the names are the same, output the name and read the

next names from the two lists

8.1 An Object-Oriented Model for Implementation Cosequential Processes

Page 8: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 8

Cosequential match procedure(1)Cosequential match procedure(1)

PROGRAM: match

List 1

List 2

same name

Item(1)

Item(2)

Item(1) < Item(2)

Item(1) > Item(2)

use input() & initialize() procedure

8.1 An Object-Oriented Model for Implementation Cosequential Processes

Page 9: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 9

Cosequential match procedure(2)Cosequential match procedure(2)

8.1 An Object-Oriented Model for Implementation Cosequential Processes

int Match(char * List1, char List2, char *OutputList)

{

int MoreItems; // true if items remain in both of the lists

// initialize input and output lists InitializeList(1, List1); InitializeList(2, List2); InitializeOutput(OutputList);

// get first item from both lists

MoreItems = NextItemInLIst(1) && NextItemInList(2); while (MoreItems) { // loop until no items in one of the lists if(Item(1) < Item(2) ) MoreItems = NextItemInList(1);

else if (Item(1) == Item (2) ) { ProcessItem(1); // match found MoreItems = NextItemInList(1) && NextItemInList(2);}else MoreItems = NextItemInList(2); // Item(1) > Item(2)

}

FinishUp(); return 1;

}

Page 10: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 10

General Class for Cosequential Processing(1)General Class for Cosequential Processing(1)

8.1 An Object-Oriented Model for Implementation Cosequential Processes

template <class ItemType> class CosequentialProcess// base class for cosequential processing{ public: // the following methods provide basic list processing // these must be defined in subclasses virtual int InitializeList (int ListNumber, char *LintName) = 0; virtual int InitializeOutput (char * OutputListName) = 0; virtual int NextItemInList (int ListNumber) = 0; // advance to next item in this list virtual ItemType Item(int ListNumber) = 0; // return current item from this list virtual int ProcessItem(int ListNumber) = 0;

// process the item in this list virtual int FinishUp() = 0; // complete the processing // 2-way cosequential match method virtual int Match2Lists (char *List1, char * List2, char *OutputList);};

Page 11: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 11

General Class for Cosequential Processing(2)General Class for Cosequential Processing(2) A Subclass to support lists that are files of strings, one per line

class StringListProcess : public CosequentialProcess<String &>{ public:

StringListProcess (int NumberOfLists); // constructor// Basic list processing methodsint InitializeList (int ListNumber, char * List1);int InitializeOutput(char * OutputList);int NextItemInList (int ListNumber); // get nextString & Item (int ListNumber); // return currentint ProcessItem (int ListNumber); // process the itemint FinishUp(); // complete the processing

protected:ifstream * List; // array of list filesString * Items; // array of current Item from each list

ofstream OutputLsit;static const char * LowValue; //used so that NextItemInList() doesn’t

// have to get the first item in an special way

static const char * HighValue;};

8.1 An Object-Oriented Model for Implementation Cosequential Processes

Page 12: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 12

General Class for Cosequential Processing(3)General Class for Cosequential Processing(3)

Appendix H: full implementation An example of main

#include “coseq.h”

int main()

{

StringListProcess ListProcess(2); // process with 2 lists

ListProces.Match2Lists (“list1.txt”, “list2.txt”, “match.txt”);

}

8.1 An Object-Oriented Model for Implementation Cosequential Processes

Page 13: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 13

Merging Two Lists(1)Merging Two Lists(1)

Based on matching operation Difference

must read each of the lists completely must change MoreNames behavior

keep this flag set to true as long as there are records in either list

HighValue the special value (we use “\xFF”) come after all legal input values in the files to ensure both

input files are read to completion

8.1 An Object-Oriented Model for Implementation Cosequential Processes

Page 14: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 14

Merging Two Lists(2)Merging Two Lists(2)

Cosequential merge procedure based on a single loop This method has been added to class CosequentialProcess No modifications are required to class StringListProcess

template <class ItemType>int CosequentialProcess<ItemType> :: Merge2Lists (char * List1Name, char * List2Name, char * OutputList){

int MoreItems1, MoreItems2; // true if more items in list

(continued … )

8.1 An Object-Oriented Model for Implementation Cosequential Processes

Page 15: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 15

Merging Two Lists(3)Merging Two Lists(3)InitializeList (1 List1Name);

InitializeList (2, List2Name);InitializeOutput (OutputListName);MoreItems1 = NextItemInList(1);MoreItems2 = NextItemInLIst(2);while (MoreItems1 || MoreItems(2) ) { // if either file has more

if (Item(1) < Item(2)) { // list 1 has next item to be processedProcessItem(1);MoreItem1 = NextItemInList(1);

}else if (Item(1) == Item(2) ) {

ProcessItem(1);MoreItems1 = NextItemInList(1);MoreItems2 = NextItemInList(2);

}else // Item(1) > Item(2) {

ProcessItem(2);MoreItem2 = NextItemInList(2);

}}FinishUp(); return 1;

}

8.1 An Object-Oriented Model for Implementation Cosequential Processes

Page 16: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 16

Cosequential merge procedure(1)Cosequential merge procedure(1)

PROGRAM: merge

NAME_1

NAME_2

List 1

List 2

OutputList

(Item(1) < Item(2) )or match

Item(1) > Item(2)

8.1 An Object-Oriented Model for Implementation Cosequential Processes

Page 17: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 17

Summary of the CosequentialSummary of the Cosequential Processing Model(1)Processing Model(1)

Assumptions two or more input files are processed in a parallel fashion each file is sorted in some cases, there must exist a high key value or a low

key records are processed in a logical sorted order for each file, there is only one current record records should be manipulated only in internal memory

8.1 An Object-Oriented Model for Implementation Cosequential Processes

Page 18: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 18

Summary of the Cosequential Processing Model(2)Summary of the Cosequential Processing Model(2)

Essential Components initialization - reads from first logical records one main synchronization loop - continues as long as relevant records remain selection in main synchronization loop

Input files & Output files are sequence checked by comparing the previous item value with new one

if (Item(1) > Item(2) then ..........else if ( Item(1) < Item(2)) then .........else ........... /* current keys equal */endif

8.1 An Object-Oriented Model for Implementation Cosequential Processes

Page 19: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 19

Summary of the Cosequential Processing Model(3)Summary of the Cosequential Processing Model(3)

Essential components (cont’d)substitute high values for actual key when EOF

main loop terminates when high values have occurred for all relevant input filesno special code to deal with EOF

I/O or error detection are to be relegated to supporting method so the details of these activities do not obscure the principal processing logic

8.1 An Object-Oriented Model for Implementation Cosequential Processes

Page 20: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 20

8.2 The General Ledger Program (1)8.2 The General Ledger Program (1) Account table (Fig 8.6)

Acct-No Acct-Title Jan Feb Mar Apr

101 check #1 100 200 170

102 check #2 500 270 320

505 advertize 300 129 230

Journal entry table (Fig 8.7)

Acct-No Check-No Date Description Debit/Credit

101 112 04/02/86 auto-repair -30

505 213 05/13/86 newspaper -39

540 670 04/13/86 printer +60

Ledger Printout (Fig 8.8)

101 check #1

1271 04/02/86 auto-expense -78

1272 04/03/86 advertise -30

Page 21: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 21

8.2 The General Ledger Program(2)8.2 The General Ledger Program(2) Ledger List and Journal List (Fig 8.10)

101 check#1 101 1271 Auto-expense

101 1272 Rent

101 1273 Advertising

102 check#2 102 670 Office-expense

The ledger (master) account number The journal (transaction) account number

Class MasterTransactionProcess (Fig 8.12) Subclass LedgeProcess (Fig 8.14)

Page 22: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 22

8.2 The General Ledger Program (3)8.2 The General Ledger Program (3)

Template <class ItemType>

class MasterTransactionProcess: Public CosequentialProcess<ItemType>

// a cosequential process that supports master/transaction processing

{public:

MasterTransactionProcess(); // constructor

Virtual int ProcessNewMaster() = 0; //processing when new master read

Virtual int ProcessCurrentMaster() = 0;

Virtual int ProcessEndMaster() = 0;

Virtual int ProcessTransactionError()= 0;

//cosequential processing of master and transaction records

int PostTransactions (char * MasterFileName, char * TransactionFileName, char * OutputListName);

};

Page 23: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 23

A K-way Merge AlgorithmA K-way Merge Algorithm A very general form of cosequential file processing Merge K input lists to create a single, sequentially ord

ered output list Algorithm

begin loop determine which list has the key with the lowest value output that key move ahead one key in that list

in duplicate input entries, move ahead in each list

loop again

8.3 Extension of the Model to Include Multiway Merging

Page 24: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 24

K-way merge nice if K is no larger than 8 or so if K > 8, the set of comparisons for minimum key is expensive loop of comparison (computing)

Selection Tree (if K > 8) time vs. space trade off a kind of “tournament” tree the minimum value is at root node the depth of tree is log2 K

Selection Tree for Merging Large Number of ListsSelection Tree for Merging Large Number of Lists

8.3 Extension of the Model to Include Multiway Merging

Page 25: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 25

7, 10, 17....List 0

9, 19, 23....List 1

8, 16, 29....List 7

15, 20, 30....List 6

5, 6, 25....List 5

12, 14, 21....List 4

18, 22, 24....List 3

11, 13, 32....List 2

7

11

5

8

7

5

5input

Selection Tree

8.3 Extension of the Model to Include Multiway Merging

Page 26: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 26

8.4 A Second Look at Sorting in Memory8.4 A Second Look at Sorting in Memory Read the whole file from into memory, perform

sorting, write the whole file into disk

Can we improve on the time that it takes for this RAM sort? perform some of parts in parallel selection sort is good but cannot be used to sort entire file

Using Heap technique! processing and I/O can occur in parallel keep all the keys in heap

Heap building while reading a block Heap rebuilding while writing a block

8.4 A Second Look at Sorting in Memory

Page 27: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 27

Overlapping processing and I/O : HeapsortOverlapping processing and I/O : Heapsort

Heap a kind of binary tree, complete binary tree each node has a single key, that key is less than or equal to

the key at its parent node storage for tree can be allocated sequentially so there is no need for pointers or other dynamic overhead

for maintaining the heap

8.4 A Second Look at Sorting in Memory

Page 28: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 28

A

B c

E H I D

G F

D I G F A C B E H

1 2 3 4 5 6 7 8 9

A heap in both its tree form and as it would be stored in an array

(1)

(2) (3)

(4) (5) (6) (7)

(8) (9)

8.4 A Second Look at Sorting in Memory

* n, 2n, 2n+1 positions

Page 29: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 29

Class Heap and Method Insert(1)Class Heap and Method Insert(1)

class Heap{ public:

Heap(int maxElements);int Insert (char * newKey);char * Remove();

protected:int MaxElements; int NumElements;char ** HeapArray;void Exchange (int i, int j); // exchange element i and jint Compare (int i, int j) // compare element i and j

{ return strcmp(Heaparray[i], HeapArray[j]); }};

8.4 A Second Look at Sorting in Memory

Page 30: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 30

Class Heap and Method Insert(2)Class Heap and Method Insert(2)

int Heap::Insert(char * newKey){

if (NumElements == MaxElements) return FALSE;NumElements++; // add the new key at the last positionHeapAray[NumElements] = newKey;// re-order the heapint k = NumElements; int parent;while(k > 1) { // k has a parent

parent = k/2;if (Compare(k, parent) >= 0) break;

// HeapArray[k] is in the right place// else exchange k and parentExchange(k, parent);k = parent;

}return;

}

8.4 A Second Look at Sorting in Memory

Page 31: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 31

Heap Building Algorithm(1)Heap Building Algorithm(1)

input key order : F D C G H I B E A

New key tobe inserted

Heap, after insertionof the new key

Selected heapsin tree form

F 1 2 3 4 5 6 7 8 9 F

D 1 2 3 4 5 6 7 8 9 D F

C 1 2 3 4 5 6 7 8 9 C F D

G 1 2 3 4 5 6 7 8 9 C F D G

H 1 2 3 4 5 6 7 8 9 C F D G H

C

F D

(continued....)

8.4 A Second Look at Sorting in Memory

Page 32: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 32

Heap Building Algorithm(2)Heap Building Algorithm(2)

New key tobe inserted

Heap, after insertionof the new key

Selected heapsin tree form

I 1 2 3 4 5 6 7 8 9 C F D G H I

B 1 2 3 4 5 6 7 8 9 B F C G H I D

E 1 2 3 4 5 6 7 8 9 B E C F H I D G

A 1 2 3 4 5 6 7 8 9 A B C E H I D G F

F D

G H I

C

B

F C

G H I D

(continued....)

input key order : F D C G H B E A

8.4 A Second Look at Sorting in Memory

Page 33: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 33

Heap Building Algorithm(3)Heap Building Algorithm(3)

A 1 2 3 4 5 6 7 8 9 A B C E H I D G F

New key tobe inserted

Heap, after insertionof the new key

Selected heapsin tree form

A

B C

E H I D

G F

input key order : F D C G H B E A

8.4 A Second Look at Sorting in Memory

Page 34: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 34

Illustration for overlapping input with heap building(1)Illustration for overlapping input with heap building(1)

Total RAM area allocated for heap

First input buffer. First part of heap is built here. Thefirst record is added to the heap, then the second recordis added, and so forth

Second input buffer. This buffer is being filled while heap is being built in first buffer.

8.4 A Second Look at Sorting in Memory

(Free ride of main memory processing: heap building is faster than IO!)

Page 35: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 35

Illustration for overlapping input with heap building(2)Illustration for overlapping input with heap building(2)

Second part of heap is built here. The first record is added to the heap, then the second record, etc

Third input buffer. This buffer is filled while heap is beingbuilt in second buffer

Third part of heap is built here

Fourth input buffer is filled while heap is being built in third buffer

8.4 A Second Look at Sorting in Memory

(One Heap is growing during IO time!)

Page 36: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 36

Sorting while Writing to the FileSorting while Writing to the File

Heap rebuilding while writing a block (Free ride of main memory processing) Retrieving the keys in order (Fig 8.20)

while( there is no elements) get the smallest value put largest value into root decrease the # of elements reorder the heap

Overlapping retrieve-in-order with I/O retrieve-in-order a block of records while writing this block, retrieve-in-order the next block

8.4 A Second Look at Sorting in Memory

Page 37: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 37

8.5 Merging as a Way of Sorting Large Files on Disk8.5 Merging as a Way of Sorting Large Files on Disk

Keysort: holding keys in memory Two Shortcomings of Keysort

substantial cost of seeking may happen after keysort cannot sort really large files

e.g. a file with 800,000 records, size of each record: 100 bytes, size of key part: 10 bytes, then 800,000 X 10 => 8G bytes!

cannot even sort all the keys in RAM

Multiway merge algorithm small overhead for maintaining pointers, temporary variables

run: sorted subfile using heap sort for each run split, read-in, heap sort, write-back

8.5 Merging as a Way of Sorting Large Files on Disk

Page 38: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 38

Sorting through the creation of runsand subsequential merging of runs

800,000 unsorted records

80 internal sorts

.............

.............80runs, each containing 10,000 sorted records

Merge

800,000 records in sorted order

8.5 Merging as a Way of Sorting Large Files on Disk

Page 39: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 39

Multiway merging (K-way merge-sort)Multiway merging (K-way merge-sort)

Can be extended to files of any size Reading during run creation is sequential

no seeking due to sequential reading

Reading & writing is sequential Sort each run: Overlapping I/O using heapsort K-way merges with k runs Since I/O is largely sequential, tapes can be used

8.5 Merging as a Way of Sorting Large Files on Disk

Page 40: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 40

How Much Time Does a Merge Sort Take?How Much Time Does a Merge Sort Take?

Assumptions only one seek is required for any sequential access only one rotational delay is required per access

Four I/Os ( refer to page of 39 ) during the sort phase

reading all records into RAM for sorting, forming runs writing sorted runs out to disk

during the merge phase reading sorted runs into RAM for merging writing sorted file out to disk

8.5 Merging as a Way of Sorting Large Files on Disk

Page 41: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 41

Four Steps(1) Four Steps(1) Step1: Reading records into RAM for sorting and forming runs

assume: 10MB input buffer, 800MB file size seek time --> 8msec, rotational delay --> 3msec transmission rate --> 0.0145MB/msec Time for step1:

access 80 blocks (80 X 11)msec + transfer 80 blocks (800/0.0145)msec

Step2: Writing sorted runs out to disk writing is reverse of reading time that it takes for step2 equals to time of step1

8.5 Merging as a Way of Sorting Large Files on Disk

Page 42: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 42

Four Steps(2)Four Steps(2) Step3: Reading sorted runs into RAM for merging

10 MB of RAM is for storing runs. 80 runs reallocate each of 80 buffers 10MB RAM as 80 input buffers access each run 80 buffers to read all of it Each buffer holds 1/80 of a run (0.125MB)

total seek & rotational time --> 80 runs X 80 seeks

--> 6400 seeks. 6400 X 11 msec = 70 seconds transfer time --> 60 seconds

total time = total seek & rotation time + transfer time

Page 43: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 43

Four Steps(3)Four Steps(3)

Step4: Writing sorted file out to diskneed to know how big output buffers arewith 20,000-byte output buffers,

total seek & rotation time = 4,000 x 11 msectransfer time is still 60 seconds

Consider Table 8.1 (323pp)What if we use keysort for 800M file? --> 24hrs 26mins 40secs

80,000,000 bytes

20,000 bytes per seek4,000 seeks

8.5 Merging as a Way of Sorting Large Files on Disk

Page 44: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 44

800,000sorted records

1st run = 80 buffers’ worth(80 accesses)

2nd run = 80 buffers’ worth(80 accesses)

80th run = 80 buffers’ worth(80 accesses)

:::

Effect of buffering on the number of seeks required

8.5 Merging as a Way of Sorting Large Files on Disk

800MB file

10MB file

80 buffers(10MB)

Page 45: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 45

Sorting a Very Large FileSorting a Very Large File Two kinds of I/O

Sort phase I/O is sequential if using heapsort Since sequential access is minimal seeking, we cannot

algorithmically speed up I/O

Merge phase RAM buffers for each run get loaded, reloaded at predictable

times -> random access For performance, look for ways to cut down on the number of

random accesses that occur while reading runs you can have some chance here!

8.5 Merging as a Way of Sorting Large Files on Disk

Page 46: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 46

The Cost of Increasing the File SizeThe Cost of Increasing the File Size K-way merge of K runs Merge sort = O(K2) ( merge op. -> K2 seeks ) If K is a big number, you are in trouble!

Some ways to reduce time!! (8.5.4, 8.5.5, 8.5.6) more hardware (disk drives, RAM, I/O channel) reducing the order of merge (k), increasing buffer size

of each run increase the lengths of the initial sorted runs find the ways to overlap I/O operations

8.5 Merging as a Way of Sorting Large Files on Disk

Page 47: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 47

Hardware-base ImprovementsHardware-base Improvements

Increasing the amount of RAM longer & fewer initial runs fewer seeks

Increasing the number of disk drives no delay due to seek time after generation of runs assign input and output to separate drives

Increasing the number of I/O channels separate I/O channels, I/O can overlap Improve transmission time

8.5 Merging as a Way of Sorting Large Files on Disk

Page 48: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 48

Decreasing the Num of Seeks Using Multiple-step MergesDecreasing the Num of Seeks Using Multiple-step Merges

K-way merge characteristics a selection tree is used

the number of comparisons is N*log K

(K-way merge with N records) K is proportional to N

O(N*log N) : reasonably efficient

Reducing seeks is to reduce the number of runs give each run a bigger buffer space multiple-step merge provides the way without more RAM

8.5 Merging as a Way of Sorting Large Files on Disk

Page 49: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 49

Do not merge all runs at one time Break the original set of runs into small groups and

Merge runs in these group separately Leads fewer seeks, but extra transmission time in

second pass Reads every record twice

to form the intermediate runs & the final sorted file

Similar to have selection tree in merging n lists!!

Multiple-step merge(1) Multiple-step merge(1)

8.5 Merging as a Way of Sorting Large Files on Disk

Page 50: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 50

Two-step merge of 800 runsTwo-step merge of 800 runs

......32 runs

......32 runs

......32 runs

......

......

25 sets of 32 runs each

8.5 Merging as a Way of Sorting Large Files on Disk

(25 sets X 32 runs) = 800 runs

Page 51: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 51

Multiple-step merge(2)Multiple-step merge(2)

Essence of multiple-step merging increase the available buffer space for each run extra pass vs. random access decrease

Can we do even better with more than two steps? trade-offs between the seek&rotation time and the

transmission time

major cost in merge sort seek, rotation time, transmission time, buffer size, number of

runs

8.5 Merging as a Way of Sorting Large Files on Disk

Page 52: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 52

Increasing Run Lengths Using Replacement Selection(1)Increasing Run Lengths Using Replacement Selection(1)

Facts of Life Want to use the heap sort in memory Want to allocate longer output runs Can we pack the longer output runs using the heap sort in memory?

Replacement Selection Idea

always select the key from memory that has the lowest value output the key replace it with a new key from the input list use 2 heaps in the memory buffer

(continued...)

8.5 Merging as a Way of Sorting Large Files on Disk

Page 53: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 53

Increasing Run Lengths Using Replacement Selection(2)Increasing Run Lengths Using Replacement Selection(2) Implementation

step1: read records and sort using heap sort this heap is the primary heap

step2: write out only the record with the lowest value step3: bring in new record and compare its key with that

of record just output step3-a: if the new key is higher, insert new record into its proper in

the primary heap along with the other records selected for output step3-b: if the new key is lower, place the record in a secondary heap

with key values lower than already written out step4: repeat step 3 while there are records in the primary heap and

there are records to be read in. When the primary heap is empty, make the secondary heap into the primary heap and repeat step2 & step3

8.5 Merging as a Way of Sorting Large Files on Disk

Page 54: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 54

Input:21, 67, 12, 5, 47, 16

Front of input string

Remaining input Memory(p=3) Output run

21, 67, 1221, 6721----

5 47 1612 47 1667 47 1667 47 2167 47 -67 - -- - -

- 5

12, 5 16, 12, 5 21, 16, 12, 5 47, 21, 16, 12, 567, 47, 21, 16, 12, 5

Example of the principle underlying replacement selection

8.5 Merging as a Way of Sorting Large Files on Disk

(Heap sort!)

Page 55: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 55

Replacement Selection(1)Replacement Selection(1)

What happens if a key arrives in memory too late to be output into ins

proper position relative to the other keys? (if 4th key is 2 rather than 12) use of second heap, to be included in next run

refer to page 335 Figure 8.25

Two questions Given P locations in memory, how long a run can we expect replacement

selection to produce, on the average?

On the average, we can expect a run length of 2P

Knuth provides an excellent description (page 335-336)

(continued...)

8.5 Merging as a Way of Sorting Large Files on Disk

Page 56: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 56

Total Seek &Rotation DelayTime

Approach# of Records per Seek to Form Runs

Size ofRuns Formed

# of SeeksRequired to Form Runs

MergeOrderUsed

TotalNumberof Seeks

(hr) (min)

800 RAMsorts followedby an 800-waymerge

Replacement selection followedby 534-way merge (records in randomorder)Replacement selection followedby 200-way merge(records partiallyordered)

10,000 10,000 800 1,600 681,600 4 58

2,500 15,000 534 6,400

2,500 40,000 200 200

521,134

206,400

3

1

48

30

Comparisons of access times required to sort 8 million recordsboth RAM sort and replacement selection

8.5 Merging as a Way of Sorting Large Files on Disk

Page 57: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 57

Step-by-step op. of replacement selection with 2 heaps working to form two sorted runs(1)

Input33, 18, 24, 58, 14, 17, 7, 21, 67, 12, 5, 47, 16

Front of input string

Remaining input33, 18, 24, 58, 14, 17, 7, 21, 67, 1233, 18, 24, 58, 14, 17, 7, 21, 6733, 18, 24, 58, 14, 17, 7, 2133, 18, 24, 58, 14, 17, 733, 18, 24, 58, 14, 1733, 18, 24, 58, 1433, 18, 24, 58

Memory(P=3)5 47 1612 47 1667 47 1667 47 2167 47 ( 7)67 (17) ( 7)(14) (17) ( 7)

Output run(A) - 5 12, 5 16, 12, 5 21, 16, 12, 5 47, 21, 16, 12, 5

67, 47, 21, 16, 12, 5

8.5 Merging as a Way of Sorting Large Files on Disk

(Heap sort!)

Page 58: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 58

Step-by-step op. of replacement selection working to form two sorted runs(2)

First run complete; start building the second

33, 18, 24, 5833, 18, 2433, 18---

Remaining input Memory(P=3) Output run(B)

14 17 714 17 5824 17 5824 18 5824 33 58- 33 58- - 58-

- 7 14, 7 17, 14, 7 18, 17, 14, 7 24, 18, 17, 14, 7 33, 24, 18, 17, 14, 758, 33, 24, 18, 17, 14, 7

8.5 Merging as a Way of Sorting Large Files on Disk

Page 59: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 59

Replacement Selection Plus Multiple MergingReplacement Selection Plus Multiple Merging

Total number of seeks is less than for the one-step merges The two-step merge requires transferring the data two more

times than do the one-step merge the two-step merges & replacement selection are still better, but the

results are less dramatic

refer to table of the next slide

8.5 Merging as a Way of Sorting Large Files on Disk

Page 60: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 60

Approach Number ofRecords perSeek to Form Runs

MergePatternUsed

Numberof Seeksfor Sortsand Merges

Seek + RotationalDelayTime(min)

TotalPassesover theFile

Total Trans-missionTime(min)

Total of Seek,Rotation, andTransmissionTimes(min)

RAM sorts

replacementselection(records in random order)

replacementselection(records part -ially ordered)

Comparison of merges, considering transmission times(1):1-step merge

10,000

2,500

2,500

800-way

534-way

200-way

681,700

521,134

206,400

298

228

90

4

4

4

43

43

43

341

341

341

(continued...)

8.5 Merging as a Way of Sorting Large Files on Disk

Page 61: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 61

Approach Number ofRecords perSeek to Form Runs

MergePatternUsed

Numberof Seeksfor Sortsand Merges

Seek + RotationalDelayTime(min)

TotalPassesover theFile

Total Trans-missionTime(min)

Total of Seek,Rotation, andTransmissionTimes(min)

RAM sorts

replacementselection(records in random order)

replacementselection(records part -ially ordered)

Comparison of merges, considering transmission times(2):2-step merge

10,000

2,500

2,500

25 x 32-way(one 25-way)

19 x 28-way(one 19-way)

20 x 10-way(one 20-way)

127,200

124,438

110,400

56

55

48

6

6

6

65

65

65

121

120

113

8.5 Merging as a Way of Sorting Large Files on Disk

Page 62: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 62

Using Two Disks with Replacement SelectionUsing Two Disks with Replacement Selection

Two disk drives input & output can overlap

reduce transmission by 50% seeking is virtually eliminated

Sort phase the run selection & output can overlap

Merge phase output disk becomes input disk, and vice versa seeking will occur on input disk, output is sequential

substantially reducing merge & transmission time

8.5 Merging as a Way of Sorting Large Files on Disk

Page 63: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 63

disk1

disk2

input

buffers

output

buffers

heap

Memory organization for replacement selection

8.5 Merging as a Way of Sorting Large Files on Disk

Page 64: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 64

More Drives? More Processors?More Drives? More Processors?

More drives? Until I/O becomes so fast that processing cannot keep up

with it

More processors? mainframes vector and array processors massively parallel machines very fast local area networks

8.5 Merging as a Way of Sorting Large Files on Disk

Page 65: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 65

Effects of MultiprogrammingEffects of Multiprogramming

Increase the efficiency of overall system by overlapping processing and I/O

Effects are very hard to predict

8.5 Merging as a Way of Sorting Large Files on Disk

Page 66: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 66

A Concept Toolkit for External SortingA Concept Toolkit for External Sorting

For in-RAM sorting, use heapsort Use as much RAM as possible Use a multiple-step merge when

the number of initial runs is so long that seek and rotation time is much greater than transmission time

Use replacement selection when possibility of partially ordered

Use more than one disk drive and I/O channel read/write can overlap

Look for ways to take advantage of new architecture and systems parallel processing or high-speed networks

8.5 Merging as a Way of Sorting Large Files on Disk

Page 67: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 67

Sorting Files on TapeSorting Files on Tape Balanced Merge with several tape drivers

Tape contains runs

T1 R1 R3 R5 R7 R9

Step1 T2 R2 R4 R6 R8 R10

T3 --

T4 --

Figure 8.28 (2 way-balanced 4 tape merge)

P is the number of passes, N is the number of runs, k is the number of input drivers ==> then, P = ceiling of (logkN)

4 tape drivers (2 for input, 2 for output), 10 runs ==> 4 passes 20 tape drivers (10 for input, 10 for output), 200 runs ==> 3 passes

Page 68: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 68

Sorting Files on TapeSorting Files on Tape Other ways of Balanced Merge (Fig 8.30) T1 T2 T3 T4

Step1 1 1 1 1 1 1 1 1 1 1 -- --

Step2 -- -- 2 2 2 2 2

Step3 4 4 .. 2 --

Step4 -- -- -- 10

(Fig 8.31) T1 T2 T3 T4

Step1 1 1 1 1 1 1 1 1 1 1 --

Step2 …1 1 1 .. 1 -- 3 3

Step3 … 1 1 -- 5 .3

Step4 …. 1 4 5 --

Step5 -- -- -- 10

Page 69: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 69

K-way Balanced Merge on TapesK-way Balanced Merge on Tapes

Some difficult questions

How does one choose an initial distribution that leads readily to an efficient merge pattern?

Are there algorithmic descriptions of the merge patterns, given an initial distribution?

Given N runs and J tape drives, is there some way to compute the optimal merging performance so we have a yardstick against which to compare the performance of any specific algorithm?

Page 70: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 70

Unix: Sorting and Cosequential ProcessingUnix: Sorting and Cosequential Processing

Sorting in Unix The Unix sort command The qsort library routine

Cosequential processing utilities in Unix Compares: cmp Difference: diff Common: comm

Page 71: Chap 8. Cosequential Processing        and the Sorting of Large Files

File Structures SNU-OOPSLA Lab. 71

Let’s Review !!Let’s Review !!

8.1 Cosequential operations

8.2 Application of the Model to a General Ledger Program

8.3 Extension of the Model to Include Multiway Merging

8.4 A Second Look at Sorting in Memory

8.5 Merging as a Way of Sorting Large Files on Disk

8.6 Sorting Files on Tape

8.7 Sort-Merge Packages

8.8 Sorting and Cosequential Processing in Unix