36
IBM Tokyo Research Laboratory September 18 th , 2007 PACT2007 at Brasov, Romania © 2007 IBM Corporation AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue , Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

September 18th, 2007 PACT2007 at Brasov, Romania © 2007 IBM Corporation

AA-Sort:A New Parallel Sorting Algorithmfor Multi-Core SIMD Processors

Hiroshi Inoue, Takao Moriyama,Hideaki Komatsu and Toshio NakataniIBM Tokyo Research Laboratory

Page 2: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

Goal

Outperform existing algorithms, such as Quick sort, by exploiting SIMD instructions on a single thread

Scale well by using thread-level parallelism of multiple cores

Develop a fast sorting algorithm by exploiting the following features of today’s processors

►SIMD instructions

►Multiple Cores (Thread-level parallelism)

Page 3: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

Benefit and limitation of SIMD instructions

Benefits: SIMD selection instructions (e.g. select minimum instruction) can accelerate sorting by

– parallelizing comparisons

– avoiding unpredictable conditional branches

A0 A1 A2 A3

B0 B1 B2 B3

C0 C1 C2 C3

MIN MIN MIN MIN

input 1

input 2

output

Limitation: SIMD load / store instructions can be effective only when they access contiguous 128 bits of data (e.g. four 32-bit values) aligned on 128-bit boundary

Page 4: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

SIMD instructions are effective for some existing sorting algorithms but slower than Quick sort

Such as Bitonic merge sort, Odd-even merge sort, and GPUTeraSort [Govindaraju ’05]They are slower than Quick sort for sorting a large number (N) of elements

– Their computational complexity is O(N (log (N))2)

1 865 7 4 3 2

A step of Bitonic merge sort

MIN MAX

Page 5: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

Presentation Outline

Motivation

AA-sort: Our new sorting algorithm

Experimental results

Summary

Page 6: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

What is Comb sort?

Comb sort [Lacey '91] is an extension to Bubble sortIt compares two non-adjacent elements

distance = 6

starting to compare two elements with large distance

2 567 8 4 1 3

Page 7: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

What is Comb sort?

Comb sort [Lacey '91] is an extension to Bubble sortIt compares two non-adjacent elements

starting to compare two elements with large distance

2 567 8 4 1 3

decreasing distance by dividing a constant called shrink factor (1.3) for each iteration

distance = distance / 1.3 = 4

Page 8: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

What is Comb sort?

Comb sort [Lacey '91] is an extension to Bubble sortIt compares two non-adjacent elements

starting to compare two elements with large distance

2 567 8 4 1 3

decreasing distance by dividing a constant called shrink factor (1.3) for each iteration

distance = distance / 1.3 = 3

Page 9: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

What is Comb sort?

Comb sort [Lacey '91] is an extension to Bubble sortIt compares two non-adjacent elements

starting to compare two elements with large distance

2 567 8 4 1 3

decreasing distance by dividing a constant called shrink factor (1.3) for each iteration

distance = distance / 1.3 = 2

Page 10: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

What is Comb sort?

Comb sort [Lacey '91] is an extension to Bubble sortIt compares two non-adjacent elements

starting to compare two elements with large distance

2 567 8 4 1 3

repeat until all data are sorted with distance = 1 average complexity

N log (N)

decreasing distance by dividing a constant called shrink factor (1.3) for each iteration

distance = distance / 1.3 = 1

Page 11: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

123 45 6 789 1011 12

832 61 10 12115 94 7

sorted

input data

sorted output data

SIMD instructions are not effective for Comb sort due to

Our technique to SIMDize the Comb sort

► unaligned memory accesses► loop-carried dependencies

Page 12: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

123 45 6 789 1011 12

832 61 10 12115 94 7

sorted

1174 51 6 1292 310 8

input data

sorted output data

Transposed order

SIMD instructions are effective for Comb sort into the Transposed order

Reorder after sorting

Our technique to SIMDize the Comb sort

Page 13: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

123 45 6 789 1011 12

832 61 10 12115 94 7

sorted

1152 8

input data

sorted output data

assume four elements in one vector

741 10 6 1293

Our technique to SIMDize the Comb sort

Transposed order

Page 14: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

123 45 6 789 1011 12

832 61 10 12115 94 7

sorted

11

74

5

1

6 129

2

3

10

8

input data

sorted output data

Our technique to SIMDize the Comb sort

► no unaligned access► no loop-carried dependency

► unaligned access► loop-carried dependency

Transposed order

Page 15: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

Analysis of our SIMDized Comb sort

N/AO(N log(N))almost 0Number of

unaligned memory accesses

O(N log(N))O(N log(N))O(N log(N))

reordering: O(N)

Computational complexity

O(N log(N))almost 0almost 0

Number of unpredictable conditional branches

original (scalar) Comb sort

naively SIMDized Comb sort

our SIMDized Comb sort

Page 16: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

Overview of the AA-Sort

It consists of two phases:

– Phase 1: SIMDized Comb sort

– Phase 2: SIMDized Merge sort

block1 block2 block3 block4

sort each block using SIMDized Comb sort

merge the blocks using SIMDized Merge sort

input datadivide input data into small blocks that can fit into (L2) cache

sorted

Phase 1

Phase 2

Page 17: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

Our approach to SIMDize merge operations

SIMD instructions are effective for Bitonic merge or Odd-even mergeTheir computational complexity are higher than usual merge operations

Our solution

– Integrate Odd-even merge into the usual merge operation to take advantage of SIMD instructions while keeping the computational complexity of usual merge operation

Page 18: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

2 3

Odd-even merge for values in two vector registers

Inputtwo vector registers contain four presorted values in each

Outputeight values in two vector registered are now sorted

Odd-even Merge one SIMD comparison and “shuffle” operations for each stage

No conditional branches!

1 4 7 8

stage 1

stage 2

stage 3

input

output

< < < <

< <

< <<

sorted sorted

sorted

2 3 5 6

1 4 7 85 6

Page 19: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

21 23

17 18

Our technique to integrate Odd-even merge into usual merge

sorted array 1 1 4 7 9

sorted array 2 2 3 5 6

sorted

sorted

10 11 12 14

8 13 15 16

Page 20: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

21 23

17 18

sorted array 1

sorted array 2

1 4 7 9 2 3 5 6

sorted sorted

vector registers

1 4 7 9

2 3 5 6

10 11 12 14

8 13 15 16

Odd-even merge

Our technique to integrate Odd-even merge into usual merge

Page 21: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

21 23

17 18

1 4 7 9

2 3 5 6

Our technique to integrate Odd-even merge into usual merge

sorted array 1

sorted array 2

· · ·

· · ·

1 4 7 92 3 5 6

sorted

vector registers

10 11 12 14

8 13 15 16

Page 22: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

1 4 7 9

2 3 5 6

21 23

17 18

Our technique to integrate Odd-even merge into usual merge

sorted array 1 10 11 12 14

sorted array 2

vector registers

merged array 1 42 3

8 13 15 16

7 95 6

use a scalar comparisonto select array to load from

output smaller four values as merged array

sorted

Page 23: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

1 4 7 9

2 3 5 6

21 23

17 188 13 15 16

8 13 15 16

Our technique to integrate Odd-even merge into usual merge

sorted array 1

sorted array 2

vector registers

merged array

Odd-even merge

7 95 6

1 42 3

10 11 12 14

sorted

Page 24: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

1 4 7 9

2 3 5 6 8 13 15 16

10 11 12 14

13 15 169

8

Our technique to integrate Odd-even merge into usual merge

sorted array 1 21 23

sorted array 2 17 18

vector registers

merged array 75 6

sorted

1 42 3

Page 25: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

Comparing merge operations

O(N)O(N log(N))O(N)Computational

complexity

1 for every output element

01 for every

output vector

Number of unpredictable conditional branches

usual (scalar) merge operation

odd-even merge implemented with SIMD

our integrated merge

operation

Page 26: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

Parallelizing AA-sort among multiple threads

block1 block2 block3 block4

thread 1 thread 2

thread 1, 2sorted

each thread executes independent merge operations

each thread sorts independent blocks using SIMDized Comb sort

multiple threads cooperate on one merge operation

Phase 1

Phase 2

Page 27: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

Presentation Outline

Motivation

AA-sort: Our new sorting algorithm

Experimental results

Summary

Page 28: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

Environment for Performance Evaluation

We used two processors for performance evaluation

– PowerPC 970 using VMX instructions (up to 4 cores)

– Cell BE using SPE cores (up to 16 SPE cores)

We compared the performance of four algorithms

– AA-sort

– GPUTeraSort [Govindaraju '05]

– ESSL (IBM’s optimized library)

– STL delivered with GCC (open source library)

Page 29: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

0.01

0.1

1

10

100

1 M 2 M 4 M 8 M 16 M 32 M 64 M 128 M

number of elements

exec

utio

n tim

e (s

ec) .

AA-sort

GPUTeraSort

ESSL

STL

Single-thread performance on PowerPC 970 for various input size

sorting 32-bit random integers on one core of PowerPC 970

faster

Page 30: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

0

1

2

3

4

5

AA-sort GPUTeraSort ESSL STL

algorithm

exec

utio

n tim

e (s

ec) .

Single-thread performance on PowerPC 970

x1.7

x3.0x3.3

sorting 32 M elements of random 32-bit integers on PowerPC 970

faster

Page 31: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

0

0.5

1

1.5

2

2.5

3

0 1 2 3 4 5

number of cores

rela

tive

thro

ughp

utov

er A

A-s

ort

usin

g 1

core

. AA-sort

GPUTeraSort

Scalability with multiple cores on PowerPC 970faster

x2.6 by 4 cores

x2.1 by 4 cores

sorting 32 M elements of random 32-bit integers on PowerPC 970

Page 32: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

0

2

4

6

8

10

12

14

0 4 8 12 16

number of cores

rela

tive

thro

ughp

utov

er A

A-s

ort

usin

g 1

core

. AA-sort

GPUTeraSort

Scalability with multiple cores on Cell BE

12.2x by 16 cores7.4x

by 8 cores

sorting 32 M elements of random 32-bit integers on Cell BE

faster

Page 33: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

Summary

We proposed a new sorting algorithm called AA-sort, which can take advantage of

– SIMD instructions

– Multiple Cores (thread-level parallelism)

We evaluated the AA-sort on PowerPC 970 and Cell BE

– Using only 1 core of PowerPC 970, the AA-sort outperformed

• IBM’s ESSL by 1.7x

• GPUTeraSort by 3.3x

– On Cell BE, the AA-sort showed good scalability

• 8 cores 7.4x

• 16 cores 12.2x

Page 34: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

Thank you for your attention!

Page 35: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

SIMD instructions are not effective for Quick sort

SIMD instructions are NOT effective for Quick sort, which requires element-wise and unaligned memory accesses

123 45 6 1389 1011 12

search an element larger than the pivot

search an element smaller than the pivot

swapelement-wise and unaligned memory accesses

A step of Quick sort (pivot = 7)

Page 36: AA-Sort - IBM · AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory

IBM Tokyo Research Laboratory

A New Parallel Sorting Algorithm for Multi-Core SIMD Processors © 2007 IBM Corporation

Transposed order

To see the values to sort as a two dimensional array

0 1 2 3

4 5 6 7

8 9 10 11

N-4 N-3 N-2 N-1

Original order= row major

column: elements in a vector

row

: ar

ray

of v

ecto

rs

0 n 2n 3n1 n+1 2n+1 3n+1

2 n+2 2n+2 3n+2

n-1 2n-1 3n-1 N-1

Transposed order= column major

adjacent values are stored in one vector

adjacent values are stored in same slot of adjacent vectors