قسمت اول ایده های اصلی A. Broumandnia, [email protected] Slide 1

قسمت اولایده های اصلی

A. B

roum

andn

ia,

Bro

uman

dnia

@gm

ail.c

om

Slide 1

مقدمه ای بر موازی سازی -1

عناوین این فصل

موازی سازی چیست؟1.1

مثال از موازی سازی.1.2

صعود و نزول پردازش موازی1.3

انواع موازی سازی1.4

مسیربلوکی پردازش موازی1.5

اثر گذاری پردازش موازی1.6

A. B

roum

andn

ia,

Bro

uman

dnia

@gm

ail.c

om

Slide 2

موازی سازی چیست؟ 1.1

A. B

roum

andn

ia,

Bro

uman

dnia

@gm

ail.c

om

Slide 3

رشد نمایی کارایی ریزپردازنده ها را قانون مور می نامند. ماه دو برابر می شود.18سرعت در هر

1990 1980 2000 2010 KIPS

MIPS

GIPS

TIPS

Pro

cess

or

perf

orm

anc

e

Calendar year

80286 68000

80386

80486 68040

Pentium

Pentium II R10000

1.6 / yr

ارزیابی کارایی/ هزینه کامپیوترها

A. B

roum

andn

ia,

Bro

uman

dnia

@gm

ail.c

om

Slide 4

From: “Robots After All,”

by H. Moravec, CACM, pp. 90-97,

October 2003.

Mental power in four scales

نقشه ی مسیر فن آوری نیمه هادی ها

سال میالدی 2001 2004 2007 2010 2013 2016

Halfpitch (nm) 140 90 65 45 32 22

فرکانس (GHZکالک)

2 4 7 12 20 30

تعداد سطوح سیم بندی

7 8 9 10 10 10

منبع تغذیه)ولت( 1.1 1.0 0.8 0.7 0.6 0.5

ماکزیمم توان)وات(

130 160 190 220 250 290

A. B

roum

andn

ia,

Bro

uman

dnia

@gm

ail.c

om

Slide 5

From the 2001 edition of the roadmap [Alla02]

معیارهای اندازه گیری بر مبنای اجرای تعداد .دستورالعمل در ثانیه است

,MIPS, GIPS, TIPSواحد های اندازه گیری: PIPS

معیار اندازه گیری پردازنده های محاسباتی بر مبنای تعداد شناور در ثانیه است.

,MFLOPS, GFLOPSواحدهای اندازه گیری: TFLOPS,PFLOPS

1990 1980 2000 2010 KIPS

MIPS

GIPS

TIPS

Pro

cess

or

perf

orm

anc

e

Calendar year

80286 68000

80386

80486 68040

Pentium

Pentium II R10000

1.6 / yr

چرا کارایی باال مورد نیاز است؟

A. B

roum

andn

ia,

Bro

uman

dnia

@gm

ail.c

om

Slide 6

سرعت باالتر)حل سریع مسائل(•پیش بینی هوا•خط مرگ نرم و سخت•

گذردهی باالتر)حل مسائل بیشتر(•پردازش تراکنش ها•

قدرت محاسباتی باالتر)حل مسائل طوالنی(•پیش بینی هوا برای یک هفته در کمتر از •

ساعت24Categories of supercomputers Uniprocessor; aka vector machine Multiprocessor; centralized or distributed shared memory Multicomputer; communicating via message passing Massively parallel processor (MPP; 1K or more processors)

آرگومان سرعت نور

A. B

roum

andn

ia,

Bro

uman

dnia

@gm

ail.c

om

Slide 7

است.cm/ns 30سرعت نور تقریبا •سیگنال با یک سوم سرعت نور در سیم •

مسی انتقال می یابد.اگر برای اجرای یک دستورالعمل سیگنال ها •

انتقال یابند، بنابراین این دستور 1cmباید اجرا می شود. در نتیجه 0.1nsحداقل در

خواهد شد.10GIPSکارایی آن محدود به این محدودیت تا حدودی با روش های حداقل •

سازی معماری همانند حافظه ی نهان بر طرف می شود.

How does parallel processing help? Wouldn’t multiple processors need to communicate via signals as well?

نیاز است؟TFLOPS و TIPSچرا با کارایی

A. B

roum

andn

ia,

Bro

uman

dnia

@gm

ail.c

om

Slide 8

Reasonable running time = Fraction of hour to several hours (103-104 s)In this time, a TIPS/TFLOPS machine can perform 1015-1016 operations

Example 2: Fluid dynamics calculations (1000 1000 1000 lattice)109 lattice points 1000 FLOP/point 10 000 time steps = 1016 FLOP

Example 3: Monte Carlo simulation of nuclear reactor1011 particles to track (for 1000 escapes) 104 FLOP/particle = 1015 FLOPDecentralized supercomputing ( from Mathworld News, 2006/4/7 ): Grid of tens of thousands networked computers discovers 230 402 457 – 1, the 43rd Mersenne prime, as the largest known prime (9 152 052 digits )

Example 1: Southern oceans heat Modeling (10-minute iterations)300 GFLOP per iteration 300 000 iterations per 6 yrs = 1016 FLOP

4096 E-W regions

1024

N-S

re

gion

s

12 layers

in depth

چرا پردازش موازی نیاز است؟

A. B

roum

andn

ia,

Bro

uman

dnia

@gm

ail.c

om

Slide 9

Parallelism = ConcurrencyDoing more than one thing at a time

Has been around for decades, since early computers

I/O channels, DMA, device controllers, multiple ALUs

The sense in which we use it in this course

Multiple agents (hardware units, software processes) collaborate to perform our main computational task

- Multiplying two matrices- Breaking a secret code- Deciding on the next chess move

مثالی از 1.2موازی سازی

A. B

roum

andn

ia,

Bro

uman

dnia

@gm

ail.c

om

Slide 10

پیدا نمودن اعداد اول با روش 30 تا 1بین

غربال نمودن اعداد

Init. Pass 1 Pass 2 Pass 3

2m 2 2 2 3 3m 3 3 4 5 5 5m 5 6 7 7 7 7 m 8 9 91011 11 11 111213 13 13 131415 151617 17 17 171819 19 19 192021 212223 23 23 232425 25 252627 272829 29 29 2930

ترکیNبی)غNیر عNدد هر اول( مضNربی از اعNداد یNNNا کوچکNNNتر اول آن رادیکNNال مسNNاوی

.عدد ترکیبی می باشد

پیاده سازی روی یک سیستم تک پردازنده ای

A. B

roum

andn

ia,

Bro

uman

dnia

@gm

ail.c

om

Slide 11

برای پیاده سازی نیاز به یک بردار بیتی با مقاد اولیه یک و .دو متغیر صحیح است

1 2 n

Current Prime IndexP

Bit-vector

پیاده سازی موازی روی یک سیستم چند پردازنده ای با حافظه ی اشتراکی

A. B

roum

andn

ia,

Bro

uman

dnia

@gm

ail.c

om

Slide 12

1 2 n

Current Prime

IndexP1

IndexP2

IndexPp...

Shared Memory I/O Device

(b)

پردازنده ای با حافظه ی اشتراکیpشماتیک سیستم

زمان اجرایی موازی و سری الگوریتم غربال سازی

A. B

roum

andn

ia,

Bro

uman

dnia

@gm

ail.c

om

Slide 13

n=1000شبیه سازی سری و موازی با

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ 2 | 3 | 5 | 7 | 11 |13|17 2 | 7 |17 3 5 | 11 |13| 2 | | 3 11 | 19 29 31 5 | 7 13|17 23

Time

19 29 23 31 p = 1, t = 1411

p = 2, t = 706

p = 3, t = 499

19

23 29 31

پیاده سازی الگوریتم غربال سازی روی یک سیستم چندکامپیوتری یا توزیع شده

A. B

roum

andn

ia,

Bro

uman

dnia

@gm

ail.c

om

Slide 14

Fig. 1.7 Data-parallel realization of the sieve of Eratosthenes.

1 2

Current PrimeP1 Index

n/p

n/p+1

Current PrimeP2 Index

2n/p

Current PrimePp Index

Communi- cation

n–n/p+1 n

Assume at most n processors, so that all prime factors dealt with are in P1 (which broadcasts them)

n < n / p

One Reason for Sublinear Speedup:Communication Overhead

A. B

roum

andn

ia,

Bro

uman

dnia

@gm

ail.c

om

Slide 15

Fig. 1.8 Trade-off between communication time and computation time in the data-parallel realization of the sieve of Eratosthenes.

Number of processors

Communication

Computation

Solution time

Ideal speedup


Actual speedup

Another Reason for Sublinear Speedup:Input/Output Overhead

A. B

roum

andn

ia,

Bro

uman

dnia

@gm

ail.c

om

Slide 16

Fig. 1.9 Effect of a constant I/O time on the data-parallel realization of the sieve of Eratosthenes.


I/O time

Computation

Solution time

Ideal speedup


Actual speedup

صعود و نزول پردازش موازی

A. B

roum

andn

ia,

Bro

uman

dnia

@gm

ail.c

om

Slide 17

Using thousands of “computers” (humans + calculators) for 24-hr weather prediction in a few hours

Conductor

1960s: ILLIAC IV (U Illinois) – four 8 8 mesh quadrants, SIMD

2000s: Internet revolution – Inf o providers, multimedia, data mining, etc. need lots of power

1980s: Commercial interest – technology was driven by government grants & contracts. Once funding dried up, many companies went bankrupt

Fig. 1.10 Richardson’s circular theater for weather forecasting calculations.

Status of Computing Power (circa 2000)

A. B

roum

andn

ia,

Bro

uman

dnia

@gm

ail.c

om

Slide 18

GFLOPS on desktop: Apple Macintosh, with G4 processor

TFLOPS in supercomputer center: 1152-processor IBM RS/6000 SP (switch-based network) Cray T3E, torus-connected

PFLOPS on drawing board: 1M-processor IBM Blue Gene (2005?) 32 proc’s/chip, 64 chips/board, 8 boards/tower, 64 towers Processor: 8 threads, on-chip memory, no data cache Chip: defect-tolerant, row/column rings in a 6 6 array Board: 8 8 chip grid organized as 4 4 4 cube Tower: Boards linked to 4 neighbors in adjacent towers System: 323232 cube of chips, 1.5 MW (water-cooled)

2010

TFLOPS

PFLOPS

EFLOPS (Exa = 1018)

انواع موازی سازی: یک طبقه بندی

A. B

roum

andn

ia,

Bro

uman

dnia

@gm

ail.c

om

Slide 19

برای دسته بندی سیستم های Flynn-Johnsonدسته بندی کامپیوتری

SISD

SIMD

MISD

MIMD

GMSV

GMMP

DMSV

DMMP

Single data stream

Mult iple data streams

Sin

gle

inst

r st

ream

M

ultip

le in

str

stre

ams

Flynn’s categories

Joh

nso

n’s

ex

pan

sio

n

Shared variables

Message passing

Glo

bal

me

mor

y

Dis

trib

uted

m

em

ory

Uniprocessors

Rarely used

Array or vector processors

Mult iproc’s or mult icomputers

Shared-memory mult iprocessors

Rarely used

Distributed shared memory

Distrib-memory mult icomputers

1.5 Roadblocks to Parallel Processing

A. B

roum

andn

ia,

Bro

uman

dnia

@gm

ail.c

om

Slide 20

Grosch’s law: Economy of scale applies, or power = cost2

Minsky’s conjecture: Speedup tends to be proportional to log p

Tyranny of IC technology: Uniprocessors suffice (x10 faster/5 yrs)

Tyranny of vector supercomputers: Familiar programming model

Software inertia: Billions of dollars investment in software

Amdahl’s law: Unparallelizable code severely limits the speedup

No longer valid; in fact we can get more bang per buck in micros

Has roots in analysis of memory bank conflicts; can be overcome

Faster ICs make parallel machines faster too; what about x1000?

Not all computations involve vectors; parallel vector machines

New programs; even uniprocessors benefit from parallelism spec

قانون امدال

A. B

roum

andn

ia,

Bro

uman

dnia

@gm

ail.c

om

Slide 21

محدودیت افزایش سرعت طبق قانون امدال

0

10

20

30

40

50

0 10 20 30 40 50Enhancement factor (p )

Spe

edup

(s

)

f = 0

f = 0.1

f = 0.05

f = 0.02

f = 0.01

s =

min(p, 1/f)

1f + (1 – f)/p

f = fraction unaffected

p = speedup of the rest

اثر گذاری پردازش موازی

A. B

roum

andn

ia,

Bro

uman

dnia

@gm

ail.c

om

Slide 22

P تعداد پردازنده ها

W(p) کار انجام شده توسطp پردازنده

T(p) زمان اجرایی توسطpپردازنده T(1) = W(1); T(p) W(p)

S(p) = T(1) / T(p)افزایش سرعت

راندمان E(p) = T(1) / [p T(p)]

R(p)= W(p) / W(1) افزونگی

بکارگیری U(p) = W(p) / [p T(p)]

کیفیت Q(p)= T3(1) / [p T2(p) W(p)]

1

2

3

4

5

67

8

910

11

12

13

گراف وظیفه یک مانع اساسی

افزایش سرعت و موازی سازی

است.

W(1) = 13

T(1) = 13

T() = 8

A. B

roum

andn

ia,

Bro

uman

dnia

@gm

ail.c

om

Slide 23

عدد صحیح16( : گراف محاسبه برای جمع 1.14شکل)

----------- 16 numbers to be added -----------

Sum

+ + ++++ ++

++

+

++

+

+

پردازنده ، هر عملیات جمع یک واحد 8 عدد روی 16مثال: جمع زمانی مصرف می کند. زمان ارتباطی صفر

E(8) = 15 / (8 4) = 47% S(8) = 15 / 4 = 3.75R(8) = 15 / 15 = 1Q(8) = 1.76

زمان ارتباطی واحد

E(8) = 15 / (8 7) = 27% S(8) = 15 / 7 = 2.14R(8) = 22 / 15 = 1.47Q(8) = 0.39

ABCs of Parallel Processing in One Slide

A. B

roum

andn

ia,

Bro

uman

dnia

@gm

ail.c

om

Slide 24

A Amdahl’s Law (Speedup Formula)Bad news – Sequential overhead will kill you, because: Speedup = T1/Tp 1/[f + (1 – f)/p] min(1/f, p)Morale: For f = 0.1, speedup is at best 10, regardless of peak OPS.

B Brent’s Scheduling TheoremGood news – Optimal scheduling is very difficult, but even a naivescheduling algorithm can ensure: T1/p Tp T1/p + T = (T1/p)[1 + p/(T1/T)]Result: For a reasonably parallel task (large T1/T), or for a suitablysmall p (say, p T1/T), good speedup and efficiency are possible.

C Cost-Effectiveness AdageReal news – The most cost-effective parallel solution may not bethe one with highest peak OPS (communication?), greatest speed-up (at what cost?), or best utilization (hardware busy doing what?).Analogy: Mass transit might be more cost-effective than private carseven if it is slower and leads to many empty seats.

Documents

قسمت اول ایده های اصلی A. Broumandnia, [email protected] Slide 1