43
并并并并并并 PARALLEL PROGRAMMING Pingpeng Yuan

并行程序设计 PARALLEL PROGRAMMING Pingpeng Yuan. PARALLEL PROGRAMMING What Why How Goal exam 2

Embed Size (px)

Citation preview

  • Slide 1

PARALLEL PROGRAMMING Pingpeng Yuan Slide 2 PARALLEL PROGRAMMING What Why How Goal exam 2 Slide 3 What is Parallel Programming? Coordinating multiple processing elements to solve a problem 3 Slide 4 PARALLELISM - A SIMPLISTIC UNDERSTANDING Multiple tasks at once. Distribute work into multiple execution units. Two approaches - Data Parallelism Functional or Control Parallelism . 4 Slide 5 WHY Why Technology Trend Application Needs 5 Slide 6 Age Growth 5 10 15 20 25 30 35 40 45.... HUMAN ARCHITECTURE! GROWTH PERFORMANCE 6 Vertical Horizontal Slide 7 No. of Processors C.P.I. 1 2.... COMPUTATIONAL POWER IMPROVEMENT 7 Multiprocessor Uniprocessor Slide 8 GENERAL TECHNOLOGY TRENDS 8 Microprocessor performance increases 50% - 100% per year Clock frequency doubles every 3 years Transistor count quadruples every 3 years Slide 9 CLOCK FREQUENCY GROWTH RATE (INTEL FAMILY) 9 30% per year Slide 10 INTEL MANY INTEGRATED CORE (MIC) 32 core version of MIC: Slide 11 TILERAS 100 CORES (JUNE 2011) Tilera has introduced a range of processors (64-bit Gx family: 36 cores, 64 cores and 100 cores), aiming to take on Intel in servers that handle high-throughput web applications 64-bit cores running up to 1.5GHz Manufactured in 40nm technology 11 Slide 12 TOP500 . Paradigm Change in HPC Slide 13 GPU ARCHITECTURE NVIDIA Fermi, 512 Processing Elements (PEs) Slide 14 THE GAP BETWEEN CPU AND GPU ref: Tesla GPU Computing Brochure Slide 15 GPU WILL TOP THE LIST IN NOV 2010 Slide 16 TRANSISTOR COUNT GROWTH RATE (INTEL FAMILY) 16 Transistor count grows much faster than clock rate - 40% per year, order of magnitude more contribution in 2 decades Slide 17 HOW TO USE MORE TRANSISTORS Improve single threaded performance via architecture: Not keeping up with potential given by technology Use transistors for memory structures to improve data locality Use parallelism Instruction-level Thread level 17 Slide 18 SIMILAR STORY FOR STORAGE (TRANSISTOR COUNT) 18 Slide 19 DRAM densities to double every 3 years Projections for DRAM densities revised downwards over time Current densities at 4Gb/die TRENDS IN DRAM CAPABILITIES DRAM data rates to double every 4-5 years Projections for DRAM data rates revised upwards over time Current data-rates at 2.2 Gb/s Slide 20 SIMILAR STORY FOR STORAGE 1980-95 1000x 50% 3% (only 2x from 1980-95) 2x cache 20 Slide 21 MEMORY HIERARCHY cache 21 CPU registers L1 cache L2 cache Primary Memory Secondary Storage Tertiary Storage 100 bytes 32KB 256KB 1GB 1TB 1PB 10 ms 1s-1hr < 1 ns 1 ns 4 ns 60 ns Slide 22 SIMILAR STORY FOR STORAGE bit 22 Slide 23 DISK TRENDS Disks too: Parallel disks plus caching Disk capacity, 1975-1989 doubled every 3+ years 25% improvement each year factor of 10 every decade Still exponential, but far less rapid than processor performance Disk capacity, 1990-recently doubling every 12 months 100% improvement each year factor of 1000 every decade Capacity growth 10x as fast as processor performance! 23 Slide 24 DISK TRENDS Only a few years ago, we purchased disks by the megabyte Today, 1 GB (a billion bytes) costs $1 $0.50 $0.05 from Dell => 1 TB costs $1K $500 $50, 1 PB costs $1M $500K $50K Technology is amazing Flying a 747 6 above the ground Reading/writing a strip of postage stamps 24 Slide 25 25 Slide 26 COMMODITY COMPUTER SYSTEMS 1946 2003 General-purpose computing: Serial. 5KHz 4GHz. 2004 General-purpose computing goes parallel. Clock frequency growth flat. #Transistors/chip 1980 2011: 29K 30B! #cores: ~d y-2003 Slide 27 If you want your program to run significantly faster youre going to have to parallelize it 27 Slide 28 DRIVERS OF PARALLEL COMPUTING APPLICATION NEEDS ref: http://www.nvidia.com/object/tesla_computing_solutions.html Slide 29 APPLICATIONS OF PARALLEL PROCESSING 29 Slide 30 30 Slide 31 WHY DO WE NEED PARALLEL PROCESSING ? 31 Reasonable running time = Fraction of hour to several hours (10 3 -10 4 s) In this time, a TIPS/TFLOPS machine can perform 10 15 -10 16 operations Example 2: Fluid dynamics calculations (1000 1000 1000 lattice) 10 9 lattice points 1000 FLOP/point 10 000 time steps = 10 16 FLOP Example 3: Monte Carlo simulation of nuclear reactor 10 11 particles to track (for 1000 escapes) 10 4 FLOP/particle = 10 15 FLOP Decentralized supercomputing ( from Mathworld News, 2006/4/7 ): Grid of tens of thousands networked computers discovers 2 30 402 457 1, the 43 rd Mersenne prime, as the largest known prime (9 152 052 digits ) Example 1: Southern oceans heat Modeling (10-minute iterations) 300 GFLOP per iteration 300 000 iterations per 6 yrs = 10 16 FLOP 4096 E-W regions 1024 N-S regions 12 layers in depth Slide 32 32 Slide 33 33 Slide 34 34 Slide 35 IDC 2012 2.7ZB 2020 35ZB Slide 36 WHAT MAKES IT BIG DATA? 36 VOLUMEVELOCITYVARIETYVALUE SOCIAL BLOG SMART METER 101100101001 001001101010 101011100101 010100100101 Slide 37 NUMBERS How many data in the world? 800 Terabytes, 2000 160 Exabytes, 2006 500 Exabytes(Internet), 2009 2.7 Zettabytes, 2012 35 Zettabytes by 2020 How many data generated ONE day? 7 TB, Twitter 10 TB, Facebook 37 Big data: The next frontier for innovation, competition, and productivity McKinsey Global Institute 2011 Slide 38 BIG DATA USE CASES 38 Todays ChallengeNew DataWhats Possible Healthcare Expensive office visits Remote patient monitoring Preventive care, reduced hospitalization Manufacturing In-person support Product sensors Automated diagnosis, support Location-Based Services Based on home zip code Real time location data Geo-advertising, traffic, local search Public Sector Standardized services Citizen surveys Tailored services, cost reductions Retail One size fits all marketing Social media Sentiment analysis segmentation Slide 39 HOW How 39 Slide 40 PARALLEL PROGRAMMING Parallel Architectures Parallel Algorithms Parallel Programming 40 Slide 41 Most people in the research community agree that there are at least two kinds of parallel programmers that will be important to the future of computing Programmers that understand how to write software, but are nave about parallelization and mapping to architecture Programmers that are knowledgeable about parallelization, and mapping to architecture, so can achieve high performance GOAL Slide 42 32 4 : + 4 24 42 Slide 43 + 1 doc + 20 80 1 doc 43