Parallel Vision by GPGPU/CUDA

Embed Size (px)

DESCRIPTION

Academic talk made by Yuan-Kai Wang

Citation preview

2. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDA p. 2What about this Talk The Multicore Era Its time for Parallel Computing GPGPU/CUDA GUGPU Architecture Parallel Programming by CUDA Some Examples Image Restoration (Retinex) Feature Extraction (SIFT) Video Cloud Computing 3. 31. The Multicore Erafor Computer Vision Paradigm shift from Clock Speed Raceto Multicore Race Some examples of Multicore 4. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDA p. 4Multicore Computing What Is Multicore Combine multiple chips of processor into single chip Multicore computing is inevitable 5. Wang, Yuan-Kai ()Parallel Vision with GPGPU/CUDA p. 5 Moores Law In 1965, Gordon Moore (Intel co-founder) predicted The transistors no. on an IC would double every 18 months The well-known law The performance of computer doubles every 18 months More transistors More performance The prediction was kept correctly by Intels CPUs for 40 years 6. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDA p. 6 Review of Moores Law Transistors in a chip did increase 7. Wang, Yuan-Kai ()Parallel Vision with GPGPU/CUDA p. 7 Problems More transistors need high frequency High frequency needs high power consumption We come into the Clock Speed Race But 4GHz has been the limit Moores law breaks 8. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDA p. 8 Paradigm Shift from 2000 General-purpose multicore comes of age Chip companies race to create multicore processors CPU: Intel Core Duo, Quad-core, ... DSP: TI DaVinci GPU: nVidia GeForce/Tesla ... 9. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDAp. 9The Multicore Evolution From large mono-core to multiple lightweight coresPentium processor Core Duo5~10 yearsOptimized for single10~100 energy efficientthreadcores optimized for parallel execution 10. Wang, Yuan-Kai ()Parallel Vision with GPGPU/CUDAp. 10 Moores Law Needs Multicore Single core cannot fit Moores law Multicore can fit Moores law if a parallel programming model existsMulti-CorePerformance Single CoreTime 11. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDA p. 11 Two Architectures for Multicore Symmetric multiprocessing (SMP) Multicore CPU, GPGPU, multicore DSP Homogeneous computing Asymmetric multiprocessing (AMP) CPU+GPGPU, CPU+FPGA, CPU+DSP Heterogeneous computing 12. Wang, Yuan-Kai ()Parallel Vision with GPGPU/CUDA p. 12 Multicore CPU (1/2) Two or more CPUs on a chip Ex.: Intel Core i7 OneProcessor With multipleexecution Cores 13. Wang, Yuan-Kai ()Parallel Vision with GPGPU/CUDAp. 13 Multicore CPU (2/2) Windows Task Manager()Two coresEight cores 14. Wang, Yuan-Kai ()Parallel Vision with GPGPU/CUDA p. 14 GPGPU (1/2) GPU (Graphical Processing Unit) The processor in graphics card to speed up 3D graphics Game playing is a major application GPGPU: General-Purpose GPU General purpose computation using GPU in applications other than 3D graphics 15. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDA p. 15GPGPU (2/2) GPGPU has more cores than CPU 120 ~ 512 cores GPGPU is more powerful than multicore CPU Vendors: nVidia ATI Intel AMD 16. Wang, Yuan-Kai ()Parallel Vision with GPGPU/CUDA p. 16Computer Vision Needs High Performance Computing An CV example : video processing Intelligent video surveillance, Its complexity is high One video: 10 Megapixels, 30fps, 100 flops per pixel 30 Gigaflops per video Massive data processing Intensive computation 17. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDAp. 17Approaches for HPC Cluster/distributed computing MAP-REDUCE(Google) Supercomputer (Cloud Computing) MPI Multi-processing computing Multicore CPU Programming with multithreading FPGA/DSP GPGPU Programming with CUDA 18. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDA p. 18 However Multicore is not a simple solution for upgrading performance The transition from single core to multicore will be blocked by software We are not ready to face the software programming challenges 19. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDA p. 19Multicore Demands Threading 20. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDA p. 20 2. GPGPU and CUDA GPGPU Hardware Programming by CUDA 21. Wang, Yuan-Kai ()Parallel Vision with GPGPU/CUDA p. 21 Why GPGPU GPGPU has many-core (> 100 cores) Suitable for intensive parallel computing GPGPU v.s. CPU Calculation: 367 GFLOPS v.s. 32 GFLOPS Memory Bandwidth: 86.4 GB/s v.s. 8.4 GB/s 22. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDA p. 22 GPGPU Vendors NVIDIA ATI Intel AMD 23. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDA p. 23 Hardware View PC-based GPGPU card as a coprocessorFrom PC to PSC : Personal Super-Computer 24. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDAp. 24 Applications of GPGPUhttp://developer.nvidia.com/category/zone/cuda-zone 25. Wang, Yuan-Kai ()Parallel Vision with GPGPU/CUDA p. 25 Two New GPGPUs from nVidia GT200 GTX 260/280, Quardro5800, Tesla 1060 Fermi Tesla 2060ALU ALU ControlALU ALU CacheDRAM DRAM CPU(host)GPU(device) MulticoreMany-core 26. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDA p. 26 nVidia GPGPU Architecture SM/SP(Stream multiprocessor/Stream processor) + Shared memory + DRAM 27. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDA p. 27 Memory Hierarchy On-Chip Memory Registers Shared Memory Constant Memory Texture Memory Off-Chip Memory Local Memory Global Memory 28. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDAp. 28 Parallel Computing Serial Computing GPGPU Cores Parallel Computing 29. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDA p. 29 Parallel Programming Many codes are written in C/C++/Java Especially algorithmic programs Can we write GPGPU parallel programs by C/C++/Java? However, C/C++ is sequential Three control structures of C/C++/Java: sequence, selection, repetition 30. Wang, Yuan-Kai ()Parallel Vision with GPGPU/CUDA p. 30 Multi-threading Multi-threading is the most important technique for parallel programming Some techniques are ready Pthread, Win32 thread, OpenMP, MPI, Intel TBB (Threading Building Block)... New techniques CUDA, OpenCL, ... 31. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDA p. 31 Parallel Programming inSequential Language Do we need to learn new languages formulti-threading? No Write multi-threading codes in C/C++ Add functions/directives to C/C++ formulti-threading That is the way current solutions did pthread, Win32 thread, OpenMP,MPI, CUDA, OpenCL, ... 32. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDA p. 32 CUDA CUDA: Compute Unified Device Architecture Parallel programming for nVidias GPGPU Use C/C++ language Java, Fortran, Matlab are OK When executing CUDA programs, the GPU operates as coprocessor to the main CPU 33. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDA p. 33CUDA Hardware Environment: CPU+GPU GPU Organizes, interprets, and CPUPCI-E GPU communicates information GPU Handles the core processing on large quantities of parallel information Compute-intensive portions of applications that are executed many times, but on different data, are extracted from the main application and compiled to execute in parallel on the GPU 34. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDA p. 34CUDA Software Stack 35. Wang, Yuan-Kai ()Parallel Vision with GPGPU/CUDAp. 35Processing Flow on CUDAMainCPU 3 2 Memory Copyprocessing 5 InstructthedataCopytheprocessing result 4 1 Memory for GPU ExecuteAllocate parallelin devicememoryeachcore6 Releasedevicememory 36. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDA p. 36 Programming with Memory Hierarchy Localityprinciple Temporallocality Spatiallocality 37. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDA p. 37Example - Hello World(1/3)int main(){HostDevicechar src[12]="Hello World";char h_hello[12];src d_hello1char* d_hello1;char* d_hello2;h_hello d_hello2cudaMalloc((void**) &d_hello1, sizeof(char)*12);cudaMalloc((void**) &d_hello2, sizeof(char)*12);cudaMemcpy(d_hello1 , src , sizeof(char)* 12 , cudaMemcpyHostToDevice);hello(d_hello1 , d_hello2 );call the kernel function 38. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDAp. 38Example - Hello World(2/3) Kernel Function __global__ void hello(char* hello1 , char* hello2 ) { int k; for(k = 0 ; hello1[k] != 0 ; k++){ HostDevice hello2[k] = hello1[k]; } src d_hello1}No parallel processing in this example h_hello d_hello2 39. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDAp. 39Example - Hello World(3/3) cudaMemcpy(h_hello, d_hello2, sizeof(char)* 12, cudaMemcpyDeviceToHost); printf("%sn", h_hello); HostDevice cudaFree(d_hello1); cudaFree(d_hello2); src d_hello1 system("pause"); h_hello d_hello2 return 0; } Result: 40. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDA p. 40 Parallelization Multicore/Multi-threading Data Parallelization Data distribution Parallel convolution Reduction algorithm Amdahls law Memory Hierarchy Management Locality principle Program accesses a relatively small portion of the address space at any instant of time 41. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDA p. 41 Develop Multi-thread Program Identify parallelism: Analyze algorithm Express parallelism: Write parallel code Validate parallelism: Debug & verify parallel code Optimize parallelism: enhance parallel performance 42. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDA p. 42 3. Image Restoration(Retinex) by CUDA 43. Wang, Yuan-Kai ()Parallel Vision with GPGPU/CUDAp. 43 Image Restoration Restore and enhance an image Its complexity is high for large images Original Complexity: RestoredO(N2) ~ O(N3) 44. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDA p. 44 Algorithms for Image Restoration Wiener Filter Histogram Based Approach Histogram Equalization, Histogram Modification, Retinex Path-based Retinex Recursive Retinex Center/surround Retinex No iterative process and is suitable for parallelization Multi-Scale Retinex with Color Restoration (MSRCR) [Rahman et al. 1997] 45. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDA p. 45MSRCR Algorithm n Ri x, y ri ( x, y ) Wk log Ii x, y log Fk x, y Ii x, y , i R, G, B , k 1 Ri x, y : the MSRCR output Ii x, y: the original image distribution in the ith spectral band F x, y k: the kth Gaussian Surround function: the convolution operationW: the weight k ri ( x, y ) : the color restoration factor in the ith spectral band I i ( x, y ) N : the number of spectral bands ri ( x, y ) log N , : the gain constant i 1 I i ( x, y ) : controls the strength of the nonlinearity 46. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDA p. 46Decompose the Problem Two basic approaches to partition computational work Domain decomposition GPGPU Partition the data usedCooperate in solving the problem Function decomposition CPU Partition the jobs (functions) from the overall work (problem) 47. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDA p. 47 Multi-Threading A program running In SerialIn Parallelhttp://en.wikipedia.org/wiki/Thread_(computer_science) 48. Wang, Yuan-Kai ()Parallel Vision with GPGPU/CUDA p. 48 Domain Decomposition (1/3) Animage example It is 2D data Three popular partition ways 49. Wang, Yuan-Kai () Parallel Vision with GPGPU/CUDAp. 49 Domain Decomposition (2/3) Domain data are usually processed by loop for (i=0; i