GPU based Medical Imaging

강 성태

Computer Graphics & Medical Imaging Lab., Seoul National University

OverviewGPUs and Medical ImagingCase Study : Cone-beam CT Reconstruction

Medical ImagingData Acquisition

CT, MRI, PET, SPECT, UltrasoundTomographic reconstruction of acquired data

Image processingSegmentationRegistration

VisualizationDirect Volume RenderingMaximum Intensity Projection

Issues on Medical Imaging ApplicationsStorage and memory usage

Medical data is huge, and getting largerVisible Human Project (1994)

15GB65GB rescanned in 2000

Time-series data

The highest level of accuracy is required

PerformanceMost medical imaging applications are based on heavy-computational algorithms…

Frequency-domain analysis, filtering, optical integration, …

…Over large data

GPUs and Medical ImagingParallelism

Simple but heavy calculation over huge domainNo or less data dependency between output data

FilteringProjection

GPU-friendly operationsInterpolationBlending

GPUs and Medical ImagingThe first flexible shader (Robert L. Cook, 1984)Accelerating Volume Reconstruction with 3D Texture Hardware (T. Cullip and U. Neumann, 1993)Accelerated Volume Rendering and TomographicReconstruction using Texture Mapping Hardware(B. Cabral, N. Cam, and J. Foran, 1994)Programmable shader introduced to consumer H/W (Microsoft DirectX 8, 2000)

Case Study : CT ReconstructionComputed Tomography

3D image of inside of an object from a large series of 2D X-ray images taken around a single axis of rotation

CT ReconstructionReconstruction of 3D data from 2D projected images

CT ReconstructionFourier Slice Theorem

Fourier transform of the projection of an n-D function is equal to a slice of n-D Fourier transform of that function along the origin in the Fourier space

CT ReconstructionBackprojection method

Filtered BackprojectionBlurring Artifact

Overweighting on low-frequency area

Filteringhigh-pass filters

Source

Backprojected

Filtering

Source

Ram-Lak filtered Shepp-Logan filtered

No filtering

Filtering (Cone-beam)

Source S at angle θ

Pθ(Y,Z)

Pre-weighting with ray distance

High-pass filter

FilteringBrief algorithm

for each image in detected images{

FTfor each pixel in the image

multiply high-pass filterinverse FTfor each pixel in the image

multiply pre-weighting factor}

GPU-based FilteringFFT on a GPU (K. Moreland and E. Angel, 2003)

Packing real values into real-imaginary pairsSlower than CPU implementation at that time

GPU-based FilteringFFT on a GPU (Sumanaweera and Liu, 2005)

1.3~3 times faster than CPU based FFT

GPU implementation was really complicated and trickyCUDA cufft library

Library for fast fourier transform using GPU resourceSimple function callNvidia’s black-box algorithm

GPU-based FilteringCUDA-based implementation

__global__ void shepp_logan(cufftComplex sourceImage){

__host__ void filter(cuffReal **image, float2 size){

Dim3 threadDim={BLOCK_SIZE, BLOCK_SIZE};Dim3 blockDim = {size.x/BLOCK_SIZE, size.y/BLOCK_SIZE};

cufftPlan2d(&forward_plan, size.x, size.y, CUFFT_R2C);cufftPlan2d(&inverse_plan, size.x, size.y, CUFFT_C2R);

for(int n=0;n<numOfDetectedImages;n++){

cufftExecR2C(forward_plan, image[i], freqImage);

shepp_logan<<<blockDim, threadDim>>>(freqImage);

cufftExecC2R(inverse_plan, freqImage, image[i]);}

FDK Backprojection (Cone-beam)

V (Object Volume)

FDK BackprojectionBrief algorithm

for each image I in detected images{

Calculate the transform matrix Mwhich maps voxel coordinates to projected image coordinate

for each voxel v with coordinate v of the volume{

Calculate projected coordinate p = MvSamplev += weight × I(p)

GPU-Based BackprojectionDX/OpenGL Programmable shader vs. CUDA

Accumulation of the valueCUDA doesn’t support normal graphics raster operation like blendingCan be alternated with atomic operations, but limited in currentversion of CUDA

Only for CUDA 1.1 compatible H/WInteger only

Manipulating volume dataVolume texture is not yet supported by CUDACan be implemented by combination of bilinear sampling

GPU-based BackprojectionReconstruction cube representation

Stack of 2D texture render targets3D texture

Supported by D3D10 compatible H/W

Detected image representation2D textureOnly one detected image is loaded to the GPU in a rendering pass, for lack of GPU memory

Causes frequent context switching and CPU-GPU memory transfer

GPU-based BackprojectionCalculating the sampling coordinate with transformation matrix

Mapping 3D voxel coordinates to 2D pixel coordinates of detected imagesRotation of the detector and perspective projection of cone-beam ray can be represented as composition of simple transform matrices

Sampling a pixel is trivialHardware accelerated linear interpolation

Accumulate weighted pixel to the voxelColor/alpha blending

GPU-based BackprojectionD3D implementation

Prepare (volume slice / 4) 4-channel render targets,each of those render targets represents 4 slices of volume

for each image I in detected images{

Load I to GPU memory (2D texture)for each render target R in the stack of render targets{

Draw a rectangle to R[Vertex Shader :

Calculate projected coordinate p of four verticesPixel Shader :

Sample the image I using raterized coordinate of pfor current raster pixel, that is identical to a voxel

Output weighted sample value to output blender with ADD blend option]

CUDA-D3D9 InteroperabilityCUDA-D3D interoperability is limited to vertex buffers yet (1.1)Filtered image transfer from CUDA to D3D9

CUDA Array CPU memory D3D9 texture

GPU CPU transfercufft operations can hide upload latency

CPU GPU transferone image per rendering passCan be hidden with D3D9 reconstruction operations

Rendering pass

GPU-based Backprojection

Current render target

Volume

Result712 detected images, 720x924 resolution512x512x608 reconstructed cubeIntel Core 2 Quad Q6600 CPUNvidia GeForce 8800GTX GPU

Filtering Reconstruction GPU-CPU transfer Total

CPU (Multithreaded) 21.8 sec 352.3 sec - 374.1 sec

GPU (CUDA+D3D9) 7.9 sec 6.2 sec 3.0 sec 17.1 sec

GPU based Medical Imaging IssuesMemory size

GPU texture memory is not so large as main memoryStill insufficient for some medical data512~1GB for flagship model

Nvidia 8800GTX(768MB)128~256MB in general

Memory transfer performanceData transfer between GPU memory and system memory depends on bus bandwidth

AGP 8x : upload ~100MB/s, download ~2.1GB/sPCI express 16x : up/down ~4GB/s

just theoreticalExperimental: up 700MB/s, down 1.2GB/s

GPU based Medical Imaging -...

Documents

NVIDIA CUDA 软件和 GPU 并行计算架构 - pudn.comread.pudn.com/downloads103/sourcecode/mpi/425099/CUDA... · 2008-03-27 · (传统GPGPU是通过图形API编程GPU) 随机访问字节内存

GPUコンピューティング（CUDA) 講習会gpu-computing.gsic.titech.ac.jp/.../tutorial-cuda-introduction.pdf · cudaによるプログラミング基礎丸山直也. はじめに

CUDA テクニカルトレーニング - nvidia.co.jp · CUDAの登場 CUDA（Compute Unified Device Architecture） GPUコンピュコンピュティングのためのーティングのためのNVIDIA

GPU-Programmierung: OpenCL€¦ · Einsatzgebiete von GPU-Computing Entwicklung von GPU-Computing 2 OpenCL Entwicklung Architektur Spracheigenschaften Vergleich mit CUDA Beispiel

Les dernières générations chez NVIDIA : Accélérateurs ...viereseau:conferences:conference... · Maxwell GPU FinFET Kepler GPU CUDA ... # CUDA Cores 2688 2496 Peak Double Precision

KOMPUTASI PARALEL BERBASIS GPU CUDA UNTUK PEMODELAN 2D ... fileParalel berbasis GPU Cuda Untuk Pemodelan 2d Tsunami Dengan Metode lattice Boltzmann. Tesis ini merupakan syarat untuk

クラウドでアクセラレーテッドコンピューティン …...P2/G2: GPU-accelerated computing 各GPUの数千CUDAコアによる高並列計算豊富なAPI群(CUDA, OpenACC,

Application du GPGPU avec CUDA en tomographie RX GPU conçu pour traitements d'images en flux spécialisés (stream) ... Pilotes supportant CUDA et SDK installés : nvcc fonctionnel

L’architecture d’une GPU - programmation CUDA · (2017) Nvidia Tegra X2 Denver(2)+Cortex-57(4) + Pascal GPU (256 cores) (2018) Nvidia Nano Cortex-57(4) + Maxwell GPU (128 cores)

CUDA альманах · 2016-10-31 · решения для рынка hpc». nvidia gpu ... две относительно новые тенденции ... сжатые сроки,

CUDA Por: Ernesto Y. Soto Rivas G00387821. ¿Qué es CUDA? CUDA es un lenguaje de programación que utiliza el Graphical Processing Unit (GPU) Permite

część 1 – architektura Zbigniew Kozakoma/pr/2011/7-cuda1.pdf · 2011. 4. 14. · Spis treści Spis treści 1.Wstęp: GPU a CPU CPU GPU 2.CUDA Co to jest CUDA? Architektura procesora

Σύντομη εισαγωγή στην τεχνολογία GPU και την CUDA

ПРИЛОЖЕНИЯ С GPU УСКОРЕНИЕМ - NVIDIA · single NVIDIA K40 or Titan GPU Single only cuda-convnet2 cuda-convnet2 is a highly optimized deep learning implementation

GPU環境へ！ HPE Engineering VDI · GPU GPU GPU GPU プリエンプション処理. プリエンプション対応によるメリット –vGPU環境でもCUDAを利用可能（これまではGPU占有して使用しない限りCUDAを利用できなかった）

CUDA画像処理入門 - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2014/jp/sessions/4002.pdf · CUDA画像処理入門エヌビディアジャパンCUDAエンジニア森野慎也

GPUコンピューティング（CUDA) 講習会gpu-computing.gsic.titech.ac.jp/Japanese/Lecture/2010-06-28/multi...omp_gpu_benchmark.cpp 利用するGPU

OpenACC CUDAによる GPUコンピューティングGPU Computing 5 GPU コンピューティング Low latency + High throughput CPU GPU 6 アプリケーション実行アプリケーション・コード

CUDA & OpenCL GPUコンピューティングって何？

Computing Unified Device Architecture (CUDA) Programação em GPU (CUDA) Msc. Lucas de Paula Veronese Tiago Alves de Oliveira lucas.veronese@lcad.inf.ufes.br