Operaciones con matrices · 2015. 4. 29. · Jerarquía de memorias Registros = 8.0 TB/seg Shared/L1 = 1.6 TB/seg Global = 190 GB/seg Mapped = 8.0 GB/seg Otras memorias: texturas,

Operaciones con matrices

Clase 7, 29/04/2015http://fisica.cab.cnea.gov.ar/gpgpu/index.php/en/icnpg/clases

cp -a /share/apps/codigos/alumnos_icnpg2015/Matrices .

http://fisica.cab.cnea.gov.ar/gpgpu/index.php/en/icnpg/clases

Jerarquía de memorias

● Registros = 8.0 TB/seg● Shared/L1 = 1.6 TB/seg● Global = 190 GB/seg● Mapped = 8.0 GB/seg

● Otras memorias: texturas, constante

Shared memory

● Puede considerarse como una caché programable (yo decido qué guardar y cuándo hacerlo).

● Es privada a cada bloque (48 kB máx) .

● Se encuentra distribuída en bancos de memoria.

● En fermi hay 32 bancos de 32 bits de ancho (128 bytes).

kernel<<<dimGrid, dimBlock,shared_memory>>>

Shared memory - Bancos

● Cada banco de memoria puede realizar una única operación de lectura/escritura.

● En Fermi tenemos 32 bancos de 4 bytes.● En Kepler tenemos 32 bancos de 8 bytes.

● El ancho de banda de cada banco es 32 bits por cada 2 ciclos de clock.

char, signed char, unsigned char 8 bits (1 byte)

short, short int, signed short, signed short int, unsigned shor, unsigned shor int, int, signed int

16 bits (2 bytes)

long, long int, signed long, signed long int, unsigned long, unsigned long int

32 bits (4 bytes)

(singed/unsigned) long long (int) 64 bits (8 bytes)

float 32 bits (4 bytes)

double 64 bits (8 bytes)

long double >80bits (12 / 16 bytes)

Ante la duda... printf(“%d\n”,sizeof(data_type));

Warps

● 1 warp = 32 threads, con un modelo SIMD. ● threadIdx.x es consecutivo a cada warp. ● Para grillas 2D los threads se serializan y se dividen en warps. Un bloque de 8x8

threads se divide en 2 warps con indexado:

● Warp 0 = (0,0), (1,0) … (7,0) … (0,3) … (7,3) ● Warp 1 = (0,4), (1,4) … (7,4) ... (0,5) … (7,7)

● Los accesos a memoria se realizan por warp (>= Fermi).

● Hay divergencia si dentro de un warp se ejecutan caminos distintos.

Transposición de una matríz

int tidx = threadIdx.x + blockDim.x * blockIdx.x;int tidy = threadIdx.y + blockDim.y * blockIdx.y;

int inputIndex = tidx + size * tidy;int outputIndex = tidy + size * tidx;

outMatrix[outputIndex] = inpMatrix[inputIndex]; Lecturas con acceso coalescido.

Escrituras con strides muy grandes.

Transposición de una matríz – Un poco mejor

La carga de datos es totalmente coalescida desde la matriz de entrada a un TILE de memoria compartida.

La escritura de datos es totalmente coalescida también.

Existe un conflicto de banco de memoria compartida. Cada thread accede a una palabra distinta del Mismo banco de memoria compartida.

__shared__ FLOAT tile[TILE_DIM][TILE_DIM];

unsigned int tidx = blockIdx.x * TILE_DIM + threadIdx.x;

unsigned int tidy = blockIdx.y * TILE_DIM + threadIdx.y;

unsigned int index_input = tidx + (tidy)*size;

tidx = blockIdx.y * TILE_DIM + threadIdx.x;

tidy = blockIdx.x * TILE_DIM + threadIdx.y;

unsigned int index_output = tidx + (tidy)*size;

for (int i=0; i<TILE_DIM; i+=ROWS)

tile[threadIdx.y+i][threadIdx.x] = input[index_input+i*size];

__syncthreads();

for (int i=0; i<TILE_DIM; i+=ROWS)

output[index_output+i*size] = tile[threadIdx.x][threadIdx.y+i];



TILE_DIM

Transposición de una matríz – Todavía mejor

Podemos evitar el conflicto de banco agregando un pequeño offset al TILE de memoria compartida.

__shared__ FLOAT tile[TILE_DIM][TILE_DIM+1];

TILE_DIM + 1

Veamos algunos ejemplos: transpose.

Multiplicación de matrices

Dadas dos matrices:A: hA x wAB: hB x wB

C = A x B: hA x wB

wA

hA

wB

hB

Algoritmo en CPU

void matrixMult(const float *A, const float *B, int hA, int wA, int wB, float *out){

for (int i = 0; i < hA; i++){

for(int j = 0; j < wB; j++){

float aux = 0.0f;

for (int k = 0; k < wA; k++){

aux += A[i * wA + j] * B[k * wB + j];

} C[i * wB + j] = aux;

} }}

Un kernel básico

__global__ void matrixMultiply(float *A, float *B, float *C, int hA, int wA, int BRows, int wB) { int row = threadIdx.y + blockDim.y * blockIdx.y;

int col = threadIdx.x + blockDim.x * blockIdx.x;

if ((row < hA) && (col < wB))

{ float sum = 0;

for (int i = 0; i < wA; i++)

{ sum += A[row * wA + i] * B[col + i * wB];

} C[row * wB + col] = sum;

} }

Lecturas con acceso coalescido.

Lecturas poco eficientes.

Cada thread lee una fila y una columna de A y B.

Un kernel un poco mejor

Se divide la matriz C en bloques de tamaño BLOCK_WIDTH.

Se utiliza una grilla dividida en bloques de tamaño BLOCK_WIDTH

Cada bloque del Grid calcula un bloque de C.

Para calcular un bloque de C necesito una ciertacantidad de columnas de B y filas de A.

Las mismas se cargan en memoria compartida.

Un kernel un poco mejor__global__ void matrixMultiply(float *A, float *B, float *C, int hA, int wA, int hB, int wB) {{ __shared__ float dA[TILE_WIDTH][TILE_WIDTH]; __shared__ float dB[TILE_WIDTH][TILE_WIDTH];

int row = threadIdx.y + blockDim.y * blockIdx.y; int col = threadIdx.x + blockDim.x * blockIdx.x; int tx = threadIdx.x; int ty = threadIdx.y; float value = 0; for (int i = 0; i < blockDim.x; i++) { if ((row < hA) && (i * TILE_WIDTH + tx < wA)) { dA[ty][tx] = A[row * wA + i * TILE_WIDTH + tx]; } else { dA[ty][tx] = 0.0; } if ((i * TILE_WIDTH + ty < wA) && (col < wB)) { dB[ty][tx] = B[(i * TILE_WIDTH + ty) * wB + col]; } else { dB[ty][tx] = 0.0; } __syncthreads(); for (int j = 0; j < TILE_WIDTH; j++) { value += dA[ty][j] * dB[j][tx]; } __syncthreads(); } if (row < hA && col < wB){ C[row * wB + col] = value; }}

Veamos un ejemplo: multiplicación

Referencias

● An Efficient Matrix Transpose in CUDA C/C++ - http://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc/

● Bank conflicts in shared memory in Cuda - http://cuda-programming.blogspot.com.ar/2013/02/bank-conflicts-in-shared-memory-in-cuda.html

● Optimizing Matrix Transpose in CUDA - http://docs.nvidia.com/cuda/samples/6_Advanced/transpose/doc/MatrixTranspose.pdf

● Memory optimizations - http://on-demand.gputechconf.com/gtc-express/2011/presentations/NVIDIA_GPU_Computing_Webinars_CUDA_Memory_Optimization.pdf

http://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc/

http://cuda-programming.blogspot.com.ar/2013/02/bank-conflicts-in-shared-memory-in-cuda.html

http://cuda-programming.blogspot.com.ar/2013/02/bank-conflicts-in-shared-memory-in-cuda.html

http://docs.nvidia.com/cuda/samples/6_Advanced/transpose/doc/MatrixTranspose.pdf

http://docs.nvidia.com/cuda/samples/6_Advanced/transpose/doc/MatrixTranspose.pdf

Documents

Operaciones con matrices · 2015. 4. 29. · Jerarquía de memorias Registros = 8.0 TB/seg Shared/L1 = 1.6 TB/seg Global = 190 GB/seg Mapped = 8.0 GB/seg Otras memorias: texturas,