Intro to CUDA C¶
Heterogeneous Computing¶
- Host: CPU and its memory (host memory)
- Device: GPU and its memory (device memory)
Processing Flow¶
- Copy input data from CPU memory to GPU memory.
- Load GPU program and execute, caching data in GPU memory.
- Copy output data from GPU memory to CPU memory.
Hello World with Device Code
hello.cu
#include <stdio.h>
__global__ void hello() {
printf("Hello, World from GPU!\n");
}
int main() {
hello<<<1, 1>>>();
cudaDeviceSynchronize();
return 0;
}
- CUDA C/C++ keyword
__global__indicates a function that:- runs on the device (GPU)
- is called from the host code
nvccseparates source code into host and device components.mykernel<<<1, 1>>>:<<<>>>marks a call
Addition on the Device
-add() runs on the device, so a, b and c must point to device memory. - We need to allocate memory on the GPU! Memory Management¶
- Host and device memory are separate entities
- Device pointers point to GPU memory.
- Host pointers point to CPU memory.
cudaMalloc()cudaMemcpy()cudaFree()
CPU 和 GPU 的内存不能互相访问(实际上也行,但此处不做讨论),但是地址空间是一样的。
Addition on the Device main()
int main(void) {
int a, b, c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
size_t size = sizeof(int);
// Allocate space for device copies of a, b, c
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);
// Copy inputs to device
cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU
add<<<1, 1>>>(d_a, d_b, d_c);
// Copy result back to host
cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);
// Free device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
return 0;
}
Moving to Parallel Programming¶
Instead of executing add() once, we can execute it multiple times in parallel.
<<<N, M>>>:N: number of thread blocksM: number of threads per block
Vector Addition On the Device¶
-
each parallel invocation of
add()is referred to as a block- the set of blocks is referred to as a grid
- Each invocation can refer
-
Use
blockIdx.xto access the block index - Use
threadIdx.xto access the thread index within a block (a block can be split into multiple threads)
Hint
With M threads per block, a unique index for each thread is given by:
Why bother with threads?
1D Stencil
with radius 3, every element is read 7 times!
Sharing Data Between Threads¶
- Shared memory: within a block, threads can share data
- 比缓存好操控(?
- Use
__shared__to declare shared memory
Data race
线程之间是独立的,如果一个线程还没写完数据,另一个线程又要读取这个数据,就会输出错误的结果。
__syncthreads()to synchronize threads within a block- RAW / WAW / WAR hazards
IDs and Dimensions¶
blockIdx and threadIdx are 1D by default, but can be 2D or 3D.
Coordinating Host and Device¶
- Kernel launches are asynchronous
- control returns to the CPU immediately
- CPU needs to synchronize before consuming the results
cudaMemcpy() 是阻塞型API,blocks the CPU until the copy is complete,copy begins when all preceding CUDA calls have completed.
现在最新的 CUDA 已经到了汇编层面,很少写 CUDA C