Note: the code snippets in this document use compiler explorer. If you're interesting in setting up a local instance of compiler explorer that supports GPU execution, check out this guide.
We'll start off with a simple "hello world" program.
This program's main calls the print_hello_cpu() function, which in turn outputs
xxxxxxxxxxhello world from the cpu
to the console before exiting. But that's not very exciting, let's write something that runs on the GPU.
To make code run on the GPU, we have to put it inside a kernel. A kernel is just a special kind of function that carries out the operations on the device (GPU) rather than the host (CPU). Syntactically, kernels follow the same rules as regular C++ functions, except they have this funny __global__ attribute in the declaration. Let's add a simple kernel to our program:
Unfortunately, this code snippet doesn't compile as written. If we try to call a kernel like it was a regular C++ function, we get an error message:
xxxxxxxxxxprint_hello_gpu(); // error: a __global__ function call must be configured
The issue here is that kernels are invoked slightly differently than regular C++ functions. The kernel launch expression expects more than just function arguments. That additional information is passed with this triple chevron notation before the parentheses:
xxxxxxxxxxprint_hello_gpu<<< gridsize, blocksize >>>();We'll go over the triple chevron parameters in more detail later, but for now let's take gridsize = blocksize = 1 and see what happens.
This fixes the compilation error, and now when we rerun the program we get ...
xxxxxxxxxxhello world from the cpu
Wait-- shouldn't we see a print statement from the kernel? What's going on?
When we launch a kernel, the CPU doesn't just sit and wait for it to finish. Instead, control is returned to the CPU almost immediately, so it can continue to do its own work while the GPU executes the kernel. This asynchronous execution pattern is a good thing: it gives us a way to keep both processors busy at the same time, but it does also add some complexity.
With that in mind, let's look at main() from the previous code snippet:
xxxxxxxxxxint main() { print_hello_cpu(); print_hello_gpu<<< 1, 1 >>>();}We see that, right after the print_hello_gpu kernel is launched, the program reaches the end of main and exits. The GPU kernel likely hasn't finished executing, so it isn't able to flush the printf output to the console in time. To prevent the program from exiting before the kernel has finished, we can add a cudaDeviceSynchronize() after the kernel launch:
cudaDeviceSynchronize makes the program wait until the GPU has finished executing its kernels. This ensures that the program does not exit prematurely, so we are now able to see the expected output
xxxxxxxxxxhello world from cpuhello world from gpu
Warning
Synchronization is expensive, only use it when necessary!
Now that the program is working as expected, let's see what happens when we vary the arguments in the triple chevron.
Important
With (gridsize = 4, blocksize = 1), we see "hello world from gpu" 4 times.
What output do we see with (gridsize = 2, blocksize = 2)? (gridsize = 2, blocksize = 4)?
After playing with this a bit, we see a pattern: the body of the kernel appears to be executed gridsize * blocksize times. Let's replace the print statement in the kernel by
xxxxxxxxxxprintf("hello world from gpu blockIdx: %d, threadIdx: %d\n", blockIdx.x, threadIdx.x); and see what happens.
blockIdx and threadIdx are special variables that can be accessed when inside a kernel scope. The range of blockIdx and threadIdx values correspond to the gridsize and blocksize arguments (respectively) that we pass in the triple chevron.
Each thread is guaranteed to have a unique combination of (blockIdx, threadIdx). Typically, these values are used to determine which part of the problem that thread should work on.
In this document we walked through the process of writing a very simple kernel. We learned that:
Kernels are the way we run code on the GPU

Kernels have some syntactic requirements:
__global__ attribute
must return void
Kernels are launched with "triple chevron" notation
<<< gridsize, blocksize, shmem, stream >>>
gridsize and blocksize control the number of threads to be launched
shmem and stream are optional (more on these later)
Kernels execute asynchronously
This enables CPU and GPU to work simultaneously
The CPU can queue up multiple kernels
Up next: the thread hierarchy and how the GPU executes kernels.
If the kernel body is executed gridsize and blocksize times, what's the difference between <<<4, 1>>> <<<1,4>>>, <<<2,2>>>? Wouldn't it just be simpler to let users write <<<4>>>?
As we will see later in the thread hierarchy discussion, <<<4, 1>>> and <<<1, 4>>> both execute the kernel body code 4 times, but they do so in different ways. These launch configuration parameters control how work is allocated to hardware resources, so it does affect the performance of the kernel. So, the choice is left up to the user to allow them to find the configuration that works best for their application.
Also, if the developer wants a simplified interface for a 1D-style parallel for loop, it is easy to write a small wrapper that accomplishes this. For example:
xxxxxxxxxx
template < typename T >__global__ void parallel_for_kernel(int n, T func) { int i = threadIdx.x + blockIdx.x * blockDim.x; if (i < n) { func(i); }}
template < typename T >void parallel_for(int n, T func) { int blocksize = 128; int gridsize = (n + blocksize - 1) / blocksize; parallel_for_kernel<<< gridsize, blocksize >>>(n, func);}
int main() { parallel_for(7, [] __host__ __device__ (int i){ printf("hello from thread %d\n", i); }); cudaDeviceSynchronize();}
// prints:// hello from thread 0// hello from thread 1// hello from thread 2// hello from thread 3// hello from thread 4// hello from thread 5// hello from thread 6