first_kernel

Note: the code snippets in this document use compiler explorer. If you're interesting in setting up a local instance of compiler explorer that supports GPU execution, check out this guide.

Hello World (CPU)

We'll start off with a simple "hello world" program.

This program's main calls the print_hello_cpu() function, which in turn outputs


xxxxxxxxxx
hello world from the cpu

to the console before exiting. But that's not very exciting, let's write something that runs on the GPU.

Hello World (GPU)

Kernel Syntax

To make code run on the GPU, we have to put it inside a kernel. A kernel is just a special kind of function that carries out the operations on the device (GPU) rather than the host (CPU). Syntactically, kernels follow the same rules as regular C++ functions, except they have this funny __global__ attribute in the declaration. Let's add a simple kernel to our program:

Unfortunately, this code snippet doesn't compile as written. If we try to call a kernel like it was a regular C++ function, we get an error message:


xxxxxxxxxx
print_hello_gpu(); // error: a __global__ function call must be configured

Triple Chevron Launch

The issue here is that kernels are invoked slightly differently than regular C++ functions. The kernel launch expression expects more than just function arguments. That additional information is passed with this triple chevron notation before the parentheses:


xxxxxxxxxx
print_hello_gpu<<< gridsize, blocksize >>>();

We'll go over the triple chevron parameters in more detail later, but for now let's take gridsize = blocksize = 1 and see what happens.

This fixes the compilation error, and now when we rerun the program we get ...


xxxxxxxxxx
hello world from the cpu

Wait-- shouldn't we see a print statement from the kernel? What's going on?

Asynchronous Execution

When we launch a kernel, the CPU doesn't just sit and wait for it to finish. Instead, control is returned to the CPU almost immediately, so it can continue to do its own work while the GPU executes the kernel. This asynchronous execution pattern is a good thing: it gives us a way to keep both processors busy at the same time, but it does also add some complexity.

With that in mind, let's look at main() from the previous code snippet:


xxxxxxxxxx
int main() {
    print_hello_cpu();
    print_hello_gpu<<< 1, 1 >>>();
}

We see that, right after the print_hello_gpu kernel is launched, the program reaches the end of main and exits. The GPU kernel likely hasn't finished executing, so it isn't able to flush the printf output to the console in time. To prevent the program from exiting before the kernel has finished, we can add a cudaDeviceSynchronize() after the kernel launch:

cudaDeviceSynchronize makes the program wait until the GPU has finished executing its kernels. This ensures that the program does not exit prematurely, so we are now able to see the expected output


xxxxxxxxxx
hello world from cpu
hello world from gpu

Warning

Synchronization is expensive, only use it when necessary!

Launch Configuration

Now that the program is working as expected, let's see what happens when we vary the arguments in the triple chevron.

Important

With (gridsize = 4, blocksize = 1), we see "hello world from gpu" 4 times.

What output do we see with (gridsize = 2, blocksize = 2)? (gridsize = 2, blocksize = 4)?

After playing with this a bit, we see a pattern: the body of the kernel appears to be executed gridsize * blocksize times. Let's replace the print statement in the kernel by


xxxxxxxxxx
printf("hello world from gpu blockIdx: %d, threadIdx: %d\n", blockIdx.x, threadIdx.x);

and see what happens.

blockIdx and threadIdx are special variables that can be accessed when inside a kernel scope. The range of blockIdx and threadIdx values correspond to the gridsize and blocksize arguments (respectively) that we pass in the triple chevron.

Each thread is guaranteed to have a unique combination of (blockIdx, threadIdx). Typically, these values are used to determine which part of the problem that thread should work on.

Recap

In this document we walked through the process of writing a very simple kernel. We learned that:

Kernels are the way we run code on the GPU
Kernels have some syntactic requirements:
- __global__ attribute
- must return void
Kernels are launched with "triple chevron" notation
- <<< gridsize, blocksize, shmem, stream >>>
- gridsize and blocksize control the number of threads to be launched
- shmem and stream are optional (more on these later)
Kernels execute asynchronously
- This enables CPU and GPU to work simultaneously
- The CPU can queue up multiple kernels

Up next: the thread hierarchy and how the GPU executes kernels.

FAQ

If the kernel body is executed gridsize and blocksize times, what's the difference between <<<4, 1>>> <<<1,4>>>, <<<2,2>>>? Wouldn't it just be simpler to let users write <<<4>>>?

As we will see later in the thread hierarchy discussion, <<<4, 1>>> and <<<1, 4>>> both execute the kernel body code 4 times, but they do so in different ways. These launch configuration parameters control how work is allocated to hardware resources, so it does affect the performance of the kernel. So, the choice is left up to the user to allow them to find the configuration that works best for their application.

Also, if the developer wants a simplified interface for a 1D-style parallel for loop, it is easy to write a small wrapper that accomplishes this. For example:


xxxxxxxxxx
#include <cstdio>

template < typename T >
__global__ void parallel_for_kernel(int n, T func) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    if (i < n) {
        func(i);
    }
}

template < typename T >
void parallel_for(int n, T func) {
    int blocksize = 128;
    int gridsize = (n + blocksize - 1) / blocksize;
    parallel_for_kernel<<< gridsize, blocksize >>>(n, func);
}

int main() {
    parallel_for(7, [] __host__ __device__ (int i){
        printf("hello from thread %d\n", i);
    });
    cudaDeviceSynchronize();
}

// prints:
// hello from thread 0
// hello from thread 1
// hello from thread 2
// hello from thread 3
// hello from thread 4
// hello from thread 5
// hello from thread 6