what_and

Graphics

"GPU" stands for "Graphical Processing Unit". The name reveals how these processors were originally designed to accelerate applications like photorealistic rendering and video games. However, things changed in the early 2000s when NVIDIA released the Geforce 3:

geforce_3 — Promotional image celebrating the Geforce 3's 25th anniversary

To appreciate why this was a big deal, you have to remember that at the time, 3D graphics applications were based on a "fixed-function" pipeline. That meant that the hardware was made to do incredibly specific calculations, and if you wanted to do something else you were out of luck!

color_by_number — a depiction of the fixed function pipeline

But the Geforce 3 was different: it had programmable shaders. This innovation enabled developers to write their own custom programs to control the different parts of the graphics pipeline, like coordinate transformations and lighting effects.

blank_canvas — with programmable shaders, artists have more creative freedom

CUDA

Not long after, developers began to realize that these custom shader programs could be (ab)used to perform calculations for non-graphics applications. If you were so inclined, you could write a shader program to run Conway's game of life or a finite difference approximations to simulate fluid flow, but there were some issues.

The graphics pipeline was intended to output images, so if you wanted to write a shader program to perform a general calculation on the GPU, the output would still have to be a 2D array of pixels. Of course, it was possible to work around this by packing and unpacking into images, but the process was unergonomic.

NVIDIA recognized this opportunity early and began creating a way to facilitate general-purpose computation on GPUs. In the summer of 2007, NVIDIA released CUDA 1.0: a platform for running code on GPUs (CUDA C++, SIMT programming model, nvcc, cuBLAS and cuFFT).

We use the term "CUDA" to refer to the entire platform for running code on GPUs (including languages, tools, libraries).

Hardware

Hardware-wise, I think of CPUs like cars: they're relatively ubiquitious, small, easy to drive, and have limited capacity. On the other hand, a GPU is like a bus: conceptually similar to a car but bigger and higher capacity.

We intuitively understand that if you need to move a lot of people, the car isn't really the right tool for the job-- you'd have to make so many trips back and forth that it would take forever and use a lot of fuel. On the other hand, buses are greating for moving large groups of people, because that's what they were designed to do.

It's the same scenario with processors too. Short, important tasks that arise when running an operating system work best on the CPU, taking advantage of its higher clocked, latency optimized cores. But when you have a lot of work to do, the GPU's parallel throughput-optimized design wins.

Let's compare some of the high-level specification differences between the two:

	13900K CPU	RTX 5090 GPU
Core/SM count	8P + 16E	170
Clock Speed	4 GHz	2 GHz
Memory Capacity	128 GB	32 GB
Cache (L2$ / L1$)	32MB / 48 KB	96 MB / 128 KB
Memory Throughput	50 GB/s	1800 GB/s
Compute Throughput	~250 GFLOP/s	~100,000 GFLOP/s

"A sufficient difference in quantity is a difference in kind"

The GPU opts for many, small cores in contrast to the CPU's few, large cores. The GPU's SMs lack some features like branch prediction, but that area is used for more register file and floating point units. This is how the GPU outpaces the CPU's compute throughput by a factor of 400. The GPU design also features a much higher-throughput memory subsystem, to continuously supply data to its SMs.

Programming Models

CPUs have multiple cores, each of which support vector instructions. However, most programming languages default to serial execution and scalar instructions, so it takes deliberate effort to thread and vectorize code to actually take advantage of the available CPU hardware.

In contrast, code run on a GPU is parallel by default. CUDA kernels expose parallelism through a "SIMT" (Single Instruction Multiple Thread) programming model. In practice, this means that you write code from the perspective of one work item and the complexity of launching threads, scheduling and vectorization is automated, rather than explicitly controller.

We'll explore this topic in much more depth in the documents to follow.

Summary

This has been a very brief introduction to some of the high level concepts related to GPU programming. If you'd like to learn more about CUDA, stick around and check out some of the other articles, like this next one in the series walking through the process of writing your first CUDA kernel.