CUDA Programming Introduction

NumbaPro provides multiple entry points for programmers of different levels of expertise on CUDA. For expert CUDA-C programmers, NumbaPro provides a Python dialect <CUDAJit.html>`_ for low-level programming on the CUDA hardware. It provides full control over the hardware for fine tunning the performance of CUDA kernels. For new CUDA programmers, the high-level API such as the universal functions (ufunc) and generalized ufuncs (gufunc) are the easiest way to write array operations for the GPU.

A Very Brief Introduction to CUDA

A CUDA GPU contains one or more streaming multiprocessors (SMs). Each SM is a manycore processor that is optimized for high throughput. The manycore architecture is very different from the common multicore CPU architecture. Instead of having a large cache and complex logic for instruction level optimization, a manycore processor achieves high throughput by executing many threads in parallel on many simpler cores. It overcomes latency due to cache miss or long operations by using zero-cost context switching. It is common to launch a CUDA kernel with hundreds or thousands of threads to keep the GPU busy.

The CUDA programming model is similar to the SIMD vector model in modern CPUs. A CUDA SM schedules the same instruction from a warp of 32-threads at each issuing cycle. The advantage of CUDA is that the programmer does not need to handle the divergence of execution path in a warp, whereas a SIMD programmer would be required to properly mask and shuffle the vectors. The CUDA model decouples the data structure from the program logic.

To know more about CUDA, please refer to NVIDIA CUDA-C Programming Guide.