Multiple blocks are combined to form a grid. All the blocks in the same grid contain the same number of threads. The number of threads in a block is limited to 1024, but grids can be used for computations that require a large number of thread blocks to operate in parallel.

CUDA is a parallel computing platform and programming model that higher level languages can use to exploit parallelism. In CUDA, the kernel is executed with the aid of threads. The thread is an abstract entity that represents the execution of the kernel. A kernel is a small program or a function. Multi threaded applications use many such threads that are running at the same time, to organize parallel computation. Every thread has an index, which is used for calculating memory address locations and also for taking control decisions.

## Dimensions

CUDA operates on a heterogeneous programming model which is used to run host device application programs. It has an execution model that is similar to OpenCL. In this model, we start executing an application on the host device which is usually a CPU core. The device is a throughput oriented device, i.e., a GPU core which performs parallel computations. Kernel functions are used to do these parallel executions. Once these kernel functions are executed the control is passed back to the host device that resumes serial execution.

As many parallel applications involve multidimensional data, it is convenient to organize thread blocks into 1D, 2D or 3D arrays of threads. The blocks in a grid must be able to be executed independently, as communication or cooperation between blocks in a grid is not possible. 'When a kernel is launched the number of threads per thread block, and the number of thread blocks is specified, this, in turn, defines the total number of CUDA threads launched.[3]' The maximum x, y and z dimensions of a block are 1024, 1024 and 64, and it should be allocated such that x × y × z ≤ 1024, which is the maximum number of threads per block.[3] Blocks can be organized into one- or two-dimensional grids of up to 65,535 blocks in each dimension.[3] The limitation on the number of threads in a block is actually imposed because the number of registers that can be allocated across all threads is limited.[3]

## Indexing

### 1D-Indexing

Every thread in CUDA is associated with a particular index so that it can calculate and access memory locations in an array.

Consider an example in which there is an array of 512 elements. One of the organization structure is taking a grid with a single block that has a 512 threads. Consider that there is an array C of 512 elements that is made of element wise multiplication of two arrays A and B which are both 512 elements each. Every thread has an index i and it performs the multiplication of ith element of A and B and then store the result in the ith element of C. i is calculated by using blockIdx (which is 0 in this case as there is only one block), blockDim (512 in this case as the block has 512 elements) and threadIdx that varies from 0 to 511 for each block.

The thread index i is calculated by the following formula :

${\displaystyle i=blockIdx.x*blockDim.x+threadIdx.x}$

blockIdx.x is the x dimension block identifier

blockDim.x is the x dimension of the block dimension

Thus ‘i’ will have values ranging from 0 to 511 that covers the entire array.

If we want to consider computations for an array that is larger than 1024 we can have multiple blocks with 1024 threads each. Consider an example with 2048 array elements. In this case we have 2 thread blocks with 1024 threads each. Thus the thread identifiers' values will vary from 0 to 1023, the block identifier will vary from 0 to 1 and the block dimension will be 1024. Thus the first block will get index values from 0 to 1023 and the last one will have index values from 1024 to 2047.

Thus each thread will first calculate the index of memory that it has to access and then proceed with the calculation. Consider an example in which elements from arrays A and B are added in parallel by using threads and the results is stored in an array C. The corresponding code in a thread is shown below :[5]

```__global__
void vecAddKernel (float *A , float *B , float * C , int n)
{
int index = blockIdx.x * blockDim.x + threadIdx.x ;
if( index < n )
{
C[index] = A[index] + B[index] ;
}
}
```

### 2D-Indexing

In the same way in particularly complex grids, the blockId as well as the threadId need to be calculated by each thread depending on geometry of the grid. Consider, a 2-dimensional Grid with 2-dimensional blocks. The threadId and the blockId will be calculated by the following formulae :

${\displaystyle blockId=blockIdx.x+blockIdx.y*gridDim.x;}$ ${\displaystyle threadId=blockId*(blockDim.x*blockDim.y)+(threadIdx.y*blockDim.x)+threadIdx.x;}$[6]

## Hardware perspective

Although we have stated the hierarchy of threads, we should note that, threads, thread blocks and grid are essentially a programmer’s perspective. In order to get a complete gist of thread block, it is critical to know it from a hardware perspective. The hardware groups threads that execute the same instruction into warps. Several warps constitute a thread block. Several thread blocks are assigned to a Streaming Multiprocessor (SM). Several SM constitute the whole GPU unit (which executes the whole Kernel Grid).

A pictorial correlation of a programmer's perspective versus a hardware perspective of a thread block in GPU.[7]

### Streaming multiprocessors

Each architecture in GPU (say Kepler or Fermi) consists of several SM or Streaming Multiprocessors. These are general purpose processors with a low clock rate target and a small cache. An SM is able to execute several thread blocks in parallel. As soon as one of its thread block has completed execution, it takes up the serially next thread block. In general, SMs support instruction-level parallelism but not branch prediction.[8]

An illustration of a streaming multiprocessor and its resources.[9]

To achieve this purpose, an SM contains the following:[8]

• Execution cores. (single precision floating-point units, double precision floating-point units, special function units (SFUs)).
• Caches:
1. L1 cache. (for reducing memory access latency).
2. Shared memory. (for shared data between threads).
4. Texture cache. (for aggregating bandwidth from texture memory).
• Schedulers for warps. (these are for issuing instructions to warps based on particular scheduling policies).
• A substantial number of registers. (an SM may be running a large number of active threads at a time, so it is a must to have registers in thousands.)

The hardware schedules thread blocks to an SM. In general an SM can handle multiple thread blocks at the same time. An SM may contains up to 8 thread blocks in total. A thread ID is assigned to a thread by its respective SM.

Whenever an SM executes a thread block, all the threads inside the thread block are executed at the same time. Hence to free a memory of a thread block inside the SM, it is critical that the entire set of threads in the block have concluded execution. Each thread block is divided in scheduled units known as a warp. These are discussed in detail in the following section.

An Illustration of a double warp scheduler implemented in the Fermi micro-architecture of NVidia.[10]

The warp scheduler of SM decides which of the warp gets prioritized during issuance of instructions.[11] Some of the warp prioritizing policies have also been discussed in the following sections.

### Warps

On the hardware side, a thread block is composed of ‘warps’. A warp is a set of 32 threads within a thread block such that all the threads in a warp execute the same instruction. These threads are selected serially by the SM.

Once a thread block is launched on a multiprocessor (SM), all of its warps are resident until their execution finishes. Thus a new block is not launched on an SM until there is sufficient number of free registers for all warps of the new block, and until there is enough free shared memory for the new block.

Consider a warp of 32 threads executing an instruction. If one or both of its operands are not ready (e.g. have not yet been fetched from global memory), a process called ‘context switching’ takes place which transfers control to another warp.[12] When switching away from a particular warp, all the data of that warp remains in the register file so that it can be quickly resumed when its operands become ready. When an instruction has no outstanding data dependencies, that is, both of its operands are ready, the respective warp is considered to be ready for execution. If more than one warps are eligible for execution, the parent SM uses a warp scheduling policy for deciding which warp gets the next fetched instruction.

Different policies for scheduling warps that are eligible for execution are discussed below:[13]

1. Round Robin (RR) - Instructions are fetched in round robin manner. RR makes sure - SMs are kept busy and no clock cycles are wasted on memory latencies.
2. Least Recently Fetched (LRF) - In this policy, warp for which instruction has not been fetched for the longest time gets priority in the fetching of an instruction.
3. Fair (FAIR)[13] - In this policy, the scheduler makes sure all warps are given ‘fair’ opportunity in the number of instruction fetched for them. It fetched instruction to a warp for which minimum number of instructions have been fetched.
4. Thread block-based CAWS[14] (criticality aware warp scheduling) - The emphasis of this scheduling policy is on improving the execution time of the thread blocks. It allocated more time resources to the warp that shall take the longest time to execute. By giving priority to the most critical warp, this policy allows thread blocks to finish faster, such that the resources become available quicker.

Traditional CPU thread context "switching" requires saving and restoring allocated register values and the program counter to off-chip memory (or cache) and is therefore a much more heavyweight operation than with warp context switching. All of a warp's register values (including its program counter) remain in the register file, and the shared memory (and cache) remain in place too since these are shared between all the warps in the thread block.

In order to take advantage of the warp architecture, programming languages and developers need to understand how to coalesce memory accesses and how to manage control flow divergence. If each thread in a warp takes a different execution path or if each thread accesses significantly divergent memory then the benefits of the warp architecture are lost and performance will significantly degrade.

## References

1. ^ "CUDA Overview". cuda.ce.rit.edu. Retrieved 2016-09-21.
2. ^ https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
3. ^ a b c d "CUDA Thread Model". www.olcf.ornl.gov. Retrieved 2016-09-21.
4. ^ "Thread Hierarchy in CUDA Programming". Retrieved 2016-09-21.
5. ^ Kirk, David; Hwu, Wen-mei W (January 28, 2010). Programming Massively Parallel Processors: A Hands-on Approach.
6. ^ "Thread Indexing Cheatsheet" (PDF). Retrieved 2016-09-21.
7. ^
8. ^ a b Wilt, Nicholas (2013). The CUDA Handbook: A Comprehensive Guide to GPU Programming.
9. ^
10. ^
11. ^
12. ^
13. ^ a b
14. ^