CUDA Pinned memory
||This article includes a list of references, related reading or external links, but its sources remain unclear because it lacks inline citations. (November 2013)|
In the framework of accelerating computational codes by Parallel Computing on Graphics Processing Unit (GPU), the data to be processed must be transferred from the Central Processing Unit (CPU) to the GPU and the results of the processing from the GPU to the CPU. In a computational code accelerated by General Purpose GPU (GPGPU) computing, such transactions can occur many times and may affect the overall performance, so that the problem of carrying out those transfers in the fastest way arises.
To allow programmers to use a larger virtual address space than is actually available in the RAM, CPUs (or hosts, in the language of GPGPU) implement a virtual memory system Virtual memory (non-locked memory) in which a physical memory page can be swapped out to disk. When the host needs that page, it loads it back in from the disk. The drawback with CPU<->GPU memory transfers is that memory transactions are slower, i.e., the bandwidth of the PCI-E bus to connect CPU and GPU is not fully exploited. Non-locked memory is stored not only in memory (e.g. it can be in swap), so the driver needs to access every single page of the non-locked memory, copy it into pinned buffer and pass it to the Direct Memory Access(DMA) (synchronous, page-by-page copy). Indeed, PCI-E transfers occur only using the DMA. Accordingly, when a “normal” transfer is issued, an allocation of a block of page-locked memory is necessary, followed by a host copy from regular memory to the page-locked one, the transfer, the wait for the transfer to complete and the deletion of the page-locked memory. This consumes precious host time which is avoided when directly using page-locked memory.
However, with today’s memories, the use of virtual memory is no longer necessary for many applications which will fit within the host memory space. In all those cases, it is more convenient to use page-locked (pinned) memory which enables a DMA on the GPU to request transfers to and from the host memory without the involvement of the CPU. In other words, locked memory is stored in the physical memory (RAM), so the GPU (or device, in the language of GPGPU) can fetch it without the help of the host (synchronous copy).
GPU memory is automatically allocated as page-locked, since GPU memory does not support swapping to disk. To allocate page-locked memory on the host in CUDA language one could use cudaHostAlloc.
- Shane Cook, CUDA Programming, A Developer’s Guide to Parallel Computing with GPUs, Morgan Kaufmann, 2013.
- Jason Sanders, Edward Kandrot, CUDA By Example, An Introduction to General-Purpose GPU Programming, Addison-Wesley, 2011.