Kepler (microarchitecture)

From Wikipedia, the free encyclopedia
  (Redirected from Nvidia Kepler)
Jump to: navigation, search
Nvidia Kepler
Predecessor Fermi
Successor Maxwell

Kepler is the codename for a GPU microarchitecture developed by Nvidia as the successor to the Fermi microarchitecture. Kepler is Nvidia's first microarchitecture to focus on energy efficiency. All GeForce 600 series, most GeForce 700 series, and some GeForce 800M series GPUs were based on Kepler, manufactured in 28 nm, as well as the GK20A, the GPU component of the Tegra K1 SoC. Kepler was followed by the Maxwell microarchitecture and used alongside Maxwell in the GeForce 700 series and GeForce 800M series.


Where the goal of the previous architecture was focused on increasing pure performance for compute and tessellation, the Kepler architecture focuses on efficiency, programmability and performance.[1][2] The efficiency goal was achieved through the use of a unified clock. By abandoning the shader clock found in their previous GPU designs, efficiency is increased, even though it requires more cores to achieve similar levels of performance. This is not only because the cores are more power efficient (two Kepler cores using about 90% of the power of one Fermi core, according to Nvidia's numbers), but also because the reduction in clock speed delivers a 50% reduction in power consumption in that area.[3]

Programmability was achieved with a new form of texture handling known as bindless textures. Previously, textures needed to be bound by the CPU to a particular slot in a fixed-size table before the GPU could reference them. This led to two limitations: one was that because the table was fixed in size, there could only be as many textures in use at one time as could fit in this table (128). The second was that the CPU was doing unnecessary work: it had to load each texture, and also bind each texture loaded in memory to a slot in the binding table.[2] With bindless textures, both limitations are removed. The GPU can access any texture loaded into memory, increasing the number of available textures and removing the performance penalty of binding.

Finally with performance, Kepler was able to achieve the memory clock to 6 GHz. To accomplish this, an entirely new memory controller and bus was needed to be implemented. While still shy of the theoretical 7 GHz limitation of GDDR5, this is well above the 4 GHz speed of the memory controller for Fermi.[3]


The GK Series GPU contains features from both the older Fermi and newer Kepler generations. Kepler based members add the following standard features:

  • PCI Express 3.0 interface
  • DisplayPort 1.2
  • HDMI 1.4a 4K x 2K video output
  • Purevideo VP5 hardware video acceleration (up to 4K x 2K H.264 decode)
  • Hardware H.264 encoding acceleration block (NVENC)
  • Support for up to 4 independent 2D displays, or 3 stereoscopic/3D displays (NV Surround)
  • Next Generation Streaming Multiprocessor (SMX)
  • Simplified Instruction Scheduler
  • Bindless Textures
  • CUDA Compute Capability 3.0 to 3.5
  • GPU Boost (Upgraded to 2.0 on GK110)
  • TXAA Support
  • Manufactured by TSMC on a 28 nm process
  • New Shuffle Instructions
  • Dynamic Parallelism
  • Hyper-Q (Hyper-Q's MPI functionality reserve for Tesla only)
  • Grid Management Unit
  • NVIDIA GPUDirect (GPU Direct’s RDMA functionality reserve for Tesla only)

Next Generation Streaming Multiprocessor (SMX)[edit]

The Kepler architecture employs a new Streaming Multiprocessor Architecture called SMX. The SMX are the reason for Kepler's power efficiency as the whole GPU uses a single "Core Clock" rather than the double-pump "Shader Clock".[3] Although the SMX usage of a single unified clock increases the GPU power efficiency due to the fact that two Kepler CUDA Cores consume 90% power of one Fermi CUDA Core, consequently the SMX needs additional processing units to execute a whole warp per cycle. As a result, it doubled the CUDA Cores from 16 to 32 per CUDA array. With the addition CUDA core solve the warp execution problem, the SMX processing resources are also double with warp schedulers, dispatch unit and the register file doubled to 64K entries as to feed additional CUDA Cores Array and load/store and SFU group to increase performance. With the doubling of processing units and resources increasing the usage of die area, the capability of the PolyMorph Engine are enhanced, making it capable of spurring out a polygon in 2 cycles instead of 4.[4] With SMX, Kepler not only have to work on power but also on area efficiency, thus Nvidia opted to use dedicated FP64 CUDA cores in an SMX as to save die space while still offering FP64 capabilities since all Kepler CUDA cores are not FP64 capable. With the improvement Nvidia made on the SMX, the results include an increase in GPU graphic performance and FP64 performance. With GK110, the 48KB texture cache is unlocked for compute workloads. In compute the texture cache becomes a read-only data cache, specializing in unaligned memory access workloads. Furthermore error detection capabilities have been added to make it safer for workloads that rely on ECC. The register per thread count is also doubled with 255 registers per thread.

Simplified Instruction Scheduler[edit]

Additional die spaces are acquired by replacing the complex hardware scheduler with simple software scheduler. With software scheduling, warps scheduling was moved to Nvidia's compiler and as the GPU math pipeline now has a fixed latency, it introduced instruction level parallelism in addition to thread level parallelism. As instructions are statically scheduled, consistency is introduced by moving to fixed latency instructions and a static scheduled compiler removed a level of complexity.[2][3][5][6]

GPU Boost[edit]

GPU Boost is a new feature which is roughly analogous to turbo boosting of a CPU. The GPU is always guaranteed to run at a minimum clock speed, referred to as the "base clock". This clock speed is set to the level which will ensure that the GPU stays within TDP specifications, even at maximum loads.[2] When loads are lower, however, there is room for the clock speed to be increased without exceeding the TDP. In these scenarios, GPU Boost will gradually increase the clock speed in steps, until the GPU reaches a predefined power target (which is 170W by default).[3] By taking this approach, the GPU will ramp its clock up or down dynamically, so that it is providing the maximum amount of speed possible while remaining within TDP specifications.

The power target, as well as the size of the clock increase steps that the GPU will take, are both adjustable via third-party utilities and provide a means of overclocking Kepler-based cards.[2]

Microsoft Direct3D Support[edit]

Nvidia Fermi and Kepler GPUs of the GeForce 600 series support the Direct3D 11.0 specification. Nvidia originally stated that the Kepler architecture has full DirectX 11.1 support, which includes the Direct3D 11.1 path.[7] The following " Modern UI " Direct3D 11.1 features, however, are not supported:[8][9]

  • Target-Independent Rasterization (2D rendering only).
  • 16xMSAA Rasterization (2D rendering only).
  • Orthogonal Line Rendering Mode.
  • UAV (Unordered Access View) in non-pixel-shader stages.

According to the definition by Microsoft, Direct3D Feature Level 11_1 must be complete, otherwise the Direct3D 11.1 path can not be executed.[10] The integrated Direct3D features of the Kepler architecture are the same as those of the GeForce 400 series Fermi architecture.[9]

Next Microsoft DirectX Support[edit]

NVIDIA Kepler GPUs of the GeForce 600/700 series support DirectX 12.[11]

NVIDIA will support the DX12 API on all the DX11-class GPUs it has shipped; these belong to the Fermi, Kepler and Maxwell architectural families.

TXAA Support[edit]

Exclusive to Kepler GPUs, TXAA is a new anti-aliasing method from Nvidia that is designed for direct implementation into game engines. TXAA is based on the MSAA technique and custom resolve filters. It is design to addresses a key problem in games known as shimmering or temporal aliasing. TXAA resolves that by smoothing out the scene in motion, making sure that any in-game scene is being cleared of any aliasing and shimmering.[2]


Main article: Nvidia NVENC

NVENC is Nvidia's power efficient fixed-function encode that is able to take codecs, decode, preprocess, and encode H.264-based content. NVENC specification input formats are limited to H.264 output. But still, NVENC, through its limited format, can support up to 4096x4096 encode.[12]

Like Intel’s Quick Sync, NVENC is currently exposed through a proprietary API, though Nvidia does have plans to provide NVENC usage through CUDA.[12]

Shuffle Instructions[edit]

At a low level, GK110 sees an additional instructions and operations to further improve performance. New shuffle instructions allow for threads within a warp to share data without going back to memory, making the process much quicker than the previous load/share/store method. Atomic operations are also overhauled, speeding up the execution speed of atomic operations and adding some FP64 operations that were previously only available for FP32 data.[13]


Hyper-Q expands GK110 hardware work queues from 1 to 32. The significance of this being that having a single work queue meant that Fermi could be under occupied at times as there wasn’t enough work in that queue to fill every SM. By having 32 work queues, GK110 can in many scenarios, achieve higher utilization by being able to put different task streams on what would otherwise be an idle SMX. The simple nature of Hyper-Q is further reinforced by the fact that it’s easily map to MPI, a common message passing interface frequently used in HPC. As legacy MPI-based algorithms that were originally designed for multi-CPU systems that became bottlenecked by false dependencies now have a solution. By increasing the number of MPI jobs, it’s possible to utilize Hyper-Q on these algorithms to improve the efficiency all without changing the code itself.[13]

Dynamic Parallelism[edit]

Dynamic Parallelism ability is for kernels to be able to dispatch other kernels. With Fermi, only the CPU could dispatch a kernel, which incurs a certain amount of overhead by having to communicate back to the CPU. By giving kernels the ability to dispatch their own child kernels, GK110 can both save time by not having to go back to the CPU, and in the process free up the CPU to work on other tasks.[13]

Grid Management Unit[edit]

Enabling Dynamic Parallelism requires a new grid management and dispatch control system. The new Grid Management Unit (GMU) manages and prioritizes grids to be executed. The GMU can pause the dispatch of new grids and queue pending and suspended grids until they are ready to execute, providing the flexibility to enable powerful runtimes, such as Dynamic Parallelism. The CUDA Work Distributor in Kepler holds grids that are ready to dispatch, and is able to dispatch 32 active grids, which is double the capacity of the Fermi CWD. The Kepler CWD communicates with the GMU via a bidirectional link that allows the GMU to pause the dispatch of new grids and to hold pending and suspended grids until needed. The GMU also has a direct connection to the Kepler SMX units to permit grids that launch additional work on the GPU via Dynamic Parallelism to send the new work back to GMU to be prioritized and dispatched. If the kernel that dispatched the additional workload pauses, the GMU will hold it inactive until the dependent work has completed.[14]

NVIDIA GPUDirect[edit]

NVIDIA GPUDirect is a capability that enables GPUs within a single computer, or GPUs in different servers located across a network, to directly exchange data without needing to go to CPU/system memory. The RDMA feature in GPUDirect allows third party devices such as SSDs, NICs, and IB adapters to directly access memory on multiple GPUs within the same system, significantly decreasing the latency of MPI send and receive messages to/from GPU memory[citation needed]. It also reduces demands on system memory bandwidth and frees the GPU DMA engines for use by other CUDA tasks. Kepler GK110 also supports other GPUDirect features including Peer‐to‐Peer and GPUDirect for Video.

See also[edit]