Tesla (microarchitecture)

From Wikipedia, the free encyclopedia
Jump to: navigation, search
This article is about the GPU microarchitecture. For GPGPU cards, see Nvidia Tesla.
Nvidia Tesla
Nvidia Tesla GPU
Predecessor G70
Successor Fermi

Tesla is the codename for a GPU microarchitecture developed by Nvidia as the successor to their prior microarchitectures. Tesla is Nvidia's first microarchitecture to implement unified shaders. It was used with GeForce 8 Series, GeForce 9 Series, GeForce 100 Series, GeForce 200 Series, and GeForce 300 Series of GPUs manufactured in 90 nm, 80 nm, 65 nm, and 55 nm. It also found use in the GeForce 405, and in the workstation market in the Quadro FX, Quadro x000, Quadro NVS series, and Nvidia Tesla computing modules. Tesla replaced the old fixed-pipeline microarchitectures and competed directly with AMD's first unified shader microarchitecture named TeraScale. Tesla was followed by Fermi.

The Tesla series takes its name from pioneering electrical engineer Nikola Tesla.


Tesla is Nvidia's first microarchitecture implementing the unified shader model. The driver supports Direct3D 10 Shader Model 4.0 / OpenGL 2.1(later drivers have OpenGL 3.3 support) architecture. The design is a major shift for NVIDIA in GPU functionality and capability, the most obvious change being the move from the separate functional units (pixel shaders, vertex shaders) within previous GPUs to a homogeneous collection of universal floating point processors (called "stream processors") that can perform a more universal set of tasks.

Model Adrianne Curry watching a 3D animation of herself during a GeForce 8 demo.

GeForce 8's unified shader architecture consists of a number of stream processors (SPs). Unlike the vector processing approach taken with older shader units, each SP is scalar and thus can operate only on one component at a time. This makes them less complex to build while still being quite flexible and universal. Scalar shader units also have the advantage of being more efficient in a number of cases as compared to previous generation vector shader units that rely on ideal instruction mixture and ordering to reach peak throughput. The lower maximum throughput of these scalar processors is compensated for by efficiency and by running them at a high clock speed (made possible by their simplicity). GeForce 8 runs the various parts of its core at differing clock speeds (clock domains), similar to the operation of the previous GeForce 7 Series GPUs. For example, the stream processors of GeForce 8800 GTX operate at a 1.35 GHz clock rate while the rest of the chip is operating at 575 MHz.[1]

GeForce 8 performs significantly better texture filtering than its predecessors that used various optimizations and visual tricks to speed up rendering without impairing filtering quality. The GeForce 8 line correctly renders an angle-independent anisotropic filtering algorithm along with full trilinear texture filtering. G80, though not its smaller brethren, is equipped with much more texture filtering arithmetic ability than the GeForce 7 series. This allows high-quality filtering with a much smaller performance hit than previously.[1]

NVIDIA has also introduced new polygon edge anti-aliasing methods, including the ability of the GPU's ROPs to perform both Multisample anti-aliasing (MSAA) and HDR lighting at the same time, correcting various limitations of previous generations. GeForce 8 can perform MSAA with both FP16 and FP32 texture formats. GeForce 8 supports 128-bit HDR rendering, an increase from prior cards' 64-bit support. The chip's new anti-aliasing technology, called coverage sampling AA (CSAA), uses Z, color, and coverage information to determine final pixel color. This technique of color optimization allows 16X CSAA to look crisp and sharp.[2]

The claimed theoretical processing power for the 8 Series cards given in FLOPS may not be correct at all times. For example the GeForce 8800 GTX has 518.43 GigaFLOPs theoretical performance given the fact that there are 128 stream processors at 1.35 GHz with each SP being able to run 1 Multiply-Add and 1 Multiply instruction per clock [(MADD (2 FLOPs) + MUL (1 FLOP))×1350 MHz×128 SPs = 518.4 GigaFLOPs].[3] This figure may not be correct because the Multiply operation is not always available[4] giving a possibly more accurate performance figure of (2×1350×128) = 345.6 GigaFLOPs.


  • G80
  • G84
  • G86
  • G92
  • G94
  • G96
  • G98
  • GT200
  • GT215
  • GT216
  • GT218


External links[edit]