# Translation lookaside buffer

A translation lookaside buffer (TLB) is a cache that memory management hardware uses to improve virtual address translation speed. All current desktop, laptop, and server processors use a TLB to map virtual and physical address spaces, and it is nearly always present in any hardware which utilizes virtual memory.

## Overview

A TLB has a fixed number of slots that contain page table entries, which map virtual addresses to physical addresses. The virtual memory is the space seen from a process. This space is segmented in pages of a prefixed size. The page table (generally loaded in memory) keeps track of where the virtual pages are loaded in the physical memory. The TLB is a cache of the page table; that is, only a subset of its contents are stored.

The TLB references physical memory addresses in its table. It may reside between the CPU and the CPU cache, between the CPU cache and primary storage memory, or between levels of a multi-level cache. The placement determines whether the cache uses physical or virtual addressing. If the cache is virtually addressed, requests are sent directly from the CPU to the cache, and the TLB is accessed only on a cache miss. If the cache is physically addressed, the CPU does a TLB lookup on every memory operation and the resulting physical address is sent to the cache. There are pros and cons to both implementations. Caches that use virtual addressing have for their key part of the virtual address plus, optionally, a key called an "address space identifier" (ASID). Caches that don't have ASIDs must be flushed every context switch in a multiprocessing environment.

In a Harvard architecture or hybrid thereof, a separate virtual address space or memory access hardware may exist for instructions and data. This can lead to distinct TLBs for each access type.

A common optimization for physically addressed caches is to perform the TLB lookup in parallel with the cache access. The low-order bits of any virtual address (e.g., in a virtual memory system having 4 KB pages, the lower 12 bits of the virtual address) represent the offset of the desired address within the page, and thus they do not change in the virtual-to-physical translation. During a cache access, two steps are performed: an index is used to find an entry in the cache's data store, and then the tags for the cache line found are compared. If the cache is structured in such a way that it can be indexed using only the bits that do not change in translation, the cache can perform its "index" operation while the TLB translates the upper bits of the address. Then, the translated address from the TLB is passed to the cache. The cache performs a tag comparison to determine if this access was a hit or miss. It is possible to perform the TLB lookup in parallel with the cache access even if the cache must be indexed using some bits that may change upon address translation; see the address translation section in the cache article for more details about virtual addressing as it pertains to caches and TLBs.

## Implications for performance

The CPU has to access main memory for a:

• initial access
• instruction cache miss
• data cache miss
• TLB miss

The third case (the simplest case) is where the desired information itself actually is in a cache, but the information for virtual-to-physical translation is not in a TLB. These are all about equally slow, so a program "thrashing" the TLB will run just as poorly as one thrashing an instruction or data cache. That is why a well functioning TLB is important.

## Multiple TLBs

Similar to caches, TLBs may have multiple levels. CPUs can be (and nowadays usually are) built with multiple TLBs, for example a small "L1" TLB (potentially fully associative) that is extremely fast, and a larger "L2" TLB that is somewhat slower. When ITLB and DTLB are used, a CPU can have three (ITLB1, DTLB1, TLB2) or four TLBs.

For instance, Intel's Nehalem microarchitecture has a four-way set associative L1 DTLB with 64 entries for 4 KiB pages and 32 entries for 2/4 MiB pages, an L1 ITLB with 128 entries for 4 KiB pages using four-way associativity and 14 fully associative entries for 2/4 MiB pages (both parts of the ITLB divided statically between two threads)[1] and a unified 512-entry L2 TLB for 4 KiB pages,[2] both 4-way associative.[3]

Some TLBs may have separate sections for small pages and huge pages.

## TLB miss handling

Two schemes for handling TLB misses are commonly found in modern architectures:

• With hardware TLB management, the CPU automatically walks the page tables (using the CR3 register on x86 for instance) to see if there is a valid page table entry for the specified virtual address. If an entry exists, it is brought into the TLB and the TLB access is retried: this time the access will hit, and the program can proceed normally. If the CPU finds no valid entry for the virtual address in the page tables, it raises a page fault exception, which the operating system must handle. Handling page faults usually involves bringing the requested data into physical memory, setting up a page table entry to map the faulting virtual address to the correct physical address, and resuming the program (see Page fault for more details.) With a hardware-managed TLB, the format of the TLB entries is not visible to software, and can change from CPU to CPU without causing loss of compatibility for the programs.
• With software-managed TLBs, a TLB miss generates a "TLB miss" exception, and operating system code is responsible for walking the page tables and performing the translation in software. The operating system then loads the translation into the TLB and restarts the program from the instruction that caused the TLB miss. As with hardware TLB management, if the OS finds no valid translation in the page tables, a page fault has occurred, and the OS must handle it accordingly. Instruction sets of CPUs that have software-managed TLBs have instructions that allow loading entries into any slot in the TLB. The format of the TLB entry is defined as a part of the instruction set architecture (ISA).[4] The MIPS architecture specifies a software-managed TLB;[5] the SPARC V9 architecture allows an implementation of SPARC V9 to have no MMU, an MMU with a software-managed TLB, or an MMU with a hardware-managed TLB.,[6] and the UltraSPARC architecture specifies a software-managed TLB.[7]

The Itanium architecture provides an option of using either software or hardware managed TLBs.[8]

The Alpha architecture's TLB is managed in PALcode, rather than in the operating system. As the PALcode for a processor can be processor-specific and operating-system-specific, this allows different versions of PALcode to implement different page table formats for different operating systems, without requiring that the TLB format, and the instructions to control the TLB, to be specified by the architecture.[9]

## Typical TLB

• Size: 12 - 4,096 entries
• Hit time: 0.5 - 1 clock cycle
• Miss penalty: 10 - 100 clock cycles
• Miss rate: 0.01 - 1%

[10]

If a TLB hit takes 1 clock cycle, a miss takes 30 clock cycles, and the miss rate is 1%, the effective memory cycle rate is an average of

$1 \times 0.99 + (1 + 30) \times 0.01 = 1.30$

(1.30 clock cycles per memory access).

## Context switch

On a context switch, some TLB entries can become invalid, since the virtual-to-physical mapping is different. The simplest strategy to deal with this is to completely flush the TLB. Newer CPUs use more effective strategies marking which process an entry is for. This means that if a second process runs for only a short time and jumps back to a first process, it may still have valid entries, saving the time to reload them.

For example in the Alpha 21264, each TLB entry is tagged with an "address space number" (ASN), and only TLB entries with an ASN matching the current task are considered valid. Another example in the Intel Pentium Pro, the page global enable (PGE) flag in the register CR4 and the global (G) flag of a page-directory or page-table entry can be used to prevent frequently used pages from being automatically invalidated in the TLBs on a task switch or a load of register CR3.

While selective flushing of the TLB is an option in software managed TLBs, the only option in some hardware TLBs (for example, the TLB in the Intel 80386) is the complete flushing of the TLB on a context switch. Other hardware TLBs (for example, the TLB in the Intel 80486 and later x86 processors, and the TLB in ARM processors) allow the flushing of individual entries from the TLB indexed by virtual address.

## Virtualization and x86 TLB

With the advent of virtualization for server consolidation, a lot of effort has gone into making the x86 architecture easier to virtualize and to ensure better performance of virtual machines on x86 hardware.[11][12] In a long list of such changes to the x86 architecture, the TLB is the latest.

Normally, the entries in the x86 TLBs are not associated with any address space. Hence, every time there is a change in address space, such as a context switch, the entire TLB has to be flushed. Maintaining a tag which associates each TLB entry with an address space in software and comparing this tag during TLB lookup and TLB flush is very expensive, especially since the x86 TLB is designed to operate with very low latency and completely in hardware. In 2008, both Intel (Nehalem)[13] and AMD (SVM)[14] have introduced tags as part of the TLB entry and dedicated hardware which checks the tag during lookup. Even though these are not fully exploited, it is envisioned that in the future, these tags will identify the address space to which every TLB entry belongs. Thus a context switch will not result in the flushing of the TLB – but just changing the tag of the current address space to the tag of the address space of the new task.

## References

1. ^ "Inside Nehalem: Intel's Future Processor and System". Real World Technologies.
2. ^ "Intel Core i7 (Nehalem): Architecture By AMD?". Tom's Hardware. Retrieved 2010-11-24.
3. ^ "Inside Nehalem: Intel's Future Processor and System". Real World Technologies. Retrieved 2010-11-24.
4. ^ J. Smith and R. Nair. Virtual Machines: Versatile Platforms for Systems and Processes (The Morgan Kaufmann Series in Computer Architecture and Design). Morgan Kaufmann Publishers Inc., 2005.
5. ^ Welsh, Matt. "MIPS r2000/r3000 Architecture". Retrieved 16 November 2008. "If no matching TLB entry is found, a TLB miss exception occurs"
6. ^ SPARC International, Inc. The SPARC Architecture Manual, Version 9. PTR Prentice Hall.
7. ^ Sun Microsystems. UltraSPARC Architecture 2005. Draft D0.9.2, 19 Jun 2008. Sun Microsystems.
8. ^ Virtual Memory in the IA-64 Kernel > Translation Lookaside Buffer
9. ^ Compaq Computer Corporation. Alpha Architecture Handbook. Version 4. Compaq Computer Corporation.
10. ^ Computer Organization And Design. Hardware/Software interface. 4th edition. Burlington, MA 01803, USA: Morgan Kaufmann Publishers. 2009. p. 503. ISBN 978-0-12-374493-7.
11. ^ D. Abramson, J. Jackson, S. Muthrasanallur, G. Neiger, G. Regnier, R. Sankaran, I. Schoinas, R. Uhlig, B. Vembu, and J. Wiegert. Intel Virtualization Technology for Directed I/O. Intel Technology Journal, 10(03):179–192.
12. ^ Advanced Micro Devices. AMD Secure Virtual Machine Architecture Reference Manual. Advanced Micro Devices, 2008.
13. ^ G. Neiger, A. Santoni, F. Leung, D. Rodgers, and R. Uhlig. Intel Virtualization Technology: Hardware Support for Efficient Processor Virtualization. Intel Technology Journal, 10(3).
14. ^ Advanced Micro Devices. AMD Secure Virtual Machine Architecture Reference Manual. Advanced Micro Devices, 2008.