Talk:CPU cache

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Former featured article CPU cache is a former featured article. Please see the links under Article milestones below for its original nomination page (for older articles, check the nomination archive) and why it was removed.
Main Page trophy This article appeared on Wikipedia's Main Page as Today's featured article on January 7, 2005.
WikiProject Computing (Rated C-class, Mid-importance)
WikiProject icon This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 Mid  This article has been rated as Mid-importance on the project's importance scale.
Version 0.5      (Rated C-Class)
Peer review This Engtech article has been selected for Version 0.5 and subsequent release versions of Wikipedia. It has been rated C-Class on the assessment scale.

AMD Centric Article[edit]

I came here looking into why the Intel Core 2 Duo chips with 64k L1 cache per core, and 4Mb shared L2 cache, works so well compared to a AMD K8 with 1MB L1 cache. However, there is very little Intel based information here. Any chance the knowledable can make this article less AMD centric? --Mgillespie 10:43, 7 August 2006 (UTC)

The AMD K8 has 128 kB L1 cache and 1 MB L2 cache (maximum). -- Darklock (talk) 02:36, 16 March 2008 (UTC)
There is reference to Intel Pro, this article is not AMD centric and it is not desirable to make it more Intel centric. Other pages are dedicated to specific architecture implementations.
However the page needs more architecture references to illustrate the concepts. It would also need to better deal with real-time and embedded constraints. Market1G (talk) 17:39, 6 April 2010 (UTC)


>> Latency: The virtual address is available from the MMU some time, perhaps
>> a few cycles, after the physical address is available from the address generator.

Isn't this a mistake? The MMU translates into *physical* addresses. Therefore the *physical* address is available from the MMU some time, perhaps a few cycles, after the *virtual* address is presented to it.

-- agl

That was a mistake, and it's been fixed a while ago. 05:22, 4 July 2006 (UTC)
>> Historically, the first hardware cache used in a computer
>> system did not cache the contents of main memory but rather
>> translations between virtual memory addresses and physical
>> addresses. This cache is known by the awkward acronym
>> Translation Lookaside Buffer (TLB).

This needs some clarification, as early computer systems did not have virtual memory, though they had instruction caches.

--Stephan Leclercq 08:50, 22 Jul 2004 (UTC)

The Burroughs B5000 and the Ferranti Atlas were the first computers with a virtual memory; neither had an instruction cache. Shmuel (Seymour J.) Metz Username:Chatul (talk) 17:49, 23 November 2010 (UTC)

Yep. There's a whole history to write here, of which I only know a little. I know that early Crays had essentially a one-line cache. I have read that there were two IBM 360 projects developed simultaneously. One was the famous "Stretch", and the other was a simpler machine which had a cache. The simple one was somehow better.

Stretch was designed in the 1950's, the IBM System/360 was designed in the 1960's. IBM had stopped taking new orders for Stretch well before they announced the S/360. The machines aren't remotely similar, although IBM did cannibalize a lot of technology from Stretch for use in the 7000 series. Shmuel (Seymour J.) Metz Username:Chatul (talk) 17:49, 23 November 2010 (UTC)

I have read that TLBs predated data caches, but I have not yet tracked down an authoritative source yet. Perhaps I should remove that comment until I do.

Iain McClatchie 20:54, 22 Jul 2004 (UTC)

I know that CDC Cyber (designed by Seymour Crey) had a 8 word instruction cache, that contained the last 8 words executed, and was cleared at every jump instruction that did not fall on the cache. Looks like nothing, but the cache enhanced tight loops by a factor of 6-10...

Hope it helps ... --Stephan Leclercq 22:41, 22 Jul 2004 (UTC)

I would enjoy a history section. One tidbit I enjoyed is that the MC68010 CPU (which I believe found its widest use in the original LaserWriter printer) had an instruction cache big enough for exactly 2 instructions, which was just enough for a big memory-move loop of one MOVE and one DBRA instruction. Tempshill 04:39, 7 Jan 2005 (UTC)

I can say with a fair amount of confidence that TLBs predated caches (unless you call a TLB a cache, of course, which I don't). The first commercial computer with a cache (more or less as we think of it today) was the System/360 Model 85, announced in 1968 and delivered the following year. The 360 Model 67 had no cache, but did have an 8-entry TLB; it was delivered in May of 1966. I believe that at least 2 earlier machines also had TLBs: the Multics hardware, and the Atlas.

The CDC 6600 (1964, predating all the CDC Cyber machines) had an 8-word instruction stack, which could be used to contain a 7 word loop, which might have as many as 27 instructions. The words had to be from consecutive memory locations. The 7600 (1968) had a 12 word stack whose contents did not have to be from consecutive locations.

Lastly, the Stretch vs. System/360 story recounted above doesn't ring true, at least as told. Stretch was delivered to customers before System/360 was much more than a gleam in anyone's eye. Capek 07:20, 10 Jan 2005 (UTC)

In fact it would be worth adding read/pre-fetch and write-back buffers, which are kind of 1-entry caches most often associated with proper caches —Preceding unsigned comment added by Market1G (talkcontribs) 17:43, 6 April 2010 (UTC)
Stretch was designed in the 1950's and the IBM System/360 was designed in the 1960's. However, the first delivery of Stretch was only a few years before IBM announced the S/360 and overlapped the writing of the SPREAD report.
Shmuel (Seymour J.) Metz Username:Chatul (talk) 17:49, 23 November 2010 (UTC)

Inclusion property[edit]

We need two expressions for inclusive cache hierarchies because implementations do not necessarily enforce the inclusion property. IIRC x86 implementations generally do not. When contents of L1 are not guaranteed to be backed by L2, L2 snoop misses do not imply L1 misses even though the hierarchy is generally labeled as inclusive. Guaranteeing inclusion however may have adverse effects an associativity: backing two n-way L1 caches by a direct mapped L2 cache (Alpha EV6?) significantly restricts L1 associativity.

Why is that? i.e. Why does it significantly restrict L1 associativity? Isn't that only if the L2 is small? —Preceding unsigned comment added by (talk) 07:18, 27 February 2010 (UTC)

A.kaiser 09:31, 25 Sep 2004 (UTC)

The K7 and K8 L2/L1 designs obviously are not inclusive, but rather exclusive. My current understanding is that the P3 and P4 designs are inclusive, so that bus snoops check only the L2 tag. Can you point to any evidence to the contrary?

The adverse effects on associativity from the inclusion guarantee is an excellent point and should be added to the page somewhere.

Iain McClatchie 07:01, 26 Sep 2004 (UTC)

Intel Optimization guide on P-M and both P4s: "Levels in the cache hierarchy are not inclusive. The fact that a line is in level i does not imply that it is also in level i+1."

Since the P-M shares much of its microarchitecture with the P3, I expect the P3 to be similar.

A.kaiser 12:34, 26 Sep 2004 (UTC)

That's good evidence. I'll go think about what that means and how to talk about it. Unless you'd like to hack the article, in which case, please go ahead. I might get to it in a week or so if you don't.

It does seem like the right hierarchy isn't exclusive-inclusive, with inclusive broken into really inclusive-not actually inclusive. I think I'm seeing three completely different categories: inclusive, exclusive, and "serial". I'm making up that last name, because I don't know what it's formally called in the literature.

Iain McClatchie 19:51, 27 Sep 2004 (UTC)

It looks like the article still doesn't explain this well. In my experience, you have at least three different kinds of inclusivity possibilities:

  1. inclusive: when a higher-level cache (L1) allocates, you allocate also. When you evict, you also back-invalidate the higher-level cache. No special requirements for when the higher level evicts, or when you allocate of your own volition.
  2. exclusive: when a higher-level cache allocates, you evict (possibly allocating in its spot what the higher level evicted). When you allocate, you back-invalidate the higher-level cache. No special requirements on when the higher level evicts (although you could choose to allocate), or when you evict of your own volition.
  3. pseudo-inclusive (I made that up a few years ago, but don't think it's gotten very widespread use anywhere): this is what is done on the L2 caches that I am most familiar with (Freescale/Motorola): when a higher-level cache allocates, you allocate as well. All other actions do not have strict requirements (L1 evict, L2 alloc, L2 evict). In particular, when you evict, you do not back-invalidate (this is the main difference from true inclusive). You start out as inclusive, but don't maintain inclusivity via back-invalidates. This allows you to make better use of your L2 cache (especially if it has a smaller associativity than your L1s, or has very different set size), at the expense of requiring all snoops to go to the L2 and the L1. It also means that on a non-dirty L1 eviction, you don't need to explicitly cast out to the L2 -- it likely still has a copy from when you allocated.

There are likely many possibilities between full inclusive and full exclusive, but most of Freescale's L2 caches have been what I'm calling pseudo-inclusive.

Actually, now that I think about it, the MPC7400 was a true victim L2: when the L1 evicts, allocate. That's it. The MPC7450/e600 and e500 are pseudo-inclusive: when the L1 allocates, you allocate. That's it.

Now I'm wondering if we could merge the concept of victim with inclusion, and show that victim caches and inclusion properties are special cases of the more general allocation/eviction policy as it concerns two or more levels of caching. That would be more of a sweeping change....

I can try to write up something about this a little more carefully than the above, if y'all think it's a Good Idea. I'd definitely want feedback before I drastically alter the page.

BGrayson 14:50, 27 April 2007 (UTC)

K8 Caching diagram[edit]

The diagram of the K8 cache hiearchy is misleading. While the TLBs are caches, they are not filled from "normal" memory as the icache and dcache are, but are filled by the OS from page tables. Dyl 07:40, Dec 24, 2004 (UTC)

I was under the impression the P5, P6, P4, K7, and K8 all have hardware page table walkers. Is this not correct?

Also, can you be more specific about how the diagrams are misleading? The icache and dcache cache main memory, and the TLBs cache the page tables (which are in main memory). If the TLBs are not filled by a hardware table walker, then I agree there should be some distinction made between the hardware and software fill paths on the diagram. Iain McClatchie 09:09, 25 Dec 2004 (UTC)

I believe all x86 implementations have hardware page table walkers. It's generally a per-instruction-set type of thing rather than per-implementation, since it has software consequences. --CTho 01:24, 23 December 2005 (UTC)

Can you elaborate? I don't understand. --Patrik Hägglund 07:33, 21 September 2006 (UTC)
If you don't provide hardware table-walk, then you need some kind of special instructions in your instruction set that enable management of the TLB. However, the two are not orthogonal: the Power Architecture (and I'm sure many others) provides TLB management instructions for those that use software tablewalk (SWTW), even though many implementations have supported hardware tablewalk (HWTW) also. Some OS people like HWTW because it's fast, but others don't like it because it constrains how they manage virtual memory -- they can't provide their own preferred TLB replacement algorithms, or have page table entries with extra information, because those details are constrained by the HWTW algorithm. BGrayson 15:49, 27 April 2007 (UTC)

I think that the K8 diagram and its text was a very inforamtive example. Thanks! However, I want to know more about for example how the load-store unit is connected to the caches. In AMD's "BIOS and Kernel Developer's Guide", section Miss Address buffers (MABs) and Page Directory cache (PDC) are mentioned. How do they fit into the picture?

How are L1 and L2 caches indexed and tagged? Reding the text about address translation, I assume that L1 caches use virtual indexing, and physical tagging with vhints, and the L2 cache use physical indexing and physical tagging. Is that correct? --Patrik Hägglund 07:33, 21 September 2006 (UTC)

In general, not all L1 and L2 caches are virtually-indexed -- this article seems too slanted in that direction. For example, Freescale and IBM PowerPC chips have always used physically indexed and physically tagged L1 caches, without sacrificing L1 cache latency. BGrayson

Stalls, rewording[edit]

The article does not make mention of a stall, which is what occurs to program execution when a cache miss occurs. That is were the penalty is ultimately felt, because the program executes slower.


Also the article needs rewording. I'm a software developer, and still had a great deal of difficulty trying to follow along with the article. I wouldn't think it would be very useful to a lay-person in this state. It contains lots of good information; the sentences are just hard to follow.

Very disappointing to hear. If you can say anything about where you were having trouble following along, it might help me fix the article. Iain McClatchie 01:28, 7 Jan 2005 (UTC)

Better organization?[edit]

Just an opinion, but this article might benefit from a reorganization along these lines:

  • Intro
  • Why cache is necessary/important
  • How cache works
  • Design - i think this is the most important reorganization, since the concept of cache design is somewhat convoluted in this article: should mention clearly 1. why a design choice/option exists, and 2. what the problem this design choice solves 3. how this design choice solves the problem
    • How researchers analyze cache performance
    • Areas where current research is focused on, in regards to cache design
  • Implementation (how design concepts are implemented on CPUs) - stuff like address translation should go here, imo
  • History (discusses how cache design has evolved along with CPU development)--Confuzion 02:02, 7 Jan 2005 (UTC)

Organize this AND dumb the wording down please! Not everyone knows all the terminology of processors.

It strikes me that this page is a victim of its own success. Not all of this is specific to the CPU cache and much text duplicates the Cache page. I understand the need for completeness and narrative, but this page could benevolently improve the cache page yet retain narrative and be more specific to what it is. Stuff like address translation would then work well here, as that is quite specific to CPU caches. Notwithstanding of course that address translation has a disambiguation page that does not reference address translation in the context of CPU caches. 22:16, 7 May 2007 (UTC)

Working sets[edit]

The phrase "working set" doesn't appear in this article at all, which I think is a fairly major omission. I must sleep now, or else I'd add it right now. A quick Google search shows that most people consider a "working set" to refer to memory pages, but my understanding is that the concept also applies to cache lines. --Doradus 05:26, Jan 7, 2005 (UTC)

I'd like to stay away from adding "working sets" into this article.

Working sets are generally attributed to the use of main memory by processes in a multiprocessing virtual memory system. The set size matters because the operating system can allocate more memory to one process and less to another. There is some similarity to the hit rates versus size that characterize caches, but folks have found hit rate rather than working set to be a more useful concept for hardware caches with fixed sizes. Iain McClatchie 06:13, 4 July 2006 (UTC)

indexing vs tagging[edit]

Virtually indexed and/or tagged caches. What is the difference between indexing and tagging? 14:32, 7 Jan 2005 (UTC)

The index bits are *not* stored in the cache. The index bits are (typically) the "middle bits" of the effective address. The index bits select a particular row of the cache.

The tag bits *are* stored in the cache. The tag bits are (typically) the "high bits" of the effective address. After a particular row of the cache is selected, the cache memory sends out all the bits on that row (including the tag bits). If the tag bits that come out exactly match the "high bits" of the address we are trying to look up, we have a hit.

The block offset is the "low bits" of the effective address. When we have a hit, we use the block offset to select a particular word of the block of data from that row of the cache.

Both indexing and tagging have something to do with "address bits" -- how can we write this article to avoid confusing them? How can we make it more clear in this article? --DavidCary (talk) 15:41, 8 December 2011 (UTC)


This part is incomprehensible (to me):

Because cache reads are the most common operation that take more than a single cycle, the recurrence from a load instruction to an instruction dependent on that load tends to be the most critical path in well-designed processors, so that data on this path wastes the least amount of time waiting for the clock. As a result, the level-1 cache is the most latency sensitive block on the chip.

-- 14:54, 7 Jan 2005 (UTC)

I think the point being made is that most ALU operations complete in one clock cycle and are probably the most common instructions. Beyond ALU operations and JMPs, reading/writing memory is probably the next most common/useful operation. When doing a bunch of memory reads, the CPU will probably fill the cache with what's in RAM at those locations, so most of your memory reads are going to be from cache. So the cache reads are extremely common, but take multiple clock cycles because they may need to transparently read from RAM and fill into cache, and from there it may still take multiple clock cycles just to read data from cache. The more clocks it takes to read from cache, the slower you are going to operate on your data. Since a majority of memory reads are actually from cache, the time required to read from cache has more impact than how quickly cache can fill from RAM. Also, if it takes an extra clock cycle to read from cache, any typical operation on data will require an additional clock cycle. Rmcii 02:37, 5 May 2006 (UTC)
I wrote this paragraph, and I've just cut out most of it. I was attempting to convey too much insight, and the troubles were many: data caches are often NOT the critical path, due to all sorts of practical difficulties; understanding this paragraph required some understanding of synchronous systems; and finally, it was only really necessary to motivate the following description of the implementation. So I just said that folks try hard to make caches go fast, and left it at that. Iain McClatchie 06:06, 4 July 2006 (UTC)

This section also confuses me:

The diagram to the right shows two memories. Each location in each memory has a datum (a cache line), which in different designs ranges in size from 8 to 512 bytes. The size of the cache line is usually larger than the size of the usual access, which ranges from 1 to 16 bytes.

"Each location" has a "datum" == "cache line" == "between 8 and 512 bytes in size"? And between the CPU, the cache, the main memory, and all the kinds of things they contain, what exactly does a "usual access" mean? --Piet Delport 11:22, 11 April 2006 (UTC)

I think the point being made is that most caches allow storage of more than the typical datum size. In x86, memory is accessed 1 byte at a time, and there's support for up to 8 byte (qwords) registers (expanded by MMX/SSE), so you're not likely to find a cache with a size on the order of 8 bytes. I think DDR/DDR2 supports streaming an entire row to the memory controller. If you're going to stream a row, you need a cache large enough to hold it or else there's no benefit from its use. Rmcii 02:37, 5 May 2006 (UTC)
The "usual access" is the usual access from a CPU instruction. On a 32-bit CPU, this is usually a 32-bit access, but sometimes it's 64 or 128 bits. It is very unusual for a cache request to be larger than that. CPUs may have various bus widths throughout the design which have little relationship to the size of these accesses. I've updated the article, please let me know if it's more understandable. Iain McClatchie 06:06, 4 July 2006 (UTC)

By the way, I'd just like to point out to anyone frustrated by the article that these two feedback comments were quite valuable to me. Once resolved, I think they will have helped improve the clarity of the article, and I appreciate that. Iain McClatchie 06:09, 4 July 2006 (UTC)

Request for references[edit]

Hi, I am working to encourage implementation of the goals of the Wikipedia:Verifiability policy. Part of that is to make sure articles cite their sources. This is particularly important for featured articles, since they are a prominent part of Wikipedia. The Fact and Reference Check Project has more information. Thank you, and please leave me a message when a few references have been added to the article. - Taxman 19:31, Apr 22, 2005 (UTC)

Is it ok to use references like which require accounts to access them? --CTho 01:29, 23 December 2005 (UTC)

Yes, it *is* OK to use inline references that require paid accounts to access. The WP:EL#Sites_requiring_registration guideline clearly states "A site that requires registration or a subscription should not be linked unless the web site itself is the topic of the article or is being used as an inline reference."
If you got information from it, WP:SAYWHEREYOUGOTIT -- it doesn't matter if other people can get it for free or if it requires a paid account to access it. -- (talk) 02:05, 26 April 2009 (UTC)
I found the following articles which may benefit the authors of this article:
Whetham, Benjamin. (5/9/00), "Theories about modern cpu cache". Retrieved: 31st May 2007 From:
The Computer Language Co. Inc., (1999), "Cache". Retrieved: 31st May 2007 From:
Alan Jay Smith. (August, 1987). "Design of cpu cache memories". Retrieved: 31st May 2007 From:
Jupitermedia. (16/09/04). "Cache". HardwareCentral. Retrieved: 31st May 2007 From:
PantherProducts. (2006). "Central processing unit cache memory". Retrieved: 31st May 2007 From:
Slowbro 03:54, 31 May 2007 (UTC)

does address translation really belong here?[edit]

It seems to me that much of this section should be moved to the virtual memory article (or removed, if it is redundant) --CTho 01:34, 23 December 2005 (UTC)

Perhaps the design section of this article is not filled out enough. Address translation fundamentally affects cache design: virtual vs physical tagging/indexing, virtual hints, and virtual aliasing can only be explained in the context of address translation.
As a seperate issue, address translation is performed by TLBs. Many common implementations of TLBs are, in a broad but useful sense, caches of the page tables in memory. I think this is a useful similarity to present.Iain McClatchie 05:36, 4 July 2006 (UTC)

Clarification for increasing of associativity vs increasing cache size[edit]

This sentence is not clear: "The rule of thumb is that doubling the associativity has about the same effect on hit rate as doubling the cache size, from 1-way (direct mapped) to 4-way." Is the associativity doubling from 1-way to 4-way ? Isn't that quadrupling ? Does the same apply for doubling from 4-way to 8-way ? Beside clarification, I think this deserves further explanation, perhaps with examples - e.g. cache sizes and associativies for Athlon vs P6, P4 and Core, etc.

Attempted a fix. Please let me know if it's better/understandable now.Iain McClatchie 05:32, 4 July 2006 (UTC)

Trace cache history[edit]

A discussion of "first proposed" should take into account Alex Peleg and Uri Weiser, "Dynamic flow instruction cache memory organized around trace segments independent of virtual address line," US Patent 5,381,533 (filed in 1994, which continued an application filed in 1992; granted to Intel 1995).--M.smotherman 13:42, 23 June 2006 (UTC)


There was an apparent merger with the L1, L2, and L3 caches of a CPU. I would like it if there were sections depicting each or at least a section that explains them.


Could we have a guide to pronunciation? I have head it pronounced "cash", "catch" and "cashay" before, which one is right? —Preceding unsigned comment added by (talk) 20:35, 8 September 2007 (UTC)

Main article(Cache) contains transcription of this word, so I think it doesn't necessary to include it in this article. If you are really interested, you can read and listen pronunciation of this word on the, i.e. here. Dan Kruchinin 03:13, 21 October 2007 (UTC)

Recent edit[edit]

This edit seems generally good, but I don't like "the more economically viable solution has been found: ", because von Neumann's original paper proposed a hierarchy of memories. The solution was "found" before the first machine was even built. --CTho 13:26, 7 November 2007 (UTC)

Yes, my bad, please WP:SOFIXIT next time. --Kubanczyk 15:05, 7 November 2007 (UTC)

History. 1970 vs 1980[edit]

In the history section I pointed that performance gap between processor and memory has been growing since 1980. But in this edit this year was changed to 1970. When I wrote about it I used the "Computer architecture : a quantitative approach" ISBN 1-558-60596-7 by John L Hennessy as a source of information. On the page 289 he says a bit about cache history. There he writes that 1980 year was a start point of processor-memory performance gap growing process. Also the same information can be found here in the "The Processor-Memory performance gap" section. Dan Kruchinin 03:45, 8 November 2007 (UTC)

Yes, my bad, please WP:SOFIXIT and provide those refs in normal way :)) --Kubanczyk 08:09, 8 November 2007 (UTC)


I find it confusing that the same word (index) denotes both tags in the Tag SRAM and words in the Data SRAM. Index often denotes the part of address used for selecting the whole cache line (Addr[10:6]), which is not the same as the part used for addressing the Data SRAM (Addr[10:2]) as shown in the image.

Usually people draw the index field connected to a decoder which selects the line. The relevant portion of the line is finally extracted by an additional decoder, which is addressed by the offset field of the address.

The detached organization in the image is also fine, but the words "index" in each line seem redundand and confusing.

Perhaps you could attribute Data SRAM entries as word 0, word 1, etc, and the Tag SRAM entries as tag 0, tag 1, etc?

Victim cache section seems wrong[edit]

There were some questionable claims in the victim cache section. I've fixed some of them, but it needs some additional work.

I'll try and read up some papers to see and add some information, but that will likely take some time. In the meantime if there are some experts who know this well enough, please contribute.

Pramod 10:28, 27 December 2008 (UTC) —Preceding unsigned comment added by Pramod.s (talkcontribs)

Note: it has been observed that a faulty L2 cache will prevent Windows XP systems from booting unless the cache is manually disabled from BIOS. Doing so however will severely reduce overall system performance.-- (talk) 14:19, 12 July 2009 (UTC)

low power cache[edit]

As far as I know, the people designing caches generally ignored the amount of energy consumed by the cache until fairly recently. And so it is understandable that, until recently, this article has said nothing about low power cache.

I think this article should say something about current research in CPU caches. In particular, I think this article should say something about research on low power caches.

I attempted to add a couple of sentences about research on low power caches, but they were deleted a minute later. -- (talk) 06:21, 13 December 2009 (UTC)

I'm reverting that delete. I hope this doesn't ignite a huge edit war. Feel free to replace my text with a better description of current research in CPU caches. -- (talk) 03:06, 30 December 2009 (UTC)

In fact, it would be worth a special page on optimizing CPU consumption! On one the hand fast caches consume a lot of power but on the other hand the memory hierarchy and cache efficiency drastically reduce power for a given CPU throughput. Market1G (talk) 20:03, 6 April 2010 (UTC)
Yes, I would like an article dedicated to techniques for designing CPUs with improved (reduced) CPU power consumption.
Since Google pointed out the importance of power consumption in their servers, I think that article should not be limited to CPUs for laptops.
There are already a few articles that briefly mention in passing part of the information such an article would have: CPU design, low-power electronics, performance per watt, power management, CPU power dissipation, and this CPU cache article.
Is it possible to piece together a first rough draft entirely from the information in those articles? -- (talk) 05:07, 7 September 2010 (UTC)

ways and sets[edit]

In the example describing ways and sets, the number of ways and sets is the same. This might lead one to believe that ways and sets are the same things, which I think is wrong. Something needs to be done to clarify the difference between a way and a set. —Preceding unsigned comment added by Skysong263 (talkcontribs) 02:34, 2 January 2010 (UTC)

Is way prediction same as pseudo associativity?[edit]

Is way prediction same as pseudo associativity? -- (talk) 09:16, 5 February 2010 (UTC)

While direct cache and normal (parallel) n-way set associative cache always respond in a fixed amount of time on a hit, caches that use "way prediction" or (serial) "pseudo associativity" respond in two different amounts of time (the fast hit time, and the slow hit time).

I was under the impression that they were two slightly different techniques: My understanding was that:

  • Starting from a direct-mapped cache, switching to n-way pseudo associativity keeps the same number of tag comparators (1), but reduces the number of cache misses to the same as n-way set associative. Instead of using that single comparator once and giving up if it doesn't match, the comparator is re-used over several cycles to check n other cache lines. The "fast hit time" (a match on the first compare) is about the same as the original direct-mapped cache, but the "slow hit time" is much slower -- but those "slow hits" *would* have been a miss in the original direct mapped cache, so overall it's faster.
  • Starting from some (parallel) n-way set associative cache, adding "way prediction" keeps the same number of tag comparators (n) and the same number of cache misses. Instead of waiting for one comparator (or possibly no comparators, on a miss) to announce a hit, and then choosing the data line associated with that comparator to feed into the CPU, we pick the data associated with some comparator and preemptively feed that into the CPU and start speculatively executing based on that data. If we are lucky and guessed right, the "fast hit time" is a few cycles shorter. Hopefully we guess intelligently enough to save cycles on most reads. but no matter how badly we guess, the "slow hit time" is no slower than the original n-way set associative cache without way prediction.

Alas, a quick search to refresh my memory gave me a reference ([1]) that implies that they are basically the same.

Could someone update the article to add some information on way prediction? -- (talk) 21:56, 19 January 2011 (UTC)

Technical sections[edit]

I am having a hard time understanding the Structure and Associativity sections.

Structure is overly detailed. I'm skeptical of it's general application to cache architectures. I'm tempted to delete the section.

Associativity launches in to an explanation of how associativity without first explaining what associativity is. I'm not familiar enough with the concept to write an introductory paragraph myself. --Kvng (talk) 19:25, 29 March 2010 (UTC)

This needs better explanation and introduction you are right but no deleting please.
Also the link for reference [2] is broken as well as the last paragraph of the Associativity section (missing text).

Market1G (talk) 19:50, 6 April 2010 (UTC)

The section on Associativity was mangled in this edit. — Aluvus t/c 00:28, 7 April 2010 (UTC)

Details of operation to be clarified[edit]

>> If data are written to the cache, they must at some point be written to main memory as well.
>> The timing of this write is controlled by what is known as the write policy.
>> In a write-through cache, every write to the cache causes a write to main memory.
>> Alternatively, in a write-back or copy-back cache, writes are not immediately mirrored to the main memory.
>> Instead, the cache tracks which locations have been written over (these locations are marked dirty).
>> The data in these locations are written back to the main memory when that data is evicted from the cache.
>> For this reason, a miss in a write-back cache may sometimes require two memory accesses to service:
>> one to first write the dirty location to memory and then another to read the new location from memory.
1. The last bit about 2 memory accesses, a write followed by a read is hard to understand. There are neither clear reasons nor implications. Should this be clarified or deleted? If clarified, it should rather be moved to a specific section dealing with details such as prefetch, by-pass, write buffer...
2. To me this section sounds more like an overview than 'Details of operation', can the title be changed?

Market1G (talk) 19:07, 6 April 2010 (UTC)

I thought I understood what it was saying, although as always the wording could be improved.
Every cache is either a write-through cache, or a write-back cache.
In a write-through cache, no dirty data is ever stored in the cache -- so a miss on a read requires (in the worst case) 1 memory access: data read from RAM into cache.
In a write-back cache, a miss on a read may require (in the worst case) 2 memory access:
In the worst case, the place where the desired data about-to-be-read may be marked dirty (and the write buffer, if any, may already be full). So the dirty data in the cache must be written to RAM (or some data in the write buffer must be written to RAM, and then the dirty data in the cache pushed out to the write buffer).
Only after there is a place to put the desired data, then that data can be read from RAM into cache.
If the data were read from RAM before there was a place for it, then it would overwrite some piece of dirty data and that dirty data would be incorrectly lost.
Is there a better way for the article to explain this? -- (talk) 05:24, 7 September 2010 (UTC)

Cache entry structure[edit]

This section seems to imply that index and displacement fields are stored in the cache, as opposed to being used to address entries within the cache:

Cache row entries usually have the following structure:

Data blocks Tag Index Displacement Valid bit

Unless this has changed since years ago when I thought I understood how caches work, then it seems this should be:

Cache row entries usually have the following content:

Data blocks Tag Valid bit

Cache row entries are usually addressed by:


Data blocks within each cache row entry are addressed by:


The data blocks (cache line) contain the actual data fetched from the main memory. The memory address (physical or virtual) is split (MSB to LSB) into a tag, an index and a displacement (offset), while the valid bit denotes that this particular entry has valid data. The index length is \lceil \log_2(cache\_rows) \rceil bits and describes which row the data has been put in. The displacement length is \lceil \log_2(data\_blocks) \rceil and specifies which block of the ones we have stored we need. The tag length is address - index - displacement and contains the most significant bits of the address. The tag from the address is compared to the tag stored in the row(s) addressed by index to see if a row contains valid data for that address (a match and valid bit set) or if the row does not contain valid data for that address (a non-match or valid bit not set).

For a 2-way cache, there are two sets of cache row entries (requiring 2 tag comparators), for a 4-way cache, there are 4 sets of cache row entries (requiring 4 tag comparators). For a fully associative cache, the index is not used and instead, the tag includes all address bits but the displacement, and the equivalent of a content addressable memory (requiring n tag comparators, where n is the number of cache row entries) is used.

Jeffareid (talk) 17:07, 22 April 2010 (UTC)

I agree -- this looks like a mis-guided combination of all the parts of a virtual address plus all the parts of a cache line.
I tried to fix it -- did I get those two ideas properly separated? -- (talk) 05:12, 20 September 2010 (UTC)
I'm just learning about this stuff myself, but the textbook I'm using (Computer Architecture: A Quantitive Approach 4th Ed. by Hennessey and Patterson) calls the displacement field the "block offset," and says the offset field specifies the minimum addressable unit within a block (not which block). Several quick searches on the internet turns up the same information. Could someone confirm this and correct the article? I would, but I'm not completely sure I'm right. — Preceding unsigned comment added by (talk) 09:19, 9 October 2011 (UTC)

I would rephrase to:
The index describes which row the data has been put in. The index length is \lceil \log_2(cache\_rows) \rceil bits. The displacement (offset) specifies which block of the stored data blocks from the cache line is needed. The displacement length is \lceil \log_2(data\_blocks) \rceil bits.

BigEndian77 (talk) 17:03, 30 October 2011 (UTC)

Dear BigEndian77, I agree entirely. I would have added your improved phrasing to the article, but I see you were WP:BOLD and already went ahead -- good job. --DavidCary (talk) 14:44, 20 December 2011 (UTC)
Dear, I agree entirely, so I changed every mention of "displacement" to "block offset", with the appropriate H&P reference, and added a brief footnote on the "displacement" terminology. --DavidCary (talk) 14:44, 20 December 2011 (UTC)


What is the L1 cache? This article is a bit ambiguous. A typical CPU is directly connected to one instruction cache and one data cache (similar to a Harvard architecture). The main memory and all other cache levels, if any, between main memory and the instruction cache are all unified and contain both instruction and data (similar to a Princeton architecture). Is the first unified cache "under" the instruction cache the L1 cache? Or is the instruction cache a L1 cache, and the first unified cache "under" the instruction cache the L2 cache? -- (talk) 04:41, 20 September 2010 (UTC)

Ahistorical historical note[edit]

The very first virtual memory machine, the Ferranti Atlas, was not very slow; in fact, it was one of the fastest computers of its day. Nor did it have a page table (held in main memory); it had an associative (content addressable) memory with one entry for every 512 word block. Shmuel (Seymour J.) Metz Username:Chatul (talk) 23:27, 22 November 2010 (UTC)

Dispute sequence of events for paging[edit]

If instruction prefetch buffers such as those in the IBM 7030 (Stretch), CDC 6600 and S/360 Model 91[1] are considered to be caches then the TLB was not the first use of a cache. Note the a loop within the instruction stack did not refetch the instructions from main memory.

There was no semiconductor memory on the early computers, other than registers. Main memory used a variety of technologies, including delay lines, drums and, most often, core. Shmuel (Seymour J.) Metz Username:Chatul (talk) 14:53, 1 December 2010 (UTC)

  1. ^ "System/360 Model 91", IBM archives, IBM. 

Is ECS on CDC 6x00 in scope[edit]

The basic CDC 6x00 and 6416 are limited to a maximum of 256 Ki words of 60 bit core storage with no cache. However, the optional Extended Core Storage (ECS)[1][2] has an 8-word buffer for each bank, which serves as a 1-way data cache. Accesses to ECS words in the buffer are actually faster than accesses to words in Central Memory. ECS is logically distinct from CM; there are special instructions for accessing it, and the address of an ECS location is taken from X0 rather than from an A register. If the buffers for ECS qualify as CPU caches then they may well be the first data cache. The question is whether ECS is in scope for this article.

  1. ^ CDC (2-21-69), Control Data 6400/6500/6600 Computer Systems Reference Manual, Revision H, 60100000.  Check date values in: |date= (help)
  2. ^ CDC (2-16-68), Control Data 6400/6500/6600 Extended Core Storage Systems Reference Manual, Revision A, 60225100.  Check date values in: |date= (help)

Shmuel (Seymour J.) Metz Username:Chatul (talk) 22:32, 9 December 2010 (UTC)

tone of section[edit]

parts of this are in the first person, "we can see", "Note that", etc. and overall the section sounds like a lecture, i'd have tagged it but the "lecture/lesson" template has been removed or something since I last used it, also is this Mark Hill notable enough to be mentioned, (no page on wiki - thats an externel link in the article), ps, regardless, could someone add a disambig hatnote to Mark Hill, we have several people on the disambig and no link to it from the default, ie.(this article is about... for other see...) — Preceding unsigned comment added by (talk) 23:55, 20 August 2011 (UTC)

The line?[edit]

Can anyone confirm whether the cache was initially referred to as the line, at least for x86 CPUs. This term is used in the book Understanding the Linux Kernel. Specifically I think it was referring to the first cache "units" which were off-CPU SRAM. If so should I add this to the x86 subsection of the history section? -- (talk) 01:42, 9 September 2011 (UTC)

My understanding is that the terms "cache line", "cache row", "cache entry" are all synonymous and refer to one of the many blocks of data in the cache. Each cache line is associated with its own tag bits and other flag bits.
People who mention "the cache line" are referring to some specific block of data in the cache, not the entire cache.
As far as I can tell,
that book[2]
uses "cache line" the same way.
Is there some way we can improve this article to point out and clear up this common misconception? --DavidCary (talk) 15:47, 20 December 2011 (UTC)

Are references and rephrasing adequate?[edit]

I rephrased the beginning of the first subsection and added references (and deleted citation needed notes and comments about need for rephrasing to avoid having to present an exhaustive survey of cache line sizes and cache access sizes), but someone might complain that the largest common reference size typically being equal to register size might require a reference (even though it is effectively common knowledge and proof would require an exhaustive survey).

The rephrasing does seem to interfere with the flow, but I was annoyed by the 'citation needed' notes and so provided a quick and dirty fix. (I may attempt a more broad reworking of the article at some point.) Paul A. Clayton (talk) 08:39, 23 September 2011 (UTC)

Proposed merge with Tag RAM[edit]

Extra small stub Christian75 (talk) 10:43, 21 July 2013 (UTC)

The Tag RAM is actually part of the CPU's caches and registers, there for it is appropriate that such a small article (could be considered a stub) be merged with CPU cache and a redirect to the proper section of CPU cache be provided for the url (talk) 01:59, 7 March 2014 (UTC)

Each tag RAM is a part of some CPU cache. So if there's so little to say about "tag RAM" that it's always going to be a stub, then I agree with Christian75 and that "tag RAM" should be redirected and merged into the more general article "CPU cache" that talks about all the parts of the cache, including the tag RAM.
Many CPUs, such as the Clipper architecture, have one chip that contains the processor registers, and (a) completely separate chip(s) that contains the CPU cache; that cache in turn includes the tag RAM. Even on current microprocessors that put them all on one chip, the processor registers are often visible in a completely separate region from the cache(s). So I don't understand why anyone would even consider merging tag RAM into processor register instead of CPU cache. Does that answer your question, Technical 13? --DavidCary (talk) 20:56, 27 January 2015 (UTC)
  • Wow. I had forgotten all about this. If this is still a desired merge, I suggest tagging it with the appropriate templates and getting a formal discussion started or being BOLD and doing it. :) — {{U|Technical 13}} (etc) 21:02, 27 January 2015 (UTC)
The tags have been there since, it appears, July 2013, if Tag RAM and CPU cache are to be believed. Both tags point here for discussion. So I guess this section is the formal discussion in question. Guy Harris (talk) 21:37, 27 January 2015 (UTC)


Some parts of the article, e.g., CPU cache#Two-way set associative cache describe concepts as being tied to microprocessors when the were actually used on other types of the machines. Shmuel (Seymour J.) Metz Username:Chatul (talk) 17:31, 5 March 2014 (UTC)

Well, the article is called "CPU cache", right? :) — Dsimic (talk | contribs) 04:12, 6 March 2014 (UTC)
I changed the two-way set associative cache section to say "they require fewer transistors, take less space on the processor circuit board or on the microprocessor chip, and can be read and compared faster", which is hopefully a bit less recentistic but still discusses current processors. Further improvements are welcome. Guy Harris (talk) 20:03, 13 June 2014 (UTC)

Rule of thumb on how many cycles does it take for the CPU to get data from different types of memory[edit]

I'm new to wiki, but I suppose this should be considered to be added to the CPU cache article in some form, if smart people around here agree on it. And I also hope that someone has more reliable source on this subject: Best solution gamerk316 CPUs 24 July 2012 20:07:13 Cache is basically just high speed RAM built directly on the CPU. An old rule of thumb:

  • )If data is in the L1 cache, the CPU can get to it in about 1-2 clock cycles
  • )If data is in the L2 cache, the CPU can get to it in about 10-20 clock cycles
  • )If data is in the L3 cache, the CPU can get to it in about 50-80 clock cycles
  • )If data is in RAM, the CPU can get to it in about 80-100 clock cycles
  • )If data is on the HDD, the CPU can get to it in about 100,000 clock cycles [see why more RAM helps performance?]

Numbers vary a bit by processor architecture, but each level of cache down gets larger, but also takes slightly longer to access. As you can see, on a system with enough RAM to avoid a Page Fault [needing to go to the HDD to load data into RAM], the L3 cache has very limited performance benefits. Hence why some argue that the space the L3 cache occupies on the CPU die would be better used for some other purpose. — Preceding unsigned comment added by (talk) 12:44, 7 April 2014 (UTC)