Jump to content

Runahead: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
m Punctuation. Moved "{{No footnotes|date=November 2010}} to the top.
Major rewrite including many summaries of and citations to recent developments for this niche little technique.
Line 1: Line 1:
{{short description|Microprocessing technique}}
{{short description|Microprocessing technique}}
'''Runahead''' is a technique that allows a [[Processor (computing)|computer processor]] to [[Speculative execution|speculatively]] pre-process [[Instruction (computer science)|instructions]] during [[CPU cache|cache]] miss cycles. The pre-processed instructions are used to generate instruction and [[data stream]] [[Instruction prefetch|prefetches]] by executing instructions leading to [[cache misses]] (typically called '''long latency loads''') before they would normally occur, effectively hiding [[memory latency]]. In runahead, the processor uses the idle execution resources to calculate instruction and data stream addresses using the available information that is independent of a cache miss. Once the processor has resolved the initial cache miss, all runahead results are discarded, and the processor resumes execution as normal. The primary use case of the technique is to mitigate the effects of the [[memory wall]]. The technique may also be used for other purposes, such as pre-computing branch outcomes to achieve highly accurate [[Branch predictor|branch prediction]]<ref>{{Cite journal |last=Pruett |first=Stephen |last2=Patt |first2=Yale |date=October 2021 |title=Branch Runahead: An Alternative to Branch Prediction for Impossible to Predict Branches |url=https://dl.acm.org/doi/10.1145/3466752.3480053 |journal=MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture |series=MICRO '21 |location=New York, NY, USA |publisher=Association for Computing Machinery |pages=804–815 |doi=10.1145/3466752.3480053 |isbn=978-1-4503-8557-2}}</ref>.
{{No footnotes|date=November 2010}}
'''Runahead''' is a technique that allows a [[microprocessor]] to pre-process [[Instruction (computer science)|instructions]] during [[CPU cache|cache]] miss cycles instead of stalling. The pre-processed instructions are used to generate instruction and [[data stream]] [[Instruction prefetch|prefetches]] by detecting [[CPU cache|cache misses]] before they would otherwise occur by using the idle execution resources to calculate instruction and data stream fetch addresses using the available information that is independent of the cache miss.


The principal hardware cost is a means of [[Application checkpointing|checkpointing]] the [[Hardware register|register]] file state. Typically, runahead processors will also contain a small additional [[Cache (computing)|cache]], which allows runahead store operations to execute without modifying actual [[Computer memory|memory]]. Certain implementations also use dedicated [[hardware acceleration]] units to execute specific slices of pre-processed instructions<ref>{{Cite journal |last=Hashemi |first=Milad |last2=Mutlu |first2=Onur |last3=Patt |first3=Yale N. |date=October 2016 |title=Continuous runahead: Transparent hardware acceleration for memory intensive workloads |url=https://ieeexplore.ieee.org/document/7783764/ |journal=2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) |pages=1–12 |doi=10.1109/MICRO.2016.7783764}}</ref><ref>{{Cite journal |last=Pruett |first=Stephen |last2=Patt |first2=Yale |date=October 2021 |title=Branch Runahead: An Alternative to Branch Prediction for Impossible to Predict Branches |url=https://dl.acm.org/doi/10.1145/3466752.3480053 |journal=MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture |series=MICRO '21 |location=New York, NY, USA |publisher=Association for Computing Machinery |pages=804–815 |doi=10.1145/3466752.3480053 |isbn=978-1-4503-8557-2}}</ref>.
The principal hardware cost is a means of [[Application checkpointing|checkpointing]] the [[Hardware register|register]] file state and preventing pre-processed stores from modifying [[Computer data storage|memory]]. This checkpointing can be accomplished using very little hardware since all results computed during runahead are discarded after the cache miss has been serviced, at which time normal execution resumes using the checkpointed [[register file]] state.


Runahead was initially investigated in the context of an in-order microprocessor<ref>Dundas, James D. and Mudge, Trevor N. (September 1996). "[https://tnm.engin.umich.edu/wp-content/uploads/sites/353/2019/04/1996-Using-Stall-Cycles-to-Improve-Microprocessor-Performance.pdf Using stall cycles to improve microprocessor performance]". Technical report. Department of Electrical Engineering and Computer Science, University of Michigan.</ref>; however, this technique has been extended for use with [[out-of-order execution|out-of-order]] microprocessors<ref name=":0">{{Cite journal |last=Mutlu |first=O. |last2=Stark |first2=J. |last3=Wilkerson |first3=C. |last4=Patt |first4=Y.N. |date=February 2003 |title=Runahead execution: an alternative to very large instruction windows for out-of-order processors |url=https://ieeexplore.ieee.org/document/1183532/ |journal=The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings. |pages=129–140 |doi=10.1109/HPCA.2003.1183532}}</ref>.
Branch outcomes computed during runahead mode can be saved into a [[shift register]], which can be used as a highly accurate [[branch predictor]] when normal operation resumes.

Runahead was initially investigated in the context of an in-order microprocessor; however, this technique has been extended for use with [[out-of-order execution|out-of-order]] microprocessors.


==Entering runahead==
==Entering runahead==
In principle, any event can trigger runahead, though typically the entry condition is a [[Memory hierarchy|last level data cache]] miss that makes it to the head of the [[re-order buffer]]<ref name=":0" />. In a normal out-of-order processor, such long latency load instructions block retirement of all younger instructions until the miss is serviced and the load is retired.
When a runahead processor detects a level one [[Instruction set|instruction]] or data cache miss it records the instruction address of the faulting access and enters runahead mode. A demand fetch for the missing instruction or data cache line is generated if necessary. The processor checkpoints the register file by one of several mechanisms (discussed later). The state of the [[memory hierarchy]] is checkpointed by disabling stores. Store instructions are allowed to compute [[Memory address|addresses]] and check for a hit, but they are not allowed to write to memory.


Because the value returned from a cache miss cannot be known ahead of time, it is possible for pre-processed instructions to be dependent upon invalid data. These are denoted by adding an "invalid" or INV bit to every register in the register file. If runahead was initiated by a load instruction, the load's destination register is marked INV.
When a processor enters runahead mode, it checkpoints all [[Instruction set architecture|architectural]] registers and records the address of the load instruction that caused entry into runahead. All instructions in the [[Instruction pipelining|pipeline]] are then marked as runahead. Because the value returned from a cache miss cannot be known ahead of time, it is possible for pre-processed instructions to be dependent upon unknown or '''invalid data'''. Registers containing such data, or data dependent on it, are denoted by adding an "invalid" or INV bit to every register in the register file. Instructions that use or write such invalid data are also marked with an INV bit. If the instruction that initiated runahead was a load, it is issued a bogus result and marked as INV, allowing it to mark its destination register as INV and drain out of the pipeline.


==Pre-processing instructions==
==Pre-processing instructions in runahead==
The processor then continues to execute instructions after the miss; however, all results are strictly temporary and are only used to attempt to generate additional load, store, and instruction cache misses, which are turned into prefetches. The designer can opt to allow runahead to skip over instructions that are not present in the instruction cache with the understanding that the quality of any prefetches generated will be reduced since the effect of the missing instructions is unknown.
In runahead mode, the processor continues to execute instructions after the instruction that initiated runahead. However, runahead is considered a speculative state in which the processor only attempts to generate additional data and instruction cache misses which are effectively prefetches. The designer can opt to allow runahead to skip instructions that are not present in the instruction cache with the understanding that the quality of any prefetches generated will be reduced since the effect of the missing instructions is unknown.


Registers that are the target of an instruction that has one or more source registers marked INV are marked INV. This allows the processor to know which register values can be (reasonably) trusted during runahead mode. Branch instructions that cannot be resolved due to INV sources are simply assumed to have had their direction predicted correctly. Branch outcomes are saved in a shift register for later use as highly accurate predictions during normal operation.
Registers that are the target of an instruction that has one or more source registers marked INV are marked INV. This allows the processor to know which register values can (probably) be trusted during runahead mode. Branch instructions that cannot be resolved due to INV source registers are simply assumed to have been [[Branch predictor|predicted]] correctly. In case the branch was mispredicted, the processor continues executing wrong-path instructions until it reaches a branch independent point, potentially executing wrong-path loads that pollute cache with useless data entries. Valid branch instruction outcomes can be saved for later use as highly accurate predictions during normal operation.


Since runahead is a speculative state, store instructions cannot be allowed to modify memory. In order to communicate store results to dependent loads, a very small cache only accessed by runahead loads and misses, called a '''runahead cache''', can be used<ref name=":0" />. This cache is functionally similar to a normal cache, but contains INV bits to track which data is invalid. INV stores set the INV bit of their corresponding target cache line, while valid stores reset the INV bit of the cache line. Any runahead load instruction must check both real and runahead cache. If the load hits in runahead cache, it will discard the real cache result and use the runahead cache data, potentially becoming invalid if the cache line was marked with a INV bit. Because the runahead cache is separate from the [[memory hierarchy]], there is no place to evict old data to. Therefore, in case of a cache conflict, the old data is simply dropped from the cache. Note that because of the limited size of the runahead cache, it is not possible to perfectly track INV data during runahead mode (as INV data may be overwritten by valid data in a [[cache conflict]]). In practice, this is not crucial since all results computed during runahead mode are discarded.
Note that it is not possible to perfectly track INV register values during runahead mode. This is not required since runahead is only used to optimize performance and all results computed during runahead mode are discarded. In fact, it is impossible to perfectly track invalid register values if runahead was initiated by an instruction cache miss, an instruction cache miss occurred during runahead, a load is dependent upon a store with an INV address (assumes that hardware is present to allow store to load forwarding during runahead), or if a branch outcome during runahead is dependent upon an INV register.


==Leaving runahead==
==Leaving runahead==
As with entering runahead, any event can in principle be cause for exiting runahead. Though in the case of a runahead period initiated by a cache miss, it is typically exited once the cache miss has been serviced.
The register file state is restored from the checkpoint and the processor is redirected to the original faulting fetch address when the fetch that initiated runahead mode has been serviced.

When the processor exits runahead, all instructions younger than and including the instruction that initiated runahead are squashed and drained out of the pipeline. The architectural register file is then restored from the checkpoint. A predetermined [[Register renaming|register aliasing table]] (RAT) is then copied into both the front- and backend RAT. Finally, the processor is redirected to the address of the instruction that initiated runahead. The processor then resumes execution in normal mode.


==Register file checkpoint options==
==Register file checkpoint options==
The most obvious method of checkpointing the register file (RF) is to simply perform a [[FlashCopy|flash copy]] to a shadow register file, or backup register file (BRF) when the processor enters runahead mode, then perform a flash copy from the BRF to the RF when normal operation resumes. There are simpler options available.
The simplest method of checkpointing the architectural register file (ARF) is to simply perform a full copy of the entire [[physical register file]] (PRF) (because the PRF is a superset of the ARF) to a chekpoint register file (CRF) when the processor enters runahead mode. When runahead is exited, the processor can then perform a full copy from the CRF to the PRF. However, there are more efficient options available.


One way to eliminate the flash copy operations is to write to both the BRF and RF during normal operation, read from only the RF during normal operation, and read/write only the BRF during runahead mode.
One way to eliminate the copy operations is to write to both the PRF and CRF during normal operation, but only to the PRF in runahead mode. This approach can eliminate the checkpointing overhead that would otherwise be incurred on initiating runahead if the CRF and PRF are written to in parallel, but still requires the processor to restore the PRF when runahead is exited.


Because the only registers that need to be checkpointed are the architectural registers, the CRF only needs to contain as many registers as there are architectural registers, as defined by the [[instruction set architecture]]. Since processors typically contain far more physical registers than architectural registers, this significantly shrinks the size of the CRF.
An even more aggressive approach is to eliminate the BRF and rely upon the forwarding paths to provide modified values during runahead mode. Checkpointing is accomplished by disabling register file writes. Modified values during runahead mode can only be provided by the forwarding paths.

An even more aggressive approach is to rely only upon the [[operand forwarding]] paths of the microarchitecture to provide modified values during runahead mode.{{Citation needed}} The register file is then "checkpointed" by disabling writes to the register file during runahead.

== Optimizations ==
While runahead is intended to increase processor performance, pre-processing instructions when the processor would otherwise have been idle decreases the processor's [[Energy efficiency of computer hardware|energy efficiency]] due to an increase in [[Processor power dissipation|dynamic power]] draw. Additionally, entering and exiting runahead incurs a performance overhead, as register checkpointing and particularly flushing the pipeline may take many cycles to complete. Therefore, it is not wise to initiate runahead at every opportunity.

Some optimizations that improve the energy efficiency of runahead are:

* Only entering runahead if the processor is expected to execute long latency loads during runahead, thereby reducing short, unproductive runahead periods.<ref>{{Citation |last=Van Craeynest |first=Kenzo |title=MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor |date=2009 |url=http://link.springer.com/10.1007/978-3-540-92990-1_10 |work=High Performance Embedded Architectures and Compilers |volume=5409 |pages=110–124 |editor-last=Seznec |editor-first=André |access-date=2023-06-02 |place=Berlin, Heidelberg |publisher=Springer Berlin Heidelberg |doi=10.1007/978-3-540-92990-1_10 |isbn=978-3-540-92989-5 |last2=Eyerman |first2=Stijn |last3=Eeckhout |first3=Lieven |editor2-last=Emer |editor2-first=Joel |editor3-last=O’Boyle |editor3-first=Michael |editor4-last=Martonosi |editor4-first=Margaret}}</ref>
* Limiting the length of runahead periods to only run as long as they are expected to generate useful results.<ref>{{Cite journal |last=Van Craeynest |first=Kenzo |last2=Eyerman |first2=Stijn |last3=Eeckhout |first3=Lieven |date=2009 |editor-last=Seznec |editor-first=André |editor2-last=Emer |editor2-first=Joel |editor3-last=O’Boyle |editor3-first=Michael |editor4-last=Martonosi |editor4-first=Margaret |editor5-last=Ungerer |editor5-first=Theo |title=MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor |url=https://link.springer.com/chapter/10.1007/978-3-540-92990-1_10 |journal=High Performance Embedded Architectures and Compilers |series=Lecture Notes in Computer Science |language=en |location=Berlin, Heidelberg |publisher=Springer |pages=110–124 |doi=10.1007/978-3-540-92990-1_10 |isbn=978-3-540-92990-1}}</ref>
* Only pre-processing instructions that eventually lead to load instructions.<ref>{{Cite journal |last=Hashemi |first=Milad |last2=Patt |first2=Yale N. |date=2015-12-05 |title=Filtered runahead execution with a runahead buffer |url=https://doi.org/10.1145/2830772.2830812 |journal=Proceedings of the 48th International Symposium on Microarchitecture |series=MICRO-48 |location=New York, NY, USA |publisher=Association for Computing Machinery |pages=358–369 |doi=10.1145/2830772.2830812 |isbn=978-1-4503-4034-2}}</ref>
* Only using free processor resources to pre-process instructions.<ref name=":1">{{Cite journal |last=Naithani |first=Ajeya |last2=Feliu |first2=Josué |last3=Adileh |first3=Almutaz |last4=Eeckhout |first4=Lieven |date=February 2020 |title=Precise Runahead Execution |url=https://ieeexplore.ieee.org/document/9065552/ |journal=2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) |pages=397–410 |doi=10.1109/HPCA47549.2020.00040}}</ref>
* Buffering micro-operations that were decoded during runahead for reuse in normal mode.<ref name=":1" />

== Side effects of runahead ==
Runahead has been found to improve [[soft error]] rates in processors as a side effect. While a processor is waiting for a cache miss, the entire state of the processor is vulnerable to soft errors while the cache miss is outstanding. By continuing execution, runahead unintentionally reduces the amount of time the processor state is vulnerable to soft errors, thereby reducing soft error rates.<ref>{{Cite journal |last=Naithani |first=Ajeya |last2=Eeckhout |first2=Lieven |date=April 2022 |title=Reliability-Aware Runahead |url=https://ieeexplore.ieee.org/document/9773198/ |journal=2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA) |publisher=IEEE |pages=772–785 |doi=10.1109/HPCA53966.2022.00062 |isbn=978-1-6654-2027-3}}</ref>


== See also ==
== See also ==
Line 36: Line 51:


==References==
==References==
<references />
* [http://portal.acm.org/citation.cfm?id=263597&coll=portal&dl=ACM Improving data cache performance by pre-executing instructions under a cache miss]
* [http://citeseer.ist.psu.edu/465945.html Improving processor performance by dynamically preprocessing the instruction stream]
* [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?isnumber=26557&arnumber=1183532 Runahead execution: an alternative to very large instruction windows for out-of-order processors]

[[Category:Instruction processing]]
[[Category:Instruction processing]]

Revision as of 16:29, 2 July 2023

Runahead is a technique that allows a computer processor to speculatively pre-process instructions during cache miss cycles. The pre-processed instructions are used to generate instruction and data stream prefetches by executing instructions leading to cache misses (typically called long latency loads) before they would normally occur, effectively hiding memory latency. In runahead, the processor uses the idle execution resources to calculate instruction and data stream addresses using the available information that is independent of a cache miss. Once the processor has resolved the initial cache miss, all runahead results are discarded, and the processor resumes execution as normal. The primary use case of the technique is to mitigate the effects of the memory wall. The technique may also be used for other purposes, such as pre-computing branch outcomes to achieve highly accurate branch prediction[1].

The principal hardware cost is a means of checkpointing the register file state. Typically, runahead processors will also contain a small additional cache, which allows runahead store operations to execute without modifying actual memory. Certain implementations also use dedicated hardware acceleration units to execute specific slices of pre-processed instructions[2][3].

Runahead was initially investigated in the context of an in-order microprocessor[4]; however, this technique has been extended for use with out-of-order microprocessors[5].

Entering runahead

In principle, any event can trigger runahead, though typically the entry condition is a last level data cache miss that makes it to the head of the re-order buffer[5]. In a normal out-of-order processor, such long latency load instructions block retirement of all younger instructions until the miss is serviced and the load is retired.

When a processor enters runahead mode, it checkpoints all architectural registers and records the address of the load instruction that caused entry into runahead. All instructions in the pipeline are then marked as runahead. Because the value returned from a cache miss cannot be known ahead of time, it is possible for pre-processed instructions to be dependent upon unknown or invalid data. Registers containing such data, or data dependent on it, are denoted by adding an "invalid" or INV bit to every register in the register file. Instructions that use or write such invalid data are also marked with an INV bit. If the instruction that initiated runahead was a load, it is issued a bogus result and marked as INV, allowing it to mark its destination register as INV and drain out of the pipeline.

Pre-processing instructions in runahead

In runahead mode, the processor continues to execute instructions after the instruction that initiated runahead. However, runahead is considered a speculative state in which the processor only attempts to generate additional data and instruction cache misses which are effectively prefetches. The designer can opt to allow runahead to skip instructions that are not present in the instruction cache with the understanding that the quality of any prefetches generated will be reduced since the effect of the missing instructions is unknown.

Registers that are the target of an instruction that has one or more source registers marked INV are marked INV. This allows the processor to know which register values can (probably) be trusted during runahead mode. Branch instructions that cannot be resolved due to INV source registers are simply assumed to have been predicted correctly. In case the branch was mispredicted, the processor continues executing wrong-path instructions until it reaches a branch independent point, potentially executing wrong-path loads that pollute cache with useless data entries. Valid branch instruction outcomes can be saved for later use as highly accurate predictions during normal operation.

Since runahead is a speculative state, store instructions cannot be allowed to modify memory. In order to communicate store results to dependent loads, a very small cache only accessed by runahead loads and misses, called a runahead cache, can be used[5]. This cache is functionally similar to a normal cache, but contains INV bits to track which data is invalid. INV stores set the INV bit of their corresponding target cache line, while valid stores reset the INV bit of the cache line. Any runahead load instruction must check both real and runahead cache. If the load hits in runahead cache, it will discard the real cache result and use the runahead cache data, potentially becoming invalid if the cache line was marked with a INV bit. Because the runahead cache is separate from the memory hierarchy, there is no place to evict old data to. Therefore, in case of a cache conflict, the old data is simply dropped from the cache. Note that because of the limited size of the runahead cache, it is not possible to perfectly track INV data during runahead mode (as INV data may be overwritten by valid data in a cache conflict). In practice, this is not crucial since all results computed during runahead mode are discarded.

Leaving runahead

As with entering runahead, any event can in principle be cause for exiting runahead. Though in the case of a runahead period initiated by a cache miss, it is typically exited once the cache miss has been serviced.

When the processor exits runahead, all instructions younger than and including the instruction that initiated runahead are squashed and drained out of the pipeline. The architectural register file is then restored from the checkpoint. A predetermined register aliasing table (RAT) is then copied into both the front- and backend RAT. Finally, the processor is redirected to the address of the instruction that initiated runahead. The processor then resumes execution in normal mode.

Register file checkpoint options

The simplest method of checkpointing the architectural register file (ARF) is to simply perform a full copy of the entire physical register file (PRF) (because the PRF is a superset of the ARF) to a chekpoint register file (CRF) when the processor enters runahead mode. When runahead is exited, the processor can then perform a full copy from the CRF to the PRF. However, there are more efficient options available.

One way to eliminate the copy operations is to write to both the PRF and CRF during normal operation, but only to the PRF in runahead mode. This approach can eliminate the checkpointing overhead that would otherwise be incurred on initiating runahead if the CRF and PRF are written to in parallel, but still requires the processor to restore the PRF when runahead is exited.

Because the only registers that need to be checkpointed are the architectural registers, the CRF only needs to contain as many registers as there are architectural registers, as defined by the instruction set architecture. Since processors typically contain far more physical registers than architectural registers, this significantly shrinks the size of the CRF.

An even more aggressive approach is to rely only upon the operand forwarding paths of the microarchitecture to provide modified values during runahead mode.[citation needed] The register file is then "checkpointed" by disabling writes to the register file during runahead.

Optimizations

While runahead is intended to increase processor performance, pre-processing instructions when the processor would otherwise have been idle decreases the processor's energy efficiency due to an increase in dynamic power draw. Additionally, entering and exiting runahead incurs a performance overhead, as register checkpointing and particularly flushing the pipeline may take many cycles to complete. Therefore, it is not wise to initiate runahead at every opportunity.

Some optimizations that improve the energy efficiency of runahead are:

  • Only entering runahead if the processor is expected to execute long latency loads during runahead, thereby reducing short, unproductive runahead periods.[6]
  • Limiting the length of runahead periods to only run as long as they are expected to generate useful results.[7]
  • Only pre-processing instructions that eventually lead to load instructions.[8]
  • Only using free processor resources to pre-process instructions.[9]
  • Buffering micro-operations that were decoded during runahead for reuse in normal mode.[9]

Side effects of runahead

Runahead has been found to improve soft error rates in processors as a side effect. While a processor is waiting for a cache miss, the entire state of the processor is vulnerable to soft errors while the cache miss is outstanding. By continuing execution, runahead unintentionally reduces the amount of time the processor state is vulnerable to soft errors, thereby reducing soft error rates.[10]

See also

References

  1. ^ Pruett, Stephen; Patt, Yale (October 2021). "Branch Runahead: An Alternative to Branch Prediction for Impossible to Predict Branches". MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO '21. New York, NY, USA: Association for Computing Machinery: 804–815. doi:10.1145/3466752.3480053. ISBN 978-1-4503-8557-2.
  2. ^ Hashemi, Milad; Mutlu, Onur; Patt, Yale N. (October 2016). "Continuous runahead: Transparent hardware acceleration for memory intensive workloads". 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO): 1–12. doi:10.1109/MICRO.2016.7783764.
  3. ^ Pruett, Stephen; Patt, Yale (October 2021). "Branch Runahead: An Alternative to Branch Prediction for Impossible to Predict Branches". MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO '21. New York, NY, USA: Association for Computing Machinery: 804–815. doi:10.1145/3466752.3480053. ISBN 978-1-4503-8557-2.
  4. ^ Dundas, James D. and Mudge, Trevor N. (September 1996). "Using stall cycles to improve microprocessor performance". Technical report. Department of Electrical Engineering and Computer Science, University of Michigan.
  5. ^ a b c Mutlu, O.; Stark, J.; Wilkerson, C.; Patt, Y.N. (February 2003). "Runahead execution: an alternative to very large instruction windows for out-of-order processors". The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.: 129–140. doi:10.1109/HPCA.2003.1183532.
  6. ^ Van Craeynest, Kenzo; Eyerman, Stijn; Eeckhout, Lieven (2009), Seznec, André; Emer, Joel; O’Boyle, Michael; Martonosi, Margaret (eds.), "MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor", High Performance Embedded Architectures and Compilers, vol. 5409, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 110–124, doi:10.1007/978-3-540-92990-1_10, ISBN 978-3-540-92989-5, retrieved 2023-06-02
  7. ^ Van Craeynest, Kenzo; Eyerman, Stijn; Eeckhout, Lieven (2009). Seznec, André; Emer, Joel; O’Boyle, Michael; Martonosi, Margaret; Ungerer, Theo (eds.). "MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor". High Performance Embedded Architectures and Compilers. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer: 110–124. doi:10.1007/978-3-540-92990-1_10. ISBN 978-3-540-92990-1.
  8. ^ Hashemi, Milad; Patt, Yale N. (2015-12-05). "Filtered runahead execution with a runahead buffer". Proceedings of the 48th International Symposium on Microarchitecture. MICRO-48. New York, NY, USA: Association for Computing Machinery: 358–369. doi:10.1145/2830772.2830812. ISBN 978-1-4503-4034-2.
  9. ^ a b Naithani, Ajeya; Feliu, Josué; Adileh, Almutaz; Eeckhout, Lieven (February 2020). "Precise Runahead Execution". 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA): 397–410. doi:10.1109/HPCA47549.2020.00040.
  10. ^ Naithani, Ajeya; Eeckhout, Lieven (April 2022). "Reliability-Aware Runahead". 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE: 772–785. doi:10.1109/HPCA53966.2022.00062. ISBN 978-1-6654-2027-3.