|This is the talk page for discussing improvements to the Out-of-order execution article.
This is not a forum for general discussion of the article's subject.
|WikiProject Computing||(Rated Start-class)|
|It is requested that a computing diagram or diagrams be included in this article to improve its quality. Specific illustrations, plots or diagrams can be requested at the Graphic Lab.
For more information, refer to discussion on this page and/or the listing at Wikipedia:Requested images.
|The content of Decoupled architecture was merged into Out-of-order execution. For the contribution history and old versions of the redirected page, please see ; for the discussion at that location, see its talk page. (2016-02-13)|
how does out of order execution fit in with branch prediction? some people claim that to have branch prediction there needs to be an out of order execution, since otherwise the next instruction can only be fetched when the branch is done, thus having no benefit from any prediction at all. — Preceding unsigned comment added by 126.96.36.199 (talk) 21:19, 15 June 2011 (UTC)
- There is a tradition of instruction fetch, instruction decode, instruction execution, instruction retirement, without OoO you can still do those in parallel. So, with branch prediction you can still do fetch and decode ahead of time, and then do execution in order. With core memory, this was pretty important. A big part of the 360/91 is 16-way interleaved memory, but it still took a long time. After the 360/91, the 360/85, first machine with memory cache, was close to the speed on the 360/91 for many programs, as the cache made up for the processing speed. As well as I remember, the 360/91 doesn't do branch prediction, but can prefetch on two possible branch paths. The 360/91 has a special 'loop mode' which keeps a cache of instruction words, such that tight loops can run without fetching instructions from core. Not exactly prediction, but it only works for successful back branches. Gah4 (talk) 18:01, 16 November 2015 (UTC)
Isn't the point of a queue to have something first in first out? FIFO?. If the instruction is fetched and decoded and stuck in a queue, wouldn't that imply some kind of order of instruction? Or is it more like a priority queue? I'm kind of confused on that point, could someone clarify? Epachamo 03:54, 2 March 2006 (UTC)
Conceptually, the queue holds instructions and (once produced) their results. Instructions are kept in order by the queue. When the result of an instruction is computed, it is written to the proper entry in the queue, next to the instruction. When the instruction commits (reaches head of the queue) its result is read from the queue and written to the programmer visible registers. In summary: it is a FIFO discipline with respect to instructions, but not results. Results are written into the queue out of order, but extracted in program order. Schiermer 19:53, 30 March 2006 (UTC)
I suppose that seems right for in-order retirement. I believe the 360/91 does out-of-order retirement, with register renaming to get the data in the right place. But it is hard to do OoO retirement and virtual memory, as you don't know when a page fault might occur. The 360/91 has the dreaded "multiple imprecise interrupt" when more than one interrupt happens before the pipeline has been flushed. The next instruction address doesn't tell you where the fault occurred, either. You can't have a page fault that doesn't identify the faulting instruction and allow it to be retried. Gah4 (talk) 18:09, 16 November 2015 (UTC)
I believe there are a few inaccuracies in the history section of the article.
Firstly, the article mentions, 'IBM also introduced the first out-of-order microprocessor, the POWER1 (1990) for its RS/6000.'
This is misleading and has potential for confusion as the POWER1 could only execute floating point instructions out of order, but the article speaks of it in such a way that suggests that it was fully out of order. It was not until the POWER3 was released in 1997 that the POWER series had a fully OoO member.
Secondly, the article says, 'It was the release of the Intel Pentium Pro (1995) which brought the technology to the mainstream.'
This too is also misleading and has potential for confusion because it sounds like that Intel, by using OoO, prompted other companies such as MTI, HP, HAL and DEC design equivalent CPUs. In fact, the Pentium Pro was based on the OoO technique of the PA-8000, which Intel had access to thanks to its alliance with HP. Other companies such as MIPS also had OoO CPUs in development at roughly the same time as HP and Intel. Further more, the article says that the Pentium Pro brought the technology into the mainstream which is inaccurate as the processor was deemed too expensive for the consumer market by Intel late into the design stage and was retargeted towards enterprise markets.
Does anyone have any thoughts or comments? Rilak 17:39, 14 August 2007 (UTC)
- I just find odd that the claim "it was the release of the Intel Pentium Pro (1995) which brought the technology to the mainstream" when the Power Macs (sp. models 6100/60, 7100/66, and 8100/80) were introduced in March of 1994. These models were based on the PPC601 and the Motorola & IBM white papers I head read in 1992/1993 stated that the 601 operated "fully" out-of-order (or "out-of-queue" as the authors preferred) as opposed to out-of-order FPU execution of the POWER1. (I hope I still have those 15-year-old papers) ▪ NeoAmsterdamTalkEdits 00:10, 18 March 2008 (UTC)
- From this paper: The Design Space of Register Renaming Techniques by Dezsö Sima of Budapest Polytechnic, published in the September/October 2000 issue of the IEEE Micro, the ES/9000 mainframe of 1992 from IBM was the first processor to implement full out of order execution. I'm not sure of the out of order capabilities of the PowerPC, but I think that they were limited much like the POWER1 and POWER2. The NextGen Nx586 of 1994, released one year before the Pentium Pro could execute fixed point instructions out of order, and the AMD K6 was released in the same year as the Pentium Pro had complete capability. Other designs such as the SPARC64 from HAL and the PowerPC chips of 1995 were also fully out of order. I think with all this, the we can conclude that the Pentium Pro was certainly not innovative at all in regards of OoO and was following the general trend like everyone else. Rilak (talk) 06:22, 20 March 2008 (UTC)
- I was the person who wrote the sentence "brought the technology to the mainstream". What I meant is that the Pentium Pro device was the first OoO device to be sold in large mass-market volumes, not so much that it caused other design houses to change their microarchitectures. Reading all of the other CPUs listed on this discussion page, I still believe my statement is an accurate assessment. The only other device that you might claim sold in large volumes are the PPCs used in the Macs, but by the early 1990s Apple had perhaps 10 percent(?) of the PC market in total and much of that still 68K based. As a separate point, the Pentium Pro CPU was the one that ended the architecture wars of the '80s and '90s by removing any hope of performance advantage to be provided by the simpler RISC architectures. Dyl (talk) 07:19, 20 March 2008 (UTC)
- Maybe the statement could be clarified further? The Pentium Pro was not exactly common, it was only used in expensive (relative to PCs of the time) x86 workstations and entry level servers. I'm not sure what you mean by the PPro removing any hope of a performance advantage by simpler RISC architectures, but I think that it is fair to say that most RISC designs of the same period were still faster, to put it bluntly. I think that the only significance of the PPro is that it was part of the trend to translate x86 instructions in simpler micro-ops that could be executed more effectively by a RISC-like or inspired core. Rilak (talk) 19:29, 20 March 2008 (UTC)
- My belief is that the volume of x86 workstations/entry-level servers is still higher then Apple's 601-based systems. Though I don't have the references to back that up. Perhaps the statement would be less controversial if it said PentiumPro and its descendant PentiumII brought OoO to the masses. I disagree about the performance issue. I was there at Microprocessor Forum '96? when the first PPro performance numbers were announced and the integer numbers were as good as any at the time. Yes, the RISC machines would continue to have better FP numbers for a long while, but for the mass-market that's a non-issue. Dyl (talk) 16:34, 21 March 2008 (UTC)
- Yes, PPro workstations and servers were far more widespread than PPC systems but isn't that the point? The PPro was not a consumer device, eg. Pentium in a $1,500 box. I would say that the Pentium II was the processor that really brought OoO to the masses but I think it would be better if we simply didn't state what chip did this and that first. Instead by providing the history without any "controversial" words with debatable meanings such as 'mainstream' or 'masses', the article would be a lot more neutral. As for performance, I would say that the PA-8000 ad Alpha beat the P6, but that's completely irrelevant to this article. Rilak (talk) 06:52, 22 March 2008 (UTC)
Regarding POWER1 could only execute floating point instructions out of order,: For scientific programming, floating point is the dominant time use, and most fixed point instructions are fast enough that in-order is fine. The design of the CDC 6600, 7600, and IBM 360/91 is mostly to speed up floating point calculations. Gah4 (talk) 02:55, 24 September 2016 (UTC)
During the development of ACS-1, Lynn Conway invented Dynamic Instruction Scheduling. Her paper can be found here: http://ai.eecs.umich.edu/people/conway/ACS/DIS/DynInstSched-paper.html. This is likely the most importent development in out-of-order execution, but is nowhere mentioned in the article.
Yale Patt seems to be getting to much credit. As he mostly plagiarized Lynn's work. This is mentioned here: http://ai.eecs.umich.edu/people/conway/Retrospective2.html#anchor305182 — Preceding unsigned comment added by 188.8.131.52 (talk) 22:49, 6 February 2012 (UTC)
- It's an interesting debate. The problem is that Lynn (who wasn't named Lynn at the time, which makes some things more tricky in some cases) published internally only (as I understand it). Obviously Tomasulo is the one who gets the base credit--and it's well deserved--even though his algorithm was really only used for floating point scheduling. I think Lynn's work deserves credit, but needs some independent reliable sourcing (WP:V) about its impact. The links above probably don't cut it. And I very much doubt the movement of the idea from Lynn to Yale will able to be shown to have occurred one way or the other.
- Out of curiosity, does Dr. Conway's paper address renaming? If not, I'm not really seeing how this is an improvement over scoreboarding. In fact, after a quick scan, I'm not seeing how name dependencies are resolved/dealt with at all. I assuming I'm missing something obvious. While a bit off topic for Wikipedia, I'd love to know what I'm missing. Back to work! Hobit (talk) 01:06, 8 February 2012 (UTC)
Merge from Decoupled architecture
- Support looks like the contents of Decoupled architecture can be effectively merged into Out-of-order_execution#Dispatch_and_issue_decoupling_allows_out-of-order_issue. ~KvnG 00:00, 3 February 2015 (UTC)
- Support the Decoupled architecture is too complicated for some readers and is very short. I think this should be merged. --☣Anarchyte☣ 02:53, 22 May 2015 (UTC)
- Support I think I understand Decoupled architecture, but it seems to be saying that "decoupled architecture" is another way of saying "an architecture with out-of-order execution". If that isn't the case, then it should be rewritten in a way that explains the distinction.--Wcoole (talk) 19:24, 30 October 2015 (UTC)
Static vs. Dynamic Scheduling Superscalar
I'm a bit muddled on the distinction between Static and Dynamic Scheduling and they overlap with OOOE, VLIW, and Superscalar. I think I understand, in a general way, the technology involved, but the terminology seems to be a bit variable. The Conway paper uses "Dynamic Instruction Scheduling", but describes something I recognize as OOOE. http://foldoc.org/Very%20Long%20Instruction%20Word says that VLIW is equivalent to "staticly scheduled superscalar". It seems as though "static" in this context means "in-order" and "dynamic" means "out-of-order", but I haven't found a reference that states this clearly. I also haven't found a source that contradicts this interpretation.
- It isn't so obvious. Seems to me that the idea of dynamic is that it is determined during execution. For example, it could change from run to run, as memory access, and especially effects from other tasks running, could be unpredictable.
- Note that Itanium is listed as not OoO. Instead, Itanium depends on the compiler to order things in the optimal order. I would say, echnically, static should refer to fixed order, not necessarily in-order. Gah4 (talk) 02:30, 24 September 2016 (UTC)
The Basic concept section seem to suggest that the steps (fetch, decode, execute, retire) happen sequentially for in-order. As well as I know, they can overlap as long as execute is in-order.
Also, for OoO, the section seems to imply in-order retirement. As far as I know, all current processors do that, as it is almost necessary for dynamic memory. The 360/91, as well as I know, allows OoO retirement, consistent with register renaming. With OoO retirement, it is necessary to flush all pipelines before an interrupt can be taken. Gah4 (talk) 02:35, 24 September 2016 (UTC)
There isn't much discussion on pipelining and parallel execution. The big advantages of OoO come when you can execute more than one instruction at a time, and especially if you pipeline them. There is a small advantage in memory access without execution overlap, but not the big advantage possible. Gah4 (talk) 03:12, 24 September 2016 (UTC)
although out-of-order execution was limited to floating-point instructions
I am wondering about the meaning of: although out-of-order execution was limited to floating-point instructions. Does that mean that if you erase all the floating point instructions, those that are left are in-order? If you exchange the order of a floating point and fixed point instruction, which one is out of order? But yes, all the OoO hardware on the 360/91, such as register renaming and reservation stations is for floating point, but that allows faster fixed-point operations to finish earlier. Gah4 (talk) 15:31, 10 April 2017 (UTC)
- The only part of that machine that could execute instructions out of order was the floating point processor. I don't recall how interactions between the integer side and the fp side were handled. Hobit (talk) 18:23, 10 April 2017 (UTC)
- OK, but this is pretty fundamental to the meaning of OoO execution, the subject here. Many processors overlap fetch, decoding, and retirement with execution, but still do execution in order. For the 8086, you can see this by modifying instructions after fetch, where the processor does not detect the change. That is part of the 8086 model. The 360/91 does detect such modification, because S/360 allows for self modifying code. Since fixed point is used for computing addresses, you pretty much have to be able to execute fixed point instructions mixed in with floating point instructions. Next, consider the 8086/8087 where the 8087 executes concurrently with the 8086. (A WAIT instruction is required to resynchronize the instruction stream.) This does not, as far as I know, count as OoO, as fixed point instructions are executed in order, and floating point instructions are in order, even though the two instruction streams can overlap. It seems to me that the article needs to make this distinction clear. Gah4 (talk) 20:16, 10 April 2017 (UTC)