Talk:Loop unrolling

This is the talk page for discussing improvements to the Loop unrolling article.
This is not a forum for general discussion of the article's subject.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Article policies

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL

Computing: Software Low‑importance

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing articles
Low	This article has been rated as Low-importance on the project's importance scale.
	This article is supported by WikiProject Software (assessed as Low-importance).

Unnamed section[edit]

Programming languages should be indicated. Current compiler implementation and actual gains should be discussed Sboehringer

Wrong primary name[edit]

"Loop unwinding"? Who uses that term? Not many people. There are 1,450 Google hits to that term, and 85,800 to "loop unrolling." —Preceding unsigned comment added by 83.249.241.102 (talk) 20:57, 1 July 2008 (UTC)[reply]

Agreed, every time I've wanted to view this article I've had a redirect. Themania (talk) 11:05, 25 July 2008 (UTC)[reply]

+1; the gcc option to perform this optimization explicitly is '-funroll-loops' —Preceding unsigned comment added by 124.168.48.205 (talk) 21:35, 5 December 2009 (UTC)[reply]

Cleanup Loop[edit]

Should the code example include a cleanup loop? The unwound loop now doesn't give the right answer unless you're looking for a multiple of five (which, in this case, is happening).

int max = 101;
int remainder = max % 5;
int x;
for ( x = 0; x < max - 4; x += 5)
{
    cout << (x)  << endl;
    cout << (x+1)<< endl;
    cout << (x+2)<< endl;
    cout << (x+3)<< endl;
    cout << (x+4)<< endl;
}
for ( int i = 0; i < remainder; i++)
    cout << x++ << endl;

Here, max can be set to any value and will print out the right answer.

(I replaced delete with cout so I could get feedback. delete could be put back in ("delete(x++)"))

Also, loops are typically unrolled in powers of two, at least in my experience

DevastatorIIC 19:00, 29 November 2005 (UTC)[reply]

Pipelining[edit]

The main reason for loop unrolling today is to make better use of the pipelined resources eg. fpu. The loop overhead nowadays is less important.

I agree with you, this article is wrong. The main point of loop unrolling is to increase the number of independent operations available, to take advantage of superscalar processors, not to reduce the loop overhead. Xavier andrade 16:34, 14 September 2007 (UTC)[reply]

Think again. A very important consideration is whether or not each iteration involves results from earlier iterations. For instance, a calculation involving A(i) often involves A(i - 1) and even A(i + 1) or similar, so that initiating all N calculations in parallel without attention to phasing will produce wrong results. In the example above, activating each statement in parallel could result in the values 0, 1, 2 etc. appearing in any order in the output via "cout". Loop unrolling as it stands makes no change to the order of operations, and parallel execution or not (of floating-point and integer operations for example) remains available because the meaning of the code is unchanged. NickyMcLean 03:38, 15 September 2007 (UTC)[reply]

There is also additional saving that occurs by delaying the execution of a jump instruction.User:dhessler 22:30, 3 November 2010 (EST)

I came here to say something fairly similar to this: that the avoidance of the branch penalty is likely a bigger win than the pointer arithmetic gain. However, I'm definitely without any sources to demonstrate this factoid William M. Connolley (talk) 23:05, 22 November 2010 (UTC)[reply]

Well I came here to say exactly the opposite! Branch prediction is going to hide most of the cost of the branch (if it can't do this then unrolling is going to make the problem worse). Loop construct overheads are rarely a significant factor in execution time. The real benefits of unrolling a loop come from the flexibility to rearrange, or even eliminate, some operations. It should be understood that these are often assembly-level operations being rearranged, and the rearrangement can occur at a level of granularity which cannot reasonably be expressed in a high level language. Obviously this must be done only when it is correct and efficient to do so, and NickyMcLean gives a valid example of where this is not the case, but in many other instances the principal gains are around things other than the test and branch operations. See, for example, constant folding and restrict.

As was stated initially, the article is wrong. It misrepresents the main purpose of the transformation. --sh1 (talk) 22:42, 17 February 2014 (UTC)[reply]

Register Usage[edit]

The comments that loop unrolling causes the compiler to allocate more significantly more registers is silly, and in most cases bogus. Whoever wrote that probably doesn't understand how their compiler works very well. (or they use a compiler that just isn't very good.)

Of course, it depends upon which compiler you use. However, in most decent compilers the example given shouldn't use more than one extra register, if that. For example, compiling the example with GCC, with optimization, and dumping ARM assembly code, the loop and the "unrolled" versions use exactly the same number of registers. Obscuranym (talk) 01:42, 23 October 2008 (UTC)[reply]

Agreed. If you really were to write code that needed to use additional registers in the assembly, then you've either unrolled your code wrong or have written horribly inefficient code. It is worth noting, that depending on the architecture use of an additional register may not affect performance. A performance loss could occur if you're unrolled code forced the machine to store temporary variables in memory as opposed to registers (because you have too many temporary variables). These cases seem academic at best. Dhessler (talk) 02:37, 4 November 2010 (UTC)[reply]

Seeing this marked as dubious was pretty surprising and I don't really understand the basis for the disagreement with the statements about registers here. Doesn't really seem disputable that loop unrolling can increase register usage if your unrolled loops are generating an increased number of variables. More vars correlates to more registers being used at best and direct memory accesses at worst. Good compilers and what they optimise isn't really even a factor in the conversation because you wouldn't be manually unrolling a loop with one, anyway. The registers issue isn't about loop unrolling in general, but is likely addressing the specific case of a manual unroll where it is common to introduce additional temporary variables. Razoras (talk) 23:50, 7 June 2012 (UTC)[reply]

IBM example too specialized.[edit]

I think the IBM assembly example is way too specialized for a general encyclopedia. Perhaps it could be explained in a way that can be understood by the 99.9% of readers that are unfamiliar with IBM assembly. Maybe rewrite it in pseudocode or pseudo-assembly? 195.35.160.133 (talk) 11:58, 4 December 2009 (UTC) Martin.[reply]

IBM/360 has been around for more than 40 years, probably comprises the bulk of commercial programming ever produced, is still extant and still compatible in todays evolved Z/Architecture. It also underpins many optimizing and non-optimizing compilers for numerous languages that are also still extant. It is also quite similar to the "ARM" language. How many other languages/ platforms can you say that about?

Having said that, perhaps someone can come up with an equivalent X386 example that is as easy to comprehend - although in my experience, someone who understands one assembler understands them all, especially if it is quite well documented, as I believe this one is! I agree pseudocode / pseudo-assemby would be an alternative , but would it execute?ken (talk) 12:51, 4 December 2009 (UTC)[reply]

I agree. You are right, I was wrong. The page needs an example of that unrolling technique, which works best in assembly. (Btw, I will add an external link to an x86 assembly example.) 195.35.160.133 (talk) 11:24, 15 December 2009 (UTC) Martin (=OP).[reply]

Resolved

C examples do not demonstrate optimal unrolling[edit]

I think whoever originally devised the C examples did not really appreciate that there is a really large benefit in avoiding run time pointer arithmetic - since the benefits were (originally) only stated as elimination of 'end of loop' tests. The C examples, which currently use variables ('x', 'x+1') etc in the unrolled loop, are not optimal and may only rid the generated code of additional run time calculations if absolute index values are used instead. Someone should replace these examples with ones where the C compiler is able perform assignments etc without repeated index/pointer recalculations for each iteration, as performed in the IBM/360 and Z/Architecture example. ken (talk) 13:40, 4 December 2009 (UTC)[reply]

I have now added a C example for dynamic unrolling . This however still suffers from the disadvantage that a variable (i) is used within both the While loop and Switch statements, still requiring pointer/index arithmetic to be generated by the compiler, negating part of the benefit of loop unrolling, even in the supposedly 'low level' C language ken (talk) 15:07, 11 December 2009 (UTC)[reply]

Side by side comparisons[edit]

I have amended the article to present program code before & after unrolling/unwinding side-by-side for ease of viewing the differemces.ken (talk) 10:04, 13 December 2009 (UTC)[reply]

Differenciate kinds of unrolling[edit]

I think we could differenciate the "total" unroll that is available only when the number of iterations in the loops is known, and the "partial" unroll that is available quite every time. Moreover, when unrolling partially, there are many way to do it. Given the following code to unroll partially:

 //N is unknown at compile time
 for (int i = 0; i < N; i++) { 
   somestuffwith(i);
 }

we want to unroll this loop partially by a factor x. But we don't know if N mod x equals to 0. Then we have many options, such as :

 for (int i = 0; i < N; i+=x) { 
   somestuffwith(i+0); if (i+1 >= N) break;
   somestuffwith(i+1); if (i+2 >= N) break;
   somestuffwith(i+2); if (i+3 >= N) break;
   //...
   somestuffwith(i+(x-3)); if (i+(x-2) >= N) break;
   somestuffwith(i+(x-2)); if (i+(x-1) >= N) break;
   somestuffwith(i+(x-1));
 }

In this case, the overhead of the if may break the pipeline, and add some overhead anyway. But the code is then more regular. We can use another kind of unrolling :

 int N2 = N - N%x;
 for (int i = 0; i < N2; i+=x) { 
   somestuffwith(i); 
   somestuffwith(i+1);
   somestuffwith(i+2);
   //...
   somestuffwith(i+(x-3));
   somestuffwith(i+(x-2)); 
   somestuffwith(i+(x-1));
 }
 for (int i = N2; i < N; i++) { 
   somestuffwith(i);
 }

Here, we may use the pipeline of a superscalar architecture, but we then have a little overhead induced by the first line, and the last for that take at most (x-1) iterations. At the end, the speedup depends on the compiler and/or the architecture. Now suppose that we know N at compile time. Then we can enhance the unroll.

 for (int i = 0; i < N; i+=x) { 
   somestuffwith(i+0);
   somestuffwith(i+1);
   somestuffwith(i+2);
   //...
   //here N is known at compile time, and obviously the unroll factor x is known
   //then we know m = N%x and we know where the loop should break.
   somestuffwith(i+(x-m)); if (x-m+1 >= N) break;
   //...
   somestuffwith(i+(x-3));
   somestuffwith(i+(x-2));
   somestuffwith(i+(x-1));
 }

Using this method, we have now only one if, that still break the pipeline, but the overhead is now reduced. Finaly, we can use this method :

 //N2 is here known at compile time : N2 = N - N%x
 for (int i = 0; i < N2; i+=x) { 
   somestuffwith(i); 
   somestuffwith(i+1);
   somestuffwith(i+2);
   //...
   somestuffwith(i+(x-3));
   somestuffwith(i+(x-2)); 
   somestuffwith(i+(x-1));
 }
 //here we still know m = N%x and we know how many iterations we have to add
 somestuffwith(i+(N2));
 somestuffwith(i+(N2+1));
 //...
 somestuffwith(i+(N2+m));

Here the pipeline may be used inside the loop's body, and at the end, when processing the remaining m last statements. —Preceding unsigned comment added by 131.254.15.243 (talk) 09:51, 9 April 2010 (UTC)[reply]

Inconsistency in "Dynamic unrolling"[edit]

The Dynamic unrolling section appears to conflate two quite different ideas:

JIT code generation
A static unwound loop with a bunch of special cases to handle when n % N != 0

Surely these are completely unrelated? In what way is the latter "dynamic"? Oli Filth^{(talk|contribs)} 22:35, 18 May 2010 (UTC)[reply]

The use of --> in C pseudo code[edit]

While technically correct, and difficult to change without altering the example, the use of --> can be confusing to people new to C. I wonder if the example could be altered so it gives the same message without the -->

the final C example[edit]

It shouldn't use "printf" as code of the loop body, better use something more abstract like "foo(i)".
The switch statement is incorrect because it changes the order in which the index values are processed (e.g. i+6 before i+3). This can be very relevant depending on the code using i. Use entries-x.
It is suboptimal to have both "i" and "repeat" as induction variables in the loop. Also the division to calculate "repeat" is superfluous. Use "i < entries-left" as loop condition.

Croorg (talk) 06:22, 3 March 2017 (UTC)[reply]

External links modified[edit]

Hello fellow Wikipedians,

I have just modified one external link on Loop unrolling. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

Added archive https://web.archive.org/web/20081201132152/http://www.nondot.org/~sabre/Mirrored/GraphicsProgrammingBlackBook/gpbb7.pdf to http://nondot.org/~sabre/Mirrored/GraphicsProgrammingBlackBook/gpbb7.pdf

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 18 January 2022).

If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 06:33, 21 December 2017 (UTC)[reply]