Talk:X87

article expansion

I'd argue that this isn't technical enough, perhaps that it is even still a stub.
This article could use more information, like how the SSE articles list common instructions in the instruction set. Also it would be helpful to know things like instruction precision when running various instructions, things such as FMUL is 80 bit and it operates on registers ... --216.37.197.174 16:57, 2 January 2007 (UTC)[reply]

merge

These math coprocessor chips are so very closely related that I think it is best to cover them all in one article: Intel 8087, Intel 80287, Intel 80387, Intel 80487.

I suggest merging all these articles into the x87 article for now. Later, if the article WP:SIZE grows too great, I think I would prefer one article on the physical coprocessor chips (that doesn't even mention SSE), and another article on the instruction set and programming model used by those chips and also by later CPUs with integrated coprocessors (and also mentions SSE). --68.0.124.33 (talk) 15:06, 17 March 2008 (UTC)[reply]

Probably prefer not to merge Intel 8087

The 8087 was historic in a number of ways (the first IEEE 754 standard quasi-implementation, arguably the first "serious" numerics widely-available in hardware form for microcomputer users, etc.), and is worthy of detailed specific discussion in ways that the other three probably aren't. Why not first merge the three short articles in (287, 387, 487), and then decide whether it makes real sense for the 8087 article to also be merged? AnonMoos (talk) 06:07, 18 March 2008 (UTC)[reply]

I'd back the last suggestion by AnonMoos... may get around to this later, unless there are further objections. Nate (talk) 23:38, 19 September 2008 (UTC)[reply]

I'd feel more inclined to keep the articles about the architecture separate from articles about specific implementations of said architecture. For that reason I'd prefer to keep these articles separate. 82.108.106.146 (talk) 08:21, 23 June 2009 (UTC)[reply]

Well, there's still the problem that there probably isn't enough to say about 287, 387, and 487 to make them individually into good stand-alone articles... AnonMoos (talk) 16:59, 23 June 2009 (UTC)[reply]

It's worth noting that the IEEE 754 standard grew out of the 8087, rather than the other way round, and therefore the 8087 deserves its own page from a historical perspective. --PeterJeremy (talk) 10:35, 19 February 2010 (UTC)[reply]

Issues

There are some unclear statements in the article:

What's a "287 pinout"? If the number of pins changes, I assume it would read "287 pins". If the pins' purpose changes, it would be simply "pinout". What is this supposed to mean?

It means the electrical signals are compatible with those on a 287 chip, so that it can be inserted in a socket intended for a 287 (despite being more of a 387 internally). 83.255.33.95 (talk) 11:51, 23 December 2009 (UTC)[reply]

Silly me, I completely misunderstood that. --Berntie (talk) 21:35, 25 December 2009 (UTC)[reply]

What's a "zero clock penalty"?

It means that this particular instruction does not inflict any delay on subsequent FPU instructions, simply because it is executed in parallel, by another execution unit. The execution of integer instructions (if any) may still be affected, of course. 83.255.33.95 (talk) 11:51, 23 December 2009 (UTC)[reply]

What does "(by using one of the integer paths for exch st(x) in parallel with the FPU instruction)" mean?

The same as above. 83.255.33.95 (talk) 11:51, 23 December 2009 (UTC)[reply]

Ok, that's clear now, too. Thanks for your explanations. --Berntie (talk) 21:35, 25 December 2009 (UTC)[reply]

Can someone provide a better explanation of those points? --Berntie (talk) 18:24, 21 December 2009 (UTC)[reply]

Intel 80387

I'm currently trying to clean up the corresponding article in the German WP and it contradicts the article here in 2 points:

it says, the original i387 was released in 1986 (vs. 1987 here). Both articles don't give sources. Who's right? Any sources?
it also says, the i387 was compatible with "both the i386DX and the i386SX". The reasoning here seems convincing, so I assume that statement is simply false. Nonetheless, some references would be helpful...

Comments? --Berntie (talk) 15:37, 25 January 2010 (UTC)[reply]

80287 clock boards

An accessory made by several companies for a few years was a clock board that fit between the 80287 and its socket. The intended use was to raise the clock speed one third so a 287 the same speed as the 286 CPU could be used. The price of these boards plus a CPU Mhz matching 80287 was initially less than a CPU Mhz matching 80287XL. —Preceding unsigned comment added by Bizzybody (talk • contribs) 03:42, 22 May 2010 (UTC)[reply]

80386 motherboards with dual FPU sockets.

A few designs of early 80386 motherboards had a DIP socket for an 80287 and a PGA socket for an 80387. Of course only one FPU could be installed. The practice was short lived. I've only seen a single such board in 29 years of working on computers. Bizzybody (talk) 11:06, 21 February 2012 (UTC)[reply]

User:Paraboloid01 experiments

486 VS Pentium and later

With Free Pascal you can easily find out that what is told about x87 FPU of say Pentium 4 CPU is true (it need 3.5 CPU clock cycles for multiplication either addition or subtraction and 20 CPU clock cycles for division but 28 cycles for division and square root together with this code, which calculating

\pi

:

 var
 a:longint;
 c:real;
 begin
 for a:=3 to 1000000000  do
 c:=c+6/(sqr(a*1.0-2));
 writeln(c);
 writeln(sqrt(c));
 readln;
 end.

and almost one free subtraction it gives results (9.86960433860995+000 and 3.1415926436458864+0000) after 11 seconds on 2.6GHz CPU).

There must probably be 28 cycles for division and square operation together. Because 11*2.6=28.6 cycles. Without subratraction operation calculation time is 10-11 seconds like written below.

For multiplication I recomend such Free Pascal benchmark:

 var a:longint; c:real;
 begin
 for a:=0 to 1000000000 do
 c:=c+a*(1+
 a*(0.16666666666666667+
 a*(0.0083333333333333333+
 a*(0.00019841269841269841+
 a*(0.0000027557319223985891+
 a*(0.000000025052108385441718775+
 a*(0.000000000160590438368216146+
 a*(0.00000000000076471637318198164759+
 a*(0.000000000000002811457254345520763+
 a*(0.000000000000000008220635246624329717+
 a*(0.00000000000000000001957294106339126123+
 a*(0.000000000000000000000038681701706306840377+
 a*(0.00000000000000000000000006446950284384473396+
 a*(0.000000000000000000000000000091836898637955461484+
 a*0.0000000000000000000000000000001130996288644771693))))))))))))));
 writeln(c);
 Readln;
 End.

it use 15 multiplications and 15 additions so 30 operations. This gives result after 41 second on 2.6 GHZ CPU and DDR-800 (400 MHz) RAM. So 30/41=0.75 operation/s, which is 0.75/2.6=0.288 operation/(CPU clock cycle) or 1/0.288=3.4667 cycle/operation (about 3.5 CPU clock cycle for one multiplication or one addition).

By suggesting from article table that 486 CPU need 16-20 CPU clock cycles for addition or multiplication I came up with two possible explanations:

1) 486 doing multiplication and addition using look up table and it have 16 numbers BCD (binary coded decimals one decimal number is 4 bits). As possibly intel 4004 CPU do. At least for addition it's possible, because for multiplication need 16*16=256 multiplications and 256 additions, so total 512 operations, but maybe in article multiplication is only of two BCD numbers (of 2 decimal numbers from 0 to 9).

2) 486 data bus is 32 bit wide and Pentium data bus 64 bits wide and x87 FPU as we know calculating (gives results) in 64 bits precision, thus it should be maximum 2-4 times slower on 486 than on Pentium 4 due to transfers in peaces between memory RAM and CPU (maybe lag makes also transfers between CPU and x87 FPU due to databus wide limits (interesting how without MMX Pentium could put this peaces together?)).

One natural question you should ask, why Free Pascal, which don't using CPU cache loading so fast on 400 MHz RAM? This either 2read then 2write then 2read then 2write you can do in one RAM clock cycle (but not read and write in same RAM clock cycle in same address) by means of DDR, or you probably don't heart that there is such thing like CPU registers. CPU have 8 8-bit General Purpose registers, 8 16-bit General Purpose registers, 8 32-bit General Purpose registers, 8 64-bit General Purpose registers, 8 x87 FPU 80 bit (double-extended precision) registers, 8 64-bit MMX registers, 16 128-bit SSE registers and 16(?) 256-bit AVX registers. I don't know how much you can fill them all, but at least 8 General Pupose registers and 8 FPU registers should be available for use them all (and those registers probably are not RAM simulation, because even Intel 4004 which have about 1753 transistors have 16 4-bit index registers and 4 12-bit address registers). How you better understand CPU performance, counting registers, which even don't necessary prove any improvements or they different names for same register but with added instruction functionability or them more, but with useless instructions or lieing cache at heart of CPU? I even don't mentioning, nVidia afraid word core, like devil cross. Because GPUs have only 2-6 times more cores, than CPUs. But worst of all, that very hard to use cores for 3D rendering. I found GPU cores (8 cores for RV770LE and 10 cores for RV770XT(4870)) can be useful only for blur which used for bright bloom areas (but even here because 256 levels on screen is 256 pixels - making it in bigger resolution don't makes any difference, because human eye don't fell more than 256 levels of smooth transition). — Preceding unsigned comment added by Paraboloid01 (talk • contribs) 16:49, 27 August 2012 (UTC)[reply]

Here from page 99 or 3.7.2:

"3.7.2 Register Operands

Source and destination operands can be any of the following registers, depending on

the instruction being executed:

• 32-bit general-purpose registers (EAX, EBX, ECX, EDX, ESI, EDI, ESP, or EBP)

• 16-bit general-purpose registers (AX, BX, CX, DX, SI, DI, SP, or BP)

• 8-bit general-purpose registers (AH, BH, CH, DH, AL, BL, CL, or DL)

• segment registers (CS, DS, SS, ES, FS, and GS)

• EFLAGS register

• x87 FPU registers (ST0 through ST7, status word, control word, tag word, data

operand pointer, and instruction pointer)

• MMX registers (MM0 through MM7)

• XMM registers (XMM0 through XMM7) and the MXCSR register

• control registers (CR0, CR2, CR3, and CR4) and system table pointer registers

(GDTR, LDTR, IDTR, and task register)

• debug registers (DR0, DR1, DR2, DR3, DR6, and DR7)

• MSR registers

Some instructions (such as the DIV and MUL instructions) use quadword operands

contained in a pair of 32-bit registers. Register pairs are represented with a colon separating them. For example, in the register pair EDX:EAX, EDX contains the high order bits and EAX contains the low order bits of a quadword operand.

Several instructions (such as the PUSHFD and POPFD instructions) are provided to load and store the contents of the EFLAGS register or to set or clear individual flags in this register. Other instructions (such as the Jcc instructions) use the state of the status flags in the EFLAGS register as condition codes for branching or other decision making operations.

The processor contains a selection of system registers that are used to control

memory management, interrupt and exception handling, task management, processor management, and debugging activities. Some of these system registers are accessible by an application program, the operating system, or the executive through a set of system instructions. When accessing a system register with a system instruction, the register is generally an implied operand of the instruction.

3.7.2.1 Register Operands in 64-Bit Mode

Register operands in 64-bit mode can be any of the following:

• 64-bit general-purpose registers (RAX, RBX, RCX, RDX, RSI, RDI, RSP, RBP, or

R8-R15)

• 32-bit general-purpose registers (EAX, EBX, ECX, EDX, ESI, EDI, ESP, EBP, or

R8D-R15D)

• 16-bit general-purpose registers (AX, BX, CX, DX, SI, DI, SP, BP, or R8W-R15W)

• 8-bit general-purpose registers: AL, BL, CL, DL, SIL, DIL, SPL, BPL, and R8LR15L are available using REX prefixes; AL, BL, CL, DL, AH, BH, CH, DH are

available without using REX prefixes.

• Segment registers (CS, DS, SS, ES, FS, and GS)

• RFLAGS register

• x87 FPU registers (ST0 through ST7, status word, control word, tag word, data

operand pointer, and instruction pointer)

• MMX registers (MM0 through MM7)

• XMM registers (XMM0 through XMM15) and the MXCSR register

• Control registers (CR0, CR2, CR3, CR4, and CR8) and system table pointer

registers (GDTR, LDTR, IDTR, and task register)

• Debug registers (DR0, DR1, DR2, DR3, DR6, and DR7)

• MSR registers

• RDX:RAX register pair representing a 128-bit operand".

Maybe somebody forgot how change ROM for FSB showing anything else than 100 or 200 MHz like from this upcoming RAM speed "downgrade" and cache addition, because of to "not logical looking small System Bus frequency and RAM frequency" (especially that CPU don't must multiply in 1 cycle, because need load numbers transfer to at least CPU accumulator or latch and so on). -- 17:34, 27 August 2012‎ Paraboloid01

Multiplication and addition benchmark, when registers just not enough

This Free Pascal (Free Pascal IDE Version 1.0.12 [2011/12/25]; Compiler Version 2.6.0; GDB Version GDB 7.2) benchmark:

var a:longint; c:real;
begin
for a:=0 to 1000000000 do
c:=c+a*(1+
a*(0.16666666666666666667+
a*(0.0083333333333333333333+
a*(0.0001984126984126984127+
a*(0.000002755731922398589065+
a*(0.000000025052108385441718775+
a*(0.000000000160590438368216145994+
a*(0.00000000000076471637318198164759+
a*(0.0000000000000028114572543455207632+
a*(0.000000000000000008220635246624329716956+
a*(0.0000000000000000000195729410633912612308+
a*(0.000000000000000000000038681701706306840377+
a*(0.000000000000000000000000064469502843844733962+
a*(0.000000000000000000000000000091836898637955461484+
a*(0.0000000000000000000000000000001130996288644771693156+
a*(0.00000000000000000000000000000000012161250415535179496+
a*(0.00000000000000000000000000000000000015163356207719502806+
a*(0.0000000000000000000000000000000000000000967759295863189099209+
a*(0.000000000000000000000000000000000000000000072654601791530713154+
a*(0.000000000000000000000000000000000000000000000049024697565135433977+
a*(0.000000000000000000000000000000000000000000000000029893108271424045108+
a*(0.000000000000000000000000000000000000000000000000000016552108677421951887+
a*(0.0000000000000000000000000000000000000000000000000000000083596508471828039833+
a*(0.0000000000000000000000000000000000000000000000000000000000038666285139605938868+
a*(0.0000000000000000000000000000000000000000000000000000000000000016439747083165790335+
a*(0.000000000000000000000000000000000000000000000000000000000000000000644695964045717268045+
a*(0.00000000000000000000000000000000000000000000000000000000000000000000023392451525606577215+
a*(0.000000000000000000000000000000000000000000000000000000000000000000000000078762463049180394663+
a*(0.000000000000000000000000000000000000000000000000000000000000000000000000000024674957095607893065+
a*0.0000000000000000000000000000000000000000000000000000000000000000000000000000000072106829618959360213)))))))))))))))))))))))))))));
writeln(c);
Readln;
End.

gives result "2.32603502296870E+197" after 87 seconds on 2.6 GHz CPU and with DDR2-800 (400 MHz) RAM. — Preceding unsigned comment added by Paraboloid01 (talk • contribs) 19:20, 1 October 2012 (UTC)[reply]

There is 30 additions and 30 multiplications, so total 60 operations. This is (87/60)*2.6=3.77 cycle/operation. I tried also this benchmark with DDR2-400 (200 MHz) by setting such (200 MHz) frequency through BIOS (when pressing for most systems Delete key to enter BIOS setup before booting OS). So even with DDR2-400 (200 MHz) RAM, this benchmark still gives result after 87 seconds on 2.6 GHz CPU. So or RAM working at same frequency like CPU (like it was with 8086, 286, 386) or there really is on CPU chip L1 and L2 cache. By the way, if you change CPU clock to 1.3 GHz (and RAM will leave at DDR2-800), you will get result after about 2*87=174 seconds (maybe few seconds faster than 174 seconds). But still by changing CPU multiplier by bus (multiplier 13 on my system for 2.6 GHz, so 200 MHz bus) and getting effect doesn't necessary mean that changing something for RAM also changing it at RAM physical level, not just changed writings in BIOS. — Preceding unsigned comment added by Paraboloid01 (talk • contribs) 20:13, 1 October 2012 (UTC)[reply]

power benchmark

Free Pascal (Comiler Version 2.6.0) code:

var a:longint; c:real;
begin
for a:=1 to 1000000000 do
// c:=c+exp(a*ln(0.7));   //0.7^a, executes after 2 minutes only this line
c:=c+exp(-0.3566749439387323789*a);  //0.7^a, executes after 119 s on 2.6 GHz CPU
// c:=c+ln(0.7*a);  //if to turn on this line (and to turn off exp() line) it executes after 53 s on 2.6 GHz CPU
writeln(c);
Readln;
End.

which gives result "2.33333333333333E+000" after 119 seconds on 2600 MHz CPU and DDR2-800 (400 MHz) RAM. So for one power (exp()) operation need 119*2.6=309 CPU clock cycles. For one natural logarithm need 53*2.6=138 CPU clock cycles.

This code proofs this formula:

{\frac {|x|}{1-|x|}}\geq x+x^{2}+x^{3}+x^{4}+...\quad (-1<x<1),

which is part of proof of Fundamental theorem of algebra.

{\frac {|0.7|}{1-|0.7|}}={\frac {0.7}{0.3}}=2.3(3).

— Preceding unsigned comment added by Paraboloid01 (talk • contribs) 18:59, 5 September 2012 (UTC)[reply]

Article table 8087 results don't exactly much claims of Intel

Here on page 3 (page 6 with Acrobat Reader) given table for 8 MHz 8087. One 8087 CPU clock cycle is 1/800000=0.000000125 or 0.125 microsecond or 125 nanoseconds. So Intel 8087 datasheet claims, that 8086 with 8087 add operation doing in 10.6 microseconds; division operation doing in 24.4 microseconds; multiplication doing in 16.9 microseconds; square root doing in 22.5 microseconds. So this gives:

10.6/0.125=85 cycles for addition;

24.4/0.125=195 cycles for division;

16.9/0.125=135 cycles for multiplication;

22.5/0.125=180 cycles for square root.

If you want to know how emulation of this operations done on only 8086, then for addition there is carry bit by 8 bits chunks. For multiplication there is logical AND. Square root can be emulated with only division, multiplication and addition. For 4 8-bit vectors shift left or right or rotate left or right can pack and then add...then you can multiply packed vectors holding spaces with zeros between 8-bit pieces in 64-bit integer number. This is how colors are packed into 64 bits maybe to take less space on HDD. Basic is shift to left or right and then add (this is how MMX works probably). On packed 3 or 4 bytes (into 64-bits integer) you can do only add/subtract, multiply by scalar, divide by scalar (mostly integer), but not square root or more complex functions like sine or cosine. IF GPU can decode those bytes (4 colors RGBA or position xyz) appropriate then we get 3-4 times acceleration (colors decoding with GPU in hardware could be much faster than shifting ~64 times for unpacking).

In this datasheet also claims, that emulation on 8086 need 2000 microseconds for division, 1000 microseconds for addition (or multiplication) and 12250 microseconds for square root.

I would say if 8086 can add (and he can) 8-bits data to 8-bits data, then for addition it should be about 10-20 times slower than on 8087, but not 1000/10=100 times slower. For multiplication need 64*64=4096 operations, so for big number multiplication should be even more than 1000 microseconds on only 8086. Except maybe if for each 8 bits AND operation with 8 bits there is 8 AND operations, but say source 8 bits each time are shifted to left by 1 bit and so then for 8 bits by 8 bits multiplication need about only 8-16 operations and for 64 bits by 64 bits multiplication need 8^2=64 or 16^2=256 operations. Basicly for 64 bits numbers multiplication need at least 4096 transistors (if you want to do it in 3-4 CPU clock cycles).

Here x square root steps:

F = 1;

G = (F+x/F)/2;

H = (G + x/G)/2;

J = (H + x/H)/2;

K = (J + x/J)/2;

L = (K + x/K)/2.

When x=2, then square root of 2 is L:

F = 1;

G = (1+2/1)/2 = 3/2 = 1.5;

H = (3/2 + 4/3)/2 = 17/12 = 1.4166666667;

J = (17/12 + 24/17)/2 = 577/408 = 1.414215686;

K = (577/408 + 816/577)/2 = 665857/470832 = 1.414213562.

L = (665857/470832 + 2/(665857/470832))/2 = 1,4142135623730950488016896235025.

With Windows calculator

{\sqrt {2}}-L=8.9929283216501666497195153305433\cdot 10^{-25}.

This is amazing, because double-extended precision (80 bits) have only 17 decimal numbers (result is already rounded by eighteen number). So result L is good for about 25 decimal digits (windows calculator gives result with 30 decimal digits precision).

Notice that shifting bits left is binary multiplication by 2 and shifting bits to right is binary division by 2. — Preceding unsigned comment added by Paraboloid01 (talk • contribs) 16:33, 7 September 2012 (UTC)[reply]

BTW because CPU chip is about 1 cm wide and 1 cm length, the maximum number of times light can go through such (1 cm) distance is 3*10^8 (m) * 100= 3*10^10 Hz or 30 GHz. Distance from CPU to RAM memory is about 10 cm. So RAM maximum frequency can be 3 GHz. If assume that x87 or coprocessor (386 is last which have separate coprocessor 387, except some modified 486) working at 30 GHz, then for multiplication need about 35 cycles, for division need about 200 cycles. — Preceding unsigned comment added by Paraboloid01 (talk • contribs) 17:41, 7 September 2012 (UTC)[reply]

Fast multiplication without x87 FPU

Here how you can fast multiply two 8 bits integers. Trick is to understand that binary multiplication is faster than decimal. So say, in register AX is first number and in register BX is second number.

1. AX is in common cases acumulator, so we shift all bits in AX register one bit position to left and most left bit is pushed out to carry flag. If carry flag is 1 then number from register BX will be added to CX (CX register have number of all zeros 00000000 and it's value is initialy 0). If carry flag is 0 then number from BX register is not added to CX register.

2. Then again we shift all bits in AX register to left by one bit position. And most left bit of AX register after shift is in carry flag. If carry flag is 1, then we shift BX register by one bit position to right and add to CX register. If carry flag is 0, then don't add BX register to CX register.

3. Again we shift all bits in AX register to left by one bit position. Most left bit of AX register after shift is in carry flag. If carry flag is 1, then we shift BX register by one bit position to right and add to CX register. If carry flag is 0, then don't add BX register to CX register.

...

8. We shift eight time all bits in AX register to left by one bit position (actually only one bit [most left] non zero only left [if it was not rotation]). Most left bit of AX register after shift is in carry flag. If carry flag is 1, then we shift BX register by one bit position to right and add to CX register. If carry flag is 0, then don't add BX register to CX register.

Now CX register have integer number, which is multiplication result of AX register with BX register.

So don't need 8*8=64 multiplications and 8*8=64 additions (half is lost anyway (underflow) so can count as (64+64)/2=64 operations), but only 8 or 16 shifts. So about 8-16 operations. To multiply 222 with 255 in decimal digits you need 5 multiply 3 times by 2, then another 5 multiply 3 times by 2 and then multiply 3 times by 2, so it would be 3+3+3=9 multiplications and 8 additions. So you see in binary 3 bits can be multiplied by faster law, than in decimal digits format.

If say, AX register have number 255 and BX register have number 255, then result will be not 255*255=65025, but 650 (or maybe 65, or maybe still 255, because no exponent).

Notice that intel 8080 and probably 8086 have 16 bit acumulator and registers. It would be about 16 shifts for AX register and 16 shifts for BX register (if on average counting, that half is zeros of data, then about 16 shifts need, but after AX data shift and when carry flag zero, then sequent shift of BX register must be done twice if this sequent carry flag is set to 1; so better each time shift BX register, but just don't each time to add [to CX], than don't shift and don't add and when need to add to shift two times). — Preceding unsigned comment added by Paraboloid01 (talk • contribs) 21:21, 14 September 2012 (UTC)[reply]

x87 FPU may not be present on system

So calculating

\pi

with "Free Pascal" is fast. But guess how slow it is with "Visual Studio 2010". About 20 times slower with "Visual Studio 2010" than with "Free Pascal" (difference is only that with "Visual Studio 2010" it gives result in "double-extended precision" (80-bit), while "Free Pascal" gives result in "double precision" (64-bit)). To check this out you need to have "Visual Studio 2010" (or maybe any over latter or early version; "Visual Studio 2010 Express" might be not enough). Run "Visual Studio 2010 [Professional]" shortcut or from "C:\Program Files\Microsoft Visual Studio 10.0\Common7\IDE\devenv.exe". Then in upper left corner click File >> New >> Project... or press Ctrl+Shift+N. Then on left choose Installed Templates >> Visual C++ >> Win32 >> Win32 Console Application. Then enter Name "Ex13_04(Win32 Console Application)". Solution name by default is the same, so leave it as it is (by default mark "Create directory for solution" is set, so leave it as it is). Then press "OK". Then press "Next >". Fields "Console application" and "Precompiled header" are marked; rest of fields are unmarked. Then press "Finish". Now in "Solution Explorer" on right from "Source Files" folder in "Ex13_04(Win32 Console Application).cpp" you must change default entry code:

// ex13_04teach.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
int _tmain(int argc, _TCHAR* argv[])
{
 return 0;
}

to [pi calculation] code:

// Ex13_04(Win32 Console Application).cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
// Computing pi by summing a series
#include <iostream>
#include <iomanip>
#include <cmath>
#include "ppl.h"
int main()
{
Concurrency::combinable <double> piParts;
Concurrency::parallel_for(1, 100000000, [&piParts](long long n)
{ piParts.local() += 6.0/(n*n); }); 
double pi2 = piParts.combine([](double left, double right)
{ return left + right; });
std::cout << "pi squared = " << std::setprecision(20) << pi2 << std::endl;
std::cout << "pi = " << std::sqrt(pi2) << std::endl; 
return 0;
}

Now you can press Ctrl+F5 or green triangle (if you will press green triangle then windows will close instantly after result is calculated). After you press Ctrl+F5 you will get result:

pi squared = 9.8696043407861431

pi = 3.1415926439922384

Press any key to continue . . .

after about 22 seconds on 2.6 GHz CPU (dual-core with DDR2-800 [400MHz] RAM).

If you this line "Concurrency::parallel_for(1, 100000000, [&piParts](long long n)" will replace with this line "Concurrency::parallel_for(1, 1000000000, [&piParts](long long n)" then you will get result:

pi squared = 9.8696043499084052

pi = 3.1415926454440917

Press any key to continue . . .

after 200-205 seconds (3 minutes and 23 seconds) on 2.6 GHz CPU (dual-core with DDR2-800 [400MHz] RAM). Created exe file "Ex13_04(Win32 Console Application).exe" in directory "C:\Users\p\Documents\Visual Studio 2010\Projects\Ex13_04(Win32 Console Application)\Debug" gives result after 203 seconds, so don't blame "Visual Studio 2010" multicolor shrift or something like that [for slowing down calculation], because "Ex13_04(Win32 Console Application).exe" file is only black window (which exist instantly after finishing). Here pi value from windows calculator: 3,1415926535897932384626433832795. Squared pi is equal to (from windows calculator): 9,8696044010893586188344909998762.

"Free Pascal" (Compiler Version 2.6.0) pi calculating code:

var
a:longint;
c:real;
begin
for a:=1 to 1000000000  do
c:=c+6/sqr(a*1.0);
writeln(c);
writeln(sqrt(c));
readln;
end.

shows results:

9.86960433860995E+000

3.1415926436458864E+0000

after 10-11 seconds on 2.6 GHz CPU (dual-core with DDR2-800 [400MHz] RAM). — Preceding unsigned comment added by Paraboloid01 (talk • contribs) 13:01, 14 October 2012 (UTC)[reply]

I think it's pretty clear that Parabaloid01 is mentally ill and his endless, pointless tangents that have nothing to do with the article prove that. Nothing he's said is relevant or logical. It's just a waste of space in the Talk section. — Preceding unsigned comment added by 130.65.158.119 (talk) 20:37, 27 February 2014 (UTC)[reply]

2022

Using 80-bit versus 64-bit calculations on modern processors is not a small difference. While, for one, I'm not entirely sure how you coaxed VS2010 into using the 80-bit format for its calculations, the main point is that 80-bit calculations must be done using x87 instructions but 64-bit calculations can be done using SSE instructions. Certain kinds of calculations can be done much faster using SSE instructions than x87 instructions, and nearly all compilers these days will use SSE instructions for calculations rather than x87 unless they're forced to. Of course, this is separate from...whatever it is exactly you're trying to imply with "x87 FPU may not be present on the system". x87 hasn't been implemented as a "coprocessor" in two decades at this point; but if you're trying to imply the x87 instruction set itself may not be present, that's ludicrous. Every single x86 processor since the 80486 has had the x87 instruction set built-into it. Its absence would be as ludicrous as the absence of 32-bit addressing modes. 69.248.160.237 (talk) 18:01, 4 January 2013 (UTC)[reply]

Actually earlier "Free Pascal" program giving results with 15 decimal digits. Ant later versions "Free Pascal" program giving results with 17 decimal digits (like Visual Studio 2010) and with same speed like with 15 decimal digits. But they both are 64 bit (double precision). Because Extended double precision (80 bit) must have about 20 decimal digits. And I check, this code and it turns out, that there no difference in precision if you doing bilion iterations (after 10-11 seconds) or 100 milions iterations. But precision decreases if to do (with FP code) 10 milion iterations. But still little bit strange, that with Visual Studio 2010 calculated result is little bit more precise, than with "Free Pascal" program (but correct pi decimal digits have the same number in both programs). Maybe both programs (FP and VS2010) using diferent algorithms. But Free Pascal calculating about 20 times faster than VS2010 with the same iterations number. Maybe Visual Studio 2010 was using SSE Double precision instructions instead of x87 FPU? Or maybe VS2010 using very unoptimized code? Or maybe using multicores? Because this lines " Concurrency::combinable <double> piParts; Concurrency::parallel_for(1, 100000000, [&piParts](long long n) " and word "Concurrency" means something with dual core (multicore) and I now don't remember or I take code for VS2010, which meant for dual core CPU calculations or just for one core... Now I don't have doubts that x87 FPU is present on CPU. But maybe have small doubt about that SSE or AVX is 4 times faster than x87 FPU, because SSE have 128 bit 16 XMM registers (XMM0 through XMM15) and AVX have 256 bit 16(?) registers... Maybe with SSE or AVX nothing faster at all than with x87 FPU. But maybe not. — Preceding unsigned comment added by Paraboloid01 (talk • contribs) 19:09, 25 December 2022 (UTC)[reply]

I think, I know why Free Pascal giving the same 17 decimal digits results ("9.8696043386099497E+000" and "3.1415926436458864E+000") after bilion iterations and after 100 milions iterations. This is because, for example, 6/(10^9)^2=6/(10^18). And if you want to store

6\cdot 10^{-18}

number (add to 9.8696... number) for this need hardware which can store 19 decimal digits. So, of course, after 10 bilions iterations pi calculation precision will be the same like after 100 milion iterations. From double precision wikipedia article: "The 53-bit significand precision gives from 15 to 17 significant decimal digits precision (2⁻⁵³ ≈ 1.11 × 10⁻¹⁶)." and "With the 52 bits of the fraction (F) significand appearing in the memory format, the total precision is therefore 53 bits (approximately 16 decimal digits, 53 log₁₀(2) ≈ 15.955)." So about 15 or 16 decimal digits for double precision. And 6/(10^8)^2=6/(10^16) requires 17 decimal digits (if no exponent). And to store

2^{-53}\approx 1.11\cdot 10^{-16}

number also need 17 decimal digits hardware. So that's why there are 17 decimal digits. — Preceding unsigned comment added by Paraboloid01 (talk • contribs) 11:50, 26 December 2022 (UTC)[reply]

Marketing

These are the next generation computers we have been speculating about. — Preceding unsigned comment added by 23.117.16.45 (talk) 01:27, 12 February 2016 (UTC)[reply]

Use of x87 instructions in 64-bit mode.

Are x87 instructions available in 64-bit mode? Does x87 stack is separate from all the other registers (like xmm registers)? Or is it not accessible anymore in 64-bit mode as one should be using SSE2 and later instead. — Preceding unsigned comment added by 81.6.34.246 (talk) 23:18, 25 September 2018 (UTC)[reply]

According to chapter 2 "x87 Floating-Point Instruction Reference" of AMD64 Architecture Programmer’s Manual Volume 5: 64-Bit Media and x87 Floating-Point Instructions:

The AMD64 architecture requires support of the x87 floating-point instruction subset including the floating-point conditional moves and the FCOMI(P) and FUCOMI(P) instructions. On compliant processor implementations both the FPU and the CMOV feature flags are set. These are indicated by EDX[FPU] (bit 0) and EDX[CMOV] (bit 15) respectively returned by CPUID Fn0000_0001 or CPUID Fn8000_0001.

The x87 instructions can be used in legacy mode or long mode. Their use in long mode is available if the following feature bit is set:

• Long Mode, as indicated by CPUID Fn8000_0001_EDX[LM] = 1.

(which just means "you can use them in long mode if the CPU supports long mode") and, according to section 8.1.1 "x87 FPU in 64-Bit Mode and Compatibility Mode"° of Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 1: Basic Architecture:

In compatibility mode and 64-bit mode, x87 FPU instructions function like they do in protected mode. Memory operands are specified using the ModR/M, SIB encoding that is described in Section 3.7.5, “Specifying an Offset.”

So, yes, x87 instructions are available in 64-bit mode. The stack is separate from everything except for the MMX registers, just as is the case in 32-bit x86. Guy Harris (talk) 01:03, 26 September 2018 (UTC)[reply]