We're trying to figure out how fast some code is running on a Pentium 3, and we're using a magic instruction that will give us the cycle counter. Dividing number of cycles by number of loop iterations gives 1962 cycles per iteration. But this number seemed unreasonable to us, so we went and looked at the assembly output of gcc.
The loop has 3772 instructions in it. According to the P3 manual, floating-point instructions pipeline but never run in parallel with each other (except for the FXCH instruction). There's 1007 FXCH instructions, and another 66 instructions that aren't floating-point. Assuming that the scheduling is perfect (doing the first couple of dozen instructions on paper, it isn't) that means that the loop could run in, at an absolute minimum, 2699 cycles per iteration. Which is much more than 1962. Something is wrong here...
The loop has 3772 instructions in it. According to the P3 manual, floating-point instructions pipeline but never run in parallel with each other (except for the FXCH instruction). There's 1007 FXCH instructions, and another 66 instructions that aren't floating-point. Assuming that the scheduling is perfect (doing the first couple of dozen instructions on paper, it isn't) that means that the loop could run in, at an absolute minimum, 2699 cycles per iteration. Which is much more than 1962. Something is wrong here...