Back to home page

DOS ain't dead

Forum index page

Log in | Register

Back to the forum
Board view  Mix view

Assembler optimisation - how to avoid a jump? (Developers)

posted by Rugxulo Homepage, Usono, 16.05.2012, 10:34

> I measured my original code, CM's code and another variant found on
> internet:

Could depend on many factors.

> VARIANT 1:
> sub edx,ecx
> jnc @RLE_CLIP_MOVE_CONT
> add ecx,edx
> @RLE_CLIP_MOVE_CONT:


6 bytes

> VARIANT 2:
> sub edx, ecx
> rcr eax, 1
> sar eax, 31
> and eax, edx
> add ecx, eax


11 bytes

> VARIANT 3:
> sub edx,ecx
> sbb eax,eax
> and eax,edx
> add ecx,eax


8 bytes

> Differences are small. Variants 2 and 3 are slightly faster then V1.

Jumps usually aren't that expensive, and branch prediction makes them reasonable. Of course, Darek Mihocka says avoid them where possible (e.g. his BOCHS optimizations), but it's such a common thing for x86 that I've never even bothered worrying about it.

I think even correctly taken jumps cost 2 cycles on a 486. Pentium assumed all backwards jumps were taken and forwards weren't. On a P4, you could also use jump hints, but I'd doubt it would help (much, if at all, might even hurt, who knows).

It gets more complicated because of EFLAGS and register dependencies, which may or may not cause AGIs (esp. on 486). And some of the more CISC-y instructions (RCR, I assume) will probably not be pariable on a Pentium. PPro/686 has the whole 4-1-1 microcode bullcrap, and don't forget that pipelines fill up faster on older machines, hence sometimes smaller code is better.

> Difference between V2 and V3 is only borderly significant, maybe V3 is
> sligtly faster but measurement would be must done in pure DOS, not in Win98
> I am running just now.
> I made test only on Pentium 4 machine, due lack of time I haven't tested on
> my Pentium III.

Pentium 4 has no barrel shifter, so VARIANT 2 will always be slower there (I think?).

There's honestly nothing horrible about any of these versions, they all work more or less the same. The difference is very very minor. You also have to worry about on-chip cache size, instruction timings, latency / thoroughput, code and data alignment, and avoid nearby self-modifying code. You're probably more limited by OS calls or HD or RAM access speeds.

It's fun to "pretend" to even barely (0.0001%) understand this stuff, but it's so incredibly arcane and (almost) useless, impossible, etc. (in my unprofessional opinion). There is no easy answer, and I'd doubt anybody really does it well across various x86 subarchitectures. (Some newer ones broke old optimizations, so that really sucks.) I wouldn't worry about it (or choose the safer path of cm and myself, optimize for size!).

 

Complete thread:

Back to the forum
Board view  Mix view
22762 Postings in 2122 Threads, 402 registered users (0 online)
DOS ain't dead | Admin contact
RSS Feed
powered by my little forum