<?xml version="1.0" encoding="UTF-8" standalone="yes"?><oembed><version><![CDATA[1.0]]></version><provider_name><![CDATA[The ryg blog]]></provider_name><provider_url><![CDATA[https://fgiesen.wordpress.com]]></provider_url><author_name><![CDATA[fgiesen]]></author_name><author_url><![CDATA[https://fgiesen.wordpress.com/author/fgiesen/]]></author_url><title><![CDATA[Loop control overhead on&nbsp;x86]]></title><type><![CDATA[link]]></type><html><![CDATA[<p>This is gonna be a short one. Let&#8217;s start with this simple loop to compute the dot product of two vectors vec_a, vec_b of signed 16-bit values:</p>
<pre>
	; (rsi=vec_a, rdi=vec_b, rcx=N_elements/16)
	pxor		xmm0, xmm0
	pxor		xmm1, xmm1

dot_loop:
	movdqa		xmm2, [rsi]
	movdqa		xmm3, [rsi+16]
	pmaddwd		xmm2, [rdi]
	pmaddwd		xmm3, [rdi+16]
	paddd		xmm0, xmm2
	paddd		xmm1, xmm3
	add		rsi, 32
	add		rdi, 32
	dec		rcx
	jnz		dot_loop
</pre>
<p>There&#8217;s nothing really wrong with this code, but it executes more ops per loop iteration than strictly necessary. Whether this actually costs you cycles (and how many) heavily depends on the processor you&#8217;re running on, so I&#8217;ll dodge that subject for a bit and pretend there&#8217;s no out-of-order execution to help us and every instruction costs at least one cycle.</p>
<p>Anyway, the key insight here is that we&#8217;re basically using three registers (<code>rsi</code>, <code>rdi</code>, <code>rcx</code>) as counters, and we end up updating all of them. The key is to update less counters and shift some of the work to the magic x86 addressing modes. One way to do this is to realize that while <code>rsi</code> and <code>rdi</code> change every iteration, <code>rdi - rsi</code> is loop invariant. So we can compute it once and do this:</p>
<pre>
	sub		rdi, rsi
	pxor		xmm0, xmm0
	pxor		xmm1, xmm1

some_loop:
	movdqa		xmm2, [rsi]
	movdqa		xmm3, [rsi+16]
	pmaddwd		xmm2, [rsi+rdi]
	pmaddwd		xmm3, [rsi+rdi+16]
	paddd		xmm0, xmm2
	paddd		xmm1, xmm3
	add		rsi, 32
	dec		rcx
	jnz		some_loop
</pre>
<p>One <code>sub</code> extra before the loop, but one less <code>add</code> inside the loop (of course we do that work in the addressing modes now, but luckily for us, they&#8217;re basically free).</p>
<p>There&#8217;s a variant of this that computes <code>vec_a_end = rsi + rcx*32</code> at the start and replaces the <code>dec rcx</code> with a <code>cmp rsi, vec_a_end</code>. Two more instructions at the loop head, no instructions saved inside the loop, but we get a <code>cmp</code>/<code>jne</code> pair which Core2 onwards can fuse into one micro-op, so that&#8217;s a plus. VC++ will often generate this style of code for simple loops.</p>
<p>We can still go one lower, though &#8211; there&#8217;s still two adds (or one add+one compare), after all! The trick is to use a loop count in bytes instead of elements and have it count up from <code>-sizeof(vec_a)</code> to zero, so we can still use one <code>jnz</code> at the end to do our conditional jump. The code looks like this:</p>
<pre>
	shl		rcx, 5			; sizeof(vec_a)
	pxor		xmm0, xmm0
	pxor		xmm1, xmm1
	add		rsi, rcx		; end of vec_a
	add		rdi, rcx		; end of vec_b
	neg		rcx

some_loop:
	movdqa		xmm2, [rsi+rcx]
	movdqa		xmm3, [rsi+rcx+16]
	pmaddwd		xmm2, [rdi+rcx]
	pmaddwd		xmm3, [rdi+rcx+16]
	paddd		xmm0, xmm2
	paddd		xmm1, xmm3
	add		rcx, 32
	jnz		some_loop
</pre>
<p>The trend should be clear at this point &#8211; a few more instrs in the loop head, but less work inside the loop. How much less depends on the processor. But the newest batch of Core i7s (Sandy Bridge) can fuse the <code>add</code>/<code>jnz</code> pair into one micro-op, so it&#8217;s very cheap indeed.</p>
]]></html></oembed>