<?xml version="1.0" encoding="UTF-8" standalone="yes"?><oembed><version><![CDATA[1.0]]></version><provider_name><![CDATA[The ryg blog]]></provider_name><provider_url><![CDATA[https://fgiesen.wordpress.com]]></provider_url><author_name><![CDATA[fgiesen]]></author_name><author_url><![CDATA[https://fgiesen.wordpress.com/author/fgiesen/]]></author_url><title><![CDATA[64-bit mode and 3-operand&nbsp;instructions]]></title><type><![CDATA[link]]></type><html><![CDATA[<p>One interesting thing about x86 is that it&#8217;s changed two major architectural &#8220;magic values&#8221; in the past 10 years. The first is the addition of 64-bit mode, which not only widens all general-purpose registers and gives a much larger virtual address space, it also increases the number of general-purpose and XMM registers from 8 to 16. The second is AVX, which allows all SSE (and other SIMD) instructions to be encoded using non-destructive 3-operand forms instead of the original 2-operand forms.</p>
<p>Since modern x86 processors are trying really hard to run both 32- and 64-bit code well (and same for SSE vs. AVX), this gives us an opportunity to compare the relative performance of these choices in a reasonably level playing field, when running the same (C++) code. Of course, this is nowhere near a perfect comparison, especially since switching from 32 to 64 bits also changes the sizes of pointers and (at the very least) the code generator used by the compiler, but it&#8217;s still interesting to be able to do the experiment on a single machine with no fuss. So, without further ado, here&#8217;s a quick comparison using the <a href="https://fgiesen.wordpress.com/2013/03/10/optimizing-software-occlusion-culling-the-reckoning/">Software Occlusion Culling demo</a> I&#8217;ve been writing about for the past month &#8211; a fairly SIMD-heavy workload.</p>
<table>
<tr>
<th>Version</th>
<th>Occlusion cull</th>
<th>Render scene</th>
</tr>
<tr>
<td>x86 (baseline)</td>
<td>2.88ms</td>
<td>1.39ms</td>
</tr>
<tr>
<td>x86, <code>/arch:SSE2</code></td>
<td>2.88ms (+0.2%)</td>
<td>1.48ms (+5.8%)</td>
</tr>
<tr>
<td>x86, <code>/arch:AVX</code></td>
<td>2.77ms (-3.8%)</td>
<td>1.43ms (+2.7%)</td>
</tr>
<tr>
<td>x64</td>
<td>2.71ms (-5.7%)</td>
<td>1.29ms (-7.2%)</td>
</tr>
<tr>
<td>x64, <code>/arch:AVX</code></td>
<td>2.63ms (-8.7%)</td>
<td>1.28ms (-8.5%)</td>
</tr>
</table>
<p>Note that <code>/arch:AVX</code> makes VC++ use AVX forms of SSE vector instructions (i.e. 3-operand), but it&#8217;s all still 4-wide SIMD, not the new 8-wide SIMD floating point. Getting that would require changes to the code. And of course the code uses SSE2 (and, in fact, even SSE4.1) instructions whether we turn on <code>/arch:SSE2</code> on x86 or not &#8211; this only affects how &#8220;regular&#8221; floating-point code is generated. Also, the speedup percentages are computed from the full-precision values, not the truncated values I put in the table. (Which doesn&#8217;t mean much, since I truncated the values to about their level of accuracy)</p>
<p>So what does this tell us? Hard to be sure. It&#8217;s very few data points and I haven&#8217;t done any work to eliminate the effect of e.g. memory layout / code placement, which can be very much significant. And of course I&#8217;ve also changed the compiler. That said, a few observations:</p>
<ul>
<li>Not much of a win turning on <code>/arch:SSE2</code> on the regular x86 code. If anything, the rendering part of the code gets worse from the &#8220;enhanced instruction set&#8221; usage. I did not investigate further.</li>
<li>The 3-operand AVX instructions provide a solid win of a few percentage points in both 32-bit and 64-bit mode. Considering I&#8217;m not using any 8-wide instructions, this is almost exclusively the impact of having less register-register move instructions.</li>
<li>Yes, going to 64 bits does make a noticeable difference. Note in particular the dip in rendering time. Whether it&#8217;s due to the overhead of 32-bit thunks on a 64-bit system, better code generation on the app side, better code on the D3D runtime/driver side, or most likely a combination of all these factors, the D3D rendering code sure gets a lot faster. And similarly, the SIMD-heavy occlusion cull code sees a good speed-up too. I have not investigated whether this is primarily due to the extra registers, or due to code generation improvements.</li>
</ul>
<p>I don&#8217;t think there&#8217;s any particular lesson here, but it&#8217;s definitely interesting.</p>
]]></html></oembed>