<?xml version="1.0" encoding="UTF-8" standalone="yes"?><oembed><version><![CDATA[1.0]]></version><provider_name><![CDATA[The ryg blog]]></provider_name><provider_url><![CDATA[https://fgiesen.wordpress.com]]></provider_url><author_name><![CDATA[fgiesen]]></author_name><author_url><![CDATA[https://fgiesen.wordpress.com/author/fgiesen/]]></author_url><title><![CDATA[Speculatively speaking]]></title><type><![CDATA[link]]></type><html><![CDATA[<p><em>This post is part of a series &#8211; go <a href="https://fgiesen.wordpress.com/2013/02/17/optimizing-sw-occlusion-culling-index/">here</a> for the index.</em></p>
<p>Welcome back! Today, it&#8217;s time to take a closer look at the triangle binning code, which we&#8217;ve only seen mentioned briefly so far, and we&#8217;re going to see a few more pitfalls that all relate to <a href="http://en.wikipedia.org/wiki/Speculative_execution">speculative execution</a>.</p>
<h3>Loads blocked by what?</h3>
<p>There&#8217;s one more micro-architectural issue this program runs into that I haven&#8217;t talked about before. Here&#8217;s the obligatory profiler screenshot:</p>
<p><a href="https://fgiesen.files.wordpress.com/2013/03/hotspots_stlf.png"><img loading="lazy" data-attachment-id="1856" data-permalink="https://fgiesen.wordpress.com/2013/03/04/speculatively-speaking/hotspots_stlf/" data-orig-file="https://fgiesen.files.wordpress.com/2013/03/hotspots_stlf.png" data-orig-size="497,279" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}" data-image-title="Store-to-load forwarding issues" data-image-description="" data-image-caption="" data-medium-file="https://fgiesen.files.wordpress.com/2013/03/hotspots_stlf.png?w=300" data-large-file="https://fgiesen.files.wordpress.com/2013/03/hotspots_stlf.png?w=497" src="https://fgiesen.files.wordpress.com/2013/03/hotspots_stlf.png?w=497&#038;h=279" alt="Store-to-load forwarding issues" width="497" height="279" class="aligncenter size-full wp-image-1856" srcset="https://fgiesen.files.wordpress.com/2013/03/hotspots_stlf.png 497w, https://fgiesen.files.wordpress.com/2013/03/hotspots_stlf.png?w=150&amp;h=84 150w, https://fgiesen.files.wordpress.com/2013/03/hotspots_stlf.png?w=300&amp;h=168 300w" sizes="(max-width: 497px) 100vw, 497px" /></a></p>
<p>The full column name reads &#8220;Loads Blocked by Store Forwarding&#8221;. So, what&#8217;s going on there? For this one, I&#8217;m gonna have to explain a bit first.</p>
<p>So let&#8217;s talk about stores in an out-of-order processor. In this series, we already saw how conditional branches and memory sharing between cores get handled on modern x86 cores: namely, with <em>speculative execution</em>. For branches, the core tries to predict which direction they will go, and automatically starts fetching and executing the corresponding instructions. Similarly, memory accesses are assumed to not conflict with what other cores are doing at the same time, and just march on ahead. But if it later turns out that the branch actually went in the other direction, that there was a memory conflict, or that some exception / hardware interrupt occurred, all the instructions that were executed in the meantime are invalid and their results must be discarded &#8211; the speculation didn&#8217;t pan out. The implicit assumption is that our speculation (branches behave as predicted, memory accesses generally don&#8217;t conflict and CPU exceptions/interrupts are rare) is right most of the time, so it generally pays off to forge ahead, and the savings are worth the occasional extra work of undoing a bunch of instructions when we turned out to be wrong.</p>
<p>But wait, how does the CPU &#8220;undo&#8221; instructions? Well, conceptually it takes a &#8220;snapshot&#8221; of the current machine state every time it&#8217;s about to start an operation that it might later have to undo. If that instructions makes it all the way through the pipeline without incident, it just gets retired normally, the snapshot gets thrown away and we know that our speculation was successful. But if there is a problem somewhere, the machine can just throw away all the work it did in the meantime, rewind back to the snapshot and retry.</p>
<p>Of course, CPUs don&#8217;t actually take full snapshots. Instead, they make use of the out-of-order machinery to do things much more efficiently: out-of-order CPUs have more registers internally than are exposed in the ISA (Instruction Set Architecture), and use a technique called &#8220;register renaming&#8221; to map the small set of architectural registers onto the larger set of physical registers. The &#8220;snapshotting&#8221; then doesn&#8217;t actually need to save register contents; it just needs to keep track of what the current register mapping at the snapshot point was, and make sure that the associated physical registers from the &#8220;before&#8221; snapshot don&#8217;t get reused until the instruction is safely retired.</p>
<p>This takes care of register modifications. We already know what happens with loads from memory &#8211; we just run them, and if it later turns out that the memory contents changed between the load instruction&#8217;s execution and its retirement, we need to re-run that block of code. Stores are the tricky part: we can&#8217;t easily do &#8220;memory renaming&#8221; since memory (unlike registers) is a shared resource, and also unlike registers rarely gets written in whole &#8220;accounting units&#8221; (cache lines) at a time.</p>
<p>The solution are <em>store buffers</em>: when a store instruction is executed, we do all the necessary groundwork &#8211; address translation, access right checking and so forth &#8211; but don&#8217;t actually write to memory just yet; rather, the target address and the associated data bits are written into a store buffer, where they just sit around for a while; the store buffers form a log of all pending writes to memory. Only after the core is sure that the store instruction will actually be executed (branch results etc. are known and no exceptions were triggered) will these values <em>actually</em> be written back to the cache.</p>
<p>Buffering stores this way has numerous advantages (beyond just making speculation easier), and is a technique not just used in out-of-order architectures; there&#8217;s just one problem though: what happens if I run code like this?</p>
<pre>
  mov  [x], eax
  mov  ebx, [x]
</pre>
<p>Assuming no other threads writing to the same memory at the same time, you would certainly hope that at the end of this instruction sequence, <code>eax</code> and <code>ebx</code> contain the same value. But remember that the first instruction (the store) just writes to a store buffer, whereas the second instruction (the load) normally just references the cache. At the very least, we have to detect that this is happening &#8211; i.e., that we are trying to load from an address that currently has a write logged in a store buffer &#8211; but there&#8217;s numerous things we could do with that information.</p>
<p>One option is to simply stall the core and wait until the store is done before the load can start. This is fairly cheap to implement in hardware, but it does slow down the software running on it. This option was chosen by the in-order cores used in the current generation of game consoles, and the result is the dreaded &#8220;Load Hit Store&#8221; stall. It&#8217;s a way to solve the problem, but let&#8217;s just say it won&#8217;t win you many friends.</p>
<p>So x86 cores normally use a technique called &#8220;store to load forwarding&#8221; or just &#8220;store forwarding&#8221;, where loads can actually read data directly from the store buffers, at least under certain conditions. This is much more expensive in hardware &#8211; it adds a <em>lot</em> of wires between the load unit and the store buffers &#8211; but it is far less finicky to use on the software side.</p>
<p>So what are the conditions? The details depend on the core in question. Generally, if you store a value to a naturally aligned location in memory, and do a load with the same size as the store, you can expect store forwarding to work. If you do trickier stuff &#8211; span multiple cache lines, or use mismatched sizes between the loads and stores, for example &#8211; it really does depend. Some of the more recent Intel cores can also forward larger stores into smaller loads (e.g. a DWord read from a location written with <code>MOVDQA</code>) under certain circumstances, for example. The dual case (large load overlapping with smaller stores) is substantially harder though, because it can involved multiple store buffers at the same time, and I currently know of no processor that implements this. And whenever you hit a case where the processor can&#8217;t perform store forwarding, you get the &#8220;Loads Blocked by Store Forwarding&#8221; stall above (effectively, x86&#8217;s version of a Load-Hit-Store).</p>
<h3>Revenge of the cycle-eaters</h3>
<p>Which brings us back to the example at hand: what&#8217;s going on in those functions, <code>BinTransformedTrianglesMT</code> in particular? Some investigation of the compiled code shows that the first sign of blocked loads is near these reads:</p>
<pre>
Gather(xformedPos, index, numLanes);
		
vFxPt4 xFormedFxPtPos[3];
for(int i = 0; i &lt; 3; i++)
{
    xFormedFxPtPos[i].X = ftoi_round(xformedPos[i].X);
    xFormedFxPtPos[i].Y = ftoi_round(xformedPos[i].Y);
    xFormedFxPtPos[i].Z = ftoi_round(xformedPos[i].Z);
    xFormedFxPtPos[i].W = ftoi_round(xformedPos[i].W);
}
</pre>
<p>and looking at the code for <code>Gather</code> shows us exactly what&#8217;s going on:</p>
<pre>
void TransformedMeshSSE::Gather(vFloat4 pOut[3], UINT triId,
    UINT numLanes)
{
    for(UINT l = 0; l &lt; numLanes; l++)
    {
        for(UINT i = 0; i &lt; 3; i++)
        {
            UINT index = mpIndices[(triId * 3) + (l * 3) + i];
            pOut[i].X.lane[l] = mpXformedPos[index].m128_f32[0];
            pOut[i].Y.lane[l] = mpXformedPos[index].m128_f32[1];
            pOut[i].Z.lane[l] = mpXformedPos[index].m128_f32[2];
            pOut[i].W.lane[l] = mpXformedPos[index].m128_f32[3];
        }
    }
}
</pre>
<p>Aha! This is the code that transforms our vertices from the AoS (array of structures) form that&#8217;s used in memory into the SoA (structure of arrays) form we use during binning (and also the two rasterizers). Note that the output vectors are written element by element; then, as soon as we try to read the whole vector into a register, we hit a forwarding stall, because the core can&#8217;t forward the results from the 4 different stores per vector to a single load. It turns out that the other two instances of forwarding stalls run into this problem for the same reason &#8211; during the gather of bounding box vertices and triangle vertices in the rasterizer, respectively.</p>
<p>So how do we fix it? Well, we&#8217;d really like those vectors to be written using full-width SIMD stores instead. Luckily, that&#8217;s not too hard: converting data from AoS to SoA is essentially a matrix transpose, and our typical use case happens to be 4 separate 4-vectors, i.e. a 4&#215;4 matrix; luckily, a 4&#215;4 matrix transpose is fairly easy to do in SSE, and Intel&#8217;s intrinsics header file even comes with a macro that implements it. So here&#8217;s the updated <code>Gather</code> that uses a SSE transpose:</p>
<pre>
void TransformedMeshSSE::Gather(vFloat4 pOut[3], UINT triId,
    UINT numLanes)
{
    const UINT *pInd0 = &amp;mpIndices[triId * 3];
    const UINT *pInd1 = pInd0 + (numLanes &gt; 1 ? 3 : 0);
    const UINT *pInd2 = pInd0 + (numLanes &gt; 2 ? 6 : 0);
    const UINT *pInd3 = pInd0 + (numLanes &gt; 3 ? 9 : 0);

    for(UINT i = 0; i &lt; 3; i++)
    {
        __m128 v0 = mpXformedPos[pInd0[i]]; // x0 y0 z0 w0
        __m128 v1 = mpXformedPos[pInd1[i]]; // x1 y1 z1 w1
        __m128 v2 = mpXformedPos[pInd2[i]]; // x2 y2 z2 w2
        __m128 v3 = mpXformedPos[pInd3[i]]; // x3 y3 z3 w3
        _MM_TRANSPOSE4_PS(v0, v1, v2, v3);
        // After transpose:
        pOut[i].X = VecF32(v0); // v0 = x0 x1 x2 x3
        pOut[i].Y = VecF32(v1); // v1 = y0 y1 y2 y3
        pOut[i].Z = VecF32(v2); // v2 = z0 z1 z2 z3
        pOut[i].W = VecF32(v3); // v3 = w0 w1 w2 w3
    }
}
</pre>
<p>Not much to talk about here. The other two instances of this get modified in the exact same way. So how much does it help?</p>
<p><b>Change:</b> Gather using SSE instructions and transpose</p>
<table>
<tr>
<th>Total cull time</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial</td>
<td>3.148</td>
<td>3.208</td>
<td>3.243</td>
<td>3.305</td>
<td>4.321</td>
<td>3.271</td>
<td>0.100</td>
</tr>
<tr>
<td>SSE Gather</td>
<td>2.934</td>
<td>3.078</td>
<td>3.110</td>
<td>3.156</td>
<td>3.992</td>
<td>3.133</td>
<td>0.103</td>
</tr>
</table>
<table>
<tr>
<th>Render depth</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial</td>
<td>2.206</td>
<td>2.220</td>
<td>2.228</td>
<td>2.242</td>
<td>2.364</td>
<td>2.234</td>
<td>0.022</td>
</tr>
<tr>
<td>SSE Gather</td>
<td>2.099</td>
<td>2.119</td>
<td>2.137</td>
<td>2.156</td>
<td>2.242</td>
<td>2.141</td>
<td>0.028</td>
</tr>
</table>
<table>
<tr>
<th>Depth test</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial</td>
<td>0.813</td>
<td>0.830</td>
<td>0.839</td>
<td>0.847</td>
<td>0.886</td>
<td>0.839</td>
<td>0.013</td>
</tr>
<tr>
<td>SSE Gather</td>
<td>0.773</td>
<td>0.793</td>
<td>0.802</td>
<td>0.809</td>
<td>0.843</td>
<td>0.801</td>
<td>0.012</td>
</tr>
</table>
<p>So we&#8217;re another 0.13ms down, about 0.04ms of which we gain in the depth testing pass and the remaining 0.09ms in the rendering pass. And a re-run with VTune confirms that the blocked loads are indeed gone:</p>
<p><a href="https://fgiesen.files.wordpress.com/2013/03/hotspots_stlf_fixed.png"><img loading="lazy" data-attachment-id="1876" data-permalink="https://fgiesen.wordpress.com/2013/03/04/speculatively-speaking/hotspots_stlf_fixed/" data-orig-file="https://fgiesen.files.wordpress.com/2013/03/hotspots_stlf_fixed.png" data-orig-size="497,282" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}" data-image-title="Store forwarding fixed" data-image-description="" data-image-caption="" data-medium-file="https://fgiesen.files.wordpress.com/2013/03/hotspots_stlf_fixed.png?w=300" data-large-file="https://fgiesen.files.wordpress.com/2013/03/hotspots_stlf_fixed.png?w=497" src="https://fgiesen.files.wordpress.com/2013/03/hotspots_stlf_fixed.png?w=497&#038;h=282" alt="Store forwarding fixed" width="497" height="282" class="aligncenter size-full wp-image-1876" srcset="https://fgiesen.files.wordpress.com/2013/03/hotspots_stlf_fixed.png 497w, https://fgiesen.files.wordpress.com/2013/03/hotspots_stlf_fixed.png?w=150&amp;h=85 150w, https://fgiesen.files.wordpress.com/2013/03/hotspots_stlf_fixed.png?w=300&amp;h=170 300w" sizes="(max-width: 497px) 100vw, 497px" /></a></p>
<h3>Vertex transformation</h3>
<p><a href="https://fgiesen.wordpress.com/2013/02/28/reshaping-dataflows/">Last time</a>, we modified the vertex transform code in the depth test rasterizer to get rid of the z-clamping and simplify the clipping logic. We also changed the logic to make better use of the regular structure of our input vertices. We don&#8217;t have any special structure we can use to make vertex transforms on regular meshes faster, but we definitely can (and should) improve the projection and near-clip logic, turning this:</p>
<pre>
mpXformedPos[i] = TransformCoords(&amp;mpVertices[i].position,
    cumulativeMatrix);
float oneOverW = 1.0f/max(mpXformedPos[i].m128_f32[3], 0.0000001f);
mpXformedPos[i] = _mm_mul_ps(mpXformedPos[i],
    _mm_set1_ps(oneOverW));
mpXformedPos[i].m128_f32[3] = oneOverW;
</pre>
<p>into this:</p>
<pre>
__m128 xform = TransformCoords(&amp;mpVertices[i].position,
    cumulativeMatrix);
__m128 vertZ = _mm_shuffle_ps(xform, xform, 0xaa);
__m128 vertW = _mm_shuffle_ps(xform, xform, 0xff);
__m128 projected = _mm_div_ps(xform, vertW);

// set to all-0 if near-clipped
__m128 mNoNearClip = _mm_cmple_ps(vertZ, vertW);
mpXformedPos[i] = _mm_and_ps(projected, mNoNearClip);
</pre>
<p>Here, near-clipped vertices are set to the (invalid) x=y=z=w=0, and the binner code can just check for <code>w==0</code> to test whether a vertex is near-clipped instead of having to use the original w tests (which again had a hardcoded near plane value).</p>
<p>This change doesn&#8217;t have any significant impact on the running time, but it does get rid of the hardcoded near plane location for good, so I thought it was worth mentioning.</p>
<h3>Again with the memory ordering</h3>
<p>And if we profile again, we notice there&#8217;s at least one more surprise waiting for us in the binning code:</p>
<p><a href="https://fgiesen.files.wordpress.com/2013/03/hotspots_binning_mc.png"><img loading="lazy" data-attachment-id="1883" data-permalink="https://fgiesen.wordpress.com/2013/03/04/speculatively-speaking/hotspots_binning_mc/" data-orig-file="https://fgiesen.files.wordpress.com/2013/03/hotspots_binning_mc.png" data-orig-size="497,246" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}" data-image-title="Binning Machine Clears" data-image-description="" data-image-caption="" data-medium-file="https://fgiesen.files.wordpress.com/2013/03/hotspots_binning_mc.png?w=300" data-large-file="https://fgiesen.files.wordpress.com/2013/03/hotspots_binning_mc.png?w=497" src="https://fgiesen.files.wordpress.com/2013/03/hotspots_binning_mc.png?w=497&#038;h=246" alt="Binning Machine Clears" width="497" height="246" class="aligncenter size-full wp-image-1883" srcset="https://fgiesen.files.wordpress.com/2013/03/hotspots_binning_mc.png 497w, https://fgiesen.files.wordpress.com/2013/03/hotspots_binning_mc.png?w=150&amp;h=74 150w, https://fgiesen.files.wordpress.com/2013/03/hotspots_binning_mc.png?w=300&amp;h=148 300w" sizes="(max-width: 497px) 100vw, 497px" /></a></p>
<p>Machine clears? We&#8217;ve seen them before, way back in &#8220;<a href="https://fgiesen.wordpress.com/2013/01/31/cores-dont-like-to-share/">Cores don&#8217;t like to share</a>&#8220;. And yes, they&#8217;re again for memory ordering reasons. What did we do wrong this time? It turns out that the problematic code has been in there since the beginning, and ran just fine for quite a while, but ever since the scheduling optimizations we did in &#8220;<a href="https://fgiesen.wordpress.com/2013/02/17/care-and-feeding-of-worker-threads-part-1/">The care and feeding of worker threads</a>&#8220;, we now have binning jobs running tightly packed enough to run into memory ordering issues. So what&#8217;s the problem? Here&#8217;s the code:</p>
<pre>
// Add triangle to the tiles or bins that the bounding box covers
int row, col;
for(row = startY; row &lt;= endY; row++)
{
    int offset1 = YOFFSET1_MT * row;
    int offset2 = YOFFSET2_MT * row;
    for(col = startX; col &lt;= endX; col++)
    {
        int idx1 = offset1 + (XOFFSET1_MT * col) + taskId;
        int idx2 = offset2 + (XOFFSET2_MT * col) +
            (taskId * MAX_TRIS_IN_BIN_MT) + pNumTrisInBin[idx1];
        pBin[idx2] = index + i;
        pBinModel[idx2] = modelId;
        pBinMesh[idx2] = meshId;
        pNumTrisInBin[idx1] += 1;
    }
}
</pre>
<p>The problem turns out to be the array <code>pNumTrisInBin</code>. Even though it&#8217;s accessed as 1D, it is effectively a 3D array like this:</p>
<p><code>uint16 pNumTrisInBin[TILE_ROWS][TILE_COLS][BINNER_TASKS]</code></p>
<p>The <code>TILE_ROWS</code> and <code>TILE_COLS</code> parts should be obvious. The <code>BINNER_TASKS</code> needs some explanation though: as you hopefully remember, we try to divide the work between binning tasks so that each of them gets roughly the same amount of triangles. Now, before we start binning triangles, we don&#8217;t know which tiles they will go into &#8211; after all, that&#8217;s what the binner is there to find out.</p>
<p>We could have just one output buffer (bin) per tile; but then, whenever two binner tasks simultaneously end up trying to add a triangle to the same tile, they will end up getting serialized because they try to increment the same counter. And even worse, it would mean that the actual order of triangles in the bins would be different between every run, depending on when exactly each thread was running; while not fatal for depth buffers (we just end up storing the max of all triangles rendered to a pixel anyway, which is ordering-invariant) it&#8217;s still a complete pain to debug.</p>
<p>Hence there is one bin per tile per binning worker. We already know that the binning workers get assigned the triangles in the order they occur in the models &#8211; with the 32 binning workers we use, the first binning task gets the first 1/32 of the triangles, and second binning task gets the second 1/32, and so forth. And each binner processes triangles in order. This means that the rasterizer tasks can still process triangles in the original order they occur in the mesh &#8211; first process all triangles inserted by binner 0, then all triangles inserted by binner 1, and so forth. Since they&#8217;re in distinct memory ranges, that&#8217;s easily done. And each bin has a separate triangle counter, so they don&#8217;t interfere, right? Nothing to see here, move along.</p>
<p>Well, except for the bit where coherency is managed on a cache line granularity. Now, as you can see from the above declaration, the triangle counts for all the binner tasks are stored in adjacent 16-bit words; 32 of them, to be precise, one per binner task. So what was the size of a cache line again? 64 bytes, you say?</p>
<p>Oops.</p>
<p>Yep, even though it&#8217;s 32 separate counters, for the purposes of the memory subsystem it&#8217;s just the same as if it was all a single counter per tile (well, it might be slightly better than that if the initial pointer isn&#8217;t 64-byte aligned, but you get the idea).</p>
<p>Luckily for us, the fix is dead easy: all we have to do is shuffle the order of the array indices around.</p>
<p><code>uint16 pNumTrisInBin[BINNER_TASKS][TILE_ROWS][TILE_COLS]</code></p>
<p>We also happen to have 32 tiles total &#8211; which means that now, each binner task gets its own cache line by itself (again, provided we align things correctly). So again, it&#8217;s a really easy fix. The question being &#8211; how much does it help?</p>
<p><b>Change:</b> Change pNumTrisInBin array indexing</p>
<table>
<tr>
<th>Total cull time</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial</td>
<td>3.148</td>
<td>3.208</td>
<td>3.243</td>
<td>3.305</td>
<td>4.321</td>
<td>3.271</td>
<td>0.100</td>
</tr>
<tr>
<td>SSE Gather</td>
<td>2.934</td>
<td>3.078</td>
<td>3.110</td>
<td>3.156</td>
<td>3.992</td>
<td>3.133</td>
<td>0.103</td>
</tr>
<tr>
<td>Change bin inds</td>
<td>2.842</td>
<td>2.933</td>
<td>2.980</td>
<td>3.042</td>
<td>3.914</td>
<td>3.007</td>
<td>0.125</td>
</tr>
</table>
<table>
<tr>
<th>Render depth</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial</td>
<td>2.206</td>
<td>2.220</td>
<td>2.228</td>
<td>2.242</td>
<td>2.364</td>
<td>2.234</td>
<td>0.022</td>
</tr>
<tr>
<td>SSE Gather</td>
<td>2.099</td>
<td>2.119</td>
<td>2.137</td>
<td>2.156</td>
<td>2.242</td>
<td>2.141</td>
<td>0.028</td>
</tr>
<tr>
<td>Change bin inds</td>
<td>1.980</td>
<td>2.008</td>
<td>2.026</td>
<td>2.046</td>
<td>2.172</td>
<td>2.032</td>
<td>0.035</td>
</tr>
</table>
<p>That&#8217;s right, a 0.1ms difference from <em>changing the memory layout of a 1024-entry, 2048-byte array</em>. You really need to be extremely careful with the layout of shared data when dealing with multiple cores at the same time.</p>
<h3>Once more, with branching</h3>
<p>At this point, the binner is starting to look fairly good, but there&#8217;s one more thing that springs to eye:</p>
<p><a href="https://fgiesen.files.wordpress.com/2013/03/hotspots_binning_mispred.png"><img loading="lazy" data-attachment-id="1894" data-permalink="https://fgiesen.wordpress.com/2013/03/04/speculatively-speaking/hotspots_binning_mispred/" data-orig-file="https://fgiesen.files.wordpress.com/2013/03/hotspots_binning_mispred.png" data-orig-size="497,229" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}" data-image-title="Binning branch mispredicts" data-image-description="" data-image-caption="" data-medium-file="https://fgiesen.files.wordpress.com/2013/03/hotspots_binning_mispred.png?w=300" data-large-file="https://fgiesen.files.wordpress.com/2013/03/hotspots_binning_mispred.png?w=497" src="https://fgiesen.files.wordpress.com/2013/03/hotspots_binning_mispred.png?w=497&#038;h=229" alt="Binning branch mispredicts" width="497" height="229" class="aligncenter size-full wp-image-1894" srcset="https://fgiesen.files.wordpress.com/2013/03/hotspots_binning_mispred.png 497w, https://fgiesen.files.wordpress.com/2013/03/hotspots_binning_mispred.png?w=150&amp;h=69 150w, https://fgiesen.files.wordpress.com/2013/03/hotspots_binning_mispred.png?w=300&amp;h=138 300w" sizes="(max-width: 497px) 100vw, 497px" /></a></p>
<p>Branch mispredictions. Now, the two rasterizers have legitimate reason to be mispredicting branches some of the time &#8211; they&#8217;re processing triangles with fairly unpredictable sizes, and the depth test rasterizer also has an early-out that&#8217;s hard to predict. But the binner has less of an excuse &#8211; sure, the triangles have very different dimensions measured <em>in 2&#215;2 pixel blocks</em>, but the vast majority of our triangles fits inside one of our (generously sized!) 320&#215;90 pixel tiles. So where are all these branches?</p>
<pre>
for(int i = 0; i &lt; numLanes; i++)
{
    // Skip triangle if area is zero 
    if(triArea.lane[i] &lt;= 0) continue;
    if(vEndX.lane[i] &lt; vStartX.lane[i] ||
       vEndY.lane[i] &lt; vStartY.lane[i]) continue;
			
    float oneOverW[3];
    for(int j = 0; j &lt; 3; j++)
        oneOverW[j] = xformedPos[j].W.lane[i];
			
    // Reject the triangle if any of its verts are outside the
    // near clip plane
    if(oneOverW[0] == 0.0f || oneOverW[1] == 0.0f ||
        oneOverW[2] == 0.0f) continue;

    // ...
}
</pre>
<p>Oh yeah, that. In particular, the first test (which checks for degenerate and back-facing triangles) will reject roughly half of all triangles and can be fairly random (as far as the CPU is concerned). Now, <a href="https://fgiesen.wordpress.com/2013/02/16/depth-buffers-done-quick-part-2/">last time we had an issue with branch mispredicts</a>, we simply removed the offending early-out. That&#8217;s a really bad idea in this case &#8211; any triangles we don&#8217;t reject here, we&#8217;re gonna waste even more work on later. No, these tests really should all be done here.</p>
<p>However, there&#8217;s no need for them to be done like this; right now, we have a whole slew of branches that are all over the map. Can&#8217;t we consolidate the branches somehow?</p>
<p>Of course we can. The basic idea is to do all the tests on 4 triangles at a time, while we&#8217;re still in SIMD form:</p>
<pre>
// Figure out which lanes are active
VecS32 mFront = cmpgt(triArea, VecS32::zero());
VecS32 mNonemptyX = cmpgt(vEndX, vStartX);
VecS32 mNonemptyY = cmpgt(vEndY, vStartY);
VecF32 mAccept1 = bits2float(mFront &amp; mNonemptyX &amp; mNonemptyY);

// All verts must be inside the near clip volume
VecF32 mW0 = cmpgt(xformedPos[0].W, VecF32::zero());
VecF32 mW1 = cmpgt(xformedPos[1].W, VecF32::zero());
VecF32 mW2 = cmpgt(xformedPos[2].W, VecF32::zero());

VecF32 mAccept = and(and(mAccept1, mW0), and(mW1, mW2));
// laneMask == (1 &lt;&lt; numLanes) - 1; - initialized earlier
unsigned int triMask = _mm_movemask_ps(mAccept.simd) &amp; laneMask;
</pre>
<p>Note I change the &#8220;is not near-clipped test&#8221; from <code>!(w == 0.0f)</code> to <code>w &gt; 0.0f</code>, on account of me knowing that all legal w&#8217;s happen to not just be non-zero, they&#8217;re positive (okay, what really happened is that I forgot to add a &#8220;cmpne&#8221; when I wrote <code>VecF32</code> and didn&#8217;t feel like adding it here). Other than that, it&#8217;s fairly straightforward. We build a mask in vector registers, then turn it into an integer mask of active lanes using <code>MOVMSKPS</code>.</p>
<p>With this, we could turn all the original branches into a single test in the <code>i</code> loop:</p>
<pre>
if((triMask &amp; (1 &lt;&lt; i)) == 0)
    continue;
</pre>
<p>However, we can do slightly better than that: it turns out we can iterate pretty much directly over the set bits in <code>triMask</code>, which means we&#8217;re now down to one single branch in the outer loop &#8211; the loop counter itself. The modified loop looks like this:</p>
<pre>
while(triMask)
{
    int i = FindClearLSB(&amp;triMask);
    // ...
}
</pre>
<p>So what does the magic <code>FindClearLSB</code> function do? It better not contain any branches! But lucky for us, it&#8217;s quite straightforward:</p>
<pre>
// Find index of least-significant set bit in mask
// and clear it (mask must be nonzero)
static int FindClearLSB(unsigned int *mask)
{
    unsigned long idx;
    _BitScanForward(&amp;idx, *mask);
    *mask &amp;= *mask - 1;
    return idx;
}
</pre>
<p>all it takes is <code>_BitScanForward</code> (the VC++ intrinsic for the x86 <code>BSF</code> instruction) and a really old trick for clearing the least-significant set bit in a value. In other words, this compiles into about 3 integer instructions and is completely branch-free. Good enough. So does it help?</p>
<p><b>Change:</b> Less branches in binner</p>
<table>
<tr>
<th>Total cull time</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial</td>
<td>3.148</td>
<td>3.208</td>
<td>3.243</td>
<td>3.305</td>
<td>4.321</td>
<td>3.271</td>
<td>0.100</td>
</tr>
<tr>
<td>SSE Gather</td>
<td>2.934</td>
<td>3.078</td>
<td>3.110</td>
<td>3.156</td>
<td>3.992</td>
<td>3.133</td>
<td>0.103</td>
</tr>
<tr>
<td>Change bin inds</td>
<td>2.842</td>
<td>2.933</td>
<td>2.980</td>
<td>3.042</td>
<td>3.914</td>
<td>3.007</td>
<td>0.125</td>
</tr>
<tr>
<td>Less branches</td>
<td>2.786</td>
<td>2.879</td>
<td>2.915</td>
<td>2.969</td>
<td>3.706</td>
<td>2.936</td>
<td>0.092</td>
</tr>
</table>
<table>
<tr>
<th>Render depth</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial</td>
<td>2.206</td>
<td>2.220</td>
<td>2.228</td>
<td>2.242</td>
<td>2.364</td>
<td>2.234</td>
<td>0.022</td>
</tr>
<tr>
<td>SSE Gather</td>
<td>2.099</td>
<td>2.119</td>
<td>2.137</td>
<td>2.156</td>
<td>2.242</td>
<td>2.141</td>
<td>0.028</td>
</tr>
<tr>
<td>Change bin inds</td>
<td>1.980</td>
<td>2.008</td>
<td>2.026</td>
<td>2.046</td>
<td>2.172</td>
<td>2.032</td>
<td>0.035</td>
</tr>
<tr>
<td>Less branches</td>
<td>1.905</td>
<td>1.934</td>
<td>1.946</td>
<td>1.959</td>
<td>2.012</td>
<td>1.947</td>
<td>0.019</td>
</tr>
</table>
<p>That&#8217;s another 0.07ms off the total, for about a 10% reduction in median total cull time for this post, and a 12.7% reduction in median rasterizer time. And for our customary victory lap, here&#8217;s the VTune results after this change:</p>
<p><a href="https://fgiesen.files.wordpress.com/2013/03/hotspots_binning_done.png"><img loading="lazy" data-attachment-id="1903" data-permalink="https://fgiesen.wordpress.com/2013/03/04/speculatively-speaking/hotspots_binning_done/" data-orig-file="https://fgiesen.files.wordpress.com/2013/03/hotspots_binning_done.png" data-orig-size="497,268" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}" data-image-title="Binning with branching improved" data-image-description="" data-image-caption="" data-medium-file="https://fgiesen.files.wordpress.com/2013/03/hotspots_binning_done.png?w=300" data-large-file="https://fgiesen.files.wordpress.com/2013/03/hotspots_binning_done.png?w=497" src="https://fgiesen.files.wordpress.com/2013/03/hotspots_binning_done.png?w=497&#038;h=268" alt="Binning with branching improved" width="497" height="268" class="aligncenter size-full wp-image-1903" srcset="https://fgiesen.files.wordpress.com/2013/03/hotspots_binning_done.png 497w, https://fgiesen.files.wordpress.com/2013/03/hotspots_binning_done.png?w=150&amp;h=81 150w, https://fgiesen.files.wordpress.com/2013/03/hotspots_binning_done.png?w=300&amp;h=162 300w" sizes="(max-width: 497px) 100vw, 497px" /></a></p>
<p>The branch mispredictions aren&#8217;t gone, but we did make a notable dent. It&#8217;s more obvious if you compare the number of clock cyles with the previous image.</p>
<p>And with that, I&#8217;ll conclude this journey into both the triangle binner and the dark side of speculative execution. We&#8217;re also getting close to the end of this series &#8211; the next post will look again at the loading and rendering code we&#8217;ve been intentionally ignoring for most of this series :), and after that I&#8217;ll finish with a summary and wrap-up &#8211; including a list of things I didn&#8217;t cover, and why not.</p>
]]></html><thumbnail_url><![CDATA[https://fgiesen.files.wordpress.com/2013/03/hotspots_stlf.png?fit=440%2C330]]></thumbnail_url><thumbnail_width><![CDATA[]]></thumbnail_width><thumbnail_height><![CDATA[]]></thumbnail_height></oembed>