<?xml version="1.0" encoding="UTF-8" standalone="yes"?><oembed><version><![CDATA[1.0]]></version><provider_name><![CDATA[The ryg blog]]></provider_name><provider_url><![CDATA[https://fgiesen.wordpress.com]]></provider_url><author_name><![CDATA[fgiesen]]></author_name><author_url><![CDATA[https://fgiesen.wordpress.com/author/fgiesen/]]></author_url><title><![CDATA[Depth buffers done quick, part&nbsp;2]]></title><type><![CDATA[link]]></type><html><![CDATA[<p><em>This post is part of a series &#8211; go <a href="https://fgiesen.wordpress.com/2013/02/17/optimizing-sw-occlusion-culling-index/">here</a> for the index.</em></p>
<p>Welcome back! At the end of the <a href="https://fgiesen.wordpress.com/2013/02/11/depth-buffers-done-quick-part/">last post</a>, we had just finished doing a first pass over the depth buffer rendering loops. Unfortunately, the first version of that post listed a final rendering time that was an outlier; more details in the post (which also has been updated to display the timing results in tables).</p>
<h3>Notation matters</h3>
<p>However, while writing that post, it became clear to me that I needed to do something about those damn over-long Intel SSE intrinsic names. Having them in regular source code is one thing, but it really sucks for presentation when performing two bitwise operations barely fits inside a single line of source code. So I whipped up two helper classes <code>VecS32</code> (32-bit signed integer) and <code>VecF32</code> (32-bit float) that are actual C++ implementations of the pseudo-code <code>Vec4i</code> I used in <a href="https://fgiesen.wordpress.com/2013/02/10/optimizing-the-basic-rasterizer/">&#8220;Optimizing the basic rasterizer&#8221;</a>. I then converted a lot of the SIMD code in the project to use those classes instead of dealing with <code>__m128</code> and <code>__m128i</code> directly.</p>
<p>I&#8217;ve used this kind of approach in the past to provide a useful common subset of SIMD operations for cross-platform code; in this case, the main point was to get some basic operator overloads and more convenient notation, but as a happy side effect it&#8217;s now much easier to make the code use SSE2 instructions only. The original code uses SSE4.1, but with the everything nicely in one place, it&#8217;s easy to use <code>MOVMSKPS / CMP</code> instead of <code>PTEST</code> for the mask tests and <code>PSRAD / ANDPS / ANDNOTPS / ORPS</code> instead of <code>BLENDVPS</code>; you just have to do the substitution in one place. I haven&#8217;t done that in the code on Github, but I wanted to point out that it&#8217;s an option.</p>
<p>Anyway, I won&#8217;t go over the details of either the helper classes (it&#8217;s fairly basic stuff) or the modifications to the code (just glorified search and replace), but I will show you one before-after example to illustrate why I did it:</p>
<pre>
col = _mm_add_epi32(colOffset, _mm_set1_epi32(startXx));
__m128i aa0Col = _mm_mullo_epi32(aa0, col);
__m128i aa1Col = _mm_mullo_epi32(aa1, col);
__m128i aa2Col = _mm_mullo_epi32(aa2, col);

row = _mm_add_epi32(rowOffset, _mm_set1_epi32(startYy));
__m128i bb0Row = _mm_add_epi32(_mm_mullo_epi32(bb0, row), cc0);
__m128i bb1Row = _mm_add_epi32(_mm_mullo_epi32(bb1, row), cc1);
__m128i bb2Row = _mm_add_epi32(_mm_mullo_epi32(bb2, row), cc2);

__m128i sum0Row = _mm_add_epi32(aa0Col, bb0Row);
__m128i sum1Row = _mm_add_epi32(aa1Col, bb1Row);
__m128i sum2Row = _mm_add_epi32(aa2Col, bb2Row);
</pre>
<p>turns into:</p>
<pre>
VecS32 col = colOffset + VecS32(startXx);
VecS32 aa0Col = aa0 * col;
VecS32 aa1Col = aa1 * col;
VecS32 aa2Col = aa2 * col;

VecS32 row = rowOffset + VecS32(startYy);
VecS32 bb0Row = bb0 * row + cc0;
VecS32 bb1Row = bb1 * row + cc1;
VecS32 bb2Row = bb2 * row + cc2;

VecS32 sum0Row = aa0Col + bb0Row;
VecS32 sum1Row = aa1Col + bb1Row;
VecS32 sum2Row = aa2Col + bb2Row;
</pre>
<p>I don&#8217;t know about you, but I already find this <em>much</em> easier to parse visually, and the generated code is the same. And as soon as I had this, I just got rid of most of the explicit temporaries since they&#8217;re never referenced again anyway:</p>
<pre>
VecS32 col = VecS32(startXx) + colOffset;
VecS32 row = VecS32(startYy) + rowOffset;

VecS32 sum0Row = aa0 * col + bb0 * row + cc0;
VecS32 sum1Row = aa1 * col + bb1 * row + cc1;
VecS32 sum2Row = aa2 * col + bb2 * row + cc2;
</pre>
<p>And suddenly, with the ratio of syntactic noise to actual content back to a reasonable range, it&#8217;s actually possible to see what&#8217;s really going on here in one glance. Even if this <em>was</em> slower &#8211; and as I just told you, it&#8217;s not &#8211; it would still be totally worthwhile for development. You can&#8217;t always do it this easily; in particular, with integer SIMD instructions (particularly when dealing with pixels), I often find myself frequently switching between the interpretation of values (&#8220;typecasting&#8221;), and adding explicit types adds more syntactic noise than it eliminates. But in this case, we actually have several relatively long functions that only deal with either 32-bit ints or 32-bit floats, so it works beautifully.</p>
<p>And just to prove that it really didn&#8217;t change the performance:</p>
<p><b>Change</b>: VecS32/VecF32</p>
<table>
<tr>
<th>Version</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial</td>
<td>3.367</td>
<td>3.420</td>
<td>3.432</td>
<td>3.445</td>
<td>3.512</td>
<td>3.433</td>
<td>0.021</td>
</tr>
<tr>
<td>End of part 1</td>
<td>3.020</td>
<td>3.081</td>
<td>3.095</td>
<td>3.106</td>
<td>3.149</td>
<td>3.093</td>
<td>0.020</td>
</tr>
<tr>
<td>Vec[SF]32</td>
<td>3.022</td>
<td>3.056</td>
<td>3.067</td>
<td>3.081</td>
<td>3.153</td>
<td>3.069</td>
<td>0.018</td>
</tr>
</table>
<h3>A bit more work on setup</h3>
<p>With that out of the way, let&#8217;s spiral further outwards and have a look at our triangle setup code. Most of it sets up edge equations etc. for 4 triangles at a time; we only drop down to individual triangles once we&#8217;re about to actually rasterize them. Most of this code works exactly as we saw in <a href="https://fgiesen.wordpress.com/2013/02/10/optimizing-the-basic-rasterizer/">&#8220;Optimizing the basic rasterizer&#8221;</a>, but there&#8217;s one bit that performs a bit more work than necessary:</p>
<pre>
// Compute triangle area
VecS32 triArea = A0 * xFormedFxPtPos[0].X;
triArea += B0 * xFormedFxPtPos[0].Y;
triArea += C0;

VecF32 oneOverTriArea = VecF32(1.0f) / itof(triArea);
</pre>
<p>Contrary to what the comment says :), this actually computes twice the (signed) triangle area and is used to normalize the barycentric coordinates. That&#8217;s also why there&#8217;s a divide to compute its reciprocal. However, the computation of the area itself is more complicated than necessary and depends on <code>C0</code>. A better way is to just use the direct determinant expression. Since the area is computed in integers, this gives exactly the same results with one operations less, and without the dependency on <code>C0</code>:</p>
<pre>
VecS32 triArea = B2 * A1 - B1 * A2;
VecF32 oneOverTriArea = VecF32(1.0f) / itof(triArea);
</pre>
<p>And talking about the barycentric coordinates, there&#8217;s also this part of the setup that is performed per triangle, not across 4 triangles:</p>
<pre>
VecF32 zz[3], oneOverW[3];
for(int vv = 0; vv &lt; 3; vv++)
{
    zz[vv] = VecF32(xformedvPos[vv].Z.lane[lane]);
    oneOverW[vv] = VecF32(xformedvPos[vv].W.lane[lane]);
}

VecF32 oneOverTotalArea(oneOverTriArea.lane[lane]);
zz[1] = (zz[1] - zz[0]) * oneOverTotalArea;
zz[2] = (zz[2] - zz[0]) * oneOverTotalArea;
</pre>
<p>The latter two lines perform the half-barycentric interpolation setup; the original code multiplied the <code>zz[i]</code> by <code>oneOverTotalArea</code> here (this is the normalization for the barycentric terms). But note that all the quantities involved here are vectors of four broadcast values; these are really scalar computations, and we can perform them while we&#8217;re still dealing with 4 triangles at a time! So right after the triangle area computation, we now do this:</p>
<pre>
// Z setup
VecF32 Z[3];
Z[0] = xformedvPos[0].Z;
Z[1] = (xformedvPos[1].Z - Z[0]) * oneOverTriArea;
Z[2] = (xformedvPos[2].Z - Z[0]) * oneOverTriArea;
</pre>
<p>Which allows us to get rid of the second half of the earlier block &#8211; all we have to do is load <code>zz</code> from <code>Z[vv]</code> rather than <code>xformedvPos[vv].Z</code>. Finally, the original code sets up <code>oneOverW</code> but never uses it, and it turns out that in this case, VC++&#8217;s data flow analysis was <em>not</em> smart enough to figure out that the computation is unnecessary. No matter &#8211; just delete that code as well.</p>
<p>So this batch is just a bunch of small, simple, local improvements: getting rid of a little unnecessary work in several places, or just grouping computations more effectively. It&#8217;s small fry, but it&#8217;s also very low-effort, so why not.</p>
<p><b>Change</b>: Various minor setup improvements</p>
<table>
<tr>
<th>Version</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial</td>
<td>3.367</td>
<td>3.420</td>
<td>3.432</td>
<td>3.445</td>
<td>3.512</td>
<td>3.433</td>
<td>0.021</td>
</tr>
<tr>
<td>End of part 1</td>
<td>3.020</td>
<td>3.081</td>
<td>3.095</td>
<td>3.106</td>
<td>3.149</td>
<td>3.093</td>
<td>0.020</td>
</tr>
<tr>
<td>Vec[SF]32</td>
<td>3.022</td>
<td>3.056</td>
<td>3.067</td>
<td>3.081</td>
<td>3.153</td>
<td>3.069</td>
<td>0.018</td>
</tr>
<tr>
<td>Setup cleanups</td>
<td>2.977</td>
<td>3.032</td>
<td>3.046</td>
<td>3.058</td>
<td>3.101</td>
<td>3.045</td>
<td>0.020</td>
</tr>
</table>
<p>As said, it&#8217;s minor, but a small win nonetheless.</p>
<h3>Garbage in the bins</h3>
<p>When I was originally performing the experiments that led to this series, I discovered something funny when I had the code at roughly this stage: occasionally, I would get triangles that had <code>endXx &lt; startXx</code> (or <code>endYy &lt; startYy</code>). I only noticed this because I changed the loop in a way that should have been equivalent, but turned out not to be: I was computing <code>endXx - startXx</code> as an unsigned integer, and it wrapped around, causing the code to start stomping over memory and eventually crash. At the time, I just made note to investigate this later and just added an <code>if</code> to detect the case early for the time being, but when I later came back to figure out what was going on, the explanation turned out to be quite interesting.</p>
<p>So, where do these triangles with empty bounding boxes come from? The actual per-triangle assignments</p>
<pre>
int startXx = startX.lane[lane];
int endXx   = endX.lane[lane];
</pre>
<p>just get their values from these vectors:</p>
<pre>
// Use bounding box traversal strategy to determine which
// pixels to rasterize 
VecS32 startX = vmax(
    vmin(
        vmin(xFormedFxPtPos[0].X, xFormedFxPtPos[1].X),
        xFormedFxPtPos[2].X), VecS32(tileStartX))
    &amp; VecS32(~1);
VecS32 endX = vmin(
    vmax(
        vmax(xFormedFxPtPos[0].X, xFormedFxPtPos[1].X),
        xFormedFxPtPos[2].X) + VecS32(1), VecS32(tileEndX));
</pre>
<p>Horrible line-breaking aside (I just need to switch to a wider layout), this is fairly straightforward: <code>startX</code> is determined as the minimum of all vertex X coordinates, then clipped against the left tile boundary and finally rounded down to be a multiple of 2 (to align with the 2&#215;2 tiling grid). Similarly, <code>endX</code> is the maximum of vertex X coordinates, clipped against the right boundary of the tile. Since we use an inclusive fill convention but exclusive loop bounds on the right side (the test is for <code>&lt; endXx</code> not <code>&lt;= endXx</code>), there&#8217;s an extra +1 in there.</p>
<p>Other than the clip to the tile bounds, this really just computes an axis-aligned bounding rectangle for the triangle and then potentially makes it a little bigger. So really, the only way to get <code>endXx &lt; startXx</code> from this is for the triangle to have an empty intersection with the active tile&#8217;s bounding box. But if that&#8217;s the case, why was the triangle added to the bin for this tile to begin with? Time to look at the binner code.</p>
<p>The relevant piece of code is <a href="https://github.com/rygorous/intel_occlusion_cull/blob/2d1282e5/SoftwareOcclusionCulling/TransformedMeshSSE.cpp#L127">here</a>. The bounding box determination for the whole triangle looks as follows:</p>
<pre>
VecS32 vStartX = vmax(
    vmin(
        vmin(xFormedFxPtPos[0].X, xFormedFxPtPos[1].X), 
        xFormedFxPtPos[2].X), VecS32(0));
VecS32 vEndX   = vmin(
    vmax(
        vmax(xFormedFxPtPos[0].X, xFormedFxPtPos[1].X),
        xFormedFxPtPos[2].X) + VecS32(1), VecS32(SCREENW));
</pre>
<p>Okay, that&#8217;s basically the same we saw before, only we&#8217;re clipping against the screen bounds not the tile bounds. And the same happens with Y. Nothing to see here so far, move along. But then, what does the code do with these bounds? Let&#8217;s have a look:</p>
<pre>
// Convert bounding box in terms of pixels to bounding box
// in terms of tiles
int startX = max(vStartX.lane[i]/TILE_WIDTH_IN_PIXELS, 0);
int endX   = min(vEndX.lane[i]/TILE_WIDTH_IN_PIXELS,
                 SCREENW_IN_TILES-1);

int startY = max(vStartY.lane[i]/TILE_HEIGHT_IN_PIXELS, 0);
int endY   = min(vEndY.lane[i]/TILE_HEIGHT_IN_PIXELS,
                 SCREENH_IN_TILES-1);

// Add triangle to the tiles or bins that the bounding box covers
int row, col;
for(row = startY; row &lt;= endY; row++)
{
    int offset1 = YOFFSET1_MT * row;
    int offset2 = YOFFSET2_MT * row;
    for(col = startX; col &lt;= endX; col++)
    {
        // ...
    }
}
</pre>
<p>And in this loop, the triangles get added to the corresponding bins. So the bug must be somewhere in here. Can you figure out what&#8217;s going on?</p>
<p>Okay, I&#8217;ll spill. The problem is triangles that are completely outside the top or left screen edges, but not too far outside, and it&#8217;s caused by the division at the top. Being regular C division, it&#8217;s truncating &#8211; that is, it always rounds towards zero (Note: In C99/C++11, it&#8217;s actually defined that way; C89 and C++98 leave it up to the compiler, but on x86 all compilers I&#8217;m aware of use truncation, since that&#8217;s what the hardware does). Say that our tiles measure 100&#215;100 pixels (they don&#8217;t, but that doesn&#8217;t matter here). What happens if we get a triangle whose bounding box goes from, say, <code>minX=-75</code> to <code>maxX=-38</code>? First, we compute <code>vStartX</code> to be 0 in that lane (<code>vStartX</code> is clipped against the left edge) and <code>vEndX</code> as -37 (it gets incremented by 1, but not clipped). This looks weird, but is completely fine &#8211; that&#8217;s an empty rectangle. However, in the computation of <code>startX</code> and <code>endX</code>, we divide both these values by 100, and get zero both times. And since the tile start and end coordinates are inclusive not exclusive (look at the loop conditions!), this is <em>not</em> fine &#8211; the leftmost column of tiles goes from x=0 to x=99 (inclusive), and our triangle doesn&#8217;t overlap that! Which is why we then get an empty bounding box in the actual rasterizer.</p>
<p>There&#8217;s two ways to fix this problem. The first is to use &#8220;floor division&#8221;, i.e. division that always rounds down, no matter the sign. This will again generate an empty rectangle in this case, and everything works fine. However, C/C++ don&#8217;t have a floor division operator, so this is somewhat awkward to express in code, and I went for the simpler option: just check whether the bounding rectangle is empty before we even do the divide.</p>
<pre>
if(vEndX.lane[i] &lt; vStartX.lane[i] ||
   vEndY.lane[i] &lt; vStartY.lane[i]) continue;
</pre>
<p>And there&#8217;s another problem with the code as-is: There&#8217;s an off-by-one error. Suppose we have a triangle with <code>maxX=99</code>. Then we&#8217;ll compute <code>vEndX</code> as 100 and end up inserting the triangle into the bin for x=100 to x=199, which again it doesn&#8217;t overlap. The solution is simple: stop adding 1 to <code>vEndX</code> and clamp it to <code>SCREENW - 1</code> instead of <code>SCREENW</code>! And with these two issues fixed, we now have a binner that really only bins triangles into tiles intersected by their bounding boxes. Which, in a nice turn of events, also means that our depth rasterizer sees slightly fewer triangles! Does it help?</p>
<p><b>Change</b>: Fix a few binning bugs</p>
<table>
<tr>
<th>Version</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial</td>
<td>3.367</td>
<td>3.420</td>
<td>3.432</td>
<td>3.445</td>
<td>3.512</td>
<td>3.433</td>
<td>0.021</td>
</tr>
<tr>
<td>End of part 1</td>
<td>3.020</td>
<td>3.081</td>
<td>3.095</td>
<td>3.106</td>
<td>3.149</td>
<td>3.093</td>
<td>0.020</td>
</tr>
<tr>
<td>Vec[SF]32</td>
<td>3.022</td>
<td>3.056</td>
<td>3.067</td>
<td>3.081</td>
<td>3.153</td>
<td>3.069</td>
<td>0.018</td>
</tr>
<tr>
<td>Setup cleanups</td>
<td>2.977</td>
<td>3.032</td>
<td>3.046</td>
<td>3.058</td>
<td>3.101</td>
<td>3.045</td>
<td>0.020</td>
</tr>
<tr>
<td>Binning fixes</td>
<td>2.972</td>
<td>3.008</td>
<td>3.022</td>
<td>3.035</td>
<td>3.079</td>
<td>3.022</td>
<td>0.020</td>
</tr>
</table>
<p>Not a big improvement, but then again, this wasn&#8217;t even for performance, it was just a regular bug fix! Always nice when they pay off this way.</p>
<h3>One more setup tweak</h3>
<p>With that out of the way, there&#8217;s one bit of unnecessary work left in our triangle setup: If you look at the <a href="https://github.com/rygorous/intel_occlusion_cull/blob/db909a37/SoftwareOcclusionCulling/DepthBufferRasterizerSSEMT.cpp#L294">current triangle setup code</a>, you&#8217;ll notice that we convert all four of X, Y, Z and W to integer (fixed-point), but we only actually look at the integer versions for X and Y. So we can stop converting Z and W. I also renamed the variables to have shorter names, simply to make the code more readable. So <a>this change</a> ends up affecting lots of lines, but the details are trivial, so I&#8217;m just going to give you the results:</p>
<p><b>Change</b>: Don&#8217;t convert Z/W to fixed point</p>
<table>
<tr>
<th>Version</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial</td>
<td>3.367</td>
<td>3.420</td>
<td>3.432</td>
<td>3.445</td>
<td>3.512</td>
<td>3.433</td>
<td>0.021</td>
</tr>
<tr>
<td>End of part 1</td>
<td>3.020</td>
<td>3.081</td>
<td>3.095</td>
<td>3.106</td>
<td>3.149</td>
<td>3.093</td>
<td>0.020</td>
</tr>
<tr>
<td>Vec[SF]32</td>
<td>3.022</td>
<td>3.056</td>
<td>3.067</td>
<td>3.081</td>
<td>3.153</td>
<td>3.069</td>
<td>0.018</td>
</tr>
<tr>
<td>Setup cleanups</td>
<td>2.977</td>
<td>3.032</td>
<td>3.046</td>
<td>3.058</td>
<td>3.101</td>
<td>3.045</td>
<td>0.020</td>
</tr>
<tr>
<td>Binning fixes</td>
<td>2.972</td>
<td>3.008</td>
<td>3.022</td>
<td>3.035</td>
<td>3.079</td>
<td>3.022</td>
<td>0.020</td>
</tr>
<tr>
<td>No fixed-pt. Z/W</td>
<td>2.958</td>
<td>2.985</td>
<td>2.991</td>
<td>2.999</td>
<td>3.048</td>
<td>2.992</td>
<td>0.012</td>
</tr>
</table>
<p>And with that, we are &#8211; finally! &#8211; down about 0.1ms from where we ended the previous post.</p>
<h3>Time to profile</h3>
<p>Evidently, progress is slowing down. This is entirely expected; we&#8217;re running out of easy targets. But while we&#8217;ve been starting intensely at code, we haven&#8217;t really done any more in-depth profiling than just looking at overall timings in quite a while. Time to bring out VTune again and check if the situation&#8217;s changed since our last detailed profiling run, way back at the start of <a href="https://fgiesen.wordpress.com/2013/02/02/frustum-culling-turning-the-crank/">&#8220;Frustum culling: turning the crank&#8221;</a>.</p>
<p>Here&#8217;s the results:</p>
<p><a href="https://fgiesen.files.wordpress.com/2013/02/hotspots_rast.png"><img loading="lazy" data-attachment-id="1685" data-permalink="https://fgiesen.wordpress.com/2013/02/16/depth-buffers-done-quick-part-2/hotspots_rast/" data-orig-file="https://fgiesen.files.wordpress.com/2013/02/hotspots_rast.png" data-orig-size="497,203" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}" data-image-title="Rasterization hot spots" data-image-description="" data-image-caption="" data-medium-file="https://fgiesen.files.wordpress.com/2013/02/hotspots_rast.png?w=300" data-large-file="https://fgiesen.files.wordpress.com/2013/02/hotspots_rast.png?w=497" src="https://fgiesen.files.wordpress.com/2013/02/hotspots_rast.png?w=497&#038;h=203" alt="Rasterization hot spots" width="497" height="203" class="aligncenter size-full wp-image-1685" srcset="https://fgiesen.files.wordpress.com/2013/02/hotspots_rast.png 497w, https://fgiesen.files.wordpress.com/2013/02/hotspots_rast.png?w=150&amp;h=61 150w, https://fgiesen.files.wordpress.com/2013/02/hotspots_rast.png?w=300&amp;h=123 300w" sizes="(max-width: 497px) 100vw, 497px" /></a></p>
<p>Unlike our previous profiling runs, there&#8217;s really no smoking guns here. At a CPI rate of 0.459 (so we&#8217;re averaging about 2.18 instructions executed per cycle over the whole function!) we&#8217;re doing pretty well: in &#8220;Frustum culling: turning the crank&#8221;, we were still at 0.588 clocks per instruction. There&#8217;s a lot of L1 and L2 cache line replacements (i.e. cache lines getting cycled in and out), but that is to be expected &#8211; at 320&#215;90 pixels times one float each, our tiles come out at about 112kb, which is larger than our L1 data cache and takes up a significant amount of the L2 cache for each core. But for all that, we don&#8217;t seem to be terribly bottlenecked by it; if we were seriously harmed by cache effects, we wouldn&#8217;t be running nearly as fast as we do.</p>
<p>Pretty much the only thing we do see is that we seem to be getting a lot of branch mispredictions. Now, if you were to drill into them, you would notice that most of these related to the row/column loops, so they&#8217;re purely a function of the triangle size. However, we do still perform the early-out check for each quad. With the initial version of the code, that&#8217;s a slight win (I checked, even though I didn&#8217;t bother telling you about it), but that a version of the code that had more code in the inner loop, and of course the test itself has some execution cost too. Is it still worthwhile? Let&#8217;s try removing it.</p>
<p><b>Change</b>: Remove &#8220;quad not covered&#8221; early-out</p>
<table>
<tr>
<th>Version</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial</td>
<td>3.367</td>
<td>3.420</td>
<td>3.432</td>
<td>3.445</td>
<td>3.512</td>
<td>3.433</td>
<td>0.021</td>
</tr>
<tr>
<td>End of part 1</td>
<td>3.020</td>
<td>3.081</td>
<td>3.095</td>
<td>3.106</td>
<td>3.149</td>
<td>3.093</td>
<td>0.020</td>
</tr>
<tr>
<td>Vec[SF]32</td>
<td>3.022</td>
<td>3.056</td>
<td>3.067</td>
<td>3.081</td>
<td>3.153</td>
<td>3.069</td>
<td>0.018</td>
</tr>
<tr>
<td>Setup cleanups</td>
<td>2.977</td>
<td>3.032</td>
<td>3.046</td>
<td>3.058</td>
<td>3.101</td>
<td>3.045</td>
<td>0.020</td>
</tr>
<tr>
<td>Binning fixes</td>
<td>2.972</td>
<td>3.008</td>
<td>3.022</td>
<td>3.035</td>
<td>3.079</td>
<td>3.022</td>
<td>0.020</td>
</tr>
<tr>
<td>No fixed-pt. Z/W</td>
<td>2.958</td>
<td>2.985</td>
<td>2.991</td>
<td>2.999</td>
<td>3.048</td>
<td>2.992</td>
<td>0.012</td>
</tr>
<tr>
<td>No quad early-out</td>
<td>2.778</td>
<td>2.809</td>
<td>2.826</td>
<td>2.842</td>
<td>2.908</td>
<td>2.827</td>
<td>0.025</td>
</tr>
</table>
<p>And just like that, another 0.17ms evaporate. I could do this all day. Let&#8217;s run the profiler again just to see what changed:</p>
<p><a href="https://fgiesen.files.wordpress.com/2013/02/hotspots_rast2.png"><img loading="lazy" data-attachment-id="1689" data-permalink="https://fgiesen.wordpress.com/2013/02/16/depth-buffers-done-quick-part-2/hotspots_rast2/" data-orig-file="https://fgiesen.files.wordpress.com/2013/02/hotspots_rast2.png" data-orig-size="497,205" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}" data-image-title="Rasterizer hotspots without early-out" data-image-description="" data-image-caption="" data-medium-file="https://fgiesen.files.wordpress.com/2013/02/hotspots_rast2.png?w=300" data-large-file="https://fgiesen.files.wordpress.com/2013/02/hotspots_rast2.png?w=497" src="https://fgiesen.files.wordpress.com/2013/02/hotspots_rast2.png?w=497&#038;h=205" alt="Rasterizer hotspots without early-out" width="497" height="205" class="aligncenter size-full wp-image-1689" srcset="https://fgiesen.files.wordpress.com/2013/02/hotspots_rast2.png 497w, https://fgiesen.files.wordpress.com/2013/02/hotspots_rast2.png?w=150&amp;h=62 150w, https://fgiesen.files.wordpress.com/2013/02/hotspots_rast2.png?w=300&amp;h=124 300w" sizes="(max-width: 497px) 100vw, 497px" /></a></p>
<p>Yes, branch mispredicts are down by about half, and cycles spent by about 10%. And we weren&#8217;t even that badly bottlenecked on branches to begin with, at least according to VTune! Just goes to show &#8211; CPUs really do like their code straight-line.</p>
<h3>Bonus: per-pixel increments</h3>
<p>There&#8217;s a few more minor modifications in the most recent set of changes that I won&#8217;t bother talking about, but there&#8217;s one more that I want to mention, and that several comments brought up last time: stepping the interpolated depth from pixel to pixel rather than recomputing it from the barycentric coordinates every time. I wanted to do this one last, because unlike our other changes, this one does change the resulting depth buffer noticeably. It&#8217;s not a huge difference, but changing the results is something I&#8217;ve intentionally avoided doing so far, so I wanted to do this change towards the end of the depth rasterizer modifications so it&#8217;s easier to &#8220;opt out&#8221; from.</p>
<p>That said, the change itself is really easy to make now: only do our current computation</p>
<pre>
VecF32 depth = zz[0] + itof(beta) * zz[1] + itof(gama) * zz[2];
</pre>
<p>once per line, and update <code>depth</code> incrementally per pixel (note that doing this properly requires changing the code a little bit, because the original code overwrites <code>depth</code> with the value we store to the depth buffer, but that&#8217;s easily changed):</p>
<pre>
depth += zx;
</pre>
<p>just like the edge equations themselves, where <code>zx</code> can be computed at setup time as</p>
<pre>
VecF32 zx = itof(aa1Inc) * zz[1] + itof(aa2Inc) * zz[2];
</pre>
<p>It should be easy to see why this produces the same results in exact arithmetic; but of course, in reality, there&#8217;s floating-point round-off error introduced in the computation of <code>zx</code> and by the repeated additions, so it&#8217;s not quite exact. That said, for our purposes (computing a depth buffer for occlusion culling), it&#8217;s probably fine. This gets rid of a lot of instructions in the loop, so it should come as no surprise that it&#8217;s faster, but let&#8217;s see by how much:</p>
<p><b>Change</b>: Per-pixel depth increments</p>
<table>
<tr>
<th>Version</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial</td>
<td>3.367</td>
<td>3.420</td>
<td>3.432</td>
<td>3.445</td>
<td>3.512</td>
<td>3.433</td>
<td>0.021</td>
</tr>
<tr>
<td>End of part 1</td>
<td>3.020</td>
<td>3.081</td>
<td>3.095</td>
<td>3.106</td>
<td>3.149</td>
<td>3.093</td>
<td>0.020</td>
</tr>
<tr>
<td>Vec[SF]32</td>
<td>3.022</td>
<td>3.056</td>
<td>3.067</td>
<td>3.081</td>
<td>3.153</td>
<td>3.069</td>
<td>0.018</td>
</tr>
<tr>
<td>Setup cleanups</td>
<td>2.977</td>
<td>3.032</td>
<td>3.046</td>
<td>3.058</td>
<td>3.101</td>
<td>3.045</td>
<td>0.020</td>
</tr>
<tr>
<td>Binning fixes</td>
<td>2.972</td>
<td>3.008</td>
<td>3.022</td>
<td>3.035</td>
<td>3.079</td>
<td>3.022</td>
<td>0.020</td>
</tr>
<tr>
<td>No fixed-pt. Z/W</td>
<td>2.958</td>
<td>2.985</td>
<td>2.991</td>
<td>2.999</td>
<td>3.048</td>
<td>2.992</td>
<td>0.012</td>
</tr>
<tr>
<td>No quad early-out</td>
<td>2.778</td>
<td>2.809</td>
<td>2.826</td>
<td>2.842</td>
<td>2.908</td>
<td>2.827</td>
<td>0.025</td>
</tr>
<tr>
<td>Incremental depth</td>
<td>2.676</td>
<td>2.699</td>
<td>2.709</td>
<td>2.721</td>
<td>2.760</td>
<td>2.711</td>
<td>0.016</td>
</tr>
</table>
<p>Down by about another 0.1ms per frame &#8211; which might be less than you expected considering how many instructions we just got rid of. What can I say &#8211; we&#8217;re starting to bump into other issues again.</p>
<p>Now, there&#8217;s more things we could try (isn&#8217;t there always?), but I think with five in-depth posts on rasterization and a 21% reduction in median run-time on what already started out as fairly optimized code, it&#8217;s time to close this chapter and start looking at other things. Which I will do in the next post. Until then, code for the new batch of changes is, as always, on <a href="https://github.com/rygorous/intel_occlusion_cull/tree/blog">Github</a>.</p>
]]></html><thumbnail_url><![CDATA[https://fgiesen.files.wordpress.com/2013/02/hotspots_rast.png?fit=440%2C330]]></thumbnail_url><thumbnail_width><![CDATA[]]></thumbnail_width><thumbnail_height><![CDATA[]]></thumbnail_height></oembed>