<?xml version="1.0" encoding="UTF-8" standalone="yes"?><oembed><version><![CDATA[1.0]]></version><provider_name><![CDATA[The ryg blog]]></provider_name><provider_url><![CDATA[https://fgiesen.wordpress.com]]></provider_url><author_name><![CDATA[fgiesen]]></author_name><author_url><![CDATA[https://fgiesen.wordpress.com/author/fgiesen/]]></author_url><title><![CDATA[Frustum culling: turning the&nbsp;crank]]></title><type><![CDATA[link]]></type><html><![CDATA[<p><em>This post is part of a series &#8211; go <a href="https://fgiesen.wordpress.com/2013/02/17/optimizing-sw-occlusion-culling-index/">here</a> for the index.</em></p>
<p>In the past few posts, we&#8217;ve been looking at Intel&#8217;s <a href="http://software.intel.com/en-us/vcsource/samples/software-occlusion-culling">Software Occlusion Culling sample</a>. This post is going to be a bit shorter than the others so far. This has two reasons: first, next up is the rasterizer. It turns out there&#8217;s another common performance problem we&#8217;re gonna see in this series, but right now, fixing it is not going to make much of a difference: as it is, the rasterizer in the sample is fairly well balanced, and it&#8217;s not making any serious mistakes. In other words, this time round, we don&#8217;t get any easy pickings. Unlike the somewhat peripheral framework code we&#8217;ve been looking at so far, this is the actual heart of this sample, and it was written by people who know what they&#8217;re doing. Speeding it up is going to take some actual algorithmic improvements, which means a lot of prep work for me, since I&#8217;ll need to teach you the necessary groundwork first. 🙂 This is gonna take several posts, but I promise that we&#8217;ll get a properly satisfying finale.</p>
<p>Well, that&#8217;s the first reason. The second reason is that I&#8217;ve actually had people complain about the rate at which I&#8217;m posting these, because they&#8217;re falling behind on reading! Sheesh. You slackers &#8211; try writing the damn things! 🙂 Anyway, I&#8217;m going to give you this one for the road, and then I&#8217;m gonna stop for the weekend so you can catch up and I can start mentally preparing for the figures on rasterization I&#8217;m gonna have to produce. A picture may be worth a thousand words, but making a decent illustration takes me a lot longer than writing a thousand words does, so it&#8217;s not a happy trade-off.</p>
<p>Anyway, enough introduction. One final batch of frustum cull optimization coming right up. Let&#8217;s get cracking.</p>
<h3>What we need here is some work ethic</h3>
<p>So the whole point of my last post was that you can often make a big difference with fairly small changes, provided you know what you&#8217;re doing &#8211; minimally invasive surgery, so to speak. This has lots of advantages: it&#8217;s easy to write and review, very testable, and less likely to cause major disruptions or cause friction with other existing code. When we last saw our frustum culling code, we managed to speed it up by a factor of over 4.5x by a handful of fixes, none of which touched more than 3 lines of code. That&#8217;s both satisfying and impressive to family and friends (if you have the kind of family which tends to be impressed by these things), and it got us from a #4 location in our hot spot list all the way down to #10:</p>
<p><a href="https://fgiesen.files.wordpress.com/2013/02/hotspots_data_density.png"><img loading="lazy" data-attachment-id="1372" data-permalink="https://fgiesen.wordpress.com/2013/02/01/fixing-cache-issues-the-lazy-way/hotspots_data_density/" data-orig-file="https://fgiesen.files.wordpress.com/2013/02/hotspots_data_density.png" data-orig-size="497,301" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}" data-image-title="Hotspots with improved data density" data-image-description="" data-image-caption="" data-medium-file="https://fgiesen.files.wordpress.com/2013/02/hotspots_data_density.png?w=300" data-large-file="https://fgiesen.files.wordpress.com/2013/02/hotspots_data_density.png?w=497" src="https://fgiesen.files.wordpress.com/2013/02/hotspots_data_density.png?w=497&#038;h=301" alt="Hotspots with improved data density" width="497" height="301" class="aligncenter size-full wp-image-1372" srcset="https://fgiesen.files.wordpress.com/2013/02/hotspots_data_density.png 497w, https://fgiesen.files.wordpress.com/2013/02/hotspots_data_density.png?w=150&amp;h=91 150w, https://fgiesen.files.wordpress.com/2013/02/hotspots_data_density.png?w=300&amp;h=182 300w" sizes="(max-width: 497px) 100vw, 497px" /></a></p>
<p>But as you can see, while that blue bar denoting Level-3 cache misses has gotten considerably shorter, it&#8217;s still there, and a lot of the things I mentioned in the previous part are still true: in particular, one instance of our <code>TransformedAABBoxSSE</code> is still well above the size of a cache line, and while we&#8217;ve managed to beat it all into a shape where the prefetcher works for us, we&#8217;re still only accessing 25 bytes per cache line, best case. That&#8217;s over 60% of our memory bandwidth wasted. Surely, we can do better?</p>
<p>Well, of course we can, but this time we&#8217;re gonna have to roll up our sleeves and do some more invasive changes to our code. Let&#8217;s first recap the struct layout:</p>
<pre>
class TransformedAABBoxSSE
{
    // Methods elided

    CPUTModelDX11 *mpCPUTModel;
    __m128 *mWorldMatrix;
    __m128 *mpBBVertexList;
    __m128 *mpXformedPos;
    __m128 *mCumulativeMatrix; 
    bool   *mVisible;
    float   mOccludeeSizeThreshold;
    __m128 *mViewPortMatrix; 

    float3 mBBCenter;
    float3 mBBHalf;
    bool   mInsideViewFrustum;
    bool   mTooSmall;
    float3 mBBCenterWS;
    float3 mBBHalfWS;
};
</pre>
<p>The part we care about right now is at the bottom: The two bools and the world-space bounding box. Now, it turns out that while one of the bools (<code>mTooSmall</code>) gets written by the function <code>TransformedAABBoxSSE::IsTooSmall</code>, nobody ever <em>reads</em> it &#8211; everyone just uses the return value of <code>IsTooSmall</code>. So we can just make it a local variable in that function and stop spending per-instance memory on it. That one&#8217;s fairly easy.</p>
<h3>Ta data rhei</h3>
<p>For <code>mInsideViewFrustum</code> though, we&#8217;re going to have to work a bit more. In particular, we&#8217;re gonna have to understand the actual dataflow patterns to figure out where the right place to put it is.</p>
<p>We already know that it gets set in <code>IsInsideViewFrustum</code>, because we&#8217;ve spent some time looking at that function already, although it&#8217;s gotten shorter since we last saw it:</p>
<pre>
void TransformedAABBoxSSE::IsInsideViewFrustum(CPUTCamera *pCamera)
{
    mInsideViewFrustum = pCamera-&gt;mFrustum.IsVisible(mBBCenterWS,
        mBBHalfWS);
}
</pre>
<p>Unfortunately, unlike the previous case, <code>IsInsideViewFrustum</code> doesn&#8217;t have a return value, so our boolean flag is actual state, and there&#8217;s two more methods that access it, one of which is <em>also</em> called <code>IsInsideViewFrustum</code>. I&#8217;m really not a fan of overloading when the two methods do completely different things &#8211; it&#8217;s confusing and error-prone &#8211; but I digress. Both of the other methods are inline:</p>
<pre>
inline void SetInsideViewFrustum(bool insideVF)
{
    mInsideViewFrustum = insideVF;
}

inline bool IsInsideViewFrustum()
{
    return mInsideViewFrustum;
}
</pre>
<p>And both of these get called from the outside, so we can&#8217;t simply nuke them. However, lucky for us, these dependencies don&#8217;t go very far upstream in the call graph at all. So let&#8217;s have a look where our three frustum cull-related functions get called. First, the function that updates our visibility state. Turns out there&#8217;s only two callers. Let&#8217;s look at the first one:</p>
<pre>
void AABBoxRasterizerSSEST::IsInsideViewFrustum(CPUTCamera *pCamera)
{
    mpCamera = pCamera;
    for(UINT i = 0; i &lt; mNumModels; i++)
    {
        mpTransformedAABBox[i].IsInsideViewFrustum(mpCamera);
    }
}
</pre>
<p>Straightforward enough. The second one is in the class <code>AABBoxRasterizerSSEMT</code>, which does the exact same thing with some additional setup to figure out which part of the model list each task needs to process (it&#8217;s multi-threaded, as the name suggests). Both classes derive from the base class <code>AABBoxRasterizer</code>, which holds a bunch of things common to both the single- and multi-threaded implementations, including the array of <code>TransformedAABBoxSSE</code>s.</p>
<p>Because there&#8217;s first a global frustum culling pass on multiple threads, which is only then followed by a second pass that looks at the results, we can&#8217;t simply get rid of the per-model bookkeeping: it&#8217;s actual state. Let&#8217;s look at the callers of the no-parameters version of <code>IsInsideViewFrustum</code> to figure out where that state is read:</p>
<pre>
void AABBoxRasterizerSSEST::TransformAABBoxAndDepthTest()
{
    mDepthTestTimer.StartTimer();

    for(UINT i = 0; i &lt; mNumModels; i++)
    {
        mpVisible[i] = false;
        mpTransformedAABBox[i].SetVisible(&amp;mpVisible[i]);
	
        if(mpTransformedAABBox[i].IsInsideViewFrustum() &amp;&amp;
           !mpTransformedAABBox[i].IsTooSmall(
               mViewMatrix, mProjMatrix, mpCamera))
        {
            mpTransformedAABBox[i].TransformAABBox();
            mpTransformedAABBox[i].RasterizeAndDepthTestAABBox(
                mpRenderTargetPixels);
        }		
    }
    mDepthTestTime[mTimeCounter++] = mDepthTestTimer.StopTimer();
    mTimeCounter = mTimeCounter &gt;= AVG_COUNTER ? 0 : mTimeCounter;
}
</pre>
<p>And again, there&#8217;s a multi-threaded version that does pretty much the same, and no other callers. </p>
<p>Finally, searching for callers to <code>SetInsideViewFrustum</code> turns up exactly one hit, an inline function in <code>AABBoxRasterizerSSE</code>:</p>
<pre>
inline void ResetInsideFrustum()
{
    for(UINT i = 0; i &lt; mNumModels; i++)
    {
        mpTransformedAABBox[i].SetInsideViewFrustum(true);
    }
}
</pre>
<p>As far as dataflow expeditions go, this one was pretty much as tame as it gets: it&#8217;s all concentrated in a few source files, among functions that are directly related and are at similar levels of the call tree. Refactoring this won&#8217;t be hard at all. Mind you, this is pretty much the best-case result &#8211; we got off lightly. In a lot of codebases, doing this kind of thing will quickly lead you to realize that the banana you&#8217;re interested in <a href="http://www.johndcook.com/blog/2011/07/19/you-wanted-banana/">and the gorilla holding it</a> are very tightly coupled. But now that we&#8217;ve determined that&#8217;s not the case, how do we rearrange things for better cache efficiency?</p>
<h3>Shuffling data around</h3>
<p>As we just saw, <code>AABBoxRasterizerSSE</code> and its subclasses are clearly in charge of running the whole frustum culling operation. Not only do they trigger the frustum culling computation, they also hold the array of bounding boxes, and they&#8217;re the only ones who actually look at the frustum culling results. That suggests that <code>AABBoxRasterizerSSE</code> is the natural place to put our frustum calling state. So let&#8217;s add an array of <code>bool</code>s for the visibility state of the boxes, and make it parallel to the array we already have:</p>
<pre>
class AABBoxRasterizerSSE : public AABBoxRasterizer
{
  // ...
  TransformedAABBoxSSE *mpTransformedAABBox;
  bool *mpBBoxVisible; // &lt;--- this is new
  // ...
};
</pre>
<p>This needs to be allocated and freed, but all of that is perfectly routine, so I won&#8217;t go into it. And once we&#8217;ve added it, we have a fairly simple plan of attack:</p>
<ul>
<li>Replace all calls to <code>mpTransformedAABBox[i].IsInsideViewFrustum()</code> (the version without arguments) by <code>mpBBoxVisible[i]</code>.</li>
<li>Similarly, replace calls to <code>SetInsideViewFrustum</code> by the corresponding assignment.</li>
<li>Instead of writing the culling state to a member variable, have <code>IsInsideViewFrustum(camera)</code> (the update version) return the frustum culling state, and write it to the corresponding slot in <code>mpBBoxVisible</code> at the call site.</li>
<li>Get rid of <code>TransformedAABBoxSSE::mInsideViewFrustum</code> now that it&#8217;s unreferenced.</li>
</ul>
<p>Each of these items results in a handful of changes; the complete diff is <a href="https://github.com/rygorous/intel_occlusion_cull/commit/28e18336b1ae054e5afca0f03bcc8039163ed2de">here</a>, for the curious.</p>
<p>And presto, we have a densely packed visibility state array (well, not that densely packed, since we still use a whole byte to store what&#8217;s effectively a 1-bit flag, but you get the idea). By itself, that won&#8217;t buy us much in the frustum culling pass, although it&#8217;s likely to make the later pass that checks for visible boxes faster, since we now never need to fetch the whole <code>TransformedAABBoxSSE</code> from memory if it was frustum culled.</p>
<p>But we can now turn the crank one more time and do the same with the world-space bounding boxes, creating yet another array held by <code>AABBoxRasterizerSSE</code>. We also move the actual visibility test to <code>AABBoxRasterizerSSE</code> (since the test function is a one-liner, that&#8217;s a simple change to make), wrap it inside a loop (since we&#8217;re always going to be culling a group of models), and call it from the two original frustum-culling loops in the single-threaded and multi-threaded rasterizer variants with the correct loop bounds. All of this is in <a href="https://github.com/rygorous/intel_occlusion_cull/commit/bd29f465c1f607e9e13a9df37d4fb5351877f66a">this commit</a> &#8211; as you can see, again it turns out to be mostly small changes.</p>
<p>Finally, for bonus points, we do some cleanup and remove the now-unnecessary fields and methods from <code>TransformedAABBoxSSE</code>. That&#8217;s in <a href="https://github.com/rygorous/intel_occlusion_cull/commit/0a82ba4330afb718836a4667d154a6f943f12e65">this commit</a>.</p>
<p>And just like that, we have our bounding boxes densely packed in a nice linear array, and the output visibility flags densely packed in another array. No more reading a whole cache line to only use 25 bytes &#8211; this time, we look at everything in the cache lines we access, and we access it all sequentially. That should result in better cache hit rates, lower memory bandwidth usage, and generally better performance. But how much does it actually buy us? Let&#8217;s find out!</p>
<p><a href="https://fgiesen.files.wordpress.com/2013/02/hotspots_frustum_dense.png"><img loading="lazy" data-attachment-id="1393" data-permalink="https://fgiesen.wordpress.com/2013/02/02/frustum-culling-turning-the-crank/hotspots_frustum_dense/" data-orig-file="https://fgiesen.files.wordpress.com/2013/02/hotspots_frustum_dense.png" data-orig-size="497,393" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}" data-image-title="Frustum culling, densely packed" data-image-description="" data-image-caption="" data-medium-file="https://fgiesen.files.wordpress.com/2013/02/hotspots_frustum_dense.png?w=300" data-large-file="https://fgiesen.files.wordpress.com/2013/02/hotspots_frustum_dense.png?w=497" src="https://fgiesen.files.wordpress.com/2013/02/hotspots_frustum_dense.png?w=497&#038;h=393" alt="Frustum culling, densely packed" width="497" height="393" class="aligncenter size-full wp-image-1393" srcset="https://fgiesen.files.wordpress.com/2013/02/hotspots_frustum_dense.png 497w, https://fgiesen.files.wordpress.com/2013/02/hotspots_frustum_dense.png?w=150&amp;h=119 150w, https://fgiesen.files.wordpress.com/2013/02/hotspots_frustum_dense.png?w=300&amp;h=237 300w" sizes="(max-width: 497px) 100vw, 497px" /></a></p>
<p>Whoa &#8211; almost down to a third of what we had before we started (for the record, the last few times, I&#8217;ve tried to keep run lengths roughly consistent so we can actually compare the cycles directly). Our CPI rate is done below 0.5 &#8211; meaning we run at over two instructions executed per clock cycle, sustained, through the whole loop.  Those pesky L3 cache misses? Gone completely. And we seem to be surrounded by a lot of functions we haven&#8217;t seen before in this series, because by now we&#8217;re at rank 20 in the hot spots list &#8211; down by another 10 positions! (But wait, is that tan() right below us? <a href="https://fgiesen.wordpress.com/2010/10/21/finish-your-derivations-please/">What the hell is that doing there&#8230;</a> ah well, never mind).</p>
<p>When people tell you that you should optimize for cache usage patterns above all else, <em>this</em> is what they mean.</p>
<p>Well, even before we started, the frustum culling performance was good enough that there was no pressing need to deal with it immediately. At this point, it&#8217;s fast enough that we should really focus our attention elsewhere; there are bigger fish to fry. But then again&#8230; we seem to be on a winning streak, so why stop now? Let&#8217;s aim for some extra credits and see if we can push it a bit further.</p>
<h3>Up To Eleven</h3>
<p>Now, since I&#8217;m cropping the screenshots heavily to make them fit in the blog layout, you can&#8217;t see what I see. For all the screen shots we&#8217;ve seen so far, I&#8217;ve always made the columns narrow and sorted them so that whatever I want to show you happens to be next to the labels. But what you actually get out of the &#8220;General Exploration&#8221; analysis I&#8217;ve had VTune run is more than 20 columns worth of various counters. So for most of the functions on the screen, there&#8217;s a bunch of other blue bars and counters that I haven&#8217;t shown you, representing various kinds of bottlenecks.</p>
<p>So you can&#8217;t see what I see, namely: absolutely nothing next to <code>CalcInsideFrustum</code>. In short, there&#8217;s nothing significant left to be gained by modifying data layout or implementation details. This code runs as smoothly as code can be expected to run. If we want to make things go faster still, we actually have to do less work.</p>
<p>Luckily, there&#8217;s still one source of inefficiency in the current algorithm: we pass in one box at a time, and test it against all 6 frustum planes. Now, this code uses SSE to test against 4 planes simultaneously, so it&#8217;s a fairly decent implementation. But the second half of the test only gives us 2 more planes; the other 2 SIMD lanes are wasted.</p>
<p>This can be fixed by turning the packing around: instead of testing one box against groups of four planes at a time, we test groups of four boxes against one plane at a time. Because we have a lot more boxes than we have planes, that means we have a lot less wasted work overall, at least potentially: the old test always checks one box against 8 planes, of which we actually care about 6. That means 6/8=75% of the computations done are useful. If we instead test groups of four boxes at a time, we run at perfect utilization except for the very last group, which might have less than 4 boxes in it if our total number of boxes is not divisible by four.</p>
<p>Of course, to do this, we need to reorder our box structures so we can grab those four boxes efficiently. Given that the original goal of this post was to be shorter than the other ones and I&#8217;m already above 2300 words, I&#8217;m not going to delve into the details here, but again, you can just <a href="https://github.com/rygorous/intel_occlusion_cull/commit/34d60ce0fc8d5409784d26b19c210d1f0033da81">look at the code</a>. So, does it help?</p>
<p><a href="https://fgiesen.files.wordpress.com/2013/02/hotspots_packetize.png"><img loading="lazy" data-attachment-id="1396" data-permalink="https://fgiesen.wordpress.com/2013/02/02/frustum-culling-turning-the-crank/hotspots_packetize/" data-orig-file="https://fgiesen.files.wordpress.com/2013/02/hotspots_packetize.png" data-orig-size="497,209" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}" data-image-title="After changing packing scheme" data-image-description="" data-image-caption="" data-medium-file="https://fgiesen.files.wordpress.com/2013/02/hotspots_packetize.png?w=300" data-large-file="https://fgiesen.files.wordpress.com/2013/02/hotspots_packetize.png?w=497" src="https://fgiesen.files.wordpress.com/2013/02/hotspots_packetize.png?w=497&#038;h=209" alt="After changing packing scheme" width="497" height="209" class="aligncenter size-full wp-image-1396" srcset="https://fgiesen.files.wordpress.com/2013/02/hotspots_packetize.png 497w, https://fgiesen.files.wordpress.com/2013/02/hotspots_packetize.png?w=150&amp;h=63 150w, https://fgiesen.files.wordpress.com/2013/02/hotspots_packetize.png?w=300&amp;h=126 300w" sizes="(max-width: 497px) 100vw, 497px" /></a></p>
<p>You bet. In fact, if you compare the numbers, we come pretty close to the 1.33x speedup you would expect when increasing utilization from 75% to near 100%. However, as you can see, our clocks per instruction went up again, and our L3 misses. That&#8217;s because we&#8217;re now starting to outrun the cache prefetching again.</p>
<p>Now, I have a processor with AVX support, and if we were compute limited, we could try use 8-wide SIMD instead of 4-wide SIMD. But considering that we already seem to be processing data faster than we can fetch it, there&#8217;s not much point to it. I tried it anyway to be certain, and sure enough, it&#8217;s really mostly a way of turning code with slightly too little computation per data item into code with far too little computation per data item. Now given what I saw in that code, I believe that things might look slightly differently in x64 mode, where we get 8 more YMM registers that this code could really make great use of, but I didn&#8217;t look into it; this post has gone on for long enough already.</p>
<h3>Conclusions</h3>
<p>I still stand by what I said in my previous post, namely that you don&#8217;t need to go full-on Data-Oriented Design to get good performance on modern CPUs. But all that said, if you&#8217;re willing to put in the effort, it definitely does pay off: we got a 3.33x speedup <em>on code that was already using SSE to begin with</em>. Stop counting ALU cycles, people. As this series should have shown you by now, it&#8217;s really not so much about what happens when your code runs &#8211; it&#8217;s about getting rid of the things that make it grind to a halt. As you just saw, data density makes a <em>huge</em> difference in cache efficiency (and hence execution times), and the widest ALUs in the world won&#8217;t do you any good if you can&#8217;t keep them fed.</p>
<p>And on that note, I&#8217;m gonna let this particular pipeline drain over the weekend so you have some time to let it all settle :). See you next time!</p>
]]></html><thumbnail_url><![CDATA[https://fgiesen.files.wordpress.com/2013/02/hotspots_data_density.png?fit=440%2C330]]></thumbnail_url><thumbnail_width><![CDATA[]]></thumbnail_width><thumbnail_height><![CDATA[]]></thumbnail_height></oembed>