<?xml version="1.0" encoding="UTF-8" standalone="yes"?><oembed><version><![CDATA[1.0]]></version><provider_name><![CDATA[The ryg blog]]></provider_name><provider_url><![CDATA[https://fgiesen.wordpress.com]]></provider_url><author_name><![CDATA[fgiesen]]></author_name><author_url><![CDATA[https://fgiesen.wordpress.com/author/fgiesen/]]></author_url><title><![CDATA[Fixing cache issues, the lazy&nbsp;way]]></title><type><![CDATA[link]]></type><html><![CDATA[<p><em>This post is part of a series &#8211; go <a href="https://fgiesen.wordpress.com/2013/02/17/optimizing-sw-occlusion-culling-index/">here</a> for the index.</em></p>
<p>Last time, we ended on a bit of a cliffhanger. We&#8217;ll continue right where we left off, but first, I want to get a few things out of the way.</p>
<p>First, a lot of people have been asking me what profiler I&#8217;ve been using for the various screenshots. So, for the record, the answer is that all these measurements were done using the current version of Intel VTune. VTune has a bit of a learning curve, and if I just want to quickly figure out why something is slow I prefer other tools. But if you&#8217;re trying to figure out what&#8217;s really going on while your code is running, you want something that supports the CPU&#8217;s internal event-based sampling counters and can do the proper analysis, and VTune does. If on the other hand you just want to take a quick peek to figure out why something is slow (which is most of the time while you&#8217;re not fine-tuning), I suggest you start with <a href="http://www.codersnotes.com/sleepy">Very Sleepy</a> &#8211; it&#8217;s free, small and easy to use.</p>
<p>Next, some background on why I&#8217;m writing this: I do not intend to badmouth the sample project I&#8217;ve been using for this series, nor do I want to suggest it&#8217;s doing anything particularly stupid. As you might have noticed, both the write-combining issues and the involuntary sharing we saw last post boiled down to two-line fixes in the source. These kinds of things just happen as code gets written and modified, particularly if there&#8217;s deadlines involved. In fact, I&#8217;m using this example precisely because the problems I&#8217;ve found in it are so very typical: I have run into all of these problems before on other projects, and I assume so have most engine programmers who&#8217;ve shipped a game. That&#8217;s exactly why I think this is worth writing down: so that people who don&#8217;t have much optimization experience and are running into performance problems know what to look for.</p>
<p>Third, lest you get a false impression: I&#8217;m in a comfortable position here &#8211; I spent two weekends (and the equivalent of maybe 3 days worth of full-time work) looking at the code, profiling and tweaking it. And of course, I&#8217;m only writing up the changes that worked. You don&#8217;t get to see the false starts and all the ideas that didn&#8217;t pan out. Nor am I presenting my changes in chronological order: as you can see in the <a href="https://github.com/rygorous/intel_occlusion_cull/commits/dev">Github repository</a>, in fact I did the SSE version of <code>CPUTFrustum::IsVisible</code> a whole day before I found the sharing issues that were the actual bottleneck. With 20-20 hindsight, I get to present changes in order of impact, but that&#8217;s not how it plays out in practice. The whole process is a lot messier (and less deterministic) than it may seem in these posts.</p>
<p>And with all that out of the way, let&#8217;s look at cache effects.</p>
<h3>Previously&#8230;</h3>
<p>&#8230;we looked at the frustum culling code in <a href="http://software.intel.com/en-us/vcsource/samples/software-occlusion-culling">Intel&#8217;s Software Occlusion Culling sample</a>. The last blog post ended with me showing this profile:</p>
<p><a href="https://fgiesen.files.wordpress.com/2013/01/hotspots_isinside_slower.png"><img loading="lazy" data-attachment-id="1356" data-permalink="https://fgiesen.wordpress.com/2013/01/31/cores-dont-like-to-share/hotspots_isinside_slower/" data-orig-file="https://fgiesen.files.wordpress.com/2013/01/hotspots_isinside_slower.png" data-orig-size="497,238" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}" data-image-title="And the bottleneck has moved" data-image-description="" data-image-caption="" data-medium-file="https://fgiesen.files.wordpress.com/2013/01/hotspots_isinside_slower.png?w=300" data-large-file="https://fgiesen.files.wordpress.com/2013/01/hotspots_isinside_slower.png?w=497" src="https://fgiesen.files.wordpress.com/2013/01/hotspots_isinside_slower.png?w=497&#038;h=238" alt="And the bottleneck has moved" width="497" height="238" class="aligncenter size-full wp-image-1356" srcset="https://fgiesen.files.wordpress.com/2013/01/hotspots_isinside_slower.png 497w, https://fgiesen.files.wordpress.com/2013/01/hotspots_isinside_slower.png?w=150&amp;h=72 150w, https://fgiesen.files.wordpress.com/2013/01/hotspots_isinside_slower.png?w=300&amp;h=144 300w" sizes="(max-width: 497px) 100vw, 497px" /></a></p>
<p>and explaining that the actual issue is triggered by this (inlined) function:</p>
<pre>
void TransformedAABBoxSSE::IsInsideViewFrustum(CPUTCamera *pCamera)
{
    float3 mBBCenterWS;
    float3 mBBHalfWS;
    mpCPUTModel-&gt;GetBoundsWorldSpace(&amp;mBBCenterWS, &amp;mBBHalfWS);
    mInsideViewFrustum = pCamera-&gt;mFrustum.IsVisible(mBBCenterWS,
        mBBHalfWS);
}
</pre>
<p>which spends a considerable amount of time missing the cache while trying to read the world-space bounding box from <code>mpCPUTModel</code>. Well, I didn&#8217;t actually back that up with any data yet. As you can see in the above profile, we spend about 13.8 billion cycles total in <code>AABBoxRasterizerSSEMT::IsInsideViewFrustum</code> in the profile. Now, if you look at the actual assembly code, you&#8217;ll notice that a majority of them are actually spent in a handful of instructions:</p>
<p><a href="https://fgiesen.files.wordpress.com/2013/02/cycles_load.png"><img loading="lazy" data-attachment-id="1366" data-permalink="https://fgiesen.wordpress.com/2013/02/01/fixing-cache-issues-the-lazy-way/cycles_load/" data-orig-file="https://fgiesen.files.wordpress.com/2013/02/cycles_load.png" data-orig-size="497,260" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}" data-image-title="The code in question" data-image-description="" data-image-caption="" data-medium-file="https://fgiesen.files.wordpress.com/2013/02/cycles_load.png?w=300" data-large-file="https://fgiesen.files.wordpress.com/2013/02/cycles_load.png?w=497" src="https://fgiesen.files.wordpress.com/2013/02/cycles_load.png?w=497&#038;h=260" alt="The code in question" width="497" height="260" class="aligncenter size-full wp-image-1366" srcset="https://fgiesen.files.wordpress.com/2013/02/cycles_load.png 497w, https://fgiesen.files.wordpress.com/2013/02/cycles_load.png?w=150&amp;h=78 150w, https://fgiesen.files.wordpress.com/2013/02/cycles_load.png?w=300&amp;h=157 300w" sizes="(max-width: 497px) 100vw, 497px" /></a></p>
<p>As you can see, about 11.7 of our 13.8 billion cycles are get counted on a mere two instructions. The column right next to the cycle counts is the number of &#8220;last-level cache&#8221; (L3 cache) misses. It doesn&#8217;t take a genius to figure out that we might be running into cache issues here.</p>
<p>The code you&#8217;re seeing is inlined from <code>CPUTModel::GetBoundsWorldSpace</code>, and simply copies the 6 floats describing the bounding box center and half-extents from the model into the two provided locations. That&#8217;s all this fragment of code does. Well, the member variables of <code>CPUTModel</code> are one indirection through <code>mpCPUModelT</code> away, and clearly, following that pointer seems to result in a lot of cache misses. In turns out that this is the only time anyone ever looks at data from <code>CPUTModel</code> during the frustum-culling pass. Now, what we really want is the 24 bytes worth of bounding box that we&#8217;re going to read. But CPU cores fetch data in units of cache lines, which is 64 bytes on current x86 processors. So best case, we&#8217;re going to get 24 bytes worth of data that we care about, and 40 bytes of data that we don&#8217;t. If we&#8217;re unlucky and the box crosses a cache line boundary, we might even end up fetching 128 bytes. And because it&#8217;s behind some arbitrary pointer, the processor can&#8217;t easily do tricks like automated memory prefetching that reduce the cost of memory accesses: prefetching requires predictable access patterns, and following pointers isn&#8217;t very predictable &#8211; not to the CPU core, anyway.</p>
<p>At this point, you might decide to rewrite the whole thing to have more coherent access patterns. Now the frustum culling loop actually isn&#8217;t that complicated, and rewriting it (and changing the data structures to be more cache-friendly) wouldn&#8217;t take very long, but for new let&#8217;s suppose we don&#8217;t know that. Is there any way incremental, less error-prone way to give us a quick speed boost, and maybe get us in a better position should we choose to change the frustum culling code later?</p>
<h3>Making those prefetchers work</h3>
<p>Of course there is, or I wouldn&#8217;t be asking. The key realization is that the outer loop in <code>AABBoxRasterizerSSEMT::IsInsideViewFrustum</code> actually traverses an array of bounding boxes (type <code>TransformedAABBoxSSE</code>) in order:</p>
<pre>
for(UINT i = start; i &lt; end; i++)
{
    mpTransformedAABBox[i].IsInsideViewFrustum(mpCamera);
}
</pre>
<p>One linear traversal is all we need. We know that the hardware prefetcher is going to load that ahead for us &#8211; and by now, they&#8217;re smart enough to do that properly even if our accesses are strided, that is, we don&#8217;t read all the data between the start and the end of the array, but only some of them with a regular spacing. This means that if we can get those world-space bounding boxes into <code>TransformedAABBoxSSE</code>, they&#8217;ll automatically get prefetched for us. And it turns out that in this example, all models are at a fixed position &#8211; we can determine the world-space bounding boxes once, at load time. Let&#8217;s look at our function again:</p>
<pre>
void TransformedAABBoxSSE::IsInsideViewFrustum(CPUTCamera *pCamera)
{
    float3 mBBCenterWS;
    float3 mBBHalfWS;
    mpCPUTModel-&gt;GetBoundsWorldSpace(&amp;mBBCenterWS, &amp;mBBHalfWS);
    mInsideViewFrustum = pCamera-&gt;mFrustum.IsVisible(mBBCenterWS,
        mBBHalfWS);
}
</pre>
<p>Here&#8217;s the punch line: all we really have to do is promote these two variables from locals to member variables, and move the <code>GetBoundsWorldSpace</code> call to init time. Sure, it&#8217;s a bit crude, and it leads to data duplication, but on the plus side, this is a really easy thing to try &#8211; just move a few lines of code around. If it pans out, we can always do it cleaner later. Which leaves the question &#8211; <em>does</em> it pan out?</p>
<p><a href="https://fgiesen.files.wordpress.com/2013/02/hotspots_bbox_inline.png"><img loading="lazy" data-attachment-id="1370" data-permalink="https://fgiesen.wordpress.com/2013/02/01/fixing-cache-issues-the-lazy-way/hotspots_bbox_inline/" data-orig-file="https://fgiesen.files.wordpress.com/2013/02/hotspots_bbox_inline.png" data-orig-size="497,262" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}" data-image-title="Hotspots after inlining bounding box data" data-image-description="" data-image-caption="" data-medium-file="https://fgiesen.files.wordpress.com/2013/02/hotspots_bbox_inline.png?w=300" data-large-file="https://fgiesen.files.wordpress.com/2013/02/hotspots_bbox_inline.png?w=497" src="https://fgiesen.files.wordpress.com/2013/02/hotspots_bbox_inline.png?w=497&#038;h=262" alt="Hotspots after inlining bounding box data" width="497" height="262" class="aligncenter size-full wp-image-1370" srcset="https://fgiesen.files.wordpress.com/2013/02/hotspots_bbox_inline.png 497w, https://fgiesen.files.wordpress.com/2013/02/hotspots_bbox_inline.png?w=150&amp;h=79 150w, https://fgiesen.files.wordpress.com/2013/02/hotspots_bbox_inline.png?w=300&amp;h=158 300w" sizes="(max-width: 497px) 100vw, 497px" /></a></p>
<p>Of course it does &#8211; I get to cheat and only write about the changes that work, remember? As you see, now the clock cycles are back in <code>CPUTFrustum::IsVisible</code>. This is not because it&#8217;s gotten mysteriously slower, it&#8217;s because <code>IsInsideViewFrustum</code> doesn&#8217;t copy any data anymore, so <code>IsVisible</code> is the first function to look at the bounding box cache lines now. Which means that it gets billed for those cache misses now.</p>
<p>It&#8217;s still not great (I&#8217;ve included the Clocks Per Instruction Rate again so we can see where we stand), but we&#8217;re clearly making progress: compared to the first profile at the top of this post, which has a similar total cycle count, we&#8217;re very roughly twice as fast &#8211; and that&#8217;s for <code>IsVisible</code>, which includes not just the cache misses but also the actual frustum culling work. Meanwhile, <code>AABBoxRasterizerSSEMT::IsInsideViewFrustum</code>, now really just a loop, has dropped well out of the top 20 hot spots, as it should. Pretty good for just moving a couple of lines of code around.</p>
<h3>Order in the cache!</h3>
<p>Okay, our quick fix got the HW prefetchers to work for us, and clearly that gave us a considerable improvement. But we still only need 24 bytes out of every <code>TransfomedAABBoxSSE</code>. How big are they? Let&#8217;s have a look at the data members (methods elided):</p>
<pre>
class TransformedAABBoxSSE
{
    // Methods elided

    CPUTModelDX11 *mpCPUTModel;
    __m128 *mWorldMatrix;
    __m128 *mpBBVertexList;
    __m128 *mpXformedPos;
    __m128 *mCumulativeMatrix; 
    UINT    mBBIndexList[AABB_INDICES]; /* 36 */
    bool   *mVisible;
    bool    mInsideViewFrustum;
    float   mOccludeeSizeThreshold;
    bool    mTooSmall;
    __m128 *mViewPortMatrix; 

    float3 mBBCenter;
    float3 mBBHalf;
    float3 mBBCenterWS;
    float3 mBBHalfWS;
};
</pre>
<p>In a 32-bit environment, that gives us 226 bytes of payload per BBox (the actual size is a bit more, due to alignment padding). Of these 226 bytes, for the frustum culling, we actually read 24 bytes (<code>mBBCenterWS</code> and <code>mBBHalfWS</code>) and write one (<code>mInsideViewFrustum</code>). That&#8217;s a pretty bad ratio, and there&#8217;s a lot of memory wasting going on, but for the purposes of caching, we only pay for what we actually read, and that&#8217;s not much. That said, even though we don&#8217;t access it here, the biggest chunk of data in the whole thing is <code>mBBIndexList</code> at 144 bytes, which is just a list of triangle indices for this BBox. That&#8217;s completely unnecessary, since that list is going to be the same for every single BBox in the system. So let&#8217;s fix that one and reorder some of the other fields so that the members we&#8217;re going to access during frustum culling are close by each other (and hence more likely to hit the same cache line):</p>
<pre>
class TransformedAABBoxSSE
{
    // Methods elided

    CPUTModelDX11 *mpCPUTModel;
    __m128 *mWorldMatrix;
    __m128 *mpBBVertexList;
    __m128 *mpXformedPos;
    __m128 *mCumulativeMatrix; 
    bool   *mVisible;
    float   mOccludeeSizeThreshold;
    __m128 *mViewPortMatrix; 

    float3 mBBCenter;
    float3 mBBHalf;
    bool   mInsideViewFrustum;
    bool   mTooSmall;
    float3 mBBCenterWS;
    float3 mBBHalfWS;
};
</pre>
<p>Note that we&#8217;re writing <code>mInsideViewFrustum</code> right after we read the bounding boxes, so it makes sense to make them adjacent. I put the fields between the object-space and the world-space bounding box simply because the object-space bounding box is reasonably large (24 bytes, about a third of a cache line) and having it between the flags and the box greatly increases our chance of having to fetch two cache lines not one per box.</p>
<p>So, did it help?</p>
<p><a href="https://fgiesen.files.wordpress.com/2013/02/hotspots_data_density.png"><img loading="lazy" data-attachment-id="1372" data-permalink="https://fgiesen.wordpress.com/2013/02/01/fixing-cache-issues-the-lazy-way/hotspots_data_density/" data-orig-file="https://fgiesen.files.wordpress.com/2013/02/hotspots_data_density.png" data-orig-size="497,301" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}" data-image-title="Hotspots with improved data density" data-image-description="" data-image-caption="" data-medium-file="https://fgiesen.files.wordpress.com/2013/02/hotspots_data_density.png?w=300" data-large-file="https://fgiesen.files.wordpress.com/2013/02/hotspots_data_density.png?w=497" src="https://fgiesen.files.wordpress.com/2013/02/hotspots_data_density.png?w=497&#038;h=301" alt="Hotspots with improved data density" width="497" height="301" class="aligncenter size-full wp-image-1372" srcset="https://fgiesen.files.wordpress.com/2013/02/hotspots_data_density.png 497w, https://fgiesen.files.wordpress.com/2013/02/hotspots_data_density.png?w=150&amp;h=91 150w, https://fgiesen.files.wordpress.com/2013/02/hotspots_data_density.png?w=300&amp;h=182 300w" sizes="(max-width: 497px) 100vw, 497px" /></a></p>
<p>Sure did. <code>IsVisible</code> is down to the number 10 spot, and the CPI Rate is down to an acceptable 1.042 clocks/instruction. Now that&#8217;s by no means the end of the line, but I want to make this clear: all I did here was factor out one common array to be a shared <code>static const</code> variable, and reorder some class members. That&#8217;s it. If you don&#8217;t count the initializers for the 36-element index list (which I&#8217;ve copied with comments and generous spacing, so it&#8217;s a few lines long), we&#8217;re talking less than 10 lines of code changed for all the improvements in this post. Total.</p>
<p>In the last few years, there&#8217;s been a push by several prominent game developers to &#8220;Data-Oriented Design&#8221;, which emphasizes structuring code around desired data-flow patterns, rather than the other way round. That&#8217;s a sound design strategy particularly for subsystems like the one we&#8217;re looking at. It&#8217;s also a good guideline for what you want to work <em>towards</em> when refactoring existing code. But the point I want to make here is that even when trying to optimize existing code within its existing environment, you can achieve substantial gains by a sequence of simple, localized improvements. That will only get you so far, but there&#8217;s a lot to be said for incremental techniques, especially if you&#8217;re just trying to hit a given performance goal in a limited time budget.</p>
<p>And that&#8217;s it for today. I might do another post on the frustum culling (I want it gone from the top 10 completely!), or I might turn to the actual rasterizer code next for a change of pace &#8211; haven&#8217;t decided yet. Until next time!</p>
]]></html><thumbnail_url><![CDATA[https://fgiesen.files.wordpress.com/2013/01/hotspots_isinside_slower.png?fit=440%2C330]]></thumbnail_url><thumbnail_width><![CDATA[]]></thumbnail_width><thumbnail_height><![CDATA[]]></thumbnail_height></oembed>