<?xml version="1.0" encoding="UTF-8" standalone="yes"?><oembed><version><![CDATA[1.0]]></version><provider_name><![CDATA[The ryg blog]]></provider_name><provider_url><![CDATA[https://fgiesen.wordpress.com]]></provider_url><author_name><![CDATA[fgiesen]]></author_name><author_url><![CDATA[https://fgiesen.wordpress.com/author/fgiesen/]]></author_url><title><![CDATA[The care and feeding of worker threads, part&nbsp;1]]></title><type><![CDATA[link]]></type><html><![CDATA[<p><em>This post is part of a series &#8211; go <a href="https://fgiesen.wordpress.com/2013/02/17/optimizing-sw-occlusion-culling-index/">here</a> for the index.</em></p>
<p>It&#8217;s time for another post! After all the time I&#8217;ve spent on squeezing about 20% out of the depth rasterizer, I figured it was time to change gears and look at something different again. But before we get started on that new topic, there&#8217;s one more set of changes that I want to talk about.</p>
<h3>The occlusion test rasterizer</h3>
<p>So far, we&#8217;ve mostly been looking at one rasterizer only &#8211; the one that actually renders the depth buffer we cull against, and even more precisely, only multi-threaded SSE version of it. But the occlusion culling demo has two sets of rasterizers: the other set is used for the occlusion tests. It renders bounding boxes for the various models to be tested and checks whether they are fully occluded. Check out the <a href="https://github.com/rygorous/intel_occlusion_cull/blob/4c64fd75/SoftwareOcclusionCulling/TransformedAABBoxSSE.cpp#L165">code</a> if you&#8217;re interested in the details.</p>
<p>This is basically the same rasterizer that we already talked about. In the previous two posts, I talked about optimizing the depth buffer rasterizer, but most of the same changes apply to the test rasterizer too. It didn&#8217;t make sense to talk through the same thing again, so I took the liberty of just making the same changes (with some minor tweaks) to the test rasterizer &#8220;off-screen&#8221;. So, just a heads-up: the test rasterizer has changed while you weren&#8217;t looking &#8211; unless you closely watch the Github repository, that is.</p>
<p>And now that we&#8217;ve established that there&#8217;s another inner loop we ought to be aware of, let&#8217;s zoom out a bit and look at the bigger picture.</p>
<h3>Some open questions</h3>
<p>There&#8217;s two questions you might have if you&#8217;ve been following this series closely so far. The first concerns a very visible difference between the depth and test rasterizers that you might have noticed if you ran the code. It&#8217;s also visible in the data in <a href="https://fgiesen.wordpress.com/2013/02/11/depth-buffers-done-quick-part/">&#8220;Depth buffers done quick, part 1&#8221;</a>, though I didn&#8217;t talk about it at the time. I&#8217;m talking, of course, about the large standard deviation we get for the execution time of the occlusion tests. Here&#8217;s a set of measurements for the code right after bringing the test rasterizer up to date:</p>
<table>
<tr>
<th>Pass</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Render depth</td>
<td>2.666</td>
<td>2.716</td>
<td>2.732</td>
<td>2.745</td>
<td>2.811</td>
<td>2.731</td>
<td>0.022</td>
</tr>
<tr>
<td>Occlusion test</td>
<td>1.335</td>
<td>1.545</td>
<td>1.587</td>
<td>1.631</td>
<td>1.761</td>
<td>1.585</td>
<td>0.066</td>
</tr>
</table>
<p>Now, the standard deviation actually got a fair bit lower with the rasterizer changes (originally, we were well above 0.1ms), but it&#8217;s still surprisingly large, especially considering that the occlusion tests run roughly half as long (in terms of wall-clock time) as the depth rendering. And there&#8217;s also a second elephant in the room that&#8217;s been staring us in the face for quite a while. Let me recycle one of the VTune screenshots from last time:</p>
<p><a href="https://fgiesen.files.wordpress.com/2013/02/hotspots_rast2.png"><img loading="lazy" data-attachment-id="1689" data-permalink="https://fgiesen.wordpress.com/2013/02/16/depth-buffers-done-quick-part-2/hotspots_rast2/" data-orig-file="https://fgiesen.files.wordpress.com/2013/02/hotspots_rast2.png" data-orig-size="497,205" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}" data-image-title="Rasterizer hotspots without early-out" data-image-description="" data-image-caption="" data-medium-file="https://fgiesen.files.wordpress.com/2013/02/hotspots_rast2.png?w=300" data-large-file="https://fgiesen.files.wordpress.com/2013/02/hotspots_rast2.png?w=497" src="https://fgiesen.files.wordpress.com/2013/02/hotspots_rast2.png?w=497&#038;h=205" alt="Rasterizer hotspots without early-out" width="497" height="205" class="aligncenter size-full wp-image-1689" srcset="https://fgiesen.files.wordpress.com/2013/02/hotspots_rast2.png 497w, https://fgiesen.files.wordpress.com/2013/02/hotspots_rast2.png?w=150&amp;h=62 150w, https://fgiesen.files.wordpress.com/2013/02/hotspots_rast2.png?w=300&amp;h=124 300w" sizes="(max-width: 497px) 100vw, 497px" /></a></p>
<p>Right there at #4 is some code from <a href="http://threadingbuildingblocks.org/">TBB</a>, namely, what turns out to be the &#8220;thread is idle&#8221; spin loop.</p>
<p>Well, so far, we&#8217;ve been profiling, measuring and optimizing this as if it was a single-threaded application, but it&#8217;s not. The code uses TBB to dispatch tasks to worker threads, and clearly, a lot of these worker threads seem to be idle a lot of the time. But why? To answer that question, we need a bit different information than what either a normal VTune analysis run or our summary timers give us. We want a detailed breakdown of what happens during a frame. Now, VTune has <em>some</em> support for that (as part of their threading/concurrency profiling), but the UI doesn&#8217;t work well for me, and neither does the the visualization; it seems to be geared towards HPC/throughput computing more than latency-sensitive applications like real-time graphics, and it&#8217;s also still based on sampling profiling, which means it&#8217;s low-overhead but fairly limited in the kind of data it can collect.</p>
<p>Instead, I&#8217;m going to go for the shameless plug and use <a href="http://www.radgametools.com/telemetry.htm">Telemetry</a> instead (full disclosure: I work at RAD). It works like this: I manually instrument the source code to tell Telemetry when certain events are happening, and Telemetry collects that data, sends the whole log to a server and can later visualize it. Most games I&#8217;ve worked on have some kind of &#8220;bar graph profiler&#8221; that can visualize within-frame events, but because Telemetry keeps the whole data stream, it can also be used to answer the favorite question (not!) of engine programmers everywhere: &#8220;Wait, what the hell just happened there?&#8221;. Instead of trying to explain it in words, I&#8217;m just gonna show you the screenshot of my initial profiling run after I hooked up Telemetry and added some basic markup: (Click on the image to get the full-sized version)</p>
<p><a href="https://fgiesen.files.wordpress.com/2013/02/tmviz_initial.png"><img data-attachment-id="1725" data-permalink="https://fgiesen.wordpress.com/2013/02/17/care-and-feeding-of-worker-threads-part-1/tmviz_initial/" data-orig-file="https://fgiesen.files.wordpress.com/2013/02/tmviz_initial.png" data-orig-size="1920,1040" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}" data-image-title="Initial Telemetry run" data-image-description="" data-image-caption="" data-medium-file="https://fgiesen.files.wordpress.com/2013/02/tmviz_initial.png?w=300" data-large-file="https://fgiesen.files.wordpress.com/2013/02/tmviz_initial.png?w=1024" src="https://fgiesen.files.wordpress.com/2013/02/tmviz_initial.png?w=1024&#038;h=554" alt="Initial Telemetry run"   class="aligncenter size-large wp-image-1725" srcset="https://fgiesen.files.wordpress.com/2013/02/tmviz_initial.png?w=1024&amp;h=554 1024w, https://fgiesen.files.wordpress.com/2013/02/tmviz_initial.png?w=150&amp;h=81 150w, https://fgiesen.files.wordpress.com/2013/02/tmviz_initial.png?w=300&amp;h=163 300w, https://fgiesen.files.wordpress.com/2013/02/tmviz_initial.png?w=768&amp;h=416 768w, https://fgiesen.files.wordpress.com/2013/02/tmviz_initial.png 1920w" sizes="(max-width: 1024px) 100vw, 1024px" /></a></p>
<p>The time axis goes from left to right, and all of the blocks correspond to regions of code that I&#8217;ve marked up. Regions can nest, and when they do, the blocks stack. I&#8217;m only using really basic markup right now, because that turns out to be all we need for the time being. The different tracks correspond to different threads.</p>
<p>As you can see, despite the code using TBB and worker threads, it&#8217;s fairly rare for more than 2 threads to be actually running anything interesting at a time. Also, if you look at the &#8220;Rasterize&#8221; and &#8220;DepthTest&#8221; tasks, you&#8217;ll notice that we&#8217;re spending a fair amount of time just waiting for the last 2 threads to finish their respective jobs, while the other worker threads are idle. That&#8217;s where our variance in latency ultimately comes from &#8211; it all depends on how lucky (or unlucky) we get with scheduling, and the exact scheduling of tasks changes every frame. And now that we&#8217;ve seen how much time the worker threads spend being idle, it also shouldn&#8217;t surprise us that TBB&#8217;s idle spin loop ranked as high as it did in the profile.</p>
<p>What do we do about it, though?</p>
<h3>Let&#8217;s start with something simple</h3>
<p>As usual, we go for the low-hanging fruit first, and if you look at the left side of the screenshot I&#8217;ll posted, you&#8217;ll see <em>a lot</em> of blocks (&#8220;zones&#8221;) on the left side of the screen. In fact, the count is much higher than you probably think &#8211; these are LOD zones, which means that Telemetry has grouped a bunch of very short zones into larger groups for the purposes of visualization. As you can see from the mouse-over text, the single block I&#8217;m pointing at with the mouse cursor corresponds to 583 zones &#8211; and each of those zones corresponds to an individual TBB task! That&#8217;s because this culling code uses one TBB task per model to be culled. <em>Ouch.</em> Let&#8217;s zoom in a bit:</p>
<p><a href="https://fgiesen.files.wordpress.com/2013/02/tmviz_occluders_zoomed.png"><img data-attachment-id="1729" data-permalink="https://fgiesen.wordpress.com/2013/02/17/care-and-feeding-of-worker-threads-part-1/tmviz_occluders_zoomed/" data-orig-file="https://fgiesen.files.wordpress.com/2013/02/tmviz_occluders_zoomed.png" data-orig-size="1920,1040" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}" data-image-title="Telemetry: occluder visibility, zoomed" data-image-description="" data-image-caption="" data-medium-file="https://fgiesen.files.wordpress.com/2013/02/tmviz_occluders_zoomed.png?w=300" data-large-file="https://fgiesen.files.wordpress.com/2013/02/tmviz_occluders_zoomed.png?w=1024" src="https://fgiesen.files.wordpress.com/2013/02/tmviz_occluders_zoomed.png?w=1024&#038;h=554" alt="Telemetry: occluder visibility, zoomed"   class="aligncenter size-large wp-image-1729" srcset="https://fgiesen.files.wordpress.com/2013/02/tmviz_occluders_zoomed.png?w=1024&amp;h=554 1024w, https://fgiesen.files.wordpress.com/2013/02/tmviz_occluders_zoomed.png?w=150&amp;h=81 150w, https://fgiesen.files.wordpress.com/2013/02/tmviz_occluders_zoomed.png?w=300&amp;h=163 300w, https://fgiesen.files.wordpress.com/2013/02/tmviz_occluders_zoomed.png?w=768&amp;h=416 768w, https://fgiesen.files.wordpress.com/2013/02/tmviz_occluders_zoomed.png 1920w" sizes="(max-width: 1024px) 100vw, 1024px" /></a></p>
<p>Note that even at this zoom level (the whole screen covers about 1.3ms), most zones are <em>still</em> LOD&#8217;d out. I&#8217;ve mouse-over&#8217;ed on a single task that happens to hit one or two L3 cache miss and so is long enough (at about 1500 cycles) to show up individually, but most of these tasks are closer to 600 cycles. In total, frustum culling the approximately 1600 occluder models takes up just above 1ms, as the captions helpfully say. For reference, the much smaller block that says &#8220;OccludeesVisible&#8221; and takes about 0.1ms? That one actually processes about 27000 models (it&#8217;s the code we optimized in <a href="https://fgiesen.wordpress.com/2013/02/02/frustum-culling-turning-the-crank/">&#8220;Frustum culling: turning the crank&#8221;</a>). Again, <em>ouch</em>.</p>
<p>Fortunately, there&#8217;s a simple solution: don&#8217;t use one task per model. Instead, use a smaller number of tasks (I just used 32) that each cover multiple models. The code is fairly obvious, so I won&#8217;t bother repeating it here, but I am going to show you the results:</p>
<p><a href="https://fgiesen.files.wordpress.com/2013/02/tmviz_occluders_fixed.png"><img data-attachment-id="1734" data-permalink="https://fgiesen.wordpress.com/2013/02/17/care-and-feeding-of-worker-threads-part-1/tmviz_occluders_fixed/" data-orig-file="https://fgiesen.files.wordpress.com/2013/02/tmviz_occluders_fixed.png" data-orig-size="1920,1040" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}" data-image-title="Telemetry: Occluder culling fixed" data-image-description="" data-image-caption="" data-medium-file="https://fgiesen.files.wordpress.com/2013/02/tmviz_occluders_fixed.png?w=300" data-large-file="https://fgiesen.files.wordpress.com/2013/02/tmviz_occluders_fixed.png?w=1024" src="https://fgiesen.files.wordpress.com/2013/02/tmviz_occluders_fixed.png?w=1024&#038;h=554" alt="Telemetry: Occluder culling fixed"   class="aligncenter size-large wp-image-1734" srcset="https://fgiesen.files.wordpress.com/2013/02/tmviz_occluders_fixed.png?w=1024&amp;h=554 1024w, https://fgiesen.files.wordpress.com/2013/02/tmviz_occluders_fixed.png?w=150&amp;h=81 150w, https://fgiesen.files.wordpress.com/2013/02/tmviz_occluders_fixed.png?w=300&amp;h=163 300w, https://fgiesen.files.wordpress.com/2013/02/tmviz_occluders_fixed.png?w=768&amp;h=416 768w, https://fgiesen.files.wordpress.com/2013/02/tmviz_occluders_fixed.png 1920w" sizes="(max-width: 1024px) 100vw, 1024px" /></a></p>
<p>Down from 1ms to 0.08ms in two minutes of work. Now we could apply the same level of optimization as we did to the occludee culling, but I&#8217;m not going to bother, because at least not for the time being it&#8217;s fast enough. And with that out of the way, let&#8217;s look at the rasterization and depth testing part.</p>
<h3>A closer look</h3>
<p>Let&#8217;s look a bit more closely at what&#8217;s going on during rasterization:</p>
<p><a href="https://fgiesen.files.wordpress.com/2013/02/tmviz_raster_closeup.png"><img loading="lazy" data-attachment-id="1737" data-permalink="https://fgiesen.wordpress.com/2013/02/17/care-and-feeding-of-worker-threads-part-1/tmviz_raster_closeup/" data-orig-file="https://fgiesen.files.wordpress.com/2013/02/tmviz_raster_closeup.png" data-orig-size="497,283" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}" data-image-title="Rasterization close-up" data-image-description="" data-image-caption="" data-medium-file="https://fgiesen.files.wordpress.com/2013/02/tmviz_raster_closeup.png?w=300" data-large-file="https://fgiesen.files.wordpress.com/2013/02/tmviz_raster_closeup.png?w=497" src="https://fgiesen.files.wordpress.com/2013/02/tmviz_raster_closeup.png?w=497&#038;h=283" alt="Rasterization close-up" width="497" height="283" class="aligncenter size-full wp-image-1737" srcset="https://fgiesen.files.wordpress.com/2013/02/tmviz_raster_closeup.png 497w, https://fgiesen.files.wordpress.com/2013/02/tmviz_raster_closeup.png?w=150&amp;h=85 150w, https://fgiesen.files.wordpress.com/2013/02/tmviz_raster_closeup.png?w=300&amp;h=171 300w" sizes="(max-width: 497px) 100vw, 497px" /></a></p>
<p>There are at least two noteworthy things clearly visible in this screenshot:</p>
<ol>
<li>There&#8217;s three separate passes &#8211; transform, bin, then rasterize.</li>
<li>For some reason, we seem to have an odd mixture of really long tasks and very short ones.</li>
</ol>
<p>The former shouldn&#8217;t come as a surprise, since it&#8217;s explicit in the code:</p>
<pre>
gTaskMgr.CreateTaskSet(&amp;DepthBufferRasterizerSSEMT::TransformMeshes, this,
    NUM_XFORMVERTS_TASKS, NULL, 0, "Xform Vertices", &amp;mXformMesh);
gTaskMgr.CreateTaskSet(&amp;DepthBufferRasterizerSSEMT::BinTransformedMeshes, this,
    NUM_XFORMVERTS_TASKS, &amp;mXformMesh, 1, "Bin Meshes", &amp;mBinMesh);
gTaskMgr.CreateTaskSet(&amp;DepthBufferRasterizerSSEMT::RasterizeBinnedTrianglesToDepthBuffer, this,
    NUM_TILES, &amp;mBinMesh, 1, "Raster Tris to DB", &amp;mRasterize);	

// Wait for the task set
gTaskMgr.WaitForSet(mRasterize);
</pre>
<p>What the screenshot does show us, however, is the cost of those synchronization points. There sure is a lot of &#8220;air&#8221; in that diagram, and we could get some significant gains from squeezing it out. The second point is more of a surprise though, because the code does in fact try pretty hard to make sure the tasks are evenly sized. There&#8217;s a problem, though:</p>
<pre>
void TransformedModelSSE::TransformMeshes(...)
{
    if(mVisible)
    {
        // compute mTooSmall

        if(!mTooSmall)
        {
            // transform verts
        }
    }
}

void TransformedModelSSE::BinTransformedTrianglesMT(...)
{
    if(mVisible &amp;&amp; !mTooSmall)
    {
        // bin triangles
    }
}
</pre>
<p>Just because we make sure each task handles an equal number of vertices (as happens for the &#8220;TransformMeshes&#8221; tasks) or an equal number of triangles (&#8220;BinTransformedTriangles&#8221;) doesn&#8217;t mean they are similarly-sized, because the work subdivision ignores culling. Evidently, the tasks end up <em>not</em> being uniformly sized &#8211; not even close. Looks like we need to do some load balancing.</p>
<h3>Balancing act</h3>
<p>To simplify things, I moved the computation of <code>mTooSmall</code> from <code>TransformMeshes</code> into <code>IsVisible</code> &#8211; right after the frustum culling itself. That required some shuffling arguments around, but it&#8217;s exactly the kind of thing we already saw in <a href="https://fgiesen.wordpress.com/2013/02/02/frustum-culling-turning-the-crank/">&#8220;Frustum culling: turning the crank&#8221;</a>, so there&#8217;s little point in going over it in detail again.</p>
<p>Once <code>TransformMeshes</code> and <code>BinTransformedTrianglesMT</code> use the exact same condition &#8211; <code>mVisible &amp;&amp; !mTooSmall</code> &#8211; we can determine the list of models that are visible and not too small once, compute how many triangles and vertices these models have in total, and then use these corrected numbers which account for the culling when we&#8217;re setting up the individual transform and binning tasks.</p>
<p>This is easy to do: <code>DepthBufferRasterizerSSE</code> gets a few more member variables</p>
<pre>
UINT *mpModelIndexA; // 'active' models = visible and not too small
UINT mNumModelsA;
UINT mNumVerticesA;
UINT mNumTrianglesA;
</pre>
<p>and two new member functions</p>
<pre>
inline void ResetActive()
{
    mNumModelsA = mNumVerticesA = mNumTrianglesA = 0;
}

inline void Activate(UINT modelId)
{
    UINT activeId = mNumModelsA++;
    assert(activeId &lt; mNumModels1);

    mpModelIndexA[activeId] = modelId;
    mNumVerticesA += mpStartV1[modelId + 1] - mpStartV1[modelId];
    mNumTrianglesA += mpStartT1[modelId + 1] - mpStartT1[modelId];
}
</pre>
<p>that handle the accounting. The depth buffer rasterizer already kept cumulative vertex and triangle counts for all models; I added one more element at the end so I could use the simplified vertex/triangle-counting logic.</p>
<p>Then, at the end of the <code>IsVisible</code> pass (after the worker threads are done), I run</p>
<pre>
// Determine which models are active
ResetActive();
for (UINT i=0; i &lt; mNumModels1; i++)
    if(mpTransformedModels1[i].IsRasterized2DB())
        Activate(i);
</pre>
<p>where <code>IsRasterized2DB()</code> is just a predicate that returns <code>mIsVisible &amp;&amp; !mTooSmall</code> (it was already there, so I used it).</p>
<p>After that, all that remains is distributing work over the active models only, using <code>mNumVerticesA</code> and <code>mNumTrianglesA</code>. This is as simple as turning the original loop in <code>TransformMeshes</code></p>
<pre>
for(UINT ss = 0; ss &lt; mNumModels1; ss++)
</pre>
<p>into</p>
<pre>
for(UINT active = 0; active &lt; mNumModelsA; active++)
{
    UINT ss = mpModelIndexA[active];
    // ...
}
</pre>
<p>and the same for <code>BinTransformedMeshes</code>. All in all, this took me about 10 minutes to write, debug and test. And with that, we should have proper load balancing for the first two passes of rendering: transform and binning. The question, as always, is: does it help?</p>
<p><b>Change</b>: Better rendering &#8220;front end&#8221; load balancing</p>
<table>
<tr>
<th>Version</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial depth render</td>
<td>2.666</td>
<td>2.716</td>
<td>2.732</td>
<td>2.745</td>
<td>2.811</td>
<td>2.731</td>
<td>0.022</td>
</tr>
<tr>
<td>Balance front end</td>
<td>2.282</td>
<td>2.323</td>
<td>2.339</td>
<td>2.362</td>
<td>2.476</td>
<td>2.347</td>
<td>0.034</td>
</tr>
</table>
<p>Oh boy, does it ever. That&#8217;s a 14.4% reduction <em>on top of what we already got last time</em>. And Telemetry tells us we&#8217;re now doing a much better job at submitting uniform-sized tasks:</p>
<p><a href="https://fgiesen.files.wordpress.com/2013/02/tmvis_rasterbal1.png"><img loading="lazy" data-attachment-id="1751" data-permalink="https://fgiesen.wordpress.com/2013/02/17/care-and-feeding-of-worker-threads-part-1/tmvis_rasterbal1/" data-orig-file="https://fgiesen.files.wordpress.com/2013/02/tmvis_rasterbal1.png" data-orig-size="497,331" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}" data-image-title="Balanced rasterization front end" data-image-description="" data-image-caption="" data-medium-file="https://fgiesen.files.wordpress.com/2013/02/tmvis_rasterbal1.png?w=300" data-large-file="https://fgiesen.files.wordpress.com/2013/02/tmvis_rasterbal1.png?w=497" src="https://fgiesen.files.wordpress.com/2013/02/tmvis_rasterbal1.png?w=497&#038;h=331" alt="Balanced rasterization front end" width="497" height="331" class="aligncenter size-full wp-image-1751" srcset="https://fgiesen.files.wordpress.com/2013/02/tmvis_rasterbal1.png 497w, https://fgiesen.files.wordpress.com/2013/02/tmvis_rasterbal1.png?w=150&amp;h=100 150w, https://fgiesen.files.wordpress.com/2013/02/tmvis_rasterbal1.png?w=300&amp;h=200 300w" sizes="(max-width: 497px) 100vw, 497px" /></a></p>
<p>In this frame, there&#8217;s still one transform batch that takes longer than the others; this happens sometimes, because of context switches for example. But note that the other threads nicely pick up the slack, and we&#8217;re still fine: a ~2x variation on the occasional item isn&#8217;t a big deal, provided most items are still roughly the same size. Also note that, even though there&#8217;s 8 worker threads, we never seem to be running more than 4 tasks at a time, and the hand-offs between threads (look at what happens in the BinMeshes phase) seem too perfectly synchronized to just happen accidentally. I&#8217;m assuming that TBB intentionally never uses more than 4 threads because the machine I&#8217;m running this on has a quad-core CPU (albeit with HyperThreading), but I haven&#8217;t checked whether this is just a configuration option or not; it probably is.</p>
<h3>Balancing the rasterizer back end</h3>
<p>Now we can&#8217;t do the same trick for the actual triangle rasterization, because it works in tiles, and they just end up with uneven amounts of work depending on what&#8217;s on the screen &#8211; there&#8217;s nothing we can do about that. That said, we&#8217;re definitely hurt by the uneven task sizes here too &#8211; for example, on my original Telemetry screenshot, you can clearly see how the non-uniform job sizes hurt us:</p>
<p><a href="https://fgiesen.files.wordpress.com/2013/02/tmviz_initial_badbal.png"><img loading="lazy" data-attachment-id="1758" data-permalink="https://fgiesen.wordpress.com/2013/02/17/care-and-feeding-of-worker-threads-part-1/tmviz_initial_badbal/" data-orig-file="https://fgiesen.files.wordpress.com/2013/02/tmviz_initial_badbal.png" data-orig-size="497,360" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}" data-image-title="Initial bad rasterizer balance" data-image-description="" data-image-caption="" data-medium-file="https://fgiesen.files.wordpress.com/2013/02/tmviz_initial_badbal.png?w=300" data-large-file="https://fgiesen.files.wordpress.com/2013/02/tmviz_initial_badbal.png?w=497" src="https://fgiesen.files.wordpress.com/2013/02/tmviz_initial_badbal.png?w=497&#038;h=360" alt="Initial bad rasterizer balance" width="497" height="360" class="aligncenter size-full wp-image-1758" srcset="https://fgiesen.files.wordpress.com/2013/02/tmviz_initial_badbal.png 497w, https://fgiesen.files.wordpress.com/2013/02/tmviz_initial_badbal.png?w=150&amp;h=109 150w, https://fgiesen.files.wordpress.com/2013/02/tmviz_initial_badbal.png?w=300&amp;h=217 300w" sizes="(max-width: 497px) 100vw, 497px" /></a></p>
<p>The green thread picks up a tile with lots of triangles to render pretty late, and as a result everyone else ends up waiting for him to finish. This is not good.</p>
<p>However, lucky for us, there&#8217;s a solution: the TBB task manager will parcel out tasks roughly in the order they were submitted. So all we have to do is to make sure the &#8220;big&#8221; tiles come first. Well, after binning is done, we know exactly how many triangles end up in each tile. So what we do is insert a single task between<br />
binning and rasterization that determines the right order to process the tiles in, then make the actual rasterization depend on it:</p>
<pre>
gTaskMgr.CreateTaskSet(&amp;DepthBufferRasterizerSSEMT::BinSort, this,
    1, &amp;mBinMesh, 1, "BinSort", &amp;sortBins);
gTaskMgr.CreateTaskSet(&amp;DepthBufferRasterizerSSEMT::RasterizeBinnedTrianglesToDepthBuffer,
    this, NUM_TILES, &amp;sortBins, 1, "Raster Tris to DB", &amp;mRasterize);	
</pre>
<p>So how does that function look? Well, all we have to do is count how many triangles ended up in each triangle, and then sort the tiles by that. The function is so short I&#8217;m just gonna show you the whole thing:</p>
<pre>
void DepthBufferRasterizerSSEMT::BinSort(VOID* taskData,
    INT context, UINT taskId, UINT taskCount)
{
    DepthBufferRasterizerSSEMT* me =
        (DepthBufferRasterizerSSEMT*)taskData;

    // Initialize sequence in identity order and compute total
    // number of triangles in the bins for each tile
    UINT tileTotalTris[NUM_TILES];
    for(UINT tile = 0; tile &lt; NUM_TILES; tile++)
    {
        me-&gt;mTileSequence[tile] = tile;

        UINT base = tile * NUM_XFORMVERTS_TASKS;
        UINT numTris = 0;
        for (UINT bin = 0; bin &lt; NUM_XFORMVERTS_TASKS; bin++)
            numTris += me-&gt;mpNumTrisInBin[base + bin];

        tileTotalTris[tile] = numTris;
    }

    // Sort tiles by number of triangles, decreasing.
    std::sort(me-&gt;mTileSequence, me-&gt;mTileSequence + NUM_TILES,
        [&amp;](const UINT a, const UINT b)
        {
            return tileTotalTris[a] &gt; tileTotalTris[b]; 
        });
}
</pre>
<p>where <code>mTileSequence</code> is just an array of <code>UINT</code>s with <code>NUM_TILES</code> elements. Then we just rename the <code>taskId</code> parameter of <code>RasterizeBinnedTrianglesToDepthBuffer</code> to <code>rawTaskId</code> and start the function like this:</p>
<pre>
    UINT taskId = mTileSequence[rawTaskId];
</pre>
<p>and presto, we have bin sorting. Here&#8217;s the results:</p>
<p><b>Change</b>: Sort back-end tiles by amount of work</p>
<table>
<tr>
<th>Version</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial depth render</td>
<td>2.666</td>
<td>2.716</td>
<td>2.732</td>
<td>2.745</td>
<td>2.811</td>
<td>2.731</td>
<td>0.022</td>
</tr>
<tr>
<td>Balance front end</td>
<td>2.282</td>
<td>2.323</td>
<td>2.339</td>
<td>2.362</td>
<td>2.476</td>
<td>2.347</td>
<td>0.034</td>
</tr>
<tr>
<td>Balance back end</td>
<td>2.128</td>
<td>2.162</td>
<td>2.178</td>
<td>2.201</td>
<td>2.284</td>
<td>2.183</td>
<td>0.029</td>
</tr>
</table>
<p>Once again, we&#8217;re 20% down from where we started! Now let&#8217;s check in Telemetry to make sure it worked correctly and we weren&#8217;t just lucky:</p>
<p><a href="https://fgiesen.files.wordpress.com/2013/02/tmviz_rasterbal2.png"><img loading="lazy" data-attachment-id="1767" data-permalink="https://fgiesen.wordpress.com/2013/02/17/care-and-feeding-of-worker-threads-part-1/tmviz_rasterbal2/" data-orig-file="https://fgiesen.files.wordpress.com/2013/02/tmviz_rasterbal2.png" data-orig-size="497,387" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}" data-image-title="Rasterizer fully balanced" data-image-description="" data-image-caption="" data-medium-file="https://fgiesen.files.wordpress.com/2013/02/tmviz_rasterbal2.png?w=300" data-large-file="https://fgiesen.files.wordpress.com/2013/02/tmviz_rasterbal2.png?w=497" src="https://fgiesen.files.wordpress.com/2013/02/tmviz_rasterbal2.png?w=497&#038;h=387" alt="Rasterizer fully balanced" width="497" height="387" class="aligncenter size-full wp-image-1767" srcset="https://fgiesen.files.wordpress.com/2013/02/tmviz_rasterbal2.png 497w, https://fgiesen.files.wordpress.com/2013/02/tmviz_rasterbal2.png?w=150&amp;h=117 150w, https://fgiesen.files.wordpress.com/2013/02/tmviz_rasterbal2.png?w=300&amp;h=234 300w" sizes="(max-width: 497px) 100vw, 497px" /></a></p>
<p>Now that&#8217;s just <em>beautiful</em>. See how the whole thing is now densely packed into the live threads, with almost no wasted space? This is how you want your profiles to look. Aside from the fact that our rasterization only seems to be running on 3 threads, that is &#8211; there&#8217;s always more digging to do. One fun thing I noticed is that TBB actually doesn&#8217;t process the tasks fully in-order; the two top threads indeed start from the biggest tiles and work their way forwards, but the  bottom-most thread actually starts from the end of the queue, working its way towards the beginning. The tiny LOD zone I&#8217;m hovering over covers both the bin sorting task and the seven smallest tiles; the packets get bigger from there.</p>
<p>And with that, I think we have enough changes (and images!) for one post. We&#8217;ll continue ironing out scheduling kinks next time, but I think the lesson is already clear: you can&#8217;t just toss tasks to worker threads and expect things to go smoothly. If you want to get good thread utilization, better profile to make sure your threads actually do what you think they&#8217;re doing! And as usual, you can find the code for this post on <a href="https://github.com/rygorous/intel_occlusion_cull/tree/blog">Github</a>, albeit without the Telemetry instrumentation for now &#8211; Telemetry is a commercial product, and I don&#8217;t want to introduce any dependencies that make it harder for people to compile the code. Take care, and until next time.</p>
]]></html><thumbnail_url><![CDATA[https://fgiesen.files.wordpress.com/2013/02/hotspots_rast2.png?fit=440%2C330]]></thumbnail_url><thumbnail_width><![CDATA[]]></thumbnail_width><thumbnail_height><![CDATA[]]></thumbnail_height></oembed>