<?xml version="1.0" encoding="UTF-8" standalone="yes"?><oembed><version><![CDATA[1.0]]></version><provider_name><![CDATA[The ryg blog]]></provider_name><provider_url><![CDATA[https://fgiesen.wordpress.com]]></provider_url><author_name><![CDATA[fgiesen]]></author_name><author_url><![CDATA[https://fgiesen.wordpress.com/author/fgiesen/]]></author_url><title><![CDATA[Notes on caches and&nbsp;threads]]></title><type><![CDATA[link]]></type><html><![CDATA[<p>This is a bit unsorted &#8211; just a bunch of things I wanted to write down at some point but didn&#8217;t feel like writing a full article about.</p>
<h3>Cache size is measured in lines, not bytes</h3>
<p>This is important to keep in mind when thinking about cache utilization, particularly L1. For example, the Cell (and also the Xbox 360 CPU) has 32k of L1 data cache but 128-byte cache lines, i.e. 256 cache lines total. Compare to e.g. the original 1993 Intel Pentium, which has 8k of L1 and 32-byte cache lines, for a total of&#8230; 256 cache lines! If your data is laid out sequentially, you get to use the 4x increased capacity. But when you use linked data structures, you really don&#8217;t: you&#8217;re more likely than not to end up in a different cache line every time you dereference a pointer, so assuming each node fits into a cache line, you&#8217;re limited to 256 nodes in your L1. Tops. Assuming no node straddles two cache lines and there are no set conflicts whatsoever.</p>
<p>32k of cache may sound like enough to keep your hot data structures in them &#8211; but if you use linked data structures with small nodes, you only really get to use a fraction. Good thing to keep in mind.</p>
<h3>&#8220;Hardware thread&#8221; is an awful name</h3>
<p>It just is. Overloading the term &#8220;thread&#8221; is a poor choice &#8211; when talking among software people anyway. When talking about code running on chips with SMT (Simultaneous multithreading, aka &#8220;HyperThreading&#8221; in Intel parlance) there&#8217;s instant confusion whenever the word &#8220;thread&#8221; is used, since it could either be a &#8220;software&#8221; (OS) thread or a hardware one.</p>
<p>Make up your own terminology, goddamnit! &#8220;Threads&#8221; (like &#8220;Processes&#8221;, &#8220;Fibers&#8221; and &#8220;Files&#8221;) are a software concept &#8211; an abstraction provided by the OS. A hardware thread isn&#8217;t a thread at all; it&#8217;s some extra infrastructure added to the hardware that is typically used by the OS to implement the &#8220;thread&#8221; abstraction. It deserves its own name.</p>
<p>That said, I don&#8217;t have any brilliant suggestions either. The best I&#8217;ve come up with so far is &#8220;head&#8221;, which is just a portmanteau of &#8220;Hardware thread&#8221; into a nice pronunciation-friendly one-syllable word. Pros are that it&#8217;s from a semantic field not yet overused in programming (body parts), and it suggests a nice metaphor: imagine a hydra which has multiple heads but just one torso &#8211; that&#8217;s roughly how SMT processors work. Cons are that it&#8217;s not really catchy and phonetically too similar to &#8220;thread&#8221;. As said, it&#8217;s not perfect. I&#8217;d be glad to hear better suggestions!</p>
<p>And while we&#8217;re on the subject of terminology, I submit this for your consideration: Just continue the list from Fiber-&gt;Thread naturally! A thread group is a &#8220;rope&#8221;, and a group of ropes (thread groups) is a &#8220;dew&#8221;. Yes, it&#8217;s only half-serious, but with the 3-dimensional D3D11 compute shader &#8220;thread coordinates&#8221; it doesn&#8217;t seem that silly anymore&#8230;</p>
<h3>Hardware threads compete for core resources!</h3>
<p>Remember those 256 cache lines up there? Yeah, that&#8217;s if you&#8217;re only using one head. If you&#8217;re using two, they&#8217;re both running out of the same L1 cache, and neither will get all 256 lines &#8211; with fun interference all around. The same goes for instruction cache, by the way. A great way to get 2-way hardware multithreading except-not-really is by running 2 threads with code that branches around a lot. Pick your poison: use large unrolled functions with plenty of inlining and run out of L1 I$ capacity, or use lots of small functions with tons of calls and end up using I$ lines inefficiently (and eating lots of branch penalties besides). The only winning move is not to play that game in the first place. What you want is reasonably small, tight loops. <em>Especially</em> if you&#8217;re using multiple HW threads &#8211; since you usually also only get half the branch prediction slots, half the BTB entries, and so on.</p>
<p>Also note that extra heads really aren&#8217;t like extra cores. If you have more cores, you can get substantial speedups even using algorithms that don&#8217;t scale linearly (i.e. more threads cause some amount of extra work). With extra heads, that doesn&#8217;t work, or at least doesn&#8217;t work very well; if you&#8217;re running into a bottleneck of the core (be it computation, memory bandwidth, instruction decoding, or something else), adding extra heads doesn&#8217;t help &#8211; they&#8217;re meant to avoid stalls. If you&#8217;re keeping the core busy as-is, you need an extra head about as much as you&#8230; well, need a second head, actually. 🙂 And there&#8217;s also the bit about the smaller effective data cache size and so on. Of course there&#8217;s advantages too: heads on one core can share data much more efficiently than different cores can since they&#8217;re using the same L1 cache, they have no coherency traffic between them, and they don&#8217;t suffer from false sharing. So they&#8217;re very well suited to low-latency fine-grained threading.</p>
<h3>Threads, threads, and threads</h3>
<p>The concept of threads (the software kind) has been around for a while, but not all threads are created (or should I say designed?) equal. Threads are used for different reasons, and even some fit the underlying OS thread abstraction better than others.</p>
<p>The first major use case is different &#8220;subprograms&#8221; that execute concurrently from the user&#8217;s point of view &#8211; the OS might decide to put two threads onto different cores so they actually execute at the same time, or it might opt to multiplex both onto one core by time-slicing. This is the &#8220;original&#8221; thread model; it&#8217;s what you use to keep UI responsive even when the app is doing some intensive number crunching right now, etc. It&#8217;s what OSes expect you to do when you create a thread.</p>
<p>The second case is where you want the different &#8220;subprograms&#8221;, but not the concurrency. You use threads purely to have several execution contexts in your program, but they don&#8217;t run simultaneously, and you &#8220;switch&#8221; from one to the other at set points (by &#8220;starting&#8221; thread 2 and putting thread 1 to sleep by waiting on some event/condition variable). Basically, all you want from threads is to have two separate stacks. This is effectively using OS facilities to implement coroutines in a language that doesn&#8217;t have them. Of course, this is overkill &#8211; threads are fairly heavyweight entities and they get scheduled; that&#8217;s a big price to pay for what&#8217;s effectively a stack-switching mechanism. When OS developers saw people doing this, they introduced &#8220;Fibers&#8221; to fill the niche. These exist (under different names) in basically all modern OSes and allow you to have multiple stacks, program counters and hence flows of execution, but without preemption and most of the hidden costs involved in creating a thread.</p>
<p>The third case is threads you create exclusively for the purpose of getting concurrency, i.e. what&#8217;s commonly called &#8220;worker threads&#8221;. Unlike the first two groups, they typically all run the same code (or at least the same main loop). This <em>only</em> makes sense if you have multiple cores, or one core with SMT. If you have neither, cut out the middleman &#8211; just process the work in your main thread! Anyway, this kind of thread has existed for a long time in the server and high-performance computing world, but its appearance on the desktop is relatively recent. And it shows: OSes don&#8217;t have really great support for this kind of thread yet. There&#8217;s lot of support code and libraries, but worker threads aren&#8217;t really first-class citizens at the OS level yet.</p>
<p>There&#8217;s two problems: First, the way this is currently done is by setting up a &#8220;thread pool&#8221;. At startup time, you create some number of threads (usually as many as the number of available &#8220;logical&#8221; processors minus 1 for the main thread) to be worker threads. The threads all check if there&#8217;s work queued up, execute it if available and go to sleep if there&#8217;s nothing to do. This works fine if all threads actually get assigned to a different head, but it quickly starts to suck once there&#8217;s other apps doing the same thing, especially if one or more of them mess around with thread or process affinities. The second, more serious problem affects latency-critical applications, which on desktop OSes usually means soft real-time apps like games or media players. Such programs usually work frame-based, with one sync-point per frame where results must be ready (not necessarily results from a computation started this frame &#8211; the process can be pipelined across more than one frame &#8211; but the &#8220;one sync per frame&#8221; design is fairly common). This design interacts really badly with OS context switches. Here&#8217;s a simple example with 16 jobs distributed across 4 worker threads to illustrate the problem:</p>
<p><a href="https://fgiesen.files.wordpress.com/2011/02/worker_thread.png"><img loading="lazy" data-attachment-id="240" data-permalink="https://fgiesen.wordpress.com/2011/02/26/notes-on-caches-and-threads/worker_thread/" data-orig-file="https://fgiesen.files.wordpress.com/2011/02/worker_thread.png" data-orig-size="457,158" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}" data-image-title="worker_thread" data-image-description="" data-image-caption="" data-medium-file="https://fgiesen.files.wordpress.com/2011/02/worker_thread.png?w=300" data-large-file="https://fgiesen.files.wordpress.com/2011/02/worker_thread.png?w=457" src="https://fgiesen.files.wordpress.com/2011/02/worker_thread.png?w=457&#038;h=158" alt="Worker thread example" title="worker_thread" width="457" height="158" class="aligncenter size-full wp-image-240" srcset="https://fgiesen.files.wordpress.com/2011/02/worker_thread.png 457w, https://fgiesen.files.wordpress.com/2011/02/worker_thread.png?w=150&amp;h=52 150w, https://fgiesen.files.wordpress.com/2011/02/worker_thread.png?w=300&amp;h=104 300w" sizes="(max-width: 457px) 100vw, 457px" /></a></p>
<p>Worker 3 gets interrupted shortly after it grabs its fourth job, and before it finishes. At that point in time, there&#8217;s three remaining tasks left, which the three other workers divide among each other. And then the task list is empty and everyone waits for Worker 3 to finish&#8230; and waits&#8230; and waits some more until Worker 3 finally gets re-activated and finishes its job. Ouch &#8211; it would&#8217;ve been <em>much</em> faster if Worker 3 had only processed 3 items then gone to sleep, with the remaining 3 threads dividing up 4 jobs.</p>
<p>It all depends on timing. If you have a system that&#8217;s self-balancing (e.g. a &#8220;work stealing&#8221; type job manager), you will avoid most of the extra cost when one of the workers is interrupted early in the frame: the remaining threads just crunch through the work alone while one thread stalls. But if the interruption is shortly before a sync point, you&#8217;re just screwed no matter what you do.</p>
<p>This is really a scheduling problem. If you have a worker thread, it&#8217;s great to interrupt them when they&#8217;ve just finished a job, and extremely bad to interrupt them just after they agreed to start work on a new one. Ideally, you&#8217;d have a semi-cooperative multitasking system for worker threads; after every job (or every 10 jobs or whatever, depending on granularity), worker threads do a syscall that signals &#8220;if you want to do a context switch, now would be a good time&#8221;. This is not the same as a &#8220;sleep&#8221;/&#8221;yield&#8221; type operation; we don&#8217;t want to relinquish our time slice, just signal that now is a good time to schedule something different if the time is almost up! The scheduler in turn promises to <em>only</em> do context switches at these points, unless it hasn&#8217;t seen one in some set time interval (say 20ms or so), after which the thread is up for regular preemption.</p>
<p>That&#8217;s a quick patch, but really you want to make the OS to be aware of worker threads on a deep level, just as it &#8220;knows&#8221; Threads and Fibers right now. A static thread pool works fine if you have exclusive control of the system, but it&#8217;s the wrong model for a desktop OS. If you have multiple apps each running their own thread pools, you don&#8217;t want to multiplex them onto the available hardware threads; ideally you want to change the number of worker threads available to each app instead! But to do this properly, the OS needs to be aware of the distinction first.</p>
]]></html><thumbnail_url><![CDATA[https://fgiesen.files.wordpress.com/2011/02/worker_thread.png?fit=440%2C330]]></thumbnail_url><thumbnail_width><![CDATA[]]></thumbnail_width><thumbnail_height><![CDATA[]]></thumbnail_height></oembed>