<?xml version="1.0" encoding="UTF-8" standalone="yes"?><oembed><version><![CDATA[1.0]]></version><provider_name><![CDATA[The ryg blog]]></provider_name><provider_url><![CDATA[https://fgiesen.wordpress.com]]></provider_url><author_name><![CDATA[fgiesen]]></author_name><author_url><![CDATA[https://fgiesen.wordpress.com/author/fgiesen/]]></author_url><title><![CDATA[More PPC compiler&nbsp;babysitting]]></title><type><![CDATA[link]]></type><html><![CDATA[<p>Another recent discovery from looking at generated code. On inorder PPC processors, you can&#8217;t move data directly between the integer and floating point units &#8211; it has to go through memory first. This usually involves storing some value to memory and reading it immediately afterwards, a guaranteed LHS (Load-Hit-Store) stall. A full integer to floating point conversion on PPC involves multiple steps:</p>
<ol>
<li>Sign-extend the integer to 64 bits (<code>extsw</code>)</li>
<li>Store 64-bit value into memory (<code>std</code>)</li>
<li>Load 64-bit into into floating-point register (<code>lfd</code>)</li>
<li>Convert to double (<code>fcfid</code>)</li>
<li>Round to single precision (<code>frsp</code>)</li>
</ol>
<p>The sign-extend and round to single steps may be omitted depending on context, but the rest is pretty much fixed, and the dreaded LHS is triggered by step 3. There&#8217;s ways to work around this problem &#8211; if you have a small set of integers, it can make sense to use a small table for int-&gt;float conversion. You can also use SIMD instructions to do the conversion, provided you do the rest of your computation in SIMD registers too (again, no direct movement between the integer, vector and floating point units, you have to go through memory).</p>
<p>That&#8217;s not what this post is about, though. Let&#8217;s just accept that LHS as a fact of life for now. Does that mean we have to eat it on every int to float (or float to int) conversion? Not really. Have a look at this code:</p>
<pre>
void some_function(float a, float b, float c, float d);

void problem(int a, int b, int c, int d, float scale)
{
  some_function(a*scale, b*scale, c*scale, d*scale);
}
</pre>
<p>We need to perform four int-to-float conversion for this function call. They&#8217;re completely independent, so the compiler could just do steps 1 and 2 for all four values, then steps 3-5. Unless we&#8217;re unlucky, we expect all four temporaries to be in the same cache line on the stack, so we expect to get only one LHS stall on the first load. So much for the theory, anyway &#8211; I recently noticed that one of the PPC compilers didn&#8217;t do this, so I whipped up the small example above and checked the other compilers we use, and it turns out that all three of them happily produced code with 4 LHS stalls.</p>
<p>When the swelling from the subsequent Mother Of All Facepalms(tm) abated, I went on to check if there was some way to coax the compilers into generating better code. And yes, on all 3 compilers there&#8217;s a way to get the desired behavior, though the details differ a bit:</p>
<pre>
 // Names changed to protect the guilty
#ifndef COMPILER_C
typedef volatile S64 S32itof;
#else 
typedef S64 S32itof;
#endif

static inline F32 fast_itof(S32itof x)
{
#ifdef COMPILER_A
  return x;
#else
  return (F32) __fcfid(x);
#endif
}

void better(int a, int b, int c, int d, float scale)
{
  S32itof fa = a, fb = b, fc = c, fd = d;
  some_function(fast_itof(fa)*scale, fast_itof(fb)*scale,
    fast_itof(fc)*scale, fast_itof(d)*scale);
}
</pre>
<p>My original implementation uses a macro for <code>fast_itof</code> since it needs to work in plain C89 code, and the temporary values of type <code>S32itof</code> aren&#8217;t optional in that case. With the inline function, you might be able to get rid of them, but I haven&#8217;t checked this.</p>
]]></html></oembed>