<?xml version="1.0" encoding="UTF-8" standalone="yes"?><oembed><version><![CDATA[1.0]]></version><provider_name><![CDATA[The ryg blog]]></provider_name><provider_url><![CDATA[https://fgiesen.wordpress.com]]></provider_url><author_name><![CDATA[fgiesen]]></author_name><author_url><![CDATA[https://fgiesen.wordpress.com/author/fgiesen/]]></author_url><title><![CDATA[Half to float done&nbsp;quick]]></title><type><![CDATA[link]]></type><html><![CDATA[<p>It all started a few days ago with <a href="https://twitter.com/#!/castano/status/182252019418021888">this tweet</a> by Ignacio Castaño &#8211; complaining about the speed of the standard library <code>half_to_float</code> in <a href="http://ispc.github.com/">ISPC</a>. Tom then <a href="https://twitter.com/#!/tom_forsyth/status/182315716077301762">replied</a> that this was hard to do well in code without dedicated HW support &#8211; then immediately followed it up with <a href="https://twitter.com/#!/tom_forsyth/status/182316268681043968">an idea</a> of how you might do it. I love this kind of puzzle (and obviously follow both Ignacio and Tom), so I jumped in and offered to write some code. Tom&#8217;s initial approach was a dead end, but it turned out that it was in fact possible to do a pretty decent job using standard SSE2 instructions (available anywhere from Pentium 4 onwards) with no dedicated hardware support at all; that said, dedicated HW support will exist in Intel processors starting from &#8220;Ivy Bridge&#8221;, due later this year. My solutions involve some fun but not immediately obvious bit-twiddling on floats, so I figured I should take the time to write it up. If you&#8217;re impatient and just want the code, <a href="https://gist.github.com/2144712">knock yourself out</a>. Actually you might want to open that up anyway; I&#8217;ll copy over key passages but not all the details. The code was written for x86 machines using Little Endian, but it doesn&#8217;t use any particularly esoteric features, so it should be easily adapted for BE architectures or different vector instruction sets.</p>
<h3>What&#8217;s in a float?</h3>
<p>If you don&#8217;t know or care how floating-point numbers are represented, this is <em>not</em> a useful article for you to read, because from here on out it will deal almost exclusively with various storage format details. If you don&#8217;t know how floats work but would like to learn, a good place to start is Bruce Dawson&#8217;s <a href="https://randomascii.wordpress.com/2012/01/11/tricks-with-the-floating-point-format/">article on the subject</a>. If you do know how floats work in principle but don&#8217;t remember the storage details of IEEE formats, make a quick stop at Wikipedia and read up on the layout of <a href="http://en.wikipedia.org/wiki/Single-precision">single-precision</a> and <a href="http://en.wikipedia.org/wiki/Half-precision_floating-point_format">half-precision</a> floats. Still here? Okay, great, let&#8217;s get started then!</p>
<h3>Half to float basics</h3>
<p>Converting between the different float formats correctly is mostly about making sure you catch all the important cases and map them properly. So let&#8217;s make a list of all the different classes a floating point number can fall into:</p>
<ul>
<li><b>Normalized numbers</b> &#8211; the ones where the exponent bits are neither all-0 nor all-1. This is the majority of all floats. Single-precision floats have both a larger exponent range and more mantissa bits than half-precision floats, so converting normalized halfs is easy: just add a bunch of 0 bits at the end of the mantissa (a plain left shift on the integer representation) and adjust the exponent accordingly.</li>
<li><b>Zero</b> (exponent and mantissa both 0) should map to zero &#8211; easy.</li>
<li><b>Denormalized numbers</b> (exponent zero, mantissa nonzero). Values that are denormals in half-precision map to regular normalized numbers in single precision, so they need to be renormalized during conversion.</li>
<li><b>Infinities</b> (exponent all-ones, mantissa zero) map to infinities.</li>
<li>Finally, <b>NaNs</b> (exponent all-ones, mantissa nonzero) should map to NaNs. There&#8217;s often an extra distinction between different types of NaNs, like quiet vs. signaling NaNs. It&#8217;s nice to preserve these semantics when possible, but in principle anything that maps NaNs to NaNs is acceptable.</li>
</ul>
<p>Finally there&#8217;s also the matter of the sign bit. There&#8217;s two slightly weird cases here (signed zero and signed NaNs), but actually as far as conversions go you&#8217;ll never go wrong if you just preserve the sign bit on conversion, so that&#8217;s what we&#8217;ll do. If you just work through these cases one by one, you get something like <code>half_to_float_full</code> in the code:</p>
<pre>static FP32 half_to_float_full(FP16 h)
{
    FP32 o = { 0 };

    // From ISPC ref code
    if (h.Exponent == 0 &amp;&amp; h.Mantissa == 0) // (Signed) zero
        o.Sign = h.Sign;
    else
    {
        if (h.Exponent == 0) // Denormal (converts to normalized)
        {
            // Adjust mantissa so it's normalized (and keep
            // track of exponent adjustment)
            int e = -1;
            uint m = h.Mantissa;
            do
            {
                e++;
                m &lt;&lt;= 1;
            } while ((m &amp; 0x400) == 0);

            o.Mantissa = (m &amp; 0x3ff) &lt;&lt; 13;
            o.Exponent = 127 - 15 - e;
            o.Sign = h.Sign;
        }
        else if (h.Exponent == 0x1f) // Inf/NaN
        {
            // NOTE: Both can be handled with same code path
            // since we just pass through mantissa bits.
            o.Mantissa = h.Mantissa &lt;&lt; 13;
            o.Exponent = 255;
            o.Sign = h.Sign;
        }
        else // Normalized number
        {
            o.Mantissa = h.Mantissa &lt;&lt; 13;
            o.Exponent = 127 - 15 + h.Exponent; 
            o.Sign = h.Sign;
        }
    }

    return o;
}</pre>
<p>Note how this code handles NaNs and infinities with the same code path; despite having a different <em>interpretation</em>, the actual work that needs to happen in both cases is exactly the same.</p>
<h3>Dealing with denormals</h3>
<p>Clearly, the ugliest part of the whole thing is the handling of denormals. Can&#8217;t we do any better than that? Turns out we can. After all, the whole point of denormals is that they are <em>not normalized</em>; in other words, they&#8217;re just a scaled integer (fixed-point) representation of a fairly small number. For half-precision floats, they represent <code>Mantissa * 2^(-14)</code>. If you&#8217;re on one of the architectures with a &#8220;convert integer to float&#8221; instruction that can scale by an arbitrary power of 2 along the way, you can handle this case with a single instruction. Otherwise, you can either use regular integer→float conversion followed by a multiply to scale the value properly or use a &#8220;magic number&#8221; based conversion (if you don&#8217;t know what that is, check out <a href="http://chrishecker.com/images/f/fb/Gdmfp.pdf">Chris Hecker&#8217;s old GDMag article</a> on the subject). Either way, 0 happens to have all-0 mantissa bits; in short, same as with NaNs and infinities, we can actually funnel zero and denormals through the same code path. This leaves just three cases to take care of: normal, denormal, or NaN/infinity. <code>half_to_float_fast2</code> is an implementation of this approach:</p>
<pre>static FP32 half_to_float_fast2(FP16 h)
{
    static const FP32 magic = { 126 &lt;&lt; 23 };
    FP32 o;

    if (h.Exponent == 0) // Zero / Denormal
    {
        o.u = magic.u + h.Mantissa;
        o.f -= magic.f;
    }
    else
    {
        o.Mantissa = h.Mantissa &lt;&lt; 13;
        if (h.Exponent == 0x1f) // Inf/NaN
            o.Exponent = 255;
        else
            o.Exponent = 127 - 15 + h.Exponent;
    }

    o.Sign = h.Sign;
    return o;
}</pre>
<p>Variants 3, 4 and 4b all use this same underlying idea; they&#8217;re slightly different implementations, but nothing major (and the SSE2 versions are fairly straight translations of the corresponding scalar variants). Variant 5, however, uses a very different approach that reduces the number of distinct cases to handle from three down to two.</p>
<h3>A different method with other applications</h3>
<p>Both single- and half-precision floats have denormals at the bottom of the exponent range. Other than the exact location of that &#8220;bottom&#8221;, they work exactly the same way. The idea, then, is to translate denormal halfs into denormal floats, and let the floating-point hardware deal with the rest (provided it supports denormals efficiently, that is). Essentially, all we need to do is to shift the input half by the difference in the amount of mantissa bits (13, as already seen above). This will map half-denormals to float-denormals and normalized halfs to normalized floats. The only problem is that all numbers converted this way will end up too small by a fixed factor that depends on the difference between the exponent biases (in this case, they need to be scaled up by <img src="https://s0.wp.com/latex.php?latex=2%5E%7B127-15%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=2%5E%7B127-15%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=2%5E%7B127-15%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="2^{127-15}" class="latex" />). That&#8217;s easily fixed with a single multiply. This reduces the number of fundamentally different cases from three to two: we still need to dedicate some work to handling infinities and NaNs, but that&#8217;s it. The code looks like this:</p>
<pre>static FP32 half_to_float_fast5(FP16 h)
{
    static const FP32 magic = { (254 - 15) &lt;&lt; 23 };
    static const FP32 was_infnan = { (127 + 16) &lt;&lt; 23 };
    FP32 o;

    o.u = (h.u &amp; 0x7fff) &lt;&lt; 13;     // exponent/mantissa bits
    o.f *= magic.f;                 // exponent adjust
    if (o.f &gt;= was_infnan.f)        // make sure Inf/NaN survive
        o.u |= 255 &lt;&lt; 23;
    o.u |= (h.u &amp; 0x8000) &lt;&lt; 16;    // sign bit
    return o;
}
</pre>
<p>This is not the fastest way to do the conversion, because it leans on HW to deal with denormals should they arise (something that floating-point HW tends to be quite slow at), but it&#8217;s definitely the slickest out of this bunch.</p>
<p>More importantly, unlike the other variants mentioned, this basic approach also works in the opposite direction (converting from floats to halfs), which in turn inspired <a href="https://gist.github.com/2156668">this code</a>. There&#8217;s a bit of extra work to ensure there&#8217;s no unintended double rounding and to handle NaNs and overflows correctly, but it&#8217;s still the same idea. Which goes to show &#8211; sometimes making things as simple as possible really does have rewards beyond the pure intellectual satisfaction. 🙂</p>
<p>And that&#8217;s it. I admit that play-by-play narration of source code isn&#8217;t particularly exciting, but in this case I thought the code itself was interesting and short enough to give it a shot. And now back to our regularly scheduled posts 🙂</p>
]]></html></oembed>