<?xml version="1.0" encoding="UTF-8" standalone="yes"?><oembed><version><![CDATA[1.0]]></version><provider_name><![CDATA[The ryg blog]]></provider_name><provider_url><![CDATA[https://fgiesen.wordpress.com]]></provider_url><author_name><![CDATA[fgiesen]]></author_name><author_url><![CDATA[https://fgiesen.wordpress.com/author/fgiesen/]]></author_url><title><![CDATA[Linear interpolation past, present and&nbsp;future]]></title><type><![CDATA[link]]></type><html><![CDATA[<p>Linear interpolation. Lerp. The bread and butter of graphics programming. Well, turns out I have three tricks about lerps to share. One of them is really well known, the other two less so. So let&#8217;s get cracking!</p>
<p>Standard linear interpolation is just <code>lerp(t, a, b) = (1-t)*a + t*b</code>. You should already know this. At t=0 we get a, at t=1 we get b, and for inbetween values of t we interpolate linearly between the two. And of course we can also linearly extrapolate by using a t outside [0,1].</p>
<h3>Past</h3>
<p>The expression shown above has two multiplies. Now, it used to be the case that multiplies were really slow (and by the way they really aren&#8217;t anymore, so please stop doubling the number of adds in an expression to get rid of a multiply, no matter what your CS prof tells you). If multiplies are expensive, there&#8217;s a better (equivalent) expression you can get with some algebra: <code>lerp(t, a, b) = a + t*(b-a)</code>. This is the first of the three tricks, and it&#8217;s really well known.</p>
<p>But in this setting (int multiplies are slow) you&#8217;re also really unlikely to have floating-point hardware, or to be working on floating-point numbers. Especially if we&#8217;re talking about pixel processing in computer graphics, it used to be all-integer for a <em>really</em> long time. It&#8217;s more likely to be all fixed-point, and in the fixed-point setting you typically have something like <code>fixedpt_lerp(t, a, b) = a + ((t * (b-a)) &gt;&gt; 8)</code>. And more importantly, your a&#8217;s and b&#8217;s are also fixed-point (integer) numbers, typically with a very low range.</p>
<p>So here comes the second trick: for some applications (e.g. cross fades), you&#8217;re doing a lot of lerps with the same t (at least one per pixel). Now note that if t is fixed, the <code>(t * (b - a)) &gt;&gt; 8</code> part really only depends on b-a, and that&#8217;s a small set of possible values: If a and b are byte values in [0,255], then d=b-a is in [-255,255]. So we really only ever need to do 511 multiplies based on t, to build a table indexed by d. Except that&#8217;s not true either, because we compute first t*0 then t*1 then t*2 and so forth, so we&#8217;re really just adding t every time, and we can build the whole table without any multiplies at all. So here we get trick two, for doing lots of lerps with constant t on integer values:</p>
<pre>U8 lerp_tab[255*2+1]; // U8 is sufficient here

// build the lerp table
for (int i=0, sum = 0; i &lt; 256; i++, sum += t) {
  lerp_tab[255-i] = (U8) (-sum &gt;&gt; 8); // negative half of table
  lerp_tab[255+i] = (U8) ( sum &gt;&gt; 8); // positive half
}

// cross-fade (grayscale image here for simplicity)
for (int i=0; i &lt; npixels; i++) {
  int a = src1[i];
  int b = src2[i];
  out[i] = a + lerp_tab[255 + b-a];
}
</pre>
<p>Look ma, no multiplies!</p>
<h3>Present</h3>
<p>But back to the present. Floating-point hardware is readily available on most platforms and purely counting arithmetic ops is a poor estimator for performance on modern architectures. Practically speaking, for almost all code, you&#8217;re never going to notice any appreciable difference in speed between the two versions</p>
<pre>
  lerp_1(t, a, b) = (1 - t)*a + t*b
  lerp_2(t, a, b) = a + t*(b-a)
</pre>
<p>but you might notice something else: unlike the real numbers (which our mathematical definition of lerp is based on) and the integers (which our fixed-point lerps worked on), floating-point numbers don&#8217;t obey the arithmetic identities we&#8217;ve been using to derive this. In particular, for two floating point numbers we generally have <code>a + (b-a) != b</code>, so the second lerp expression is generally not correct at t=1! In contrast, with IEEE floating point, the first expression is guaranteed to return the exact value of a (b) at t=0 (t=1). So there&#8217;s actually some reason to prefer the first expression in that case, and using the second one tends to produce visible artifacts for some applications (you can also hardcode your lerp to return b exactly at t=1, but that&#8217;s just ugly and paying a data-dependent branch to get rid of one FLOP is a bad idea). While for pixel values you&#8217;re unlikely to care, it generally pays to be careful for mesh processing and the like; using the wrong expression can produce cracks in your mesh.</p>
<p>So what to do? Use the first form, which has one more arithmetic operation and does two multiplies, or the second form, which has one less operation but unfavorable rounding properties? Luckily, this dilemma is going away.</p>
<h3>Future</h3>
<p>Okay, &#8220;future&#8221; is stretching a bit here, because for some platforms this &#8220;future&#8221; started in 1990, but I digress. Anyway, in the future we&#8217;d like to have a magic lerp instruction in our processors that solves this problem for us (and is also fast). Unfortunately, that seems very unlikely: Even GPUs don&#8217;t have one, and <a href="http://software.intel.com/file/15542">Michael Abrash</a> never could get Intel to give him one either. However, he did get fused multiply-adds, and that&#8217;s one thing all GPUs have too, and it&#8217;s either already on the processor you&#8217;re using or soon coming to it. So if fused multiply-adds can help, then maybe we&#8217;re good. And it turns out they can.</p>
<p>A fused multiply-add is just an operation <code>fma(a, b, c) = a*b + c</code> that computes the inner expression using only one exact rounding step. And FMA-based architectures tend to compute regular adds and multiplies using the same circuitry, so all three operations cost roughly the same (in terms of latency anyway, not necessarily in terms of power). And while I say &#8220;fma&#8221; these chips usually support different versions with sign flips in different places too (implementing this in the HW is almost free); the second most important one is &#8220;fused negative multiply-subtract&#8221;, which does <code>fnms(a, b, c) = -(a*b - c) = c - a*b</code>. Let&#8217;s rewrite our lerp expressions using FMA:</p>
<pre>
  lerp_1(t, a, b) = fma(t, b, (1 - t)*a)
  lerp_2(t, a, b) = fma(t, b-a, a)
</pre>
<p>Both of these still have arithmetic operations left that aren&#8217;t in FMA form; lerp_1 has two leftover ops and lerp_2 has one. And so far both of them aren&#8217;t significantly better than their original counterparts. However, lerp_1 has exactly one multiply and one subtract left; they&#8217;re just subtract-multiply (which we don&#8217;t have HW for), rather than multiply-subtract. However, that one is easily remedied with some algebra: <code>(1 - t)*a = 1*a - t*a = a - t*a</code>, and that last expression <em>is</em> in fma (more accurately, fnms) form. So we get a third variant, and our third trick:</p>
<pre>
  lerp_3(t, a, b) = fma(t, b, fnms(t, a, a))
</pre>
<p>Two operations, both FMA-class ops, and this is based on lerp_1 so it&#8217;s actually exact at the end points. Two dependent ops &#8211; as long as we don&#8217;t actually get a hardware lerp instruction, that&#8217;s the best we can hope for. So this version is both fast <em>and</em> accurate, and as you migrate to platforms with guaranteed FMA support, you should consider rewriting your lerps that way &#8211; it&#8217;s the lerp of the future!</p>
]]></html></oembed>