<?xml version="1.0" encoding="UTF-8" standalone="yes"?><oembed><version><![CDATA[1.0]]></version><provider_name><![CDATA[The ryg blog]]></provider_name><provider_url><![CDATA[https://fgiesen.wordpress.com]]></provider_url><author_name><![CDATA[fgiesen]]></author_name><author_url><![CDATA[https://fgiesen.wordpress.com/author/fgiesen/]]></author_url><title><![CDATA[Fast blurs 2]]></title><type><![CDATA[link]]></type><html><![CDATA[<p><em>This post is part of the series “<a href="https://fgiesen.wordpress.com/2012/02/13/debris-opening-the-box/">Debris: Opening the box</a>“.</em></p>
<p>At the end of the last post, I described the basic &#8220;moving average&#8221; implementation for box filters and how to build a simple blur out of it. I also noted that the approach given only offers very coarse control of the blur radius, and that we&#8217;d like to have something better than box filters. So let&#8217;s fix both of these issues.</p>
<h3>Subpixel resolution</h3>
<p>In the previous article, I used &#8216;r&#8217; to denote what is effectively the &#8220;radius&#8221; of a box filter, and our filters always had an odd number of taps 2r+1. So r=0 has 1 tap (which is 1, so this is just the identity &#8211; no blur), r=1 has 3 taps, r=2 has 5 taps and so forth. But how do we get intermediate stages, say r=0.1 or r=1.5?</p>
<p>Well, Jim Blinn once said that &#8220;All problems in computer graphics can be solved with a matrix inversion&#8221;. But here&#8217;s the thing: Jim Blinn lied to you. Disregard Jim Blinn. Because at least 50% of all problems in computer graphics can in fact be solved using linear interpolation without any matrices whatsoever. And this is one of them. We know how to handle r=0, and we know how to handle r=1. Want something in the middle? Well, just linearly fade in another 1 and then normalize the whole thing so the weights still sum to 1. To implement this, let&#8217;s split our radius into an integer and fractional component first:</p>
<p><img src="https://s0.wp.com/latex.php?latex=r+%3D+m+%2B+%5Calpha+%5Cquad+%5Ctextrm%7Bwhere%7D%5C+m+%5Cin+%5Cmathbb%7BZ%7D%2C+0+%5Cle+%5Calpha+%3C+1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=r+%3D+m+%2B+%5Calpha+%5Cquad+%5Ctextrm%7Bwhere%7D%5C+m+%5Cin+%5Cmathbb%7BZ%7D%2C+0+%5Cle+%5Calpha+%3C+1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=r+%3D+m+%2B+%5Calpha+%5Cquad+%5Ctextrm%7Bwhere%7D%5C+m+%5Cin+%5Cmathbb%7BZ%7D%2C+0+%5Cle+%5Calpha+%3C+1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="r = m + &#92;alpha &#92;quad &#92;textrm{where}&#92; m &#92;in &#92;mathbb{Z}, 0 &#92;le &#92;alpha &lt; 1" class="latex" /></p>
<p>Then our blur kernel looks like this:</p>
<p><img src="https://s0.wp.com/latex.php?latex=%5Cfrac%7B1%7D%7B2r+%2B+1%7D+%5Cbegin%7Bbmatrix%7D%5Calpha+%26+1+%26+1+%26+%5Ccdots+%26+1+%26+1+%26+%5Calpha+%5Cend%7Bbmatrix%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=%5Cfrac%7B1%7D%7B2r+%2B+1%7D+%5Cbegin%7Bbmatrix%7D%5Calpha+%26+1+%26+1+%26+%5Ccdots+%26+1+%26+1+%26+%5Calpha+%5Cend%7Bbmatrix%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=%5Cfrac%7B1%7D%7B2r+%2B+1%7D+%5Cbegin%7Bbmatrix%7D%5Calpha+%26+1+%26+1+%26+%5Ccdots+%26+1+%26+1+%26+%5Calpha+%5Cend%7Bbmatrix%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="&#92;frac{1}{2r + 1} &#92;begin{bmatrix}&#92;alpha &amp; 1 &amp; 1 &amp; &#92;cdots &amp; 1 &amp; 1 &amp; &#92;alpha &#92;end{bmatrix}" class="latex" /></p>
<p>with exactly 2m+1 ones in the middle. This is still cheap to implement &#8211; for example, we could just treat the whole thing as a box filter with r=m, and then add the two samples at the ends with weight &alpha; after the fact (and then apply our usual normalization). This is perfectly reasonable and brings us up to four samples per pixel instead of two, but it&#039;s still a constant cost independent of the length of the filter, so we&#039;re golden.</p>
<p>However there&#039;s a slightly different equivalent formulation that fits in a bit better with the spirit of the filter and is very suitable when texture filtering hardware is available: consider what happens when we move one pixel to the right. Let&#039;s look at our five-tap case again (without normalization to keep things simple) and take another look at the differences between adjacent output samples:</p>
<p><img src="https://s0.wp.com/latex.php?latex=y%28n%29+%3D+%5Calpha+%5C%2C+x%28n-2%29+%2B+x%28n-1%29+%2B+x%28n%29+%2B+x%28n%2B1%29+%2B+%5Calpha+%5C%2C+x%28n%2B2%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=y%28n%29+%3D+%5Calpha+%5C%2C+x%28n-2%29+%2B+x%28n-1%29+%2B+x%28n%29+%2B+x%28n%2B1%29+%2B+%5Calpha+%5C%2C+x%28n%2B2%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=y%28n%29+%3D+%5Calpha+%5C%2C+x%28n-2%29+%2B+x%28n-1%29+%2B+x%28n%29+%2B+x%28n%2B1%29+%2B+%5Calpha+%5C%2C+x%28n%2B2%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="y(n) = &#92;alpha &#92;, x(n-2) + x(n-1) + x(n) + x(n+1) + &#92;alpha &#92;, x(n+2)" class="latex" /><br />
<img src="https://s0.wp.com/latex.php?latex=y%28n%2B1%29+%3D+%5Calpha+%5C%2C+x%28n-1%29+%2B+x%28n%29+%2B+x%28n%2B1%29+%2B+x%28n%2B2%29+%2B+%5Calpha+%5C%2C+x%28n%2B3%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=y%28n%2B1%29+%3D+%5Calpha+%5C%2C+x%28n-1%29+%2B+x%28n%29+%2B+x%28n%2B1%29+%2B+x%28n%2B2%29+%2B+%5Calpha+%5C%2C+x%28n%2B3%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=y%28n%2B1%29+%3D+%5Calpha+%5C%2C+x%28n-1%29+%2B+x%28n%29+%2B+x%28n%2B1%29+%2B+x%28n%2B2%29+%2B+%5Calpha+%5C%2C+x%28n%2B3%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="y(n+1) = &#92;alpha &#92;, x(n-1) + x(n) + x(n+1) + x(n+2) + &#92;alpha &#92;, x(n+3)" class="latex" /></p>
<p><img src="https://s0.wp.com/latex.php?latex=y%28n%2B1%29+-+y%28n%29+%5C%5C+%3D+%28%281-%5Calpha%29+x%28n%2B2%29+%2B+%5Calpha+%5C%2C+x%28n%2B3%29%29+-+%28%281-%5Calpha%29+x%28n-1%29+%2B+%5Calpha+%5C%2C+x%28n-2%29%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=y%28n%2B1%29+-+y%28n%29+%5C%5C+%3D+%28%281-%5Calpha%29+x%28n%2B2%29+%2B+%5Calpha+%5C%2C+x%28n%2B3%29%29+-+%28%281-%5Calpha%29+x%28n-1%29+%2B+%5Calpha+%5C%2C+x%28n-2%29%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=y%28n%2B1%29+-+y%28n%29+%5C%5C+%3D+%28%281-%5Calpha%29+x%28n%2B2%29+%2B+%5Calpha+%5C%2C+x%28n%2B3%29%29+-+%28%281-%5Calpha%29+x%28n-1%29+%2B+%5Calpha+%5C%2C+x%28n-2%29%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="y(n+1) - y(n) &#92;&#92; = ((1-&#92;alpha) x(n+2) + &#92;alpha &#92;, x(n+3)) - ((1-&#92;alpha) x(n-1) + &#92;alpha &#92;, x(n-2))" class="latex" /></p>
<p>On the right side, we add a new pixel at the right end with weight &alpha;, and we need to &quot;upgrade&quot; our old rightmost pixel to weight 1 by adding it in again weighted with (1-&alpha;); we do the same on the left side to subtract out the old samples. Interestingly, if you look at what we&#039;re actually adding and subtracting, it&#039;s just linear interpolation between two adjacent samples &#8211; i.e. linear filtering (the 1D version of bilinear filtering), the kind of thing that texture sampling hardware is great at!</p>
<p>Time to update our pseudocode: (I&#039;m writing this using floats for convenience, in practice you might want to use fixed point)</p>
<pre>
  // Convolve x with a box filter of "radius" r, ignoring boundary
  // conditions.
  float scale = 1.0f / (2.0f * r + 1.0f); // or use fixed point
  int m = (int) r; // integer part of radius
  float alpha = r - m; // fractional part

  // Compute sum at first pixel
  float sum = x[0];
  for (int i=0; i &lt; m; i++)
    sum += x[-i] + x[i];
  sum += alpha * (x[-m] + x[m]);

  // Generate output pixel, then update running sum for next pixel.
  // lerp(t, a, b) = (1-t)*a + t*b = a + t * (b-a)
  for (int i=0; i &lt; n; i++) {
    y[i] = sum * scale;
    sum += lerp(alpha, x[i+m+1], x[i+m+2]);
    sum -= lerp(alpha, x[i-m], x[i-m-1]);
  }
</pre>
<p>I think this is really pretty slick &#8211; especially when implemented on the GPU, where you can combine the linear interpolation and sampling steps into a single bilinear sample, for an inner loop with two texture samples and a tiny bit of arithmetic. However nice this may be, though, we&#8217;re still stuck with box filters &#8211; let&#8217;s fix that one too.</p>
<h3>Unboxing</h3>
<p>Okay, we know we can do box filters fast, but they&#8217;re crappy, and we also know how to do arbitrary filters, but that&#8217;s much slower. However, we can build better filters by using our box filters as a building block. In particular, we can apply our box filters multiple times. Now, to give you an understanding of what&#8217;s going on, I&#8217;m going to switch to the continuous domain for a bit, because that gets rid of a few discretization artifacts like the in-between box filters introduced above (which make analysis trickier), and it&#8217;s also nicer for drawing plots. As in the introduction, I&#8217;m not going to give you a proper definition or theory of convolution (neither discrete nor continuous); as usual, you can step by <a href="http://en.wikipedia.org/wiki/Convolution">Wikipedia</a> if you&#8217;re interested in details. What we need to know here are just two things: first, there is a continuous version of the discrete convolution we&#8217;ve been using so far that behaves essentially the same way (except now, we perform integration instead of the finite summation we had before), and second, convolution is associative, i.e. <img src="https://s0.wp.com/latex.php?latex=%28f+%2A+g%29+%2A+h+%3D+f+%2A+%28g+%2A+h%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=%28f+%2A+g%29+%2A+h+%3D+f+%2A+%28g+%2A+h%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=%28f+%2A+g%29+%2A+h+%3D+f+%2A+%28g+%2A+h%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="(f * g) * h = f * (g * h)" class="latex" />. In our case, we&#8217;re gonna convolve our image I with a box filter b multiple times, and for this case we get something like <img src="https://s0.wp.com/latex.php?latex=b+%2A+b+%2A+I+%3D+%28b+%2A+b%29+%2A+I&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=b+%2A+b+%2A+I+%3D+%28b+%2A+b%29+%2A+I&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=b+%2A+b+%2A+I+%3D+%28b+%2A+b%29+%2A+I&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="b * b * I = (b * b) * I" class="latex" />. Point being, applying a box filter p times is equivalent to filtering the image once with a filter that&#8217;s a box convolved p-1 times with itself.</p>
<p>In other words, using just our single lowly box filter, we can build fancier filters just by applying it multiple times. So what filters do we get? Let&#8217;s plot the first few iterations, starting from a single box filter with radius 1. Note that in the continuous realm, there really is only one box filter &#8211; because we&#8217;re not on a discrete grid, we can generate other widths just by scaling the &#8220;time&#8221; value we pass into our filter function. This is one of the ways that looking at continuous filters makes our life easier here. Anyway, here&#8217;s the pretty pictures:</p>
<p><div data-shortcode="caption" id="attachment_1021" style="width: 411px" class="wp-caption aligncenter"><a href="https://fgiesen.files.wordpress.com/2012/07/box11.png"><img loading="lazy" aria-describedby="caption-attachment-1021" data-attachment-id="1021" data-permalink="https://fgiesen.wordpress.com/2012/08/01/fast-blurs-2/box1/" data-orig-file="https://fgiesen.files.wordpress.com/2012/07/box11.png" data-orig-size="401,201" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}" data-image-title="Box filter" data-image-description="" data-image-caption="&lt;p&gt;Box filter (p=1)&lt;/p&gt;
" data-medium-file="https://fgiesen.files.wordpress.com/2012/07/box11.png?w=300" data-large-file="https://fgiesen.files.wordpress.com/2012/07/box11.png?w=401" src="https://fgiesen.files.wordpress.com/2012/07/box11.png?w=401&#038;h=201" alt="" title="Box filter" width="401" height="201" class="size-full wp-image-1021" srcset="https://fgiesen.files.wordpress.com/2012/07/box11.png 401w, https://fgiesen.files.wordpress.com/2012/07/box11.png?w=150&amp;h=75 150w, https://fgiesen.files.wordpress.com/2012/07/box11.png?w=300&amp;h=150 300w" sizes="(max-width: 401px) 100vw, 401px" /></a><p id="caption-attachment-1021" class="wp-caption-text">Box filter (p=1)</p></div><br />
<div data-shortcode="caption" id="attachment_1022" style="width: 411px" class="wp-caption aligncenter"><a href="https://fgiesen.files.wordpress.com/2012/07/box21.png"><img loading="lazy" aria-describedby="caption-attachment-1022" data-attachment-id="1022" data-permalink="https://fgiesen.wordpress.com/2012/08/01/fast-blurs-2/box2/" data-orig-file="https://fgiesen.files.wordpress.com/2012/07/box21.png" data-orig-size="401,201" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}" data-image-title="Triangle filter" data-image-description="" data-image-caption="&lt;p&gt;Triangle filter (p=2)&lt;/p&gt;
" data-medium-file="https://fgiesen.files.wordpress.com/2012/07/box21.png?w=300" data-large-file="https://fgiesen.files.wordpress.com/2012/07/box21.png?w=401" src="https://fgiesen.files.wordpress.com/2012/07/box21.png?w=401&#038;h=201" alt="" title="Triangle filter" width="401" height="201" class="size-full wp-image-1022" srcset="https://fgiesen.files.wordpress.com/2012/07/box21.png 401w, https://fgiesen.files.wordpress.com/2012/07/box21.png?w=150&amp;h=75 150w, https://fgiesen.files.wordpress.com/2012/07/box21.png?w=300&amp;h=150 300w" sizes="(max-width: 401px) 100vw, 401px" /></a><p id="caption-attachment-1022" class="wp-caption-text">Triangle filter (p=2)</p></div><br />
<div data-shortcode="caption" id="attachment_1023" style="width: 411px" class="wp-caption aligncenter"><a href="https://fgiesen.files.wordpress.com/2012/07/box31.png"><img loading="lazy" aria-describedby="caption-attachment-1023" data-attachment-id="1023" data-permalink="https://fgiesen.wordpress.com/2012/08/01/fast-blurs-2/box3/" data-orig-file="https://fgiesen.files.wordpress.com/2012/07/box31.png" data-orig-size="401,201" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}" data-image-title="Piecewise quadratic filter" data-image-description="" data-image-caption="&lt;p&gt;Piecewise quadratic filter (p=3)&lt;/p&gt;
" data-medium-file="https://fgiesen.files.wordpress.com/2012/07/box31.png?w=300" data-large-file="https://fgiesen.files.wordpress.com/2012/07/box31.png?w=401" src="https://fgiesen.files.wordpress.com/2012/07/box31.png?w=401&#038;h=201" alt="" title="Piecewise quadratic filter" width="401" height="201" class="size-full wp-image-1023" srcset="https://fgiesen.files.wordpress.com/2012/07/box31.png 401w, https://fgiesen.files.wordpress.com/2012/07/box31.png?w=150&amp;h=75 150w, https://fgiesen.files.wordpress.com/2012/07/box31.png?w=300&amp;h=150 300w" sizes="(max-width: 401px) 100vw, 401px" /></a><p id="caption-attachment-1023" class="wp-caption-text">Piecewise quadratic filter (p=3)</p></div><br />
<div data-shortcode="caption" id="attachment_1024" style="width: 411px" class="wp-caption aligncenter"><a href="https://fgiesen.files.wordpress.com/2012/07/box41.png"><img loading="lazy" aria-describedby="caption-attachment-1024" data-attachment-id="1024" data-permalink="https://fgiesen.wordpress.com/2012/08/01/fast-blurs-2/box4/" data-orig-file="https://fgiesen.files.wordpress.com/2012/07/box41.png" data-orig-size="401,201" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}" data-image-title="Piecewise cubic filter" data-image-description="" data-image-caption="&lt;p&gt;Piecewise cubic filter (p=4)&lt;/p&gt;
" data-medium-file="https://fgiesen.files.wordpress.com/2012/07/box41.png?w=300" data-large-file="https://fgiesen.files.wordpress.com/2012/07/box41.png?w=401" src="https://fgiesen.files.wordpress.com/2012/07/box41.png?w=401&#038;h=201" alt="" title="Piecewise cubic filter" width="401" height="201" class="size-full wp-image-1024" srcset="https://fgiesen.files.wordpress.com/2012/07/box41.png 401w, https://fgiesen.files.wordpress.com/2012/07/box41.png?w=150&amp;h=75 150w, https://fgiesen.files.wordpress.com/2012/07/box41.png?w=300&amp;h=150 300w" sizes="(max-width: 401px) 100vw, 401px" /></a><p id="caption-attachment-1024" class="wp-caption-text">Piecewise cubic filter (p=4)</p></div></p>
<p>These all show the filter in question in red, plotted over a Gaussian with matching variance (<img src="https://s0.wp.com/latex.php?latex=%5Csigma%5E2+%3D+p%2F12&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=%5Csigma%5E2+%3D+p%2F12&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=%5Csigma%5E2+%3D+p%2F12&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="&#92;sigma^2 = p/12" class="latex" />) and scale in blue (ahem, not to suggest any conclusions or anything!). Triangle filters also have a name (which is just as inspired as &#8220;box filter&#8221;, natch), the other ones I&#8217;ve labeled simply by what type of function they correspond to. Anyway, the point to take away here is that convolving a box with itself quickly leads to pretty smooth functions (the <a href="http://en.wikipedia.org/wiki/B-spline">Cardinal B-splines</a>, in fact) that happen to get quite similar to a Gaussian very fast. In fact, as we keep increasing p (the order), they will converge towards a &#8220;real&#8221; Gaussian &#8211; proof <a href="http://bigwww.epfl.ch/publications/unser9201.pdf">here</a> for the curious. (You can also stay with the discrete version and prove this using the Central Limit Theorem, but I like this version more). The piecewise cubic function you get for p=4 above is already within 3% (absolute) error of the corresponding Gaussian. In practice, exact match with a Gaussian isn&#8217;t that important anyway as long as you&#8217;re just blurring images, and just using p=3 is normally fine.</p>
<p>So, to get better filters, just blur multiple times. Easy! So easy in fact that I&#8217;m not gonna bore you with another piece of pseudocode &#8211; I trust you can imagine a &#8220;do this p times&#8221; outer loop on your own. Now, to be fair, the convolutions I showed you were done in the continuous domain, and none of them actually prove anything about the modified box filters we&#8217;re using. To their defense, we can at least argue that our modified box filter definition is reasonable in that it&#8217;s the discretization of the continuous box filter with radius r to a pixel grid, where the pixels themselves have a rectangular shape (this is a fairly good model of current displays). But if we actually wanted to prove consistency and convergence using our filters, we&#8217;d need to prove that our discrete version is a good approximation of the continuous version. I&#8217;ll skip that here since it gets technical and doesn&#8217;t really add anything if you just want fast blurs, but if you want to see proof (and also get an exact expression for the variance of our iterated modified box filters), <a href="http://www.mia.uni-saarland.de/Publications/gwosdek-ssvm11.pdf">this paper</a> has the answers you seek.</p>
<h3>Implementation notes</h3>
<p>And that&#8217;s really all you need to know to get nice, fast quasi-Gaussian blurs, no matter how wide the kernel. For a GPU implementation, you should be able to more or less directly take my pseudo-code above and turn it into a Compute Shader: for the horizontal blur passes, have each thread in a group work on a different scan line (and for vertical passes, have each thread work on a different row). To get higher orders, you just keep iterating the passes, ping-ponging between destination render targets (or UAVs if you want to stick with DX11 terminology). Important caveat: I haven&#8217;t actually tried this (yet) and I don&#8217;t know how well it performs; you might need to whack the code a bit to actually get good performance. As usual with GPUs, once you have the basic algorithm you&#8217;re at best half-done; the rest is figuring out the right way to package and partition the data.</p>
<p>I can, however, tell you how to implement this nicely on CPUs. Computation-wise, we&#8217;ll stick exactly with the algorithm above, but we can eke out big wins by doing things in the right order to maximize cache hit ratios.</p>
<p>Let&#8217;s look at the horizontal passes first. These nicely walk the image left to right and access memory sequentially. Very nice locality of reference, very good for memory prefetchers. There&#8217;s not much you can do wrong here &#8211; however, one important optimization is to perform the multiple filter passes <em>per scanline</em> (or per group of scanlines) and not filter the whole image multiple times. That is, you want your code to do this:</p>
<pre>
for (int y=0; y &lt; height; y++) {
  for (int pass=0; pass &lt; num_passes; pass++) {
    blur_scanline(y, radius);
  }
}
</pre>
<p>and not</p>
<pre>
// Do *NOT* do it like this!
for (int pass=0; pass &lt; num_passes; pass++) {
  for (int y=0; y &lt; height; y++) {
    blur_scanline(y, radius);
  }
}
</pre>
<p>This is because a single scan line might fit inside your L1 data cache, and very likely fits fully inside your L2 cache. Unless your image is small, the whole image won&#8217;t. So in the second version, by the point you get back to y=0 for the second pass, all of the image data has likely dropped out of the L1 and L2 caches and needs to be fetched again from memory (or the L3 cache if you&#8217;re lucky). That&#8217;s a completely unnecessary waste of time and memory bandwidth &#8211; so avoid this.</p>
<p>Second, vertical passes. If you implement them naively, they will likely be significantly slower (sometimes by an order of magnitude!) than the horizontal passes. The problem is that stepping through images vertically is not making good use of the cache: a cache line these days is usually 64 bytes (even 128 on some platforms), while a single pixel usually takes somewhere between 4 and 16 bytes (depending on the pixel format). So we keep fetching cache lines from memory (or lower cache levels) and only looking at between 1/16th and 1/4th of the data we get back! And since the cache operates at cache line granularity, that means we get an effective cache size of somewhere between a quarter and a sixteenth of the actual cache size (this goes both for L1 and L2). Ouch! And same as above with the multiple passes, if we do this, then it becomes a lot more likely that by the time we come back to any given cache line (once we&#8217;re finished with our current column of the image and proceeded to the next one), a lot of the image has dropped out of the closest cache levels.</p>
<p>There&#8217;s multiple ways to solve this problem. You can write your vertical blur pass so that it always works on multiple columns at a time &#8211; enough to always be reading (and writing) full cache lines. This works, but it means that the horizontal and vertical blur code are really two completely different loops, and on most architectures you&#8217;re likely to run out of registers when doing this, which also makes things slower.</p>
<p>A better approach (in my opinion anyway) is to have only one blur loop &#8211; let&#8217;s say the horizontal one. Now, to do a vertical blur pass, we first read a narrow stripe of N columns and write it out, transposed (i.e. rows and columns interchanged), into a scratch buffer. N is chosen so that we&#8217;re consuming full cache lines at a time. Then we run our multiple horizontal blur passes on the N scan lines in our scratch buffer, and finally transpose again as we&#8217;re writing out the stripe to the image. The upshot is that we only have one blur loop that gets used for both directions (and only has to be optimized once); the complexity gets shifted into the read and write loops, which are glorified copy loops (okay, with a transpose inside them) that are much easier to get right. And the scratch buffer is small enough to stay in the cache (maybe not L1 for large images, but probably L2) all the time.</p>
<p>However, we still have two separate copy loops, and now there&#8217;s this weird asymmetry between horizontal and vertical blurs. Can&#8217;t we do better? Let&#8217;s look at the sequence of steps that we effectively perform:</p>
<ol>
<li>Horizontal blur pass.</li>
<li>Transpose.</li>
<li>Horizontal blur pass.</li>
<li>Transpose.</li>
</ol>
<p>Currently, our horizontal blur does step 1, and our vertical blur does steps 2-4. But as should be obvious when you write it this way, it&#8217;s actually much more natural to write a function that does steps 1 and 2 (which are the same as steps 3 and 4, just potentially using a different blur radius). So instead of writing one copy loop that transposes from the input to a scratch buffer, and one copy loop that transposes from a scratch buffer to the output, just have the horizontal blur always write to the scratch buffer, and only ever use the second type of copy loop (scratch buffer to output with transpose).</p>
<p>And that&#8217;s it! For reference, a version of the Werkkzeug3 Blur written using SSE2 intrinsics (and adapted from an earlier MMX-only version) is <a href="https://github.com/farbrausch/fr_public/blob/master/altona_wz4/wz4/wz4frlib/wz3_bitmap_code.cpp">here</a> &#8211; the function is <code>GenBitmap::Blur</code>. This one uses fixed-point arithmetic throughout, which gets a bit awkward in places, because we have to work on 32-bit numbers using the MMX set of arithmetic operations which doesn&#8217;t include any 32-bit multiplies. It also performs blurring and transpose in two separate steps, making a pass over the full image each time, for no good reason whatsoever. 🙂 If there&#8217;s interest, I can write a more detailed explanation of exactly how that code works in a separate post; for now, I think I&#8217;ve covered enough material. Take care, and remember to blur responsibly!</p>
]]></html><thumbnail_url><![CDATA[https://fgiesen.files.wordpress.com/2012/07/box11.png?fit=440%2C330]]></thumbnail_url><thumbnail_width><![CDATA[]]></thumbnail_width><thumbnail_height><![CDATA[]]></thumbnail_height></oembed>