<?xml version="1.0" encoding="UTF-8" standalone="yes"?><oembed><version><![CDATA[1.0]]></version><provider_name><![CDATA[The ryg blog]]></provider_name><provider_url><![CDATA[https://fgiesen.wordpress.com]]></provider_url><author_name><![CDATA[fgiesen]]></author_name><author_url><![CDATA[https://fgiesen.wordpress.com/author/fgiesen/]]></author_url><title><![CDATA[rANS with static probability&nbsp;distributions]]></title><type><![CDATA[link]]></type><html><![CDATA[<p>In the <a href="https://fgiesen.wordpress.com/2014/02/02/rans-notes/">previous post</a>, I wrote about rANS in general. The ANS family is, in essence, just a different design approach for arithmetic coders, with somewhat different trade-offs, strengths and weaknesses than existing coders. In this post, I am going to talk specifically about using rANS as a drop-in replacement for (static) Huffman coding: that is, we are encoding data with a known, static probability distribution for symbols. I am also going to assume a compress-once decode-often scenario: slowing down the encoder (within reason) is acceptable if doing so makes the decoder faster. It turns out that rANS is very useful in this kind of setting.</p>
<h3>Review</h3>
<p>Last time, we defined the rANS encoding and decoding functions, assuming a finite alphabet <img src="https://s0.wp.com/latex.php?latex=%5Cmathcal%7BA%7D+%3D+%5C%7B+0%2C+%5Cdots%2C+n+-+1+%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=%5Cmathcal%7BA%7D+%3D+%5C%7B+0%2C+%5Cdots%2C+n+-+1+%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=%5Cmathcal%7BA%7D+%3D+%5C%7B+0%2C+%5Cdots%2C+n+-+1+%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="&#92;mathcal{A} = &#92;{ 0, &#92;dots, n - 1 &#92;}" class="latex" /> of n symbols numbered 0 to n-1.</p>
<p><img src="https://s0.wp.com/latex.php?latex=C%28s%2Cx%29+%3A%3D+M+%5Clfloor+x%2FF_s+%5Crfloor+%2B+B_s+%2B+%28x+%5Cbmod+F_s%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=C%28s%2Cx%29+%3A%3D+M+%5Clfloor+x%2FF_s+%5Crfloor+%2B+B_s+%2B+%28x+%5Cbmod+F_s%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=C%28s%2Cx%29+%3A%3D+M+%5Clfloor+x%2FF_s+%5Crfloor+%2B+B_s+%2B+%28x+%5Cbmod+F_s%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="C(s,x) := M &#92;lfloor x/F_s &#92;rfloor + B_s + (x &#92;bmod F_s)" class="latex" /><br />
<img src="https://s0.wp.com/latex.php?latex=D%28x%29+%3A%3D+%28s%2C+F_s+%5Clfloor+x%2FM+%5Crfloor+%2B+%28x+%5Cbmod+M%29+-+B_s%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=D%28x%29+%3A%3D+%28s%2C+F_s+%5Clfloor+x%2FM+%5Crfloor+%2B+%28x+%5Cbmod+M%29+-+B_s%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=D%28x%29+%3A%3D+%28s%2C+F_s+%5Clfloor+x%2FM+%5Crfloor+%2B+%28x+%5Cbmod+M%29+-+B_s%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="D(x) := (s, F_s &#92;lfloor x/M &#92;rfloor + (x &#92;bmod M) - B_s)" class="latex" /> where <img src="https://s0.wp.com/latex.php?latex=s+%3D+s%28x+%5Cbmod+M%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=s+%3D+s%28x+%5Cbmod+M%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=s+%3D+s%28x+%5Cbmod+M%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="s = s(x &#92;bmod M)" class="latex" />.</p>
<p>where <img src="https://s0.wp.com/latex.php?latex=F_s&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=F_s&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=F_s&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="F_s" class="latex" /> is the frequency of symbol s, <img src="https://s0.wp.com/latex.php?latex=B_s+%3D+%5Csum_%7Bi%3D0%7D%5E%7Bs-1%7D+F_i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=B_s+%3D+%5Csum_%7Bi%3D0%7D%5E%7Bs-1%7D+F_i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=B_s+%3D+%5Csum_%7Bi%3D0%7D%5E%7Bs-1%7D+F_i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="B_s = &#92;sum_{i=0}^{s-1} F_i" class="latex" /> is the sum of the frequencies of all symbols before s, and <img src="https://s0.wp.com/latex.php?latex=M+%3D+%5Csum_%7Bi%3D0%7D%5E%7Bn-1%7D+F_i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=M+%3D+%5Csum_%7Bi%3D0%7D%5E%7Bn-1%7D+F_i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=M+%3D+%5Csum_%7Bi%3D0%7D%5E%7Bn-1%7D+F_i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="M = &#92;sum_{i=0}^{n-1} F_i" class="latex" /> is the sum of all symbol frequencies. Then a given symbol s has (assumed) probability <img src="https://s0.wp.com/latex.php?latex=p_s+%3D+F_s+%2F+M&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=p_s+%3D+F_s+%2F+M&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=p_s+%3D+F_s+%2F+M&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="p_s = F_s / M" class="latex" />.</p>
<p>Furthermore, as noted in the previous post, M can&#8217;t be chosen arbitrarily; it must divide L (the lower bound of our normalized interval) for the encoding/decoding algorithms we saw to work.</p>
<p>Given these constraints and the form of C and D, it&#8217;s very convenient to have M be a power of 2; this replaces the divisions and modulo operations in the decoder with bit masking and bit shifts. We also choose L as a power of 2 (which needs to be at least as large as M, since otherwise M can&#8217;t divide L).</p>
<p>This means that, starting from a reference probability distribution, we need to approximate the probabilities as fractions with common denominator M. My colleague Charles Bloom just wrote <a href="http://cbloomrants.blogspot.com/2014/02/02-11-14-understanding-ans-10.html">a blog post on that very topic</a>, so I&#8217;m going to refer you there for details on how to do this optimally.</p>
<h3>Getting rid of per-symbol divisions in the encoder</h3>
<p>Making M a power of two removes the division/modulo operations in the decoder, but the encoder still has to perform them. However, note that we&#8217;re only ever dividing by the symbol frequencies <img src="https://s0.wp.com/latex.php?latex=F_s&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=F_s&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=F_s&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="F_s" class="latex" />, which are known at the start of the encoding operation (in our &#8220;static probability distribution&#8221; setting). The question is, does that help?</p>
<p>You bet it does. A little known fact (amongst most programmers who aren&#8217;t compiler writers or bit hacking aficionados anyway) is that division of a <img src="https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="p" class="latex" />-bit unsigned integer by a constant can always be performed as fixed-point multiplication with a reciprocal, using <img src="https://s0.wp.com/latex.php?latex=2p%2B1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=2p%2B1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=2p%2B1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="2p+1" class="latex" /> bits (or less) of intermediate precision. This is <em>exact</em> &#8211; no round-off error involved. Compilers like to use this technique on integer divisions by constants, since multiplication (even long multiplication) is typically much faster than division.</p>
<p>There are several papers on how to choose the &#8220;magic constants&#8221; (with proofs); however, most of them are designed to be used in the code generator of a compiler. As such, they generally have several possible code sequences for division by constants, and try to use the cheapest one that works for the given divisor. This makes sense in a compiler, but not in our case, where the exact frequencies are not known at compile time and doing run-time branching between different possible instruction sequences would cost more than it saves. Therefore, I would suggest sticking with Alverson&#8217;s original paper <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.33.1710">&#8220;Integer division using reciprocals&#8221;</a>.</p>
<p>The <a href="https://github.com/rygorous/ryg_rans/blob/854451f698c3a604de7441d0a3f953a2b3053ac9/rans_byte.h#L150">example code</a> I linked to implements this approach, replacing the division/modulo pair with a pair of integer multiplications; when using this approach, it makes sense to limit the state variable to 31 bits (or 63 bits on 64-bit architectures): as said before, the reciprocal method requires <img src="https://s0.wp.com/latex.php?latex=2p%2B1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=2p%2B1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=2p%2B1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="2p+1" class="latex" /> bits working precision for worst-case divisors, and reducing the range by 1 bit enables a faster (and simpler) implementation than would be required for a full-range variant (especially in C/C++ code, where multi-precision arithmetic is not easy to express). Note that handling the case <img src="https://s0.wp.com/latex.php?latex=F_s%3D1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=F_s%3D1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=F_s%3D1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="F_s=1" class="latex" /> requires some extra work; details are explained in the code.</p>
<h3>Symbol lookup</h3>
<p>There&#8217;s one important step in the decoder that I haven&#8217;t talked about yet: mapping from the &#8220;slot index&#8221; <img src="https://s0.wp.com/latex.php?latex=x+%5Cbmod+M&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=x+%5Cbmod+M&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=x+%5Cbmod+M&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="x &#92;bmod M" class="latex" /> to the corresponding symbol index. In normal rANS, each symbol covers a contiguous range of the &#8220;slot index&#8221; space (by contrast to say tANS, where the slots for any given symbol are <a href="http://cbloomrants.blogspot.com/2014/02/02-06-14-understanding-ans-8.html">spread relatively uniformly across the slot index space</a>). That means that, if all else fails, we can figure out the symbol ID using a binary search in <img src="https://s0.wp.com/latex.php?latex=%5Clceil%5Clog_2+n%5Crceil&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=%5Clceil%5Clog_2+n%5Crceil&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=%5Clceil%5Clog_2+n%5Crceil&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="&#92;lceil&#92;log_2 n&#92;rceil" class="latex" /> steps (remember that n is the size of our alphabet) from the cumulative frequency table (the <img src="https://s0.wp.com/latex.php?latex=B_s&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=B_s&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=B_s&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="B_s" class="latex" />, which take O(n) space) &#8211; independent of the size of M. That&#8217;s comforting to know, but doing a binary search per symbol is, in practice, quite expensive compared to the rest of the decoding work we do.</p>
<p>At the other extreme, we can just prepare a look-up table mapping from the cumulative frequency to the corresponding symbol ID. This is very simple (and the technique used in the example code) and theoretically constant-time per symbol, but it requires a table with M entries &#8211; and if the table ends up being too large to fit comfortably in a core&#8217;s L1 cache, real-world performance (although still technically bounded by a constant per symbol) can get quite bad. Moving in the other direction, if M is small enough, it can make sense to store the per-symbol information in the M-indexed table too, and avoid the extra indirection; I would not recommend this far beyond M=2<sup>12</sup> though.</p>
<p>Anyway, that gives us two design points: we can use <img src="https://s0.wp.com/latex.php?latex=O%28n%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=O%28n%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=O%28n%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="O(n)" class="latex" /> space, at a cost of <img src="https://s0.wp.com/latex.php?latex=O%28%5Clog+n%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=O%28%5Clog+n%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=O%28%5Clog+n%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="O(&#92;log n)" class="latex" /> per symbol lookup; or we can use <img src="https://s0.wp.com/latex.php?latex=O%28M%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=O%28M%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=O%28M%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="O(M)" class="latex" /> space, with <img src="https://s0.wp.com/latex.php?latex=O%281%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=O%281%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=O%281%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="O(1)" class="latex" /> symbol lookup cost. Now what we&#8217;d really like is to get <img src="https://s0.wp.com/latex.php?latex=O%281%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=O%281%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=O%281%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="O(1)" class="latex" /> symbol lookup in <img src="https://s0.wp.com/latex.php?latex=O%28n%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=O%28n%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=O%28n%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="O(n)" class="latex" /> space, but sadly that option&#8217;s not on the menu.</p>
<p>Or is it?</p>
<h3>The alias method</h3>
<p>To make a long story short, I&#8217;m not aware of any way to meet our performance goals with the original unmodified rANS algorithm; however, we can do much better if we&#8217;re willing to relax our requirements a bit. Notably, there&#8217;s no deep reason for us to require that the slots assigned to a given symbol s be contiguous; we already know that e.g. tANS works in a much more relaxed setting. So let&#8217;s assume, for the time being, that we can rearrange our slot to symbol mapping arbitrarily (we&#8217;ll have to check if this is actually true later, and also work through what it means for our encoder). What does that buy us?</p>
<p>It buys us all we need to meet our performance goals, it turns out (props to my colleague Sean Barrett, who was the first one to figure this out, in our internal email exchanges anyway). As the section title says, the key turns out to be a stochastic sampling technique called the <a href="http://en.wikipedia.org/wiki/Alias_method">&#8220;alias method&#8221;</a>. I&#8217;m not gonna explain the details here and instead refer you to <a href="http://pandasthumb.org/archives/2012/08/lab-notes-the-a.html">this short introduction</a> (written by a computational evolutionary geneticist, on randomly picking base pairs) and <a href="http://www.keithschwarz.com/darts-dice-coins/">&#8220;Darts, Dice and Coins&#8221;</a>, a much longer article that covers multiple ways to sample from a nonuniform distribution (by the way, note that the warnings about numerical instability that often accompany descriptions of the alias method need not worry us; we&#8217;re dealing with integer frequencies here so there&#8217;s no round-off error).</p>
<p>At this point, you might be wondering what the alias method, a technique for sampling from a non-uniform discrete probability distribution, has anything to do with entropy (de)coding. The answer is that the symbol look-up problem is essentially the same thing: we have a &#8220;random&#8221; value <img src="https://s0.wp.com/latex.php?latex=x+%5Cbmod+M&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=x+%5Cbmod+M&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=x+%5Cbmod+M&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="x &#92;bmod M" class="latex" /> from the interval [0,M-1], and a matching non-uniform probability distribution (our symbol frequencies). Drawing symbols according to that distribution defines a map from [0,M-1] to our symbol alphabet, which is precisely what we need for our decoding function.</p>
<p>So what does the alias method do? Well, if you followed the link to the <a href="http://pandasthumb.org/archives/2012/08/lab-notes-the-a.html">article I mentioned earlier</a>, you get the general picture: it partitions the probabilities for our n-symbol alphabet into n &#8220;buckets&#8221;, such that each bucket i references at most 2 symbols (one of which is symbol i), and the probabilities within each bucket sum to the same value (namely, 1/n). This is <em>always</em> possible, and there is an algorithm (due to Vose) which determines such a partition in O(n) time. More generally, we can do so for any N&ge;n, by just adding some dummy symbols with frequency 0 at the end. In practice, it&#8217;s convenient to have N be a power of two, so for arbitrary n we would just pick N to be the smallest power of 2 that is &ge;n.</p>
<p>Translating the sampling terminology to our rANS setting: we can subdivide our interval [0,M-1] into N sub-intervals (&#8220;buckets&#8221;) of equal size k=M/N, such that each bucket i references at most 2 distinct symbols, one of which is symbol i. We also need to know what the other symbol referenced in this bucket is &#8211; <code>alias[i]</code>, the &#8220;alias&#8221; that gives the methods its name &#8211; and the position <code>divider[i]</code> of the &#8220;dividing line&#8221; between the two symbols.</p>
<p>With these two tables, determining the symbol ID from x is quick and easy:</p>
<pre>
  uint xM = x % M; // bit masking (power-of-2 M)
  uint bucket_id = xM / K; // shift (power-of-2 M/N!)
  uint symbol = bucket_id;
  if (xM &gt;= divider[bucket_id]) // primary symbol or alias?
    symbol = alias[bucket_id];
</pre>
<p>This is O(1) time and O(N) = O(n) space (for the &#8220;divider&#8221; and &#8220;alias&#8221; arrays), as promised. However, this isn&#8217;t quite enough for rANS: remember that for our decoding function D, we need to know not just the symbol ID, but also which of the (potentially many) slots assigned to that symbol we ended up in; with regular rANS, this was simple since all slots assigned to a symbol are sequential, starting at slot <img src="https://s0.wp.com/latex.php?latex=B_s&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=B_s&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=B_s&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="B_s" class="latex" />:<br />
<img src="https://s0.wp.com/latex.php?latex=D%28x%29+%3D+%28s%2C+F_s+%5Clfloor+x%2FM+%5Crfloor+%2B+%28x+%5Cbmod+M%29+-+B_s%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=D%28x%29+%3D+%28s%2C+F_s+%5Clfloor+x%2FM+%5Crfloor+%2B+%28x+%5Cbmod+M%29+-+B_s%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=D%28x%29+%3D+%28s%2C+F_s+%5Clfloor+x%2FM+%5Crfloor+%2B+%28x+%5Cbmod+M%29+-+B_s%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="D(x) = (s, F_s &#92;lfloor x/M &#92;rfloor + (x &#92;bmod M) - B_s)" class="latex" /> where <img src="https://s0.wp.com/latex.php?latex=s+%3D+s%28x+%5Cbmod+M%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=s+%3D+s%28x+%5Cbmod+M%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=s+%3D+s%28x+%5Cbmod+M%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="s = s(x &#92;bmod M)" class="latex" />.<br />
Here, the <img src="https://s0.wp.com/latex.php?latex=%28x+%5Cbmod+M%29+-+B_s&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=%28x+%5Cbmod+M%29+-+B_s&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=%28x+%5Cbmod+M%29+-+B_s&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="(x &#92;bmod M) - B_s" class="latex" /> part is the number we need. Now with the alias method, the slot IDs assigned to a symbol aren&#8217;t necessarily contiguous anymore. However, within each bucket, the slot IDs assigned to a symbol are sequential &#8211; which means that instead of the cumulative frequencies <img src="https://s0.wp.com/latex.php?latex=B_s&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=B_s&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=B_s&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="B_s" class="latex" />, we now have two separate per bucket. This allows us to define the complete &#8220;alias rANS&#8221; decoding function:</p>
<pre>
  // s, x = D(x) with "alias rANS"
  uint xM = x % M;
  uint bucket_id = xM / K;
  uint symbol, bias;
  if (xM &lt; divider[bucket_id]) { // primary symbol or alias?
    symbol = bucket_id;
    bias = primary_start[bucket_id];
  } else {
    symbol = alias[bucket_id];
    bias = alias_start[bucket_id];
  }
  x = (x / M) * freq[symbol] + xM - bias;
</pre>
<p>And although this code is written with branches for clarity, it is in fact fairly easy to do branch-free. We gained another two tables indexed with the bucket ID; generating them is another straightforward linear pass over the buckets: we just need to keep track of how many slots we&#8217;ve assigned to each symbol so far. And that&#8217;s it &#8211; this is all we need for a complete &#8220;alias rANS&#8221; decoder.</p>
<p>However, there&#8217;s one more minor tweak we can make: note that the only part of the computation that actually depends on <code>symbol</code> is the evaluation of <code>freq[symbol]</code>; if we store the frequencies for both symbols in each alias table bucket, we can get rid of the dependent look-ups. This can be a performance win in practice; on the other hand, it does waste a bit of extra memory on the alias table, so you might want to skip on it.</p>
<p>Either way, this alias method allow us to perform quite fast (though not as fast as a fully-unrolled table for small M) symbol look-ups, for large M, with memory overhead (and preparation time) proportional to n. That&#8217;s quite cool, and might be particularly interesting in cases where you either have relatively small alphabets (say on the order of 15-25 symbols), need lots of different tables, or frequently switch between tables.</p>
<h3>Encoding</h3>
<p>However, we haven&#8217;t covered encoding yet. With regular rANS, encoding is easy, since &#8211; again &#8211; the slot ranges for each symbol are contiguous; the encoder just does<br />
<img src="https://s0.wp.com/latex.php?latex=C%28s%2Cx%29+%3D+M+%5Clfloor+x%2FF_s+%5Crfloor+%2B+B_s+%2B+%28x+%5Cbmod+F_s%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=C%28s%2Cx%29+%3D+M+%5Clfloor+x%2FF_s+%5Crfloor+%2B+B_s+%2B+%28x+%5Cbmod+F_s%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=C%28s%2Cx%29+%3D+M+%5Clfloor+x%2FF_s+%5Crfloor+%2B+B_s+%2B+%28x+%5Cbmod+F_s%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="C(s,x) = M &#92;lfloor x/F_s &#92;rfloor + B_s + (x &#92;bmod F_s)" class="latex" /><br />
where <img src="https://s0.wp.com/latex.php?latex=B_s+%2B+%28x+%5Cbmod+F_s%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=B_s+%2B+%28x+%5Cbmod+F_s%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=B_s+%2B+%28x+%5Cbmod+F_s%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="B_s + (x &#92;bmod F_s)" class="latex" /> is the slot id corresponding to the <img src="https://s0.wp.com/latex.php?latex=%28x+%5Cbmod+F_s%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=%28x+%5Cbmod+F_s%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=%28x+%5Cbmod+F_s%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="(x &#92;bmod F_s)" class="latex" />&#8216;th appearance of symbol s.</p>
<p>With alias rANS, each symbol may have its slots distributed across multiple, disjoint intervals &#8211; up to N of them. And the encoder now needs to map <img src="https://s0.wp.com/latex.php?latex=%28x+%5Cbmod+F_s%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=%28x+%5Cbmod+F_s%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=%28x+%5Cbmod+F_s%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="(x &#92;bmod F_s)" class="latex" /> to a corresponding slot index that will decode correctly. One way to do this is to just keep track of the mapping as we build the alias table; this takes O(M) space and is O(1) cost per symbol. Another is to keep a sorted list of subintervals (and their cumulative sizes) assigned to each symbol; this takes only O(N) space, but adds a $O(\log_2 N)$ (worst-case) lookup per symbol in the encoder. Sound familiar?</p>
<p>In short, using the alias method doesn&#8217;t <em>really</em> solve the symbol lookup problem for large M; or, more precisely, it solves the lookup problem on the decoder side, but at the cost of adding an equivalent problem on the encoder side. What this means is that we have to pick our poison: faster encoding (at the some extra cost in the decoder), or faster decoding (at some extra cost in the encoder). This is fine, though; it means we get to make a trade-off, depending on which of the two steps is more important to us overall. And as long as we are in a compress-once decompress-often scenario (which is fairly typical), making the decoder faster at some reasonable extra cost in the encoder is definitely useful.</p>
<h3>Conclusion</h3>
<p>We can exploit static, known probabilities in several ways for rANS and related coders: for encoding, we can precompute the right &#8220;magic values&#8221; to avoid divisions in the hot encoding loop; and if we want to support large M, the alias method enables fast decoding without generating a giant table with M entries &#8211; with an O(n) preprocessing step (where n is the number of symbols), we can still support O(1) symbol decoding, albeit with a (slightly) higher constant factor.</p>
<p>I&#8217;m aware that this post is somewhat hand-wavey; the problem is that while Vose&#8217;s algorithm and the associated post-processing are actually quite straightforward, there&#8217;s a lot of index manipulation, and I find the relevant steps to be quite hard to express in prose without the &#8220;explanation&#8221; ending up harder to read than the actual code. So instead, my intent was to convey the &#8220;big picture&#8221;; a sample implementation of alias table rANS, with all the details, can be found &#8211; as usual &#8211; on <a href="https://github.com/rygorous/ryg_rans/blob/master/main_alias.cpp#L159">Github</a>.</p>
<p>And that&#8217;s it for today &#8211; take care!</p>
]]></html></oembed>