<?xml version="1.0" encoding="UTF-8" standalone="yes"?><oembed><version><![CDATA[1.0]]></version><provider_name><![CDATA[The ryg blog]]></provider_name><provider_url><![CDATA[https://fgiesen.wordpress.com]]></provider_url><author_name><![CDATA[fgiesen]]></author_name><author_url><![CDATA[https://fgiesen.wordpress.com/author/fgiesen/]]></author_url><title><![CDATA[Bit scanning equivalencies]]></title><type><![CDATA[link]]></type><html><![CDATA[<p>A lot of bit manipulation operations exist basically everywhere: AND, OR, XOR/EOR, shifts, and so forth. Some exist only on some architectures but have obvious implementations everywhere else &#8211; NAND, NOR, equivalence/XNOR and so forth.</p>
<p>Some exist as built-in instructions on some architectures and require fairly lengthy multi-instruction sequences when they&#8217;re not available. Some examples would be population count (number of bits set; available on recent x86s and POWER but not PowerPC), or bit reversal (available on ARMv6T2 and above).</p>
<p>And then there&#8217;s bit scanning operations. All of these can be done fairly efficiently on most architectures, but it&#8217;s not always obvious how.</p>
<h3>x86 bit scanning operations</h3>
<p>On x86, there&#8217;s <code>BSF</code> (bit scan forward), <code>BSR</code> (bit scan reverse), <code>TZCNT</code> (trailing zero count) and <code>LZCNT</code> (leading zero count). The first two are &#8220;old&#8221; (having existed since 386), the latter are very new. <code>BSF</code> and <code>TZCNT</code> do basically the same thing, with one difference: <code>BSF</code> has &#8220;undefined&#8221; results on a zero input (what is the index of the first set bit in the integer 0?), whereas the &#8220;trailing zeros&#8221; definition for <code>TZCNT</code> provides an obvious answer: the width of the register. <code>BSR</code> returns the index of the &#8220;last&#8221; set bit (again undefined on 0 input), <code>LZCNT</code> returns the number of leading zero bits, so these two actually produce different results.</p>
<ul>
<li><code>x86_bsf(x) = x86_tzcnt(x)</code> unless x=0 (in which case <code>bsf</code> is undefined).</li>
<li><code>x86_bsr(x) = (reg_width - 1) - x86_lzcnt(x)</code> unless x=0 (in which case <code>bsr</code> is undefined).</li>
</ul>
<p>In the following, I&#8217;ll only use <code>LZCNT</code> and <code>TZCNT</code> (use above table to convert where necessary) simply to avoid that nasty edge case at 0.</p>
<h3>PowerPC bit scanning operations</h3>
<p>PPC has <code>cntlzw</code> (Count Leading Zeros Word) and <code>cntlzd</code> (Count Leading Zeros Double Word) but no equivalent for trailing zero bits. We can get fairly close though: there&#8217;s the old trick <code>x &amp; -x</code> which clears all but the lowest set bit in <code>x</code>. As long as x is nonzero, this value has exactly one bit set, and we can determine its position using a leading zero count.</p>
<ul>
<li><code>x86_tzcnt(x) = (reg_width - 1) - ppc_cntlz(x &amp; -x)</code> unless x=0 (in which case the PPC expression returns -1).
<li>
<li><code>x86_lzcnt(x) = ppc_cntlz(x)</code>.</li>
</ul>
<h3>ARM bit scanning operations</h3>
<p>ARM also has <code>CLZ</code> (Count Leading Zeros) but nothing for trailing zero bits. But it also offers the aforementioned <code>RBIT</code> which reverses the bits in a word, which makes a trailing zero count easy to accomplish:</p>
<ul>
<li><code>x86_tzcnt(x) = arm_clz(arm_rbit(x))</code>.</li>
<li><code>x86_lzcnt(x) = arm_clz(x)</code>.</li>
</ul>
<h3>Bonus</h3>
<p>Finally, ARMs NEON also offers <code>VCLS</code> (Vector Count Leading Sign Bits), which (quoting from the documentation) &#8220;counts the number of consecutive bits following the topmost bit, that are the same as the topmost bit&#8221;. Well, we can do that on all architectures I mentioned as well, using only ingredients we already have: <code>arm_cls(x) = x86_lzcnt(x ^ (x &gt;&gt; 1)) - 1</code> (the shift here is an arithmetic shift). The expression <code>y = x ^ (x &gt;&gt; 1)</code> gives a value that has bit <code>n</code> set if and only if bits <code>n</code> and <code>n + 1</code> of x are the same. By induction, the number of leading zeros in <code>y</code> is thus exactly the number of leading bits in <code>x</code> that match the sign bit. This count includes the topmost (sign) bit, so it&#8217;s always at least 1, and the instruction definition I just quoted requires us to return the number of bits <em>following</em> the topmost bit that match it. So we subtract 1 to get the right result. Since we can do a fast leading zero count on all quoted platforms, we&#8217;re good.</p>
<p>Conclusion: bit scans / zero bit counts in either direction can be done fairly efficiently on all architectures covered, but you need to be careful when zeros are involved.</p>
]]></html></oembed>