<?xml version="1.0" encoding="UTF-8" standalone="yes"?><oembed><version><![CDATA[1.0]]></version><provider_name><![CDATA[Experimental chill]]></provider_name><provider_url><![CDATA[http://danlark.org]]></provider_url><author_name><![CDATA[Danila Kutenin]]></author_name><author_url><![CDATA[https://danlark.org/author/kutdanila/]]></author_url><title><![CDATA[Optimizing 128-bit Division]]></title><type><![CDATA[link]]></type><html><![CDATA[
<p>When it comes to hashing, sometimes 64 bit is not enough, for example, because of <a href="https://en.wikipedia.org/wiki/Birthday_problem">birthday paradox</a> &#8212; the hacker can iterate through random <img src="https://s0.wp.com/latex.php?latex=2%5E%7B32%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" alt="2^{32}" class="latex" /> entities and it can be proven that with some constant probability they will find a collision, i.e. two different objects will have the same hash. <img src="https://s0.wp.com/latex.php?latex=2%5E%7B32%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" alt="2^{32}" class="latex" /> is around 4 billion objects and with the current power capacity in each computer it is certainly achievable. That&#8217;s why we need sometimes to advance the bitness of hashing to at least 128 bits. Unfortunately, it comes with a cost because platforms and CPUs do not support 128 bit operations natively.</p>



<p>Division historically is the most complex operation on CPUs and all guidelines suggest avoiding the division at all costs.</p>



<p>At my job I faced an interesting problem of optimizing 128 bit division from <a href="https://github.com/abseil/abseil-cpp/blob/master/absl/numeric/int128.cc#L155">abseil library</a> in order to split some data across buckets with the help of 128 bit hashing (the number of buckets is not fixed for some uninteresting historical reasons). I found out that the division takes a really long time. The <a href="https://github.com/abseil/abseil-cpp/blob/master/absl/numeric/int128_benchmark.cc#L52">benchmarks</a> from abseil on Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz show some horrible results</p>



<pre class="wp-block-code"><code>Benchmark                       Time(ns)  CPU(ns)
BM_DivideClass128UniformDivisor     13.8     13.8  // 128 bit by 128 bit
BM_DivideClass128SmallDivisor        168      168  // 128 bit by 64 bit</code></pre>



<p>150 nanoseconds for dividing the random 128 bit number by a random 64 bit number? Sounds crazy. For example, <code>div</code> instruction on x86-64 Skylake takes 76 cycles (also, for AMD processors it is much less), the division takes around 20-22ns.</p>



<figure class="wp-block-image size-large"><img data-attachment-id="361" data-permalink="https://danlark.org/d2elnjfdkne/" data-orig-file="https://danlarkorg.files.wordpress.com/2020/06/d2elnjfdkne.png" data-orig-size="921,461" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="d2elnjfdkne" data-image-description="" data-image-caption="" data-medium-file="https://danlarkorg.files.wordpress.com/2020/06/d2elnjfdkne.png?w=300" data-large-file="https://danlarkorg.files.wordpress.com/2020/06/d2elnjfdkne.png?w=921" src="https://danlarkorg.files.wordpress.com/2020/06/d2elnjfdkne.png?w=921" alt="" class="wp-image-361" srcset="https://danlarkorg.files.wordpress.com/2020/06/d2elnjfdkne.png 921w, https://danlarkorg.files.wordpress.com/2020/06/d2elnjfdkne.png?w=150 150w, https://danlarkorg.files.wordpress.com/2020/06/d2elnjfdkne.png?w=300 300w, https://danlarkorg.files.wordpress.com/2020/06/d2elnjfdkne.png?w=768 768w" sizes="(max-width: 921px) 100vw, 921px" /><figcaption><a href="https://godbolt.org/z/o2vTZr">https://godbolt.org/z/o2vTZr</a></figcaption></figure>



<p>In reality everything is slightly better because of pipeline execution and division has its own ALU, so if you divide something and do something else in the next instructions, you will get lower average latency. Still, 128 bit division cannot be 8x slower than 64 bit division. All latencies you can find in Agner Fog <a href="https://www.agner.org/optimize/instruction_tables.pdf">instruction table</a> for most of the modern x86 CPUs. The truth is more complex and division latency can even depend on the values given.</p>



<figure class="wp-block-image size-large"><img data-attachment-id="299" data-permalink="https://danlark.org/2020-06-14-182043_835x215_scrot/" data-orig-file="https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-182043_835x215_scrot.png" data-orig-size="835,215" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="2020-06-14-182043_835x215_scrot" data-image-description="" data-image-caption="" data-medium-file="https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-182043_835x215_scrot.png?w=300" data-large-file="https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-182043_835x215_scrot.png?w=835" src="https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-182043_835x215_scrot.png?w=835" alt="" class="wp-image-299" srcset="https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-182043_835x215_scrot.png 835w, https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-182043_835x215_scrot.png?w=150 150w, https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-182043_835x215_scrot.png?w=300 300w, https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-182043_835x215_scrot.png?w=768 768w" sizes="(max-width: 835px) 100vw, 835px" /><figcaption>Agner Fog instruction table for Skylake CPUs, the second but last column is the latency.</figcaption></figure>



<p>Even compilers when dividing by some constants, try to use the reciprocal (or, the same as inverse in a ring) value and multiply the reciprocal and the value with some shifts afterwards</p>



<figure class="wp-block-image size-large"><img data-attachment-id="315" data-permalink="https://danlark.org/2020-06-14-192300_861x251_scrot/" data-orig-file="https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-192300_861x251_scrot.png" data-orig-size="861,251" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="2020-06-14-192300_861x251_scrot" data-image-description="" data-image-caption="" data-medium-file="https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-192300_861x251_scrot.png?w=300" data-large-file="https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-192300_861x251_scrot.png?w=861" src="https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-192300_861x251_scrot.png?w=861" alt="" class="wp-image-315" srcset="https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-192300_861x251_scrot.png 861w, https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-192300_861x251_scrot.png?w=150 150w, https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-192300_861x251_scrot.png?w=300 300w, https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-192300_861x251_scrot.png?w=768 768w" sizes="(max-width: 861px) 100vw, 861px" /><figcaption><a href="https://gcc.godbolt.org/z/PRibsx">https://gcc.godbolt.org/z/PRibsx</a></figcaption></figure>



<p>Overall, given the fact that only some <code>sin</code>, <code>cos</code> instructions cost more than division, division is one of the most complex instructions in CPUs and optimizations in that place matter a lot. My exact case was more or less general, maybe I was dividing 128 bit by 64 bit a bit more frequent. We are going to optimize the general case in LLVM.</p>



<p>We need to understand how 128 bit division is working through the compiler stack.</p>



<figure class="wp-block-image size-large"><img data-attachment-id="303" data-permalink="https://danlark.org/2020-06-14-183125_682x238_scrot/" data-orig-file="https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-183125_682x238_scrot.png" data-orig-size="682,238" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="2020-06-14-183125_682x238_scrot" data-image-description="" data-image-caption="" data-medium-file="https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-183125_682x238_scrot.png?w=300" data-large-file="https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-183125_682x238_scrot.png?w=682" src="https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-183125_682x238_scrot.png?w=682" alt="" class="wp-image-303" srcset="https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-183125_682x238_scrot.png 682w, https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-183125_682x238_scrot.png?w=150 150w, https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-183125_682x238_scrot.png?w=300 300w" sizes="(max-width: 682px) 100vw, 682px" /><figcaption><a href="https://gcc.godbolt.org/z/fB3aq2">https://gcc.godbolt.org/z/fB3aq2</a></figcaption></figure>



<p>It calls <code>__udivti3</code> function. Let&#8217;s first understand how to read these functions. In runtime libraries the modes of the functions are:</p>



<div class="wp-block-group"><div class="wp-block-group__inner-container">
<pre class="wp-block-code"><code>QI: An integer that is as wide as the smallest addressable unit, usually 8 bits.
HI: An integer, twice as wide as a QI mode integer, usually 16 bits.
SI: An integer, four times as wide as a QI mode integer, usually 32 bits.
DI: An integer, eight times as wide as a QI mode integer, usually 64 bits.
SF: A floating point value, as wide as a SI mode integer, usually 32 bits.
DF: A floating point value, as wide as a DI mode integer, usually 64 bits.
TI: An integer, 16 times as wide as a QI mode integer, usually 128 bits.</code></pre>
</div></div>



<p>So, <code>udivti3</code> is an <strong>u</strong>nsigned division of TI (128 bits) integers, last &#8216;<em>3&#8242;</em> means that it has 3 arguments including the return value. Also, there is a function <code>__udivmodti4</code> which computes the divisor and the remainder (division and modulo operation) and it has 4 arguments including the returning value. These functions are a part of runtime libraries which compilers provide by default. For example, in GCC it is <a href="http://gcc.gnu.org/onlinedocs/gccint/Libgcc.html#Libgcc">libgcc</a>, in LLVM it is <a href="https://compiler-rt.llvm.org/">compiler-rt</a>, they are linked almost in every program if you have the corresponding toolchain. In LLVM, <code>__udivti3</code> is a simple alias to <code>__udivmodti4</code>.</p>



<figure class="wp-block-embed is-type-rich"><div class="wp-block-embed__wrapper">
<style>.gist table { margin-bottom: 0; }</style><div style="tab-size: 8" id="gist103719357" class="gist">
    <div class="gist-file" translate="no">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container file-box">
  <div id="file-modti3-c" class="file my-2">
    
  <div itemprop="text" class="Box-body p-0 blob-wrapper data type-c  ">

      
<table class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip>
      <tr>
        <td id="file-modti3-c-L1" class="blob-num js-line-number" data-line-number="1"></td>
        <td id="file-modti3-c-LC1" class="blob-code blob-code-inner js-file-line">#<span class="pl-k">include</span> <span class="pl-s"><span class="pl-pds">&quot;</span>int_lib.h<span class="pl-pds">&quot;</span></span></td>
      </tr>
      <tr>
        <td id="file-modti3-c-L2" class="blob-num js-line-number" data-line-number="2"></td>
        <td id="file-modti3-c-LC2" class="blob-code blob-code-inner js-file-line">
</td>
      </tr>
      <tr>
        <td id="file-modti3-c-L3" class="blob-num js-line-number" data-line-number="3"></td>
        <td id="file-modti3-c-LC3" class="blob-code blob-code-inner js-file-line"><span class="pl-c"><span class="pl-c">/*</span></span></td>
      </tr>
      <tr>
        <td id="file-modti3-c-L4" class="blob-num js-line-number" data-line-number="4"></td>
        <td id="file-modti3-c-LC4" class="blob-code blob-code-inner js-file-line"><span class="pl-c">typedef      int si_int;</span></td>
      </tr>
      <tr>
        <td id="file-modti3-c-L5" class="blob-num js-line-number" data-line-number="5"></td>
        <td id="file-modti3-c-LC5" class="blob-code blob-code-inner js-file-line"><span class="pl-c">typedef unsigned su_int;</span></td>
      </tr>
      <tr>
        <td id="file-modti3-c-L6" class="blob-num js-line-number" data-line-number="6"></td>
        <td id="file-modti3-c-LC6" class="blob-code blob-code-inner js-file-line"><span class="pl-c"></span></td>
      </tr>
      <tr>
        <td id="file-modti3-c-L7" class="blob-num js-line-number" data-line-number="7"></td>
        <td id="file-modti3-c-LC7" class="blob-code blob-code-inner js-file-line"><span class="pl-c">typedef          long long di_int;</span></td>
      </tr>
      <tr>
        <td id="file-modti3-c-L8" class="blob-num js-line-number" data-line-number="8"></td>
        <td id="file-modti3-c-LC8" class="blob-code blob-code-inner js-file-line"><span class="pl-c">typedef unsigned long long du_int;</span></td>
      </tr>
      <tr>
        <td id="file-modti3-c-L9" class="blob-num js-line-number" data-line-number="9"></td>
        <td id="file-modti3-c-LC9" class="blob-code blob-code-inner js-file-line"><span class="pl-c"></span></td>
      </tr>
      <tr>
        <td id="file-modti3-c-L10" class="blob-num js-line-number" data-line-number="10"></td>
        <td id="file-modti3-c-LC10" class="blob-code blob-code-inner js-file-line"><span class="pl-c">typedef int ti_int __attribute__((mode(TI))); // 128 signed</span></td>
      </tr>
      <tr>
        <td id="file-modti3-c-L11" class="blob-num js-line-number" data-line-number="11"></td>
        <td id="file-modti3-c-LC11" class="blob-code blob-code-inner js-file-line"><span class="pl-c">typedef unsigned tu_int __attribute__((mode(TI))); // 128 bit unsigned</span></td>
      </tr>
      <tr>
        <td id="file-modti3-c-L12" class="blob-num js-line-number" data-line-number="12"></td>
        <td id="file-modti3-c-LC12" class="blob-code blob-code-inner js-file-line"><span class="pl-c"><span class="pl-c">*/</span></span></td>
      </tr>
      <tr>
        <td id="file-modti3-c-L13" class="blob-num js-line-number" data-line-number="13"></td>
        <td id="file-modti3-c-LC13" class="blob-code blob-code-inner js-file-line">
</td>
      </tr>
      <tr>
        <td id="file-modti3-c-L14" class="blob-num js-line-number" data-line-number="14"></td>
        <td id="file-modti3-c-LC14" class="blob-code blob-code-inner js-file-line">#<span class="pl-k">ifdef</span> CRT_HAS_128BIT</td>
      </tr>
      <tr>
        <td id="file-modti3-c-L15" class="blob-num js-line-number" data-line-number="15"></td>
        <td id="file-modti3-c-LC15" class="blob-code blob-code-inner js-file-line">
</td>
      </tr>
      <tr>
        <td id="file-modti3-c-L16" class="blob-num js-line-number" data-line-number="16"></td>
        <td id="file-modti3-c-LC16" class="blob-code blob-code-inner js-file-line">COMPILER_RT_ABI tu_int <span class="pl-en">__umodti3</span>(tu_int a, tu_int b) {</td>
      </tr>
      <tr>
        <td id="file-modti3-c-L17" class="blob-num js-line-number" data-line-number="17"></td>
        <td id="file-modti3-c-LC17" class="blob-code blob-code-inner js-file-line">  tu_int r;</td>
      </tr>
      <tr>
        <td id="file-modti3-c-L18" class="blob-num js-line-number" data-line-number="18"></td>
        <td id="file-modti3-c-LC18" class="blob-code blob-code-inner js-file-line">  <span class="pl-c1">__udivmodti4</span>(a, b, &amp;r);</td>
      </tr>
      <tr>
        <td id="file-modti3-c-L19" class="blob-num js-line-number" data-line-number="19"></td>
        <td id="file-modti3-c-LC19" class="blob-code blob-code-inner js-file-line">  <span class="pl-k">return</span> r;</td>
      </tr>
      <tr>
        <td id="file-modti3-c-L20" class="blob-num js-line-number" data-line-number="20"></td>
        <td id="file-modti3-c-LC20" class="blob-code blob-code-inner js-file-line">}</td>
      </tr>
      <tr>
        <td id="file-modti3-c-L21" class="blob-num js-line-number" data-line-number="21"></td>
        <td id="file-modti3-c-LC21" class="blob-code blob-code-inner js-file-line">
</td>
      </tr>
      <tr>
        <td id="file-modti3-c-L22" class="blob-num js-line-number" data-line-number="22"></td>
        <td id="file-modti3-c-LC22" class="blob-code blob-code-inner js-file-line"><span class="pl-c"><span class="pl-c">//</span> Returns: a % b</span></td>
      </tr>
      <tr>
        <td id="file-modti3-c-L23" class="blob-num js-line-number" data-line-number="23"></td>
        <td id="file-modti3-c-LC23" class="blob-code blob-code-inner js-file-line">
</td>
      </tr>
      <tr>
        <td id="file-modti3-c-L24" class="blob-num js-line-number" data-line-number="24"></td>
        <td id="file-modti3-c-LC24" class="blob-code blob-code-inner js-file-line">COMPILER_RT_ABI ti_int <span class="pl-en">__modti3</span>(ti_int a, ti_int b) {</td>
      </tr>
      <tr>
        <td id="file-modti3-c-L25" class="blob-num js-line-number" data-line-number="25"></td>
        <td id="file-modti3-c-LC25" class="blob-code blob-code-inner js-file-line">  <span class="pl-k">const</span> <span class="pl-k">int</span> bits_in_tword_m1 = (<span class="pl-k">int</span>)(<span class="pl-k">sizeof</span>(ti_int) * CHAR_BIT) &#8211; <span class="pl-c1">1</span>;</td>
      </tr>
      <tr>
        <td id="file-modti3-c-L26" class="blob-num js-line-number" data-line-number="26"></td>
        <td id="file-modti3-c-LC26" class="blob-code blob-code-inner js-file-line">  ti_int s = b &gt;&gt; bits_in_tword_m1; <span class="pl-c"><span class="pl-c">//</span> s = b &lt; 0 ? -1 : 0</span></td>
      </tr>
      <tr>
        <td id="file-modti3-c-L27" class="blob-num js-line-number" data-line-number="27"></td>
        <td id="file-modti3-c-LC27" class="blob-code blob-code-inner js-file-line">  b = (b ^ s) &#8211; s;                  <span class="pl-c"><span class="pl-c">//</span> negate if s == -1</span></td>
      </tr>
      <tr>
        <td id="file-modti3-c-L28" class="blob-num js-line-number" data-line-number="28"></td>
        <td id="file-modti3-c-LC28" class="blob-code blob-code-inner js-file-line">  s = a &gt;&gt; bits_in_tword_m1;        <span class="pl-c"><span class="pl-c">//</span> s = a &lt; 0 ? -1 : 0</span></td>
      </tr>
      <tr>
        <td id="file-modti3-c-L29" class="blob-num js-line-number" data-line-number="29"></td>
        <td id="file-modti3-c-LC29" class="blob-code blob-code-inner js-file-line">  a = (a ^ s) &#8211; s;                  <span class="pl-c"><span class="pl-c">//</span> negate if s == -1</span></td>
      </tr>
      <tr>
        <td id="file-modti3-c-L30" class="blob-num js-line-number" data-line-number="30"></td>
        <td id="file-modti3-c-LC30" class="blob-code blob-code-inner js-file-line">  tu_int r;</td>
      </tr>
      <tr>
        <td id="file-modti3-c-L31" class="blob-num js-line-number" data-line-number="31"></td>
        <td id="file-modti3-c-LC31" class="blob-code blob-code-inner js-file-line">  <span class="pl-c1">__udivmodti4</span>(a, b, &amp;r);</td>
      </tr>
      <tr>
        <td id="file-modti3-c-L32" class="blob-num js-line-number" data-line-number="32"></td>
        <td id="file-modti3-c-LC32" class="blob-code blob-code-inner js-file-line">  <span class="pl-k">return</span> ((ti_int)r ^ s) &#8211; s; <span class="pl-c"><span class="pl-c">//</span> negate if s == -1</span></td>
      </tr>
      <tr>
        <td id="file-modti3-c-L33" class="blob-num js-line-number" data-line-number="33"></td>
        <td id="file-modti3-c-LC33" class="blob-code blob-code-inner js-file-line">}</td>
      </tr>
      <tr>
        <td id="file-modti3-c-L34" class="blob-num js-line-number" data-line-number="34"></td>
        <td id="file-modti3-c-LC34" class="blob-code blob-code-inner js-file-line">
</td>
      </tr>
      <tr>
        <td id="file-modti3-c-L35" class="blob-num js-line-number" data-line-number="35"></td>
        <td id="file-modti3-c-LC35" class="blob-code blob-code-inner js-file-line">#<span class="pl-k">endif</span> <span class="pl-c"><span class="pl-c">//</span> CRT_HAS_128BIT</span></td>
      </tr>
</table>


  </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/danlark1/7ae49544dfaa5cfebb6c70a11c198e92/raw/0ca715c2778335088be6ba6de0ffb344810396cc/modti3.c" style="float:right">view raw</a>
        <a href="https://gist.github.com/danlark1/7ae49544dfaa5cfebb6c70a11c198e92#file-modti3-c">modti3.c</a>
        hosted with &#10084; by <a href="https://github.com">GitHub</a>
      </div>
    </div>
    <div class="gist-file" translate="no">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container file-box">
  <div id="file-udivti3-c" class="file my-2">
    
  <div itemprop="text" class="Box-body p-0 blob-wrapper data type-c  ">

      
<table class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip>
      <tr>
        <td id="file-udivti3-c-L1" class="blob-num js-line-number" data-line-number="1"></td>
        <td id="file-udivti3-c-LC1" class="blob-code blob-code-inner js-file-line">#<span class="pl-k">include</span> <span class="pl-s"><span class="pl-pds">&quot;</span>int_lib.h<span class="pl-pds">&quot;</span></span></td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L2" class="blob-num js-line-number" data-line-number="2"></td>
        <td id="file-udivti3-c-LC2" class="blob-code blob-code-inner js-file-line">
</td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L3" class="blob-num js-line-number" data-line-number="3"></td>
        <td id="file-udivti3-c-LC3" class="blob-code blob-code-inner js-file-line"><span class="pl-c"><span class="pl-c">/*</span></span></td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L4" class="blob-num js-line-number" data-line-number="4"></td>
        <td id="file-udivti3-c-LC4" class="blob-code blob-code-inner js-file-line"><span class="pl-c">typedef      int si_int;</span></td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L5" class="blob-num js-line-number" data-line-number="5"></td>
        <td id="file-udivti3-c-LC5" class="blob-code blob-code-inner js-file-line"><span class="pl-c">typedef unsigned su_int;</span></td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L6" class="blob-num js-line-number" data-line-number="6"></td>
        <td id="file-udivti3-c-LC6" class="blob-code blob-code-inner js-file-line"><span class="pl-c"></span></td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L7" class="blob-num js-line-number" data-line-number="7"></td>
        <td id="file-udivti3-c-LC7" class="blob-code blob-code-inner js-file-line"><span class="pl-c">typedef          long long di_int;</span></td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L8" class="blob-num js-line-number" data-line-number="8"></td>
        <td id="file-udivti3-c-LC8" class="blob-code blob-code-inner js-file-line"><span class="pl-c">typedef unsigned long long du_int;</span></td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L9" class="blob-num js-line-number" data-line-number="9"></td>
        <td id="file-udivti3-c-LC9" class="blob-code blob-code-inner js-file-line"><span class="pl-c"></span></td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L10" class="blob-num js-line-number" data-line-number="10"></td>
        <td id="file-udivti3-c-LC10" class="blob-code blob-code-inner js-file-line"><span class="pl-c">typedef int ti_int __attribute__((mode(TI))); // 128 signed</span></td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L11" class="blob-num js-line-number" data-line-number="11"></td>
        <td id="file-udivti3-c-LC11" class="blob-code blob-code-inner js-file-line"><span class="pl-c">typedef unsigned tu_int __attribute__((mode(TI))); // 128 bit unsigned</span></td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L12" class="blob-num js-line-number" data-line-number="12"></td>
        <td id="file-udivti3-c-LC12" class="blob-code blob-code-inner js-file-line"><span class="pl-c"><span class="pl-c">*/</span></span></td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L13" class="blob-num js-line-number" data-line-number="13"></td>
        <td id="file-udivti3-c-LC13" class="blob-code blob-code-inner js-file-line">
</td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L14" class="blob-num js-line-number" data-line-number="14"></td>
        <td id="file-udivti3-c-LC14" class="blob-code blob-code-inner js-file-line">#<span class="pl-k">ifdef</span> CRT_HAS_128BIT</td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L15" class="blob-num js-line-number" data-line-number="15"></td>
        <td id="file-udivti3-c-LC15" class="blob-code blob-code-inner js-file-line">
</td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L16" class="blob-num js-line-number" data-line-number="16"></td>
        <td id="file-udivti3-c-LC16" class="blob-code blob-code-inner js-file-line"><span class="pl-c"><span class="pl-c">//</span> Returns: a / b</span></td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L17" class="blob-num js-line-number" data-line-number="17"></td>
        <td id="file-udivti3-c-LC17" class="blob-code blob-code-inner js-file-line">
</td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L18" class="blob-num js-line-number" data-line-number="18"></td>
        <td id="file-udivti3-c-LC18" class="blob-code blob-code-inner js-file-line">COMPILER_RT_ABI tu_int <span class="pl-en">__udivti3</span>(tu_int a, tu_int b) {</td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L19" class="blob-num js-line-number" data-line-number="19"></td>
        <td id="file-udivti3-c-LC19" class="blob-code blob-code-inner js-file-line">  <span class="pl-k">return</span> <span class="pl-c1">__udivmodti4</span>(a, b, <span class="pl-c1">0</span>);</td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L20" class="blob-num js-line-number" data-line-number="20"></td>
        <td id="file-udivti3-c-LC20" class="blob-code blob-code-inner js-file-line">}</td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L21" class="blob-num js-line-number" data-line-number="21"></td>
        <td id="file-udivti3-c-LC21" class="blob-code blob-code-inner js-file-line">
</td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L22" class="blob-num js-line-number" data-line-number="22"></td>
        <td id="file-udivti3-c-LC22" class="blob-code blob-code-inner js-file-line">COMPILER_RT_ABI ti_int <span class="pl-en">__divti3</span>(ti_int a, ti_int b) {</td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L23" class="blob-num js-line-number" data-line-number="23"></td>
        <td id="file-udivti3-c-LC23" class="blob-code blob-code-inner js-file-line">  <span class="pl-k">const</span> <span class="pl-k">int</span> bits_in_tword_m1 = (<span class="pl-k">int</span>)(<span class="pl-k">sizeof</span>(ti_int) * CHAR_BIT) &#8211; <span class="pl-c1">1</span>;</td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L24" class="blob-num js-line-number" data-line-number="24"></td>
        <td id="file-udivti3-c-LC24" class="blob-code blob-code-inner js-file-line">  ti_int s_a = a &gt;&gt; bits_in_tword_m1;                   <span class="pl-c"><span class="pl-c">//</span> s_a = a &lt; 0 ? -1 : 0</span></td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L25" class="blob-num js-line-number" data-line-number="25"></td>
        <td id="file-udivti3-c-LC25" class="blob-code blob-code-inner js-file-line">  ti_int s_b = b &gt;&gt; bits_in_tword_m1;                   <span class="pl-c"><span class="pl-c">//</span> s_b = b &lt; 0 ? -1 : 0</span></td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L26" class="blob-num js-line-number" data-line-number="26"></td>
        <td id="file-udivti3-c-LC26" class="blob-code blob-code-inner js-file-line">  a = (a ^ s_a) &#8211; s_a;                                  <span class="pl-c"><span class="pl-c">//</span> negate if s_a == -1</span></td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L27" class="blob-num js-line-number" data-line-number="27"></td>
        <td id="file-udivti3-c-LC27" class="blob-code blob-code-inner js-file-line">  b = (b ^ s_b) &#8211; s_b;                                  <span class="pl-c"><span class="pl-c">//</span> negate if s_b == -1</span></td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L28" class="blob-num js-line-number" data-line-number="28"></td>
        <td id="file-udivti3-c-LC28" class="blob-code blob-code-inner js-file-line">  s_a ^= s_b;                                           <span class="pl-c"><span class="pl-c">//</span> sign of quotient</span></td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L29" class="blob-num js-line-number" data-line-number="29"></td>
        <td id="file-udivti3-c-LC29" class="blob-code blob-code-inner js-file-line">  <span class="pl-k">return</span> (<span class="pl-c1">__udivmodti4</span>(a, b, (tu_int *)<span class="pl-c1">0</span>) ^ s_a) &#8211; s_a; <span class="pl-c"><span class="pl-c">//</span> negate if s_a == -1</span></td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L30" class="blob-num js-line-number" data-line-number="30"></td>
        <td id="file-udivti3-c-LC30" class="blob-code blob-code-inner js-file-line">}</td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L31" class="blob-num js-line-number" data-line-number="31"></td>
        <td id="file-udivti3-c-LC31" class="blob-code blob-code-inner js-file-line">
</td>
      </tr>
      <tr>
        <td id="file-udivti3-c-L32" class="blob-num js-line-number" data-line-number="32"></td>
        <td id="file-udivti3-c-LC32" class="blob-code blob-code-inner js-file-line">#<span class="pl-k">endif</span> <span class="pl-c"><span class="pl-c">//</span> CRT_HAS_128BIT</span></td>
      </tr>
</table>


  </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/danlark1/7ae49544dfaa5cfebb6c70a11c198e92/raw/0ca715c2778335088be6ba6de0ffb344810396cc/udivti3.c" style="float:right">view raw</a>
        <a href="https://gist.github.com/danlark1/7ae49544dfaa5cfebb6c70a11c198e92#file-udivti3-c">udivti3.c</a>
        hosted with &#10084; by <a href="https://github.com">GitHub</a>
      </div>
    </div>
</div>

</div></figure>



<p><code><a href="https://github.com/llvm-mirror/compiler-rt/blob/release_90/lib/builtins/udivmodti4.c#L20">__udivmodti4</a></code> function was written with the help of <code>Translated from Figure 3-40 of The PowerPC Compiler Writer's Guide.</code> After looking at it <a href="https://cr.yp.to/2005-590/powerpc-cwg.pdf">here</a>, it looks like this was written long time ago and things have changed since then</p>



<figure class="wp-block-image size-large"><img data-attachment-id="313" data-permalink="https://danlark.org/2020-06-14-191400_1038x718_scrot/" data-orig-file="https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-191400_1038x718_scrot.png" data-orig-size="1038,718" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="2020-06-14-191400_1038x718_scrot" data-image-description="" data-image-caption="" data-medium-file="https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-191400_1038x718_scrot.png?w=300" data-large-file="https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-191400_1038x718_scrot.png?w=1024" src="https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-191400_1038x718_scrot.png?w=1024" alt="" class="wp-image-313" srcset="https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-191400_1038x718_scrot.png?w=1024 1024w, https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-191400_1038x718_scrot.png?w=150 150w, https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-191400_1038x718_scrot.png?w=300 300w, https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-191400_1038x718_scrot.png?w=768 768w, https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-191400_1038x718_scrot.png 1038w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p>First of all, let&#8217;s come up with something easy, like shift-subtract algorithm that we have been learning since childhood. First, if <code>divisor &gt; dividend</code>, then the quotient is zero and remainder is the <code>dividend</code>, not an interesting case. </p>



<figure class="wp-block-embed is-type-rich"><div class="wp-block-embed__wrapper">
<style>.gist table { margin-bottom: 0; }</style><div style="tab-size: 8" id="gist103719706" class="gist">
    <div class="gist-file" translate="no">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container file-box">
  <div id="file-shift_subtract-c" class="file my-2">
    
  <div itemprop="text" class="Box-body p-0 blob-wrapper data type-c  ">

      
<table class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip>
      <tr>
        <td id="file-shift_subtract-c-L1" class="blob-num js-line-number" data-line-number="1"></td>
        <td id="file-shift_subtract-c-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-c"><span class="pl-c">//</span> dividend / divisor, remainder is stored in rem.</span></td>
      </tr>
      <tr>
        <td id="file-shift_subtract-c-L2" class="blob-num js-line-number" data-line-number="2"></td>
        <td id="file-shift_subtract-c-LC2" class="blob-code blob-code-inner js-file-line">uint128 <span class="pl-en">__udivmodti4</span>(uint128 dividend, uint128 divisor, uint128* rem) {</td>
      </tr>
      <tr>
        <td id="file-shift_subtract-c-L3" class="blob-num js-line-number" data-line-number="3"></td>
        <td id="file-shift_subtract-c-LC3" class="blob-code blob-code-inner js-file-line">  <span class="pl-k">if</span> (divisor &gt; dividend) {</td>
      </tr>
      <tr>
        <td id="file-shift_subtract-c-L4" class="blob-num js-line-number" data-line-number="4"></td>
        <td id="file-shift_subtract-c-LC4" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">if</span> (rem)</td>
      </tr>
      <tr>
        <td id="file-shift_subtract-c-L5" class="blob-num js-line-number" data-line-number="5"></td>
        <td id="file-shift_subtract-c-LC5" class="blob-code blob-code-inner js-file-line">      *rem = dividend;</td>
      </tr>
      <tr>
        <td id="file-shift_subtract-c-L6" class="blob-num js-line-number" data-line-number="6"></td>
        <td id="file-shift_subtract-c-LC6" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">return</span> <span class="pl-c1">0</span>;</td>
      </tr>
      <tr>
        <td id="file-shift_subtract-c-L7" class="blob-num js-line-number" data-line-number="7"></td>
        <td id="file-shift_subtract-c-LC7" class="blob-code blob-code-inner js-file-line">  }</td>
      </tr>
      <tr>
        <td id="file-shift_subtract-c-L8" class="blob-num js-line-number" data-line-number="8"></td>
        <td id="file-shift_subtract-c-LC8" class="blob-code blob-code-inner js-file-line">   <span class="pl-c"><span class="pl-c">//</span> Calculate the distance between most significant bits, 128 &gt; shift &gt;= 0.</span></td>
      </tr>
      <tr>
        <td id="file-shift_subtract-c-L9" class="blob-num js-line-number" data-line-number="9"></td>
        <td id="file-shift_subtract-c-LC9" class="blob-code blob-code-inner js-file-line">  <span class="pl-k">int</span> shift = <span class="pl-c1">Distance</span>(dividend, divisor);</td>
      </tr>
      <tr>
        <td id="file-shift_subtract-c-L10" class="blob-num js-line-number" data-line-number="10"></td>
        <td id="file-shift_subtract-c-LC10" class="blob-code blob-code-inner js-file-line">  divisor &lt;&lt;= shift;</td>
      </tr>
      <tr>
        <td id="file-shift_subtract-c-L11" class="blob-num js-line-number" data-line-number="11"></td>
        <td id="file-shift_subtract-c-LC11" class="blob-code blob-code-inner js-file-line">  quotient = <span class="pl-c1">0</span>;</td>
      </tr>
      <tr>
        <td id="file-shift_subtract-c-L12" class="blob-num js-line-number" data-line-number="12"></td>
        <td id="file-shift_subtract-c-LC12" class="blob-code blob-code-inner js-file-line">  <span class="pl-k">for</span> (; shift &gt;= <span class="pl-c1">0</span>; &#8211;shift) {</td>
      </tr>
      <tr>
        <td id="file-shift_subtract-c-L13" class="blob-num js-line-number" data-line-number="13"></td>
        <td id="file-shift_subtract-c-LC13" class="blob-code blob-code-inner js-file-line">    quotient &lt;&lt;= <span class="pl-c1">1</span>;</td>
      </tr>
      <tr>
        <td id="file-shift_subtract-c-L14" class="blob-num js-line-number" data-line-number="14"></td>
        <td id="file-shift_subtract-c-LC14" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">if</span> (dividend &gt;= divisor) {</td>
      </tr>
      <tr>
        <td id="file-shift_subtract-c-L15" class="blob-num js-line-number" data-line-number="15"></td>
        <td id="file-shift_subtract-c-LC15" class="blob-code blob-code-inner js-file-line">       dividend -= divisor;</td>
      </tr>
      <tr>
        <td id="file-shift_subtract-c-L16" class="blob-num js-line-number" data-line-number="16"></td>
        <td id="file-shift_subtract-c-LC16" class="blob-code blob-code-inner js-file-line">       quotient |= <span class="pl-c1">1</span>;</td>
      </tr>
      <tr>
        <td id="file-shift_subtract-c-L17" class="blob-num js-line-number" data-line-number="17"></td>
        <td id="file-shift_subtract-c-LC17" class="blob-code blob-code-inner js-file-line">    }</td>
      </tr>
      <tr>
        <td id="file-shift_subtract-c-L18" class="blob-num js-line-number" data-line-number="18"></td>
        <td id="file-shift_subtract-c-LC18" class="blob-code blob-code-inner js-file-line">    divisor &gt;&gt;= <span class="pl-c1">1</span>;</td>
      </tr>
      <tr>
        <td id="file-shift_subtract-c-L19" class="blob-num js-line-number" data-line-number="19"></td>
        <td id="file-shift_subtract-c-LC19" class="blob-code blob-code-inner js-file-line">  }</td>
      </tr>
      <tr>
        <td id="file-shift_subtract-c-L20" class="blob-num js-line-number" data-line-number="20"></td>
        <td id="file-shift_subtract-c-LC20" class="blob-code blob-code-inner js-file-line">  <span class="pl-k">if</span> (rem)</td>
      </tr>
      <tr>
        <td id="file-shift_subtract-c-L21" class="blob-num js-line-number" data-line-number="21"></td>
        <td id="file-shift_subtract-c-LC21" class="blob-code blob-code-inner js-file-line">    *rem = dividend;</td>
      </tr>
      <tr>
        <td id="file-shift_subtract-c-L22" class="blob-num js-line-number" data-line-number="22"></td>
        <td id="file-shift_subtract-c-LC22" class="blob-code blob-code-inner js-file-line">  <span class="pl-k">return</span> quotient;</td>
      </tr>
      <tr>
        <td id="file-shift_subtract-c-L23" class="blob-num js-line-number" data-line-number="23"></td>
        <td id="file-shift_subtract-c-LC23" class="blob-code blob-code-inner js-file-line">}</td>
      </tr>
</table>


  </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/danlark1/83cd2eba75ef0426793c2abca98f5725/raw/8d01bc83157a5986c4d97cee6cd8eef0b5e026fd/shift_subtract.c" style="float:right">view raw</a>
        <a href="https://gist.github.com/danlark1/83cd2eba75ef0426793c2abca98f5725#file-shift_subtract-c">shift_subtract.c</a>
        hosted with &#10084; by <a href="https://github.com">GitHub</a>
      </div>
    </div>
</div>

</div></figure>



<p>The algorithm is easy, we align the numbers by their most significant bits, if dividend is more than divisor, subtract and add 1 to the output, then shift by 1 and repeat.  Some sort of animation can be seen like that:</p>



<figure class="wp-block-image size-large"><img data-attachment-id="318" data-permalink="https://danlark.org/simplescreenrecorder-2020-06-14_20-23-25/" data-orig-file="https://danlarkorg.files.wordpress.com/2020/06/simplescreenrecorder-2020-06-14_20.23.25.gif" data-orig-size="918,442" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="simplescreenrecorder-2020-06-14_20.23.25" data-image-description="" data-image-caption="" data-medium-file="https://danlarkorg.files.wordpress.com/2020/06/simplescreenrecorder-2020-06-14_20.23.25.gif?w=300" data-large-file="https://danlarkorg.files.wordpress.com/2020/06/simplescreenrecorder-2020-06-14_20.23.25.gif?w=918" src="https://danlarkorg.files.wordpress.com/2020/06/simplescreenrecorder-2020-06-14_20.23.25.gif?w=918" alt="" class="wp-image-318" /></figure>



<p>For 128 bit division it will take at most 128 iterations in the for loop. Actually, the implementation in <a href="https://github.com/llvm-mirror/compiler-rt/blob/release_90/lib/builtins/udivmodti4.c#L173">LLVM</a> for loop is a fallback and we saw it takes 150+ns to complete it because it requires to shift many registers because 128 bit numbers are represented as two registers.</p>



<p>Now, let&#8217;s dive into the architecture features. I noticed that while the compiler generates the <code>divq</code> instructions, it frees <code>rdx</code> register</p>



<figure class="wp-block-image size-large"><img data-attachment-id="362" data-permalink="https://danlark.org/2ugj4bgvw4x/" data-orig-file="https://danlarkorg.files.wordpress.com/2020/06/2ugj4bgvw4x.png" data-orig-size="891,206" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="2ugj4bgvw4x" data-image-description="" data-image-caption="" data-medium-file="https://danlarkorg.files.wordpress.com/2020/06/2ugj4bgvw4x.png?w=300" data-large-file="https://danlarkorg.files.wordpress.com/2020/06/2ugj4bgvw4x.png?w=891" src="https://danlarkorg.files.wordpress.com/2020/06/2ugj4bgvw4x.png?w=891" alt="" class="wp-image-362" srcset="https://danlarkorg.files.wordpress.com/2020/06/2ugj4bgvw4x.png 891w, https://danlarkorg.files.wordpress.com/2020/06/2ugj4bgvw4x.png?w=150 150w, https://danlarkorg.files.wordpress.com/2020/06/2ugj4bgvw4x.png?w=300 300w, https://danlarkorg.files.wordpress.com/2020/06/2ugj4bgvw4x.png?w=768 768w" sizes="(max-width: 891px) 100vw, 891px" /></figure>



<p>In the manual they say the following</p>



<figure class="wp-block-image size-large"><img data-attachment-id="322" data-permalink="https://danlark.org/2020-06-14-204644_860x119_scrot/" data-orig-file="https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-204644_860x119_scrot.png" data-orig-size="860,119" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="2020-06-14-204644_860x119_scrot" data-image-description="" data-image-caption="" data-medium-file="https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-204644_860x119_scrot.png?w=300" data-large-file="https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-204644_860x119_scrot.png?w=860" src="https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-204644_860x119_scrot.png?w=860" alt="" class="wp-image-322" srcset="https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-204644_860x119_scrot.png 860w, https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-204644_860x119_scrot.png?w=150 150w, https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-204644_860x119_scrot.png?w=300 300w, https://danlarkorg.files.wordpress.com/2020/06/2020-06-14-204644_860x119_scrot.png?w=768 768w" sizes="(max-width: 860px) 100vw, 860px" /></figure>



<p><code>divq</code> instruction provides 128 bit division from [%rdx]:[%rax] by <code>S</code>. The quotient is stored in <code>%rax</code> and the remainder in <code>%rdx</code>. After some experimenting with inline asm in C/C++, I figured out that if the result does not fit in 64 bits, SIGFPE is raised. See:</p>



<figure class="wp-block-embed is-type-rich"><div class="wp-block-embed__wrapper">
<style>.gist table { margin-bottom: 0; }</style><div style="tab-size: 8" id="gist103720424" class="gist">
    <div class="gist-file" translate="no">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container file-box">
  <div id="file-divq-c" class="file my-2">
    
  <div itemprop="text" class="Box-body p-0 blob-wrapper data type-c  ">

      
<table class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip>
      <tr>
        <td id="file-divq-c-L1" class="blob-num js-line-number" data-line-number="1"></td>
        <td id="file-divq-c-LC1" class="blob-code blob-code-inner js-file-line">#<span class="pl-k">include</span> <span class="pl-s"><span class="pl-pds">&lt;</span>cinttypes<span class="pl-pds">&gt;</span></span></td>
      </tr>
      <tr>
        <td id="file-divq-c-L2" class="blob-num js-line-number" data-line-number="2"></td>
        <td id="file-divq-c-LC2" class="blob-code blob-code-inner js-file-line">
</td>
      </tr>
      <tr>
        <td id="file-divq-c-L3" class="blob-num js-line-number" data-line-number="3"></td>
        <td id="file-divq-c-LC3" class="blob-code blob-code-inner js-file-line"><span class="pl-c1">uint64_t</span> <span class="pl-en">div</span>(<span class="pl-c1">uint64_t</span> u1, <span class="pl-c1">uint64_t</span> u0, <span class="pl-c1">uint64_t</span> v) {</td>
      </tr>
      <tr>
        <td id="file-divq-c-L4" class="blob-num js-line-number" data-line-number="4"></td>
        <td id="file-divq-c-LC4" class="blob-code blob-code-inner js-file-line">  <span class="pl-c1">uint64_t</span> result;</td>
      </tr>
      <tr>
        <td id="file-divq-c-L5" class="blob-num js-line-number" data-line-number="5"></td>
        <td id="file-divq-c-LC5" class="blob-code blob-code-inner js-file-line">  <span class="pl-c1">uint64_t</span> <span class="pl-c1">remainder</span>;</td>
      </tr>
      <tr>
        <td id="file-divq-c-L6" class="blob-num js-line-number" data-line-number="6"></td>
        <td id="file-divq-c-LC6" class="blob-code blob-code-inner js-file-line">  <span class="pl-c1">__asm__</span>(<span class="pl-s"><span class="pl-pds">&quot;</span>divq %[v]<span class="pl-pds">&quot;</span></span> : <span class="pl-s"><span class="pl-pds">&quot;</span>=a<span class="pl-pds">&quot;</span></span>(result), <span class="pl-s"><span class="pl-pds">&quot;</span>=d<span class="pl-pds">&quot;</span></span>(<span class="pl-c1">remainder</span>) : [v] <span class="pl-s"><span class="pl-pds">&quot;</span>r<span class="pl-pds">&quot;</span></span>(v), <span class="pl-s"><span class="pl-pds">&quot;</span>a<span class="pl-pds">&quot;</span></span>(u0), <span class="pl-s"><span class="pl-pds">&quot;</span>d<span class="pl-pds">&quot;</span></span>(u1));</td>
      </tr>
      <tr>
        <td id="file-divq-c-L7" class="blob-num js-line-number" data-line-number="7"></td>
        <td id="file-divq-c-LC7" class="blob-code blob-code-inner js-file-line">  <span class="pl-k">return</span> result;</td>
      </tr>
      <tr>
        <td id="file-divq-c-L8" class="blob-num js-line-number" data-line-number="8"></td>
        <td id="file-divq-c-LC8" class="blob-code blob-code-inner js-file-line">}</td>
      </tr>
      <tr>
        <td id="file-divq-c-L9" class="blob-num js-line-number" data-line-number="9"></td>
        <td id="file-divq-c-LC9" class="blob-code blob-code-inner js-file-line">
</td>
      </tr>
      <tr>
        <td id="file-divq-c-L10" class="blob-num js-line-number" data-line-number="10"></td>
        <td id="file-divq-c-LC10" class="blob-code blob-code-inner js-file-line"><span class="pl-k">int</span> <span class="pl-en">main</span>() {</td>
      </tr>
      <tr>
        <td id="file-divq-c-L11" class="blob-num js-line-number" data-line-number="11"></td>
        <td id="file-divq-c-LC11" class="blob-code blob-code-inner js-file-line">  <span class="pl-c1">div</span>(<span class="pl-c1">1</span>, <span class="pl-c1">0</span>, <span class="pl-c1">1</span>); <span class="pl-c"><span class="pl-c">//</span> 2**64 / 1</span></td>
      </tr>
      <tr>
        <td id="file-divq-c-L12" class="blob-num js-line-number" data-line-number="12"></td>
        <td id="file-divq-c-LC12" class="blob-code blob-code-inner js-file-line">}</td>
      </tr>
      <tr>
        <td id="file-divq-c-L13" class="blob-num js-line-number" data-line-number="13"></td>
        <td id="file-divq-c-LC13" class="blob-code blob-code-inner js-file-line">
</td>
      </tr>
      <tr>
        <td id="file-divq-c-L14" class="blob-num js-line-number" data-line-number="14"></td>
        <td id="file-divq-c-LC14" class="blob-code blob-code-inner js-file-line"><span class="pl-c"><span class="pl-c">/*</span></span></td>
      </tr>
      <tr>
        <td id="file-divq-c-L15" class="blob-num js-line-number" data-line-number="15"></td>
        <td id="file-divq-c-LC15" class="blob-code blob-code-inner js-file-line"><span class="pl-c">g++ -std=c++17 -O0 main.cpp -o main</span></td>
      </tr>
      <tr>
        <td id="file-divq-c-L16" class="blob-num js-line-number" data-line-number="16"></td>
        <td id="file-divq-c-LC16" class="blob-code blob-code-inner js-file-line"><span class="pl-c">./main </span></td>
      </tr>
      <tr>
        <td id="file-divq-c-L17" class="blob-num js-line-number" data-line-number="17"></td>
        <td id="file-divq-c-LC17" class="blob-code blob-code-inner js-file-line"><span class="pl-c">“./main” terminated by signal SIGFPE (Floating point exception)</span></td>
      </tr>
      <tr>
        <td id="file-divq-c-L18" class="blob-num js-line-number" data-line-number="18"></td>
        <td id="file-divq-c-LC18" class="blob-code blob-code-inner js-file-line"><span class="pl-c"><span class="pl-c">*/</span></span></td>
      </tr>
</table>


  </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/danlark1/5afcaba4fdd3e74f0d4a014e332948aa/raw/acaf07f94d831a607c297d16d974488f4501d3df/divq.c" style="float:right">view raw</a>
        <a href="https://gist.github.com/danlark1/5afcaba4fdd3e74f0d4a014e332948aa#file-divq-c">divq.c</a>
        hosted with &#10084; by <a href="https://github.com">GitHub</a>
      </div>
    </div>
</div>

</div></figure>



<p>Compilers don&#8217;t use this instruction in 128 bit division because they cannot know for sure if the result is going to fit in 64 bits. Yet, if the high 64 bits of the 128 bit number is smaller than the divisor, the result fits into 64 bits and we can use this instruction. As compilers don&#8217;t generate <code>div</code>q instruction for their own reasons, we would use inline asm for x86-64.</p>



<figure class="wp-block-embed is-type-rich"><div class="wp-block-embed__wrapper">
<style>.gist table { margin-bottom: 0; }</style><div style="tab-size: 8" id="gist103720539" class="gist">
    <div class="gist-file" translate="no">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container file-box">
  <div id="file-optimizing_64_bit-cpp" class="file my-2">
    
  <div itemprop="text" class="Box-body p-0 blob-wrapper data type-c  ">

      
<table class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip>
      <tr>
        <td id="file-optimizing_64_bit-cpp-L1" class="blob-num js-line-number" data-line-number="1"></td>
        <td id="file-optimizing_64_bit-cpp-LC1" class="blob-code blob-code-inner js-file-line">#<span class="pl-k">if</span> defined(__x86_64__)</td>
      </tr>
      <tr>
        <td id="file-optimizing_64_bit-cpp-L2" class="blob-num js-line-number" data-line-number="2"></td>
        <td id="file-optimizing_64_bit-cpp-LC2" class="blob-code blob-code-inner js-file-line"><span class="pl-k">inline</span> <span class="pl-c1">uint64_t</span> <span class="pl-en">Divide128Div64To64</span>(<span class="pl-c1">uint64_t</span> high, <span class="pl-c1">uint64_t</span> low,</td>
      </tr>
      <tr>
        <td id="file-optimizing_64_bit-cpp-L3" class="blob-num js-line-number" data-line-number="3"></td>
        <td id="file-optimizing_64_bit-cpp-LC3" class="blob-code blob-code-inner js-file-line">                                   <span class="pl-c1">uint64_t</span> divisor, <span class="pl-c1">uint64_t</span>* remainder) {</td>
      </tr>
      <tr>
        <td id="file-optimizing_64_bit-cpp-L4" class="blob-num js-line-number" data-line-number="4"></td>
        <td id="file-optimizing_64_bit-cpp-LC4" class="blob-code blob-code-inner js-file-line">  <span class="pl-c1">uint64_t</span> result;</td>
      </tr>
      <tr>
        <td id="file-optimizing_64_bit-cpp-L5" class="blob-num js-line-number" data-line-number="5"></td>
        <td id="file-optimizing_64_bit-cpp-LC5" class="blob-code blob-code-inner js-file-line">  <span class="pl-c1">__asm__</span>(<span class="pl-s"><span class="pl-pds">&quot;</span>divq %[v]<span class="pl-pds">&quot;</span></span></td>
      </tr>
      <tr>
        <td id="file-optimizing_64_bit-cpp-L6" class="blob-num js-line-number" data-line-number="6"></td>
        <td id="file-optimizing_64_bit-cpp-LC6" class="blob-code blob-code-inner js-file-line">          : <span class="pl-s"><span class="pl-pds">&quot;</span>=a<span class="pl-pds">&quot;</span></span>(result), <span class="pl-s"><span class="pl-pds">&quot;</span>=d<span class="pl-pds">&quot;</span></span>(*<span class="pl-c1">remainder</span>) <span class="pl-c"><span class="pl-c">//</span> Output parametrs, =a for rax, =d for rdx, [v] is an</span></td>
      </tr>
      <tr>
        <td id="file-optimizing_64_bit-cpp-L7" class="blob-num js-line-number" data-line-number="7"></td>
        <td id="file-optimizing_64_bit-cpp-LC7" class="blob-code blob-code-inner js-file-line">                                           <span class="pl-c"><span class="pl-c">//</span> alias for divisor, input paramters &quot;a&quot; and &quot;d&quot; for low and high.</span></td>
      </tr>
      <tr>
        <td id="file-optimizing_64_bit-cpp-L8" class="blob-num js-line-number" data-line-number="8"></td>
        <td id="file-optimizing_64_bit-cpp-LC8" class="blob-code blob-code-inner js-file-line">          : [v] <span class="pl-s"><span class="pl-pds">&quot;</span>r<span class="pl-pds">&quot;</span></span>(divisor), <span class="pl-s"><span class="pl-pds">&quot;</span>a<span class="pl-pds">&quot;</span></span>(low), <span class="pl-s"><span class="pl-pds">&quot;</span>d<span class="pl-pds">&quot;</span></span>(high));</td>
      </tr>
      <tr>
        <td id="file-optimizing_64_bit-cpp-L9" class="blob-num js-line-number" data-line-number="9"></td>
        <td id="file-optimizing_64_bit-cpp-LC9" class="blob-code blob-code-inner js-file-line">  <span class="pl-k">return</span> result;</td>
      </tr>
      <tr>
        <td id="file-optimizing_64_bit-cpp-L10" class="blob-num js-line-number" data-line-number="10"></td>
        <td id="file-optimizing_64_bit-cpp-LC10" class="blob-code blob-code-inner js-file-line">}</td>
      </tr>
      <tr>
        <td id="file-optimizing_64_bit-cpp-L11" class="blob-num js-line-number" data-line-number="11"></td>
        <td id="file-optimizing_64_bit-cpp-LC11" class="blob-code blob-code-inner js-file-line">#<span class="pl-k">endif</span></td>
      </tr>
      <tr>
        <td id="file-optimizing_64_bit-cpp-L12" class="blob-num js-line-number" data-line-number="12"></td>
        <td id="file-optimizing_64_bit-cpp-LC12" class="blob-code blob-code-inner js-file-line">
</td>
      </tr>
      <tr>
        <td id="file-optimizing_64_bit-cpp-L13" class="blob-num js-line-number" data-line-number="13"></td>
        <td id="file-optimizing_64_bit-cpp-LC13" class="blob-code blob-code-inner js-file-line">tu_int <span class="pl-en">__udivmodti4</span>(tu_int dividend, tu_int divisor, tu_int* remainder) {</td>
      </tr>
      <tr>
        <td id="file-optimizing_64_bit-cpp-L14" class="blob-num js-line-number" data-line-number="14"></td>
        <td id="file-optimizing_64_bit-cpp-LC14" class="blob-code blob-code-inner js-file-line">&#8230;</td>
      </tr>
      <tr>
        <td id="file-optimizing_64_bit-cpp-L15" class="blob-num js-line-number" data-line-number="15"></td>
        <td id="file-optimizing_64_bit-cpp-LC15" class="blob-code blob-code-inner js-file-line">
</td>
      </tr>
      <tr>
        <td id="file-optimizing_64_bit-cpp-L16" class="blob-num js-line-number" data-line-number="16"></td>
        <td id="file-optimizing_64_bit-cpp-LC16" class="blob-code blob-code-inner js-file-line">#<span class="pl-k">if</span> defined(__x86_64__)</td>
      </tr>
      <tr>
        <td id="file-optimizing_64_bit-cpp-L17" class="blob-num js-line-number" data-line-number="17"></td>
        <td id="file-optimizing_64_bit-cpp-LC17" class="blob-code blob-code-inner js-file-line">  <span class="pl-k">if</span> (divisor.<span class="pl-smi">high</span> == <span class="pl-c1">0</span> &amp;&amp; dividend.<span class="pl-smi">high</span> &lt; divisor) {</td>
      </tr>
      <tr>
        <td id="file-optimizing_64_bit-cpp-L18" class="blob-num js-line-number" data-line-number="18"></td>
        <td id="file-optimizing_64_bit-cpp-LC18" class="blob-code blob-code-inner js-file-line">    <span class="pl-c1">remainder</span>-&gt;<span class="pl-smi">high</span> = <span class="pl-c1">0</span>;</td>
      </tr>
      <tr>
        <td id="file-optimizing_64_bit-cpp-L19" class="blob-num js-line-number" data-line-number="19"></td>
        <td id="file-optimizing_64_bit-cpp-LC19" class="blob-code blob-code-inner js-file-line">    <span class="pl-c1">uint64_t</span> quotient =</td>
      </tr>
      <tr>
        <td id="file-optimizing_64_bit-cpp-L20" class="blob-num js-line-number" data-line-number="20"></td>
        <td id="file-optimizing_64_bit-cpp-LC20" class="blob-code blob-code-inner js-file-line">        <span class="pl-c1">Divide128Div64To64</span>(dividend.<span class="pl-smi">high</span>, dividend.<span class="pl-smi">low</span>,</td>
      </tr>
      <tr>
        <td id="file-optimizing_64_bit-cpp-L21" class="blob-num js-line-number" data-line-number="21"></td>
        <td id="file-optimizing_64_bit-cpp-LC21" class="blob-code blob-code-inner js-file-line">                           divisor.<span class="pl-smi">low</span>, &amp;<span class="pl-c1">remainder</span>-&gt;<span class="pl-smi">low</span>);</td>
      </tr>
      <tr>
        <td id="file-optimizing_64_bit-cpp-L22" class="blob-num js-line-number" data-line-number="22"></td>
        <td id="file-optimizing_64_bit-cpp-LC22" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">return</span> quotient;</td>
      </tr>
      <tr>
        <td id="file-optimizing_64_bit-cpp-L23" class="blob-num js-line-number" data-line-number="23"></td>
        <td id="file-optimizing_64_bit-cpp-LC23" class="blob-code blob-code-inner js-file-line">  }</td>
      </tr>
      <tr>
        <td id="file-optimizing_64_bit-cpp-L24" class="blob-num js-line-number" data-line-number="24"></td>
        <td id="file-optimizing_64_bit-cpp-LC24" class="blob-code blob-code-inner js-file-line">#<span class="pl-k">endif</span></td>
      </tr>
      <tr>
        <td id="file-optimizing_64_bit-cpp-L25" class="blob-num js-line-number" data-line-number="25"></td>
        <td id="file-optimizing_64_bit-cpp-LC25" class="blob-code blob-code-inner js-file-line">&#8230;</td>
      </tr>
      <tr>
        <td id="file-optimizing_64_bit-cpp-L26" class="blob-num js-line-number" data-line-number="26"></td>
        <td id="file-optimizing_64_bit-cpp-LC26" class="blob-code blob-code-inner js-file-line">}</td>
      </tr>
</table>


  </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/danlark1/fc203457bf604071e128f59b0afd6df3/raw/ef261b722950b812a37d00accdb7977ed54a590a/optimizing_64_bit.cpp" style="float:right">view raw</a>
        <a href="https://gist.github.com/danlark1/fc203457bf604071e128f59b0afd6df3#file-optimizing_64_bit-cpp">optimizing_64_bit.cpp</a>
        hosted with &#10084; by <a href="https://github.com">GitHub</a>
      </div>
    </div>
</div>

</div></figure>



<p>What to do if the high is not less than the divisor? The right answer is to use 2 divisions because</p>



<figure class="wp-block-image size-large is-resized"><img loading="lazy" data-attachment-id="325" data-permalink="https://danlark.org/2020-06-11-135245_1608x410_scrot/" data-orig-file="https://danlarkorg.files.wordpress.com/2020/06/2020-06-11-135245_1608x410_scrot.png" data-orig-size="1608,410" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="2020-06-11-135245_1608x410_scrot" data-image-description="" data-image-caption="" data-medium-file="https://danlarkorg.files.wordpress.com/2020/06/2020-06-11-135245_1608x410_scrot.png?w=300" data-large-file="https://danlarkorg.files.wordpress.com/2020/06/2020-06-11-135245_1608x410_scrot.png?w=1024" src="https://danlarkorg.files.wordpress.com/2020/06/2020-06-11-135245_1608x410_scrot.png?w=1024" alt="" class="wp-image-325" width="780" height="198" srcset="https://danlarkorg.files.wordpress.com/2020/06/2020-06-11-135245_1608x410_scrot.png?w=1024 1024w, https://danlarkorg.files.wordpress.com/2020/06/2020-06-11-135245_1608x410_scrot.png?w=777 777w, https://danlarkorg.files.wordpress.com/2020/06/2020-06-11-135245_1608x410_scrot.png?w=1553 1553w, https://danlarkorg.files.wordpress.com/2020/06/2020-06-11-135245_1608x410_scrot.png?w=150 150w, https://danlarkorg.files.wordpress.com/2020/06/2020-06-11-135245_1608x410_scrot.png?w=300 300w, https://danlarkorg.files.wordpress.com/2020/06/2020-06-11-135245_1608x410_scrot.png?w=768 768w" sizes="(max-width: 780px) 100vw, 780px" /></figure>



<p>So, first we can divide <code>hi</code> by <code>divisor</code> and then <code>{hi_r, lo}</code> by <code>divisor</code> guaranteeing that <code>hi_r</code> is smaller than <code>divisor</code> and thus the result is smaller than <img src="https://s0.wp.com/latex.php?latex=2%5E%7B64%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" alt="2^{64}" class="latex" />. We will get something like</p>



<figure class="wp-block-embed is-type-rich"><div class="wp-block-embed__wrapper">
<style>.gist table { margin-bottom: 0; }</style><div style="tab-size: 8" id="gist103720611" class="gist">
    <div class="gist-file" translate="no">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container file-box">
  <div id="file-optimizing_all_64_bit-cpp" class="file my-2">
    
  <div itemprop="text" class="Box-body p-0 blob-wrapper data type-c  ">

      
<table class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L1" class="blob-num js-line-number" data-line-number="1"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC1" class="blob-code blob-code-inner js-file-line">#<span class="pl-k">if</span> defined(__x86_64__)</td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L2" class="blob-num js-line-number" data-line-number="2"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC2" class="blob-code blob-code-inner js-file-line"><span class="pl-k">inline</span> <span class="pl-c1">uint64_t</span> <span class="pl-en">Divide128Div64To64</span>(<span class="pl-c1">uint64_t</span> high, <span class="pl-c1">uint64_t</span> low,</td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L3" class="blob-num js-line-number" data-line-number="3"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC3" class="blob-code blob-code-inner js-file-line">                                   <span class="pl-c1">uint64_t</span> divisor, <span class="pl-c1">uint64_t</span>* remainder) {</td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L4" class="blob-num js-line-number" data-line-number="4"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC4" class="blob-code blob-code-inner js-file-line">  <span class="pl-c1">uint64_t</span> result;</td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L5" class="blob-num js-line-number" data-line-number="5"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC5" class="blob-code blob-code-inner js-file-line">  <span class="pl-c1">__asm__</span>(<span class="pl-s"><span class="pl-pds">&quot;</span>divq %[v]<span class="pl-pds">&quot;</span></span></td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L6" class="blob-num js-line-number" data-line-number="6"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC6" class="blob-code blob-code-inner js-file-line">          : <span class="pl-s"><span class="pl-pds">&quot;</span>=a<span class="pl-pds">&quot;</span></span>(result), <span class="pl-s"><span class="pl-pds">&quot;</span>=d<span class="pl-pds">&quot;</span></span>(*<span class="pl-c1">remainder</span>) <span class="pl-c"><span class="pl-c">//</span> Ouput parametrs, =a for rax, =d for rdx, [v] is an</span></td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L7" class="blob-num js-line-number" data-line-number="7"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC7" class="blob-code blob-code-inner js-file-line">                                           <span class="pl-c"><span class="pl-c">//</span> alias for divisor, input paramters &quot;a&quot; and &quot;d&quot; for low and high.</span></td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L8" class="blob-num js-line-number" data-line-number="8"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC8" class="blob-code blob-code-inner js-file-line">          : [v] <span class="pl-s"><span class="pl-pds">&quot;</span>r<span class="pl-pds">&quot;</span></span>(divisor), <span class="pl-s"><span class="pl-pds">&quot;</span>a<span class="pl-pds">&quot;</span></span>(low), <span class="pl-s"><span class="pl-pds">&quot;</span>d<span class="pl-pds">&quot;</span></span>(high));</td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L9" class="blob-num js-line-number" data-line-number="9"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC9" class="blob-code blob-code-inner js-file-line">  <span class="pl-k">return</span> result;</td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L10" class="blob-num js-line-number" data-line-number="10"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC10" class="blob-code blob-code-inner js-file-line">}</td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L11" class="blob-num js-line-number" data-line-number="11"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC11" class="blob-code blob-code-inner js-file-line">#<span class="pl-k">endif</span></td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L12" class="blob-num js-line-number" data-line-number="12"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC12" class="blob-code blob-code-inner js-file-line">
</td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L13" class="blob-num js-line-number" data-line-number="13"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC13" class="blob-code blob-code-inner js-file-line">tu_int <span class="pl-en">__udivmodti4</span>(tu_int dividend, tu_int divisor, tu_int* remainder) {</td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L14" class="blob-num js-line-number" data-line-number="14"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC14" class="blob-code blob-code-inner js-file-line">…</td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L15" class="blob-num js-line-number" data-line-number="15"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC15" class="blob-code blob-code-inner js-file-line">
</td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L16" class="blob-num js-line-number" data-line-number="16"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC16" class="blob-code blob-code-inner js-file-line">#<span class="pl-k">if</span> defined(__x86_64__)</td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L17" class="blob-num js-line-number" data-line-number="17"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC17" class="blob-code blob-code-inner js-file-line">  <span class="pl-k">if</span> (divisor.<span class="pl-smi">high</span> == <span class="pl-c1">0</span>) {</td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L18" class="blob-num js-line-number" data-line-number="18"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC18" class="blob-code blob-code-inner js-file-line">    <span class="pl-c1">remainder</span>-&gt;<span class="pl-smi">high</span> = <span class="pl-c1">0</span>;</td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L19" class="blob-num js-line-number" data-line-number="19"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC19" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">if</span> (dividend.<span class="pl-smi">high</span> &lt; divisor) {</td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L20" class="blob-num js-line-number" data-line-number="20"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC20" class="blob-code blob-code-inner js-file-line">      <span class="pl-c1">uint64_t</span> quotient =</td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L21" class="blob-num js-line-number" data-line-number="21"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC21" class="blob-code blob-code-inner js-file-line">          <span class="pl-c1">Divide128Div64To64</span>(dividend.<span class="pl-smi">high</span>, dividend.<span class="pl-smi">low</span>,</td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L22" class="blob-num js-line-number" data-line-number="22"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC22" class="blob-code blob-code-inner js-file-line">                             divisor.<span class="pl-smi">low</span>, &amp;<span class="pl-c1">remainder</span>-&gt;<span class="pl-smi">low</span>);</td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L23" class="blob-num js-line-number" data-line-number="23"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC23" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">return</span> quotient;</td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L24" class="blob-num js-line-number" data-line-number="24"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC24" class="blob-code blob-code-inner js-file-line">    } <span class="pl-k">else</span> {</td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L25" class="blob-num js-line-number" data-line-number="25"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC25" class="blob-code blob-code-inner js-file-line">      tu_int quotient;</td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L26" class="blob-num js-line-number" data-line-number="26"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC26" class="blob-code blob-code-inner js-file-line">      quotient.<span class="pl-smi">high</span> = <span class="pl-c1">Divide128Div64To64</span>(<span class="pl-c1">0</span>, dividend.<span class="pl-smi">high</span>, divisor.<span class="pl-smi">low</span>,</td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L27" class="blob-num js-line-number" data-line-number="27"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC27" class="blob-code blob-code-inner js-file-line">                                         &amp;dividend.<span class="pl-smi">high</span>);</td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L28" class="blob-num js-line-number" data-line-number="28"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC28" class="blob-code blob-code-inner js-file-line">      quotient.<span class="pl-smi">low</span> = <span class="pl-c1">Divide128Div64To64</span>(dividend.<span class="pl-smi">high</span>, dividend.<span class="pl-smi">low</span>,</td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L29" class="blob-num js-line-number" data-line-number="29"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC29" class="blob-code blob-code-inner js-file-line">                                        divisor.<span class="pl-smi">low</span>, &amp;<span class="pl-c1">remainder</span>-&gt;<span class="pl-smi">low</span>);</td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L30" class="blob-num js-line-number" data-line-number="30"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC30" class="blob-code blob-code-inner js-file-line">      <span class="pl-k">return</span> quotient;</td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L31" class="blob-num js-line-number" data-line-number="31"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC31" class="blob-code blob-code-inner js-file-line">    }</td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L32" class="blob-num js-line-number" data-line-number="32"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC32" class="blob-code blob-code-inner js-file-line">  }</td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L33" class="blob-num js-line-number" data-line-number="33"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC33" class="blob-code blob-code-inner js-file-line">#<span class="pl-k">endif</span></td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L34" class="blob-num js-line-number" data-line-number="34"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC34" class="blob-code blob-code-inner js-file-line">&#8230;</td>
      </tr>
      <tr>
        <td id="file-optimizing_all_64_bit-cpp-L35" class="blob-num js-line-number" data-line-number="35"></td>
        <td id="file-optimizing_all_64_bit-cpp-LC35" class="blob-code blob-code-inner js-file-line">}</td>
      </tr>
</table>


  </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/danlark1/a6bc8805f9e4c7b2e1f08d2d3ba3cae3/raw/664dff9a824532c22d73bc62fe20cd5ef62b00cf/optimizing_all_64_bit.cpp" style="float:right">view raw</a>
        <a href="https://gist.github.com/danlark1/a6bc8805f9e4c7b2e1f08d2d3ba3cae3#file-optimizing_all_64_bit-cpp">optimizing_all_64_bit.cpp</a>
        hosted with &#10084; by <a href="https://github.com">GitHub</a>
      </div>
    </div>
</div>

</div></figure>



<p>  After that the benchmarks improved significantly</p>



<pre class="wp-block-code"><code>Benchmark                       Time(ns)  CPU(ns)
BM_DivideClass128UniformDivisor 11.9      11.9
BM_DivideClass128SmallDivisor   26.6      26.6</code></pre>



<p>Only 26.6ns for small divisors, that&#8217;s a clear 6x win.</p>



<p>Then there are multiple choices to do next but we know that both dividend and divisor have at least one bit in their high registers and the shift-subtract algorithm will have at most 64 iterations. Also the quotient is guaranteed to fit in 64 bits, thus we can use only the low register of the resulting quotient and save more shifts in the shift-subtract algorithm. That&#8217;s why the uniform divisor slightly improved.</p>



<p>One more optimization to do in shift-subtract algorithm is to remove the branch inside the for loop (read carefully, it should be understandable).</p>



<figure class="wp-block-embed is-type-rich"><div class="wp-block-embed__wrapper">
<style>.gist table { margin-bottom: 0; }</style><div style="tab-size: 8" id="gist103720794" class="gist">
    <div class="gist-file" translate="no">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container file-box">
  <div id="file-shift_subtract_branch_free-c" class="file my-2">
    
  <div itemprop="text" class="Box-body p-0 blob-wrapper data type-c  ">

      
<table class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip>
      <tr>
        <td id="file-shift_subtract_branch_free-c-L1" class="blob-num js-line-number" data-line-number="1"></td>
        <td id="file-shift_subtract_branch_free-c-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-c"><span class="pl-c">//</span> dividend / divisor, remainder is stored in rem.</span></td>
      </tr>
      <tr>
        <td id="file-shift_subtract_branch_free-c-L2" class="blob-num js-line-number" data-line-number="2"></td>
        <td id="file-shift_subtract_branch_free-c-LC2" class="blob-code blob-code-inner js-file-line">uint128 <span class="pl-en">__udivmodti4</span>(uint128 dividend, uint128 divisor, uint128* rem) {</td>
      </tr>
      <tr>
        <td id="file-shift_subtract_branch_free-c-L3" class="blob-num js-line-number" data-line-number="3"></td>
        <td id="file-shift_subtract_branch_free-c-LC3" class="blob-code blob-code-inner js-file-line">  <span class="pl-k">if</span> (divisor &gt; dividend) {</td>
      </tr>
      <tr>
        <td id="file-shift_subtract_branch_free-c-L4" class="blob-num js-line-number" data-line-number="4"></td>
        <td id="file-shift_subtract_branch_free-c-LC4" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">if</span> (rem)</td>
      </tr>
      <tr>
        <td id="file-shift_subtract_branch_free-c-L5" class="blob-num js-line-number" data-line-number="5"></td>
        <td id="file-shift_subtract_branch_free-c-LC5" class="blob-code blob-code-inner js-file-line">      *rem = dividend;</td>
      </tr>
      <tr>
        <td id="file-shift_subtract_branch_free-c-L6" class="blob-num js-line-number" data-line-number="6"></td>
        <td id="file-shift_subtract_branch_free-c-LC6" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">return</span> <span class="pl-c1">0</span>;</td>
      </tr>
      <tr>
        <td id="file-shift_subtract_branch_free-c-L7" class="blob-num js-line-number" data-line-number="7"></td>
        <td id="file-shift_subtract_branch_free-c-LC7" class="blob-code blob-code-inner js-file-line">  }</td>
      </tr>
      <tr>
        <td id="file-shift_subtract_branch_free-c-L8" class="blob-num js-line-number" data-line-number="8"></td>
        <td id="file-shift_subtract_branch_free-c-LC8" class="blob-code blob-code-inner js-file-line">  <span class="pl-c"><span class="pl-c">//</span> 64 bit divisor implementation</span></td>
      </tr>
      <tr>
        <td id="file-shift_subtract_branch_free-c-L9" class="blob-num js-line-number" data-line-number="9"></td>
        <td id="file-shift_subtract_branch_free-c-LC9" class="blob-code blob-code-inner js-file-line">  &#8230;</td>
      </tr>
      <tr>
        <td id="file-shift_subtract_branch_free-c-L10" class="blob-num js-line-number" data-line-number="10"></td>
        <td id="file-shift_subtract_branch_free-c-LC10" class="blob-code blob-code-inner js-file-line">  <span class="pl-c"><span class="pl-c">//</span> end</span></td>
      </tr>
      <tr>
        <td id="file-shift_subtract_branch_free-c-L11" class="blob-num js-line-number" data-line-number="11"></td>
        <td id="file-shift_subtract_branch_free-c-LC11" class="blob-code blob-code-inner js-file-line">  <span class="pl-c"><span class="pl-c">//</span> Calculate the distance between most significant bits, 128 &gt; shift &gt;= 0.</span></td>
      </tr>
      <tr>
        <td id="file-shift_subtract_branch_free-c-L12" class="blob-num js-line-number" data-line-number="12"></td>
        <td id="file-shift_subtract_branch_free-c-LC12" class="blob-code blob-code-inner js-file-line">  <span class="pl-k">int</span> shift = <span class="pl-c1">Distance</span>(dividend, divisor);</td>
      </tr>
      <tr>
        <td id="file-shift_subtract_branch_free-c-L13" class="blob-num js-line-number" data-line-number="13"></td>
        <td id="file-shift_subtract_branch_free-c-LC13" class="blob-code blob-code-inner js-file-line">  divisor &lt;&lt;= shift;</td>
      </tr>
      <tr>
        <td id="file-shift_subtract_branch_free-c-L14" class="blob-num js-line-number" data-line-number="14"></td>
        <td id="file-shift_subtract_branch_free-c-LC14" class="blob-code blob-code-inner js-file-line">  quotient.<span class="pl-smi">low</span> = <span class="pl-c1">0</span>;</td>
      </tr>
      <tr>
        <td id="file-shift_subtract_branch_free-c-L15" class="blob-num js-line-number" data-line-number="15"></td>
        <td id="file-shift_subtract_branch_free-c-LC15" class="blob-code blob-code-inner js-file-line">  quotient.<span class="pl-smi">high</span> = <span class="pl-c1">0</span>;</td>
      </tr>
      <tr>
        <td id="file-shift_subtract_branch_free-c-L16" class="blob-num js-line-number" data-line-number="16"></td>
        <td id="file-shift_subtract_branch_free-c-LC16" class="blob-code blob-code-inner js-file-line">  <span class="pl-k">for</span> (; shift &gt;= <span class="pl-c1">0</span>; &#8211;shift) {</td>
      </tr>
      <tr>
        <td id="file-shift_subtract_branch_free-c-L17" class="blob-num js-line-number" data-line-number="17"></td>
        <td id="file-shift_subtract_branch_free-c-LC17" class="blob-code blob-code-inner js-file-line">    quotient.<span class="pl-smi">low</span> &lt;&lt;= <span class="pl-c1">1</span>;</td>
      </tr>
      <tr>
        <td id="file-shift_subtract_branch_free-c-L18" class="blob-num js-line-number" data-line-number="18"></td>
        <td id="file-shift_subtract_branch_free-c-LC18" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">const</span> int128 s = (int128)(divisor.<span class="pl-smi">all</span> &#8211; dividend.<span class="pl-smi">all</span> &#8211; <span class="pl-c1">1</span>) &gt;&gt; <span class="pl-c1">127</span>;</td>
      </tr>
      <tr>
        <td id="file-shift_subtract_branch_free-c-L19" class="blob-num js-line-number" data-line-number="19"></td>
        <td id="file-shift_subtract_branch_free-c-LC19" class="blob-code blob-code-inner js-file-line">    quotient.<span class="pl-smi">low</span> |= s &amp; <span class="pl-c1">1</span>;</td>
      </tr>
      <tr>
        <td id="file-shift_subtract_branch_free-c-L20" class="blob-num js-line-number" data-line-number="20"></td>
        <td id="file-shift_subtract_branch_free-c-LC20" class="blob-code blob-code-inner js-file-line">    dividend.<span class="pl-smi">all</span> -= divisor.<span class="pl-smi">all</span> &amp; s;</td>
      </tr>
      <tr>
        <td id="file-shift_subtract_branch_free-c-L21" class="blob-num js-line-number" data-line-number="21"></td>
        <td id="file-shift_subtract_branch_free-c-LC21" class="blob-code blob-code-inner js-file-line">    divisor &gt;&gt;= <span class="pl-c1">1</span>;</td>
      </tr>
      <tr>
        <td id="file-shift_subtract_branch_free-c-L22" class="blob-num js-line-number" data-line-number="22"></td>
        <td id="file-shift_subtract_branch_free-c-LC22" class="blob-code blob-code-inner js-file-line">  }</td>
      </tr>
      <tr>
        <td id="file-shift_subtract_branch_free-c-L23" class="blob-num js-line-number" data-line-number="23"></td>
        <td id="file-shift_subtract_branch_free-c-LC23" class="blob-code blob-code-inner js-file-line">  <span class="pl-k">if</span> (rem)</td>
      </tr>
      <tr>
        <td id="file-shift_subtract_branch_free-c-L24" class="blob-num js-line-number" data-line-number="24"></td>
        <td id="file-shift_subtract_branch_free-c-LC24" class="blob-code blob-code-inner js-file-line">    *rem = dividend;</td>
      </tr>
      <tr>
        <td id="file-shift_subtract_branch_free-c-L25" class="blob-num js-line-number" data-line-number="25"></td>
        <td id="file-shift_subtract_branch_free-c-LC25" class="blob-code blob-code-inner js-file-line">  <span class="pl-k">return</span> quotient;</td>
      </tr>
      <tr>
        <td id="file-shift_subtract_branch_free-c-L26" class="blob-num js-line-number" data-line-number="26"></td>
        <td id="file-shift_subtract_branch_free-c-LC26" class="blob-code blob-code-inner js-file-line">}</td>
      </tr>
</table>


  </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/danlark1/b89d67ad5e1989c0336c3ec921e5c67a/raw/d1a1ea5f4bfe8274d5b60554685056cd0191d686/shift_subtract_branch_free.c" style="float:right">view raw</a>
        <a href="https://gist.github.com/danlark1/b89d67ad5e1989c0336c3ec921e5c67a#file-shift_subtract_branch_free-c">shift_subtract_branch_free.c</a>
        hosted with &#10084; by <a href="https://github.com">GitHub</a>
      </div>
    </div>
</div>

</div></figure>



<p>In the end, it gives 0.4ns more for uniform 128 bit divisor.</p>



<p>And finally I believe that&#8217;s one of the best algorithm to divide 128 bit by 128 bit numbers. From statistics, the case when the divisor is 64 bit is worth optimizing and we showed that additional checks on the high register of divisor has its own advantages and expansion of the invariants. Now let&#8217;s see what other libraries perform in that case.</p>



<h2>LibDivide</h2>



<p><a href="https://github.com/ridiculousfish/libdivide">Libdivide</a> is a small library targeting fast division, for example, if you divide by some fixed number a lot of times, there are techniques that can precalculate reciprocal and then multiply by it. Libdivide provides a very good interface for such optimizations. Even though, it has some optimizations regarding 128 bit division. For example, function <code>libdivide_128_div_128_to_64</code> computes the division 128 bit number by 128 bit number if the result fits in 64 bits. In the case where both numbers are more or equal to <img src="https://s0.wp.com/latex.php?latex=2%5E%7B64%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" alt="2^{64}" class="latex" /> it does the following algorithm that they took from <a href="https://www.amazon.de/Hackers-Delight-Henry-S-Warren/dp/0321842685/ref=sr_1_1?dchild=1&amp;keywords=Hackers+Delight&amp;qid=1592164234&amp;sr=8-1">Hackers Delight</a> book:</p>



<p><img src="https://s0.wp.com/latex.php?latex=%5Cbegin%7Bcases%7D+n+%3D+MSB%28%5Cmathrm%7Bdivisor%7D%29+%5Cgeq+1+%5C%5C+%5Cmathrm%7Bdivisor_1%7D+%3D+%5Clfloor+%5Cmathrm%7Bdivisor%7D%2F2%5E%7B64+-+n%7D+%5Crfloor+%5C%5C+%5Cmathrm%7Bdividend_1%7D+%3D+%5Clfloor+%5Cmathrm%7Bdividend%7D%2F2+%5Crfloor+%5Cend%7Bcases%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" alt="&#92;begin{cases} n = MSB(&#92;mathrm{divisor}) &#92;geq 1 &#92;&#92; &#92;mathrm{divisor_1} = &#92;lfloor &#92;mathrm{divisor}/2^{64 - n} &#92;rfloor &#92;&#92; &#92;mathrm{dividend_1} = &#92;lfloor &#92;mathrm{dividend}/2 &#92;rfloor &#92;end{cases}" class="latex" /></p>



<p>With the instruction that produces the 64 bit result when the divisor is 128 bit result we can compute</p>



<p><img src="https://s0.wp.com/latex.php?latex=%5Cmathrm%7Bquotient_1%7D+%3D+%5Clfloor+%5Cmathrm%7Bdividend_1%7D%2F%5Cmathrm%7Bdivisor_1%7D+%5Crfloor&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" alt="&#92;mathrm{quotient_1} = &#92;lfloor &#92;mathrm{dividend_1}/&#92;mathrm{divisor_1} &#92;rfloor" class="latex" /></p>



<p>Then we compute</p>



<p><img src="https://s0.wp.com/latex.php?latex=%5Cmathrm%7Bquotient_0%7D+%3D+%5Clfloor+%5Cmathrm%7Bquotient_1%7D%2F2%5E%7B63+-+n%7D+%5Crfloor&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" alt="&#92;mathrm{quotient_0} = &#92;lfloor &#92;mathrm{quotient_1}/2^{63 - n} &#92;rfloor" class="latex" />.</p>



<p>It cannot overflow because <img src="https://s0.wp.com/latex.php?latex=%5Cmathrm%7Bquotient_1%7D+%3C+2%5E%7B64%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" alt="&#92;mathrm{quotient_1} &lt; 2^{64}" class="latex" /> because the maximum value of <img src="https://s0.wp.com/latex.php?latex=%5Cmathrm%7Bdividend_1%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" alt="&#92;mathrm{dividend_1}" class="latex" /> is <img src="https://s0.wp.com/latex.php?latex=2%5E%7B127%7D+-+1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" alt="2^{127} - 1" class="latex" /> and minimum value of <img src="https://s0.wp.com/latex.php?latex=%5Cmathrm%7Bdivisor_1%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" alt="&#92;mathrm{divisor_1}" class="latex" /> is <img src="https://s0.wp.com/latex.php?latex=2%5E%7B63%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" alt="2^{63}" class="latex" />. Now let&#8217;s show that</p>



<p><img src="https://s0.wp.com/latex.php?latex=%5Clfloor+%5Cmathrm%7Bdividend%7D%2F%5Cmathrm%7Bdivisor%7D+%5Crfloor+%5Cleq+%5Cmathrm%7Bquotient_0%7D+%5Cleq++%5Clfloor+%5Cmathrm%7Bdividend%7D%2F%5Cmathrm%7Bdivisor%7D+%5Crfloor+%2B+1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" alt="&#92;lfloor &#92;mathrm{dividend}/&#92;mathrm{divisor} &#92;rfloor &#92;leq &#92;mathrm{quotient_0} &#92;leq  &#92;lfloor &#92;mathrm{dividend}/&#92;mathrm{divisor} &#92;rfloor + 1" class="latex" /></p>



<p><img src="https://s0.wp.com/latex.php?latex=%5Cmathrm%7Bquotient_0%7D+%3D+%5Cleft%5Clfloor+%5Cfrac%7B%5Cmathrm%7Bdividend%7D%7D%7B2%5E%7B64+-+n%7D%5Cmathrm%7Bdivisor_1%7D%7D+%5Cright%5Crfloor+%3D+%5Cleft%5Clfloor+%5Cfrac%7B%5Cmathrm%7Bdividend%7D%7D%7B2%5E%7B64+-+n%7D%5Cleft%5Clfloor%5Cfrac%7B%5Cmathrm%7Bdivisor%7D%7D%7B2%5E%7B64+-+n%7D%7D%5Cright%5Crfloor%7D+%5Cright%5Crfloor+%3D+%5Cleft%5Clfloor+%5Cfrac%7B%5Cmathrm%7Bdividend%7D%7D%7B%5Cmathrm%7Bdivisor%7D+-+%28%5Cmathrm%7Bdivisor%7D+%5Cmathrm%7B%5C+mod%5C+%7D+2%5E%7B64+-+n%7D%29%7D+%5Cright%5Crfloor+%3D+%5Cleft%5Clfloor+%5Cfrac%7B%5Cmathrm%7Bdividend%7D%7D%7B%5Cmathrm%7Bdivisor%7D%7D+%2B+%5Cfrac%7B%5Cmathrm%7Bdividend%7D%28%5Cmathrm%7Bdivisor%7D+%5Cmathrm%7B%5C+mod%5C+%7D+2%5E%7B64+-+n%7D%29%29%7D%7B%5Cmathrm%7Bdivisor%7D%28%5Cmathrm%7Bdivisor%7D+-+%28%5Cmathrm%7Bdivisor%7D+%5Cmathrm%7B%5C+mod%5C+%7D+2%5E%7B64+-+n%7D%29%29%7D+%5Cright%5Crfloor+%3D+%5Cleft%5Clfloor+%5Cfrac%7B%5Cmathrm%7Bdividend%7D%7D%7B%5Cmathrm%7Bdivisor%7D%7D+%2B+%5Cdelta+%5Cright%5Crfloor&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" alt="&#92;mathrm{quotient_0} = &#92;left&#92;lfloor &#92;frac{&#92;mathrm{dividend}}{2^{64 - n}&#92;mathrm{divisor_1}} &#92;right&#92;rfloor = &#92;left&#92;lfloor &#92;frac{&#92;mathrm{dividend}}{2^{64 - n}&#92;left&#92;lfloor&#92;frac{&#92;mathrm{divisor}}{2^{64 - n}}&#92;right&#92;rfloor} &#92;right&#92;rfloor = &#92;left&#92;lfloor &#92;frac{&#92;mathrm{dividend}}{&#92;mathrm{divisor} - (&#92;mathrm{divisor} &#92;mathrm{&#92; mod&#92; } 2^{64 - n})} &#92;right&#92;rfloor = &#92;left&#92;lfloor &#92;frac{&#92;mathrm{dividend}}{&#92;mathrm{divisor}} + &#92;frac{&#92;mathrm{dividend}(&#92;mathrm{divisor} &#92;mathrm{&#92; mod&#92; } 2^{64 - n}))}{&#92;mathrm{divisor}(&#92;mathrm{divisor} - (&#92;mathrm{divisor} &#92;mathrm{&#92; mod&#92; } 2^{64 - n}))} &#92;right&#92;rfloor = &#92;left&#92;lfloor &#92;frac{&#92;mathrm{dividend}}{&#92;mathrm{divisor}} + &#92;delta &#92;right&#92;rfloor" class="latex" />.</p>



<p>Now we want to show that <img src="https://s0.wp.com/latex.php?latex=%5Cdelta+%3C+1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" alt="&#92;delta &lt; 1" class="latex" />. <img src="https://s0.wp.com/latex.php?latex=%5Cdelta&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" alt="&#92;delta" class="latex" /> is the largest when the remainder in the numerator is as large as possible, it can be up to <img src="https://s0.wp.com/latex.php?latex=2%5E%7B64+-+n%7D+-+1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" alt="2^{64 - n} - 1" class="latex" />. Because of the definition of <img src="https://s0.wp.com/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" alt="n" class="latex" />, <img src="https://s0.wp.com/latex.php?latex=%5Cmathrm%7Bdivisor%7D+%5Cgeq+2%5E%7B127+-+n%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" alt="&#92;mathrm{divisor} &#92;geq 2^{127 - n}" class="latex" />. The smallest value of <img src="https://s0.wp.com/latex.php?latex=%5Cmathrm%7Bdivisor%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" alt="&#92;mathrm{divisor}" class="latex" /> in the denominator is <img src="https://s0.wp.com/latex.php?latex=2%5E%7B127+-+n%7D+%2B+2%5E%7B64+-+n%7D+-+1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" alt="2^{127 - n} + 2^{64 - n} - 1" class="latex" />. That&#8217;s why</p>



<p><img src="https://s0.wp.com/latex.php?latex=%5Cdelta+%5Cleq+%5Cfrac%7B%5Cmathrm%7Bdividend%7D%282%5E%7B64+-+n%7D+-+1%29%7D%7B%282%5E%7B127+-+n%7D+%2B+2%5E%7B64+-+n%7D+-+1%292%5E%7B127+-+n+%7D%7D+%3C+%5Cfrac%7B%5Cmathrm%7Bdividend%7D%282%5E%7B64+-+n%7D+-+1%29%7D%7B%282%5E%7B127+-+n+%7D%29%5E2%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" alt="&#92;delta &#92;leq &#92;frac{&#92;mathrm{dividend}(2^{64 - n} - 1)}{(2^{127 - n} + 2^{64 - n} - 1)2^{127 - n }} &lt; &#92;frac{&#92;mathrm{dividend}(2^{64 - n} - 1)}{(2^{127 - n })^2}" class="latex" />. As n iterates from 0 to 63, we can conclude that <img src="https://s0.wp.com/latex.php?latex=%5Cdelta+%3C+%5Cfrac%7B%5Cmathrm%7Bdividend%7D%7D%7B2%5E%7B128%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" alt="&#92;delta &lt; &#92;frac{&#92;mathrm{dividend}}{2^{128}}" class="latex" />. So we got either the correct value, either the correct plus one. Everything else in the algorithms is just a correction of which result to choose.</p>



<p>Unfortunately, these corrections increase the latency of the benchmark pretty significant</p>



<pre class="wp-block-code"><code>Benchmark                                          Time(ns)  CPU(ns)
BM_DivideClass128UniformDivisor&lt;LibDivideDivision&gt;    26.3    26.3  
BM_RemainderClass128UniformDivisor&lt;LibDivideDivision&gt; 26.2    26.2
BM_DivideClass128SmallDivisor&lt;LibDivideDivision&gt;      25.8    25.8
BM_RemainderClass128SmallDivisor&lt;LibDivideDivision&gt;   26.3    26.3</code></pre>



<p>So I decided to drop this idea after I&#8217;ve tried this.</p>



<h2>GMP</h2>



<p><a href="https://gmplib.org/">GMP</a> library is a standard GNU library for long arithmetic. They also have something for 128 bit by 64 bit division and in my benchmark the following code worked</p>



<figure class="wp-block-embed is-type-rich"><div class="wp-block-embed__wrapper">
<style>.gist table { margin-bottom: 0; }</style><div style="tab-size: 8" id="gist103721337" class="gist">
    <div class="gist-file" translate="no">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container file-box">
  <div id="file-gmp_div-h" class="file my-2">
    
  <div itemprop="text" class="Box-body p-0 blob-wrapper data type-c  ">

      
<table class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip>
      <tr>
        <td id="file-gmp_div-h-L1" class="blob-num js-line-number" data-line-number="1"></td>
        <td id="file-gmp_div-h-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-k">struct</span> GmpDiv {</td>
      </tr>
      <tr>
        <td id="file-gmp_div-h-L2" class="blob-num js-line-number" data-line-number="2"></td>
        <td id="file-gmp_div-h-LC2" class="blob-code blob-code-inner js-file-line">  <span class="pl-c1">uint64_t</span> <span class="pl-smi">operator</span>()(<span class="pl-c1">uint64_t</span> u1, <span class="pl-c1">uint64_t</span> u0, <span class="pl-c1">uint64_t</span> v, du_int* r) <span class="pl-k">const</span> {</td>
      </tr>
      <tr>
        <td id="file-gmp_div-h-L3" class="blob-num js-line-number" data-line-number="3"></td>
        <td id="file-gmp_div-h-LC3" class="blob-code blob-code-inner js-file-line">    <span class="pl-c1">mp_limb_t</span> q[<span class="pl-c1">2</span>] = {u0, u1};</td>
      </tr>
      <tr>
        <td id="file-gmp_div-h-L4" class="blob-num js-line-number" data-line-number="4"></td>
        <td id="file-gmp_div-h-LC4" class="blob-code blob-code-inner js-file-line">    <span class="pl-c1">mp_limb_t</span> result[<span class="pl-c1">2</span>] = {<span class="pl-c1">0</span>, <span class="pl-c1">0</span>};</td>
      </tr>
      <tr>
        <td id="file-gmp_div-h-L5" class="blob-num js-line-number" data-line-number="5"></td>
        <td id="file-gmp_div-h-LC5" class="blob-code blob-code-inner js-file-line">    *r = <span class="pl-c1">mpn_divrem_1</span>(result, <span class="pl-c1">0</span>, q, <span class="pl-c1">2</span>, v);</td>
      </tr>
      <tr>
        <td id="file-gmp_div-h-L6" class="blob-num js-line-number" data-line-number="6"></td>
        <td id="file-gmp_div-h-LC6" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">return</span> result[<span class="pl-c1">0</span>];</td>
      </tr>
      <tr>
        <td id="file-gmp_div-h-L7" class="blob-num js-line-number" data-line-number="7"></td>
        <td id="file-gmp_div-h-LC7" class="blob-code blob-code-inner js-file-line">  }</td>
      </tr>
      <tr>
        <td id="file-gmp_div-h-L8" class="blob-num js-line-number" data-line-number="8"></td>
        <td id="file-gmp_div-h-LC8" class="blob-code blob-code-inner js-file-line">};</td>
      </tr>
</table>


  </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/danlark1/12fd52c74e131f90d708868ee5092562/raw/0939babc13b5d644c0d04915d3001eee97e211f6/gmp_div.h" style="float:right">view raw</a>
        <a href="https://gist.github.com/danlark1/12fd52c74e131f90d708868ee5092562#file-gmp_div-h">gmp_div.h</a>
        hosted with &#10084; by <a href="https://github.com">GitHub</a>
      </div>
    </div>
</div>

</div></figure>



<p>It divides the two limbs by a <code>uint64_t</code> and provides the result. Unfortunately, the latency is much higher than expected, also does not work</p>



<pre class="wp-block-code"><code>Benchmark                                          Time(ns)  CPU(ns)
BM_DivideClass128UniformDivisor&lt;GmpDivision&gt;          11.5    11.5
BM_RemainderClass128UniformDivisor&lt;GmpDivision&gt;       10.7    10.7
BM_DivideClass128SmallDivisor&lt;GmpDivision&gt;            47.5    47.5
BM_RemainderClass128SmallDivisor&lt;GmpDivision&gt;         47.8    47.8 </code></pre>



<h2>Conclusion</h2>



<p>In the end I&#8217;ve tried several method of 128 bit division and came up with something that is the fastest among popular alternatives. <a href="https://gist.github.com/danlark1/92334a0b7f185588b6cd8ebd0438ac76">Here</a> is the full code for benchmarks, though it is quite hard to install, maybe later I will provide an easy version of installation. The final benchmarks are</p>



<pre class="wp-block-code"><code>Benchmark                                           Time(ns)  CPU(ns)
---------------------------------------------------------------------
BM_DivideClass128UniformDivisor&lt;absl::uint128&gt;        13.7      13.7
BM_RemainderClass128UniformDivisor&lt;absl::uint128&gt;     14.0      14.0  
BM_DivideClass128SmallDivisor&lt;absl::uint128&gt;          169       169    
BM_RemainderClass128SmallDivisor&lt;absl::uint128&gt;       153       153    
BM_DivideClass128UniformDivisor&lt;LLVMDivision&gt;         12.6      12.6  
BM_RemainderClass128UniformDivisor&lt;LLVMDivision&gt;      12.3      12.3  
BM_DivideClass128SmallDivisor&lt;LLVMDivision&gt;           145        145    
BM_RemainderClass128SmallDivisor&lt;LLVMDivision&gt;        140        140    
**BM_DivideClass128UniformDivisor&lt;MyDivision1&gt;          11.6      11.6**
**BM_RemainderClass128UniformDivisor&lt;MyDivision1&gt;       10.7      10.7**
**BM_DivideClass128SmallDivisor&lt;MyDivision1&gt;            25.5      25.5**
**BM_RemainderClass128SmallDivisor&lt;MyDivision1&gt;         26.2      26.2**
BM_DivideClass128UniformDivisor&lt;MyDivision2&gt;          12.7      12.7  
BM_RemainderClass128UniformDivisor&lt;MyDivision2&gt;       12.8      12.8  
BM_DivideClass128SmallDivisor&lt;MyDivision2&gt;            36.9      36.9  
BM_RemainderClass128SmallDivisor&lt;MyDivision2&gt;         37.0      37.1  
BM_DivideClass128UniformDivisor&lt;GmpDivision&gt;          11.5      11.5  
BM_RemainderClass128UniformDivisor&lt;GmpDivision&gt;       10.7      10.7  
BM_DivideClass128SmallDivisor&lt;GmpDivision&gt;            47.5      47.5  
BM_RemainderClass128SmallDivisor&lt;GmpDivision&gt;         47.8      47.8  
BM_DivideClass128UniformDivisor&lt;LibDivideDivision&gt;    26.3      26.3  
BM_RemainderClass128UniformDivisor&lt;LibDivideDivision&gt; 26.2      26.2  
BM_DivideClass128SmallDivisor&lt;LibDivideDivision&gt;      25.8      25.8  
BM_RemainderClass128SmallDivisor&lt;LibDivideDivision&gt;   26.3      26.3</code></pre>



<p>MyDivision1 is going to be <a href="https://reviews.llvm.org/D81809">upstreamed</a> in LLVM, MyDivision2 will be the default version for all non x86-64 platforms which also has a solid latency, much better than the previous one.</p>



<h2>Future Work</h2>



<p>However, benchmarks are biased in the uniform divisor case because the distance between most significant bits in the dividend and divisor falls exponentially and starting from 10-15, the benchmark becomes worse rather than libdivide approach.</p>



<p>I also prepared some recent research <a href="https://gmplib.org/~tege/division-paper.pdf">paper</a> patch in <a href="https://reviews.llvm.org/D83547">https://reviews.llvm.org/D83547</a> where the reciprocal is computed beforehand and then only some multiplication happens. Yet, with the cold cache of the 512 byte lookup table it is worse than the already submitted approach. I also tried just division by <img src="https://s0.wp.com/latex.php?latex=d_9&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" alt="d_9" class="latex" /> in the screenshot below and it showed some inconsistent results which I don&#8217;t understand for now.</p>



<figure class="wp-block-image size-large"><img data-attachment-id="369" data-permalink="https://danlark.org/photo_2020-07-19_12-02-41/" data-orig-file="https://danlarkorg.files.wordpress.com/2020/07/photo_2020-07-19_12-02-41.jpg" data-orig-size="633,412" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="photo_2020-07-19_12-02-41" data-image-description="" data-image-caption="" data-medium-file="https://danlarkorg.files.wordpress.com/2020/07/photo_2020-07-19_12-02-41.jpg?w=300" data-large-file="https://danlarkorg.files.wordpress.com/2020/07/photo_2020-07-19_12-02-41.jpg?w=633" src="https://danlarkorg.files.wordpress.com/2020/07/photo_2020-07-19_12-02-41.jpg?w=633" alt="" class="wp-image-369" srcset="https://danlarkorg.files.wordpress.com/2020/07/photo_2020-07-19_12-02-41.jpg 633w, https://danlarkorg.files.wordpress.com/2020/07/photo_2020-07-19_12-02-41.jpg?w=150 150w, https://danlarkorg.files.wordpress.com/2020/07/photo_2020-07-19_12-02-41.jpg?w=300 300w" sizes="(max-width: 633px) 100vw, 633px" /></figure>



<p>Also, subscribe to my Twitter <a href="https://twitter.com/Danlark1">https://twitter.com/Danlark1</a> 🙂</p>
]]></html><thumbnail_url><![CDATA[https://danlarkorg.files.wordpress.com/2020/06/d2elnjfdkne.png?fit=440%2C330]]></thumbnail_url><thumbnail_width><![CDATA[440]]></thumbnail_width><thumbnail_height><![CDATA[220]]></thumbnail_height></oembed>