<?xml version="1.0" encoding="UTF-8" standalone="yes"?><oembed><version><![CDATA[1.0]]></version><provider_name><![CDATA[The ryg blog]]></provider_name><provider_url><![CDATA[https://fgiesen.wordpress.com]]></provider_url><author_name><![CDATA[fgiesen]]></author_name><author_url><![CDATA[https://fgiesen.wordpress.com/author/fgiesen/]]></author_url><title><![CDATA[A small note on SIMD matrix-vector&nbsp;multiplication]]></title><type><![CDATA[link]]></type><html><![CDATA[<p>Suppose we want to calculate a product between a 4&#215;4 matrix M and a 4-element vector v:</p>
<p><img src="https://s0.wp.com/latex.php?latex=Mv+%3D+%5Cbegin%7Bpmatrix%7Da_x+%26+b_x+%26+c_x+%26+d_x+%5C%5C+a_y+%26+b_y+%26+c_y+%26+d_y+%5C%5C+a_z+%26+b_z+%26+c_z+%26+d_z+%5C%5C+a_w+%26+b_w+%26+c_w+%26+d_w%5Cend%7Bpmatrix%7D+%5Cbegin%7Bpmatrix%7Dv_x+%5C%5C+v_y+%5C%5C+v_z+%5C%5C+v_w%5Cend%7Bpmatrix%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=Mv+%3D+%5Cbegin%7Bpmatrix%7Da_x+%26+b_x+%26+c_x+%26+d_x+%5C%5C+a_y+%26+b_y+%26+c_y+%26+d_y+%5C%5C+a_z+%26+b_z+%26+c_z+%26+d_z+%5C%5C+a_w+%26+b_w+%26+c_w+%26+d_w%5Cend%7Bpmatrix%7D+%5Cbegin%7Bpmatrix%7Dv_x+%5C%5C+v_y+%5C%5C+v_z+%5C%5C+v_w%5Cend%7Bpmatrix%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=Mv+%3D+%5Cbegin%7Bpmatrix%7Da_x+%26+b_x+%26+c_x+%26+d_x+%5C%5C+a_y+%26+b_y+%26+c_y+%26+d_y+%5C%5C+a_z+%26+b_z+%26+c_z+%26+d_z+%5C%5C+a_w+%26+b_w+%26+c_w+%26+d_w%5Cend%7Bpmatrix%7D+%5Cbegin%7Bpmatrix%7Dv_x+%5C%5C+v_y+%5C%5C+v_z+%5C%5C+v_w%5Cend%7Bpmatrix%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="Mv = &#92;begin{pmatrix}a_x &amp; b_x &amp; c_x &amp; d_x &#92;&#92; a_y &amp; b_y &amp; c_y &amp; d_y &#92;&#92; a_z &amp; b_z &amp; c_z &amp; d_z &#92;&#92; a_w &amp; b_w &amp; c_w &amp; d_w&#92;end{pmatrix} &#92;begin{pmatrix}v_x &#92;&#92; v_y &#92;&#92; v_z &#92;&#92; v_w&#92;end{pmatrix}" class="latex" /></p>
<p>The standard approach to computing Mv using SIMD instructions boils down to taking a linear combination of the four column vectors a, b, c and d, using standard SIMD componentwise addition, multiplication and broadcast shuffles.</p>
<pre>
  // Given M as its four constituent column vectors a, b, c, d,
  // compute r=M*v.
  r = v.xxxx*a + v.yyyy*b + v.zzzz*c + v.wwww*d;
</pre>
<p>This computes the vector-matrix product using four shuffles, four (SIMD) multiplies, and three additions. This is all bog-standard. And if the ISA we&#8217;re working on has free broadcast swizzles (ARM NEON for example), we&#8217;re done. But if not, can we do better? Certainly if we know things about M or v: if M has a special structure, or some components of v are known to be always 0, 1 or -1, chances are good we can save a bit of work (whether it makes a difference is another matter). But what if M and v are completely general, and all we know is that we want to transform a lot of vectors with a single M? If v is either given as or returned in SoA form (structure-of-arrays), we can reduce the number of per-vector shuffles greatly if we&#8217;re willing to preprocess M a bit and have enough registers available. But let&#8217;s say we&#8217;re not doing that either: our input v is in packed form, and we want the results packed too. Is there anything we can do?</p>
<p>There&#8217;s no way to reduce the number of multiplies or additions in general, but we can get rid of exactly one shuffle per vector, if we&#8217;re willing to rearrange M a bit. The trick is to realize that we&#8217;re using each of v.x, v.y, v.z, and v.w exactly four times, and that the computations we&#8217;re doing (a bunch of component-wise multiplies and additions) are commutative and associative, so we can reorder them, in exact arithmetic anyway. (This type of computation is usually done in floating point, where we don&#8217;t actually have associativity, but I&#8217;m going to gloss over this.)</p>
<p>Let&#8217;s look at the our first set of products, <code>v.xxxx * a</code>. We&#8217;re just walking down a column of M, multiplying each element we see by v<sub>x</sub>. What if we walk in a different direction? Going along horizontals turns out to be boring (it&#8217;s essentially the same, just transposed), but diagonals of M are interesting, the main diagonal in particular.</p>
<p>So here&#8217;s the punch line: we form four new vectors by walking along diagonals (with wrap-around) as follows:</p>
<p><img src="https://s0.wp.com/latex.php?latex=e+%3D+%5Cbegin%7Bpmatrix%7D+a_x+%5C%5C+b_y+%5C%5C+c_z+%5C%5C+d_w+%5Cend%7Bpmatrix%7D+%5Cquad++f+%3D+%5Cbegin%7Bpmatrix%7D+b_x+%5C%5C+c_y+%5C%5C+d_z+%5C%5C+a_w+%5Cend%7Bpmatrix%7D+%5Cquad++g+%3D+%5Cbegin%7Bpmatrix%7D+c_x+%5C%5C+d_y+%5C%5C+a_z+%5C%5C+b_w+%5Cend%7Bpmatrix%7D+%5Cquad++h+%3D+%5Cbegin%7Bpmatrix%7D+d_x+%5C%5C+a_y+%5C%5C+b_z+%5C%5C+c_w+%5Cend%7Bpmatrix%7D++&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002" srcset="https://s0.wp.com/latex.php?latex=e+%3D+%5Cbegin%7Bpmatrix%7D+a_x+%5C%5C+b_y+%5C%5C+c_z+%5C%5C+d_w+%5Cend%7Bpmatrix%7D+%5Cquad++f+%3D+%5Cbegin%7Bpmatrix%7D+b_x+%5C%5C+c_y+%5C%5C+d_z+%5C%5C+a_w+%5Cend%7Bpmatrix%7D+%5Cquad++g+%3D+%5Cbegin%7Bpmatrix%7D+c_x+%5C%5C+d_y+%5C%5C+a_z+%5C%5C+b_w+%5Cend%7Bpmatrix%7D+%5Cquad++h+%3D+%5Cbegin%7Bpmatrix%7D+d_x+%5C%5C+a_y+%5C%5C+b_z+%5C%5C+c_w+%5Cend%7Bpmatrix%7D++&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002 1x, https://s0.wp.com/latex.php?latex=e+%3D+%5Cbegin%7Bpmatrix%7D+a_x+%5C%5C+b_y+%5C%5C+c_z+%5C%5C+d_w+%5Cend%7Bpmatrix%7D+%5Cquad++f+%3D+%5Cbegin%7Bpmatrix%7D+b_x+%5C%5C+c_y+%5C%5C+d_z+%5C%5C+a_w+%5Cend%7Bpmatrix%7D+%5Cquad++g+%3D+%5Cbegin%7Bpmatrix%7D+c_x+%5C%5C+d_y+%5C%5C+a_z+%5C%5C+b_w+%5Cend%7Bpmatrix%7D+%5Cquad++h+%3D+%5Cbegin%7Bpmatrix%7D+d_x+%5C%5C+a_y+%5C%5C+b_z+%5C%5C+c_w+%5Cend%7Bpmatrix%7D++&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002&#038;zoom=4.5 4x" alt="e = &#92;begin{pmatrix} a_x &#92;&#92; b_y &#92;&#92; c_z &#92;&#92; d_w &#92;end{pmatrix} &#92;quad  f = &#92;begin{pmatrix} b_x &#92;&#92; c_y &#92;&#92; d_z &#92;&#92; a_w &#92;end{pmatrix} &#92;quad  g = &#92;begin{pmatrix} c_x &#92;&#92; d_y &#92;&#92; a_z &#92;&#92; b_w &#92;end{pmatrix} &#92;quad  h = &#92;begin{pmatrix} d_x &#92;&#92; a_y &#92;&#92; b_z &#92;&#92; c_w &#92;end{pmatrix}  " class="latex" /></p>
<p>Phrasing the matrix multiply in terms of these four vectors, we get:</p>
<pre>
  r = v*e + v.yzwx*f + v.zwxy*g + v.wxyz*h;
</pre>
<p>Same number of multiplies and adds, but one shuffle per vector less (because the swizzle pattern for v in the first term is <code>xyzw</code>, which is the natural ordering of v). Also note that forming e, f, g, and h given M in column vector form is also relatively cheap: it&#8217;s a matrix transposition with a few post-swizzles to implement the cyclic rotations. If you have M as row vectors (for example because it&#8217;s stored in row-major order), it&#8217;s even cheaper.</p>
<p>So: multiplying a packed 4-vector with a constant 4&#215;4-matrix takes one shuffle less than the standard approach, if we&#8217;re willing to do some preprocessing on M (or store our matrices in a weird layout to begin with). Does this matter? It depends. On current desktop x86 cores, it&#8217;s pretty marginal, because SIMD shuffles can execute in parallel (during the same cycle) with additions and multiplications. On older cores with less execution resources, on in-order SIMD CPUs, and on low-power parts, it can definitely help though.</p>
<p>For what it&#8217;s worth: if your 4D vectors come from graphics or physics workloads and are actually homogeneous 3-vectors with a constant w=1 and no projective transforms anywhere in sight, you can exploit that structure explicitly for higher gains than this. But I ran into this with a DSP workload (with v just being a vector of 4 arbitrary samples), and in that case it&#8217;s definitely useful to know, especially since anything convolution-related tends to have highly diagonal (<a href="http://en.wikipedia.org/wiki/Toeplitz_matrix">Toeplitz</a>, to be precise) structure to begin with.</p>
]]></html></oembed>