<?xml version="1.0" encoding="UTF-8" standalone="yes"?><oembed><version><![CDATA[1.0]]></version><provider_name><![CDATA[The ryg blog]]></provider_name><provider_url><![CDATA[https://fgiesen.wordpress.com]]></provider_url><author_name><![CDATA[fgiesen]]></author_name><author_url><![CDATA[https://fgiesen.wordpress.com/author/fgiesen/]]></author_url><title><![CDATA[x86 code compression in&nbsp;kkrunchy]]></title><type><![CDATA[link]]></type><html><![CDATA[<p>This is about the &#8220;secret ingredient&#8221; in my EXE packer kkrunchy, which was used in our (Farbrausch) 64k intros starting from &#8220;fr-030: Candytron&#8221;, and also in a lot of other 64k intros or similarly size-limited productions by other groups &#8211; including several well-known 64ks from other groups such as Conspiracy, Equinox and Fairlight, I&#8217;m happy to add.</p>
<p>There are two primary versions of kkrunchy. The first one was developed between 2003 and 2005 and uses a fairly simple LZ77 coder with arithmetic coding backend, designed to have good compression and a small depacker (about 300 bytes when using hand-optimized x86 code). If you&#8217;re interested in the details, a simplified C++ version of the depacker that I then used to write the ASM code from is still available <a href="http://www.farbrausch.de/~fg/stuff/depacker_simple.cpp">here</a> (of the various programs I&#8217;ve written, this is still one of my favorites: it&#8217;s just nice to have something so fundamentally <em>simple</em> that&#8217;s actually useful).</p>
<p>The second was developed roughly between 2006 and 2008 and uses PAQ-style context mixing. To be precise, it&#8217;s a spin-off borrowing a lot of ideas from PAQ7 including its neural network mixer, but with a custom model design optimized to use a lot less memory (around 5.5mb instead of hundreds of megabytes) and run <em>a lot</em> faster. It&#8217;s still dog slow, but it managed to process &gt;120kb/s on my then-current P4 2.4GHz instead of the roughly 5kb/s that PAQ7 managed on the same machine when targeting roughly the same compression ratio. The depacker was optimized for both size and speed (with this much computation per bit, making it 4x slower for the sake of saving a few bytes is a bad deal!), and took up a comparatively hefty 1.5k, but normally more than made up for it. For example, our 96k shooter &#8220;kkrieger&#8221; shrunk by over 10k when recompressed with the new engine (if only we&#8217;d had that before, it would&#8217;ve saved us a lot of work in the last week before release and prevented some very embarrassing bugs).</p>
<p>Anyway, both packing engines were competitive with the state of the art in EXE compression for the given target sizes when they were first released, but neither were in any way spectacular, and both were biased towards simplicity and tweakability, not absolutely maximum compression. That&#8217;s because unlike most other EXE packers, kkrunchy doesn&#8217;t rely solely on its backend compression engine to bring in the space savings.</p>
<p><!--more--></p>
<h3>Preprocessing code for fun and profit</h3>
<p>Especially 64k intros (the original target for kkrunchy) usually have a fairly large amount of the overall executable size devoted to code, and a much smaller amount spent on content (since 64ks are usually one-file packages, this content comes in the form of either statically initialized data or resource sections). A typical ratio for us was between 48k and 56k of the final compressed executable spent on code and overhead (you still need OS-parseable file headers and some way of importing DLLs), with the rest being data. Clearly, if you want to really make a dent on the packed size, you should concentrate on code first. This is especially true because code always comes in the same format (executable x86 opcodes) while the composition and structure of data in the executable varies a lot.</p>
<p>If you have a context modeling based compressor, one way to inject &#8220;domain knowledge&#8221; about x86 code into the compressor is to include models specifically designed to analyze x86 opcodes and deal with them. This is the approach taken by recent PAQ versions, and also what you&#8217;ll find in the <a href="http://research.microsoft.com/en-us/um/people/darkok/papers/TOPLAS.pdf">PPMexe paper</a> (I&#8217;ll come back to this in a bit). In context coders, this is probably the most direct way to inject domain knowledge into the (de)compressor, but it has two problems: first, that domain knowledge translates to a lot of extra code you need to carry around in your decompressor, and second, you can&#8217;t do it if you&#8217;re not using a context coder! (Remember the first version of kkrunchy used a plain LZ+Ari backend).</p>
<p>So kkrunchy instead uses a <em>filter</em> on x86 code. The compressor first takes the original x86 code, passes it through the filter (this should better be a reversible process!), and then hands it to the main compression engine. The decompressor first undoes the actual compression, then reverse-applies the filter in a second step. This is something done by most EXE compressors, but all others (that I know of) use very simple filters, mainly intended to make jump and call offsets more compressible. kkrunchy looks deeper than that; it actually disassembles the instruction stream, albeit only very roughly. This is based on the &#8220;split-stream&#8221; algorithm in the PPMexe paper (even though the paper only appeared in a journal in 2007, an earlier version was available as a MS Research Technical Report in 2005, when I first read it).</p>
<h3>Split-stream encoding</h3>
<p>Unlike most modern, RISC-influenced instruction encodings, x86 doesn&#8217;t have fixed-length instruction words, but instead uses a variable-length encoding that leads to higher code density but is full of exceptions and special cases, making instruction decoding (or even the simpler problem of just determining instruction lengths) fairly complicated. An x86 instruction can have these components:</p>
<ul>
<li>Zero or more <em>prefixes</em>. A prefix is a single byte that can influence the way an operation is performed (e.g. <code>REP</code>, <code>LOCK</code>), the size of an operand (the <em>operand-size prefix</em> <code>0x66</code> turns a 16-bit operation into a 32-bit operation and vice versa), the size of an addressing computation (the <em>address-size prefix</em> <code>0x67</code> which does the same for address calculations) or the source/target of a read/write operation (via segment prefixes). There&#8217;s also a set of extra prefixes in 64-bit mode (the so-called REX byte), and finally some of the prefixes are also used to extend the opcode byte for newer instructions (e.g. SSE/SSE2).
<li>The actual <em>opcode</em>, which encodes the operation to be performed (duh). &#8220;Classic&#8221; x86 has one-byte and two-byte opcodes. Two-byte opcodes have 0x0f as the first byte, the rest is one-byte opcodes. (More recent instructions also have three-byte opcodes that start with either 0x0f 0x38 or 0x0f 0x3a).
<li>A <em>ModR/M byte</em>, which encodes either the addressing mode or registers used for an instruction.
<li>A <em>SIB byte</em> (scale-index-base), used to extend the ModR/M byte for more advanced 386+ addressing modes like <code>[esi+eax*4+0x1234]</code>.
<li>A <em>displacement</em> (usually 8-bits or 32-bits, signed, in 32-bit code) which is just an immediate offset added to the result of an address calculation (like the &#8220;0x1234&#8221; in the example above). There&#8217;s also absolute address mode (e.g. <code>mov [0x01234567], 0x89abcdef</code>) which is not relative to any base register.
<li>An <em>immedate operand</em>. Again, usually 8 or 32 bits in 32-bit code, but there are some exceptions. Some ops (like <code>add eax, 42</code>) or (<code>mov [edi], 23</code>) have an immediate operand. This field is separate from the displacement; both can occur in the same instruction.
</ul>
<p>What the split-stream filter does is this: It splits x86 instructions (which are an interleaved sequence of these bytes with different meanings) into separate streams based on their function. The idea is that there&#8217;s lots of similarity between values in the same stream, but very little relation between values in different streams: Especially compiler-generated code won&#8217;t use all opcodes (some are a lot more frequent than others), but 8-bit jump offsets are almost completely random. Small 32-bit immediates are a lot more common than large ones; most 32-bit displacements are positive, but stack-based references (based off either <code>ESP</code> or <code>EBP</code>, e.g. used to access function parameters or local variables) are often negative. SIB bytes look different from ModRM bytes (or opcode bytes), and some are a lot more common than others. And so on. kkrunchy doesn&#8217;t split all of these into distinct groups (for example, the prefix, opcode and ModR/M bytes tend to be heavily correlated, so they&#8217;re kept all in one stream) and splits some of them even further: jump offsets are special 32-bit displacements, as are call offsets, and for 8-bit displacements, it pays to have separate streams based on the &#8220;base register&#8221; field of the ModR/M byte, since offsets used off the same register &#8211; usually containing a pointer &#8211; tend to be correlated. But the basic idea is just to split instructions into their components and encode the streams separately. This is done by just keeping them in distinct regions, so we need to add a header at the beginning that tells us how long each stream is (a bit of overhead there).</p>
<p>One thing worth noting is that the regular (non-far, non-indirect) <code>JMP</code> (plus conditional variants) and <code>CALL</code> instructions use <em>displacements</em>, not absolute target addresses; i.e. they encode the distance between the target instruction and the end of the current (call/jump) instruction, not the actual target address. This is intentionally done to greatly reduce the number of <em>relocations</em> that need to be made if some piece of code needs to be moved to an address other than its preferred load address in memory. However, it also means that code that repeatedly jumps to or calls the same target address (e.g. a frequently-called function) will have a different encoding of the target address every time, since the displacement depends on the address of the call instruction! For compression purposes, we&#8217;d like calls to the same target instruction to result in the exact same encoding every time, i.e. convert the offsets into absolute addresses. This is fairly easy to do &#8211; just add the address of a call instruction to its destination address (of course, you need to undo this after decompression or the code will jump somewhere wrong). This is the one thing that most executable compressors (or general-purpose compressors that have special modes for x86 code, for that matter) will do on their own: Search for byte sequences that look like a <code>CALL</code> or <code>JMP</code> instruction in the byte stream, and add the current offset to the 4 bytes following it. Then do the same thing in reverse on decompression. Very simple trick, but huge effect on the compression ratio: doing this one transform before compression usually reduces the final compressed size by about 10%, both for LZ-based and context-based coders.</p>
<h3>Fun with calls</h3>
<p>Function calls are special: A lot of functions will get called very often indeed. You can just rely on the relative-to-absolute offset transform and your backend encoder to handle this; but when developing kkrunchy I found that it actually pays to special-case this one a bit. So instead I keep a 255-entry LRU cache (or a FIFO for simplicity in earlier kkrunchy versions) of recent call sites. If a call is repeated, I only need to store the index; otherwise I write a &#8220;255&#8221; byte to denote a cache miss, then code the full 32-bit target address and add it to the cache. The actual call offsets aren&#8217;t correlated to the offsets, so the &#8220;cache index&#8221; and &#8220;verbatim call offset&#8221; fields are written into separate streams.</p>
<p>There&#8217;s two more tricks to the function call cache: First, code that follows a <code>RET</code> instruction has a good likelihood of being the start of a new function, and second, VC++ pads the unused space between functions (for alignment reasons) with <code>0xcc</code> bytes which correspond to the <code>INT3</code> opcode (software breakpoint). Since <code>INT3</code> opcodes don&#8217;t appear in most functions, any instruction following one or more <code>INT3</code> ops is also likely to start a new function. All such potential function addresses are also speculatively added to the function address cache. A false positive will throw an address not used for at least 255 subsequent <code>CALL</code> instructions of the cache &#8211; not a big deal. OTOH, if our guess turns out to be right and we&#8217;ve found the start of a function that gets called soon, that&#8217;s 32 bits we never need to encode now.</p>
<p>Typical hit rates for the function address cache are between 70 and 80 percent, and every hit saves us 3 bytes. Trust me, there&#8217;s a lot of calls in most pieces of x86 code, and this is a big deal. This is the <em>only</em> transform in the code preprocessor that results in a lower number of byte to be encoded; all other steps of the transform are either 1:1 in terms of bytes or add some overhead (headers, escape codes). Still, running the filter usually reduces the size of the file by a few percent <em>even before any actual compression is done</em>.</p>
<h3>Dealing with non-code bytes</h3>
<p>The original PPMexe algorithm assumes that it&#8217;s only actually running on code, and that it knows about all basic blocks in the program (i.e. all potential jump targets are known). It will then reorder instructions to get better compression. This is not an option when dealing with arbitrary programs; for once, we don&#8217;t know where the basic blocks are (the compiler doesn&#8217;t tell us!). We also don&#8217;t want to change the program in any way (even ignoring potential performance penalties, just think about the kind of problems arbitrary reordering of instructions could introduce in lock-free code&#8230;), and finally, it&#8217;s not generally a safe assumption that the code section only contains code, or that the compressor knows about all x86 instructions (after all, there&#8217;s new ones being added every so often, and we can&#8217;t generally parse them correctly without knowing about them). Luckily, x86 disassembling is mostly self-synchronizing (even if you get a few instructions wrong, you will generally by in sync again after a few instructions unless the code was explicitly designed to be &#8220;ambiguous&#8221;), so we don&#8217;t need to sweat this too much. We do, however, need to be able to encode instruction bytes that we can&#8217;t disassemble properly. </p>
<p>Luckily, there&#8217;s a few one-byte opcodes that basically never occur in any real program; we can use any of them as an &#8220;escape code&#8221;. We then flag that instruction as invalid in our disassembler table; if the program actually contains that instruction, it will need to use an escape code to do so, but that&#8217;s fine. kkrunchy originally used <code>INTO</code> (<code>0xce</code>) as escape opcode that just means &#8220;copy the next byte verbatim&#8221;; this is supposed to trigger an interrupt if arithmetic overflow occurred in the previous flag-setting instruction. I&#8217;ve never seen this in any real program; certainly not a compiled one. Should kkrunchy ever run across it, it can still encode that instruction byte using an escape: <code>INTO</code> turns into <code>0xce 0xce</code>. No problem. More recent versions use the <code>ICEBP</code> opcode (<code>0xf1</code>) as escape, which is even more obscure. It only exists on a few x86 models, and again, I&#8217;ve never seen it actually used anywhere.</p>
<p>One piece of non-code that tends to occur in code sections is jump tables for switch statements. Luckily they&#8217;re fairly easy to detect with good accuracy: If we have at least 12 consecutive bytes, starting at an address divisible by 4, that look like addresses in the code (i.e. are between the &#8220;code start&#8221; and &#8220;code end&#8221; addresses which are assumed to be known), we use another escape opcode to denote &#8220;jump table with N entries follows&#8221;, where N-1 is encoded as a single-byte value. (If a jump table has more than 256 entries, we just use chain several such escapes).</p>
<p>Between the two tricks, unknown instructions or junk in the code section doesn&#8217;t usually cause a problem.</p>
<h3>A box inside a box inside a box&#8230;</h3>
<p>Because the kkrunchy code preprocessing filter does more than most such filters, it&#8217;s also bigger (around 900 bytes of code+data combined in the last version I used). But since it runs after the main decompression phase, there&#8217;s no need to pay that cost in full: The filter code can be compressed along with the filtered data! This shaves off another few hundred bytes at the cost of one extra jump at the end of the main decompression code. All other code that runs after decompression (e.g. import processing) can also be stored in compressed format. This is a bit like the &#8220;bootstrapping&#8221; process used for some C compilers: The core part of a C compiler (syntactic analysis and a simple code generator for the target machine) is written in a very portable subset of C. This core C compiler is compiled using whatever other C compiler is currently present on the system, with optimizations disabled. The result is the stage-1 compiler. The stage-1 compiler is then used to compile a fully-featured version of our C compiler (including extensions and a fully-featured code generator with optimization), creating the stage-2 compiler. This stage-2 compiler is &#8220;feature complete&#8221;, but may be slow because it was compiled by the stage-1 compiler which doesn&#8217;t have all optimizations, so the stage-2 compiler compiles itself again (with full optimization this time!) to produce the stage-3 compiler, which is the final result (some compilers only use a stage-2 bootstrap process, and some cross compilers may use more stages, but you get the general idea). You can chain compression and filtering stages the same way; indeed, at some point I was bothered by the 1.5k spent on the context modeling depacker and considered adding a very simple pure LZ at the start of the chain (to unpack the unpacker&#8230;), but the best I managed was an overall gain of less than 100 bytes (after subtracting the &#8220;stage-0&#8221; depacker size and overhead), so I didn&#8217;t pursue that any further.</p>
<h3>So how much does it help?</h3>
<p>All these techniques combined typically reduce final compressed size by about 20% compared to no preprocessing at all, with both the LZ-based and the context model coder (this is true not just with the kkrunchy backends, but also with other compressors like ZIP, RAR, LZMA or PAQ). Compare to the simpler &#8220;call offsets only&#8221; preprocessing, which typically reduces size by about 10%. That&#8217;s clearly the low-hanging fruit right there, but an extra 10% isn&#8217;t to be sneezed at, particularly since both the encoding and decoding are really quite easy and quick.</p>
<p>If you&#8217;re interested in the details, I made a (cleaned-up version) of the preprocessor available a while ago <a href="http://farbrausch.com/~fg/code/disfilter/">here</a> (that code is placed into the public domain); there&#8217;s extra documentation with some more details in the code. The biggest lacking feature is probably x86-64 support. It shouldn&#8217;t be too hard to add, but I never got around to it since I stopped doing size-optimized stuff a while ago, and besides, even now, 64-bit x86 apps are relatively rare.</p>
<p>Anyway, if you have any questions or are interested in extending/using that code somewhere, just drop me a mail (or reply here). I&#8217;d certainly like to see it being useful in other contexts!</p>
]]></html></oembed>