<?xml version="1.0" encoding="UTF-8" standalone="yes"?><oembed><version><![CDATA[1.0]]></version><provider_name><![CDATA[The Osmosian Order of Plain English Programmers Welcomes You]]></provider_name><provider_url><![CDATA[http://osmosianplainenglishprogramming.blog]]></provider_url><author_name><![CDATA[gerryrzeppa]]></author_name><author_url><![CDATA[https://osmosianplainenglishprogramming.blog/author/gerryrzeppa/]]></author_url><title><![CDATA[Parsing with Riders]]></title><type><![CDATA[link]]></type><html><![CDATA[<header class="entry-header">
<h1 class="entry-title"></h1>
</header>
<div class="entry-content">
<p>A Plain English compiler, as you might expect, needs to parse text simply, flexibly, and quickly. We do that with a thing we call a <em>rider</em>. It’s a generalized version of an idea (with the same name) that Niklaus Wirth taught us about in his magnificent <em>Oberon </em>system. To understand our implementation and use of riders, however, you must first understand the way we implement…</p>
<p><strong>Strings</strong></p>
<p>The prototype <em>string</em> record, in Plain English, is defined like this:</p>
<p><span style="color:#00ccff;">A string has a first byte pointer and a last byte pointer.</span></p>
<p>The data bytes of a string are dynamically allocated in a contiguous sequence on the Heap. A string is considered blank if the first byte pointer is nil (or greater than the last byte pointer). When a string consists of just one-byte, the pointers are equal. The length of a string is thus the last byte’s address minus the first byte’s address plus one. Our compiler generates all the code and calls necessary to manage string memory.</p>
<p>And that brings us to…</p>
<p><strong>Substrings</strong></p>
<p>A <em>substring </em>is any contiguous part of a string, up to and including the whole string. In Plain English it is defined like this:</p>
<p><span style="color:#00ccff;">A substring is a string.</span></p>
<p>In other words, the substring type not only looks like a string (with a first byte pointer and a last byte pointer), but is also compatible with <em>string </em>type, which allows most of our string routines to operate on either type.</p>
<p>And now we’re ready to talk about…</p>
<p><strong>Riders</strong></p>
<p>A <em>rider</em>, in Plain English, is defined like this:</p>
<p><span style="color:#00ccff;">A rider has an original substring, a source substring and a token substring.</span></p>
<p>When we…</p>
<p><span style="color:#00ccff;">Slap a rider on a string.</span></p>
<p>…the pointers in the original and source substrings are set to span the entire string. At the same time, the token’s first byte pointer is set to the first byte of the string, and the token’s last byte pointer is set to nil (making the token substring initially blank, but ready to be extended).</p>
<p>Now when we…</p>
<p><span style="color:#00ccff;">Bump a rider.</span></p>
<p>…we add 1 to the source’s <em>first </em>byte pointer, and add 1 to the token’s <em>last </em>byte pointer. It thus appears that we have “moved” a byte of the source into the token, though no physical movement of string data has actually taken place. Very fast, even on huge strings. We keep the original substring pointers intact so we can check, as necessary, to make sure we don’t fall off either end of the source.</p>
<p>Note that we can peek back at previous bytes in the source string simply by subtracting from the source’s first byte pointer.</p>
<p>Note also that we can have as many riders on a string as we need, so we can parse different parts of the source at the same time.</p>
<p>Note, thirdly, that we can save substrings of the source of any length (as just two pointers) so we can process them later without having to hunt them down again in the source.</p>
<p>Finally, note that we can code up a wide variety of “Move a rider” routines to extract any kind of token from any kind of source. Here, from our compiler, are some…</p>
<p><strong>Examples</strong></p>
<p>Our compiler ignores spaces, tabs, linefeeds, carriage returns, and other <em>noise </em>between meaningful characters. When we find ourselves sitting on a character that doesn’t interest us, we use this routine to move past it:</p>
<p><span style="color:#00ccff;">To move a rider (code rules &#8211; noise):</span><br />
<span style="color:#00ccff;"> Bump the rider.</span><br />
<span style="color:#00ccff;"> If the rider&#8217;s source is blank, exit.</span><br />
<span style="color:#00ccff;"> If the rider&#8217;s source&#8217;s first&#8217;s target is noise, repeat.</span></p>
<p>A <em>comment</em>, in Plain English, starts with a backslash and ends at the end of a line. When we find ourselves sitting on a comment (ie, the first remaining byte of the rider’s source is a backslash), we suck it up into a token using this routine:</p>
<p><span style="color:#00ccff;">To move a rider (code rules &#8211; comment):</span><br />
<span style="color:#00ccff;"> Bump the rider.</span><br />
<span style="color:#00ccff;"> If the rider&#8217;s source is blank, exit.</span><br />
<span style="color:#00ccff;"> If the rider&#8217;s source&#8217;s first&#8217;s target is not the return byte, repeat.</span></p>
<p>A <em>remark</em>, in Plain English, is an “inline” comment surrounded by square brackets. When we find ourselves sitting on a remark, we call this guy:</p>
<p><span style="color:#00ccff;">To move a rider (code rules &#8211; remark):</span><br />
<span style="color:#00ccff;"> If the rider&#8217;s source is blank, exit.</span><br />
<span style="color:#00ccff;"> If the rider&#8217;s source&#8217;s first&#8217;s target is the return byte, break.</span><br />
<span style="color:#00ccff;"> If the rider&#8217;s source&#8217;s first&#8217;s target is the left-bracket byte,<br />
add 1 to a count.</span><br />
<span style="color:#00ccff;"> If the rider&#8217;s source&#8217;s first&#8217;s target is the right-bracket byte,<br />
subtract 1 from the count.</span><br />
<span style="color:#00ccff;"> Bump the rider.</span><br />
<span style="color:#00ccff;"> If the count is 0, break.</span><br />
<span style="color:#00ccff;"> Repeat.</span></p>
<p>Remarks can be nested, so that routine is a little more complex.</p>
<p>When we find ourselves at the beginning of a literal <em>string </em>(ie, the first remaining byte of the source is a double-quote mark), we call this routine to suck the string into a token:</p>
<p><span style="color:#00ccff;">To move a rider (code rules &#8211; string):</span><br />
<span style="color:#00ccff;"> Bump the rider.</span><br />
<span style="color:#00ccff;"> If the rider&#8217;s source is blank, exit.</span><br />
<span style="color:#00ccff;"> If the rider&#8217;s source&#8217;s first&#8217;s target is the return byte, exit.</span><br />
<span style="color:#00ccff;"> If the rider is on any nested double-quote, bump the rider; repeat.</span><br />
<span style="color:#00ccff;"> If the rider&#8217;s source&#8217;s first&#8217;s target is the double-quote byte,<br />
bump the rider; exit.</span><br />
<span style="color:#00ccff;"> Repeat.</span></p>
<p>Note that we don’t allow strings to span multiple lines (to avoid common errors), and that we use doubled-up double-quotes in string to allow for double quotes within strings. For example…</p>
<p><span style="color:#00ccff;">&#8220;This is a string with the next word &#8220;&#8221;in&#8221;&#8221; double quote marks.&#8221;</span></p>
<p>…is interpreted like this:</p>
<p><span style="color:#00ccff;">This is a string with the next word &#8220;in&#8221; double quote marks.</span></p>
<p><em>Qualifiers</em> are used to distinguish special cases of similar routines. In Plain English, they’re enclosed in parentheses. This is the “move a rider” routine that handles qualifiers:</p>
<p><span style="color:#00ccff;">To move a rider (code rules &#8211; qualifier):</span><br />
<span style="color:#00ccff;"> If the rider&#8217;s source is blank, exit.</span><br />
<span style="color:#00ccff;"> If the rider&#8217;s source&#8217;s first&#8217;s target is the return byte, break.</span><br />
<span style="color:#00ccff;"> If the rider&#8217;s source&#8217;s first&#8217;s target is the left-paren byte,<br />
add 1 to a count.</span><br />
<span style="color:#00ccff;"> If the rider&#8217;s source&#8217;s first&#8217;s target is the right-paren byte,<br />
subtract 1 from the count.</span><br />
<span style="color:#00ccff;"> Bump the rider.</span><br />
<span style="color:#00ccff;"> If the count is 0, break.</span><br />
<span style="color:#00ccff;"> Repeat.</span></p>
<p>Qualifiers can also be nested, so we have to take that into account as we did with remarks.</p>
<p>Punctuation <em>marks </em>in Plain English are all single characters, so we just suck ’em up into the token:</p>
<p><span style="color:#00ccff;">To move a rider (code rules &#8211; mark):</span><br />
<span style="color:#00ccff;"> Bump the rider.</span></p>
<p><em>Possessives</em> typically come at the end of names and can either be an apostrophe followed by the letter “s”, or an apostrophe all by itself, if the preceding letter is “s”. This is the routine that parses possessives:</p>
<p><span style="color:#00ccff;">To move a rider (code rules &#8211; possessive):</span><br />
<span style="color:#00ccff;"> Bump the rider.</span><br />
<span style="color:#00ccff;"> If the rider&#8217;s source is blank, exit.</span><br />
<span style="color:#00ccff;"> If the rider&#8217;s source starts with &#8220;s&#8221;, bump the rider.</span></p>
<p>Below is the higher-level routine that calls the above routines, and one more at the end:</p>
<p><span style="color:#00ccff;">To move a rider (code rules):</span><br />
<span style="color:#00ccff;"> Position the rider&#8217;s token on the rider&#8217;s source.</span><br />
<span style="color:#00ccff;"> If the rider&#8217;s source is blank, exit.</span><br />
<span style="color:#00ccff;"> If the rider&#8217;s source&#8217;s first&#8217;s target is noise,</span><br />
<span style="color:#00ccff;"> move the rider (code rules &#8211; noise); exit.</span><br />
<span style="color:#00ccff;"> If the rider&#8217;s source&#8217;s first&#8217;s target is the backslash byte,</span><br />
<span style="color:#00ccff;"> move the rider (code rules &#8211; comment); exit.</span><br />
<span style="color:#00ccff;"> If the rider&#8217;s source&#8217;s first&#8217;s target is the left-bracket byte,</span><br />
<span style="color:#00ccff;"> move the rider (code rules &#8211; remark); exit.</span><br />
<span style="color:#00ccff;"> If the rider&#8217;s source&#8217;s first&#8217;s target is the double-quote byte,</span><br />
<span style="color:#00ccff;"> move the rider (code rules &#8211; string); exit.</span><br />
<span style="color:#00ccff;"> If the rider&#8217;s source&#8217;s first&#8217;s target is the left-paren byte,</span><br />
<span style="color:#00ccff;"> move the rider (code rules &#8211; qualifier); exit.</span><br />
<span style="color:#00ccff;"> If the rider&#8217;s source&#8217;s first&#8217;s target is any mark,</span><br />
<span style="color:#00ccff;"> move the rider (code rules &#8211; mark); exit.</span><br />
<span style="color:#00ccff;"> If the rider is on any possessive,</span><br />
<span style="color:#00ccff;"> move the rider (code rules &#8211; possessive); exit.</span><br />
<span style="color:#00ccff;"> Move the rider (code rules &#8211; glom).</span></p>
<p>A <em>glom </em>is any character or collection of characters that can be processed at the next level up — the <em>semantic</em>, rather than the <em>syntactic </em>level. Since noise, comments, remarks, strings, qualifiers, punctuation marks and possessives have been weeded out at this point, the “move a rider” routine for gloms is quite simple:</p>
<p><span style="color:#00ccff;">To move a rider (code rules &#8211; glom):</span><br />
<span style="color:#00ccff;"> Bump the rider.</span><br />
<span style="color:#00ccff;"> If the rider&#8217;s source is blank, exit.</span><br />
<span style="color:#00ccff;"> If the rider is on any possessive, exit.</span><br />
<span style="color:#00ccff;"> If the rider&#8217;s source&#8217;s first&#8217;s target is any glom byte, repeat.</span></p>
<p>We have to check for possessives a second time in case one pops up at the end of a glom.</p>
<p><em>Glom bytes</em> are defined as follows:</p>
<p><span style="color:#00ccff;">To decide if a byte is any glom byte:</span><br />
<span style="color:#00ccff;"> If the byte is any letter, say yes.</span><br />
<span style="color:#00ccff;"> If the byte is any digit, say yes.</span><br />
<span style="color:#00ccff;"> If the byte is the tilde byte, say yes.</span><br />
<span style="color:#00ccff;"> If the byte is the at-sign byte, say yes.</span><br />
<span style="color:#00ccff;"> If the byte is the number-sign byte, say yes.</span><br />
<span style="color:#00ccff;"> If the byte is the percent-sign byte, say yes.</span><br />
<span style="color:#00ccff;"> If the byte is the ampersand byte, say yes.</span><br />
<span style="color:#00ccff;"> If the byte is the underscore byte, say yes.</span><br />
<span style="color:#00ccff;"> If the byte is the single-quote byte, say yes.</span><br />
<span style="color:#00ccff;"> If the byte is the dash byte, say yes.</span><br />
<span style="color:#00ccff;"> If the byte is the cross byte, say yes.</span><br />
<span style="color:#00ccff;"> If the byte is the slash byte, say yes.</span><br />
<span style="color:#00ccff;"> Say no.</span></p>
<p>I imagine you get the idea. Using riders in this way, we can simply, flexibly, and quickly parse our source files without having to think about too much at any one time.</p>
</div>
]]></html></oembed>