<?xml version="1.0" encoding="UTF-8" standalone="yes"?><oembed><version><![CDATA[1.0]]></version><provider_name><![CDATA[The ryg blog]]></provider_name><provider_url><![CDATA[https://fgiesen.wordpress.com]]></provider_url><author_name><![CDATA[fgiesen]]></author_name><author_url><![CDATA[https://fgiesen.wordpress.com/author/fgiesen/]]></author_url><title><![CDATA[Reading and writing are less symmetric than you (probably)&nbsp;think]]></title><type><![CDATA[link]]></type><html><![CDATA[<p>I am talking about the I/O operations as used in computing here. A typical example of how this kind of thing is exposed are the POSIX syscalls <code>read(2)</code> and <code>write(2)</code>, which have the following C function prototypes:</p>
<pre>
ssize_t read(int fd, void *buf, size_t count);
ssize_t write(int fd, const void *buf, size_t count); 
</pre>
<p>Now these are raw system calls; user programs <em>can</em> use them directly, but they usually don&#8217;t. They normally go through some buffered IO layer; in the C standard library, this means <code>FILE*</code> and functions <code>fread</code> and <code>fwrite</code>, which split <code>count</code> into a product of two values in a vestigial nod to record-based IO but are otherwise equivalent. For concreteness, suppose we&#8217;re interfacing with actual storage (i.e. not a pipe, socket, virtual filesystem etc.). Then conceptually, a &#8220;read&#8221;-class operation (like <code>read</code> or <code>fread</code>) grabs bytes from a file say on a disk somewhere and puts them into the specific memory buffer, and a &#8220;write&#8221;-class operation takes bytes in a memory buffer and writes them to the disk. Which definitely <em>sounds</em> nice and symmetric&mdash;but there&#8217;s some important behavioral asymmetries between them, especially when errors are in the mix. The reasons have to do with buffering.</p>
<h3>Buffered I/O</h3>
<p>In general, file I/O operations in your program will not go directly to a storage device; data instead makes its way through several buffering layers (most of which can be disabled using various flags, but in normal usage these layers are on). These layers are there for good reason: on the kernel side, there&#8217;s what&#8217;s traditionally called the &#8220;buffer cache&#8221;. Storage devices are &#8220;block devices&#8221;, which means they store data in blocks. The block size depends on the device; on old hard disks it used to be 512 bytes, CDs, DVDs etc. tend to use 2k blocks, newer storage devices are now on 4k blocks. Block devices only read entire blocks at a time; that means random byte-aligned IO requests such as &#8220;read 100 bytes from disk at byte offset 1234567&#8221; or &#8220;write 2000 bytes to location 987654&#8221; can&#8217;t be directly passed to the device at all. The buffer cache is used to translate these requests into block-aligned read and write operations that the device understands; non-block-aligned writes also require reading the previous contents of the block that are not overwritten, and those go in the buffer cache as well. And of course, as the name suggests, it acts as a cache.</p>
<p>On the user-space side, we also have buffers, albeit for a different reason: <code>read</code> and <code>write</code> are system calls, and as such incur a transition to kernel space and back. They also need to check for and report errors every time they are invoked. And of course they actually need to do the work we want them to do &#8211; copy the data from (<code>read</code>) or to (<code>write</code>) the buffer cache. System call overhead varies between OSes, but it&#8217;s safe to assume that the whole process takes at least a couple hundred clock cycles in the best case. So for the overhead not to completely dominate the actual work being done, you generally want to be reading or writing at least a few kilobytes at a time. For scale reference, typical IO buffer sizes as of this writing are 4096 bytes (e.g. Visual C++ 2013 <code>FILE*</code>, Go <code>bufio.Reader</code>/<code>bufio.Writer</code>) or 8192 bytes (e.g. GNU libc <code>FILE*</code>, Java <code>BufferedReader</code>/<code>BufferedWriter</code>).</p>
<p>Often there are more buffers too. For example, most hard drives and RAID controllers have their own caches, and it is not uncommon for user-space code to have several layers of buffering for various reasons. But this is enough to illustrate the basic structure.</p>
<p>All of these buffers are used in much the same way for reading and writing. So where&#8217;s the behavioral asymmetry between reading and writing that I&#8217;m talking about? You need to think about the state of the world (so to speak) after you call a <code>read</code>-type call and how it differs from the state of the world after a <code>write</code>-type call.</p>
<h3>What happens when you issue an IO operation</h3>
<p>Let&#8217;s look at what goes into servicing a <code>read</code>-type call first: say you open a C <code>FILE*</code> and want to read the first 100 bytes via <code>fread</code>. The C standard I/O library notices that its buffers are currently empty, and tries to fill them up, issuing a <code>read</code> system call to read say 4k worth of data. The kernel in turn asks the file system where the data for the first 4k of the file is located, checks the buffer cache to see if it already has a copy in memory, and if not, it issues a block read command to the storage device. Either way, the kernel makes sure to get those 4k of data into the buffer cache and from there copies them into the standard IO buffers in user-space memory, then returns. The standard IO library looks at the result of the system call, updates the state of its IO buffers, and then copies the 100 requested bytes into the memory buffer the app supplied.</p>
<p>And what if anything goes wrong? Say the file is smaller than 100 bytes, or there was an error reading from disk, or the file is on a network file system that&#8217;s currently inaccessible. Well, if that happens, we catch it too: if something goes wrong filling up the buffer cache, the kernel notices and returns an appropriate error to the I/O library, which can in turn pass errors on to the app. Either way, anything that can go wrong will go wrong <em>before</em> the <code>fread</code> call returns. All the intervening layers need to do is make sure to keep the error information around so it can be passed to the app at the appropriate time.</p>
<p>Now let&#8217;s go the other way round: let&#8217;s open a fresh new file with a 4k write buffer<a href="#foot1"><sup>[1]</sup></a> and issue a 100-byte <code>fwrite</code>. This time, the IO library copies the 100 bytes from the app buffer to the write buffer&#8230; and immediately returns, reporting success. The underlying <code>write</code> system call will not be executed until either the buffer fills up or is flushed as a result of calling <code>fflush</code>, <code>fseek</code>, <code>fclose</code> or similar.</p>
<p>Quick imaginary show of hands: who reading this habitually checks return codes of <code>fread</code> or <code>fwrite</code> at all? Of those saying &#8220;yes&#8221;, who also remembers to check return codes of <code>fflush</code>, <code>fseek</code> or <code>fclose</code>? Probably not a lot. Well, if you don&#8217;t, you&#8217;re not <em>actually</em> checking whether your writes succeeded at all. And while these remarks are C-specific, this general pattern holds for <em>all</em> buffered writer implementations. Buffered writing delays making the actual <code>write</code> system call; that&#8217;s kind of the point. But it implies that error reporting is delayed too!</p>
<h3>More buffers</h3>
<p>This type of problem is not restricted to user-space buffering either. The implementation of <code>write</code> itself has similar issues: generally, after a successful <code>write</code> call, your data made it to the buffer cache, but it hasn&#8217;t hit actual storage yet. The kernel will make its best effort to write that data to storage eventually (hopefully within the next few seconds), but if there&#8217;s a device error or a system crash, that data could still be lost. Both of these are relatively rare these days, so we don&#8217;t worry about them too much, right? Except for those of us who do.</p>
<p>Oh, and while <code>write</code> will go to some lengths to make sure there are no nasty surprises when writing to local filesystems (for example, even with delayed write-back, you want to make sure to reserve free space on the disk early<a href="#foot2"><sup>[2]</sup></a>, lest you run out during write-back), at least on POSIX systems there can still be write errors that you only get notified about on <code>close</code>, especially when network filesystems such as NFS or SMB/CIFS are in play (I&#8217;m not aware of any such late-reported error conditions on Windows, but that doesn&#8217;t mean there aren&#8217;t any). Something to be aware of: if you&#8217;re using these system calls and are not checking the return code of <code>close</code>, you might be missing errors.</p>
<p>Which brings up another point: even on local file systems, you only have the guarantee that the data made it to the buffer cache. It hasn&#8217;t necessarily made it to the storage device yet! If you want that (for example, you&#8217;ve just finished writing some important data and want to make sure it actually made it all the way), you need to call <code>fsync</code><a href="#foot3"><sup>[3]</sup></a> on the file descriptor before you close it. The Windows equivalent is <code>FlushFileBuffers</code>.</p>
<p>So, if you make sure to check error codes on every <code>write</code>, and you <code>fsync</code> before you <code>close</code> (again checking errors), that means that once you&#8217;ve done all that, you&#8217;re safe and the data has successfully made it to permanent storage, right?</p>
<p>Well, two final wrinkles. First, RAID controllers and storage devices themselves have caches too. They&#8217;re supposed to have enough capacitors so that if the system suddenly loses power, they still have sufficient power to actually get your data written safely. Hopefully that&#8217;s actually true. Good luck. Second, the <em>data</em> may have made it to storage, but that doesn&#8217;t necessarily mean it&#8217;s actually visible, because the metadata necessary to reach it might not have been written yet. Quoting the Linux man page on <code>fsync(2)</code>:</p>
<blockquote><p>
Calling <code>fsync()</code> does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit <code>fsync()</code> on a file descriptor for the directory is also needed.
</p></blockquote>
<p>For better or for worse, I can&#8217;t recall ever seeing code doing this in the wild, though. I&#8217;m honestly not sure what the actual guarantees are that popular Linux file systems provide about these things. If you&#8217;re handling really <em>really</em> important data, you should probably find out.</p>
<h3>Conclusion and summary</h3>
<p>Buffering on the read side is great and pretty much transparent because if anything goes wrong, it will go wrong before you ever get to see the data, and you&#8217;ll get a proper error code.</p>
<p>Buffering on the write side is much trickier because it delays actual writing and error reporting in ways that most programmers are <em>supposed</em> to be aware of, but usually aren&#8217;t. Few are aware of the actual steps necessary to ensure that data made it to storage safely, and some of the OS abstractions involved don&#8217;t exactly make things easier (see the <code>fsync</code> quote above). Here be dragons.</p>
<h3>Footnotes</h3>
<p><a name="foot1"><sup>[1]</sup></a> Full buffering not line buffering mode, in case anyone&#8217;s feeling nit-picky.<br />
<a name="foot2"><sup>[2]</sup></a> Actual block allocation&mdash;as in, selecting which physical location on the device file writes will end up&mdash;is often delayed in modern file systems, to make it easier to assign mostly-contiguous space to large files where possible. However, even with delayed allocation, you want to keep track of how much space is going to be available on the device once all currently pending writes complete, so that you can return &#8220;out of disk space&#8221; errors appropriately instead of discovering that you&#8217;re out of space 10 seconds after the user exited the app he was using to edit and save his Important Document. Because that would be bad. This sounds as though it&#8217;s just a matter of accounting, but it gets tricky with file systems that use extents and not bitmap-based block allocation: getting the last few discontinuous blocks on the device means that you might need extra space to store the file extents! All of which is to say: this stuff is tricky to get right.<br />
<a name="foot3"><sup>[3]</sup></a> Yes, the name looks like it&#8217;s part of the C library buffered IO package, but it&#8217;s a proper syscall.</p>
]]></html></oembed>