NFS sync/async, some of the issue solved, or how to set vm.dirty_bytes and vm.dirty_background_bytes

2016 October 9
by Daniel Lakeland

tl;dr version: Set your vm.dirty_bytes and vm.dirty_background_bytes sysctl values to be a quantity based on the bandwidth of your ethernet times a reasonable initial delay period, such as a quantity less than 1 second. This prevents your client from increasing the total time to flush a file to disk by dramatically reducing the initial delay before it starts the disk working.

So, the pieces are falling together on this issue. With server mount = async, the test program returns before everything is actually on disk. So, it doesn't actually measure the full time to transfer everything to disk, only the time to get it across the ethernet wire. Test program is:

dd if=/dev/zero of=./MyTestFileOnNFS bs=1M count=1000 conv=fdatasync

That opens the file, writes 1 Gig of zeros and syncs the file. With async on the server, the sync returns immediately before the disk is finished, so of course you wind up getting about 100MB/s which is the bandwidth of the network transfer (gigE).

However, if you put sync on the server, the fsync() at the end of the dd waits until the disk finishes. How fast can the filesystem get files on the disk? With btrfs RAID 10 and my little USB 3.0 enclosure, the above command gives about 66 MB/s on the server itself without NFS in the mix. So, for the moment, we'll suppose that's the max bandwidth the USB drive controller will handle for RAID 10.

Still, over NFS with server=sync, I was getting around 44MB/s on a good run, which is only about 2/3 of the disk maximum, nowhere near the 66 we'd hope for.

Well, it turns out that the kernel will buffer things up in RAM and only start the disks writing after sufficient buffers are full. There are 4 possible sysctl settings involved:

vm.dirty_ratio
vm.dirty_background_ratio
vm.dirty_bytes
vm.dirty_background_bytes

By default, dirty_ratio = 20 and dirty_background_ratio = 10 (and the *bytes options are 0 and therefore ignored, you can only use bytes or ratio but not both, the kernel will zero out the other one when you set one of these options). The way the defaults work is when 10% of the physical RAM is full, the kernel starts flushing to disk in the background, and when 20% is full it starts throttling programs that are writing and flushing to disk aggressively.

Well setting these as a fraction of RAM is just a bad idea, especially as RAM prices decline and RAM on a server can easily be larger than even largish files. So first off go into /etc/sysctl.conf and put in lines like this:

vm.dirty_bytes =
vm.dirty_background_bytes =

But the question is, what to put on the right hand side of the =, how big to make these buffers?

The right way to do this is to think about how much time you're willing to delay starting the flush to disk if a large write comes in? If, for example, you have dirty_background_ratio = 10, and you've got 32 gigs of RAM, then 3.2 GB of data have to come over the wire before the kernel will wake up and flush the disk.... which at gigE speeds is about 256 seconds! Well, instead the kernel timeout of 30 seconds will kick in. But still, if transferring a file takes less than 30 seconds, then the kernel won't even start putting the file on disk before the xfer completes over the wire. If the disk writes a LOT faster than the wire, this doesn't cause a big problem. But when the disk writes about the same speed as the wire... you're basically taking twice as long as if you'd just started writing to the disk right away.

So, instead, think about the quantity: bandwidth*delay. For gigE the bandwidth = 1024*1024*1024/8 bytes/s and a reasonable delay before you wake up to do something about it is probably less than a second. I put 0.25 seconds. And if a buffer builds up representing about 1.5 seconds continuously receiving data you want to flush that disk aggressively. The result is dirty_bytes = 1024^3/8 * 1.5 and dirty_background_bytes = 1024^3/8*0.25

vm.dirty_bytes = 201326592
vm.dirty_background_bytes = 33554432

After putting this in /etc/sysctl.conf and doing "sysctl -p" to actually set it in the kernel on both the client and the server, the dd bandwidth for 1000 MB was about 56MB/s instead of 44 MB/s which is now about 85% of the filesystem's local bandwidth. Not bad. There are probably other issues I can tune but at least we're not sitting around twiddling thumbs for 5 to 30 seconds before servicing large writes.

So, rather than my earlier explanation in which I assumed a whole bunch of very short delays accumulated as each 1MB block was synced to disk, instead what was going on was initially a long wait time, up to 5 seconds (btrfs commit interval that I set) was being added into the total transfer time because the kernel was happy to just put my data into RAM and sit on it until 30 seconds or 1.6 GB or the btrfs 5 second commit interval accumulated. Since I was only transferring 1GB anyway, it makes sense that a lot of delay occurred before writing started, thereby increasing the total time to get things on disk.

There are quite a few articles online telling you about how setting dirty_bytes and dirty_background_bytes is a good idea, but few of them tell you how to think about the selection of the buffer sizes. They tend to say things like "try setting it to 100MB and see how that works..." well. By thinking in terms of bandwidth*delay product you can make a principled guess as to what would be a good size. If you have 10GigE or are bonding 3 GigE ports or just have old 100 Mbps ethernet or whatever, you'll need to choose completely different quantities, but it will still be a good idea to pick a reasonable delay, and start tuning with your bandwidth * delay number.

 

No comments yet

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS