NFS sync/async still a mystery

2016 October 9
by Daniel Lakeland

It turns out that the "sync on writing each block" explanation from previous posts doesn't work. Specifically, from an email from someone in the know:

No, what you're missing is that neither client nor server actually
performs writes completely synchronously by default.

The client returns from the "write" systemcall without waiting for the
writes to complete on the server.  It also does not wait for one WRITE
call to the server to complete before sending the next; instead it keeps
multiple WRITEs in flight.

An NFS version >=3 server, even in the "sync" case, also does not wait
for data to hit disk before replying to the WRITE systemcall, allowing
it to keep multiple IOs in flight to the disk simultaneously (and
hopefully also allowing it to coalesce contiguous writes to minimize
seeks).

So, at least with the default "sync" option on export, it seems like the NFS server is supposed to do the right thing, namely batch stuff up in the FS disk cache, and wait for a request for COMMIT to flush to disk. But, this still doesn't explain the massive differences in throughput I'm seeing for sync vs async... so, it remains to be seen if I can find the explanation, and hopefully get the info into the right hands to "fix" this. A short packet capture with "sync" vs "async" also showed that there didn't seem to be any difference in terms of pauses between WRITE and the acknowledgement from the server. (for short files, I wasn't prepared to packet-capture a full gigabyte file transfer).

It seems like the main difference between sync and async on the server is whether a COMMIT request gets returned immediately, or actually waits for things to hit the disk. So, you really do want "sync" on your server because otherwise the server lies about the stability of your file. If this is the only difference, then I really can't explain what's up. Doing a proper test on the raw filesystem locally on the server shows it's capable of about 60-65MB/s due to needing to do btrfs RAID10 operations down a USB 3.0 connection. Testing an individual drive in the cabinet with hdparm gives 100MB/s. So perhaps the reduction from 60-65 down to 45 MB/s (70-75% of raw disk efficiency) is all you can expect given the NFS stuff in between and two different computers with different amounts of RAM and different processors and soforth.

I'd be interested to see if I can increase the speed of throughput to the BTRFS volumes down the USB 3 link. I suspect the little Orico hot-swap cabinet is the bottleneck there. Later, when a larger chassis becomes available I may put the server in a regular mid-tower and put all the disks directly on SATA, leaving the hot-swap for replacing broken disks or later expansion.

No comments yet

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS