Typical Nerd Network?

2016 November 20
by Daniel Lakeland

So, I'm not sure how typical this is for people who actually use computers for work and such, but this is more or less how my home network looks. With my main workstation taking files off the home server via NFS, it seemed like a good idea to split the server and printer off from the Buffalo WiFi "router" (no longer routing, just a fancy AP) and the ATA so that all the calls will be going over a separate cable from any filesystem operations.

Once gigabit fiber WAN hit, I needed to move the routing load from the little Buffalo router to the home server, which is when I added a "Smart Managed" switch. Doing that also lets me prioritize voice packets via DSCP on the local LAN so if I'm doing something like talking on the phone while browsing PDF scans of documents, the traffic to my computer reading images off the server doesn't interfere with my calls.

The home server runs on an Asrock Rack J1900D2Y which is a mini-itx motherboard with a dedicated IPMI port (shown in red). The switch also lets me prioritize IPMI traffic (slightly lower than the voice traffic but higher than default traffic) so that if something goes wrong and I need to reboot the server I can connect without lag. The server is headless and has no keyboard.

I discovered that the IPMI port on these (and also on Supermicro mobos) can failover to be shared with the first ethernet port on the motherboard. This causes no end of trouble when the IPMI traffic starts appearing on one of the links in the bonded LAG group. It seemed like the solution would be to force the IPMI to only use the dedicated port, but I haven't yet figured out how to make that actually happen (altering it in the IPMI network settings didn't seem to work, I think I need to get into the BIOS but that requires a reboot of the server while using IPMI and then I have been losing my IPMI connection and can't see the BIOS screen because it tries to failover during boot). A reasonable solution was to statically allocate the IPMI MAC address to the port where the dedicated IPMI link goes into the managed switch. At least then the switch doesn't start getting confused about the LAG group, and the IPMI port eventually figures out that it can't talk to anyone on its failover port so it sticks to Dedicated.

Yellow links show battery-backed UPS power, so I can still have internet access and phone calls for several tens of minutes to an hour after power goes out. That duration was much longer when the whole thing was routed by the Buffalo device, but it's still a reasonable amount of time since the little Mini-ITX server, switch, and RAID enclosure all use about 8% of the max UPS output, or something like 75 watts.

I'm pretty sure this system could handle a full office for a medium business of around 100 people with just the addition of a couple of POE switches and some desk phones (and a whole lot more floor space!). For a system like that I'd probably move the file server function to a cluster of two servers running glusterfs to handle hardware failures and planned downtime more smoothly.

As it is right now the whole thing works pretty smoothly and gives me a place to store large files and archive lots of photos without having my important files connected directly to my desktop machine where I run more bleeding edge kernels and occasionally run large computations that have a chance of accidentally filling all the RAM and bringing the machine to its knees (there was some particularly nasty issue trying to plot some of my recent graphs. I think ggplot was making a full copy of a large data frame once for each plot in a spaghetti plot, leading to 65GB RAM usage... moving to data.tables helped a lot, and also re-thinking how the plot was constructed so that it was using just one copy of the table).

How much effort goes into your home network? Considering how many WiFi networks I see where no-one even bothered to set the ESSID to something other than ATTWiFi-992 or whatever, I'm guessing it varies a lot. One thing that seems clear though is that lots of people are frustrated with their home networks as they load on more and more devices. Even a non-tech-savvy family of 2 adults and 2 teens probably has a minimum of 10-15 WiFi devices these days given smartphones, tablets, laptops, a security camera or baby monitor, a game console, Roku/FireTV/Chromecast etc.

The fact that this technology works as well as it does is testament to how inefficiently our radio spectrum is being used. Think of the whole FM Radio, TV, and spectrum in use for business, police, fire, and soforth. Something like 50MHz to 1GHz is in my opinion utterly wasted compared to what could be done with modern techniques like Frequency Hopping Spread Spectrum or dynamic frequency allocations via negotiation protocols (compare to the way DHCP works for IP addresses).

 

For Dave, the QoS update

2016 November 19
by Daniel Lakeland

I've been using Fireqos for my home network. Since switching recently to Gigabit fiber it required a lot of reconfiguring of my internal network. In the process I discovered a few things:

  1. Typical consumer level routers from even a few years ago can't even begin to handle a gigabit through their firewall. You need something with an x86 type processor or a very modern ARM based consumer router. My Buffalo router could push about 150Mbps through the firewall at most.
  2. QoS is still important at gigabit speeds. You can push a lot of data into buffers very quickly. Furthermore keeping things well paced actually allows you to go faster because acks make it back to where they're going.
  3. Don't forget the effect of crappy cables. Replace your patch cables that you have lying around that came with whatever stuff you used to have with something good. I made my own patch cables with a crimp tool and high quality Cat5e, and it improved packet loss issues that may have been an issue before as well.
  4. As Dave Taht suggested, switching from pfifo to fq_codel helped for the ssh connection class. In particular, I had been thinking of this class as mainly handling keystrokes and things for ssh sessions, but of course scp and rsync both like to push data over ssh. Because of that, I needed to put an fq_codel qdisc on the ssh class so my keystrokes would make it even when some rsync was going.
  5. Too many things have changed at once for me to know whether fq_codel would have any affect on my voip RTP queue. But I suspect not. Every 0.02 seconds it'll send a single udp packet for each call. Each packet is around 1000 bytes. There are typically 1-4 calls at most. They jump to the front of the line due to the QoS and so the queue is never going to have more than 1 or 2 packets in it. The overhead of fq_codel makes no sense when the queue never gets longer than 3 packets and never takes longer than .00002 seconds to drain. If I have any issues though, I'll revisit.

 

VOIP update: gigabit fiber cures most of what ails you

2016 November 16
by Daniel Lakeland

Ok, so I had a couple of weeks of continually degrading call quality and then realized that ATT had just started offering fiber service in my neighborhood, so I jumped on that like a gymnast on a trampoline and dropped my cable connection like a ton of bricks. After a lot of fooling around with getting the install done and turning off the cable service and then re-configuring my internal network to handle the now ridiculously fast network connection... I can assure you that if you have crappy audio quality over a cable modem it is most likely your ISP's jittery lossy junk connection and no amount of QOS will make their part of it any better.

My impression is that VOIP technology works really hard to mask the shittiness of shitty internet. I use opus codec for example using CSipSimple and it's a super high quality codec with forward error correction and blablabla. So if your internet jitters and packets arrive late or not at all, it will cover up that loss up to several percent packet loss in such a way that it just sounds like you're muffled (it can reconstruct the lower-frequency components even with packet losses). Well, it turns out that by the end of my cable modem experience, sure enough I was experiencing something like 2% packet loss and jitter that would have packet arrival times be between something like 20 ms normally, and 100ms every second or two. It seemed to coincide with time of day, and I suspect a poor/corroded/temperature sensitive splitter connection somewhere along the coax together with the shared nature of the cable in my neighborhood. Monitoring incoming traffic during a call using "fireqos status wan-in" would show steady 100 kbps incoming for several seconds (when using ulaw) and then 15-30 kbps for a second where a bunch of packets would drop out... and it would do this randomly on average around 5 second intervals. No amount of QOS on my end could solve this issue.

Also, testing this issue was hard. Things like dlsreports speed test try hard but if the ISP is prioritizing icmp/ping packets it can give a very misleading view of packet loss and jitter for the UDP/RTP packets that make up your voice.

Now, the FIBER connection is a whole different story. It includes an IPv6 prefix I can delegate, and I now have a fully routable public IPv6 network with no extra lag/delay. In fact, it showed up a bunch of issues in my network that weren't there at the lower speeds. I moved the routing load from a consumer Buffalo wifi router to my home server which has a 4 core Celeron J1900 processor, 16 Gigs of RAM and 3 bonded NICs. It provides an NFS server, an SMB server, a web server, dnsmasq, and FireHOL/FireQOS traffic control. Pushing packets through the QOS and firewall doesn't even hit 1% CPU usage on this machine, whereas the Buffalo router was maxing out its CPU at 150-200Mbps. So, now I consistently get over 800 Mbps symmetric internet connection. The best part is that packet loss is nonexistent, the fiber transceiver has a battery backup, the jitter is around 1 or 2 ms under load, and I can dedicate 10Mbps to voice calls without even noticing it as roundoff-error.

Still, the WiFi part can be tricky especially with neighbors interfering, but I can assure you gigabit fiber is the way to go ūüėČ

Yes, I still run QOS on the gigabit, because regardless of how big your pipe is, someone will come along and open up a big video buffering task and suck up the bandwidth if you let them. Prioritizing your voice and your ssh keystrokes ahead of all else still makes sense.

Onward to internet the way it was meant to be.

 

Bayesian model of differential gene expression

2016 November 5
by Daniel Lakeland

My wife has been working on a project where she's looking at 3 time-points during a healing process and doing RNA-seq. I'm analyzing the known secreted proteins for differential expression, using a fully Bayesian model in Stan:

Red genes have 90% posterior probability of up or down regulation relative to week 0 controls

Red genes have 90% posterior probability of up or down regulation relative to week 0 controls, at least according to the current overnight Stan run.

Fun, but non-trivial. There are around 6000 genes being considered, and there are 5 or 6 of these parameter vectors... 36000 parameters to be estimated from 60000 data points, and the parameters all have fairly long tails... cauchy priors and soforth. Then I need a couple hundred samples. so Stan is outputting something like 5-10 million numbers. Warmup has been a problem here, and the long tails have required re-parameterizations and tuning. Fun stuff.

 

More VOIP over WLAN: With low bitrates disabled, decrease the beacon interval for more responsive AP switching

2016 November 4
by Daniel Lakeland

It seems that beacons from your APs are how your wireless client (say a smartphone) determines which access point to connect to. Beacons obviously take up time that would otherwise be used for transmitting data, so that suggests reducing them in order to make more time available for transmission of useful stuff, but, with your low bitrate modes turned off, beacons will be transmitted at 6000 kbps or 12000 kbps and hence be pretty short compared to their duration if you were allowing 1000 kbps modes. So this suggests actually increasing their frequency so that it is quicker to switch between stations.

Typical settings are something like (in hostapd on OpenWRT)

    option beacon_int '100'

These counts are in 1.024 ms intervals, which which suggests that every 102.4 ms a beacon will be transmitted. Round off, around 100ms between beacons. So, if you're running around your house at 10mph you will travel about 1.5ft in 100ms which suggests your beacons could occur between longer intervals. But, it takes a while to re-associate with a new AP. Typical VOIP packets are sent at 20ms intervals, and this suggests that if you're dissociated from an AP for say 2 beacons, you're going to drop 10 packets, which is a noticeable audio delay.

Instead, let's drop the beacon interval to 50 and then if you're dissociated for 2 beacon intervals that's 100ms total and you drop about 5 packets, an amount that might be covered by typical jitter buffers in hard/soft phones.

I just literally ran around my house while talking to a test number that I have set up, the test number records for 30 seconds and then plays back the recording... with beacon=250 I definitely had little drop-outs in audio, with beacon=50 even running around my house to change position as fast as possible, no noticeable audio dropouts occurred.

Progress.

 

Disabling WiFi low bitrates for more stable performance

2016 October 31
by Daniel Lakeland

Next step in my FIX MY PHONE campaign is dealing with the WiFi part. Step one was to DSCP mark my inbound RTP packets with DSCP=48 so that linux will send them over the highest priority "VO" WMM queue (WiFi Multi Media, a simple priority based QoS system for the radio). The standard for voice is DSCP=46 but linux drivers use the slightly lower priority VI (video) queue for DSCP=46, and hence I'd be competing with potentially other traffic, especially traffic to my FireTV Stick.

Second step was to disable lower bitrates. 802.11b came out in something like 1999 every piece of kit in my house supports 802.11n which came out in ~ 2007 almost 10 years ago. There is really no excuse for using 1Mbps rates except that they work at long distances. Now that I have 3 access points in my house, long range is actually a detriment, and forcing clients to switch between access points as they move around is beneficial (and, I like to pace around while talking on the phone... people would complain about that as my phone probably lowered its data rate to stay associated rather than jump to the closer AP and keep a higher rate).

Nevertheless I noticed that in fact lots of sleeping mobile devices will connect at 1Mbps while they sit there waiting to refresh their email connections, etc. At 1Mbps sending a single packet of 1kB of data takes 8 ms. If I'm in the middle of a voice conversation, 8ms is close to half the inter-frame duration of 20ms. Let's not forget that beacons happen something like every 100ms (I set mine to 250ms). Beacons get transmitted at the lowest data rate in the basic set. So, for efficiency and channel access purposes, it makes sense to limit everyone to using higher rates, including sleeping mobile devices.

In OpenWRT the /etc/config/wireless file can be used to force you to accept only much higher data rates. I set my minimum to 12 Mbps. So now a 250 byte RTP packet takes on the order of 0.16 milliseconds to send, (or less if we're at even higher rates).

Here's a stanza defining the radio on my router

config wifi-device 'radio0'
    option type 'mac80211'
    option path 'pci0000:00/0000:00:11.0'
    list ht_capab 'SHORT-GI-40'
    list ht_capab 'TX-STBC'
    list ht_capab 'RX-STBC1'
    list ht_capab 'DSSS_CCK-40'
    option country 'US'
    option htmode 'HT20'
    option hwmode '11g'
    option txpower '14'
    option channel '1'
    option frag '2200'
    option rts '1000'
    option beacon_int '250'
    option basic_rate '12000 18000 24000 36000 48000 54000'
    option supported_rates '12000 18000 24000 36000 48000 54000'

The last 2 lines determines the basic rates required and allowed by this access point. This is sent in the beacons 4 times a second (beacon_int 250). I'm also turning on rts/cts for packets over 1000 bytes and frag for packets over 2200. I may turn those off, they were attempts to improve reliability given both a lot of neighbor interference (hence fragmenting big packets so they xmit more reliably) and a lot of low-power mobile devices hence turning on rts/cts so the AP will ask mobile devices to pay attention before transmitting to them, hence devices that can't hear each other don't collide with each other as much.

It makes sense to me that all this should really help. In particular, less time spent futzing around at 1Mbps serving low priority mobile devices refreshing their email server connections, or sending beacons saying "here I am... you can talk to me at 1Mbps if you like" means more clear-air for getting regularly spaced 20ms packets over the airways.

This stuff isn't rocket science, but it's not wiring up electrical outlets either. Understanding the physical limitations of radio spectrum use and making good choices about the way in which multiple connected devices cooperate should take something that "kind of worked" and turn it into "rock solid", I hope.

 

QoS is hard, but worth it... fireqos on openwrt

2016 October 30
by Daniel Lakeland

I finally got tired of the way OpenWRT on my router was handling QoS, and went ahead and installed fireqos on my wireless router. Instructions are here and are mostly complete except in current fireqos you also need to upload /sbin/functions.common.sh from fireqos not just /sbin/fireqos itself

The thing I wanted was to fully control my VOIP traffic. The basic principle is that VOIP gets full bandwidth with NO DROPS up to several times what I actually need (worst case, if friends or family are visiting I have maybe 6 phone calls going at once).  Everyone else can wait or tolerate some drops, but don't drop my RTP packets. It gets really annoying for everyone to be on the phone with a lawyer or a real-estate person or some other business call and have people say "hey you're breaking up.. you sound like a robot" for a few seconds every few minutes. Cable internet is just like that some times.

fireqos has a great purpose-built language for configuring a QoS system, it translates its language into raw "tc" commands that need to be executed to tell the kernel what to do. It uses HTB, the heirarchical token bucket classful queuing discipline.

There are lots of things on the internet about how these work, but let's just say that fireqos hides some of it from you so you can get on with things a little like a high level programming language hides details like how many registers your CPU has.

Here is my fireqos.conf (with some specific IP addresses removed)

DEVICE=eth1
INPUT_SPEED=55000kbit
OUTPUT_SPEED=3600kbit

interface $DEVICE world-in input rate $INPUT_SPEED qdisc pfifo

    class voicertp rate 1000kbit ceil 2000kbit
      match dscp EF
      match dscp CS6
      match udp src <ASTERISK BOX IP HERE>
      match udp src <ASTERISK BOX 2 IP HERE>


    class voipsip rate 600kbit ceil 1500kbit
      match tcp sports 5060,5061
      match udp sports 5060,5061

    class netctrl commit 1% ceil 3%
      match icmp
      match udp sport 53

    class ssh rate 100kbit ceil 93%
      match dport 22

    class default qdisc fq_codel


interface $DEVICE world-out output rate $OUTPUT_SPEED qdisc pfifo

    class voicertp rate 1000kbit ceil 2000kbit
      match dscp EF
      match dscp CS6
      match udp dst <ASTERISK BOX HERE>
      match udp dst <ASTERISK BOX 2 HERE>

    class voipsip rate 100kbit ceil 500kbit
      match tcp dports 5060,5061
      match udp dports 5060,5061

    class netctrl commit 1% ceil 5%
      match icmp
      match udp dport 53


    class ssh rate 100kbit ceil 93%
      match dport 22


    class default qdisc fq_codel

Some things to note about this configuration. I use default qdisc of pfifo and have set my outbound txqueuelen to 300 packets or so. For 250 byte packets that'd take about 170 ms to drain, and for 1500 byte packets (the MTU or maximum transmission unit size) it'd take a second. So the MAX outbound queue delay is 1s

Secondly, I'm using pfifo qdisc for each class, except the default class. If I've bothered to classify packets into their own queues, it means they're important so I want them shoved through the pipe as fast as possible in priority order. For the default class, I use fq_codel which tries to drop packets in a smart way to avoid buffers getting too long. For default traffic, this is fine, but I was finding with the openwrt default qos_scripts, it was using fq_codel everywhere, and when I had congestion, I would lose voice packets too, and that was unacceptable.

I go further than this and also DSCP mark packets in a PREROUTING and POSTROUTING mangle table using iptables:

iptables -I PREROUTING 1 -t mangle ! --source 10.79.1.0/24 -j DSCP --set-dscp 0
iptables -I PREROUTING 2 -p udp --source MYASTERISKBOX --source-port 31000:31550 -t mangle -j DSCP --set-dscp 48
iptables -I PREROUTING 3 -p udp --source SECONDASTERISKBOX --source-port 10000:20000 -t mangle -j DSCP --set-dscp 48

Note that I first zero out all the DSCP values because I don't want to trust what comes in. Then I mark udp packets from certain machines and certain ports as DSCP=48 (EDIT: previous value of 46 was not correct) which according to documentation on mac80211 in the linux kernel means that the WMM (WiFi Multi Media) queue will be the high priority VO queue for voice.

So, now the question is, how well does all this work? Over the next week I'll try it out and see if anyone complains... I think they will, and I think this comes down to neighbors having WiFi APs that are set too high power output and on random strange channels like 3, 4, 7, 8, 9 instead of the non-overlapping 1,6,11 scheme most people who understand WiFi use. But, at least if my kids are playing Minecraft over my LAN they may not ruin my telephone calls.

 

NFS sync/async, some of the issue solved, or how to set vm.dirty_bytes and vm.dirty_background_bytes

2016 October 9
by Daniel Lakeland

tl;dr version: Set your vm.dirty_bytes and vm.dirty_background_bytes sysctl values to be a quantity based on the bandwidth of your ethernet times a reasonable initial delay period, such as a quantity less than 1 second. This prevents your client from increasing the total time to flush a file to disk by dramatically reducing the initial delay before it starts the disk working.

So, the pieces are falling together on this issue. With server mount = async, the test program returns before everything is actually on disk. So, it doesn't actually measure the full time to transfer everything to disk, only the time to get it across the ethernet wire. Test program is:

dd if=/dev/zero of=./MyTestFileOnNFS bs=1M count=1000 conv=fdatasync

That opens the file, writes 1 Gig of zeros and syncs the file. With async on the server, the sync returns immediately before the disk is finished, so of course you wind up getting about 100MB/s which is the bandwidth of the network transfer (gigE).

However, if you put sync on the server, the fsync() at the end of the dd waits until the disk finishes. How fast can the filesystem get files on the disk? With btrfs RAID 10 and my little USB 3.0 enclosure, the above command gives about 66 MB/s on the server itself without NFS in the mix. So, for the moment, we'll suppose that's the max bandwidth the USB drive controller will handle for RAID 10.

Still, over NFS with server=sync, I was getting around 44MB/s on a good run, which is only about 2/3 of the disk maximum, nowhere near the 66 we'd hope for.

Well, it turns out that the kernel will buffer things up in RAM and only start the disks writing after sufficient buffers are full. There are 4 possible sysctl settings involved:

vm.dirty_ratio
vm.dirty_background_ratio
vm.dirty_bytes
vm.dirty_background_bytes

By default, dirty_ratio = 20 and dirty_background_ratio = 10 (and the *bytes options are 0 and therefore ignored, you can only use bytes or ratio but not both, the kernel will zero out the other one when you set one of these options). The way the defaults work is when 10% of the physical RAM is full, the kernel starts flushing to disk in the background, and when 20% is full it starts throttling programs that are writing and flushing to disk aggressively.

Well setting these as a fraction of RAM is just a bad idea, especially as RAM prices decline and RAM on a server can easily be larger than even largish files. So first off go into /etc/sysctl.conf and put in lines like this:

vm.dirty_bytes =
vm.dirty_background_bytes =

But the question is, what to put on the right hand side of the =, how big to make these buffers?

The right way to do this is to think about how much time you're willing to delay starting the flush to disk if a large write comes in? If, for example, you have dirty_background_ratio = 10, and you've got 32 gigs of RAM, then 3.2 GB of data have to come over the wire before the kernel will wake up and flush the disk.... which at gigE speeds is about 256 seconds! Well, instead the kernel timeout of 30 seconds will kick in. But still, if transferring a file takes less than 30 seconds, then the kernel won't even start putting the file on disk before the xfer completes over the wire. If the disk writes a LOT faster than the wire, this doesn't cause a big problem. But when the disk writes about the same speed as the wire... you're basically taking twice as long as if you'd just started writing to the disk right away.

So, instead, think about the quantity: bandwidth*delay. For gigE the bandwidth = 1024*1024*1024/8 bytes/s and a reasonable delay before you wake up to do something about it is probably less than a second. I put 0.25 seconds. And if a buffer builds up representing about 1.5 seconds continuously receiving data you want to flush that disk aggressively. The result is dirty_bytes = 1024^3/8 * 1.5 and dirty_background_bytes = 1024^3/8*0.25

vm.dirty_bytes = 201326592
vm.dirty_background_bytes = 33554432

After putting this in /etc/sysctl.conf and doing "sysctl -p" to actually set it in the kernel on both the client and the server, the dd bandwidth for 1000 MB was about 56MB/s instead of 44 MB/s which is now about 85% of the filesystem's local bandwidth. Not bad. There are probably other issues I can tune but at least we're not sitting around twiddling thumbs for 5 to 30 seconds before servicing large writes.

So, rather than my earlier explanation in which I assumed a whole bunch of very short delays accumulated as each 1MB block was synced to disk, instead what was going on was initially a long wait time, up to 5 seconds (btrfs commit interval that I set) was being added into the total transfer time because the kernel was happy to just put my data into RAM and sit on it until 30 seconds or 1.6 GB or the btrfs 5 second commit interval accumulated. Since I was only transferring 1GB anyway, it makes sense that a lot of delay occurred before writing started, thereby increasing the total time to get things on disk.

There are quite a few articles online telling you about how setting dirty_bytes and dirty_background_bytes is a good idea, but few of them tell you how to think about the selection of the buffer sizes. They tend to say things like "try setting it to 100MB and see how that works..." well. By thinking in terms of bandwidth*delay product you can make a principled guess as to what would be a good size. If you have 10GigE or are bonding 3 GigE ports or just have old 100 Mbps ethernet or whatever, you'll need to choose completely different quantities, but it will still be a good idea to pick a reasonable delay, and start tuning with your bandwidth * delay number.

 

NFS sync/async still a mystery

2016 October 9
by Daniel Lakeland

It turns out that the "sync on writing each block" explanation from previous posts doesn't work. Specifically, from an email from someone in the know:

No, what you're missing is that neither client nor server actually
performs writes completely synchronously by default.

The client returns from the "write" systemcall without waiting for the
writes to complete on the server.  It also does not wait for one WRITE
call to the server to complete before sending the next; instead it keeps
multiple WRITEs in flight.

An NFS version >=3 server, even in the "sync" case, also does not wait
for data to hit disk before replying to the WRITE systemcall, allowing
it to keep multiple IOs in flight to the disk simultaneously (and
hopefully also allowing it to coalesce contiguous writes to minimize
seeks).

So, at least with the default "sync" option on export, it seems like the NFS server is supposed to do the right thing, namely batch stuff up in the FS disk cache, and wait for a request for COMMIT to flush to disk. But, this still doesn't explain the massive differences in throughput I'm seeing for sync vs async... so, it remains to be seen if I can find the explanation, and hopefully get the info into the right hands to "fix" this. A short packet capture with "sync" vs "async" also showed that there didn't seem to be any difference in terms of pauses between WRITE and the acknowledgement from the server. (for short files, I wasn't prepared to packet-capture a full gigabyte file transfer).

It seems like the main difference between sync and async on the server is whether a COMMIT request gets returned immediately, or actually waits for things to hit the disk. So, you really do want "sync" on your server because otherwise the server lies about the stability of your file. If this is the only difference, then I really can't explain what's up. Doing a proper test on the raw filesystem locally on the server shows it's capable of about 60-65MB/s due to needing to do btrfs RAID10 operations down a USB 3.0 connection. Testing an individual drive in the cabinet with hdparm gives 100MB/s. So perhaps the reduction from 60-65 down to 45 MB/s (70-75% of raw disk efficiency) is all you can expect given the NFS stuff in between and two different computers with different amounts of RAM and different processors and soforth.

I'd be interested to see if I can increase the speed of throughput to the BTRFS volumes down the USB 3 link. I suspect the little Orico hot-swap cabinet is the bottleneck there. Later, when a larger chassis becomes available I may put the server in a regular mid-tower and put all the disks directly on SATA, leaving the hot-swap for replacing broken disks or later expansion.

NFS network efficiency sync vs async

2016 October 8
by Daniel Lakeland

So, in my previous post I discussed sync vs async operation in an NFS server (not client), and there was a bit of math there that I glossed over. It's worth developing. Suppose that B_w is the network bandwidth, B_d is the bandwidth of continuous writes to disk, t_s = 60/RPM/2 is the average seek time (half rotation of the disk), T_c is the time between commits on the underlying filesystem, and S_b is the block size of the NFS network transfer. Then, with sync on the export, to write each block puts B_b bytes on the disk in a time equal to T_s = S_b/B_w+t_s. So that the bandwidth of transfer to disk as a fraction of the network bandwidth (the efficiency) is:

\frac{\frac{S_b}{\frac{S_b}{B_w} + \frac{30}{RPM}}}{B_w}

Which after some algebra can be expressed as:

\frac{1}{1 + \frac{30 B_w}{S_b RPM}}

So, as bandwidth of the network increaes, efficiency decreases, and this can only be compensated by increasing the block size of transfer. In the end to get high efficiency, it must take much longer than a rotation of the disk to transfer the block. That suggests something like a 500 MB block size for a gigabit ethernet! Hmm...

Alternatively, we can do async transfer. Then, the time to transfer the data D across the wire is D/B_w and the time to write it to disk is more or less 1 seek, the time to write the data to the disk in a large block, and then maybe another seek. Efficiency is then something like:

\frac{D}{\mathrm{max}(D, \frac{DB_{we}}{B_{wd}}) + \frac{30B_{we}}{RPM}}

Where now B_{we} is the ethernet bandwidth and B_{wd} is the disk bandwidth. Assuming disks are similar bandwidth to gigabit ethernet, we can simplify to

\frac{1}{1+\frac{30B_{we}}{D\times RPM}}

In other words, it's just like the first case, except instead of the 1 MB network block size, the efficiency depends on D the full size of the file. I suppose we can get a little more fancy but as a first approximation, clearly the scaling of the "async" option gets us MUCH closer to 100% efficiency with large file transfers. The big problem is that if someone wants to ensure that things are really written to disk, NFS seems to treat explicit fsync() as a no-op when async is on.

Regular local filesystems like ext4 already buffer writes and then flush them periodically in large batches. So having NFS insist on flushing 1MB writes is basically like mounting your ext4 filesystem with the "sync" option, something NO-ONE does.

But, consider back in the bad old days when NFS was designed. Look at the formula for efficiency under the "sync" condition. Put in B_w of 1 Mbps coaxial ethernet, and S_b of 8kB into the network bandwidth efficiency formula:

\frac{1}{1 + \frac{30 \times 1000/8}{8\times 5600}} = 0.92

In the bad old days, no one noticed an efficiency problem, because it took so long to transfer 8kB at 1 Mbps that an extra half-rotation of the platters wasn't a factor.