Inbound QoS with virtual ethernet and policy routing

2017 December 11
by Daniel Lakeland

Starting in mid october I began working on many projects involving computer networking. When you have a fairly complicated set of computing services you just need to do maintenance on them, and also inevitably there's something wrong with some of them that needs fixing, but it takes concentrated time, so you put it off...

Anyway, I made a bunch of things better, including my SIP based VOIP system, and my internal computer network in my house which is now ipv6 only except for certain legacy devices. Ipv6 solves much of the problems that SIP has with NAT, and so this was an overall win. In the process of all of it, I spent some time updating my routers, and got involved with the LEDE project (a fork of OpenWRT). In the process of that I started helping someone on the forum understand QoS, and learned something about setting up QoS myself.

The problem:

Inbound QoS is normally considered "impossible" because once you have a packet you can't keep it from being sent to you... and so the best thing to do is just forward it as quickly as possible. This is not true. In fact it relies on a mistaken idea of what QoS is all about. QoS is about increasing the "utility" of your network by selectively slowing down things that you care less about so that things that you care more about go faster than they would if you didn't slow down those unimportant things.

The "value" of a stream of packets is related to how important they are to you, not how many of them you get per second. If a download you plan to run all night while you sleep is slowed down so your voice conversation is crystal clear, even though you're passing fewer packets over the wire per second, your quality of service increases.

Inbound QoS in Linux:

In Linux, the receive queue has far fewer features than the send queue on an interface. Also, the receive queue has to make decisions before Linux has sent the packet through iptables and had a chance to use marking or DSCP setting iptables rules (such as in the mangle/PREROUTING table). The logical thing to do with inbound QoS is to put it in the router that routes to the wide internet, and run the packets through the iptables and then put the packets into the LAN output queue with appropriate tags and all the capabilities of the output queue. The problem comes when you have a router with several interfaces. For example maybe it has two separate LANs or it has 2 wifi interfaces and an ethernet interface all bridged together. You can't control the total bandwidth by setting limits on the individual interfaces. You want to control the total bandwidth though, because this is what comes in your WAN pipe.

Solution:

Suppose you have a bridged LAN with 3 physical interfaces, such as you might on a LEDE/OpenWRT wifi router. Let's call the interfaces wlan0, wlan1, and eth0 and they're bridged into a bridge called br-lan. Let's suppose there's also an interface "wan" where packets come in from the internet.

We can force packets received on wan to be routed to a virtual ethernet pair veth0 and veth1. This is based on where they're received from, and so is part of Linux policy routing. The way a veth works is that anything sent to one of them immediately is received by the other as if they were a pair of ports on your computer with an ethernet patch cable between them. To set this up we can do something like:

ip link add type veth ## by default creates veth0 and veth1
ip link set veth1 promisc on ## might be unnecessary
ip link set veth1 master br-lan ## bridge veth1 into the lan

ip route add default dev veth0 table 100 ## send everything that uses table 100 to veth0
ip rule add iif wan table 100 priority 100 ## if it comes in wan interface use table 100 for routing

Now packets that come in wan go through the iptables where we can have -j DSCP rules that tag them with DSCP values describing their importance, then the packets hit the routing code, the code looks up the rule priority 100 and sees it applies so it uses table 100 to route the packets, and table 100 says to send everything down veth0 towards veth1. Since the packets have been through the iptables before routing, by the time they hit veth0 we can for example use fireqos to set up the output of veth0 to have queues which shape the traffic, in particular they delay and/or drop the traffic we don't care about as much which has less important DSCP tags.

There are some tricks here though. The bridge itself will send packets through the iptables again if we have certain sysctls enabled. This may filter the traffic so that it doesn't go from veth1 into the bridge. If you don't want that you need in /etc/sysctl.conf

net.bridge.bridge-nf-call-arptables = 0
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 0

Or if you do want it, you need to set up your iptables to allow the appropriate traffic.

The delay/dropping of packets that are low priority is critical because we may not be able to stop people from sending us packets we've already received, but because of TCP window and the ACK feedback, a short time in the future, if we haven't sent back ACK packets, they will slow down their sending rate. So with TCP in the mix, shaping your inbound packet stream results in feedback going back to the sender in the form of delays in acknowledgement and then slowdowns on their send rate. Explicit congestion notification can also help here. So it's not good to just forward your packet as fast as possible on inbound... you want to give that feedback in the form of delayed ACKs to tell your other party to slow down and open up the channel because you're using some of it for something else.

The result is maybe the very beginning of your voice call might have a little garbled audio, but after say 1 second the delays on your download caused at the veth0 output queue result in your download partner throttling the download, and then your audio packets can come in and will be fine for the rest of the call. Similarly for latency sensitive games, providing you're tagging appropriately.

Tagging your high priority traffic with DSCP tags as it comes in your router is also a good idea because downstream inside your LAN you may have managed switches or wifi access points which have WMM queues and they will interpret the higher priority DSCP tags to mean they should provide basic QoS as well. (note all 802.11n or later, including 802.11ac access points use WMM for prioritizing different traffic, this also changes the radio-collision algorithms used)

Because of the way linux wifi drivers work, it's a good idea to use DSCP tag CS6 = decimal 48 on your highest priority traffic, such as voice RTP traffic. This isn't the standard, the standard for voice seems to be EF = decimal 46 but this doesn't in general cause Linux drivers to use the WMM Voice queue. So setting up DSCP 48 on your softphones and retagging to 48 as packets come in from the internet is probably best.

Hopefully this helps some people. I'll be linking this post to FireQOS issues so people can find it that way.

 

Something interesting happened last May in ipv6 land

2017 November 8
by Daniel Lakeland

Here's a chart from Akamai:

Akamai ipv6 in the US

I think this might have had to do with Charter / Spectrum or whatever they are called now after all the mergers. They enabled native IPv6 on their home cable internet services some time around there.

What this tells me is that more than 50% of the US traffic will be IPv6 by some time in 2018, and that's a good thing, because it means we'll see a rapid abandonment of ipv4 as it becomes a minority share.

*rubs hands together and cackles maniacally*

 

IPv6 only networks: experiences

2017 November 6
by Daniel Lakeland

T-Mobile has been ipv6 only on their mobile network, at least for Android phones, for quite some time now, at least a year or more. Apple required that all apps in their app store be IPv6 only read by June 2016 or so... the IPv6 transition is in full swing right now. Here's the traffic graph from Google:

Google IPv6 traffic as of 2017-11-06

IPv6 traffic at google globally as of 2017-11-06

I use SIP based phones, and to tell you the truth, they have a lot of problems... most of which are caused by the non-end-to-end nature of modern ipv4. NAT means that if a SIP server needs to open a connection and tell your phone what to do, such as that the other end is putting the call on hold, you are screwed. Instead, your phone has to keep a connection open to the server at all times, and so even a half-second glitch in your wifi can reset TCP connections and lead to dropped calls. (this is particularly true for SIP over TCP or more importantly SIP over TLS... who wants coffee-shop snoopers to record phone calls? Well it's easy with wireshark and an unencrypted SIP call so use TLS + SRTP).

Anyway, other things don't work that well over ipv4 either: ever wanted to reach your home computer from a remote location and pull a file off of it? It's basically impossible. Similarly for screen-sharing/remote control. There are tons of hacks that kinda make this work, but they're insecure and usually involve giving a totally unknown third party control over your computers (some cloud provider) or port-forwarding, and a lot of other hackishness. Third parties employ people, some of whom get disgruntled. Some of whom stalk their disgruntled exes online... Think I'm joking? I'm not. Read the news:  so "fixing" the ipv4 brokenness by buying a "cloud service" that lets you connect to your home network through some proxy server run by Google/Samsung/CloudsRUs or whatever is a terrible idea.

So, it makes lots of sense to want to eliminate IPv4 as soon as possible: for one thing, it makes the internet work again, for another, it's much easier to administrate, for a third, it means you don't have two configurations to manage, finally, it just supports more devices on your network in a trivially easy way, and there are lots of devices these days.

In order to do this, you need to map the entire IPv4 internet into a tiny little corner of the IPv6 internet, and have a translator. This is called NAT64 and DNS64. Google provides a DNS64 service:  so that takes care of that, you just need to advertise their servers, or have a caching server that uses them as upstream. And, to do the protocol mapping, you need Tayga on your router:  problem solved. This is the technique that T-Mobile uses to make things relatively transparent for your phone.

So, here are my experiences with turning off ipv4 entirely:

  1. SIP calls were broken initially. Unfortunately none of the sip providers do ipv6, so you still need ipv4 connectivity, and more to the point many ATAs don't do IPv6 so that becomes broken. The way to fix this is via "topology hiding" at a sip proxy like Opensips. I'll give a detailed explanation elsewhere. Once I got this fixed, calls drop a lot less and generally things work better, including better audio quality (no doubt all that carrier grade NAT reduces real-time performance of the switching gear in your ISP, and ipv6 bypasses all that).
  2. NFS4 with kerberos encryption just works exactly the way it should. I can access my fileserver from anywhere on the ipv6 web.
  3. I still can't figure out how to get OpenVPN to push ipv6 routes to clients properly. But it does work in so far as it recognizes the ipv6 commands and spits out meaningful logs. I just didn't get the config right.
  4. FireTV Stick is flaky. It seems like if you're broadcasting a DNS entry on your router advertisements, the FireTV stick decides it doesn't have an internet connection. This is true even if you turn on ipv4 for it and hand it a DHCP lease. Hmm... but if you leave it alone for several hours, magically it starts working fine. Clearly a software update is in order Amazon.
  5. Kindle Fire tablets work fine. Flawlessly for all their basic functions.
  6. Minecraft pocket edition on the Kindle Fire works fine on its own, but doesn't do any LAN games. If you give it an ipv4 it works for LAN games. Sigh. This is a known issue apparently. I suspect it will get better as the new, one month old code base gets ironed out. Also, using a server is recommended, but the new code base broke all the servers. Wait a few months if you play minecraft a lot (or your kids do, whatever). New code base seems to demand an XBox One login even if you want to play on your own server. No thanks Microsoft. I consider these things bugs, or worse yet the XBox login thing is malicious spyware behavior.
  7. Linux desktops, MacOS desktops, and Windows Desktops all work totally flawlessly, including raspberry pi. My wife who is a mac user, didn't know I turned off ipv4 until a week afterwards when I told her. Of course she knew I broke the phone system, but I've done that before. Her Mac just worked. There were a few glitches but they turned out to be unrelated (Right about the same time, Google wanted her to upgrade from the "Drive" client to the "Sync" client... so her Drive stopped working, but not because ipv6 just because of google's new system).
  8. Android phones work flawlessly. In fact they seem to work better under ipv6 only to tell you the truth. I think this is the end-to-end nature of the network, no NAT bullshit breaking your communications.

So, if you want to replicate this experiment and ditch the broken old internet protocol here's what you need to replicate my setup:

  • A router running linux.
  • Firehol firewall software.
  • DNSmasq LAN management software (does DNS caching, DHCP, SLAAC, router advertisements, DHCPv6, local DNS for your LAN, etc).
  • Tayga on the router to do NAT64
  • The google DNS64 server info on the above linked page.
  • Some knowledge of how ipv6 works.
  • An ISP that provides ipv6 natively: ATT, Comcast, Spectrum/Charter/TWC, and Cox all do this to a substantial degree.
  • Wide DHCPv6 client to request prefixes from your ISP
  • Alternatively, get a router that runs LEDE (the more up-to-date project that came from OpenWRT). It handles ipv6 pretty much flawlessly out of the box, I just have slightly more requirements for my router than average, so I run it on an Intel machine running Debian.

 

On IPv6 only networks and FireTV Stick

2017 October 17
by Daniel Lakeland

The FireTV stick from Amazon (I have I think version 2) is a useful bit of kit. However, while it can use ipv6 it seems as of today (FireOS 5.2.6.0 October 2017) it must have an ipv4 connection in addition in order to be happy.

However, even if you give it dual-stack, it can still flake out, and the way it seems to work is it looks like it'll be connected for a few seconds, and then it will show that "home is unavailable" and ask you to check your network settings.

If you go to settings -> about -> network you will see it has an ipv4 address and everything looks fine! what's up? Look closer and I see that it shows a DNS address with ipv6.

The key is that ipv6 can advertise an ipv6 DNS server, and if you aren't also DHCP advertising an ipv4 DNS server, you'll be up a creek.

using dnsmasq on my network, what fixed this for me was:

dhcp-option=option:dns-server,10.x.x.x

where 10.x.x.x was my server/router running dnsmasq

If this is there, even if you have an additional:

dhcp-option=option6:dns-server,......

where ...... is your ipv6 for dnsmasq, firetv will seem to be happy. This seems to be because it overrides its DNS settings if it hears from dnsmasq who the DNS servers are, and if it doesn't have any ipv4 DNS servers, it just refuses to be happy about its network connection...

Hope that helps someone.

 

Followup on Implicit Function Theorem / Likelihoods

2017 September 25
by Daniel Lakeland

I think it's important to understand conceptually what is going on in these cases where we have an implicit relationship that data and parameters are supposed to follow, and to know when it is that we need to do some kind of Jacobian corrections.

A Jacobian correction is required when you have a GIVEN probability distribution on space A and you have a transformation from A to B, call it B=F(A) and you want to express a probability distribution on the B space which is *equivalent in every way* to the GIVEN distribution on space A. The distribution on B is called the *push forward* distribution on B. The mnemonic here is that if you have a small neighborhood in A and you "push it forward" through the F function into the B space, it produces a small neighborhood in B, and if you want this to be equivalent in every way, then the measure of the neighborhood on A is going to be forced to be equal to the measure of the pushed-forward neighborhood in B.

GIVEN: A ~ DistA(values)

GIVEN: B = F(A)

DERIVE: B ~ DistB(values)

This process requires using DistA, the inverse transform Finv(B) and a Jacobian correction.

Compare this to:

UNKNOWN: A ~ UnknownDistro(Avalues)

GIVEN: B = F(A)

GIVEN: B ~ GivenDistB(Bvalues)

Here we don't know what measure we have on the A space, but we know (by a modeling assumption) what measure we have on the B space. If F is an invertible function, then this situation is entirely symmetric with the above distribution it's just the case that *which space the distributional information is given in is different*.

Now, let's add some semantics to all of this.

In the above problem let A be a space in which your data measurements live. Let DistA(values) then be p(A | values) a likelihood factor in your model. Here, you know what the distribution is on your data. So, just use it. But if you insist on transforming your data to some other space, like say taking the log of your data, in order to leave your model unchanged by the fact that you insist on taking the log, you will have to find a DistB which is the push-forward measure of DistA through the transformation B=log(A).

Now, suppose you don't know what likelihood to give for your data values, but you know that if you calculate some complicated function B = F(A) you would be willing to model the results, in the B space, as having a distribution p(B|Parameters) = DistB(Parameters)

Now, if you want to know what measure this implies in the data space, you will have to do the whole change of variables rigamarole with Jacobians. The important thing to understand is *what is given vs what is derived*

Now, let's imagine a situation where you have a non-separable relationship between various data and parameters, which is constant plus error, a typical situation where the implicit function theorem applies. Here x,y are data, a,b,c are parameters in your model, and we'll assume F is a "nice" function of the kind you're likely to write down as part of a modeling exercise not something really weird which is nowhere differentiable on any of its inputs or the like. Our model says that there is a relationship between x,y,a,b,c which is a constant plus noise. This relationship will be written:

F(x,y,a,b,c) = 0 + \epsilon

And let's say \epsilon \sim De(C) has GIVEN distribution De(C) where C are some constants (easiest case).

Now suppose that a,b,c have given values, and x,y are measured. Then the quantity on the left of this equation is a number F(x,y,a,b,c)=3.310 for example. And so, 3.310 = \epsilon is data, derived data to be sure, but data nonetheless, for a given a,b,c and measured x,y there is no uncertainty left it's just a number. By MODELING ASSUMPTION the probability that this \epsilon would be calculated to be within d\epsilon of 3.310 if the true values of a,b,c were the ones given by the sampler, is De(3.310|C)d\epsilon where De is a given function.

And so the distribution De(C) is of the form p(\epsilon | a,b,c) it is a *given* likelihood in "epsilon space". Note that x,y are needed to get \epsilon but they are known data values, throughout the sampling process they stay constant. So this is really a function L(a,b,c) where a,b,c are the only things that change while you're sampling. Given the data x,y the post-data distribution on a,b,c is

L(a,b,c) prior(a,b,c)/Z da db dc

Where Z is a normalization factor Z = \int L(a,b,c) prior(a,b,c)da db dc

Now, if you have this given likelihood in epsilon space, and you want to see what the equivalent likelihood is over say y space where we think of y as data we'd like to predict, and x as covariates, and a,b,c as parameter values:

p(y | a,b,c) dy = p(\epsilon(x,y) | a,b,c) \frac{d\epsilon(y)}{dy} dy

Under the assumption that F is sufficiently well behaved that the implicit function theorem gives us a unique differentiable transform from y to epsilon for given x,a,b,c. And d\epsilon(y)/dy is the "Jacobian Correction". Now divide both sides by dy and we have our answer for the density of y (I'm using nonstandard analysis, dy is an infinitesimal number).

The point is, the likelihood is strictly implied to be the push-forward measure of the GIVEN distribution over \epsilon. But the truth is, we don't know the transformation y = f(x,a,b,c,\epsilon) or its inverse. The typical way we'd do predictions would be to set \epsilon_n to be a parameter with the epsilon distribution, and then sample, then we take the \epsilon_n values and use an iterative numerical solver to get y values. And so, now we have a computational criterion for deciding of F is sufficiently nice: it produces a unique answer (you might be able to extend this to a countable number of possible alternative answers) under iterative numerical solution for y from a given x,a,b,c,\epsilon_n.

 

The Implicit Function Theorem and Likelihood Functions

2017 September 22
by Daniel Lakeland

In Bayesian statistical modeling we often use the symbol ~ which denotes a kind of "statistically equal to". Consider the following:

y = ax+b+\epsilon

If \epsilon = 0 then this is an equation of a line, whereas if we say \epsilon \sim N(0,1) for example then this denotes a line with errors that have a certain range of credible sizes and are centered around 0. Well this statement about the distribution of \epsilon doesn't alter the algebraic properties of the symbolic expression y=ax+b+\epsilon and so that equality still respects all the usual algebraic rules.

y-ax-b = \epsilon

Is true, and so y-ax-b \sim N(0,1) is true by substitution of \epsilon.

In general you might have a fairly complicated relationship, something like

F(x,y,a,b) = \epsilon

With F a nonlinear non-separable relationship between the quantities, for example

y^2 -\frac{x}{y}\mathrm{atan}(ay)+\frac{b}{a} = \epsilon

Or something equally nasty from the perspective of trying say "solve for y". We can suppose that y is our data, and x a covariate and a,b are parameters. What do we make of this relationship in a Bayesian model and how do we use it?

In Stan, if you create a transformed parameter F = y^2-x/y*atan(a*y)+b/a and then say

F ~ normal(0,1)

You will get a message about how the left hand side of this sampling statement contains a transform of a parameter, and if it's nonlinear you need to include the Jacobian of the transformation in the target. This warning message is designed to alert you to something you need to do when you re-parameterize. But a re-parameterization is a purely formal transformation. It doesn't alter the meaning of your model, it alters the way in which the model is expressed. For example if you have y = ax+b and you change this to y/a = x +b/a and then rename y/a = y' and b/a = b' and say y' = x + b', this is a formal transformation that doesn't alter the meaning of the equation (provided a is not 0). On the other hand, if you do y = ax^2 + b then you're changing your model.

The statement F ~ normal(0,1) above is not a formal transformation, it is in fact a statement about your data, a kind of likelihood, it's just an implicit statement about your data.

Although we can't necessarily solve our equation for y symbolically, there is a theorem called the implicit function theorem which enables us to say that as long as our relationship F(x,y,a,b) is sufficiently well behaved, then in some region around any given point x',a',b'  there exists a function y = f(x,a,b) even if we don't know how to express it. For example when the distribution for a is well separated from 0 then we won't be dividing by a=0 and so our expression F is well behaved. And so, our statement

F ~ normal(0,1) is really a statement about

y-f(x,a,b) = \epsilon

And could be re-expressed as

y \sim N(f(x,a,b),1)

Which for y a data value is obviously the usual kind of likelihood expression. The problem is, although this f function exists, that doesn't mean we know what it is. We do, however, know the relationship F(x,y,a,b) and so why not do

F(x,y,a,b) \sim N(0,1)

Which has exactly the same meaning.

Note, to the best of our knowledge we have decided to model \epsilon \sim N(0,1) which is a modeling choice, and subject to questions regarding whether it expresses valid facts about the world more than it is subject to questions about mathematical correctness. This fact isn't derived mathematically from anything, it's assumed and so it should be questioned primarily on modeling grounds more than anything else. There can be mathematical facts that are relevant I suppose, but the main question is "is this a good model" not "did you derive the N(0,1) correctly" since it isn't derived.

All of this is another way to think about what I called "declarative models" a while back when I first started thinking about this topic.

 

On the lack of Lebesgue Measure on countably infinite dimensional spaces, and Nonstandard Analysis

2017 September 18
by Daniel Lakeland

Consider the interval [0,1] it has length 1. The generalized notion of length in the reals is Lebesgue measure, whenever you have something like a closed interval so that there's a trivial length for a set, then the Lebesgue measure is the same as the length.

Now consider the 2D plane, the square [0,1] \times [0,1] consists of all the points (x,y) where x is in [0,1] and so is y. What is the area? It's 1. This continues to work for integer dimensions 3,4, etc what's the volume of the hypercube [0,1]^N for N some large integer like 3105? Again, it's 1^{3105} = 1.

But now let's see what happens when we consider intervals of the form [0,0.5] the length is 0.5 and for high dimensions N the hyper-volume of the hyper-cube is 0.5^N which goes to zero as N gets big. Similarly for intervals [0,2] the volume goes to 2^N which goes to infinity as N gets big.

Intuitively this is why we don't have (standard) Lebesgue measure on the infinite dimensional space. An infinitesimal interval dx is small, but when you calculate dx^N for N nonstandard, the hyper-volume is REALLY small. Similarly for intervals of slightly larger than side 1, the hyper-volume is infinite.

On the other hand, consider the interval [0,1.1]^N for N a nonstandard integer. Sure, the hyper-volume 1.1^N is nonstandard. But, it's a perfectly fine nonstandard number. If this calculation is an intermediate calculation in a series of calculations that eventually leads you to prove some property, there is nothing that keeps you from carrying it out. For example you want to show that one set is much smaller than another, the ratio of sizes is r = 1.1^N/1.2^N for N nonstandard. This ratio is clearly infinitesimal as 1.1/1.2 \approx 0.916667 is a fraction less than 1 and it's raised to a nonstandard power.

But if you have some other infinitesimal ratio, and we want to discern how big they are relative to each other, for example how big is 0.995^{N-K} relative to (1.1/1.2)^N you can do so easily and algebraically. [0.995^{N-K}/(1.1/1.2)^N] \approx 1.0855^{N-K} \times (1.1/1.2)^{-K}.

When N and K are nonstandard, you rapidly get either an unlimited or an infinitesimal result. But if you prove that this is true for all N,K and then need to later consider the finite standard case say N=1331512 and K=89331 then you have the formula available to you, and you can get a perfectly fine standard value. This is useful if you're doing something like considering a function of space evaluated at a set of points and you don't know ahead of time exactly how many points. For example each point might be the location of competing insects, and you're working out a PDE to approximate how these insect populations change in time. The insects come at discrete locations, but the particulars of how many and which locations are not known ahead of time. You can develop a continuous model, in which you have a smooth function of space, and then you've got an "infinite dimensional" model, but the truth is your infinite dimensional model is just a device for calculations approximating a finite but "large N" number of points. It's not helpful to say that "there is no Lebesgue measure on infinite dimensional space" because the property "there is Lebesgue measure on space of finite dimensions N for all integer N" is the property you care about. In your model you would only actually ever care about say N = a few million to billion. So developing a nonstandard expression makes more sense to the modeler, even though it makes no sense to the pure mathematician trained in classical analysis.

 

On finite vs countable additivity and Cox/Jaynes/DeFinetti etc

2017 September 12
by Daniel Lakeland

There is some concern in the back of people's mind about Cox/Jaynes Bayes and things like continuous parameter spaces, or infinite dimensional models and soforth. It comes down to questions about Finite Additivity. About the only good thing for me that came out of Dan Simpson posting on Andrew's blog about the "Gay Face" neural net baloney that came out this week is that he indirectly pointed me to a guest post on Xian's blog from a few years back. (To be clear, it's the gay face "research" baloney not Dan's post that I object to)

So, here I want to try to organize some thoughts I have on finite additivity type stuff. I've mentioned why I don't think this is actually an issue for science in the past.

As regular blog readers know, I'm a fan of the IST form of Nonstandard Analysis. I think it connects models to formalism in a way that is transparent, and in a way that "typical" measure theory type math doesn't.

So, let's try to define some probability foundations in a Nonstandard Analysis setting (warning, I will probably do a poor job of this compared to professionals but if I screw it up I am pretty confident it will be both detectable and fixable by a more serious formalist)

First, let's work with a simple continuous valued random variable. X

Define a set of points \{x_i : -N + i dx\} where dx = 1/N^2 and i \in [0,1,...,N^2] and N is a nonstandard integer.

Now define a function p(x_i) which is non-negative and \sum_i p(x_i) dx = 1

Suppose the standardization p^*(x) exists as a standard function, and therefore p^*(x) is a standard probability density function, proof is simple, it's non-negative, it's integrable, and its integral is 1. Every standard integrable function that is non-negative and has integral 1 is a probability density by definition of a probability density.

Now, suppose instead that p(x) is a nonstandard function, that is, it takes on nonstandard values in such a way that p(x) dx is appreciable for some x values, but suppose this only occurs for values infinitesimally close to some standard values x_j (ie. the "delta functions" are at standard locations). Then there is no standardization of this function. However, we can define a probability measure on X such that if s is a standard set of points in X

\mu(s) = \mathrm{st}(\sum_{i : x_i \in s} p(x_i)dx)

It's trivial to show that this \mu is a countably additive probability measure, first off we're adding up non-negative values, and the sum of all the nonstandard values equals 1 so the outcome is always in [0,1]. If s1 and s2 are disjoint standard sets then \mu(s1) + \mu(s2) = \mu(s1\cup s2) by the property of the sum operation that defines \mu in terms of p(s). This is true for all unions of K disjoint standard sets for K any standard integer. This is true because the K standard sets are disjoint, and therefore they partition the sum, we don't double-count any of the nonstandard grid points for example. By Transfer this is true for all K including nonstandard K. To see that this implies countable additivity we use proof by induction. In standard mathematics, we have a sequence of subsets \{s_i\} whose full union is the full set S, our equality is true for every K prefix of the sequence of subsets by our partitions-the-nonstandard-sum argument, and hence by induction is true for the whole sequence of subsets.

So, finite additivity may be problematic as a foundation for Bayesian statistics on continuous parameter spaces, but nonstandard-finite additivity is sufficient to give measure theory, and to boot, in the nonstandard case, every standard measure has a nonstandard density.

Now, suppose we're dealing with infinite dimensional spaces, like the sample paths of gaussian processes. There may not be any standard Lebesgue measure on infinite dimensions, but there is a perfectly fine Lebesgue measure on every finite dimensional space of standard positive integer dimension D, therefore there is a Lebesgue measure on every nonstandard positive integer dimension D as well by Transfer. Let (x,f(x)) be the graph of a sample path on a nonstandard grid of nonstandard dimension D in x and in each f. Let f have a nonstandard density defined by the nonstandardly-discretized multivariate normal distribution. The the standardization of some realization f(x) is a gaussian process sample path (because every finite standard set of x values has f(x) values with multivariate gaussian distribution, which is the definition of a gaussian process).

The problem with standard measure theory is that when you build your formalism on taking limits, things change from one type of thing to another under certain limits. For example, the normal(0,s) density for s an infinitesimal is a perfectly fine nonstandard density function, but in standard mathematics it is a "delta function measure" which is to say a measure over sets, not a function f(x).

 

Statics methods for MC ensembles

2017 September 5
by Daniel Lakeland

Recently I've been working on problems for which Stan struggles. The typical situation is that I have a large model with many parameters, usually most of the parameters represent some measurement for an individual item. For example a Public Use Microdata Area in the Census, or the behavior of some individual who has been surveyed, or the expression level of some gene, or what have you. And then there are a small number of parameter that describe the group-level properties of these individual measurement taken as an ensemble (a hierarchical model)

The symptoms of this struggling is that Stan adapts its step size very small, uses long trajectories, maxes out its treedepth, takes many iterations to get into the typical set, and then generally takes a long time to sample, and probably has divergences anyway. Often I kill the sampler before it even really adapts because I see the stepsize has crashed down to 1e-7 or the like and each iteration is taking tens of minutes to an hour.

Now, when we have better diagnostics for divergences, perhaps this kind of problem will be remedied somewhat. But in the mean time, solving dynamics problems is known to be hard, a lot harder typically than solving statics problems. For example, if you want to put electrons on the surface of a sphere in a configuration that minimizes total potential energy, you can do so a lot easier computationally than you can solving the dynamics of those electrons if you give each one of them a velocity constrained to the sphere... Solving the dynamics problem requires you to take small timesteps and move all the electrons along complicated curves. Solving the statics problem just requires you to simultaneously perturb all the positions in an energy minimization way.

So, one thought I had was to solve an optimization problem on ensembles.

Here's one way to think about it: suppose p(x) is a probability density on a high dimensional space x. We want to create a sample of say N points in the X space, let N = 100 or so for example.

One thing we know is that if \{x_i\} is any set of points in the X space then for K a nonstandard number of steps \{x_i'\} = \mathrm{RWMH}(x_i,K) is an ensemble in equilibrium with this p(x) distribution. Where RWMH(x,K) means K random walk metropolis hastings updates from each starting point x_i.

Furthermore, once in equilibrium, it stays in equilibrium even for K=1. Of course for K=1 it is possible that each point is rejected in the MCMC update step, and so the ensemble is also unchanged. But as both the size of the ensemble increases and K increases, this non-movement becomes dramatically improbable. In particular, we can choose something like an isotropic gaussian kernel with sufficiently small standard deviation that we get something like 10% chance of acceptance in each point, and with 100 points, the probability that the ensemble is unmoved after 1 update per point is 0.9^100 ~ 1e-5. Of course, the distance moved may not be very large, but we will have some movement of at least some of the points. If we make K = 10 or so, we'll get some movement of most of the points.

However, we want to guarantee that we do get large movement relative to our usually terrible starting point. One way to get movement is to just continue moving each particle along a line:

For each particle:

x(t) = x(0) + t (RWMH(x(0),K) - x(0))

When mapped across the ensemble, this can be thought of as transport of the ensemble in the direction of an acceptable random perturbation (acceptable because the RWMH accepted that update).

How far should we allow this transport to go? Obviously if t goes to infinity, some of the particles go to infinite distances and since our probability density is normalizable this means some of the particles go well outside the high probability region.

One particular expectation that is of interest is the ensemble entropy:

\frac{1}{N} \sum_{i=1}^N -\log(p(x_i))p(x_i)

This has the advantage that the function f(x) = -\log(p(x)) has domain equal to the domain of the density, and is therefore sensitive to all the regions of space. Furthermore, because p(x) goes to zero outside some ball, the function -\log(p(x)) is positive and so as t increases the entropy of the ensemble eventually increases.

The result is, that if x is out of equilibrium, initially as a function of increasing t the ensemble entropy should first decrease (because the RWMH accepts higher probability transitions and many of the transitions will be towards higher probability) then after a while, it should become somewhat constant, then as time goes on it has to increase again. The region of space near the entropy minimum is what we want. Yet, we don't want to just send all the points to the mode... that would be the lowest entropy ensemble possible.

This whole concept is related to concentration of measure. The entropy of an ensemble of 100 randomly chosen iid points is a function of 100 random variables and is therefore more or less constant under different realizations. The region of constant entropy is called the typical set and contains essentially 100% of the probability mass.

Now, suppose that \{x_i\} starts in equilibrium. Then, as t increases, ensemble entropy may decrease or increase initially, but the change will not be dramatic, and as t increases through time eventually entropy increases again. Since p(x) is typically smooth, the ensemble entropy as a function of t is also smooth. If it comes to a minimum somewhere, then in the vicinity of that minimum it acts like a parabola. How far should we go? Suppose that the ensemble entropy has a central limit type theorem such that when the ensemble size is large enough the ensemble entropy is distributed according to Normal(e,s(N)) for e the actual entropy of the distribution, and s some decreasing function of ensemble size N. This manifests as the trajectory having a parabolic minimum entropy. If we sample t with weights according to \exp(-e(x(t))) where e is the ensemble entropy then we should come into equilibrium (intuitively, I have no idea if this works in reality). Note that initially we are going to always go forward in time, but at equilibrium we need to project t both forward and backward.

Some questions: why doesn't this collapse to finding the mode?

The reason is that at equilibrium each individual random walk metropolis step goes either up or down in the distribution equally, so some points are heading into the tail, and others into the core. If we get close to the mode, the RWMH has more ways to go away from the mode and so we will get directions that mostly point away from it. If we get too far from the mode, RWMH will tend to ratchet upwards and give us a direction back into the typical set.

Is this any better than random walk?

I don't know, intuitively we're using a random walk to get a random perturbation direction that at equilibrium leaves the ensemble in equilibrium. So in the direction of this perturbation, we're moving along the typical set manifold. We do so in a way that more or less equally weights all points that we visited which were in the typical set (ie. if the entropy is constant along our curve, then we go to any of those points equally likely). It seems likely that this devolves to a random walk when the typical set is highly curved, and works more like HMC when the typical set is elongated. Now, Stan detects when there is a lot of curvature and drops the step size, but it then keeps the same step size for all the trajectories. so the computation required is based on the worst case. Perhaps this other technique for following the typical set as an ensemble is more robust to allowing larger step size when it's warranted? Also, it doesn't require calculating gradients. In some sense, the RWMH step finds us a direction which is pointed along the typical set, and so has zero ensemble entropy gradient on average.

The nice thing is, I have been working on some example problems in Julia, so I think I have most of the requirements in place to test this on some example problems. I'll report back.

 

On Morality of Real World Decisions, and Frequentist Principles

2017 June 27
by Daniel Lakeland

If you want to make decisions, such as choosing a particular point estimate of a parameter, or deciding whether to give a drug, or whatever. And you want your decision making rule to have the *Frequency related* (Frequentist) property that under repeated application it will on average have small "badness" (or larger "goodness") then you should look for your procedure within the class of procedures mathematically proven to have unbeatable frequency of badness properties. This class is the Bayesian decision rules (see Wald's Theorem and the Wiki). The boundary of the class is the Bayesian decision rules with flat priors, but we know more, we know that the frequency properties of Rule A will be better than Rule B whenever a region around the real parameter is higher in the prior probability distribution under A than under B. Then we are giving more weight to the actual correct value and so our decision is based more on what will turn out to happen.

Unfortunately we don't know the correct value exactly, but in the case where we use flat priors, they are equivalent to a prior that places 100% probability on the value being infinitely large

Now, if you, as a person who cares about the frequency properties of procedures, agree that your parameter *is* a real number, and you have any idea at all what the magnitude of that number is, say it's logarithm rounded to the nearest integer is N, then by choosing a proper prior that has normal(0,10^(N+100)) you will have lower Frequency risk of making "bad" decisions than if you used a flat prior, of course, you can do better, normal(0,10^(N+1)) will do better...

Now, decision making and morality are inevitably entertwined. Consider the existence of the "trolly problem" in moral philosophy, it's all about making choices each of which have bad consequences, but we have to make a choice, including the choice of "do nothing" which also has bad consequences. On the other hand, if you have no choice, there is no morality associated with your action. Getting hit by a drunk driver who crashes through the wall of your bedroom while you're sleeping is not a moral failing on your part for example.

But, if you have a choice of how to make real important, *CLINICAL* decisions about people's lives, and health, and societal health through engineering civil structures and the like, and you care about the frequency with which you do things that have bad consequences that you can't forsee exactly, and you *don't* make your decision by choosing a method that is better than your likelihood + flat prior + point estimate based on Mean Squared Error because you refuse to use a prior on some kind of principle, or you refuse to consider real world consequences other than mean squared error on some kind of principle, then in my opinion your principle is immoral, in the same way as prescribing a toxic drug on the principle that "I get a cut of the proceeds" is immoral.

If you make the decision because you don't know any better... then you're like the guy in the bed who gets hit by the car. But if you write books on statistics from a Frequentist perspective, and you fail to teach the Complete Class result, and you fail to emphasize the fact that you have a choice in what measure you will use in deciding on your decisions (such as the choice between Mean Squared Error in your estimate of the parameter value vs Quality Adjusted Live Years Saved of your clinical decision) then I think you're doing evil work in the same way that a person who teaches a Civil Engineering design rule that has been proven to be wrong and risk people's lives is doing evil work.

So, I do get a little worked up over this issue. Remember I have background in Civil Engineering and I work with my wife who is a research biologist at a medical school. None of this is abstract to me, it's all real world "you shouldn't do that because it's wrong/bad for society/evil/it hurts people more often"

To back up a bit though: I don't think it's evil to care about the frequency properties of your procedures. I think it's evil to *fail to care* about the real world consequences of your decisions.

From the perspective of making decisions about things, such as point estimates of parameters, or clinical decisions about treatments, being a Frequentist (meaning, trying to reduce the average badness of outcomes under repeated trials) actually *entails* doing Bayesian Decision Theory. The Frequentist principle "try to reduce the average severity of bad outcomes" implies "Do Bayesian Decision Theory".