I had a conversation with Christian Hennig that enlightened me about some confusion over what qualifies as Frequentist theory vs what qualifies as Bayesian theory.

Of course we’re free to create technical jargony definitions of things, but essential to my conception of Frequentism is the idea that probability is placed on observable outcomes only, so that in principle you could if necessary get a large sample of things and have the real actual frequency distribution in the form of say a histogram. Then you could say that a given parametric distribution was good enough as a model of that histogram for example. I quoted some wikipedia text which essentially said the same thing, and is consistent with various sources you might find online or in books. In general at least if this isn’t the only definition of Frequentism, it’s a common one.

The alternative, Bayesian viewpoint in my view was that you use probability distributions either on quantities that don’t vary (like say the speed of light or the gravitational acceleration in your lab during your wednesday morning experiment), or you use distributions over things that vary which are notional and have no validated connection to observed frequency, but just represent your own information about how widely things may vary.

The question “could this random number generator have produced our dataset” which is essentially the purpose to which a p value is put, is not exclusively the realm of Frequentist statistics. Every time we fit a Bayesian model we could be considered to be asking that question over and over again and using the likelihood of the data under the RNG model to answer the question, and determine whether to keep our parameter vector in the sample or not.

What this means is, a lot of people are using what you’d call Bayesian methods under this classification scheme, without really thinking about it. For example linear mixed models or “random effects” models… These are hierarchical models in which essentially the likelihood function based on a normal distribution is *very* rarely questioned against the data (ie. goodness of fit tests run) and the distributions over the “random effects” is essentially *never* tested against a repeated sampling process. This means the distribution over the random effects process is just a Bayesian distribution as it represents how likely it is to get effects of that size. It is typically taken as a normal distribution with mean 0 and standard deviation equal to essentially some maximum likelihood estimate (the standard deviation is found during the numerical fitting process I think).

In fact, there are plenty of situations where a mixed effects model is used and there are a finite few groups involved. The simplest would be 2 groups, but let’s even say 8, like in the “8 schools” example. The distribution of individual effects in the 8 schools *can not* be a normal distribution. In fact, these must be just fixed values one for each school. The notion that these 8 are a sample from a normal distribution is entirely notional, and has no direct connection to observable frequency.

The only thing these kinds of models don’t have is an explicit proper prior. And Bayes isn’t just “frequentism with priors” it’s probability as measure of credence of an idea. People are already doing Bayesian analysis every time they run a mixed effects model, they just are dodging some of the responsibility for it by hiding it under the hood of “well accepted practices codified into the lme4 library” or some such thing.

Next up: Using an ABC like process and a likelihood based on a p value, we can construct confidence intervals, showing that confidence intervals are just shitty flat posterior intervals.

Increasingly, “science” is being used as club to wield in order to have power, for example the power to influence policy, or to get career advancement, or the like. The stories are legion on Andrew Gelman’s blog: Brian Wansink is an excellent example, he received millions in funding to do his “food industry” research… that didn’t have a hope of being correct.

So when my friend posted this video claiming that randomized controlled trials have proven that sugar does not cause kids to become hyperactive, I was skeptical, to say the least.

But at least it’s based on the “gold standard” of randomized controlled trials, and presented by someone with a biology background who understands and is an expert on this topic right? (actually, sure he’s an MD, but his “research focuses on the study of information technology to improve pediatric care and areas of health policy including physician malpractice…” and he’s publishing on an Econ blog. Is he perhaps a bit too credulous of this research?)

So here is where we unpack the all too common situation where meta-analysis of various randomized controlled trials is used as a club to beat people with, regardless of what those trials even say.

Let’s start with the meta-analysis linked on the YouTube video. This is Wolraich et al. 1995, a review of quite a few different somewhat related RCTs… which is hidden behind a paywall. If you manage to access it you will find that it cites 37 studies. Some of the results are summarized in this graph:

Hmm… what is up with Sarvais et al? (actually the correct spelling of her name is Saravis, not Sarvais). In any case, I looked through the cited research summarized in table 1.

You can see, this is a lot of studies of ADHD, “sugar reactors”, “Delinquent”, “aggressive”, “Prader-Willi syndrome” and other unusual populations. Furthermore the vast majority are 100% male subjects. Finally, almost all of these studies use a sweet beverage as their “control”. Unless they have an additional control for “not sweet” condition, this means they have zero chance to understand the difference between giving a kid say a sugary beverage, vs say water or a candy bar vs say a piece of whole wheat bread. In other words, sweetness may itself be the mechanism by which sugar makes kids hyperactive. If you were a biology major and proposed a mouse experiment like this to my wife she would say “where’s your negative control?” and send you back to the lab to think about how to run an experiment. Some of the papers do have a negative control.

Among the “normal” populations, number of subjects is typically less than around 25… I wasn’t going to chase down all these studies myself, so I decided to look at selected few that use “normal” population and have more than just a handful of subjects.

The ones I chased down were: Rosen et al 1988, Saravis et al 1990, and Wolraich et al. 1994

Let’s start with the last one, also by Wolraich. Like a number of these studies, Wolraich et al. 1994 is designed to investigate whether an overall diet of high sugar leads through time to an overall increase in hyperactivity… Although this is an interesting question, it is not the relevant question being discussed in the YouTube video. When a parent says their kid gets hyperactive after eating sugar, what they mean is “my kid got a candy bar from his friend and ate it, and now he’s bouncing off the walls for the next hour while I’m trying to get him to get ready for bed”. Of course there are kids with more general hyperactivity issues, but that is a more complicated issue. The YouTube video strongly suggests that sugar in any form, bolus or through time, never makes kids hyperactive ever.

The Wolraich et al 1994 article says: “The subjects and their families were placed on a different diet for each of three consecutive three-week periods. One of the three diets was high in sucrose with no artificial sweeteners, another was low in sucrose and contained aspartame, and the third was low in sucrose and contained saccharin (the placebo). So we can totally ignore this study as the design simply doesn’t address the question, and it fails to have any negative control. But, even if we don’t ignore it, what do we find… Table 4 is a summary (stupid, where’s the graph?)

It’s clear, from a Bayesian perspective, that the study was vastly underpowered to detect anything of interest. For example in the “Parent’s rating of behavior… Conduct” the numbers are 8.1+- 6.7 meaning expected random noise is almost as big as the average score… How meaningful is this rating if people ranged from say 1 to 14? Furthermore, the scores were complicated tests of a wide variety of things “administered in the mobile laboratory each week on the same day of the week and at the same time of day”. Any short term effect of a bolus of sweet foods would only be detectable if they had just given them the foods in the minutes before testing. So the results of this study are simply *irrelevant* to the main question at hand.

Let’s move to Rosen et al 1988: this study at least is designed to measure a bolus of sugar… of course it’s based on two studies with 30 preschool kids and 15 elementary school kids. They designed the study around a “high-sugar condition, a low-sugar condition, and a control aspartame condition (low in sugar but with a sweet taste).” The diet was randomized for each kid each day, and everyone was blinded as much as possible (obviously, the kids could tell if their diet was sweet or not, but weren’t told if it had sugar vs aspartame because no-one involved supposedly knew). The manipulation was by giving the kids controlled breakfast meals. The children were tested

“approximately 20-30 min following the completion of breakfast… on several measures sensitive to cognitive functioning…” as well as “each day teachers completed… [a] global rating scale… immediately preceding the child’s lunch time…to reflect the child’s behavior for the entire morning.”

The research is under-powered to really measure much of anything as well, but when they looked at global rating across all kids they found

“An analysis of the means for this measure revealed that the ratings during the high-sugar condition (M = 6.2, SD = 1.08) were only slightly, although significantly (Tukey p < .05) higher than those during the low-sugar condition (M = 5.9, SD = 1.04) (see Figure 2). The control condition (M = 6.0, SD = 1.07) was not significantly different from either the high or low condition. No other significant differences were observed for the teacher rating measures.”

So by the “rules” of standard frequentist statistics, we should determine that “high sugar” leads to hyperactivity (the scale increases with increasing hyperactivity) so this is (very weak) evidence but it tends *against* the YouTube video.

Now we get to the Saravis et al. paper. This paper fails to have a non-sweet control. The goal of the study was to determine if there was some difference between sugar vs aspartame, and not to determine the difference between “sweet” and “not sweet” diets. So right away it’s going to be weak evidence as to whether something like “feeding your child candy will cause them to be hyperactive for the next few tens of minutes”… But we’ll soldier on anyway, especially because if you look back above at the meta-analysis graph, you’ll find Saravis (misspelled Sarvais in the graph) is an utter outlier in terms of results…

The design of the Saravis study is that:

“Two experiments were conducted. The objective of the first study was to compare the effects of consuming a large dose of aspartame or sodium cyclamate…with carbohydrate on measures of learning, behavior, and mood in children. The treatment consisted of an ice slurry … of unsweetened strawberry Kool-Aid containing carbohydrate (as Polycose …) plus either the test dose of aspartame or the equivalent in sweetness as sodium cyclamate…”

…

“The objective of the second study was to compare the effects of aspartame with a common nutritive sweetener (sucrose) on the same variables. In this experiment, the treatments consisted of … Kool-Aid, plus either 1.75 g/kg of sucrose or the equivalent sweetness of aspartame. The drink with sucrose provided the same amount of carbohydrate as in the first experiment.”

So the study is always giving the kids a sweet drink, and only the second study even varies the amount of carbohydrate in the dose… The studies were 20 children 10 boys and 10 girls ages 9-10…We’ll soldier on…

Guess when they gave the high sweetness drink? Well they claim it was at 10:30, but if you know anything about a pre-school, you’d know that it takes time to pour out the drinks, get the kids in line, etc etc. So you can guess that at 10:25 the kids were aware that they were about to get a bolus of sweetness… and of course their activity level is very high right at that point in time, and continuing.

So the study is not designed to test the main question of interest “if you give kids candy/sweets do they go bonkers” but to the extent that they did give kids sweet stuff… they did go bonkers afterwards, independent of what kind of sweet stuff… Because there was “no difference” between the two kinds of sweet syrupy Kool-Aid, this is taken by the YouTube video as evidence that “sugar doesn’t cause hyperactivity”, essentially the opposite of the conclusion I draw, which is that sugar, and other super-sweet foods, given in a bolus, may very well make kids crazy, but the study actually can’t really determine that… Maybe just getting kids in line and giving them anything makes them crazy… who knows. Welcome to modern science, where a study completely incapable of determining anything of use which nevertheless is entirely consistent with out prior guess, and under-powered to determine anything precisely, is taken as definitive proof that our prior expectation is completely false. RCT as billy-club, wielded by econo-physician to show that stuff “everyone knows” is nevertheless wrong and only heroic econo-physician can save us from irrationality.

Chris Wilson commented on some of my comments related to Jacobians and Masking in prior specification… It’s kind of long past the discussion, so I’m not sure if others will chime in. I figured I’d put my response here..

In the past I’ve never seemed to be able to express this idea effectively, I hope this somehow was more effective. The green screen idea is really a good analogy… you calculate some quantity (greenness, or Q(X))… and decide whether you want to upweight a given region of space or downweight that region of space based on some strictly non-negative function of the quantity…

so formally for any prior p_base(X)/Z_base, you can always use p_base(X)*deformation_factor(Q(X))/Z_deformed as a prior provided deformation_factor(Q(X)) is a strictly non-negative bounded function (you’re guaranteed to have a normalizable distribution if it started out normalizable and nothing is multiplied by anything bigger than some constant). The only formal requirement for a prior is that it be normalizable.

The easiest interpretation is when deformation_factor is between 0 and 1. Regions near where deformation_factor = 1 are not squashed at all, regions less than 1 are squashed… and if you are doing MCMC you don’t need to worry about the normalization so it’s just a squishing factor.

Then the non-formal question is just: did this actually express the knowledge I have? The easiest way to determine this is sample from the prior and see if the samples make sense. You have to do that whether or not you are using nonlinear transformed functions.

The fact that Q(X) is a nonlinear function of the X vector doesn’t enter into this at all… since deformation_factor is just squishing those regions that result in “weird” Q(x) values, and not squishing regions that result in “ok” Q(x) values. It isn’t directly a probability measure, it’s a dimensionless scaling factor. “Adding” a Jacobian correction is another way of saying “you squished this the wrong amount”… umm… says who?

A similar thing occurs when you use a nonlinear Q(data,params) in a likelihood function L(Q(data,params))… This expresses a conditional probability of a particular transformed “summary statistic” Q(data,params), conditional on *both the data and the parameters* so it’s just a squishing factor for how much to downweight the prior… which, it turns out, is what a likelihood is. You might call this a “pseudo-likelihood” or you might just call it a way to directly express a joint probability distribution on data and parameters without factoring it into p(data | params) p(params), instead it’s p(Q(data,params),params)

this is more or less what’s going on in approximate bayesian computation (ABC) where the matching is often expressed in terms of low-dimensional summary statistics. If all your observed data is is low dimensional summary statistics… then you’d have just called this the “likelihood” but if you happened to observe the high dimensional version of your data and you still decide to match on the low dimensional summary… you call it “approximate”. It’s really just a different but related model.

You certainly shouldn’t be doing jacobian magic on the Q(data,params), because that would express a different model. If your model really is:

L(Q(data,params))p(params)

and someone comes along and tells you it should be

J(params) L(Q(data,params)) p(params)

they are not telling you “you made a formal calculation mistake” they are telling you “I don’t like your model, use this other one instead” which is probably wrong.

Jacobian corrections are verifiable formal consequences of model transformations… These other things are non-formal modeling choices whose goodness is a matter of opinion in the same way that linear regression’s goodness for a particular problem is a matter of opinion. Often Q(data,params) is a scalar and {data,params} is a high dimensional vector. There is no Jacobian for a transformation from high dimensions to a scalar. The transformation isn’t invertible.

There was a discussion on Andrew Gelman’s blog about creating informative priors using “quantities of interest” which are basically functions of the parameters.

His example included some code which sets a prior over parameters a, b by combining several functions of a,b:

model { target += normal(y | a + b*x, sigma); \\ data model target += normal(a | 0, 10); \\ weak prior information on a target += normal(b | 0, 10); \\ weak prior information on b target += normal(a + 5*b | 4.5, 0.2); \\ strong prior information on a + 5*b

Andrew then went on to say: “You should be able to do the same thing if you have information on a nonlinear function of parameters too, but then you need to fix the Jacobian, or maybe there’s some way to do this in Stan.”

I disagreed with this idea of “fixing the jacobian”. Deep in the comments I discussed how this works and how to understand it vs how to deal with “jacobian corrections” when describing priors in terms of probability measures on a target space. The question of whether you need a Jacobian is determined by the role that the information plays, whether you have absolute knowledge of a probability measure, or relative knowledge of how much an underlying probability density should be modified (ie. “masked” if you’re familiar with using masks in Photoshop). I thought I’d post it here so I can refer to it easily:

The first thing you have to understand is what is a Jacobian correction for.

Essentially a Jacobian “correction” allows you to sample in one space A in such a way that you induce a particular given known density on B when B is a known function of A.

if B = F(A) is a one to one and onto mapping (invertible) and you know what density you want B to have (pb(B)), then there is only one density you can define for A which will cause A to be sampled correctly so that B has the right density. Most people might work out that since B = F(A) the density should be pb(F(A))… but instead…

To figure this out, we can use nonstandard analysis, in which dA and dB are infinitesimal numbers, and we can do algebra on them. We will do algebra on *probability values*. Every density must be multiplied by an infinitesimal increment of the value over which the density is defined in order to keep the “units” of probability (densities are probability per unit something).

We want to define a density pa(A) such that any small set of width dA at any given point A* has total probability pb(F(A*)) abs(dB*)

That is, we have the infinitesimal equation:

pa(A) dA = pb(F(A)) abs(dB)

solve for pa(A) = pb(F(A)) abs(dB/dA)

if we said pa(A) = pb(F(A)) we’d be wrong by a factor involving the derivative dB/dA = dF/dA evaluated at A, which is itself a function of A. The absolute value is to ensure that everything remains positive.

abs(dB/dA) is a jacobian “correction” to the pb(F(A)) we derived naively at first.

—————

So, the applicability is when

1) you know the density in one space

2) you want to sample in a different space

3) There is a straightforward transformation between the two spaces

In Andrew’s example, this isn’t the case. We are trying to decide on a probability density *in the space where we’re sampling* A, and we’re not basing it on a known probability density in another space. Instead we’re basing it on

1) Information we have about the plausibility of values of A based on examination of the value of A itself. p(A)

2) Information about the relative plausibility of a given A value after we calculate some function of A… as I said, this is a kind of “mask”. pi(F(A))

Now we’re defining the full density P(A) in terms of some “base” density little p, p(A) and multiplying it by a masking function pi(F(A)) and then dividing by Z where Z is a normalization factor. So the density on space A is *defined* as P(A) = p(A) pi(F(A)) / Z

Notice how if p(A) is a properly normalized density, then pi(F(A)) K is a perfectly good mask function for all values of K, because the K value is eliminated by the normalization constant Z, which changes with K. In other words, pi tells us only *relatively* how “good” a given A value is in terms of the value of its F(A) value. It need not tell us any “absolute” goodness quantity.

Should “how much we like A” depend on how many different possible A values converge to the region F(A)? I think this seems wrong. If you have familiarity with photoshop, think like this: you want to mask away a “green screen”, should whether you mask a given pixel depend on “how green that pixel is” or “how many total green pixels there are”? The *first* notion is the masking notion I’m talking about, it’s local information about what is going on in vicinity of A, the second notion is the probability notion: how much total spatial locations get mapped to “green” that’s “probability of being green”

For example pi(F(A)) = 1 is a perfectly good mask, it says “all values of F(A) are equally good” (it doesn’t matter what color the pixel is). Clearly pi is not a probability density, since you can’t normalize that. You could also say “A is totally implausible if it results in negative F(A)” (if the pixel is green don’t include it) so then pi(F(A)) = 1 for all F(A) >= 0 and 0 otherwise is a good mask function as well. It’s also not normalizable in general.

If you start including “jacobian corrections” in your mask function, then your mask function isn’t telling you information like “mask this based on how green it is” it’s telling you some mixed up information instead that involves “how rapidly varying the “greenness measurement” is in the vicinity of this level of greenness”. This isn’t the nature of the information you have, and so you shouldn’t just blindly think that because you’re doing a nonlinear transform of a parameter, that you’re obligated to start taking derivatives and calculating Jacobians.

## Emile Borel, Leon Bruillouin, The Butterfly Effect, and the illogic of Null Hypothesis Significance Testing

I first read about this idea in Raghuveer Parthasarathy’s blog. And I’ve used it in several places in arguments, but hadn’t looked up the source.

The idea is this, about 1914 or so Emile Borel calculated that a perturbation corresponding to a single gram of matter moving in a certain way in a star several lightyears away would perturb the individual trajectories of the molecules of an ideal gas in a lab in such a way as to cause those trajectories to diverge from the ones you would calculate without the perturbation so that after a few seconds any given molecule would have diverged dramatically from your calculated trajectory. (To find citations on this I relied on this blog which ultimately says that the basic idea was described in Leon Brillouin’s 1964 book “Scientific Uncertainty and Information” (pg 125) where he cites Emile Borel’s 1914 book “*Introduction géométrique a quelques théories physiques*” (p 94) for Borel’s calculation.)

This is a manifestation of the basic idea of Sensitive Dependence on Initial Conditions (aka the Butterfly Effect). A measure of this phenomenon is called the Lyapunov Exponent. Basically for small times after the initial point where your model begins to track the molecules, the error in your calculation grows exponentially like $$exp(\lambda t)$$ for some $$\lambda$$. Exponential growth is of course very fast, so whenever $$\lambda$$ is positive then your calculation is useless after some amount of time because there is **always** some error in your calculation and it rapidly blows up.

What does this have to do with Null Hypothesis Significance Testing (NHST)? Well, a common way to utilize this “paradigm” is to collect some data, experimental or otherwise, do some analysis, test a “Null Hypothesis” that the effect of X on Y is zero on average, and then find that the effect is either “statistically significant” or “non-statistically significant”. This is taken as evidence that “X affects Y” or “X does not affect Y”.

What the Borel calculation tells us is “everything affects everything else to some extent” and so logically in this NHST statistics there are only “true positives” and “false negatives”. If you find a non-significant effect it’s just because you didn’t collect enough data.

Rethinking this paradigm logically, we should really be asking questions like “how big is the effect of X on Y” and “what other effects are there, and how big are their contributions, so that we can determine whether we need to take X into account in order to make useful description of Y”

Rethinking things in this manner immediately leads you to a Bayesian viewpoint on statistics: find out what regions of X effect space are plausible given our model and our data, and then determine whether we can safely neglect X relative to the other factors in our model.

The RED qdisc, which is available in Linux to control network traffic is mostly not used these days in favor of things like codel and fq_codel. Originally it was designed to give feedback to TCP streams by dropping packets randomly before hitting a hard tail-drop, as a way to get the TCP streams to throttle back.

Today, the place where it might get used more often is something like a core router at an ISP because it doesn’t require much computation and can maybe handle things like 40Gbps or 100Gbps ethernet.

But I recently had some discussions that led me to think it might have a use in certain circumstances where it wasn’t traditionally in use. The particular use I’m considering is for UDP streams carrying realtime traffic. For example, game or phone traffic where a small packet is sent every 10 or 20 ms or so to update the state of the game or provide the next piece of audio data.

Now, if you set up GAMEBW to contain the bandwidth allocated in kbps for your total game traffic, and you assume typical packets are about 220 bytes (as estimated by example data I collected) you can set up a RED queue as follows:

tc qdisc add dev ${DEV} parent 1:20 handle 20: red limit $(($GAMEBW * 70 /8)) min $(($GAMEBW * 20/8)) max $(($GAMEBW * 50/8)) avpkt 220 burst $(($GAMEBW *30/8 / 220)) probability 1.0 bandwidth ${GAMEBW}kbit harddrop

Here the magic numbers 20, 50, 30, etc refer to the time in milliseconds that it would take at GAMEBW to send the data in the queue so far. min at 20ms means that the RED queue acts like a bfifo for the first 20ms worth of data… assuming that this is a leaf qdisc for some bandwidth limiting qdisc, we assume that enough bandwidth is available to ensure that this level of queue rarely occurs. But if there’s a burst of traffic for some reason, we might start queuing past a 20ms queue, and then we have some probability to drop packets… which rises at 20ms from 0 linearly up to 100% (probability 1.0) once we hit 50ms of queue (we set limit at 70ms just in case).

If we start to exceed our allotted bandwidth for a while, we build up a queue, but we want to ensure that this queue doesn’t build up past some point, in our case 50ms worth of data. If we have 1.5 times the bandwidth allowed coming in, we’ll build up to a steady state where 1-1/1.5 = 33% of the packets are dropped, which occurs at 20+(50-20)*.33 = 30ms of queue length.

If a burst of packets lasts for a while, we either need to buffer it and hope that later it will go away, in which case our delay grows without bound until the burst stops… (bufferbloat!) or drop some of the packets and make sure that what goes out is no more than the max bandwidth and try to keep the delay low at the expense of sending only some of the packets.

The RED queue linearly interpolates between 0% drop at min, and 100% drop at max, so that no matter what the incoming over-bandwidth rate is, the delay stays below 50ms in this case.

RED is like a “soft tailed” bfifo, and although the TCP exponentially backs off when there’s a little droppage… UDP does no such thing, and so the recommendations in the RED manual page don’t really apply here… we want to get up to 100% droppage at our maximum allowable delay level, whereas the RED manual page assumes something like just 1-2% droppage, because TCP will dramatically reduce its bandwidth usage when it detects drops.

You can think of a model of the RED queue as a bucket with a small hole in the bottom, connected to a pipe, and a large hole in the side that dumps water on the ground. Water pours into the top of the bucket, and goes out the bottom. If we pour a little faster, we can build up a queue of water waiting to go down the pipe… up to 20ms of delay and the bucket doesn’t leak on the ground…. As we pour even faster… the hole in the side pours some of the water out on the ground (drops packets). Depending on how fast we pour, the height of the water increases, and the side hole dumps water faster… Eventually the water reaches the top of the bucket and 100% of the excess dumps onto the ground. That’s basically what’s going on. When we turn off the tap… no more than 80ms later we’ve got an empty bucket.

A recent paper came out in The Lancet about alcohol consumption claiming that overall considering all causes, there’s no “safe” level of consumption. Of course that made a big splash in the news. It happened that Andrew Gelman also posted another study about alcohol and when I mentioned some serious concerns about this Lancet paper, the study author actually responded on Gelman’s blog.

I followed up with some specific concerns, but the main issue is that if you want to get causal inference you must either have randomized assignment to ensure on average zero probabilistic dependence on important unaccounted for effects, or have a *causal model* that mimics the real mechanism at least approximately and is believable. Of course the mechanism of death from diseases is … life and all its complexities. In particular the time-dependency aspects.

To illustrate how this time dependency is important I created a qualitative differential equation model that is intended to simulate some kind of simple risk model through time. The basics are this:

1 unit of people are born at time t=0, as time goes on (1 unit of time is about 20 years) they slowly die of various causes. There are 3 causes we track through to 6 units of time:

- “Heart disease” which we assume begins to affect people around 2.5 units of time, and is reduced by 25% for each dose of alcohol, assumed valid only for small doses, such as less than 2.
- “Cancer” which we assume begins to affect people later, around 3 units of time, and is completely unaffected by alcohol (just for illustration purposes).
- “Other causes” which ramps up somewhat slowly between about 3 and 5 units of time but ensures that very few people really make it past 5 units or so (~100 yrs). Other causes are unaffected by alcohol.

Code is below, but first, the graph which shows overall survival, cumulative heart disease deaths, and cumulative cancer deaths through time for two groups: Red = no alcohol, and Blue = 1 dose of alcohol daily

Notice that by construction the people who take 1 dose of alcohol have lower instantaneous heart disease risk, and the same cancer and other cause risk. This means they live longer (blue survival curve on top is above red). But if you look at the individual causes, you see a reduced heart disease risk, but in cancer you see there are MORE deaths? By construction, alcohol doesn’t change your cancer risk in this model… So why do more alcohol drinkers die of cancer? The answer is that they live longer on average because they don’t die of heart disease as much… So eventually they die of cancer, or “other causes”.

At the *very minimum* you need to reproduce this kind of thing in your model, or you have no hope of teasing out how things work.

Now for the code:

## The Build Out Line in US Youth Soccer: A Terrible Idea full of unintended adverse consequences

The US Soccer Federation has created a new set of rules for youth games. These games are played on reduced sized fields with sides of 7 players (7 on 7). The rules include a new line on the field called the “build out line” which is half-way between the top of the goalkeeper box and the midfield line. When the goalkeeper receives the ball in his hands, the rule is that play must stop, and the opponents must move out past the build-out line before the goalkeeper can re-start play. I’ve seen websites claiming that a quick-restart is still allowed, but the rules adopted from US Soccer by CalSouth are actually explicit that this is not allowed

“**once the opposing team is behind the build out line**, the goalkeeper can pass, throw or roll the ball into play (*punts and drop kicks are not allowed*)” (bold emphasis is mine, italics is theirs)

Now, whenever you create a rule in a game, you modify the tactics that work well in that game. The game *is* the rules and everything that the rules imply about what will work well to win. The point of a game is to make an honest attempt to win. If we’re just trying to work out certain limited skills, it’s called practice. Furthermore, the players want to do a good job at the game. The players do not want to be obliterated, nor to lose opportunities to score goals “for no reason” (ie. if the rules don’t prohibit their tactic).

This new rule does several things to change the effective tactics available, all of them bad:

- It eliminates the role of the goalkeeper in breaking up the attack and coordinating a counterattack, forcing the keeper to wait and hold the ball until the opposing team has built up a defense rather than look for the best opportunity and method to distribute the ball. Quick and accurate distribution of the ball to the most well positioned players is one of the primary functions of the keeper in real soccer.
- It eliminates the threat of a long-ball by prohibiting punts and drop-kicks, allowing the defending team to push their entire team up to the build-out line to trap the ball in the back third and provide a high pressure defense that quickly turns over into offense in a dangerous area. The reason the defending team can build up to the build-out line is because they don’t have to worry about a long ball. The basic concept of tactics is to tune your action to the situation. Tactically, if you think there’s no long ball possibility, you can put all the players on front-line defense. The proper response to that tactic by the offense is actually … the long ball. A game-theoretic correct response is therefore to figure out a hack around the rule prohibiting punts… Namely drop the ball to a fullback and have him punt it. Or take a “goal kick” type kick from the top of the box, neither of which are prohibited, but both of which are completely artificial.
- The build-out line is also used as the line past which offside can be called, allowing a team to cherry-pick goals by leaving players essentially a few yards above the top of the box, or for players to sprint ahead on breaks and receive passes well into the defending side’s half without being called offside.

All of these tactics were immediately figured out as soon as this rule was put in place, as you can see in the comment section on ussoccer

For example Ryan Gilbert says:

I am a coach of U9 soccer in NJ. Firstly, congratulations on creating a entirely new game at your board meeting. Yesterday was our first game of the season and it was full of penalties and free kicks in front of goal due to this so called ‘build-out line’. 30 mins of our 50 min game were actually soccer and all the goals on both teams came from attack taking advantage of the new ‘build-out’ rules. What a total calamity you’ve created, a travesty to the beautiful game

…

All the opposition needs to do is flood the build-out line and to exert maximum pressure on their defense and the ball never leaves the half way line. I don’t know what game this is you’ve created but it’s not soccer. This season is now a farce and it must be scrapped as soon as possible.

*fuck that rule.*

*does occur*is: Whenever the opponent keeper gets the ball, the previously attacking team gets a break to set up a massive high pressure attack just outside the box, lines up like the charge of the light brigade and if 3 on 1 can’t win the ball, all you really have to do is slam the ball off an opponent into touch: now you have a throw in right next to the goal and can kick the next goal… over and over.

*the real game*from a good and experienced coach, and it’s making his team lose dramatically. Imagine how that makes the players feel? They’re asking themselves “why are we doing so badly? what are we doing wrong? how come we can’t get the ball out of our side?”

Now look at them yo-yo's that's the way you do it You play the lottery on the point oh five That ain't workin' that's the way you do it Grant money for nothin', academic chits for free Now that ain't workin' that's the way you do it Lemme tell ya them guys ain't dumb Maybe get a blister on your control-c finger Maybe get a blister on your mousing thumb We got to collect that carefully measured data, controls for all the conditions we see. We got to run tests with simulated data, run long chains of MCMC...

So, over the last few years I’ve been working on getting VOIP phone systems working as flawlessly as possible. It’s a surprisingly difficult thing to do. There are many pieces of the puzzle that can go wrong.

One of the important pieces of the puzzle is the network behavior at your router. The issue is that ideally for each call that comes through every 0.02 seconds a new audio packet will arrive and it should be delivered to the phone. If it’s delayed by more than 0.02 seconds, it’s basically worthless as it’s too late to play that bit of audio anyway, the next audio bit needs to be played. This isn’t quite true, because SIP phones use jitter buffers, so if once in a while you get some delays it can recover. But jitter buffers add to the overall phone call delay, so with too big of a jitter buffer, conversation participants start to talk over each other. This is a particular problem when calling between VOIP systems and cell phones as cell phones often have fairly long delays.

Well, so far, I’ve been using FireQOS to set up a quality of service system that manages the queue of packets on my router. It’s a pretty nice system, and it does fairly well. But it isn’t perfect. In particular, it uses the hierarchical token bucket (HTB) qdisc on Linux. This particular algorithm shares the bandwidth between different classes of traffic using a particular scheme that allows certain classes to borrow bandwidth from other classes that aren’t using it.

An alternative is the Hierarchical Fair Service Curve (HFSC) qdisc, and in general, this is much more poorly understood. However I discovered a great set of resources about it, and after reading that explanation I tried it out. The result was substantially more reliable low-jitter connections. Here are the resources: Stackexchange Thread and Tutorial

Here’s a very useful scheme you can set up. In addition to this set of classes, you need to also have filters that select which class each packet goes into. That’s a separate issue. But suppose you can select your voip packets to go into 1:10, and game packets into 1:20 and say interactive packets like ssh or ping or maybe video streams into 1:30, by default everything goes to 1:40 and packets for long-running file transfers go into 1:50…

setupdev(){ DEV=$1 BW=$2 NCALL=$3 tc qdisc del dev ${DEV} root tc qdisc add dev ${DEV} handle 1: root hfsc default 40 tc class add dev ${DEV} parent 1: classid 1:1 hfsc ls m2 ${BW}kbit ul m2 ${BW}kbit ## voip class tc class add dev ${DEV} parent 1:1 classid 1:10 hfsc rt m1 $((250*8*$NCALL*2/5))kbit d 5ms m2 $((250*8*$NCALL/20))kbit ## games class tc class add dev ${DEV} parent 1:1 classid 1:20 hfsc rt m1 $((250*8*$NCALL*2/5))kbit d 5ms m2 $((250*8*$NCALL/20))kbit ## above classes get pfifo behavior ## LS classes, interactive tc class add dev ${DEV} parent 1:1 classid 1:30 hfsc ls m1 $(($BW * 75/100))kbit d 100ms m2 $(($BW * 2/10))kbit tc qdisc add dev ${DEV} parent 1:30 handle 30: fq_codel ## default tc class add dev ${DEV} parent 1:1 classid 1:40 hfsc ls m1 $(($BW * 20/100))kbit d 100ms m2 $(($BW * 6/10))kbit tc qdisc add dev ${DEV} parent 1:40 handle 40: fq_codel ## lowprio tc class add dev ${DEV} parent 1:1 classid 1:50 hfsc ls m1 $(($BW * 5/100))kbit d 100ms m2 $(($BW * 2/10))kbit tc qdisc add dev ${DEV} parent 1:50 handle 50: fq_codel }

Now, this sets up a top-level HFSC class called 1:1 which has a maximum of $BW output bandwidth. It then sets up 2 real-time queues, and 3 link sharing queues to apportion the bandwidth. I assume 100kbit/s for a phone call, and a similar amount for a “game”. I currently am not using the game class.

The voip class is set up using the specification

rt m1 $((250*8*$NCALL*2/5))kbit d 5ms m2 $((250*8*$NCALL/20))kbit

This means “real time” class, with a burst rate of 800*N kbit/s for up to 5milliseconds, and a steady rate of 100 N kbit/s. How does this work?

First off, if I understand this correctly, whenever there is traffic in real-time classes, all link share classes have to wait. Because of that, it makes sense to make sure that the total amount reserved for real-time is a smallish fraction of your total budget. In other words, you can’t really expect VOIP calls to work great if you have 1Mbit of upstream speed or less because 800*2 = 1600 kbits/s which exceeds your 1000 kbit/s available. Now, what’s the “d 5ms” mean? It means that this “burst” speed is only available for up to 5ms. The idea here is that if your have N calls, and there are 2 packets in the queue per call, you can completely drain the queue by 5ms later. The speed I calculated is: 250 bytes * 8 bits/byte * N calls * 2 packets/call / 5ms. But, we don’t want this to go on forever, the goal is just to keep the queues short by draining them quickly. Long run, only 250*8*N/20 kbit/s are in use.

The “burst” speed time-period begins at the point where competition for sending occurs, in other words, when there is more than one real-time queue with a packet. HFSC looks at all the real-time queues and determines which one has the *quickest* packet to send, and sends that. The time it takes to send a packet is packetLength / burstSpeed. Real time packets get bandwidth until one of two things happens: they exceed their total sending allotment, or they drain their queues.

For real time, the total amount you can have sent from time t = 0 to time T=now is m2 * T. You can think of this as a line that increases in time at slope m2. If you had a long period with no calls going on, then the amount of bytes you have sent so far is much smaller than m2*T. This means whenever a packet comes in, it gets sent at the m1 rate until the queue drains. Since the m1 rate drains the queue much faster than it fills up from new VOIP packets. We basically ensure that as each VOIP packet arrives, it gets sent very very quickly and we rarely have more than a few packets in the queue.

Suppose you wanted to use the games class? If you care more about VOIP than games, as you probably should. You’d want to make sure that the m1 for VOIP calls was several times faster than the m1 for games. This means several packets would be sent for VOIP for every packet sent for your game. Let’s say for example you need 500 kbit/s for your games, and you want to be able to send 3 game packets within 5ms during burst to drain backlogs. Game packets are say 100 bytes. The m1 speed should be 100*8*3/5 = 480 kbit/s. If you want to send at least 3 VOIP packets PER CALL for each game packet, with 250 byte voip packets, you make m1 = 480*N * 3 * 250/100 = 3600 *N kbit. The m2 rate stays the same, it is what limits the real-time class from completely using ALL the available bandwidth. It’s worth pointing out that what matters is really the *ratio* of speeds between the competing classes, not the absolute speed of any class.

Now, what about link-sharing?

Link sharing always goes after real-time, either because real-time queues were totally drained, or because real-time hit its limiting total amount to be sent based on m2 and the time from boot-up to now.

Assuming real-time has drained, let’s look at the classes 1:30 1:40 and 1:50

Assume all three classes have packets that arrived just now. We’re at the start of a congestion period. Which packet gets sent first? again, its the one that’s the fastest to send, but based on the m1 rates for these classes for the first 100ms of congestion. During an initial 100ms period, the class 1:30 can send at 75% of the link rate, the default queue 1:40 can send at 20%, and the low-priority queue can send at 5% speed. This means roughly that if all the packets are similar sized, and there are say 100 packets backlog in each class, the high priority queue will send 75 packets the medium will send 20 packets and the low priority will send 5 packets in a given chunk of time.. provided that chunk is less than 100ms and provided the queue doesn’t drain. After the 100ms the rates change, high priority has to slow down to 20%, default priority can send at 60% and low priority can send at 20%.

Lets say the burst speed lets the high priority queue drain, now only default and low priority are competing. The default queue will send 20/5 times as many packets as the low priority. If default drains its queue, then low priority gets 5/5 = 100% of the remaining bandwidth.

And this brings up a good point: limits only apply when there is competition between queues. If default is the only thing that has traffic to send, it gets 100% of the bandwidth. And in general if there are several queues the amount queue i gets is bw[i]/sum(bw[j] for all j).

Once you understand how HFSC really works, which is not trivial without the help of the tutorials… You can design a very good queueing system that handles truly real-time traffic like VOIP or robot feedback control or even games, while also giving good service performance in link-sharing. The link-share definition alone even without the real time, tends to keep latency down on the high priority queue at the expense of increasing latency for traffic that doesn’t care, such as your large download of a gigabyte operating system install image or whatever.

Note that if you do a hierarchy that’s deeper, only the *leaf* classes get real-time service. It really makes sense to have all your real time classes directly attached to the root class 1:1