From Gelman’s blog, he shows yet another regression discontinuity. Apparently people have never heard of the Runge phenomenon, or the basis for why it happens. Here’s some R code, and a PDF of the output…

```
## regression discontinuity such as:
## https://statmodeling.stat.columbia.edu/2020/01/09/no-i-dont-think-that-this-study-offers-good-evidence-that-installing-air-filters-in-classrooms-has-surprisingly-large-educational-benefits/
#generally is garbage that ignores what's essentially a correlary to
##the well known Runge phenomenon... we demonstrate here.
library(ggplot2)
set.seed(131211)
datasets = list()
for (i in 1:20) {
datasets[[i]] = data.frame(x=runif(20,0,2),y=rt(20,5))
}
plotgraph = function(d){
g = ggplot(d,aes(x,y)) + geom_point() + geom_smooth(data=d[d$x < 1,],method="lm") + geom_smooth(data=d[d$x >= 1,],method="lm")
return(g)
}
graphs = lapply(datasets,plotgraph)
pdf("discplots.pdf")
sapply(graphs,print)
dev.off()
```

In almost every plot there is “something going on” at the discontinuity, either the level of the function has changed, or the slope, or both. And yet, the whole thing is random t-distributed noise…

I don’t know what that paper did to calculate its p values, but it probably wasn’t simulations like this, and it should have been.

Every year about this time there’s only one thing on a young child’s mind… if Zombie Santa comes for me, will I survive and have presents ‘neath tree?

Agent Based Models have the answer: fight back with a rocket launcher.

Merry Christmas to all and to all a good cricket bat.

Get the code below

My assertion is that among people doing more than just a couple t tests, particularly among people using Random Effects or Mixed Effects models, they are *already* doing a shabby sort of Bayesian modeling without taking responsibility for including real useful prior information, or doing appropriate model checking etc. It’s time to recognize that “Bayes with a flat prior and a linear predictor function” is not “Objective, Frequentist Statistics” it’s low grade passive Bayes.

I had a conversation with Christian Hennig that enlightened me about some confusion over what qualifies as Frequentist theory vs what qualifies as Bayesian theory.

Of course we’re free to create technical jargony definitions of things, but essential to my conception of Frequentism is the idea that probability is placed on observable outcomes only, so that in principle you could if necessary get a large sample of things and have the real actual frequency distribution in the form of say a histogram. Then you could say that a given parametric distribution was good enough as a model of that histogram for example. I quoted some wikipedia text which essentially said the same thing, and is consistent with various sources you might find online or in books. In general at least if this isn’t the only definition of Frequentism, it’s a common one.

The alternative, Bayesian viewpoint in my view was that you use probability distributions either on quantities that don’t vary (like say the speed of light or the gravitational acceleration in your lab during your wednesday morning experiment), or you use distributions over things that vary which are notional and have no validated connection to observed frequency, but just represent your own information about how widely things may vary.

The question “could this random number generator have produced our dataset” which is essentially the purpose to which a p value is put, is not exclusively the realm of Frequentist statistics. Every time we fit a Bayesian model we could be considered to be asking that question over and over again and using the likelihood of the data under the RNG model to answer the question, and determine whether to keep our parameter vector in the sample or not.

What this means is, a lot of people are using what you’d call Bayesian methods under this classification scheme, without really thinking about it. For example linear mixed models or “random effects” models… These are hierarchical models in which essentially the likelihood function based on a normal distribution is *very* rarely questioned against the data (ie. goodness of fit tests run) and the distributions over the “random effects” is essentially *never* tested against a repeated sampling process. This means the distribution over the random effects process is just a Bayesian distribution as it represents how likely it is to get effects of that size. It is typically taken as a normal distribution with mean 0 and standard deviation equal to essentially some maximum likelihood estimate (the standard deviation is found during the numerical fitting process I think).

In fact, there are plenty of situations where a mixed effects model is used and there are a finite few groups involved. The simplest would be 2 groups, but let’s even say 8, like in the “8 schools” example. The distribution of individual effects in the 8 schools *can not* be a normal distribution. In fact, these must be just fixed values one for each school. The notion that these 8 are a sample from a normal distribution is entirely notional, and has no direct connection to observable frequency.

The only thing these kinds of models don’t have is an explicit proper prior. And Bayes isn’t just “frequentism with priors” it’s probability as measure of credence of an idea. People are already doing Bayesian analysis every time they run a mixed effects model, they just are dodging some of the responsibility for it by hiding it under the hood of “well accepted practices codified into the lme4 library” or some such thing.

Next up: Using an ABC like process and a likelihood based on a p value, we can construct confidence intervals, showing that confidence intervals are just shitty flat posterior intervals.

Increasingly, “science” is being used as club to wield in order to have power, for example the power to influence policy, or to get career advancement, or the like. The stories are legion on Andrew Gelman’s blog: Brian Wansink is an excellent example, he received millions in funding to do his “food industry” research… that didn’t have a hope of being correct.

So when my friend posted this video claiming that randomized controlled trials have proven that sugar does not cause kids to become hyperactive, I was skeptical, to say the least.

But at least it’s based on the “gold standard” of randomized controlled trials, and presented by someone with a biology background who understands and is an expert on this topic right? (actually, sure he’s an MD, but his “research focuses on the study of information technology to improve pediatric care and areas of health policy including physician malpractice…” and he’s publishing on an Econ blog. Is he perhaps a bit too credulous of this research?)

So here is where we unpack the all too common situation where meta-analysis of various randomized controlled trials is used as a club to beat people with, regardless of what those trials even say.

Let’s start with the meta-analysis linked on the YouTube video. This is Wolraich et al. 1995, a review of quite a few different somewhat related RCTs… which is hidden behind a paywall. If you manage to access it you will find that it cites 37 studies. Some of the results are summarized in this graph:

Hmm… what is up with Sarvais et al? (actually the correct spelling of her name is Saravis, not Sarvais). In any case, I looked through the cited research summarized in table 1.

You can see, this is a lot of studies of ADHD, “sugar reactors”, “Delinquent”, “aggressive”, “Prader-Willi syndrome” and other unusual populations. Furthermore the vast majority are 100% male subjects. Finally, almost all of these studies use a sweet beverage as their “control”. Unless they have an additional control for “not sweet” condition, this means they have zero chance to understand the difference between giving a kid say a sugary beverage, vs say water or a candy bar vs say a piece of whole wheat bread. In other words, sweetness may itself be the mechanism by which sugar makes kids hyperactive. If you were a biology major and proposed a mouse experiment like this to my wife she would say “where’s your negative control?” and send you back to the lab to think about how to run an experiment. Some of the papers do have a negative control.

Among the “normal” populations, number of subjects is typically less than around 25… I wasn’t going to chase down all these studies myself, so I decided to look at selected few that use “normal” population and have more than just a handful of subjects.

The ones I chased down were: Rosen et al 1988, Saravis et al 1990, and Wolraich et al. 1994

Let’s start with the last one, also by Wolraich. Like a number of these studies, Wolraich et al. 1994 is designed to investigate whether an overall diet of high sugar leads through time to an overall increase in hyperactivity… Although this is an interesting question, it is not the relevant question being discussed in the YouTube video. When a parent says their kid gets hyperactive after eating sugar, what they mean is “my kid got a candy bar from his friend and ate it, and now he’s bouncing off the walls for the next hour while I’m trying to get him to get ready for bed”. Of course there are kids with more general hyperactivity issues, but that is a more complicated issue. The YouTube video strongly suggests that sugar in any form, bolus or through time, never makes kids hyperactive ever.

The Wolraich et al 1994 article says: “The subjects and their families were placed on a different diet for each of three consecutive three-week periods. One of the three diets was high in sucrose with no artificial sweeteners, another was low in sucrose and contained aspartame, and the third was low in sucrose and contained saccharin (the placebo). So we can totally ignore this study as the design simply doesn’t address the question, and it fails to have any negative control. But, even if we don’t ignore it, what do we find… Table 4 is a summary (stupid, where’s the graph?)

It’s clear, from a Bayesian perspective, that the study was vastly underpowered to detect anything of interest. For example in the “Parent’s rating of behavior… Conduct” the numbers are 8.1+- 6.7 meaning expected random noise is almost as big as the average score… How meaningful is this rating if people ranged from say 1 to 14? Furthermore, the scores were complicated tests of a wide variety of things “administered in the mobile laboratory each week on the same day of the week and at the same time of day”. Any short term effect of a bolus of sweet foods would only be detectable if they had just given them the foods in the minutes before testing. So the results of this study are simply *irrelevant* to the main question at hand.

Let’s move to Rosen et al 1988: this study at least is designed to measure a bolus of sugar… of course it’s based on two studies with 30 preschool kids and 15 elementary school kids. They designed the study around a “high-sugar condition, a low-sugar condition, and a control aspartame condition (low in sugar but with a sweet taste).” The diet was randomized for each kid each day, and everyone was blinded as much as possible (obviously, the kids could tell if their diet was sweet or not, but weren’t told if it had sugar vs aspartame because no-one involved supposedly knew). The manipulation was by giving the kids controlled breakfast meals. The children were tested

“approximately 20-30 min following the completion of breakfast… on several measures sensitive to cognitive functioning…” as well as “each day teachers completed… [a] global rating scale… immediately preceding the child’s lunch time…to reflect the child’s behavior for the entire morning.”

The research is under-powered to really measure much of anything as well, but when they looked at global rating across all kids they found

“An analysis of the means for this measure revealed that the ratings during the high-sugar condition (M = 6.2, SD = 1.08) were only slightly, although significantly (Tukey p < .05) higher than those during the low-sugar condition (M = 5.9, SD = 1.04) (see Figure 2). The control condition (M = 6.0, SD = 1.07) was not significantly different from either the high or low condition. No other significant differences were observed for the teacher rating measures.”

So by the “rules” of standard frequentist statistics, we should determine that “high sugar” leads to hyperactivity (the scale increases with increasing hyperactivity) so this is (very weak) evidence but it tends *against* the YouTube video.

Now we get to the Saravis et al. paper. This paper fails to have a non-sweet control. The goal of the study was to determine if there was some difference between sugar vs aspartame, and not to determine the difference between “sweet” and “not sweet” diets. So right away it’s going to be weak evidence as to whether something like “feeding your child candy will cause them to be hyperactive for the next few tens of minutes”… But we’ll soldier on anyway, especially because if you look back above at the meta-analysis graph, you’ll find Saravis (misspelled Sarvais in the graph) is an utter outlier in terms of results…

The design of the Saravis study is that:

“Two experiments were conducted. The objective of the first study was to compare the effects of consuming a large dose of aspartame or sodium cyclamate…with carbohydrate on measures of learning, behavior, and mood in children. The treatment consisted of an ice slurry … of unsweetened strawberry Kool-Aid containing carbohydrate (as Polycose …) plus either the test dose of aspartame or the equivalent in sweetness as sodium cyclamate…”

…

“The objective of the second study was to compare the effects of aspartame with a common nutritive sweetener (sucrose) on the same variables. In this experiment, the treatments consisted of … Kool-Aid, plus either 1.75 g/kg of sucrose or the equivalent sweetness of aspartame. The drink with sucrose provided the same amount of carbohydrate as in the first experiment.”

So the study is always giving the kids a sweet drink, and only the second study even varies the amount of carbohydrate in the dose… The studies were 20 children 10 boys and 10 girls ages 9-10…We’ll soldier on…

Guess when they gave the high sweetness drink? Well they claim it was at 10:30, but if you know anything about a pre-school, you’d know that it takes time to pour out the drinks, get the kids in line, etc etc. So you can guess that at 10:25 the kids were aware that they were about to get a bolus of sweetness… and of course their activity level is very high right at that point in time, and continuing.

So the study is not designed to test the main question of interest “if you give kids candy/sweets do they go bonkers” but to the extent that they did give kids sweet stuff… they did go bonkers afterwards, independent of what kind of sweet stuff… Because there was “no difference” between the two kinds of sweet syrupy Kool-Aid, this is taken by the YouTube video as evidence that “sugar doesn’t cause hyperactivity”, essentially the opposite of the conclusion I draw, which is that sugar, and other super-sweet foods, given in a bolus, may very well make kids crazy, but the study actually can’t really determine that… Maybe just getting kids in line and giving them anything makes them crazy… who knows. Welcome to modern science, where a study completely incapable of determining anything of use which nevertheless is entirely consistent with out prior guess, and under-powered to determine anything precisely, is taken as definitive proof that our prior expectation is completely false. RCT as billy-club, wielded by econo-physician to show that stuff “everyone knows” is nevertheless wrong and only heroic econo-physician can save us from irrationality.

Chris Wilson commented on some of my comments related to Jacobians and Masking in prior specification… It’s kind of long past the discussion, so I’m not sure if others will chime in. I figured I’d put my response here..

In the past I’ve never seemed to be able to express this idea effectively, I hope this somehow was more effective. The green screen idea is really a good analogy… you calculate some quantity (greenness, or Q(X))… and decide whether you want to upweight a given region of space or downweight that region of space based on some strictly non-negative function of the quantity…

so formally for any prior p_base(X)/Z_base, you can always use p_base(X)*deformation_factor(Q(X))/Z_deformed as a prior provided deformation_factor(Q(X)) is a strictly non-negative bounded function (you’re guaranteed to have a normalizable distribution if it started out normalizable and nothing is multiplied by anything bigger than some constant). The only formal requirement for a prior is that it be normalizable.

The easiest interpretation is when deformation_factor is between 0 and 1. Regions near where deformation_factor = 1 are not squashed at all, regions less than 1 are squashed… and if you are doing MCMC you don’t need to worry about the normalization so it’s just a squishing factor.

Then the non-formal question is just: did this actually express the knowledge I have? The easiest way to determine this is sample from the prior and see if the samples make sense. You have to do that whether or not you are using nonlinear transformed functions.

The fact that Q(X) is a nonlinear function of the X vector doesn’t enter into this at all… since deformation_factor is just squishing those regions that result in “weird” Q(x) values, and not squishing regions that result in “ok” Q(x) values. It isn’t directly a probability measure, it’s a dimensionless scaling factor. “Adding” a Jacobian correction is another way of saying “you squished this the wrong amount”… umm… says who?

A similar thing occurs when you use a nonlinear Q(data,params) in a likelihood function L(Q(data,params))… This expresses a conditional probability of a particular transformed “summary statistic” Q(data,params), conditional on *both the data and the parameters* so it’s just a squishing factor for how much to downweight the prior… which, it turns out, is what a likelihood is. You might call this a “pseudo-likelihood” or you might just call it a way to directly express a joint probability distribution on data and parameters without factoring it into p(data | params) p(params), instead it’s p(Q(data,params),params)

this is more or less what’s going on in approximate bayesian computation (ABC) where the matching is often expressed in terms of low-dimensional summary statistics. If all your observed data is is low dimensional summary statistics… then you’d have just called this the “likelihood” but if you happened to observe the high dimensional version of your data and you still decide to match on the low dimensional summary… you call it “approximate”. It’s really just a different but related model.

You certainly shouldn’t be doing jacobian magic on the Q(data,params), because that would express a different model. If your model really is:

L(Q(data,params))p(params)

and someone comes along and tells you it should be

J(params) L(Q(data,params)) p(params)

they are not telling you “you made a formal calculation mistake” they are telling you “I don’t like your model, use this other one instead” which is probably wrong.

Jacobian corrections are verifiable formal consequences of model transformations… These other things are non-formal modeling choices whose goodness is a matter of opinion in the same way that linear regression’s goodness for a particular problem is a matter of opinion. Often Q(data,params) is a scalar and {data,params} is a high dimensional vector. There is no Jacobian for a transformation from high dimensions to a scalar. The transformation isn’t invertible.

There was a discussion on Andrew Gelman’s blog about creating informative priors using “quantities of interest” which are basically functions of the parameters.

His example included some code which sets a prior over parameters a, b by combining several functions of a,b:

model { target += normal(y | a + b*x, sigma); \\ data model target += normal(a | 0, 10); \\ weak prior information on a target += normal(b | 0, 10); \\ weak prior information on b target += normal(a + 5*b | 4.5, 0.2); \\ strong prior information on a + 5*b

Andrew then went on to say: “You should be able to do the same thing if you have information on a nonlinear function of parameters too, but then you need to fix the Jacobian, or maybe there’s some way to do this in Stan.”

I disagreed with this idea of “fixing the jacobian”. Deep in the comments I discussed how this works and how to understand it vs how to deal with “jacobian corrections” when describing priors in terms of probability measures on a target space. The question of whether you need a Jacobian is determined by the role that the information plays, whether you have absolute knowledge of a probability measure, or relative knowledge of how much an underlying probability density should be modified (ie. “masked” if you’re familiar with using masks in Photoshop). I thought I’d post it here so I can refer to it easily:

The first thing you have to understand is what is a Jacobian correction for.

Essentially a Jacobian “correction” allows you to sample in one space A in such a way that you induce a particular given known density on B when B is a known function of A.

if B = F(A) is a one to one and onto mapping (invertible) and you know what density you want B to have (pb(B)), then there is only one density you can define for A which will cause A to be sampled correctly so that B has the right density. Most people might work out that since B = F(A) the density should be pb(F(A))… but instead…

To figure this out, we can use nonstandard analysis, in which dA and dB are infinitesimal numbers, and we can do algebra on them. We will do algebra on *probability values*. Every density must be multiplied by an infinitesimal increment of the value over which the density is defined in order to keep the “units” of probability (densities are probability per unit something).

We want to define a density pa(A) such that any small set of width dA at any given point A* has total probability pb(F(A*)) abs(dB*)

That is, we have the infinitesimal equation:

pa(A) dA = pb(F(A)) abs(dB)

solve for pa(A) = pb(F(A)) abs(dB/dA)

if we said pa(A) = pb(F(A)) we’d be wrong by a factor involving the derivative dB/dA = dF/dA evaluated at A, which is itself a function of A. The absolute value is to ensure that everything remains positive.

abs(dB/dA) is a jacobian “correction” to the pb(F(A)) we derived naively at first.

—————

So, the applicability is when

1) you know the density in one space

2) you want to sample in a different space

3) There is a straightforward transformation between the two spaces

In Andrew’s example, this isn’t the case. We are trying to decide on a probability density *in the space where we’re sampling* A, and we’re not basing it on a known probability density in another space. Instead we’re basing it on

1) Information we have about the plausibility of values of A based on examination of the value of A itself. p(A)

2) Information about the relative plausibility of a given A value after we calculate some function of A… as I said, this is a kind of “mask”. pi(F(A))

Now we’re defining the full density P(A) in terms of some “base” density little p, p(A) and multiplying it by a masking function pi(F(A)) and then dividing by Z where Z is a normalization factor. So the density on space A is *defined* as P(A) = p(A) pi(F(A)) / Z

Notice how if p(A) is a properly normalized density, then pi(F(A)) K is a perfectly good mask function for all values of K, because the K value is eliminated by the normalization constant Z, which changes with K. In other words, pi tells us only *relatively* how “good” a given A value is in terms of the value of its F(A) value. It need not tell us any “absolute” goodness quantity.

Should “how much we like A” depend on how many different possible A values converge to the region F(A)? I think this seems wrong. If you have familiarity with photoshop, think like this: you want to mask away a “green screen”, should whether you mask a given pixel depend on “how green that pixel is” or “how many total green pixels there are”? The *first* notion is the masking notion I’m talking about, it’s local information about what is going on in vicinity of A, the second notion is the probability notion: how much total spatial locations get mapped to “green” that’s “probability of being green”

For example pi(F(A)) = 1 is a perfectly good mask, it says “all values of F(A) are equally good” (it doesn’t matter what color the pixel is). Clearly pi is not a probability density, since you can’t normalize that. You could also say “A is totally implausible if it results in negative F(A)” (if the pixel is green don’t include it) so then pi(F(A)) = 1 for all F(A) >= 0 and 0 otherwise is a good mask function as well. It’s also not normalizable in general.

If you start including “jacobian corrections” in your mask function, then your mask function isn’t telling you information like “mask this based on how green it is” it’s telling you some mixed up information instead that involves “how rapidly varying the “greenness measurement” is in the vicinity of this level of greenness”. This isn’t the nature of the information you have, and so you shouldn’t just blindly think that because you’re doing a nonlinear transform of a parameter, that you’re obligated to start taking derivatives and calculating Jacobians.

## Emile Borel, Leon Bruillouin, The Butterfly Effect, and the illogic of Null Hypothesis Significance Testing

I first read about this idea in Raghuveer Parthasarathy’s blog. And I’ve used it in several places in arguments, but hadn’t looked up the source.

The idea is this, about 1914 or so Emile Borel calculated that a perturbation corresponding to a single gram of matter moving in a certain way in a star several lightyears away would perturb the individual trajectories of the molecules of an ideal gas in a lab in such a way as to cause those trajectories to diverge from the ones you would calculate without the perturbation so that after a few seconds any given molecule would have diverged dramatically from your calculated trajectory. (To find citations on this I relied on this blog which ultimately says that the basic idea was described in Leon Brillouin’s 1964 book “Scientific Uncertainty and Information” (pg 125) where he cites Emile Borel’s 1914 book “*Introduction géométrique a quelques théories physiques*” (p 94) for Borel’s calculation.)

This is a manifestation of the basic idea of Sensitive Dependence on Initial Conditions (aka the Butterfly Effect). A measure of this phenomenon is called the Lyapunov Exponent. Basically for small times after the initial point where your model begins to track the molecules, the error in your calculation grows exponentially like $$exp(\lambda t)$$ for some $$\lambda$$. Exponential growth is of course very fast, so whenever $$\lambda$$ is positive then your calculation is useless after some amount of time because there is **always** some error in your calculation and it rapidly blows up.

What does this have to do with Null Hypothesis Significance Testing (NHST)? Well, a common way to utilize this “paradigm” is to collect some data, experimental or otherwise, do some analysis, test a “Null Hypothesis” that the effect of X on Y is zero on average, and then find that the effect is either “statistically significant” or “non-statistically significant”. This is taken as evidence that “X affects Y” or “X does not affect Y”.

What the Borel calculation tells us is “everything affects everything else to some extent” and so logically in this NHST statistics there are only “true positives” and “false negatives”. If you find a non-significant effect it’s just because you didn’t collect enough data.

Rethinking this paradigm logically, we should really be asking questions like “how big is the effect of X on Y” and “what other effects are there, and how big are their contributions, so that we can determine whether we need to take X into account in order to make useful description of Y”

Rethinking things in this manner immediately leads you to a Bayesian viewpoint on statistics: find out what regions of X effect space are plausible given our model and our data, and then determine whether we can safely neglect X relative to the other factors in our model.

The RED qdisc, which is available in Linux to control network traffic is mostly not used these days in favor of things like codel and fq_codel. Originally it was designed to give feedback to TCP streams by dropping packets randomly before hitting a hard tail-drop, as a way to get the TCP streams to throttle back.

Today, the place where it might get used more often is something like a core router at an ISP because it doesn’t require much computation and can maybe handle things like 40Gbps or 100Gbps ethernet.

But I recently had some discussions that led me to think it might have a use in certain circumstances where it wasn’t traditionally in use. The particular use I’m considering is for UDP streams carrying realtime traffic. For example, game or phone traffic where a small packet is sent every 10 or 20 ms or so to update the state of the game or provide the next piece of audio data.

Now, if you set up GAMEBW to contain the bandwidth allocated in kbps for your total game traffic, and you assume typical packets are about 220 bytes (as estimated by example data I collected) you can set up a RED queue as follows:

tc qdisc add dev ${DEV} parent 1:20 handle 20: red limit $(($GAMEBW * 70 /8)) min $(($GAMEBW * 20/8)) max $(($GAMEBW * 50/8)) avpkt 220 burst $(($GAMEBW *30/8 / 220)) probability 1.0 bandwidth ${GAMEBW}kbit harddrop

Here the magic numbers 20, 50, 30, etc refer to the time in milliseconds that it would take at GAMEBW to send the data in the queue so far. min at 20ms means that the RED queue acts like a bfifo for the first 20ms worth of data… assuming that this is a leaf qdisc for some bandwidth limiting qdisc, we assume that enough bandwidth is available to ensure that this level of queue rarely occurs. But if there’s a burst of traffic for some reason, we might start queuing past a 20ms queue, and then we have some probability to drop packets… which rises at 20ms from 0 linearly up to 100% (probability 1.0) once we hit 50ms of queue (we set limit at 70ms just in case).

If we start to exceed our allotted bandwidth for a while, we build up a queue, but we want to ensure that this queue doesn’t build up past some point, in our case 50ms worth of data. If we have 1.5 times the bandwidth allowed coming in, we’ll build up to a steady state where 1-1/1.5 = 33% of the packets are dropped, which occurs at 20+(50-20)*.33 = 30ms of queue length.

If a burst of packets lasts for a while, we either need to buffer it and hope that later it will go away, in which case our delay grows without bound until the burst stops… (bufferbloat!) or drop some of the packets and make sure that what goes out is no more than the max bandwidth and try to keep the delay low at the expense of sending only some of the packets.

The RED queue linearly interpolates between 0% drop at min, and 100% drop at max, so that no matter what the incoming over-bandwidth rate is, the delay stays below 50ms in this case.

RED is like a “soft tailed” bfifo, and although the TCP exponentially backs off when there’s a little droppage… UDP does no such thing, and so the recommendations in the RED manual page don’t really apply here… we want to get up to 100% droppage at our maximum allowable delay level, whereas the RED manual page assumes something like just 1-2% droppage, because TCP will dramatically reduce its bandwidth usage when it detects drops.

You can think of a model of the RED queue as a bucket with a small hole in the bottom, connected to a pipe, and a large hole in the side that dumps water on the ground. Water pours into the top of the bucket, and goes out the bottom. If we pour a little faster, we can build up a queue of water waiting to go down the pipe… up to 20ms of delay and the bucket doesn’t leak on the ground…. As we pour even faster… the hole in the side pours some of the water out on the ground (drops packets). Depending on how fast we pour, the height of the water increases, and the side hole dumps water faster… Eventually the water reaches the top of the bucket and 100% of the excess dumps onto the ground. That’s basically what’s going on. When we turn off the tap… no more than 80ms later we’ve got an empty bucket.

A recent paper came out in The Lancet about alcohol consumption claiming that overall considering all causes, there’s no “safe” level of consumption. Of course that made a big splash in the news. It happened that Andrew Gelman also posted another study about alcohol and when I mentioned some serious concerns about this Lancet paper, the study author actually responded on Gelman’s blog.

I followed up with some specific concerns, but the main issue is that if you want to get causal inference you must either have randomized assignment to ensure on average zero probabilistic dependence on important unaccounted for effects, or have a *causal model* that mimics the real mechanism at least approximately and is believable. Of course the mechanism of death from diseases is … life and all its complexities. In particular the time-dependency aspects.

To illustrate how this time dependency is important I created a qualitative differential equation model that is intended to simulate some kind of simple risk model through time. The basics are this:

1 unit of people are born at time t=0, as time goes on (1 unit of time is about 20 years) they slowly die of various causes. There are 3 causes we track through to 6 units of time:

- “Heart disease” which we assume begins to affect people around 2.5 units of time, and is reduced by 25% for each dose of alcohol, assumed valid only for small doses, such as less than 2.
- “Cancer” which we assume begins to affect people later, around 3 units of time, and is completely unaffected by alcohol (just for illustration purposes).
- “Other causes” which ramps up somewhat slowly between about 3 and 5 units of time but ensures that very few people really make it past 5 units or so (~100 yrs). Other causes are unaffected by alcohol.

Code is below, but first, the graph which shows overall survival, cumulative heart disease deaths, and cumulative cancer deaths through time for two groups: Red = no alcohol, and Blue = 1 dose of alcohol daily

Notice that by construction the people who take 1 dose of alcohol have lower instantaneous heart disease risk, and the same cancer and other cause risk. This means they live longer (blue survival curve on top is above red). But if you look at the individual causes, you see a reduced heart disease risk, but in cancer you see there are MORE deaths? By construction, alcohol doesn’t change your cancer risk in this model… So why do more alcohol drinkers die of cancer? The answer is that they live longer on average because they don’t die of heart disease as much… So eventually they die of cancer, or “other causes”.

At the *very minimum* you need to reproduce this kind of thing in your model, or you have no hope of teasing out how things work.

Now for the code: