Dirichlet Processes as Probability over Histograms

2016 July 22
by Daniel Lakeland

Suppose you are sampling from a large population, trying to find out about how that population is distributed. Suppose it's a large but finite population at a fixed time, like say 1 Million balances in checking accounts. You can't take a full sample, so you need to take a much smaller sample, say 100 balances. Yet, you want to be able to use your information to come up with plausible models for the frequency distribution of the balance so that you could for example run some simulations of what might happen if you offered people a chance to buy something at various prices or something like that (some calculation that requires simulating from the full frequency distribution, not just knowing its mean and sd or the like).

Most of us have worked with histograms in the past. In the simple version, you take your sample, break up the range into bins of equal size, and count up how many samples fall in each bin. The value n_i/N for each bin i gives an estimate of the value

\int_{x_i}^{x_{i+1}} F(s)ds

where F is thought of as the "real" frequency density, and when the population is large this is taken to be basically a continuous function (hopefully the nonstandard analysis analogy is already coming into your head).

Well, each of the bins is basically a "categorical" variable, and the n_i are counts in each category. The simplest categorical variable is the binomial one, where there are two possible outcomes {1,0} or "true vs false" or "success vs failure" or "heads vs tails". A more complex categorical variable analogy is drawing balls from an urn. Suppose there is a bag full of balls, and there are 5 different colors. If you draw a sample of 100 balls, you can provide a vector of the counts of red,orange,yellow,green,blue for example. Similarly by breaking up the range of a variable into different bins you can take your sample and turn it into a count of the number of times a data point fell in each bin.

Now, we're familiar with the idea that if you take a large sample, the histogram should be close to the real frequency distribution F(x) whereas with a small sample, the histogram might be very different from F(x).

Many people are also familiar with the Beta-Binomial conjugate family. If you check 100 out of a large number of items with only two colors, and 40 are red and 60 are blue, then with a prior of beta(1,1) your posterior over the frequency of reds is beta(41,61) for example. A similar distribution is conjugate to the multinomial/categorical distribution with k categories. This distribution is called the "Dirichlet" distribution, and it's a distribution over the vector of relative frequencies with which each category appears in random sampling from a giant urn with k colors of balls, or the process of creating histograms with k bins...

In R, if you load the "gtools" package, you can draw samples from a Dirichlet distribution:

\vec f = \mathrm{rdirichlet}(1,\vec N)

Is a function that returns a vector f which is in essence a frequency histogram from a conjugate model under the assumption that you've observed a vector of counts \vec N where the N_i values are actually the sum of the counts observed plus the "effective prior counts".

In the absence of strong prior information and the presence of a large sample we can think of the \vec N as just observed counts in a histogram process, and the f_i as plausible values for the true frequency with which values from a very large sample would fall into histogram bin i for all the possible i.

Unsurprisingly, the larger N_{tot}=\sum N_i is, the closer different draws of f_i will be to a single consistent value N_i/N_{tot}. In other words, all the f vectors will look similar. Whereas, when the N_{tot} is small, the histograms will vary a lot and tend to look the histograms of only a small number of observations... noisy. Note, there's no requirement to make each of the values in the \vec N be integers even though we're understanding them in terms of "effective counts".

If you want to simulate values that are plausible for the full population using only your prior guess at the frequency histogram and the counts of observed data points, one way to do it is to use the rdirichlet function to generate an f, then generate data consistent with that histogram. Each time you do this you'll generate a plausible full population which you can then use to calculate whatever complex thing you want as if you had the whole population dataset. Doing this over and over shows you how much your uncertainty in the whole population affects whatever your calculated statistic is.

There's nothing to say that the bins have to be equally spaced, or even have finite ends, though you'll have to have a model for the tail of the distribution if you include an infinite bin. Also, there's no reason you can't use an informative prior over the histograms, and the more informative you want the prior to be, the bigger the "count" of the fake data. If f_i is a vector of prior guesses for the histogram bin frequencies, then N f_i is in essence a prior over counts "worth" N data points. If you have an additional K data points that have observed relative frequencies F_i then Nf_i + KF_i is the parameter vector you should give to rdirichlet to get a sample from the posterior over histograms.

Once you have a vector of histogram frequencies f you can generate a plausible population of N values by selecting a histogram bin at random and generating a value uniformly within that bin. But probably the real F is not a piecewise constant function, so you can maybe do better by a little smoothing. Adding a small amount of normal noise to each simulated value will produce a draw from a smoothed histogram F_s a piecewise constant function convolved with a normal curve. If you simply use the midpoint of each interval and then smooth with the appropriate normally distributed perturbation you have a "nonparametric" gaussian mixture model. It's nonparametric in the sense that it is actually a family of models, as you can always increase the number of bins to fit the size of your data.

Basically, this scheme allows you to put Bayesian probabilities over the unknown frequency histogram of a large but finite population from which you have taken a random sample. If you have models for bias in your sampling, you can include that model through appropriate perturbations to the parameter vector of the Dirichlet.

 

Floating point arithmetic and probability

2016 July 18
by Daniel Lakeland

Kit Joyce asks about how floating point arithmetic interacts with random number generation

I thought about this for a while, and I think this is a good place to show the utility of concepts from nonstandard analysis and Internal Set Theory in particular. Here's how my reasoning goes.

First off, rather than using measure theory, we'll imagine the process of choosing a random number uniformly in the interval [0,1] as choosing one of a finite but nonstandard number of rational numbers, and then taking the standard part. So, we'll divide up the interval [0,1] into a bunch of possible outcomes r_i as follows:

r_i = 0 + i dx

where dx = 1/N for some nonstandard N, and so we are essentially choosing a random integer between 0 and N and then mapping that process into the interval [0,1]. To choose a standard number you take the standard part of r_i.

Now, what's the probability of getting a value between a,b where a < b and both are standard numbers in [0,1]? It's N(a,b)/N where N(a,b) is the number of possible outcomes that result in a standard number between a and b. Since there is one different outcome every interval dx then N(a,b) \ge (b-a)/dx. If you select a nonstandard number outside the interval [a,b] that is infinitesimally close to a or infinitesimally close to b you will also "round off" via the standard part to a or to b, so hence the \ge symbol. Still N(a,b)/N = (b-a)/dx/N + K/N \cong b-a because the number of outcomes in the halo of a or b outside the interval [a,b] is what I'm calling K and it is infinitesimal relative to N, so K/N \cong 0.

Now, I think this maps pretty directly to the idea of generating a random number say a 64 bit integer which maps then into rational numbers between 0,1 that are dx=1/2^{64} apart, and then the "standard part" operation is modeled by the "roundoff to nearest floating point" operation. And, so long as N(a,b)/K is big, the roundoff error doesn't introduce any problems, and that's true whether or not the floating point numbers between a,b are evenly distributed or unevenly distributed.

So, when do we have a problem? Mainly when we're trying to figure out the probability of landing in some region that is small relative to machine epsilon of the floating point numbers. For example, suppose you're trying to model the extreme tail of the normal distribution, by generating floats in the range 0.99999999999 to 1.0 and then transforming them by the inverse cumulative distribution to their normal quantiles... you are probably doing something wrong! If you want to generate normal variates in the extreme tail, better to do something like rejection sampling where you're generating the deviates directly in the outcome space and don't have big gaps caused by the roundoff errors in the process of generating numbers between 0.99999999999 and 1.0

 

On U shapes and NHST...

2016 June 28
by Daniel Lakeland

Gelman posts about a paper in which social scientists run a quadratic regression and discover that it produces a U shape!!! Amazing. The story is that too much talent on your sports team will lead to poorer performance. Further evidence that understanding function approximation and basis expansion is seriously flawed among many researchers.

Uri and Leif weren't able to get the data, but were able to get the original researchers to run a test of a hypothesis. The idea is they take their quadratic regression, find the maximum or minimum point, and split the data at that point, running two linear regressions, and testing whether the sign of the slope is different. This is a bit better than assuming the quadratic means there really is a U shape, but basically still wrong. The problem is NHST of a slope against zero isn't the right way to figure out whether you have a U shape. In fact NHST isn't the right way to do much of anything (except test the output of PRNG algorithms).

I hear you asking: Ok Dan, well then, how should we do this right?

First off, Uri and Leif have some data on competition vs patents and claim that this is a good testbed for their method. I disagree because I don't think number of patents is a good measure of "creativity" and there are serious problems with that analysis even before you have the data. The sports data on the other hand is pretty good conceptually, but the authors didn't provide the data so I can't just show you the kinds of stuff you ought to do...

So I'll have to resort to just telling you. The first thing is that you need a system which is sufficiently flexible to fit whatever nonlinear shape your data "really" has. With the sports data the x axis is logically limited to the range [0,1] (0 to 100% of the team is high-talent). This is a bounded interval, and polynomials are a complete basis for continuous functions on a bounded interval. That means, with sufficiently high order polynomials we can fit any nonlinear shape. However, with noisy measurements, we also have a problem of potentially over-fitting. Furthermore, we'll do better for mathematical reasons if we use orthogonal polynomials.

But, a complete orthogonal basis isn't necessarily the best way to go, especially in a Bayesian analysis. For example, we have the idea from sports fans who haven't seen the data that we should get a diminishing returns, horizontal asymptote type result. So, perhaps we should include such a family in our analysis. Unfortunately, I couldn't get Chebyshev polynomials into Stan because the Stan authors didn't really understand them even though someone DID do the work to include them. So, let's set up a model for y vs x using regular old non-orthogonal polynomial regression.

 y = a_0 + \frac{h}{(1+\exp(-r (x-x_0)))} + \sum_{i=1}^5 a_ix^i

We're going to expect some correlations between the various coefficients, because we don't have an orthogonal system, but Stan is pretty good at working that out, in the end all we care about are the function shapes, not the coefficients anyway. The second term represents a sigmoid function which has horizontal asymptotes at 0 and h and which has a transition between them in the region of x_0. The polynomial represents a general purpose correction to this basic shape. It would be even better to have Chebyshev polynomials and to give strong priors to the coefficients a_i for reasons I won't go into (but they do involve a less-correlated posterior distribution for the coefficients, easier to sample from). With the sigmoid function doing a reasonable job of representing our core expectation, we can at least put priors on the a_i that are zero-mean, as we wouldn't be too surprised to find that large polynomial corrections aren't necessary.

Ok, so now, pry the data out of the researchers fingers and fit the model in Stan. Then, take a sample of 100 of the coefficient vectors and plot the curves in a spaghetti plot. Now, calculate the derivative of this function (if you haven't done calculus in a while that's ok, use Maxima or ask Wolfram Alpha or something similar). In a separate plot, spaghetti plot the derivative. It'll be noisier, as the derivative is less constrained by the data, but we've regularized our result by truncating at 5th order polynomials and we can regularize further by putting strong priors on the a_i. Spaghetti plotting the derivative, does it almost always have a positive value on the left and then dive down to a negative value on the right?

If you have those plots of the curve and its derivative, and they really do produce a fairly consistent change in the sign of the derivative, THEN you have reasonable evidence for a transition to negative slope. The posterior probability of a curve shape can be directly assessed in these plots. There's no need to "test" your hypothesis using an NHST test that is appropriate for assessing whether the coefficients are the output of an RNG. But, if you do find that out of the 100 spaghetti derivative plots all but 1 to 5 of them have entirely negative values at the high end, then, yes you have relatively strong evidence for your decreasing performance at the high end.

The fact is, the posterior distribution is what summarizes our uncertainty. A binary decision about rejecting a linear regression on the upper end is NOT the same thing for the same reasons that Uri and Leif give about why the quadratic isn't good as an indicator of a U shape.

What should the analysis of acupuncture for allergic rhinitis have looked like

2016 June 14
by Daniel Lakeland

Ok, so I'm down on the NHST analysis of allergic rhinitis. And, I probably am not going to get ahold of the data to re-do the analysis, but what kinds of things SHOULD be done if the data were actually available?

First off, let's acknowledge that the goal of all of this is to make people feel and function better. There is therefore a latent "how good do I feel" variable which we can only measure imperfectly which is our real subject of interest. There are two ways they tried to measure this variable. The first is the RQLQ (symptom) score which says how bad the person perceives their symptoms, and the second is the RMS (medicine usage) score which tells how much the person used medication to combat the symptoms. However, the RMS score also is a measure of a kind of forcing function to reduce symptoms. The medications available were Cetirizine and oral steroids, and a diary was kept by each patient so we actually know how much medicine of what types was used on every day of the study..

So, let's imagine a function S(t) for "symptoms". We have some causal process we think is going on in which S increases through time when exposure to pollen occurs but eventually saturates (fortunately pollen rarely actually makes your whole body combust). We have some treatments thought to reduce the symptoms, including medications which are taken according to the patients own choice, based on a tolerance for symptoms, and in addition, sham acupuncture, or real acupuncture. Real acupuncture is given either in period t \in [0,8] or t \in [8,16] weeks and symptoms assessed at t=0, t=8, t=16, t=60. Furthermore, we have individual patient beliefs about whether sham or real acupuncture was being applied, which were assessed after their 3rd session with needles.

The model should therefore be that S(t) is observed with errors using the RQLQ assessment, and furthermore that the choice to take various medications at time t is based on a combination of S(t) and individual tolerance for the symptoms (so that M(t) gives us information about the patient's S(t)). If M(t) is the medication dosage function (dose per day) then M(t) = M(S(t), Tol) and also dS/dt =F(M(t), Ac(t), Sa(t), B, P(t)) for each patient separately, where M is medicine dose, Ac is acupuncture dose, Sa is sham acupuncture dose B is belief in whether they are getting sham or real acupuncture, and P(t) is pollen exposure which we might imagine as a constant for each clinical center.

Put all of those ideas together with some simplifying mathematical assumptions, analyze with a Bayesian computational software program like Stan, and discover how strongly Ac, Sa, and M affect symptoms. Compare the effect Ac/M and Sa/M averaged across people to get an answer to the question "what is the average effect of Acupuncture measured relative to the average effect of taking Cetirizine on patient Symptoms, and how does it compare to sham acupuncture?"

 

NO NO NO NO...

2016 June 13
by Daniel Lakeland

From this paper on a Randomized Controlled Trial of acupuncture for allergic rhinitis

We therefore started with noninferiority tests of change in RQLQ score and concluded that real acupuncture was noninferior if the left limit of the covariance-based, 2-sided 95% CI surrounding the between-group difference between real and sham acupuncture was greater than the noninferiority margin of ⫺0.5 point (12). If we showed noninferiority, we repeated the analysis for the RMS outcome using a noninferiority margin of ⫺1.5 points. This threshold was chosen on the basis of a review that we performed of unpublished RQLQ and RMS data suggesting a rough equivalence of scales at a ratio of 1:3, so the RQLQ threshold of 0.5 point translated to an RMS threshold of 1.5 points. If this procedure also showed noninferiority, we tested for superiority and concluded that real acupuncture was superior to sham acupuncture if at least one of the Bonferroni-adjusted, analyses of covariance– based, 2-sided 97.5% CIs surrounding the between-group difference in RQLQ score and RMS was completely greater than zero.

This is how NHST turns people's brains to mush. In this study both the acupuncture group and the "medication only" group got acupuncture, it just happened at different times. They have assessment at 4 time points: t=0 initial baseline, t=8 weeks (only acupuncture group has had acupuncture), t=16 weeks (acupuncture group had acupuncture between week 0 and 8, medication group had acupuncture between t=8 and t=16), and again at week 60 (no treatment between t=16 and t=60 weeks). Clearly, what's needed is a time-series analysis for score as a function of what treatments happened when, but instead, we get "noninferiority and superiority testing at a predefined threshold of ..." bullshit.

Unfortunately, their reproducible research statement says "Data set: Certain portions of the analytic data set are available to approved individuals through written agreements with the author or research sponsor." which means getting the data is a lot more complicated than just downloading it off an open public website.

 

5000-1 odds on Leicester, what does it mean?

2016 June 3
by Daniel Lakeland

In this comment at Gelman's Blog "Ney" asks "Does the 5000:1 mean that a team like Leicester would be expected to win the English championship only once within 5000 years?"

My answer to that is no. At least if a Bayesian gives a 5000:1 odds on something like a team winning a championship or a particular earthquake occurring or a particular financial event, it need not have any frequency of occurrence in the long run interpretation. But what is the interpretation?

Bayesian probabilities under Cox/Jaynes probability just says that different things have different degrees of credibility or plausibility or believability or whatever. In this system, Probability is like Energy, if we write energy in units where all the energy in the universe is 1 unit, then since energy is conserved, we can account for what fraction of the universes energy there is in any one object. Same idea for Bayesian probability, what fraction of our credibility is associated with a particular value or small range of values?

So, we could imagine a whole series of events, say each week, there are some soccer games played, and then someone wins, and that means that different matches happen next week, and then someone wins, and etc etc etc. By the end of the season there is some enormous combinatorial explosion of different possible "paths" through the season. A 5000:1 odds for a Bayesian roughly means that of these N possible paths through the season 5000/5001 N of them have Leicester losing and 1/5001 N of them have Leicester winning. Now, it's not quite that simple, because there's no reason why each of the N possible paths have to have equal plausibility, so some paths might "count more" than others, but it's clear that we're not talking about what will happen in 5000 future seasons, we're talking about the weight of plausibility among all the different detailed ways in which the outcome LEICESTER WINS THIS SEASON could occur.

 

Shorthand terminology and personification of assumptions

2016 May 30
by Daniel Lakeland

In philosophical discussions it's often the case that we use a kind of shorthand which can be confusing. In many cases, we "personify" particular assumptions. That is, we sort of imagine a person who holds those assumptions as true and talk about what such a person would also logically believe. For example:

"A Frequentist has to believe that a moderate sized sample will fill up the high probability region of the sampling distribution."

or

"A Bayesian believes that if a moderate sized sample doesn't fill the data distribution it's not a problem for the model"

These and other statements like them are really shorthand for more complicated sentences such as:

"Once you assume that your distribution is the long run frequency distribution of IID repeated trials, this also implies that a moderate sized sample will have data points throughout the high probability region as can be seen by simulations"

or

"The logic of Bayesian inference does not require that we believe the data distribution is the frequency distribution, so if there is a portion of the high probability region which isn't occupied by any data in a moderate sized sample it doesn't invalidate the basic assumptions of the model"

Real people are free of course to have additional assumptions beyond the basics (for example a person doing a Bayesian analysis might actually want to fit a frequency distribution, so the model with gaps in the sampling would be considered wrong), or to acknowledge that their choice of distribution is intentionally approximate or regularized in a way that doesn't fully satisfy the basic assumption. But it's a useful construct to think about a person who takes their model at literal face value and then see what else that model logically implies they should believe. It helps to detect when additional assumptions are needed, and what they might be, or it also helps to detect when the basic assumptions of a model are contrary to reality.

As such, when talking about "Frequentists" vs "Bayesians" I'm really talking about what the math associated to pure forms of those kinds of analyses implies, not what particular real people actually think.

So much of Statistics is taught as formulas, procedures, heuristics, calculation methods, techniques etc that many people (myself included up to a few years ago) never really see an overarching organizing principle. Discussing those organizing principles explicitly can be very useful to help people with their future analyses.

Bayesian models are also (sometimes) Non-Reproducible

2016 May 17

Shravan at Gelman's blog writes "I moved away from frequentist to Bayesian modeling in psycholinguistics some years ago (2012?), and my students and I have published maybe a dozen or more papers since then using Bayesian methods (Stan or JAGS). We still have the problem that we cannot replicate most of our and most of other people’s results."

So, Bayesian modeling isn't a panacea. Somehow I think this has been the take-away message for many people from my recent discussions of the philosophy of what Bayes vs Frequentist methods are about. "Switch to Bayes and your models will stop being un-reproducible!"

So, I feel like I've failed (so far, in part), because that certainly isn't my intention. The only thing that will make your science better is to figure out how the world works. This means figuring out the details of the mechanism that connects some observables to other observables using some parameter values which are unobservable. If you're doing science, you're looking for mechanisms, causality.

But, I believe that Bayesian analysis helps us detect when we have a problem, and by doing Bayesian analysis, we can force a kind of knowledge of our real uncertainty onto the problem, that we can't do with Frequentist procedures!

That detection of un-reproducibility, and the ability to force realistic evaluations of uncertainty, both help us be less wrong.

Some Bayesian Advantages

So, what are some of the things that make Bayesian models helpful for detecting scientific hypotheses that are unreproducible or keep our models more realistic?

Consistency is Detectible:

First off, there's a huge flexibility of modeling possibilities, with a single consistent evaluation procedure (Bayesian probability theory).

For example, you can easily distinguish between the following cases:

  1. There is a single approximately fixed parameter in the world which describes all of the instances of the data. (ie. charge of an electron, or vapor pressure of water at temperature T and atmospheric pressure P, or the maximum stack depth of linguistic states required for a pushdown automaton to parse a given sentence)
  2. There are parameters that describe individual sets of data and all of these parameters are within a certain range of each other (ie. basal metabolic rate of individual atheletes, fuel efficiency of different cars on a test track).
  3. There is no stable parameter value that describes a given physical process (ie. basal metabolic rate of couch potatoes who first start training for a marathon, fuel efficiency of a car with an intermittent sensor problem, or if you believe that Tracy and Beall paper Andrew Gelman likes to use as an example, the shirt color preferences of women through time).

When Shravan says that his Bayesian models are unreproducible, what does that indicate?

Most likely it indicates that he's created a model like (1) where he hypothesizes a universal facet of language processing, and then when he fits it to data and finds his parameter value, if he fits the model to different data, the parameter value isn't similar to the first fit. That is, the universality fails to hold. If, in fact, he were to use the posterior distribution of the first fit as the prior for the second, he'd find that the posterior widened or maybe shifted somewhere that the posterior from the first analysis wouldn't predict when the additional data was taken into account.

Or, maybe he's got a model more like (2) where he hypothesizes that at least there's some stable parameter for each person but that all the people are clustered in some region of space. But, when he measures the same person multiple times he gets different parameter values for each trial, or when he measures different populations he winds up with different ranges of parameter values.

Or, maybe he has a model like (3) where he expects changes in time, but the kinds of changes he sees are so broad and have so little structure to them that they aren't really predictive of anything at all.

In each case, probability as extended logic gives you a clear mechanism to detect that your hypothesis is problematic. When you collect more data and it fails to concentrate your parameter estimates, or it causes the parameters to shift outside the region they had previously been constrained to,  it indicates that the model is inconsistent and can't explain the data sufficiently.

Bayesian models can be structured to be valid even when the data isn't an IID sample:

IID sampling is just way more powerful mathematically than the reality of running real experiments, so that assumption implicit in a Frequentist test, will fool you into failing to realize the range of possibilities. In other words, Frequentism relies on you throwing away knowledge of alternatives that might happen under other circumstances ("let the data speak for themselves" which is sometimes attributed to Ronald Fisher but may not be something he really said)

Suppose instead you are analyzing performance of a linguistic test in terms of a Frequency distribution of events. The basic claim is that each event is an IID sample from a distribution F(x ; a,b) where x is the measured value, and a,b are some fixed but unknown parameters. Now, you do a test to see if a > 0 and you reject this hypothesis, so you assume a < 0 and you publish your result. First off, the assumption of a fixed F and IID sampling is usually a huge huge assumption. And, with that assumption comes a "surety" about the past, the future, about other places in the world, other people, etc. In essence what that assumption means is "no matter where I go or when I take the measurements all the measurements will look like samples from F and every moderate size dataset will fill up F". How is this assumption different from "no matter where I go all I know about the data is that they won't be unusual events under distribution P" which is the Bayesian interpretation?

Consider the following specific example: P = normal(0,10), and I go and I get a sample of 25 data points, and I find they are all within the range -1,1. A Frequentist rejects the model of normal(0,10). Any random sample from normal(0,10) would contain values in the ranges say -15,-1 and 1,15 but this sample doesn't.

Does the Bayesian reject the model? No, not necessarily. Because the Bayesians know that a) the probability is in their head based on the knowledge they have, whereas the Frequency is in the world based on the laws of physics. And, b) in many many cases there is no sense in which the data is a representative random sample of all the things that can happen in the world. So, even if Bayesians really do believe that P=F is a stable Frequency distribution across the full range of conditions, there's nothing to say that the current conditions couldn't be all within a small subset of what's possible. In other words, the Bayesian's knowledge may tell them only about the full range of possibilities, but not about the particulars of how this sample might vary from the full range of possibilities. Since Bayesians know that their samples aren't IID samples from F they need not be concerned that it fails to fill up all of F. The only time they need to be concerned is if they see samples that are far outside the range of possibilities considered. Like say x=100 in this example. That would never happen under normal(0,10);

To the Bayesian, x ~ normal(0,s); s ~ prior_for_s(constants) is very different than x ~ normal(0,constant); In the first case the scale s is unknown and has a kind of range of possibilities which we can do inference on. The assumption is implicitly that the data is informative about the range of s values. In the second case, the constant scale comes from theoretical principles and we're not interested in restricting the range down to some other value s based on the data.

This can be important, because it lets the Bayesian impose his knowledge on the model, forcing the model to consider a wide range of possibilities for data values even when the data doesn't "fill up" the range. Furthermore, it's possible to be less dogmatic for the Bayesian, they can provide informative priors that are not as informative as the delta-function (ie. a fixed constant) but also will not immediately mold themselves to the particular sample. For example s ~ gamma(100,100.0/10) says that the scale is close to 10 and this information is "worth" about 100 additional samples, so don't take the 25 samples we've got as fully indicative of the full range.

The assumption of random sampling that comes out of Frequentist theory says that there's no way in hell that you could fail to fill up the Frequency distribution if your samples were 25 random ones. This means implicitly that the current data and any future data look "alike" for sufficiently large sample sizes. That's a very strong assumption which only holds when you are doing random sampling from a finite population using an RNG. But all of the Frequentist tests rely on that assumption. Without that assumption you simply can't interpret the p values as meaningful at all. If you know that your sample is not a random sample from a finite population but just "some stuff that happened through time where people were recruited to take part in various experiments in various labs" then assuming that they are random samples from a fixed population says automatically "when n is more than a smallish number, like 25 to 50, there is no way in hell we've failed to sample any part of the possible outcomes"

Since the p values you get out of Frequentist tests are typically only valid when the data really IS an IID sample of the possibilities. You're left with either being fooled by statistics, or being a Bayesian and imposing your knowledge onto your inferences.

 

High Dimensional Data/Params as functions

2016 May 16
by Daniel Lakeland

My wife had a problem where she had measurements of the expression level of thousands of genes under two different genotypes and two different time points of development. Subsetting these genes into those she was interested in, she still had the problem of understanding the differences in a multi-thousand-dimensional vector. I invented a plot where the genetic expression profile was described in terms of a function of the rank of the expression under a reference condition. It's easier to understand if you've seen the data, so here's an example plot with the specifics of the labels removed (click the image to get a zoomed version)

The idea is that we're displaying the logarithm of normalized expression levels on the y axis, with the genes shown in a particular order. The order they're shown in is always the rank-order under the reference condition +/- (heterozygous). That rank is preserved for all of the plots at both time points (early and late). So when you plot the alternate condition on the same graph, you get simultaneously a view of how much expression there is typically for this gene, how much variation there is on average between genotypes, and how much of a difference there is for the particular gene compared to the reference genotype. Finally, plotting the later time point shows you that BOTH genotypes diverge from the original expression profile, but in a noisy way, there is still an overall curving upward trend which is somewhat similar to the original.

Now, suppose you have a high dimensional parameter vector in a Bayesian model. You could choose the rank of each parameter within the average parameter vector as the x values, and then plot that average parameter vector as the reference condition. Then, over that, you can spaghetti plot particular samples, which will show you how much variation there is as well as to some extent correlations in the variation, for example when parameter number 50 is higher than average, maybe parameter number 70 is often lower than average, so that the curve will have an upward blip at 50 and a downward blip at 70 for many of the spaghetti curves.

This would be particularly useful if the parameter values were fairly well constrained, if they have a huge amount of variation, then the spaghetti plot could get really really messy, even messier than the "late" curves in this plot, for example.

Another thing to note is that this plot will work a LOT better if you have dimensionless parameters which are scaled to be O(1), none of this "parameter 1 is in ft/s and parameter 2 is in dollars/hour and parameter 3 is in Trillions of Dollars per Country per Decade" stuff.

Generating a Random Seed

2016 May 10
by Daniel Lakeland

Suppose you want to hold a lottery, and you want it to be such that no one has a good way to influence the outcome of the lottery. What you need is a cryptographic seed for a strong random number generator.

A strong RNG can be made by encrypting a counter using a block cipher in CBC mode starting at a randomly chosen initial number with a randomly chosen seed. Or by using a strong stream cipher (a cipher that outputs one bit-at-a-time basically)

But how to get the initial seed? There are many ways, but one of the things you want is that no one person can fiddle around with it too much. And also, you want transparency, you aren't going to rely on some compiled function to do what you need because that requires people to audit the function something most people can't do.

If your audience understands cryptography. You can simply execute on the Linux command line dd if=/dev/random bs=1024 count=10 | sha256sum

and read off the resulting hash value. Otherwise reasonably analog ways to create initial vectors for valuable lotteries would be to roll 10 different twenty-sided dice 10 times and read them off into a text file, then cat the file into sha256sum, if the dice are perfectly symmetrical they'd generate around 400 bits of entropy which would be compressed down to the 256 you need. Or, you could roll regular 6 sided dice, say 10 at a time about 10 times, to get around 258 bits of entropy which still fills up the SHA256 hash value.

Once you've done this, you can then set about randomly choosing grants to fund, a step up vs the BS we currently do 😉