Bayesian models are also (sometimes) Non-Reproducible

2016 May 17

Shravan at Gelman's blog writes "I moved away from frequentist to Bayesian modeling in psycholinguistics some years ago (2012?), and my students and I have published maybe a dozen or more papers since then using Bayesian methods (Stan or JAGS). We still have the problem that we cannot replicate most of our and most of other people’s results."

So, Bayesian modeling isn't a panacea. Somehow I think this has been the take-away message for many people from my recent discussions of the philosophy of what Bayes vs Frequentist methods are about. "Switch to Bayes and your models will stop being un-reproducible!"

So, I feel like I've failed (so far, in part), because that certainly isn't my intention. The only thing that will make your science better is to figure out how the world works. This means figuring out the details of the mechanism that connects some observables to other observables using some parameter values which are unobservable. If you're doing science, you're looking for mechanisms, causality.

But, I believe that Bayesian analysis helps us detect when we have a problem, and by doing Bayesian analysis, we can force a kind of knowledge of our real uncertainty onto the problem, that we can't do with Frequentist procedures!

That detection of un-reproducibility, and the ability to force realistic evaluations of uncertainty, both help us be less wrong.

Some Bayesian Advantages

So, what are some of the things that make Bayesian models helpful for detecting scientific hypotheses that are unreproducible or keep our models more realistic?

Consistency is Detectible:

First off, there's a huge flexibility of modeling possibilities, with a single consistent evaluation procedure (Bayesian probability theory).

For example, you can easily distinguish between the following cases:

  1. There is a single approximately fixed parameter in the world which describes all of the instances of the data. (ie. charge of an electron, or vapor pressure of water at temperature T and atmospheric pressure P, or the maximum stack depth of linguistic states required for a pushdown automaton to parse a given sentence)
  2. There are parameters that describe individual sets of data and all of these parameters are within a certain range of each other (ie. basal metabolic rate of individual atheletes, fuel efficiency of different cars on a test track).
  3. There is no stable parameter value that describes a given physical process (ie. basal metabolic rate of couch potatoes who first start training for a marathon, fuel efficiency of a car with an intermittent sensor problem, or if you believe that Tracy and Beall paper Andrew Gelman likes to use as an example, the shirt color preferences of women through time).

When Shravan says that his Bayesian models are unreproducible, what does that indicate?

Most likely it indicates that he's created a model like (1) where he hypothesizes a universal facet of language processing, and then when he fits it to data and finds his parameter value, if he fits the model to different data, the parameter value isn't similar to the first fit. That is, the universality fails to hold. If, in fact, he were to use the posterior distribution of the first fit as the prior for the second, he'd find that the posterior widened or maybe shifted somewhere that the posterior from the first analysis wouldn't predict when the additional data was taken into account.

Or, maybe he's got a model more like (2) where he hypothesizes that at least there's some stable parameter for each person but that all the people are clustered in some region of space. But, when he measures the same person multiple times he gets different parameter values for each trial, or when he measures different populations he winds up with different ranges of parameter values.

Or, maybe he has a model like (3) where he expects changes in time, but the kinds of changes he sees are so broad and have so little structure to them that they aren't really predictive of anything at all.

In each case, probability as extended logic gives you a clear mechanism to detect that your hypothesis is problematic. When you collect more data and it fails to concentrate your parameter estimates, or it causes the parameters to shift outside the region they had previously been constrained to,  it indicates that the model is inconsistent and can't explain the data sufficiently.

Bayesian models can be structured to be valid even when the data isn't an IID sample:

IID sampling is just way more powerful mathematically than the reality of running real experiments, so that assumption implicit in a Frequentist test, will fool you into failing to realize the range of possibilities. In other words, Frequentism relies on you throwing away knowledge of alternatives that might happen under other circumstances ("let the data speak for themselves" which is sometimes attributed to Ronald Fisher but may not be something he really said)

Suppose instead you are analyzing performance of a linguistic test in terms of a Frequency distribution of events. The basic claim is that each event is an IID sample from a distribution F(x ; a,b) where x is the measured value, and a,b are some fixed but unknown parameters. Now, you do a test to see if a > 0 and you reject this hypothesis, so you assume a < 0 and you publish your result. First off, the assumption of a fixed F and IID sampling is usually a huge huge assumption. And, with that assumption comes a "surety" about the past, the future, about other places in the world, other people, etc. In essence what that assumption means is "no matter where I go or when I take the measurements all the measurements will look like samples from F and every moderate size dataset will fill up F". How is this assumption different from "no matter where I go all I know about the data is that they won't be unusual events under distribution P" which is the Bayesian interpretation?

Consider the following specific example: P = normal(0,10), and I go and I get a sample of 25 data points, and I find they are all within the range -1,1. A Frequentist rejects the model of normal(0,10). Any random sample from normal(0,10) would contain values in the ranges say -15,-1 and 1,15 but this sample doesn't.

Does the Bayesian reject the model? No, not necessarily. Because the Bayesians know that a) the probability is in their head based on the knowledge they have, whereas the Frequency is in the world based on the laws of physics. And, b) in many many cases there is no sense in which the data is a representative random sample of all the things that can happen in the world. So, even if Bayesians really do believe that P=F is a stable Frequency distribution across the full range of conditions, there's nothing to say that the current conditions couldn't be all within a small subset of what's possible. In other words, the Bayesian's knowledge may tell them only about the full range of possibilities, but not about the particulars of how this sample might vary from the full range of possibilities. Since Bayesians know that their samples aren't IID samples from F they need not be concerned that it fails to fill up all of F. The only time they need to be concerned is if they see samples that are far outside the range of possibilities considered. Like say x=100 in this example. That would never happen under normal(0,10);

To the Bayesian, x ~ normal(0,s); s ~ prior_for_s(constants) is very different than x ~ normal(0,constant); In the first case the scale s is unknown and has a kind of range of possibilities which we can do inference on. The assumption is implicitly that the data is informative about the range of s values. In the second case, the constant scale comes from theoretical principles and we're not interested in restricting the range down to some other value s based on the data.

This can be important, because it lets the Bayesian impose his knowledge on the model, forcing the model to consider a wide range of possibilities for data values even when the data doesn't "fill up" the range. Furthermore, it's possible to be less dogmatic for the Bayesian, they can provide informative priors that are not as informative as the delta-function (ie. a fixed constant) but also will not immediately mold themselves to the particular sample. For example s ~ gamma(100,100.0/10) says that the scale is close to 10 and this information is "worth" about 100 additional samples, so don't take the 25 samples we've got as fully indicative of the full range.

The assumption of random sampling that comes out of Frequentist theory says that there's no way in hell that you could fail to fill up the Frequency distribution if your samples were 25 random ones. This means implicitly that the current data and any future data look "alike" for sufficiently large sample sizes. That's a very strong assumption which only holds when you are doing random sampling from a finite population using an RNG. But all of the Frequentist tests rely on that assumption. Without that assumption you simply can't interpret the p values as meaningful at all. If you know that your sample is not a random sample from a finite population but just "some stuff that happened through time where people were recruited to take part in various experiments in various labs" then assuming that they are random samples from a fixed population says automatically "when n is more than a smallish number, like 25 to 50, there is no way in hell we've failed to sample any part of the possible outcomes"

Since the p values you get out of Frequentist tests are typically only valid when the data really IS an IID sample of the possibilities. You're left with either being fooled by statistics, or being a Bayesian and imposing your knowledge onto your inferences.


High Dimensional Data/Params as functions

2016 May 16
by Daniel Lakeland

My wife had a problem where she had measurements of the expression level of thousands of genes under two different genotypes and two different time points of development. Subsetting these genes into those she was interested in, she still had the problem of understanding the differences in a multi-thousand-dimensional vector. I invented a plot where the genetic expression profile was described in terms of a function of the rank of the expression under a reference condition. It's easier to understand if you've seen the data, so here's an example plot with the specifics of the labels removed (click the image to get a zoomed version)

The idea is that we're displaying the logarithm of normalized expression levels on the y axis, with the genes shown in a particular order. The order they're shown in is always the rank-order under the reference condition +/- (heterozygous). That rank is preserved for all of the plots at both time points (early and late). So when you plot the alternate condition on the same graph, you get simultaneously a view of how much expression there is typically for this gene, how much variation there is on average between genotypes, and how much of a difference there is for the particular gene compared to the reference genotype. Finally, plotting the later time point shows you that BOTH genotypes diverge from the original expression profile, but in a noisy way, there is still an overall curving upward trend which is somewhat similar to the original.

Now, suppose you have a high dimensional parameter vector in a Bayesian model. You could choose the rank of each parameter within the average parameter vector as the x values, and then plot that average parameter vector as the reference condition. Then, over that, you can spaghetti plot particular samples, which will show you how much variation there is as well as to some extent correlations in the variation, for example when parameter number 50 is higher than average, maybe parameter number 70 is often lower than average, so that the curve will have an upward blip at 50 and a downward blip at 70 for many of the spaghetti curves.

This would be particularly useful if the parameter values were fairly well constrained, if they have a huge amount of variation, then the spaghetti plot could get really really messy, even messier than the "late" curves in this plot, for example.

Another thing to note is that this plot will work a LOT better if you have dimensionless parameters which are scaled to be O(1), none of this "parameter 1 is in ft/s and parameter 2 is in dollars/hour and parameter 3 is in Trillions of Dollars per Country per Decade" stuff.

Generating a Random Seed

2016 May 10
by Daniel Lakeland

Suppose you want to hold a lottery, and you want it to be such that no one has a good way to influence the outcome of the lottery. What you need is a cryptographic seed for a strong random number generator.

A strong RNG can be made by encrypting a counter using a block cipher in CBC mode starting at a randomly chosen initial number with a randomly chosen seed. Or by using a strong stream cipher (a cipher that outputs one bit-at-a-time basically)

But how to get the initial seed? There are many ways, but one of the things you want is that no one person can fiddle around with it too much. And also, you want transparency, you aren't going to rely on some compiled function to do what you need because that requires people to audit the function something most people can't do.

If your audience understands cryptography. You can simply execute on the Linux command line dd if=/dev/random bs=1024 count=10 | sha256sum

and read off the resulting hash value. Otherwise reasonably analog ways to create initial vectors for valuable lotteries would be to roll 10 different twenty-sided dice 10 times and read them off into a text file, then cat the file into sha256sum, if the dice are perfectly symmetrical they'd generate around 400 bits of entropy which would be compressed down to the 256 you need. Or, you could roll regular 6 sided dice, say 10 at a time about 10 times, to get around 258 bits of entropy which still fills up the SHA256 hash value.

Once you've done this, you can then set about randomly choosing grants to fund, a step up vs the BS we currently do 😉

Confusion of definitions Bayesian vs Frequentist (vs yet a third category)

2016 May 9
by Daniel Lakeland

Generally I consider it unhelpful to argue about definitions of words. When that arises I think it's important to make your definitions clear, and then move on to substantive arguments about things that matter (like the logic behind doing one thing vs another).

So, what makes a calculation Bayesian vs Frequentist and is there any other category to consider? Here are my definitions:

Bayesian Inference: The calculation begins from a position where probability defines how much you know about an aspect of the world, and proceeds to calculate the consequences of what you know together with what you observed (your data) using the sum and product rule of probability to arrive at a new explicit description of a state of knowledge. Analogy is that in Prolog you use Aristotelian logic to compute true statements from facts you supply, and in Stan you use Bayesian logic to compute samples from output distributions from data and your input distributions.

Frequentist Inference: The calculation begins with a description of the data collection or generation process as repeated sampling from a collection of possibilities, and proceeds to calculate how often something might happen if you do more sampling in the future, and how that relates to some unknown aspect of the collection. (Wikipedia definition seems to agree qualitatively)

Those are the definitions I'm using. You may well believe these are not good definitions. But I think these are principled definitions, in that they capture the essence of the metaphysical ideas (metaphysics: "a traditional branch of philosophy concerned with explaining the fundamental nature of being and the world that encompasses it" wikipedia)

The problem arises in that a lot of classical statistics is kind of "neither" of these, or maybe "a mishmash of both". My take on that is more or less as follows:

  1. There are Frequentist procedures which are very clearly Frequentist. For example the chi-square, Kolmogorov-Smirnov, Anderson-Darling, Shapiro-Wilks, and similar tests. For a hard-core set of tests we can look at the Die Harder tests for Random Number Generators which explicitly try to find ways in which RNGs deviate from uniform and IID. These things really do answer the question "does this sample look like a sample from an RNG with given distribution?"
  2. There are Bayesian models, in which people choose a model based on what they know about a process (physics, chemistry, biology, economics, technological design of measurement instruments, etc) and then turn that into a joint distribution over the data and some parameters that describe the process, typically via two factors described as a likelihood and a prior. In this case, the question of whether we can reject the idea that the data came from a particular distribution on frequency grounds does not enter into the calculation.
  3. There are a large number of people who do things like what's done in this paper where they construct a likelihood in what I would call a Bayesian way based on their knowledge of a process (in this case, it's a point process in time, detection of gram-negative bacteria in a hospital test, they know that the average is not necessarily the same as the variance, so they choose a default likelihood function for the data, a negative binomial distribution which has two parameters, instead of a Poisson distribution with only one but there emphatically is NOT some well defined finite collection of negative-binomially distributed events that they are sampling from). They then take this likelihood function and treat it as if it were describing a bag of data from which the actual data were drawn, and do Frequentist type testing to determine whether they can reject the idea that certain explanatory parameters they are thinking of putting into the model could be dropped out (what you might call a "Frequentist tuning procedure"). In the end they choose a simplified model and maximize the likelihood (a Bayesian calculation).

Since 1 and 2 are pretty straightforward, the real question is how to understand 3? And, probably the best way to understand 3 is that 3 is what's done when you've been classically trained and therefore have no real critical idea of the principles behind what you're doing. That is, you know about "what is done" but not "why". Note, that doesn't make it "wrong" necessarily, it's just that the people who do it rarely have a guiding principal other than "this is what I know how to do based on my education".

On the other hand, I think it's possible to argue that there is a principle behind 3, I just think that most people who do 3 aren't much aware of that principle. The basic principle might be "data compression". These kinds of models start with a Bayesian model in which a likelihood is picked based not on sampling from some known population but instead "what is known about where the data might lie" and then this model is tuned with the goal of giving a sufficiently short bit-length approximation to a Bayesian model that has good enough Frequency properties to reduce your bit cost of transmitting the data.

Then, if you collect additional data in the future, and it continues to look a lot like the past, awesome, you can send that data to your friends with low bandwidth costs. For example, instead of say a 16 bit integer per observation, you might be able to send the data using only on average 9 bits per observation, with no approximation or loss of information relative to the 16 bit version.

In this sense, (3) isn't really an inference procedure at all, it's a tuning procedure for a channel encoding. If you do some Frequentist tests, and decide to drop factor X from your model, it's not a commitment to "factor X = 0 in the real world" it's more an approximation to "Factor X doesn't save me much on data transmission costs if I had to send a lot more of this kind of data"

Although there are a certain number of "information theoretic" statisticians who actually think of this explicitly as a guiding principle, I don't think this is the position you'd get if you asked the authors of that paper what their motivation was for the statistical procedures they used.

Is 3 Bayesian inference? is 3 Frequentist inference? the answer is no and no. 3 is not inference at all, it's communications engineering. But, it does use Bayesian ideas in the development of the likelihood, and it does use Frequentist ideas in the evaluation of the channel bandwidth. I think some of those principled information theoretic statisticians may have an argument that it is some kind of inference, in other words, that the word inference should include this type of stuff, but I don't know much about those arguments and I don't really care to argue about the definitions of words. Communications Engineering seems to be a perfectly good description of this process, the question is, does it accomplish what you need to accomplish? In most scientific questions, I doubt it.

Again, thanks to ojm who keeps pushing back and making me think about things in more depth.

PS: the careful, principled version of this stuff is called "Minimum Message Length" modeling and Wikipedia has a nice intro. In essence it takes a Bayesian model, breaks it down to a discretized version, making decisions about the precision of the discretization which are ultimately based on a loss function related to Frequentist ideas (since that's what's relevant to transmission of a message). The version performed by many classically trained statisticians such as the ones in the article from PLoS One is more or less an ad-hoc version.

The probability density vs a sequence of random numbers

2016 May 6
by Daniel Lakeland

How can we map back and forth between (nonstandard) density functions, and random number sequences?

Let x_i be a finite sequence of numbers of length L with L a nonstandard integer. Let -N,-N+dx,..-N+i dx,..N be a grid of values on the real line with N a nonstandard integer and dx = 1/N so that there are 2N^2+1 grid point values.

Let L/N^2 be nonstandard (L is ULTRA ULTRA ULTRA BIG, so that there are roughly "a nonstandard number of values for each grid point").

Consider a "randomly chosen" permutation of the integers from 1..L where we intuitively understand somehow how this is done (ie. we pick one of the permutations intuitively). Call this sequence c_i (for choice) then imagine choosing at random from the sequence x_i by choosing the integers in order i=1..L calculating c_i and choosing the appropriate x_{c_i}. Assert without proof that x_{c_i} is an IID random number generator for some distribution when c is chosen "well enough". What distribution?

For each value, determine which grid point it falls in, and increment a counter for that grid point until we get to the end of the sequence. Call this counter n(x). Construct the nonstandard function n(x)/L this is the nonstandard probability density associated with x_i. When the standardization of this function exists, use it. Otherwise, perhaps you have something like a delta-function, where all the values are infinitesimally close to a single value, or some such thing. Those are useful too so we don't necessarily require a standard density. The fact that we include nonstandard densities means we are able to actually represent arbitrary probability measures, where the measure of an open set S is \int_S n(x)/L dx

To reverse this process, start with the function n(x)/L, multiply the function by L then for each grid point append a string of n values equal to x on our sequence. Once we have the full sequence x_i we can "sample from it" by choosing a permutation of 1..L.

Now, let's canonicalize our sequences, namely if we have a sequence x_i we can sort it in ascending order. Now for every function n(x)/L we can find a canonical x_i of length L.

How many different RNG sequences of length L are there for a given density function? There are one for each permutation, or (L!). How big is that? Remember Stirling's approximation: \log(N!) = N \log(N) - N which is already quite accurate at N = 3. So there are about \exp(L \log(L) - L) different RNGs corresponding to the same distribution. We could perhaps require only permutations that leave no element in place, called the derangements, there are about L!/e of those so that'd put us at \exp(L\log(L)-L-1)

Are all of these really "random number generators" that will satisfy Per Martin-Lof? Probably not. But a big fraction of them will be. Imagine that an oracle tells us Martin-Lof's "most powerful test" for a binary random number generator. Take the subset of the permutations which cause the sequence x_{c_i} to pass that test at the \alpha level, when x_i is the sequence {0,0,\ldots,0,1,1\ldots,1} where the first half of the sequence is 0 and the second half 1. Constrain our choice of c_i to be one of these sequences, it's still going to be a big big number of sequences (asserted without proof).

So, it's always possible to take a probability measure (a nonstandard density) and convert it into many different sequences of IID samples from that probability measure (a random number generator) and back and forth.

Mathematically, this means in essence that "you can sample from any probability distribution". But this doesn't mean that placing a probability distribution over real world data D implies that you think D is a sample from that distribution. As a Bayesian, you are free to reject that. The probability involved is in your head or at least assumed for the purpose of calculation (ie. in the head of the Jaynesian Robot, or of MC Stan).  As Jaynes says: "It is therefore highly illogical to speak of 'verifying' (3.8 [the Bernoulli urn equation]) by performing experiments with the urn; that would be like trying to verify a boy's love for his dog by performing experiments on the dog." - E.T. Jaynes, Probability Theory


Frequentist vs Bayesian reasoning, the essence

2016 May 5
by Daniel Lakeland

A jeweler reaches into his desk drawer and pulls out 3 photographs of diamonds on velvet backgrounds, he is very proud of the workmanship in cutting these expensive diamonds. "What do you think of that?" He says to his two friends, both of whom are statisticians.

Bayesian: Well, I know something about photography and light and diamonds, so I know that if I were to take the digital photos and apply a Gaussian blur to them then in each of the photographs the center of the diamond would be one of the pixel values that had at least brightness B. (Plausibility of the location is measured by brightness)

Frequentist: I see that the diamonds are at x1,y1 and x2,y2 and x3,y3 so if I assume all of the photographs in your desk drawer are of diamonds on velvet backgrounds, I can reject the hypothesis that the diamonds are on average in the center of those other photographs. (Repeated sampling would almost never produce 3 photos in a row with the diamond this far off center)

Jeweler: Yes, yes, but which one should I give to my wife for our anniversary?



Fourier Series and Transform using NSA

2016 May 3
by Daniel Lakeland

So, we've been talking about NSA and Fourier series. Some of you are probably really familiar with the whole idea of Fourier series and/or transforms, others might not be. Hopefully this post will help those who haven't ever really figured out what Fourier series or transforms are. Maybe some Economists who currently are reading papers about seasonal corrections to timeseries will find it interesting or something. Anyway let's get started. (The TL;DR summary is at the end)

First off, imagine you have a periodic function f(x) with period p. An example is the Bangkok daylight hours function from the seasonal trends post. The trig functions \sin(2\pi k (x/p)) and \cos(2\pi k (x/p)) are also periodic with period p for all integers k. Furthermore, they are "orthogonal" which is to say

\int_0^p \sin(2\pi k (x/p)) \cos(2 \pi k (x/p)) dx = 0

You can think of this in a couple of ways... this is the definition of the "timewise covariance" of two timeseries (the average value of the product through time).

The vector space view, Fourier Series for periodic functions, with a discrete set of standard frequencies:

It's also possible to imagine the cos and sin functions as vectors in a very large vector space, let's say one with N a nonstandard number of dimensions. Then let s_k and c_k be two different vectors, whose i'th entries are:

 s_k[i] = \sin(2\pi k (i/N)), c_k[i] = \cos(2\pi k (i/N))

Now what's the dot product of these vectors?

\sum_{i=1}^N s_k[i] c_k[i]

This is the sum of a nonstandard number of appreciable values, to make this thing have a limited value we'd better multiply it by dx = p/N and as soon as you do that, you see it's the nonstandard definition of the integral mentioned above.

\sum_{i=1}^N s_k[i] c_k[i] dx = \int_0^p \sin(2\pi k x/p) \cos(2\pi k x/p) dx

In other words, we can define a "dot product" on functions by using the covariance integral. Since the two functions have nonzero length, their dot product is only zero because the angle between them is \pi/2 or 90^\circ (the dot product of two vectors a,b is |a||b|\cos(\theta) and if a,b have nonzero length then \cos(\theta)=0 is the only solution. In an N dimensional vector space, there are N different orthogonal directions, and the sin and cos functions fill up all those dimensions (there's a sin and cos function for each k from 0 to N/2 that defines one of the directions in this vector space).

So, periodic functions are elements of a vector space, and the sin,cos functions span that space, we can create any periodic function by linear combinations of sin and cos. If you're a mathematician you now come out of the woodwork to bop me on the head about Hilbert spaces, and how we don't get pointwise convergence. Specifically you could have just a few of those N dimensions fail to converge. Imagine the function \cos(2\pi x) except when x = \sqrt(2)+2\pi n for any integer n, the function equals 97.3. This little blip is an irritation only, for example its contribution to any integral is always infinitesimal since it only occurs for an infinitesimal interval, and the closest standard function (in terms of sum of squared differences) to this function is the usual \cos(2\pi x) so we identify and collect together all the nonstandard vectors with their closest standard function in the vector space, and when we do that we can span the whole space of standard periodic functions with the sin and cos functions with frequencies that are integer multiples of 1 cycle per distance p f_i = i/p

Fourier Transforms: nonperiodic functions using a discrete set of infinitesimally close frequencies

Fine and dandy. How about non-periodic functions? Like our old friend the quadratic exponential radial basis function (a non-normalized normal curve)

\exp\left(-\frac{x^2}{2}\right )

Well, we can still do the whole vectorspaceification (yes it's a real word, one I just made up) of functions but now instead of on the interval [0,p] we do it on the interval [-N,N] for N a nonstandard integer, and we space our samples a distance dx = 1/N apart, so we've got 2N/(1/N) = 2N^2 different dimensions in our vector and two separate types of functions, so we need 2N^2/2 = N^2 frequencies. And we define the same kind of dot product on this vector space... and we can again span the vector space with sin and cos functions that are mutually near-orthogonal (infinitesimal covariance integral).

\sum_{i=0}^{N^2} \sin(2\pi f_1 (-N+i dx) \cos(2\pi f_2 (-N + i dx)) dx = \int_{-\infty}^{\infty} \sin(2\pi f_1 x) \cos(2\pi f_2 x) dx \cong 0

for any two different positive standard frequencies f1 \ne f2 from among the appropriate nonstandard sequence of frequencies. Which frequencies do we have? I think the usual way to think of it is f_i = -N/2 + i/N for i \in 0\ldots N^2. Intuitively, that is basically from minus infinity to infinity in increments of 1/N.

And, we can approximate an arbitrary function g(x) by taking its dot product with any of our basis vectors. And constructing the function:

 g^*(x) = \sum_{i=0}^{N^2}\left [ \left (\int g(x) \sin(2\pi f_i x)dx\right ) \sin(2\pi f_i x) + \left (\int g(x) \cos(2\pi f_i x)dx\right ) \cos(2\pi f_i x) \right ] df

This constructs the function as a nonstandard sum of infinitesimal contributions from each frequency. The sum is itself the definition of an integral in frequency space. Multiplying the sum by df normalizes the sum properly. It has N^2 terms, and df = O(1/N). This means that the coefficients (the \int g(x) \sin(..)dx parts) have to decline fast enough in order to ensure we get a near-standard function.

Remember also that we don't get precise pointwise convergence. In particular, at discontinuities we're going to get problems, there are a variety of technicalities. The point of this blog post is more or less to give you the following basic ideas:


  • Functions of a single variable (or even of many variables) could be considered like vectors with a nonstandard number of dimensions.
  • Linear algebra works fine in these large vector spaces, and we can compute dot products.
  • We can approximate functions by linear combinations of basis vectors (basis functions). The initially most popular one was the Fourier basis, sin, and cos
  • For periodic functions, you can construct a Fourier series. The frequencies are on a discrete grid corresponding to integer multiples of a standard base frequency.
  • For non-periodic functions you can construct a Fourier integral, the frequencies are on a discrete grid corresponding to integer multiples of an infinitesimal base frequency.
  • All of it is hand-wavy here, and I'm sure I am probably mussing up one or two technicalities but in order to use these ideas fruitfully you don't usually need to know all the technicalities. If you do need to look up technicalities, you can start with this intuitive view and map your ideas to the exposition in reference books on the subject. When they say "integral" you think "nonstandard sum"... etc.
  • In particular, I hope this makes clear why we might not want to do things like use "monthly dummy variables" or "weekly dummy variables" to approximate seasonal effects. When we do that, the dot product of our timeseries with our basis function is basically just the average value during the time of interest. Since the width of our dummy function is standard, we can't capture the standard sized changes that occur during that time interval.
  • Another way to think about the issue is that the dot product with the Fourier basis uses information about how the function changes over the whole periodic interval, whereas the dot product with the dummy "window" function only uses local information to that time window.


What does "seasonal" mean?

2016 May 2
by Daniel Lakeland

So in followup to the ideas about "seasonal" variations, there's a fundamental philosophical question here. What is a seasonal variation?

Clearly, length of day in Bangkok is determined by the geometry of the earth and its spin about its axis and tilt relative to the orbit around the sun. Purely by being at a particular point on the earth and at a particular point in the orbit there will be a particular period of the day when light falls on Bangkok.

But in general, there are relatively regular but less strictly regular variations in things, such as for example the amount of cheese produced each month:

It rises and falls throughout the year but it's not periodic, more like quasi-periodic with a clear trend and a clear change in the pattern that occurs slowly in time, over a timescale of 5 to 10 years.

In general, you can break down every function into a part that changes "slowly" and a part that oscillates "quickly". There's a theorem that is related to this idea by Cartier and Perrin published in "Nonstandard Analysis In Practice" (Diener and Diener 1995). For the gist, consider the following idea.

Let g(x) be a standard, bounded periodic function of x with period p such that \int_0^p g(x) = 0. Suppose that it's bounded by |g(x)| < B for all x. It is, basically a Fourier type series of \sin and \cos terms with no constant term.

Now consider the nonstandard function \hat g(x) = g(Npx) for N a nonstandard integer.

Consider the integral over any standard length L

\int_0^L \hat g(x) dx = \left ( \int_0^{Kp} g(s) ds + \int_{Kp}^{Kp+\epsilon} g(s)ds \right ) dx/ds

Where s = Npx and K= \lfloor (NpL/p)\rfloor

On the right, inside the parens, the first term is zero due to the construction of g(x) as a zero mean oscillating periodic function, and the second term is bounded by \epsilon B. What is the conversion factor dx/ds ? A small distance ds in the domain of the g function represents a distance ds/Np in the x domain of \hat g(x), so the conversion factor is 1/Np suggesting that the little bit at the end of the integral is bounded by B\epsilon/Np which is infinitesimal.

So this nonstandard function is a function which when integrated over any appreciable domain takes on an infinitesimal value. In other words, although it is pointwise very different from zero by as much as B which could be any appreciable number like 100 for example, over any observable distance its average is "essentially zero".

This is the essence of homogenization theory, and we don't need the function to be periodic necessarily, it just simplifies the construction here.

The point is, it's often natural to talk about two different timescales, a "slow" process t1 and a "fast" process t2 such that a model for what goes on at the scale of dt2 will average out to zero over times that are appreciable in the scale of t1. In other words, "daily" fluctuations in temperature (scale t2) are irrelevant to modeling "annual" changes in seasonal temperature averages (scale t1). When these scales are sufficiently far apart we can consider one scale the "seasonal" scale and another the "deviations".

The problem comes when "deviations" last for appreciable times on the "seasonal" scale, and now it's hard to separate the timescales meaningfully.


Philly vs Bangkok, thinking about seasonality in terms of fourier series

2016 May 2
by Daniel Lakeland

Andrew Gelman linked to Uri Simonsohn and Joe Simmons' blog post on analyzing things in terms of seasonal effects. The idea is that seasonal effects sometimes need to be removed before you can look at whether A and B are related. The season is more or less a common cause for both A and B and it's only the deviations from the seasonality which can be used to see if there is something about A and B together.

The techniques considered were basically weekly and even daily dummy variables and they have very nice graphs to show how those work (or don't work). They point out how poorly those do the job. In essence when you do that you're trying to reconstruct a continuous series as if it were a series of step functions. As we know, that can work well when you make the step functions small enough (that's the basic idea behind integration!) but for data analysis it's a poor way to go. As you get down to the daily dummy variables, you now have as many predictors as you have things to predict! Not a good idea, even if you do use some kind of prior on their values to reduce the "effective" degrees of freedom.

When handling a timeseries that has this kind of repetitive pattern, a better choice is a basis expansion in continuous functions, and for seasonality, in periodic functions. What a basis expansion does is it gives you regularization and representation. Regularization in the sense that you can control how complex the basis expansion is, and representation in that taken to a high enough order you can approximate pretty much any function. Also, a Fourier series will converge to this kind of near-periodic continuous function typically very rapidly (the coefficients usually decline exponentially fast asymptotically for high-order coefficients) which means that the number of things you need to estimate is relatively very small (you can truncate the series at just a few terms). That is part of why it gives you regularization.

Here is the result of trying their "predicting the day duration in Bangkok" example (data is here) just using the regular old least-squares function "lm" in R:

lm(formula = day_duration_bangkok ~ temp_today_philly + sin(2 * 
    pi * day/365) + cos(2 * pi * day/365) + sin(2 * pi * day * 
    2/365) + cos(2 * pi * day * 2/365) + sin(2 * pi * day * 3/365) + 
    cos(2 * pi * day * 3/365) + sin(2 * pi * day * 4/365) + cos(2 * 
    pi * day * 4/365), data = dat)

      Min        1Q    Median        3Q       Max 
-0.016515 -0.004719  0.000195  0.004538  0.015385 

                            Estimate Std. Error   t value Pr(>|t|)    
(Intercept)                1.461e+01  1.939e-03  7534.174  < 2e-16 ***
temp_today_philly          1.556e-05  3.399e-05     0.458    0.647    
sin(2 * pi * day/365)     -5.004e-03  5.841e-04    -8.567  < 2e-16 ***
cos(2 * pi * day/365)     -8.446e-01  7.502e-04 -1125.795  < 2e-16 ***
sin(2 * pi * day * 2/365) -1.533e-03  3.561e-04    -4.306 1.89e-05 ***
cos(2 * pi * day * 2/365)  1.153e-01  3.493e-04   330.162  < 2e-16 ***
sin(2 * pi * day * 3/365)  2.042e-04  3.455e-04     0.591    0.555    
cos(2 * pi * day * 3/365) -1.764e-02  3.556e-04   -49.588  < 2e-16 ***
sin(2 * pi * day * 4/365)  4.626e-04  3.455e-04     1.339    0.181    
cos(2 * pi * day * 4/365)  4.176e-03  3.490e-04    11.965  < 2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.0066 on 720 degrees of freedom
Multiple R-squared:  0.9999,    Adjusted R-squared:  0.9999 
F-statistic: 6.773e+05 on 9 and 720 DF,  p-value: < 2.2e-16


I created an extra variable "day" by doing as.numeric(as.Date(data$date,format="%m/%d/%Y")). This allows me to use it in the sin, and cos functions.

As you can see, the Fourier series to 4th order (8 coefficients plus the overall mean) has completely hammered out any seasonal effect, and the regression with the temperature in Philadelphia is now an effect on the order of 10^{-5} hours / degree F (note it's important to state the units here!) that has no statistical significance... the effect predicted by the mental model where seasonality is a common cause.


Recommended Reading IST

2016 April 30
by Daniel Lakeland

There's a fabulous, inexpensive Dover text by Alain Robert that gives a very accessible introduction to nonstandard analysis (NSA) through Internal Set Theory (IST). I wish there were an e-book version, but alas there isn't.

There is however an e-book version of another useful accessible book on nonstandard analysis  (by Henle and Kleinberg) that comes at the subject from a more "traditional" viewpoint (based on Abraham Robinson's original construction). It's so cheap you have to buy it right now. It spends more time on applications than on the constructions required to define infinitesimals etc.

For someone interested in applied mathematics, the IST version of NSA is extremely accessible, but the techniques are similar in both Robinson's version and IST once you get past the Robinsonian foundations (which require heavy-hitting mathematical logic).

In any case, if you build mathematical models using calculus, or you do statistics, you should know something about NSA because it "algebraifies" analysis. One advantage to that is it brings in lots of possibilities for Computer Algebra. Another advantage to NSA is that it has hugely enriched function spaces. Things that "ought" to be functions but in standard mathematics aren't, like delta functions, are perfectly reasonable nonstandard objects.

In many areas of "standard" mathematics, we have a sequence of "finite" objects which miraculously transforms into a totally different TYPE of object in the limit. In NSA that isn't the case.