# Philly vs Bangkok, thinking about seasonality in terms of fourier series

Andrew Gelman linked to Uri Simonsohn and Joe Simmons' blog post on analyzing things in terms of seasonal effects. The idea is that seasonal effects sometimes need to be removed before you can look at whether A and B are related. The season is more or less a common cause for both A and B and it's only the deviations from the seasonality which can be used to see if there is something about A and B together.

The techniques considered were basically weekly and even daily dummy variables and they have very nice graphs to show how those work (or don't work). They point out how poorly those do the job. In essence when you do that you're trying to reconstruct a continuous series as if it were a series of step functions. As we know, that can work well when you make the step functions small enough (that's the basic idea behind integration!) but for data analysis it's a poor way to go. As you get down to the daily dummy variables, you now have as many predictors as you have things to predict! Not a good idea, even if you do use some kind of prior on their values to reduce the "effective" degrees of freedom.

When handling a timeseries that has this kind of repetitive pattern, a better choice is a basis expansion in continuous functions, and for seasonality, in *periodic functions*. What a basis expansion does is it gives you regularization and representation. Regularization in the sense that you can control how complex the basis expansion is, and representation in that taken to a high enough order you can approximate pretty much any function. Also, a Fourier series will converge to this kind of near-periodic continuous function typically very rapidly (the coefficients usually decline exponentially fast asymptotically for high-order coefficients) which means that the number of things you need to estimate is relatively very small (you can truncate the series at just a few terms). That is part of why it gives you regularization.

Here is the result of trying their "predicting the day duration in Bangkok" example (data is here) just using the regular old least-squares function "lm" in R:

Call: lm(formula = day_duration_bangkok ~ temp_today_philly + sin(2 * pi * day/365) + cos(2 * pi * day/365) + sin(2 * pi * day * 2/365) + cos(2 * pi * day * 2/365) + sin(2 * pi * day * 3/365) + cos(2 * pi * day * 3/365) + sin(2 * pi * day * 4/365) + cos(2 * pi * day * 4/365), data = dat) Residuals: Min 1Q Median 3Q Max -0.016515 -0.004719 0.000195 0.004538 0.015385 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.461e+01 1.939e-03 7534.174 < 2e-16 *** temp_today_philly 1.556e-05 3.399e-05 0.458 0.647 sin(2 * pi * day/365) -5.004e-03 5.841e-04 -8.567 < 2e-16 *** cos(2 * pi * day/365) -8.446e-01 7.502e-04 -1125.795 < 2e-16 *** sin(2 * pi * day * 2/365) -1.533e-03 3.561e-04 -4.306 1.89e-05 *** cos(2 * pi * day * 2/365) 1.153e-01 3.493e-04 330.162 < 2e-16 *** sin(2 * pi * day * 3/365) 2.042e-04 3.455e-04 0.591 0.555 cos(2 * pi * day * 3/365) -1.764e-02 3.556e-04 -49.588 < 2e-16 *** sin(2 * pi * day * 4/365) 4.626e-04 3.455e-04 1.339 0.181 cos(2 * pi * day * 4/365) 4.176e-03 3.490e-04 11.965 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.0066 on 720 degrees of freedom Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999 F-statistic: 6.773e+05 on 9 and 720 DF, p-value: < 2.2e-16

I created an extra variable "day" by doing as.numeric(as.Date(data$date,format="%m/%d/%Y")). This allows me to use it in the sin, and cos functions.

As you can see, the Fourier series to 4th order (8 coefficients plus the overall mean) has completely hammered out any seasonal effect, and the regression with the temperature in Philadelphia is now an effect on the order of hours / degree F (note it's important to state the units here!) that has no statistical significance... the effect predicted by the mental model where seasonality is a common cause.

Andrew Gelman wanted the "display" output.

~~Which I don't like as much in this case. This is an astronomical calculation and has standard errors on coefficients of 10^-4, it really does have that kind of precision, unlike the models he's used to fitting which are to more fuzzy data like economics or voting.~~Andrew points out you can give an argument "digits" to get higher precision, display(mod,digits=5) shown below.

I should also point out that the "dummy variable approach" is itself a basis expansion, it's just a basis expansion in local, discontinuous, constant functions (step functions). A basis expansion in general has the following form (a linear combination of other functions).

y(x) = sum(a[i] * f_i(x))

if you consider f_i to be the function Ind(xlow,xhigh,x) (the function that is 1 when x between xlow and xhigh) then you'd get the "monthly dummy" or "weekly dummy" variable approach.

if you use sin, and cos for the f functions, you get the fourier approach.

if you use something like f_i = exp(-((x-x[i])/c[i])^2) you get the "quadratic radial basis function" approach...

if you use something like a polynomial multiplied by an indicator function, you get a piecewise polynomial type approach (like say the cubic splines)

in general, if you have a periodic function you want to use Fourier series, if you have an infinitely smooth function you want Fourier series. If your function is on a finite domain and smooth you might want the Chebyshev polynomials (which is kind of a Fourier series transformed) if you have a function that can have kinks in it you might want something like cubic splines. If you have functions that have multiple steps in it, you could do well with the "dummy variable" approach (step functions).

Very cool! Have you also tried dynamic linear models (state space models)?

I've built a lot of ODE models, and I've worked on systems that have an unobserved state, but I don't know whether they classify as "dynamic linear models" per se.

The idea that something can be described by a hidden state that determines the observations is a powerful general purpose concept though. For example, you might describe the Bangkok timeseries in terms of the orbital mechanics of the planet, with its tilted axis and the wobble in that axis and etc etc none of which is observed directly, only the consequences for the day-night lengths.

This is really cool! Is there any intuition for understanding the plot of the fitted model values versus the temp_today_philly? The oval drawn around the points is intriguing, but I don't know if there's really any valuable information there in this toy example.

I think there is a small typo in the date conversion to day (the format part):

as.numeric(as.Date(data$date,format="%y/%m/%Y")) should be

as.numeric(as.Date(data$date,format="%m/%d/%Y"))

Typo fixed, thanks.

I'm not quite sure what plot you're referring to, is it

plot(fitted.values(model) ~ temp_today_philly,data=data)

?

Since the model fits really well, that's more or less just looking at the trend between daylight hours and temp in philly. This is the false trend that the original problem was trying to eliminate.

You can remove temp_today_philly from the model, re-fit, and then plot the residuals against temp_today_philly and see how the trend is totally removed by the Fourier series.

I should have included the plotting code! And if I did include it, we'd have realized my tab complete had me plotting the "temp_hist_philly" as my independent variable rather than "temp_today_philly." So now I'm seeing what I expected: the fitted values essentially overlapping the data. And the residuals versus temp_today_philly (with temp_today_philly removed from the model) showing no trend.

Thanks!

One more question: do the coefficients of the components to the Fourier series have any meaning? Or is this strictly a way to remove the seasonality and focus on the influence of the other predictors?

The coefficients do carry some information. From my post on Fourier analysis you can see that the sin and cos functions fill up (span) a vector space. The square root of sum of squares of the coefficients then defines a "euclidean length" in that vector space. That Euclidean length is the same as the square root of the integral(f^2(x),dx) over the periodic domain. It defines sort of "how much variability around the average" there is.

Also the cos, sin coefficients can be paired together into a complex number a+bi and from that you can get a phase angle. So the fact that the cos terms dominate here shows you that the calendar day (the result of as.numeric(as.Date(...)) is aligned with the variation in the length of the day very well. The function is nearly "even" (symmetrical about 0) with respect to a calendar period starting day = 0 on Jan 1....