Philly vs Bangkok, thinking about seasonality in terms of fourier series
Andrew Gelman linked to Uri Simonsohn and Joe Simmons' blog post on analyzing things in terms of seasonal effects. The idea is that seasonal effects sometimes need to be removed before you can look at whether A and B are related. The season is more or less a common cause for both A and B and it's only the deviations from the seasonality which can be used to see if there is something about A and B together.
The techniques considered were basically weekly and even daily dummy variables and they have very nice graphs to show how those work (or don't work). They point out how poorly those do the job. In essence when you do that you're trying to reconstruct a continuous series as if it were a series of step functions. As we know, that can work well when you make the step functions small enough (that's the basic idea behind integration!) but for data analysis it's a poor way to go. As you get down to the daily dummy variables, you now have as many predictors as you have things to predict! Not a good idea, even if you do use some kind of prior on their values to reduce the "effective" degrees of freedom.
When handling a timeseries that has this kind of repetitive pattern, a better choice is a basis expansion in continuous functions, and for seasonality, in periodic functions. What a basis expansion does is it gives you regularization and representation. Regularization in the sense that you can control how complex the basis expansion is, and representation in that taken to a high enough order you can approximate pretty much any function. Also, a Fourier series will converge to this kind of near-periodic continuous function typically very rapidly (the coefficients usually decline exponentially fast asymptotically for high-order coefficients) which means that the number of things you need to estimate is relatively very small (you can truncate the series at just a few terms). That is part of why it gives you regularization.
Here is the result of trying their "predicting the day duration in Bangkok" example (data is here) just using the regular old least-squares function "lm" in R:
Call: lm(formula = day_duration_bangkok ~ temp_today_philly + sin(2 * pi * day/365) + cos(2 * pi * day/365) + sin(2 * pi * day * 2/365) + cos(2 * pi * day * 2/365) + sin(2 * pi * day * 3/365) + cos(2 * pi * day * 3/365) + sin(2 * pi * day * 4/365) + cos(2 * pi * day * 4/365), data = dat) Residuals: Min 1Q Median 3Q Max -0.016515 -0.004719 0.000195 0.004538 0.015385 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.461e+01 1.939e-03 7534.174 < 2e-16 *** temp_today_philly 1.556e-05 3.399e-05 0.458 0.647 sin(2 * pi * day/365) -5.004e-03 5.841e-04 -8.567 < 2e-16 *** cos(2 * pi * day/365) -8.446e-01 7.502e-04 -1125.795 < 2e-16 *** sin(2 * pi * day * 2/365) -1.533e-03 3.561e-04 -4.306 1.89e-05 *** cos(2 * pi * day * 2/365) 1.153e-01 3.493e-04 330.162 < 2e-16 *** sin(2 * pi * day * 3/365) 2.042e-04 3.455e-04 0.591 0.555 cos(2 * pi * day * 3/365) -1.764e-02 3.556e-04 -49.588 < 2e-16 *** sin(2 * pi * day * 4/365) 4.626e-04 3.455e-04 1.339 0.181 cos(2 * pi * day * 4/365) 4.176e-03 3.490e-04 11.965 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.0066 on 720 degrees of freedom Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999 F-statistic: 6.773e+05 on 9 and 720 DF, p-value: < 2.2e-16
I created an extra variable "day" by doing as.numeric(as.Date(data$date,format="%m/%d/%Y")). This allows me to use it in the sin, and cos functions.
As you can see, the Fourier series to 4th order (8 coefficients plus the overall mean) has completely hammered out any seasonal effect, and the regression with the temperature in Philadelphia is now an effect on the order of hours / degree F (note it's important to state the units here!) that has no statistical significance... the effect predicted by the mental model where seasonality is a common cause.