Boobies. There I had to say it. This is a post about boobies, and math, and consulting with experts before making too many claims.

In this click bait article that I found somehow searching on Google News for unrelated topics, I see that some "Medical Anthropologists" are claiming that Bras seem to cause breast cancer (not a new claim, their book came out in 1995, but their push against the scientific establishment is reignited I guess). At least part of this conclusion seems to be based on the observation from their PDF

Dressed To Kill described our 1991-93 Bra and Breast Cancer Study, examining the bra wearing habits and attitudes of about 4,700 American women, nearly half of whom had had breast cancer. The study results showed that wearing a bra over 12 hours daily dramatically increases breast cancer incidence. Bra-free women were shown to have about the same incidence of breast cancer as men, while those who wear a bra 18-24 hours daily have over 100 times greater incidence of breast cancer than do bra-free women. This link was 3-4 times greater than that between cigarettes and lung cancer!

They further claim "bras are the leading cause of breast cancer."

That's pretty shocking data! I mean really? Now, according to http://seer.cancer.gov/statfacts/html/breast.html there are about 2 Million women in the US living with breast cancer, and 12% overall will be diagnosed throughout their lives. There are around 150M women in the US overall. So $P(BC) = 2/150 = O(0.01)$

However, in our sample $P(BC) = 0.5$ That's 50 times the background rate (ok 37.5 if you do the math precisely).

Doesn't it maybe seem plausible that in winnowing through the 1% of women living with breast cancer and are still alive, or even the 5 or 6 percent who have been diagnosed in the past but are still alive (figure half of women who are alive today who will at some point be diagnosed have already been diagnosed at this point) that maybe, just maybe they could have introduced a bias in whether or not their sample wears bras?

So "looking for cancer patients causes us to find bra wearing women" is actually maybe the more likely story here? Perhaps "cancer patients who were non bra wearers were overwhelmingly more likely to have died from their breast cancer, and so we couldn't find any of them?" That's somehow not as reassuring to the non-bra-wearers in the audience I think.

Symbolically: $P(Alive \& BC \& NBra) = P(Alive | BC \& NBra) P(BC|NBra) P(NBra) = 1/100 P(Alive \& BC \& Bra)\\ = 1/100 P(Alive | BC\&Bra) P(BC|Bra) P(Bra)$ pretend BC and Bra are independent. We conclude $P(Alive | BC \& NBra) = 1/100 P(Alive | BC \& Bra) P(Bra)/(1-P(Bra))$ or not wearing a bra reduces your chance of surviving by a factor of 10 or so if P(Bra) ~ 0.9? Put on those bras ladies! The exact opposite of their conclusion!

I personally suspect something else spurious in their research. But nothing in their PDF convinces me that they know what they are doing.

Note that wikipedia has some discussion of their book.

A friend of mine posted a link to news articles about a recent Surgeon General warning about sunscreen

http://www.cnn.com/2014/07/29/health/surgeon-general-skin-cancer/

Quoting from that article:

Skin cancer is on the rise, according to the American Cancer Society, with more cases diagnosed annually than breast, prostate, lung and colon cancer cases combined.

On Tuesday, the United States surgeon general issued a call to action to prevent the disease, calling it a major public health problem that requires immediate action. Nearly 5 million people are treated for skin cancer each year.

But let's dissect this a little bit. First of all, most skin cancer is unlikely to really hurt you. It's melanoma that is the real concern.

cancer.gov gives melanoma rates according to the following graph:

Melanoma diagnosis and death rates through time from seer.cancer.gov

As for melanoma itself, clearly the diagnosed cases are on the rise, but look at the deaths per year. Totally flat. This is consistent with improved diagnosis procedures without any change in actual incidence in the population. Furthermore looking deeper on the melanoma site we see that 5 year survival rates have increased from 82% in 1975 to 93% in 2006, this is also consistent with earlier diagnosis (so that the 5 year rate measures from an earlier point relative to the initiation of the melanoma).

How about melanoma by state? Climate should be involved right? More sun exposure should mean more melanoma?

Melanoma rate by state

As you can see, states in the southwest have lower rates, and states in the northwest and northeast have higher rates. The cluster of southeastern states with high rates are interesting too.

Vitamin D is suspected to be involved in several health affects related to cancer, so overall exposure to sun may be beneficial. However, I think that the data is also clear that intense exposure to sun from tanning beds, intentional tanning, and repeated sunburn is bad for your skin.

Sun exposure, like other exposures such as alcohol, may have nonlinear risk effects. At moderate exposures you are better off than at zero exposure (in Oregon rain or Maine winters) or heavy exposure (leading to repeated sunburn and high melanoma risk).

So, is the advice to slather on sunscreen good? I don't think the conclusion is so clear cut, but I don't have any in-depth study data to go on either. All I can tell you is that I'll continue to avoid getting sunburn by covering up and using zinc based sunblock when I'm outside for long periods, but I'll continue to get regular sun exposure without sunblock in mornings, evenings, and mid day during non-summer months.

A serious error in refereeing occurred in the 2014 World Cup final in the beginning of the second half, when German goalkeeper Neuer challenges Gonzalo Higuain in a reckless and dangrous manner, resulting in a collision that is absolutely shocking to watch. I suspect that Higuain only survives this collision because he is unaware that it will occur and does not tense up (his eyes are facing away from Neuer and on the ball the whole time). Videos abound of this event, such as this one (EDIT: takedown requests have made getting good video more difficult, much of what is available is people glorifying this event by replaying it over and over set to aggressive music). After this collision, a foul was given against Neuer (ie. committed by Higuain??) and Germany takes possession of the ball. The charitable interpretation of this is that the referee simply didn't see what happened, and therefore applied a heuristic that the goalkeeper gets the benefit of the doubt. The alternative is actual misconduct by the referee. The uninformed claptrap that abounds on the internet, in which people claim that this is not a foul by Neuer against Higuain, or that it is not worthy of a red card against Neuer seems to be rampant. Fortunately, there are rules to the game which can be looked up. Such as on the FIFA website interpretation of rules for fouls and penalties. Under page 122 of this commentary: "Serious Foul Play"

"A player is guilty of serious foul play if he uses excessive force or brutality against an opponent when challenging for the ball when it is in play.

A tackle that endangers the safety of an opponent must be sanctioned as serious foul play. Any player who lunges at an opponent in challenging for the ball from the front, from the side or from behind using one or both legs, with excessive force and endangering the safety of an opponent is guilty of serious foul play. ....

A player who is guilty of serious foul play should be sent off"

Neuer lunges at Higuain, using one leg, from the side (or behind Higuain's direction of view), using excessive force (at full tilt with knee raised) and certainly endangering the safety of Higuain. Some have said this was ruled as a foul on Higuain because he was impeding the progress of Neuer, which is nonsense. On page 118 of the above pdf:

"Impeding the progress of an opponent means moving into the path of the opponent to obstruct, block, slow down or force a change of direction by an opponent when the ball is not within playing distance of either player.

All players have a right to their position on the field of play, being in the way of an opponent is not the same as moving into the way of an opponent.

Shielding the ball is permitted. A player who places himself between an opponent and the ball for tactical reasons has not committed an offence as long as the ball is kept within playing distance and the player does not hold off the opponent with his arms or body. If the ball is within playing distance, the player may be fairly charged by an opponent."

Higuain was within playing distance of the ball at the time the foul was committed, Higuain had a right to his position, Neuer could have avoided the collision (as opposed to Higuain impeding progress by stepping in front of Neuer to intentionally trip him, which is what the above commentary is about).

Was this reckless and endangering the safety of Higuain? Anyone who watches this collision has to see that Higuain's safety was endangered. It's actually shocking that he survives it without being paralyzed. I suspect that this is entirely due to Higuain being unaware of the brutal challenge that is coming since his eyes are clearly on the ball, away from Neuer, the entire time. If he had tensed up when the impact occurred his neck vertebra would have been subject to much higher forces (as opposed to simply bending) and that might have resulted in his permanent paralysis. That isn't even to say anything about the possibility of concussion from the impact.

My prediction is that this World Cup will result in rule changes regarding the treatment of head injuries, it was constantly on the lips of commentators, fans, and others throughout the tournament.

As far as I am concerned, without their primary goalie (ie. playing 10 men and with a substitute for the goalie), Germany would have lost this match badly, and since the rules clearly show that the goalie should have left the game at 56 minutes or so, Argentina are the proper world champions.

The article purports to show that alcohol played a role about 1/10 of all deaths among "working age" people (age 20-64). This is a pretty big percentage, but we should remember that most people don't die until after age 64. In fact according to the CDC life tables about 85% of people survive to 65, and 50% of people live to more than 84 years of age. So if alcohol is attributable to 10% of the 15% of deaths that occur before age 65, then alcohol is attributable to about 1.5% of all deaths in the US.

Since most people are living so long, the kinds of things that do kill younger people are likely to be things that are not very age specific, or which are generally quite harmful, like car accidents, and crime. It's pretty clear that excessive use of alcohol is problematic, but just because it's involved in 10% of all deaths in this population doesn't mean it's involved in 10% of all deaths in the US, most of which occur in people over 64 years of age, and are related to things like cancer, heart disease, pneumonia, etc.

How do they estimate the role that alcohol plays? They plug things into a model developed by people who study the harmful effects of alcohol. The model has an online application available. Assumptions in this model are going to play a big role in the outcomes. For example, they assume that 42% of fire injuries and 47% of homicides are "attributable to alcohol" in this population. I don't know where they get those kinds of numbers, but even if those are decent point estimates, the uncertainty in the attributable fraction would seem to be significant. Similarly suicide is somehow 23% attributable to alcohol.

This kind of result is pretty questionable. Clearly a causal attribution is being made. Implicitly this means somehow "without alcohol" these deaths wouldn't have occurred. But suppose you have for example a depressed Iraq war veteran with post traumatic stress syndrome, a small child (which is a big stress), and a wife who is considering leaving him (my sister who is a Psych NP used to see about 10 cases a day at a VA hospital that were like this). Suppose that this person, who has been threatening to commit suicide for a few years one day drinks 500 ml of 40% alcohol spirits (about 10 or 12 drinks) over a period of about 4 or 5 hours, and then shoots himself. Is this "attributable" to alcohol, or is it "attributable" to traumatic combat stress, social safety net issues, as well as perhaps genetic risk factors which affect personality and make this person both more likely to join the military, and more likely to contemplate suicide when faced with serious life problems?  It's a pretty difficult and complex thing to tease out. If you want to make "causal" claims, it has to some how be that with alcohol the death occurs, and without alcohol the death wouldn't have occurred. Pretty clearly, when a 19 year old fraternity pledge drinks until he passes out and dies of alcohol poisoning, the cause is plainly attributable "to alcohol" (as well as to poor social norms that encourage excessive drinking in that context). But what is the "intervention" that would involve alcohol and would have prevented these events? If it's not possible to intervene to control the variable, and thereby control the outcome, then we can't really have "causality".

It's also clear that interventions don't leave "all else equal". Clearly, in the US, when we tried Prohibition, violence went way up. We couldn't just "eliminate alcohol" while keeping everything else about society the same. So in that context you might say that thanks to the fact that we have alcohol, overall deaths are much down vs what they would be with legally enforced prohibition. No, I wouldn't say that either, because the whole concept is just stupid. Prohibition was predictably a bad idea from the start, but it was motivated by the "all else equal" fallacy.

I personally think of these CDC type reports as largely politically motivated. Some people study the harmful effects of alcohol (or guns, another political topic the CDC has published on)and by publishing this kind of "attribution" article they can justify further funding and continue their research, and they can also bring some publicity to what is undoubtedly a real problem. But typically the way in which these things are stated winds up over-generalizing and making the politically charged issue into a ruckus.

To contrast that approach, consider this report from a group that studies alcohol issues in Canada and has published "low risk" guidelines. The estimate is that if everyone were to adopt the "low risk" guidelines, deaths in Canada would decrease by 4600 per year. This is a concrete, causal sort of intervention and most likely achievable in large part without major changes to "all else." How big is that 4600 per year decrease? Well this website lists about 250000 deaths in Canada per year, so the idea is that by everyone adopting the low risk guidelines, deaths would be reduced overall by about  2%. It's interesting how my 2% calculation here is about the same order of magnitude as the 1.5% calculated back in the first paragraph or so. Implicitly then, the real issue is heavy, or binge drinking, since that is what the "low risk" guidelines rule out.

The Canadian report takes the point of view that their guidelines are set so that if you max out the guidelines (for men, 15 drinks per week, no more than 3 in 24 hours most days, no more than 4 in 24 hours ever, consumed in a safe environment that rules out drunk driving etc) the net effect on deaths vs complete abstinence from alcohol would be zero. How can that be? Wouldn't the least deaths happen from zero alcohol consumption? It's been known for a long time that alcohol consumption mortality has a sort of U shaped profile. Those who drink a little overall live longer than either heavy drinkers, or complete abstainers (on average). Mostly the effect is related to reduced cardiovascular disease related problems, but there may be other effects.

So, to that, I say if you are not in a high risk category for alcohol dependence (such as the offspring of an alcoholic or someone who has suffered from alcohol related problems in the past) raise one or maybe two glasses of wine, beer, or well made cocktails tonight with some friends, and be thankful for a substance that is beneficial in moderation. Keep to those low risk guidelines, and don't drive anywhere until the effects are fully worn off.

Oh, and don't believe everything you read.

Stanislaw Ulam was a mathematician who more or less invented the use of Monte Carlo computations for physical problems, and he's the namesake for which the Stan Bayesian modeling system was named.

Following up on my previous analysis of Himmicanes vs Hurricanes, I decided to go fully Bayesian on this problem (imagine the voice of Marsellus Wallace from Pulp Fiction). Hurricane damage is of course a traditional Civil Engineering interest, and if the name of the storm has a lot to do with how deadly it is, for social reasons, then this seems like information we really want to know.

The big issue in the Hurricane analysis is measurement uncertainty. For example in examining Hurricane Isaac in my previous analysis, (before I had gotten access to the PNAS paper), I decided that their estimate of 5 deaths was somehow wrong, and put the 34 deaths from Wikipedia in. Then, after reading the PNAS article I realized they were only counting US deaths, so I looked more carefully at the Wikipedia article and counted 9 deaths mentioned in news stories. So clearly even the direct deaths have errors and uncertainty, and that isn't to say anything about the indirect deaths, such as increased car accidents in regions outside the evacuation, or heart attacks or strokes caused by stress, or even potentially reductions in overall deaths due to lower traffic levels during hurricanes. It's a confusing and difficult to measure thing. And if it's hard to measure deaths, think of how hard it is to measure dollar damages. First of all, there is some kind of normalization procedure used by the original authors, second of all, a car crushed by a fallen tree is pretty clearly hurricane damage, but what about a car that gets "lightly" flooded but this causes electrical damage which leads ultimately to the car failing 6 months later and being scrapped because the repair is too costly? We should take measures such as damage and deaths as having significant errors incorporated into their measurement.

With that in mind, I decided to re-analyze the data using a Bayesian errors-in-variables model on top of my basic model of "involvement" using $D/(C v^2)$ as a measure of number of people potentially affected by the storm (where D is damage, C is capitalization of the region in dollars per person, and v is the velocity of the storm winds).

To accomplish this I set up parameters that represented actual values of $D, C,v$ for each storm, and linked them to the measured values through probability distributions, then linked these unobserved "actual" values together through a regression line in which the Masculinity and Femininity were allowed to play a role after 1979 (when both male and female names were in use). Before 1979 we simply consider this as a separate era, and have a slope and intercept term for that entire era).

The model is perhaps best described by the code itself:
stanmodl <- "
data {
int Nhur;
real<lower=0> NDAM[Nhur];
real alldeaths[Nhur];
real<lower=0> Category[Nhur];
real Year[Nhur];
real MasFem[Nhur];
}

parameters {
real<lower=0> Damage[Nhur];
real actualdeathsnorm[Nhur]; /* the actual deaths, normalized by 150, unobserved*/
real<lower=0> v[Nhur];
real<lower=0> Cappercapita[Nhur];
real<lower=0> ndsig;
real<lower=0> involvement[Nhur];
real<lower=0> invsd;
real<lower=0> slope;
real<lower=0> intercept;
real MFslope;
real MFint;
real MasFemAct[Nhur];
real predsd;
real Pre1979Int;
real Pre1979Slope;
real<lower=0> avgcappercap;
}
/* dollar amounts will be converted consistently to billions*/
model{
ndsig ~ exponential(1/.1);
invsd ~ exponential(1/.5);
MFslope ~ normal(0.05,10.0);
MFint ~ normal(0.05,1.0);
intercept ~ normal(0,1);
slope ~ normal(0,10);
predsd ~ gamma(1.3,1/4.0);
Pre1979Int ~ normal(0,1);
Pre1979Slope ~ normal(0,10);
avgcappercap ~ lognormal(log(5.0/1000.0),log(1.5));
for(i in 1:Nhur) {
Damage[i] ~ normal(NDAM[i]/1000.0,NDAM[i]/1000.00 * ndsig);
Cappercapita[i] ~ normal(avgcappercap,.25*avgcappercap);
v[i] ~ normal((25.5+8.2*Category[i])/(25.5+8.2),0.15); /* normalized wind velocity, with error*/
MasFemAct[i] ~ normal((MasFem[i]-6.0)/5.0,.15);
involvement[i] ~ normal(Damage[i]/(Cappercapita[i]*v[i]*v[i])/2000.0,invsd);
actualdeathsnorm[i] ~ normal(alldeaths[i]/150, .1*alldeaths[i]/150+5.0/150.0);
if(Year[i] < 1979) {
actualdeathsnorm[i] ~ normal((intercept + Pre1979Int + (slope+Pre1979Slope ) * involvement[i]), predsd);
}else
{
actualdeathsnorm[i] ~ normal((intercept + MFint*MasFemAct[i] + (slope+(MasFemAct[i]*MFslope) ) * involvement[i]), predsd);
}
}
}"

out <- stan(model_code=stanmodl,data=list(Nhur=NROW(Data), NDAM=Data$NDAM, alldeaths=Data$alldeaths, Category=as.numeric(Data$Category), Year = Data$Year, MasFem=Data\$MasFem),chains=1,iter=800)

I run 800 samples (only 400 of which are post-warm-up) on 1 chain because the model has a lot of parameters (one for each measurement, so a couple hundred) and it takes a while to run. This run takes about 5 minutes on my desktop machine. If I were going for publication I'd probably do 3000 or more samples in two separate chains in parallel (I have two cores). But the basic message is already visible in this smaller sample set:

Here we're plotting the posterior mean values of the "actual deaths" vs the "actual involvement" $D/(C v^2)$. As you can see, After 1979, the posterior values line up along the same line regardless of the Masculinity / Femininity of the name. Before 1979, the slope of the line is larger, indicating a higher fraction of the people involved tended to be killed. This can most likely be attributed to much poorer forecasting and communications in the 1950's and 1960's. For example, the wiki article on Hurricane Camille mentions many people refusing to evacuate even including some jail prisoners who thought this category 5 hurricane would be no big deal.

Why do these data fall right on the line? In the model we have "wiggle room" for all of the actual measurements. This includes some room for variability in capitalization per capita (think "wealth of the area affected") as well as the actual number of deaths and the actual wind velocity at landfall. Put all together, there is enough prior uncertainty in the measurements that we can find the posterior average values of the estimates directly along the regression line that we've specified. The prior we specified for the model error (predsd) is relatively wide, but the Bayesian machinery is telling us the more important source of uncertainty is in the measurements, somehow it seems to pick a very small predictive sd O(0.01) even under alternative priors. (this is actually somewhat suspicious I suppose, and would be worth following up if you were trying to publish this non-news).  So all the estimates of the actual deaths and actual involvement are pulled towards the regression line we've specified. We've plotted with low alpha some grey dots that show alternative possibilities from our samples.

How about the masculinity and femininity effects?:

The effect of femininity on slope and intercept.

As you can see, these values are essentially 0 though there is plenty of uncertainty left to allow for the possibility of small effects. We don't know if there is some small effect, but we sure as heck shouldn't publish a result claiming an important effect based on this information!

Many people in the blogosphere have been commenting on the paper in PNAS about male vs female hurricanes. Thanks to Bob O'Hara who posted some code doing his own analysis, I decided to grab the data and do a little modeling myself. My main concern was how I was seeing a lot of statistical analysis, but very little modeling involving causal and physical factors. The following model actually considers the causality, and the physics and produces results that match the intuitions that go into the model. And the femininity turns out to be meaningless. Note also that I found one outlying point and corrected the number of deaths in hurricane Isaac to be 34 as shown on Wikipedia instead of 5 from the dataset. Whatever, I'm sure there are more data cleaning issues too but I don't get paid to clean other people's data and I don't enjoy it, so feel free to scrutinize their data more carefully if you like.

So let's cut to the punchline of a couple of graphs, and then hear the analysis:

The basic model

And the residuals from the fit:

The first thing I decided was to analyze only the period after 1975, since the use of male names comes in shortly after (1979) and this is the "modern" era of forecasting and is more relevant to any effect of gender in the future than storms in 1950 or 1960.

### The Basic Model:

So I decided to think about how damage and deaths work. In both cases, the number of deaths and the amount of damage are tied to the population in the area affected. In particular, I used the following basic idea:

In other words, Damage $D$ is related linearly via a constant $K$ (to first order expansion) to the number of people in the area affected, times the capital stock per person $(C/N)$ times a measure of the destructive force of the winds (the fraction of the capital stock that will be damaged). Since the pressure exerted on a building is proportional to the velocity of the wind squared, I used $(v/v_1)^2$ to measure destructive force. I get the velocity of the winds from the "Category" variable in the dataset, and transform it according to wikipedia's description of the velocity (I transform to m/s, but then divide by $v_1$ the velocity of cat 1 hurricanes, which makes the result independent of units).

Last week I went swimming and met a couple who were very helpful and friendly. They were both professors at the Art Center College of Design here in Pasadena. Anyway, one of them pointed out that the pool does have a lap timer clock which I hadn't bothered to notice before, so I timed myself swimming 50 and 100 yard sets (two and four laps of the pool). My time for 50 was about 40 seconds (1.14m/s), and 100 was 1:40 (0.914 m/s) these were fairly consistent after the first set or two. I consider this not bad since I'm a father of two small boys who hasn't been exercising enough for the last 4 years or so. But another thing I discovered was that I know I swim 25 yards in about 13 or 14 strokes, sometimes 15. This, together with my lap times gives me a range of stroke rates which is around 36 or so strokes per minute. Now typical stroke rates are advocated to be around 60 to 70 per minute, which suggests that somehow I should increase my stroke rate. Intuitively, it seemed to me that this wouldn't work for me, but I thought I'd analyze the math of it. Thinking about how this would affect power output led me to the following analysis:

Let's add in the energy cost of stroking to the previous analysis. To get an energy cost, consider that your muscle "wastes" the kinetic energy of your arm at each end of the stroke. So the power you're putting out overcomes drag, and also accelerates and decelerates your arm every stroke.

Where $a$ is the length of the arm stroke, which is different from $sL$, so $m(2ar)^2r/2$ has the dimensions of power (kinetic energy per stroke time). Also, note that $s= v/(Lr)$ which I substitute for. We're going to try to control $v$ and $r$ so having things in terms of those variables is more helpful.

Another analysis I'm going to do here is to change from using $h$ and $d$ to $h$ and $t$, where $h$ is still the height of the wake above the water, but $t$ is the full cross sectional profile of the body, so $t = h+d$. I do this because for people who have a fairly flat profile in the water, $t$ is something fixed by their body shape, whereas $h$ can be controlled by things like head position. The previous analysis normalized by drag using $d$ but normalizing by full submersion drag using $t$ makes a little more sense because it's a single fixed point rather than something that changes when $h$ changes. Substituting $d=t-h$, and normalizing by $K\rho t wv^3$ gives the following dimensionless power equation (note, as in the previous analysis, I've dropped the lateral viscous drag term as we've already seen it's multiplied by a trivially small constant).

This is a slightly different form than before, we see in this analysis that we have three dimensionless groups $h/t, gh^2/(v^2t),ma^2r^3/(\rho t v^3 w)$.

From this analysis, if we take a very low in the water body position, then $h/t=0$, and we see that we can decrease our drag by riding up a little bit so that $h>0$. Since the wave drag grows like $h^2$ when $h$ is small the decrease in piston drag $(-h/t)$ exceeds the increase in wave drag. This probably accounts for the fact that elite swimmers often have a relatively forward looking head position. This lets them reduce their piston drag significantly while suffering only a small wave drag penalty. It also probably helps to produce the trough that allows for effective breathing.

Plugging in typical values for the mass of the arm and the density of the fluid and soforth, the order of magnitude estimates for the dimensionless groups are about 0.2 for the wave group, and $0.02r^3$ for the arm group. So, although the equation predicts that if I double my stroke rate, I will waste 8 times more arm power, this still brings it into the same range as the wave and piston drag power. But why would I want to do that? The only reason I can see is that it would give me significantly more opportunities to breathe. One thing that I find with a 30 or so strokes per minute stroke rate is that I pretty much have to breathe one side only because otherwise I build up carbon dioxide. But doing so gives me an asymmetric stroke and might increase the wave and piston drag coefficients, and it makes it hard to empty my lungs fast enough. A faster stroke rate will have to be combined with a shorter stroke length, but will allow me to have more opportunities to breathe and may allow me to have a better sustained power output, actually making me faster even though my power output increases.

At least, that's how I interpret the math. I plan to get a little metronome that beeps regularly to help me time my stroke and see how it works for me.

I've been working on my front-crawl swim stroke as an effort to improve my fitness. I really enjoy the challenge of getting better at the technique, and I also enjoy the challenge of not over-thinking things while I'm swimming. Of course, when I'm not in the water it's another story. So here's a blog post about dimensionless analysis and the forward crawl.

First off, let's be clear what we mean. Some people call this stroke "freestyle", some people the "forward crawl" others the "australian crawl", you can read about this stroke at the wikipedia page but almost everyone who can swim has used some variant of this stroke, so I'm going to assume you know what I'm talking about.

Many people of course are interested in how to swim quickly. The front crawl is generally considered the fastest stroke, which is why people use it during the "freestyle" event (where you really could swim whatever you wanted). Let's think about how we get velocity:

In this equation, $v$ is the velocity of the swimmer through the water, and $sL$ is the "stroke length" (the distance traveled per stroke)  which we get by multiplying the swimmer's length L by a dimensionless factor $s$ (this is already anticipating later analysis of lengths), whereas $r$ is the stroke rate (the number of strokes per minute for example).

If we want to increase $v$ we can either make $sL$ bigger, or $r$ bigger or both. Since we can't make the swimmer much longer without injury, to lengthen the stroke requires making $s$ bigger which is part of working on technique.

Now, let's think about energy consumption. If we want to go fast in a long distance swim, we will need to make $1/t \int_0^t s(q)L r(q) dq$ large, which is the time averaged velocity, averaged over time $t$. To make this make sense we had better be going long enough that we take a lot of strokes and can consider the time between strokes as "infinitesimal" compared to the overall time. A good stroking rate is on the order of 1 stroke per second, so we're interested in averages over minutes or even tens of minutes.

Now, when trying to go fast, what will happen is we have some ability to output power $P$ such that we can sustain it for the entire time $t$. And we will go as fast as that power budget will allow, in such a way as the power  from drag is equal to our sustained power output. So keeping drag low at a given velocity will let us go faster.

So let's try to model the power involved. The simplest model is a body averaged one:

where $D$ is the net drag force on your body.

To figure out what controls the drag $D$ consider that it has dimensions of force, and using dimensional analysis we need to generate a force term from the following variables $h$ the height of the bow wave $d$ the depth of your body in the water, $w$ the width of your body at the shoulders, $L$ the length of your body, $\rho$ the fluid density, $\mu$ the dynamic viscosity, and $v$ the velocity.

Our drag $D$ will be made up of several components. We can use dimensional analysis to get the functional form of these components. We need combinations of the above variables which have dimensions of force for use in our power equation.

A good candidate force term is proportional to $hw (\rho g h)=\rho g h^2 w$. This is more or less related to the static pressure head generated by the bow wave in front of the swimmer. Another good candidate term is $\rho dw v^2$ which has dimensions of force and relates to the difficulty of pushing the fluid out of the way in front of you (without lifting it). Finally, we expect drag to be related to viscosity as well, we're going to find that $\mu v L$ has dimensions of force, and is related to the drag along the length of the swimmer's body essentially caused by pulling the fluid along with us. Now we've identified several sources of drag. Let's write an equation for them:

In our overall problem, there are 2 equations and 9 variables ($L,d,w,h,\rho,\mu,v,g,r$) and 3 independent dimensions (length, time, mass) so we expect to have 6 independent dimensionless groups. One thing we can see though is that L,d,w,h are all lengths. Clearly we can make 3 dimensionless groups by doing $h/L,w/L,d/L$ which makes sense since $L$ is the largest dimension so all of these length ratios are less than 1.

Let's rewrite our power equation in dimensionless form. We could normalize the power equation by one of the types of drags, since we're interested in going fast, consider that the term multiplied by $v^2$ may dominate and normalize by that. Let's assume that over the range of speeds of interest to swimmers the dimensionless multiplier of this drag term is near constant and therefore the drag is equal to $K \rho d w v^3$ for some dimensionless constant $K$ as shown above. Dividing through we get:

In this equation, we've identified 3 important dimensionless groups $g h^2 / (dv^2)$, $\mu/\rho v d$ and $w/L$, there is also another important dimensionless group, $v/Lr$, this basically measures our velocity as a fraction of one body length per stroke interval, and is equal to $s$ the dimensionless stroke length. Finally, there is $h/L,d/L$ which are the natural measures of bow height, and depth in the water (we are essentially measuring all lengths as a fraction of the swimmer's full length). That is the full 6 dimensionless groups promised. If we write $r/L = r'$, $d/L = d'$, $v/Lrs=v'=1$ and $h/L = h'$ we can rewrite our equations as (note, maxima computer algebra system is helpful in avoiding mistakes):

and

In other words, if we measure velocity in stroke lengths per stroke time, distance in units of your body length, and power in units of piston drag (the term proportional to $v^2$) we get things into a very simple form. Clearly we can learn a lot from this. It becomes obvious why tall swimmers are desirable, their wave drag decreases like $1/L$ and lateral drag like $1/L^2$. Plugging the following approximate order of magnitude constants in $g=9.81,L=2,r=1.25,\mu=.001,\rho=1000,w'=1/3$ we get order of magnitude estimates of how important our different drags are:

Now, we can't estimate $K_w$ or $K_L$ without data, but the fact that the typical size of the dimensionless group in front of the third term is orders of magnitude smaller than 1, suggests that lateral drag is really not very important in the real world (though we haven't accounted for kicking here, it's quite possible that excessive kick churn would enter into essentially lateral drag). To support this conclusion, consider that the Mythbusters tested out swimming in various viscosities and found that for non-elite swimmers the viscosity had essentially no important effect on speed over multiple orders of magnitude of viscosity.

Clearly though, with a dimensionless premultiplication factor of 3 or so, wave drag is important. One reason to keep your head low in the water ($h'$ small) and your stroke length long ($s \sim O(1)$) is that small stroke lengths and high head positions will severely increase your wave drag. This is why swimmers try to stay under the water after a push off, so that waves aren't sapping their momentum.

So, to swim efficiently, we need small total power, which means a long stroke (large $s$), low in the water (small $h'$), and we need to keep our body stretched out and lying flat in the water to maximize L and minimize $w'$ (which produces a small dimensional piston drag $dwv^2\rho$). Keep the stroke rate high, but keep it smooth so as to minimize other kinds of wave drag (choppiness) which is a kind of drag not included in this model. Since piston drag is proportional to depth in the water, and wave drag is proportional to height of the wave above the water squared, there is an optimum depth that involves not being too deep, but also not creating too big of a wave. It's probably a good idea to try to stay deeper in the water, with head down and see how well that works for you. Obviously if it interferes with breathing, or alters the constant in front of the piston drag due to strange turbulence around your head, it won't help. There is some optimum depth tradeoff for each swimmer.

What about the "glide"?? this is something that some people advocate, because it lengthens the stroke, and people have found that long strokes are efficient. However, it decreases the stroke rate as well. As long as $rs$ increases, everything is good. So in other words, if you are "over paddling" with a short stroke length and a high stroke rate, slowing down your strokes and making them longer makes sense. But if you are "over-gliding" with a long stroke length (s) but a slow stroke rate (r) you should instead speed-up your stroking so that the overall combination $rs$ increases. Note how the relative wave drag is inversely proportional to $(rs)^2$, which is related to the velocity. When we slow down, piston drag decreases, so wave drag becomes a bigger relative factor. Ideally, we're maximizing $v$, which means that we're maximizing piston drag, and wave drag becomes relatively less important. Eventually we come to a steady state where the drag force and velocity are high enough that we can't go faster without exceeding our power budget and therefore getting tired out too soon. Since piston drag is related to $v^2$ which is what we're trying to maximize, we can't get rid of it entirely, but we can try to minimize the additional drag caused by waves.

I have a friend who is a Philosophy professor. He recently posted to his Facebook page that he was offering his students extra credit if they could come up with false positive and false negative rates for the drug tests that used to be required by Florida law in order to receive welfare benefits. His students were having a heck of a time finding anything on these rates.

The only data he had was that about 2.4% of people who took the test got positive tests (reported in the news I guess).

So, I did a little quick googling, and came up with a WebMD article that had an approximate range for these tests.

I wrote him the following email which describes how to use this range to construct a simple prior over the false positive and false negative frequencies for these tests, and then using a single 6 sided die to generate a sample from this prior and calculate the range of frequencies for a person who tests positive to be an actual drug user.

Since it's a Philosophy course, I put in some stuff at the end about the distinction between frequencies and probabilities within the Bayesian framework. If you're either a teacher, or a student just seeing these ideas for the first time, it might be helpful to read this description

If you follow my blog you probably know that I was a PhD student at USC. I graduated in December, so you can read my dissertation which discusses both the mechanics of liquefaction and the thermodynamics and mechanics of wave dissipation due to microscopic vibrations in a molecular solid, along with some Bayesian fitting of that ODE model.

The liquefaction research was recently published on Proceedings of the Royal Society A. My version of the preprint can be downloaded here: Grain Settlement and Fluid Flow Cause the Earthquake Liquefaction of Sand and my postprint version will be available after 12 months (though I understand the definitive version will become available from the journal after a year or so too). The supplemental materials are available as well.

The article in Proceedings A was picked up by two different news sources so far:

ABC News Australia interviewed me for their article from a week or two ago, and in the last few days ScienceNOW interviewed me and one of my advisers Amy Rechenmacher for an article that came out today.

I was particularly happy to see Ross Boulanger quoted with a favorable comment. He was a professor of mine when I did my second undergrad degree at Davis, and he definitely motivated me to think about the problems with standard liquefaction explanations.

In the mean time, I am now working on building up a consulting practice. My company is: Lakeland Applied Sciences LLC. My emphasis has been on building mathematical models of all sorts of phenomena, from things like liquefaction and wave vibration to scaling laws for the healing time of bones and cartilage, to economic decision making, to biochemical explanations for behavior. You can see examples of projects I've worked on in the past on the company web site.