Regular readers of my blog will know that I'm an IST nonstandard analysis fan. It fits so well with model building. So, I thought I'd clear up how I interpret "maximum likelihood". If you write down a likelihood for data D with parameter vector $q$ which we'll notate as $p(D|q)$ and you maximize it without putting any prior on $q$, clearly the maximum occurs at the same place independent of multiplication by any positive constant $C$.

Now, suppose for ease of exposition, that $q$ is just a single parameter. In nonstandard analysis, we can choose a nonstandard integer $N$ and create a prior for $q$ which is $p(q) = 1/(2N)$ for any q value in the range $[-N,N]$ and zero otherwise. Clearly this is a constant for all standard values of $q$.

Now, do the maximization of this new nonstandard posterior, and provided that your maximum likelihood method picks out a limited value for $q$ it picks out the same $q$ as your maximum likelihood method without this nonstandard prior. It's clear that multiplying by this prior couldn't have any effect on the maximum point.

Is this "legitimate?" Well let me ask you a totally equivalent question. Are integrals legitimate? Because the integral of $f(x)$ a continuous function over a region $[a,b]$ can be defined as follows, let $N$ be a nonstandard integer, and $dx = (b-a)/N$ and $\mathrm{st}(x)$ be the standard part function, then

and if you don't like the Riemann integral, you can do the Lesbesgue integral instead, in that case you need to evaluate f at standard locations:

Both of these ideas are mathematical constructs in which nonstandard numbers are used to define a mapping from a standard thing to another standard thing. So, maximum likelihood, when it gives a unique maximum, is just maximum a-posteriori for a nonstandard posterior with a nonstandard flat prior, the same way that the integral of a function is just the standard part of a nonstandard sum.

It's not like this isn't a known thing, that maximum likelihood is the same as maximizing the Bayesian posterior with a "flat" prior. But usually that's taken as a kind of "intuition" because there *is no* (standard) flat prior on the real line. Well, there *is no* standard $dx$ value either, but that doesn't keep us from using nonstandard $dx$ values to define an integral, and it doesn't keep us from using nonstandard priors to define a standard posterior either.

Of course, if you pick some likelihood that can't be normalized... then we're talking about a different story. You should probably rethink your model.

In discussion at Gelman's blog, I gave a kind of summary description of what I thought Frequentist statistics was about (that is, principled frequentist as opposed to just "what is done a lot"). One of the commenters "ojm" felt that I was really misstating the case.

So, I sat down to try to figure out what frequentist statistics was more carefully, and went to one of the "sources" I felt very comfortable with: Larry Wasserman. Larry defines a Frequentist Inference as one in which we "construct [a] procedure with frequency guarantees"

But in what sense is this "guarantee" correct? Every one of these guarantees is of the approximate form "Random Number Generators of the type $R(i,q)$ will produce data $\{D_i\}$ such that $q^*(\{D_i\}) - q$ is in interval $I(R,q*)$ 95% of the time that a sample is taken"

Notice how this doesn't say anything about the world? It says something about the relationship between a mathematical function $R(x,q)$ and an estimator function $q*$

So, when can we apply Frequentist procedures to *real world* processes, and expect anything useful?

Here's some distinguishing cases that I think are important:

1. Frequentist inference about a finite population at a given time with an approximately known distribution, using a random number generator to sample the population.
1. In this case, I think we're good, since we're relying on a random number generator there really is an $R(i,q)$ so to speak.
2. Frequentist inference about a finite population at a given time without an approximately known distribution using a random number generator to sample the population.
1. Again, we have a random number generator $R(i,q)$ but it gets mapped through an unknown distribution of population values, but we can still construct "distribution free" inferences. It's possible! Things like Bootstrap and permutation tests and soforth have this flavor.
3. Frequentist inference about a finite population at a given time with an approximately known distribution and a sample where we have an approximate model of the sampling process.
1. Here we're on shakier territory. We took a sample, but we don't know exactly how that sample was produced, but we have an approximate model. Here we can at least get conditional on the model being good enough, something of use.
4. Frequentist inference about a finite population at a given time with an unknown distribution and an unknown sampling process for which we have no good model (Say, convenience sampling of Mechanical Turk respondents to a survey).
1. Here, I'm sorry, but we're just sunk. You can make up whatever Frequentist "guarantees" you like about random number generators $R(i,q)$ but you need to work VERY HARD to convince me that they mean anything about the world. (and the kind of work required is going to more or less push you into the categories 1-3)
5. Frequentist inference about a process through time considered as an infinite set of potential samples (ie. the measurement error distribution of an electronic instrument used to make many repeated measurements) where we have scientific reasons to believe the frequency distribution is stable.
1. This might work, so long as that science holds. But the term "frequency guarantee" sounds a lot stronger than "provided the world acts like random number generator R then ..."
6. Frequentist inference in which you use an IID sampling type likelihood that isn't calibrated to a known distributional shape for the output of some process that isn't RNG based sampling of a finite population, and you do maximum likelihood estimation.
1. This is a Bayesian calculation, but depending on how you actually construct a confidence interval, you might be adding something in that is Frequentist here... but most likely not based on what I typically see people do. I'd recommend accepting that you made assumptions to choose that Likelihood based on what you know (a fact in your head) and are calculating what that implies about other facts using probability theory as generalized logic. Oh, and just because you didn't put in a prior doesn't mean you don't have one, you just have a flat one.

We had a long discussion over at Gelman's Blog on using convenience samples in studies. I had some strong opinions which I thought I'd try to summarize here.

First up, you have to ask yourself what it is you want to find out from your study. Typically, it's not just what would happen if you did your treatment again on a similar convenience sample. It's more likely to be something like "what would happen if you did your treatment on a more broadly defined population P?"

If this is the kind of question you have, then you now have to figure out what kind of statistics you are willing to do. The spectrum looks something like this:

1. Frequentist statistics: A probability means how often something would happen if you repeated it a lot of times. For example, if you take a random number generator sample of 100 measurements out of a population of 100,000 people then you're guaranteed based on the sampling distribution of the average, that the average of your sample will be within a certain distance of the average over all 100,000 people almost no matter what your RNG output (for example, calculating the 95% confidence interval)
2. Likelihoodist Statistics: where we do a Bayesian calculation with a nonstandard, flat prior.
1. Frequentist Likelihoodist statistics: where we assume that the data arises as if from a random number generator as IID samples, with a validated frequency distribution (we test this distribution to make sure the frequencies approximately match the assumptions). We then write down the likelihood of seeing the data, and use it to get a maximum likelihood value of the parameter.
2. Default Bayesian Likelihoodist statistics: We write down some IID likelihood based on some default choice of distribution without doing any tests on the data distribution, we multiply by a constant prior (flat), and then usually look for maximum likelihood. This is what's done in "ordinary least squares" in the vast majority of cases. Few people test the "normality" assumptions and when they do, they often find that they're not met, but they stick with the calculation anyway because they don't have much else they know how to do or don't believe in full Bayesian statistics for some reason.
3. Full Bayesian Statistics: We write down a full joint distribution of the data and the parameters, providing some kind of prior that isn't a nonstandard construct (ie. not flat on the whole real line) and we don't restrict ourselves necessarily to likelihoods that are IID. For example, we might model a whole timeseries as a single observation from a gaussian-process, where each data point has a complex covariance structure with every other data point. We reject the idea that distributions must represent frequencies under repeated sampling, and instead use them as measurements of plausibility conditional on some scientific knowledge, and we're aware of this fact (usually unlike the Default Bayesian Likelihoodist).

Now, we've done a study on a convenience sample. What are the challenges?

For a Frequentist, we imagine we're interested in some function of the population, f(P), and a given data set S which is a sample from a population P has a value f(S). If we understand the sampling methodology, and the distribution of the values in P that go into calculating f(P) then we can get a sampling distribution for f(S) and see how it relates to f(P). This automatically allows us to make extrapolations to the population P which are not exact, but which hold to a certain approximation a certain percentage of the time. The only problems with this approach are, it requires us to know, at least approximately, the frequency distribution of the values in P, or be able to calculate the sampling distribution of f(S) approximately independently from the frequency distribution of P (such as when there's a mathematical attractor involved such as the Central Limit Theorem)

But, in the absence of a specific sampling model, when all we know is that we're sub-sampling P in some definitely biased but unknown way, there is no way to extrapolate back to the population without making a specific guess for what the population values P look like, AND how the sampling works. And since a Frequentist will not put probabilities on things like "the distribution of the P distributions" (which a Bayesian might accomplish by doing something like a Dirichlet Process or a Gaussian Mixture Model), there is no way to get probabilistic statements here. You can at best do something like "worst case" among all the sensitivity analyses you tried.

What would be the way forward for a Bayesian? Well, first of all, there are quite a few of them. We can build a wide variety of models. But in the basic case, what we'd do is assume that we have some covariates we can measure and that we can get a functional form for the prediction of the outcome from the covariates, so it's basically the case that for each s in the sample S, we have f(s)=outcome(s)+error. Now we have to assume that this relationship still holds, or holds with some modification over which we have a prior, so that f* = f(s') + error + error2(s') = outcome(s') for any s' value in the whole population P. Here we say that we at least have some probabilistic information about error2, the out-of-sample extrapolation error.

Next, we assume something about what P looks like, and since we're Bayesian, we can put probabilities over the frequency distribution for P. So we can do things like set up Gaussian Mixture Models, or Dirichlet Process Models or other models so that we can describe what we think we know about the full range of the population. This could be "pure" prior guess, or it could be based on some data from an alternative source (like Census data, or patient summary data from other hospitals, or operating characteristics of fighter jets published by military intelligence gatherers, or aerial photos of forests from weather satellites, whatever your subject matter is). Finally, we can generate a series of samples from the assumed population P and apply our predictors f*(P) to get predictions extrapolated back to our model of the population P.

So, in this case, the Bayesian can make assumptions which give probabilistic statements about the extrapolated results, but these probabilities are not Frequencies in any sense (there's no sense in which the real population P is somehow part of an ensemble of repeated populations P_i whose frequency distribution is given by the Gaussian Mixture Model of the Bayesian or whatever). The true frequency distribution of the real population P is a fixed thing at a given point in time. Like, the true frequency distribution of the weight in lbs of people between the ages of 18 and 60 in the US on April 28 2016 at 3:45PM Pacific Time. Nevertheless, although there's a very definite distribution for those weights, we DO NOT know what it is, and the Bayesian distribution over that frequency distribution is NOT describing an infinite set of worlds where our true population P is just one population out of many...

The Frequentist can only give an infinite set of statements of the form "if P looks like P_i and the sampling method looks like S_j then the confidence interval for f(P) = f(S) +- F_(i,j)"

A friend linked on Facebook to graphs from HumanProgress.org showing breast and prostate cancer death rates in the US. I immediately checked those against SEER at cancer.gov and also didn't like the y axes used by HumanProgress.org so here are my own graphs based on SEER data:

It's nice to see it go down, but it's better to see the data presented well and based on the most authoritative source with a y axis that makes it clear what the magnitude of the change really is. Also, CDC gets death certificates directly, WHO has to do more modeling to get their data.

So, pretty much everyone has analyzed the golf swing as a rigid double pendulum. Since that's well understood (see "The Physics of Golf" for example), the question I have is what is missing from this analysis? The way to answer that is to make a slightly more complex model that potentially captures different effects, and see if any of them are worth considering.

The model I'm considering has a spine pivot that can translate left-right, a rigid left arm, a variable length, actuated, right arm, two hand attachment points separated by a small distance at the grip, and a club with two stiff torsional spring joints in the shaft, to capture some shaft bend. Furthermore, you could add the golf ball itself with a very stiff nonlinear spring whose interaction force goes to zero rapidly outside the radius of the ball but goes to infinity rapidly as the club approaches the center of the ball.

The problem with this model is that it is a nonlinear actuated control system. What you'd like to do is specify the spine torque, the weight shift, the wrist torques, and the right arm extension as parameterized functions of time or of angle, and then train the system to maximize the component of ball velocity along the target line subject to constraints on the available torque, forces, etc. Then once it's found it's "perfect swing" you'd want to figure out what it's doing by looking at the chosen input functions.

Not exactly a trivial code it up at home in an afternoon kind of problem, but an interesting area of research. I'd be particularly interested in whether the shaft flexibility really has much effect. After all, the fancy expensive golf clubs are often expensive and fancy because they have graphite shafts and soforth. The bend of the shaft is visible in all the high speed videos as well.

On the way down at 0:49 see how the club is bent back away from the direction of travel? This is due to a torque applied by the arms to get the club rotating as well as translating. Since the club is going around a circle with a constant angle relative to his arms the rotation and the translation of the club are kinematically linked.

At about 0:53 as the wrists unlock, the club is straight and at 0:56 or so it's actually bending the other way (towards the ball). At about 0:56 the hands are also at their slowest. My assertion is that the left hand is applying a backwards force on the grip of the club while the hands hinge. That backwards force is what's responsible for the *forward* club bend that is very visible at 0:57. That backwards force is a necessary part of getting the club into almost pure rotation around the hands, because it's stopping or at least slowing the translation of the club grip end. Tuttleman's point from the previous post is that you shouldn't apply a pure torque with your hands or at least should do so very late. But a single backwards force applied at the grip end of the club together with the inertial force of the center of mass form a torque (force couple) in the accelerating frame of reference of the club center of mass, and that's the torque that is responsible for the forward bend of the club.

No, that above bit doesn't make sense. Sure a backwards force at the grip and the inertial force form a couple, but they wouldn't cause shaft bend, they'd cause change in rotational speed. Only with a torque applied to the club by the hands would we get a shaft bend. But the whole thing happens quickly, a torque that is helpful for getting the club head going by locking the lag angle is probably still persisting because it takes a while for humans to respond.

Thanks to Richard Kennaway for getting me thinking a little more carefully about the angular momentum issues in the golf swing. Here's another fantastic slow motion shot of Tiger Woods hitting a driver. It's slow enough that we can mention time points and see what's going on. I'm going to try to analyze how energy is pumped into the club, and how it's transferred to the ball.

Let's ignore the backswing, because although it's important to develop a good feel and rhythm, in other words, to help you control, it's not involved in adding power to the club since the club is stationary at the top. So starting at 0:33 the downswing begins. The wrists are cocked so that the club is at an approximately 90 degree angle to the left arm. By 0:39 we can see the hands are at the belt line, and the wrists are just beginning to uncock and the angle of the club to the left arm is starting to increase. Between these two time points he's put a lot of torque into turning his upper body, arms and torso but it seems mostly arms. A lot of energy is now stored in the motion of the club and the motion of his upper body and arms, like the energy stored in a gyroscope.

At 0:41 Tiger's whole upper body has nearly zero angular velocity relative to the large angular velocity at say 0:39. His hands become nearly stationary at a point near his belt buckle. Well, perhaps that's an exaggeration, but I'd estimate maybe half the peak linear velocity of the hands. I'd love to have access to one of those little video drawing and measuring tools. But, the club continues to swing thanks to the unhinging of the wrists.

During the period from 0:40 to 0:42 the club is unhinging. Let's approximate the linear velocity of the hands when they reach the belt buckle area as zero. The club head is still going quite fast, so we have as an approximation, the club purely rotating around a point where his left thumb is, and his upper body is stationary. This is a pretty rough approximation, clearly he is still turning, but not nearly as fast as at the middle of the downswing. This is the period of time when energy is being transferred from the upper body into the club. The club actually bends forward.

In fact, I think he is feeling a force couple applying torque to the club. His right hand will feel like it's "pushing forward" in the direction of the swing, and the left hand is pulling backward. Those forces are responsible for pumping energy into the club. The key in the swing is to do this late enough that the club doesn't "get ahead" of your arms. At impact the club should still form an angle of less than or just about equal to 180 degrees to the wrist. At this point, your left arm is pulling upward, and you'll see Tiger's shoulder rising post impact.

EDIT: Something is definitely going on at that point in the video, the hands do seem to be slowing, but I don't think that the above description is accurate having looked and thought more carefully about it.

Now, here's a simple mechanical analogy experiment you can try at home. Take a normal desk ruler, preferably a flat metal one. Hold it flat between your right thumb and forefinger on one end of the ruler. Cock the ruler so it's trailing at 90 degrees to your arm. Move your right arm from level out at your right side around in an arc towards your front, holding tight until you get about halfway through this right arc, and then partially release the pressure on your thumb so that the ruler is free to rotate. Decelerate your arm so that it's stationary when it gets in front of you. You will see the ruler rapidly begin turning around your thumb, and whack you in the forearm... watch out that the ruler doesn't fly out of your hand, or hit you in the face or arm in a painful way.

I think there's a flavor of this type of energy transfer in the golf swing. It's similar to a trebuchet which has a stiff arm, and a rope-sling attached to the end. Towards the end of the stiff arm's swing it slows down and strong forces are exerted on the rope to accelerate the rock around a pivot point with arm length equal only to the rope length. In essence, the orbital radius changes rapidly from something approximating the length of the trebuchet arm, to something approximating the length of the rope.

Tutleman analyzes the whole thing as a double physical pendulum, and concludes that the wrists shouldn't add positive torque until very late in the swing, tens of milliseconds before impact... But Hogan who advocated the "swing with the hands" concept probably is feeling something real even if he doesn't understand how to turn it into physics.

But what Tutleman doesn't analyze and I'd be interested to see, is what happens when the body adds negative torque at an appropriate point. That is, slow the upper body rotation speed while allowing the wrists to unhinge. That will induce a backwards force on the top of the club and explain the forward bend of the club that he mentions. It is the same basic motion as stopping your arm while letting the ruler swing freely.

I'll write some maxima code to integrate the ODEs and see what I get.

Here are some nice videos of golf impacts and swings

There's a lot you can learn from these videos. But the one thing that I want to make indisputably clear here is that how you hold the club or your body or the forces you exert on the club during impact have absolutely no discernible effect on the actual forces applied to the ball.

The second video is at 68,000 fps, and replayed at say 24 frames per second, the whole impact takes less than 1 second of video watching time, which means that the impact time is about 24/68000 = 0.00035 seconds. The ball goes from zero to say 200 mph in this time. Thanks to GNU units we can get that as 26000 gees (gravitational accelerations or 1gee = 9.8 m/s^2). A golf ball weighs about 46 grams, which means at 26000g the force on the ball is 2641 lbs or the weight of a 1200 kg mass which is about as much as a typical sedan automobile. Can you lift up your car off the ground with your hands? Do you experience anything even remotely close to that force during a golf swing? No.

The speed of shear waves in steel is around 6000 m/s, a golf club is about 1 m long, so it takes 1m/(6000m/s) = .0002 s for a shear wave to travel the length of the club, so in the .0004 seconds of contact it's just about enough time for a shear wave to travel up the club to your hand and back down around the time the ball leaves the face.

NOTHING about your hands directly affects the ball, EXCEPT position. And obviously, the way that position effects the ball is that it puts the club head at the right place to impact the ball.

Of course, leading up to impact, what you do with your hands and arms and body is critical, critical to get the club head positioned in the right place and traveling at the right speed, but during impact, all the effect comes from the club interacting with the ball as if your hands weren't even attached.

Golf is a game of precision more than power. You need to slot that club head into the right position plus or minus less than a cm while getting it traveling at 100 mph or so. Kind of an amazing feat of control.

How does Tiger Woods get a high club speed? See the second video between 1 and 4 seconds, at the point where the club is horizontal about at the level of his right knee, his wrists begin to hinge. The club goes from an L shape at almost 90 degrees to his arms, into a straight line with his left arm by the time the club passes the ball by say 5 inches. This means the club is basically rotating around the contact point with his hand, whereas it started out rotating around a point about at his left shoulder. The radius of rotation of the club becomes about half what it was and equal about to the length of the club. Like an ice skater pulling in her arms to speed up her spin, conservation of angular momentum ensures that when the radius of rotation becomes smaller, the velocity of the club increases. During the initial downswing, his hands and wrists apply a torque to the club, this is what golfers call "the feeling of lag". At the moment when the club is horizontal or so, he stops applying that torque, stops holding his wrist stiff, and that's when he begins to apply only a centripetal force along the shaft to keep the club head going in a circle around his wrist. The angular momentum is $mvR$ where R is the radius from his shoulder pivot point to the club head, and if suddenly we reduce the radius to $r \approx R/2$, then $V \approx 2v$, essentially doubling the club speed by this slingshot effect

The above stuff about the change in radius is wrong. Yes the radius of rotation changes, but we can only invoke conservation of angular momentum if we keep the reference pivot point constant, which would be the point close to his left shoulder or something. If you have a ball on a string going around a pivot point, and you reel the string in, you will speed up the ball, but that's not what's going on here.

"Lagging" the club is all about applying torque so that you can get the club rotating, and then you release that stiff wrist to increase the club speed dramatically. The feeling of "retaining your lag" is about releasing the club late enough that it swings through the ball before it reaches the straight position with your left (leading) arm.

Mechanics... it works!

If you've got a bunch of stuff out in the world, and you send some people out to check up on them, and they only bother to look at and report things that are bigger than some value, then you wind up with a truncated random sample. For example, suppose you run around a housing development and see if there are any driveways with large cracks in them indicating a problem with the quality of the concrete. Suppose you only stop at driveways where the cracks are noticeable in size. You don't know what "noticeable" really is (unknown cutoff) but you only get reports of how wide the cracks are when they can be seen from the street on a drive-by (this is not a real problem I'm having, just a plausible example).

Let's just model this in the simplest case: approximately normally distributed crack widths with unknown cutoff.

vals <- rnorm(1000,10,1); /* 1000 cracks mean width 10, sd width 1*/
actualcutoff <- 10;
N <- 100;
seenvals <- sample(vals[vals > actualcutoff],N);


Now we want to estimate the population parameters, which are mu=10,sd=1 in this case, and we need to simultaneously estimate the cutoff, which we know logically needs to be say positive, and less than min(seenvals) (a data-dependent prior!)

Stan code:

data{
vector [N] seenvals;
}
parameters{
real<lower=0,upper=min(seenvals)> cutoff;
real<lower=0> mean;
real<lower=0> sd;
}
model{
mean ~ exponential(1.0/100); /* we have a vague idea about mean and sd*/
sd ~ exponential(1.0/100);
seenvals ~ normal(mean,sd);
increment_log_prob(-N*normal_ccdf_log(cutoff,mean,sd)); /*renormalize the distribution based on cutoff*/
}


What I discovered in doing this was that for truncation points near or above the mean, the estimated mean and sd were biased towards lower means and higher standard deviations, and actually it was true independent of whether I used a flat or exponential prior. The more observations you saw to the left of the mean, the more reliable and accurate your estimates were. As sample size increased things got better.

I am not sure why this bias occurs, but it is interesting and might inform some analyses. There is definitely a tendency in practical problems to have situations where we can't even detect some smaller values, but get a full compliment of the larger ones.