The p value is perhaps one of the most mis-used concepts in statistics. In fact, many researchers in science who are not statistics experts seem to believe that statistics is really the study of how to define and calculate p values. I'd say this attitude is prevalent especially in biology, medicine, and some areas of social sciences.
The truth is, to a Bayesian such as myself, p values are largely irrelevant, but they DO have one area where they make sense. First, let's understand what a p value is.
p value meaning
The p value in a test is the probability that a random number generator of a certain type would output test-static data that is as extreme or more extreme than the actually observed value of that test statistic.
Procedurally, imagine that your data comes from a random number generator which has boring properties that you don't care about. If your data comes from this random number generator, it would by definition, be an uninteresting process that you'd stop studying. Call this random number generator and its output (little d). Now consider some function which maps your data to a real number: or for random generator output . Generally the function measures in some sense how far away your data falls from an uninteresting value of , (often t=0). Now, how often would your specifically chosen boring random number generator produce fake data whose value is more extreme than the value of your actual data ? This is what the formula above describes.
p value use
So, that above description seems rather sterile, here is an examples of "proper" use of a p value: filtering data
You have a large sample of 1 second long audio recordings of the ambient noise around the area of a surveillance camera. You want to detect when the overall loudness of the noise is "unusual" so that you can tag and save the audio and video recordings for 30 minutes on either side of the "unusual" event. These events will be saved indefinitely, and other time periods will be deleted after 7 days to reduce data storage requirements. You calculate an overall amplitude of the sound recording using this formula for an amplitude: this is a real number, and its calculation from the data does not require generating random numbers, and therefore the formula is a deterministic function that maps your data (a sequence of voltages) to a real number, and qualifies as a "test statistic". Next you manually identify a set of 1000 time intervals during which "nothing happened" on your recording, and you calculate the values for each of these "uninteresting" intervals. Now, if you have an value which is greater than 99% of all the "uninteresting" values, then you know that the "A" value is unusually large under the assumption that your "A" value was generated by the "nothing happened" random number generator, in this case, the p value for the amplitude to come from a "nothing happened" time period is because 99% of "nothing happened" samples have amplitude less than this given amplitude.
Note that this does not mean in any way that "a crime happened" perhaps a cat knocked over a trash-can, or a window washer came and bumped the camera, or a jet airplane flew over, or a Harley Davidson drove by, or whatever. Taking the fact that the audio was louder than most of the samples of "nothing happened" as evidence that "a crime was committed" is seriously WRONG, in just the way that taking the fact that your psychology experiment produced measurements that are different from some simple "null hypothesis" as evidence that "my explanatory mechanism is true" is also seriously WRONG.
The real value of p: filtering
So, we see the real value of "p" values: filters. We have lots of things that we probably shouldn't pay attention to: chemicals synthesized at random in a pharma lab, noise produced by unimportant events near our surveillance camera, time periods on a seismometer during which no earthquake waves are being received, psychology experiments that we shouldn't pay attention to. The p value gives us a way to say "something is worth considering here". The big problem comes when we make the unjustified leap from "something is worth considering here" to "my pet theory is true!".
As a Bayesian, I appreciate the use of p values in filtering down the data to stuff I probably shouldn't ignore, it's a first step. But the next step is always going to be "let's build a model of what's going on, and then find out what the data tells us about the unobserved explanatory variables within the model" That's where the real science occurs!
"A 20x20x8 ft room is full of air containing 1000 pollen grains per cubic foot, a HEPA filter runs in the room, stirring the air well. The HEPA filter removes 99.997% of pollen grains that pass through the filter, and moves r=300 cfm of air. How long before the room is down to 1 pollen grain per cubic foot?"
It's a more or less textbook problem for Calculus and/or differential equations classes, but it's worth thinking about how to set it up, and also it's worth knowing, because damn the allergens are serious around here.
Using nonstandard analysis, we'll suppose we have an infinitesimal time in which a volume of air passes through the filter, with the concentration of pollen in the exhaust being and the concentration in the remaining air remaining as . The overall concentration is then . The Change in concentration per unit time is which works out to , or which has solution (differentiate to verify) since .99997 is about 1, let's just say we're looking for or or plugging in our numbers, about 74 minutes for our example room!
So although it only takes 11 minutes or so to move one roomful of air through the filter, because the clean air mixes with the "dirty" it takes 74 minutes to fully scrub the air. Also, the time is proportional to the ratio V/r, so for a fixed size filter, the bigger the room, the longer the time to clean it, scaling linearly. Get a big filter, and run it for a lot longer than you think!
I have grass allergies, I finally got the allergy testing panel done and BAM all the grasses lit up like a Christmas tree. So my doctor suggested that I consider sub-lingual grass allergy pills. He recommended that I do a little research on it and then if I decide I want to try them, I make an appt because you have to do the first dose in the Dr office as there's a small risk of anaphalaxis (O(1/1000) or so).
So, after getting the original article and looking at it, it seems that there is a consistent result, allergy symptoms ARE reduced vs placebo, but as the quote in the popular article:
"Consider that the symptom score scale is 18 points," Di Lorenzo said. "So, less than 1 point difference is not clinically significant. This means that the treatment alone is not sufficient to control symptoms."
Well, the truth is, no one knows how "clinically significant" the 1 point difference is, and the reason is that the scale is fully arbitrary. Consider for example if we'd measured how much life-extension a given cancer drug gave. Clearly the dimensions of the measurement is "time" so if I told you "it gave an increase of 1 on a scale of 18" you obviously couldn't tell what the heck was going on. Now, if I said it gave an increase of "1 day" or an increase of "1 year" or an increase of "1 million years!" you'd know what I meant, but the fact that the largest unit you can measure is "18" doesn't tell you anything (for example you can have a clock with a battery that lasts 18 months vs a clock with a battery that last 18 million years).
In essence, we're creating a new dimension for measurement, like distance, time, or mass, "symptomaticity" is a dimension, and the scale is arbitrary. So the only way we can determine what is going on is to compare the results under the sublingual treatment to some meaningful benchmark, and since placebo isn't a treatment, we need it to be some other meaningful treatment. For example, the immunotherapy injections, which apparently none of these studies compared to. It's not surprising, since immunotherapy injections are a big commitment that takes a while to take effect and involves a lot of trips to the doctor and a lot of needle sticks. But without such a comparison, we can't determine the "units" (like "year" or "month" in my example).
Another way to deal with this issue would be to convert to another meaningful scale, so for example, you could give people randomly either placebo or treatment, and then after a while, switch them to the other arm (placebo->treatment, and treatment->placebo) all double-blinded of course, then ask them how much they'd be willing to pay per week to have whatever their preferred treatment was (and which did they prefer). Now you're measuring things in the dimensions of "preferences" as measured in units of dollars, and you can compare the dollar amounts, the dollar is a well defined unit, we have a large number of products that cost various amounts of dollars so we can determine what the preference is "equivalent to" just as we can determine that "3 feet" is "equivalent to the height of a 4 year old" or whatever.
In any case, while the result of this meta-analysis is interesting, it's not by itself enough for anyone to make decisions like "should we bother with using sublingual grass allergy tablets?" since we're ultimately not sure what the scale means.
Anyway you see things like: "compare[ed]...more than 1600 medical conditions... and found 55 diseases that correlated with specific birth month".
That's 1600 conditions times 12 different possible birth months, a total of 19200 comparisons! And they only found 55 spurious results??? Surely they could have pumped it up a bit more than that!
Note that the author says "we found significant associations". This is probably another one of those cases where significant doesn't mean what you think it means. Statistically significant, at the p=0.05 level or some similar thing, he later says "the risk related to birth month is relatively minor when compared to more influential variables like diet and exercise."
No doubt there are diseases where in your first year exposure to sunlight, or diet differences between seasons, or air pollution, or other affects change your lifetime risk of something or other, maybe allergies, asthma, diabetes, leukemia, whatever. But this study smells pretty fishy and I think I can even see the tacklebox from here....
If you use Kerberos and NFSv4 you can get pretty decent security from NFS, which historically was pretty shaky on the security front. On the other hand, you can also get bitten by bugs. In particular, it seems that Thunderbird and Firefox (Icedove, Iceweasel on Debian) tend to do a lot of sqlite based file access, including a bunch of file locking which can be buggy on NFS. This turns into a big problem when overnight your Kerberos tickets expire because you haven't unlocked your screensaver, or done a kinit in a while. The symptom is log files (on Linux) like the following:
[50184.811684] NFS: v4 server returned a bad sequence-id error on an unconfirmed
Repeated over and over. There are other reasons this can happen, but expired kerberos tickets seem to be one of them. Fortunately, there's a reasonable way to deal with this: crontab.
Every night at 2am or so, you can run "kinit -R" via cron, and it will renew your tickets. This enables you to get through the night without entering your password, or having thunderbird/firefox bork on you. Maybe. It will only work if your renewable lifetime is sufficient, but for anyone who logs into / unlocks their computer at least a couple times a day this should solve the problem of overnight expiration. It might not work so well if you go off on vacation for a week. If you need to keep things running longer, you could crontab this every 9 hours or so.
Google Cloud storage claims "11 nines" of reliability in their storage. The earth is 4 billion years old, and has experienced about 5 mass extinction events. Google believes that if I put a picture of my cat in their cloud storage today, and then wait another 4 billion years, after 5 more mass extinctions, there's a 96% chance I'll still be able to access my cat picture?
Suppose for a moment that you are involved in a psychology experiment. A grad student brings you into a room with a table, on which is a bag full of coins, and a small pile of pennies. The student tells you that the pennies were sampled from the bag, there are 6 of them on the table. Using a scale you can determine that the 6 pennies weight 0.5 grams, and that the bag weighs 200 g.
Seeing this set-up you are then asked to write what you can determine about the bag?
What can you write? Well, the answer is "not much". In particular, from the set up, you can write "the bag used to contain at least 6 pennies" and very little else, however, it seems plausible that given this set up, many people would naturally be inclined to extrapolate the money in the bag, saying that there are about 200 * 6/0.5 pennies total in the bag. And, if we extend this example to more realistic situations, such as determining in a lawsuit how much damage occurred in a particular building, the likelihood that the "natural" extrapolation would be employed by a typical participant becomes very high (in my experience). The set up, with a bunch of seemingly precise data in a textbook type example leads you along the garden path towards the "desired answer".
To counteract this, I came up with the bag example, and an explanation about the missing information: "how were the pennies sampled?"
That this question is crucial becomes obvious when we consider the following different options:
- Reach into the bag, stir it around, and pinch a bunch of coins, pull them all out and put them on the table.
- Reach into the bag, pull out coins one at a time. If you get anything other than a penny, flip it, and if it comes up heads, throw it back. Pull coins until you get 6 total.
- Reach into the bag, pull out one coin at a time, and throw back anything that isn't a penny. Continue until you get 6 pennies.
- Reach into the bag, pull out the first coin, flip it, if it comes up heads, use protocol (1) above to pull a sample of coins, if it comes up tails, use protocol (2) above...
- ........ there are an enormous set of possible methods to sample coins .....
In litigation situations, people with expertise in the type of damages often are invoked to look carefully at some set of samples and determine in detail the degree or types of damages involved in those samples. Statisticians are very familiar with the scientist who shows up with a pile of detailed expensive data, and in the words of John Tukey "an aching desire for an answer". Unfortunately, detailed data about some totally un-known sampling process produces... totally unknown answers.
I have a fairly extensive SIP based VOIP system. Recently, after buying a FireTV Stick for Christmas, I've been having lots of complaints about "breaking up" when the kids are streaming high def content, like Sesame Street or nature shows about ants or whatever.
I thought this was fairly odd, because I run an OpenWRT router with QOS settings that I THOUGHT would take care of prioritizing my upstream voice (RTP packets).
So, to make a long story short, the standard QOS scripts prioritize TCP SYN and TCP ACK packets, and apparently the way that Netflix works is to every 10 seconds or so, open up a gazillion HTTP requests to grab blocks of video, and these got prioritized highest priority due to the SYN and ACKs or some such thing.
Eliminating this reclassifying behavior makes my voice nice and steady even while streaming.
Andrew Gelman discusses John Cook's post about order-of-magnitude estimates and goes on to state that is a "hopelessly innumerate" estimate of the probability of a decisive vote in a large election.
What I have to say about that is that a probability of a decisive vote in a particular election is very different than the long-run frequency of decisive votes in national elections. I'll venture to say that the long run frequency of decisive votes in elections involving more than 10 million voters will be zero exactly. There will be a finite number of these elections before the end of the human race, and as in the Gore/Bush case, there are too many ways to fiddle with the vote counts for a decisive vote to ever really occur, any election where the vote is down to a few hundred people will be decided by committee, even if that committee is the people deciding which ballots are invalidated. Committees invalidating ballots will always find a way to invalidate enough that the difference isn't down to 1 vote (a prediction, but not unreasonable I think).
But, whether an estimate like is terrible in a given election is down to what our state of knowledge is about that election. Consider the following graph:
This binomial model puts a moderate amount of probability on an exact 1M vote outcome in a 2M vote election if you put exactly p=0.50000... but if you vary from this p by even +- 0.001 the probability plummets to "hopelessly innumerate" levels. But what does this even mean in our case?
In real world situations, we have the following uncertain variables:
- How many votes will be cast.
- How many votes will be allowed, and from which districts.
- What will be the total count of the allowed votes (assume a yes/no vote on a ballot measure for simplicity).
Note that there is no "p" that corresponds to the binomial probability formula. The usual intuition on such formulas is that p is the long run frequency that will be observed in infinitely repeated trials. Such a parameter is meaningful for an algorithmic random number generator, but that interpretation is meaningless for a single election. But a binomial distribution is a reasonable model for counts of yes/no sequence outcomes where we know nothing about which individual sequence we will get except that sequences with more or less counts are more or less likely in a certain sense (in the sense of the parameter p indexing the highest probability count).
So, if we're in a state where we are quite certain that the highest probability count is a little different from 1000000/2000000 it is very reasonable to call out the chance of a given election as , the fact is though, that a prior over the hyperparameter p (the maximum probability count) rarely is strongly peaked around any given value (ie. peaked around 0.5001 +- 0.00005). Much more likely a probability distribution for the "highest probability count" (ie. a prior over p) would be broad at the level of 0.5 +- 0.02...
I think the average researcher views statisticians as a kind of "Gatekeeper" of publication. Do the right incantations, appease the worries about distributional approximations, or robustness of estimators, get the p < 0.05 or you can't publish. In this view, the statistician doesn't add to the researcher's substantive hypothesis, more keeping the researcher from getting into an accident, like a kind of research seat-belt.
The alternative version is what I like to think of as the Keymaster role. A researcher, having a vague substantive hypothesis and an idea of technically how to go about collecting some data that would be relevant, can come to a good statistician, or better yet mathematical-modeler (which encompasses a little more than just applied probability, sampling theory etc) who will help make a vague notion into a fairly precise and quantitative statement about the world. This process will get you thinking about the relationships between your quantities of interest, and identify some substantive but unknown parameters that describe the system you are studying. That model structure will then give you a sense of what data will best inform you about these precise quantities, and then ultimately when the Keymaster analyzes the collected data, he or she can extract the meaningful internal unobserved quantities that you really care about (but didn't know about) originally.
This explains why I think it's a big mistake to go out and collect some data first and then show up and expect a statistican to help you make sense of it.
And, I mean really, who wouldn't want to be Rick Moranis??