# Fly Genetics, Censoring, and Bayes Theorem

Today I had the good fortune to talk with a longtime friend of mine Dennis Hazelett. He is a developmental biologist and fly geneticist currently working in David Morton's lab at OHSU. Previously we wrote a paper together on using the mathematics of simple convolutions to detect regulatory regions in fly chromosomes.

In his fly experiment, he has set up a control fly cross in such a way that on average half the eggs will be of type A and half the eggs will be of type B. However, due to other typically tricky fly genetics factors, eggs of type B will die with some probability which is guessed to be largish but is not known. In a separate experimental cross in the presence of a third factor C (a null heterozygous mutant allele on a different chromosome), half the eggs will be of type A and half the eggs will be of type B but the probability that type B eggs die is assumed to be different, probably smaller. This is an example of "suppression" of one gene by another. These days many scientists study interactions such as "suppression" or "enhancement" of one gene by another.

The standard technique to analyze this kind of experiment is to collect a large number of flies from the control condition, and count the number of each type. Then use the fraction of flies from each type as a model for the "expected" frequency. The experimental flies are then collected and a chi-squared test for goodness of fit is used to determine whether it is unlikely that the experimental frequency for the two types A and B is the same as the control frequency for A and B. If you reject this hypothesis then you can claim evidence for some kind of interaction. However, the problem is that such a procedure does not give you a method for estimating the probability of lethality (unless the count is very large and hence uncertainty can be ignored). Most biologists Dennis knows will have left the room by now, but those who are still along for the ride may be interested in developing a better method for analyzing this kind of genetic interaction experiment.

The trick is that the data he collects is the number of surviving flies of type A and the number of surviving flies of type B for each of the two cases C (experimental conditions containing factor C) and ~C (no factor C, or control). What is censored from him by the experimental conditions is N the number of total eggs that were laid in each experiment, and he doesn't know p(B survives| C) or p(B survives | ~C) which are the quantities he is really interested in.

To calculate what he wanted to know, after some complicated back and forth to figure out what was really going on, we decided on the following application of Bayes' Theorem.

Let's define some variables. NA is the number of A flies counted, NB is the number of B flies counted, p = p(B survives | C) or p = p(B survives | ~C) depending on which experiment we're talking about (C and ~C are known experimental conditions set by him) and let Ntot be the total number of eggs laid by the flies which he *doesn't know* and is a nuisance parameter (he doesn't care what Ntot is).

Bayes Theorem for the joint probability of observing NA,NB,Ntot,p is:

p(NB | NA,Ntot, p) p(NA|Ntot,p) p(p|Ntot) p(Ntot) = p(p | NA,NB,Ntot) p(NB | NA,Ntot) P(NA|Ntot)p(Ntot)

The quantity of interest for each experimental condition is p(p | NA, NB) since he observes NA and NB and he doesn't care about Ntot. He can get this by solving for p(p|NA,NB,Ntot) and integrating out Ntot.

How can he set up the calculations to get what he wants? Let's look first at the left hand side of the Bayes' Theorem. This is where we have a "likelihood" of seeing the observed values NA,NB.

using R notation:

p(NB |NA, Ntot, p) = dbinom(NB,size = Ntot-NA,prob=p)

p(NA|Ntot,p) = dbinom(NA,size=Ntot,p=1/2)

p(p|Ntot) = p(p) = a prior distribution for the probability of a type B egg surviving. let's use dbeta(1/2,1) for the NC condition (more likely to be lethal) and dbeta(1,1/2) for the C condition (more likely to be rescued).

p(Ntot) ~ dexp(Ntot-NA,rate=1/(NB/E(p) ) Here we're using the maximum entropy distribution for Ntot by transforming it to a distribution on number of B eggs laid, this says we expect the number of B eggs (Ntot-NA) to be NB/E(p) where E(p) is the expected value for p, the probability of egg of type B surviving. This is an ok prior model. perhaps something better could be thought up.

Finally, to get the thing we're interested in, (the first factor on the right) we need to divide both sides by the probability of p(NB|NA,Ntot) p(NA|Ntot). Then we need to integrate out the Ntot from the remaining expressions.

We can calculate p(NB|NA,Ntot) p(NA|Ntot) by conditioning on p, and then integrating over all prior values of p.

What we get is this (if I haven't made mistakes)

Substituting in for the expressions for the associated probabilities from above, we could potentially even try to calculate this in closed form. However, the easiest way would be to do numerical integration over Ntot and p using monte-carlo simulation. For each different value of Ntot you draw a sample of the denominator and calculate an average of that sample, then draw samples of the numerator and divide by the denominator average. Then once you have all the samples of the numerator one set for each value of Ntot with size proportional to the probability of Ntot, you marginalize over Ntot to get the distribution of p given NA and NB which is what you wanted in the first place.

I'll see if I can implement this in R and post a link.

Dan, what if you had more than two phenotypic classes to score? I suppose if you're only interested in one of the classes you might pool the other classes and treat them as one. But suppose you want to know whether males or females are affected more. Can you recursively model the distributions for these other classes within your A variable, or would it be best to create an independent estimate pB for each class based on this model?

I think it's best, or at least easiest to understand, if you separate the groups and get separate estimates for p(B survives | each class)

On the other hand, if the classes don't completely separate (ie. they're a mixture of sex, genotype, and drug treatments and you don't have all possible mixtures separately) then there are a bunch of techniques that Andrew Gelman advocates for Hierarchical Bayesian Modeling.