# Bayesian model of differential gene expression

2016 November 5

My wife has been working on a project where she's looking at 3 time-points during a healing process and doing RNA-seq. I'm analyzing the known secreted proteins for differential expression, using a fully Bayesian model in Stan:

Fun, but non-trivial. There are around 6000 genes being considered, and there are 5 or 6 of these parameter vectors... 36000 parameters to be estimated from 60000 data points, and the parameters all have fairly long tails... cauchy priors and soforth. Then I need a couple hundred samples. so Stan is outputting something like 5-10 million numbers. Warmup has been a problem here, and the long tails have required re-parameterizations and tuning. Fun stuff.

3 Responses
leave one →

This is something I am interested. Would you elaborate on the tools/pipeline/methodology you used to do this analysis please? I am a biologist and I am aware of the common tools sued in this analysis. I am not of the statistical variety so I have been wondering if anybody used Stan for the analysis and this where I found it!

Did you ever test Salmon (from Rob Patro group) or Kallisto/Sleuth (From Pachter group) for expression analysis like this?

Thanks

This is the first real RNA-seq analysis I've done. My wife is the biologist. I'm not that familiar with the "standard" tools, but I have seen enough to know that they're usually frequentist and usually going along the wrong road in my opinion (For example, Receiver Operating Characteristic ROC is a statistic that is used to compare methods... and I think this is highly mistaken, RNA-seq is not a transmission line from the cell to your ear telling you "which are the true positives" which is the philosophy behind ROC). The goal in an RNA-seq analysis should be to measure the expression profile given that you have a noisy measurement process, and that most of the noise comes from biological replications, not measurement error in a single sample.

In any case, the theory I'm using here is something like this:

A single set of cells is sent through the RNA-seq machinery and a set of gene-counts comes back. If you divide all the counts by the total count, this is in essence a single vector-sample from a distribution over a categorical frequency distribution. A typical distribution over categorical frequency distributions is the Dirichlet distribution. So, our Bayesian goal in doing this repeatedly with different biological samples is to find a single vector parameter Ceff (for effective counts) so that our data vectors look like samples from a dirichlet_rng(Ceff).

Then, we can construct priors over the Ceff based on our biological model for what's going on (for example in my wife's case these are the same cell type at different time points in a healing process, so the expression profile should mostly be the same, but certain important genes could be up or down regulated).

I write a prior over log(Ceff) by specifying a prior over the baseline day-zero expression profile, and priors over "expression differences" for the different time points. The expression differences are cauchy distributed reflecting the fact that most genes are very similar between the time points, especially those which have low counts, and yet some small set of genes can be many factors bigger or smaller than the control.

Finally, I run the Stan code and get estimates of these differences in logarithms and these differences define whether and how much I'm estimating that a gene is up or down-regulated.

Runs take many hours, but I am confident at least that I understand the model, as opposed to some canned piece of software based on some basic theoretical ideas I have no confidence in.

Ok. I was aware that most of the tools are frequentist tools and hence my interest in your post. I am a biologist not well versed in statistics. Consequently the theory you described will take a while to sink in. In the mean time I have a couple more things to add.

There are multiple sources of error in RNA-Seq and the data is inherently noisy. You already know that. Methods development people have been fiercely arguing about the noise and its incorporation in the data analysis. Hence most of the methods/tools try to minimize the (biological and non-biological) errors as part of data cleaning/quality assessment.

Most of the tools do need lot of computational resource making it prohibitive for bench scientists to do or tryout. The reasoning goes like, 'if I have to go to cloud to do analysis then I might as well pass it onto statistician down the hall'. Unfortunately that is not good from an experimentalist's point of view: the one generating data will have insights on the relevant biology and should be doing the analysis. This is where methods like Kallisto/Sleuth and Salmon come in - they are reportedly fast and need resources available to all. They are supposed to run on laptops in reasonable time. I was wondering if you had a chance to use them.

Thanks