]]>We propose a new prior for ultra-sparse signal detection that we term the "horseshoe+ prior." The horseshoe+ prior is a natural extension of the horseshoe prior that has achieved success in the estimation and detection of sparse signals and has been shown to possess a number of desirable theoretical properties while enjoying computational feasibility in high dimensions. The horseshoe+ prior builds upon these advantages. Our work proves that the horseshoe+ posterior concentrates at a rate faster than that of the horseshoe in the Kullback-Leibler (K-L) sense. We also establish theoretically that the proposed estimator has lower posterior mean squared error in estimating signals compared to the horseshoe and achieves the optimal Bayes risk in testing up to a constant. For global-local scale mixture priors, we develop a new technique for analyzing the marginal sparse prior densities using the class of Meijer-G functions. In simulations, the horseshoe+ estimator demonstrates superior performance in a standard design setting against competing methods, including the horseshoe and Dirichlet-Laplace estimators. We conclude with an illustration on a prostate cancer data set and by pointing out some directions for future research.

I like to be able to develop models of my data and would like to factor in the variation in the data rather than discarding reads. Current methods do not remove reads wholesale simply because they do not align; bulk of it is direct result of inherent limitations of the equipment/sequencers as well library construction methods. The field has come to this stage the hard way - committing the errors and figuring them out. However, the Bayesian methods have yet to impact us in a big way and that is disappointing. This would be greatly helpful when single-cell data are involved - there is no replication possible, each cell is different. That means inherent noise level is sky high. I will not be very happy to discard reads due to the noise if there is a way to factor them in.

I get the difficulty you have with putting the current project's analysis on github. I probably cant do that for my own data either - too many people involved and too many approval's needed. However, hopefully it should be possible to put a general description of the various steps.

Cheers

If it was important enough, I think you could spin up a hefty preemptible google compute machine to do this with 200Gigs of RAM and 32 cores or whatnot and make it work for not much money (It'd cost $12 to do 30 hrs of computing). My model runs on 4 cores and a machine that has 64 Gigs of RAM in 4 to 6 hours to get 600 samples per chain (300 burn-in and 300 regular). If you don't save too many samples it would be tractable on todays machines to do the full 40k genes in the genome.

The advantage of bespoke analysis is putting in information about your specific situation. For example because these are both the same tissues at different timepoints, I expect much less difference in expression than if these were say two different cancers from two different tissues in two different patients.

sharing specifics of this will require getting buy in from other people, but I'm sharing the general idea here because I really do think it's something people should be doing, namely thinking about their data and building models. Looking for a canned utility to solve your problem is itself a problem in many cases.

]]>1) figured fitting would take an inordinate amount of time w/ parameter estimates for 20-40K + transcripts + internal parameters.

2) didn't want to re-invent the mean-variance / overdispersion modeling methods covered by canned procedures in the literature. Granted, these usually aren't that complicated if to represent, but always wonder if it's worth it if #1 is limiting.

Maybe it's gotten better with the HMC or VB procedures now. Any chance you could share some of this on github?

]]>I would like to write up a discussion on this and get it published, I'll have to work with my wife on how to get that done, as it's her data and it was collected in collaboration with another organization. So, it might be a little while before it becomes available.

A big problem is that biological researchers are still thinking along the lines of creating a "standardized pipeline for discovery" as if somehow you can discover things in an automated way without ever accounting for the specifics of what biological experiment you actually ran! My goal would be to turn that conversation back towards "here are tools for understanding data that should be put together in ways that help you make sense of the specific biological situation you are studying"

]]>Are you planning on writing up some of those scripts here? Here is a motivation from the biology side: i had not heard "Dirichlet distribution" until you wrote that above!

Thanks

]]>The error and noise in RNA-seq is something that needs to be acknowledged. In a Frequentist analysis you "clean" the data to make it behave more like some default assumptions. In a Bayesian analysis you try to actually describe what the noise looks like and then infer what the observed values mean about the unknown things you want to estimate.

As such, I very simply take the total counts for each gene, add 1 so that none of them are exactly zero, and then divide by the total, there is no "cleaning" the data, there is only describing what noise I expect to have in the data. I think this is the only way to go. specifically, in my case, the dirichlet distribution describes the noise in the data, and I need to find out which parameters to put into the dirichlet to get output that "looks like" the actual data.

In R you can see how dirichlet distributions work as follows. Imagine you have only 4 genes of interest. after

library(gtools)

you can see some examples of frequency distributions (the rows) that occur when you randomly generate from a dirichlet distribution with parameter c(1,1,2,1) vs when you use parameter c(100,100,200,100)

rdirichlet(20,c(1,1,2,1))

vs

rdirichlet(20,c(1,1,2,1)*100)

In the first case, the output will cluster around vectors c(1/5,1/5,2/5,1/5) but they will vary from that base value a fair amount. Whereas in the second case they're all much closer to c(1/5,1/5,2/5,1/5)

the output of a dirichlet distribution is always a frequency distribution, that is, it's a vector of positive numbers that add to 1. the dirichlet distribution itself represents uncertainty about what the underlying frequencies will be.

]]>