High Dimensional Data/Params as functions

2016 May 16
by Daniel Lakeland

My wife had a problem where she had measurements of the expression level of thousands of genes under two different genotypes and two different time points of development. Subsetting these genes into those she was interested in, she still had the problem of understanding the differences in a multi-thousand-dimensional vector. I invented a plot where the genetic expression profile was described in terms of a function of the rank of the expression under a reference condition. It's easier to understand if you've seen the data, so here's an example plot with the specifics of the labels removed (click the image to get a zoomed version)

The idea is that we're displaying the logarithm of normalized expression levels on the y axis, with the genes shown in a particular order. The order they're shown in is always the rank-order under the reference condition +/- (heterozygous). That rank is preserved for all of the plots at both time points (early and late). So when you plot the alternate condition on the same graph, you get simultaneously a view of how much expression there is typically for this gene, how much variation there is on average between genotypes, and how much of a difference there is for the particular gene compared to the reference genotype. Finally, plotting the later time point shows you that BOTH genotypes diverge from the original expression profile, but in a noisy way, there is still an overall curving upward trend which is somewhat similar to the original.

Now, suppose you have a high dimensional parameter vector in a Bayesian model. You could choose the rank of each parameter within the average parameter vector as the x values, and then plot that average parameter vector as the reference condition. Then, over that, you can spaghetti plot particular samples, which will show you how much variation there is as well as to some extent correlations in the variation, for example when parameter number 50 is higher than average, maybe parameter number 70 is often lower than average, so that the curve will have an upward blip at 50 and a downward blip at 70 for many of the spaghetti curves.

This would be particularly useful if the parameter values were fairly well constrained, if they have a huge amount of variation, then the spaghetti plot could get really really messy, even messier than the "late" curves in this plot, for example.

Another thing to note is that this plot will work a LOT better if you have dimensionless parameters which are scaled to be O(1), none of this "parameter 1 is in ft/s and parameter 2 is in dollars/hour and parameter 3 is in Trillions of Dollars per Country per Decade" stuff.

No comments yet

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS