Data visualization is extremely important. I never believe any kind of “significant result” until I’ve visualized it to see that there is some convincing difference or pattern. However, today I ran into some trouble when trying to visualize some of my data with a simple barplot. It’s silly enough that I think others might make the same mistake, and so it’s important that I share. Here I was trying to show “interesting gene probes” in a single brain region defined by being greater or less than 3 standard deviations of the mean (of the same probe across the entire brain) expression:

The red lines on the top and bottom are the three standard deviation thresholds from the mean. The bars themselves represent the differences: above the line is above the mean, below is below. GREEN bars mean that the probe is above the three standard deviation threshold, and BLUE bars mean that the probe is below the three standard deviations. ORANGE bars are a randomly selected set of 100 probes that were not above or below. See any HUGE problem here? Yeah! There are green and red bars that aren’t above/below the line!

This is just NOT complicated. I was tearing out my hair (not really, don’t worry) and SO carefully going through my methods, but I couldn’t find anything wrong. Why was this so strange looking? Then it occurred to me, could it be that plotting a barplot of some size N bars does NOT correspond to x coordinates 1 through N? The answer is YES. When you add additional lines / stuffs to a barplot, you need to give it the x range of the original barplot. Here is how I was doing it before:

barplot(df$differences,ylim=c(min(df$differences),max(3*df$sd)),main=paste("Interesting probes for region",all_regions[r]),col=df$colors,las=2,xlab="gene probes",ylab="normalized expression")
lines(x=seq(1,nrow(df)),y=3*df$sd,col="red") lines(x=seq(1,nrow(df)),y=-3*df$sd,col="red")
legend(50,3, c("3 standard deviations > mean","3 standard deviations < mean","random sample N=100","three standard deviations"),lty=c(1,1),lwd=c(2.5,2.5),col=c("green","blue","orange","red"))
</pre>

“lines” is how you add a trendline to some plot. NOTICE that I was setting the x values to be a sequence from 1 to the number of data points (the rows of my data frame). That’s totally logical, right? Why would the x range be anything else? Nope! Bad idea! Not the right way to do it! Here is how it should be done:

bp = barplot(df$differences,ylim=c(min(df$differences),max(3*df$sd)),main=paste("Interesting probes for region",all_regions[r]),col=df$colors,las=2,xlab="gene probes",ylab="normalized expression")
lines(x=bp,y=3*df$sd,col="red") lines(x=bp,y=-3*df$sd,col="red")
legend(50,3, c("3 standard deviations > mean","3 standard deviations < mean","random sample N=100","three standard deviations"),lty=c(1,1),lwd=c(2.5,2.5),col=c("green","blue","orange","red"))
</pre>

NOW notice that I am saving my barplot into a variable “bp,” and setting the x range of the lines to be… that variable. R is smart enough to know I want the sane x axis as was created in my barplo! Here is the fixed plot: