This analysis aimed to discover an optimal strategy for comparison of pairwise brain maps in the context of image classification. We tested classification across a range of thresholds, image comparison strategies, and similarity metrics, and this web portal shows "how we did" for each any combination of those variables by way of a confusion matrix.
In functional magnetic resonance imaging (fMRI) we can put people in the scanner and have them perform different tasks that measure brain function, or behavioral paradigms that might test a cognitive process of interest to help us better understand how the human brain works. When we put a bunch of human brains in a standard space and do calculations to determine what areas are being activated by the task, we generate summary brain maps that describe our result.
In a classification framework, we give ourselves points (toward accuracy) when some object, A, that we are trying to classify, is predicted as "A" by our classifier. You can imagine four possibilities: predicting A when the item is actually A (true positive), predicting A when it is B (false positive), predicting B when it is B (true negative), and predicting B when it is A (false negative). In machine learning we summarize this performance in a (typically) 2x2 table called a "confusion matrix":
Predicted
| |
A |
B |
Actual |
A |
True Positive |
False Positive |
|
B |
False Negative |
True Negative |
Each cell has counts for the number of true positive, false positive, true negative, and false negative, and you can imagine that a perfect classifier will have all counts down the top left to bottom right diagonal. Now imagine that we have many more classes than "A" and "B," as is the case with our brain maps classification task, for which there are 47. The idea is the same, except now we have a 47 X 47 matrix.
Our brain maps were derived using data from the
Human Connectome Project, using a permutation-based approach. Any two images could be compared by either taking the intersection of data in the map (complete case analysis) or a union (single value imputation), which substitutes missing values in either map with zeros. Each unthresholded image associated with a particular behavior in a cognitive task was compared to all other images for each comparison strategy (complete case analysis and single value imputation), similarity metric (a Pearson or Spearman score), across a range of thresholdings of the second map. When you mouse over a cell, we show the actual (row) and predicted (column) thresholded maps. Keep in mind that these images were produced from only one of 500 permutations using a subset of data, and so a map that appears empty at a higher threshold (with mis-classifications) indicates that one of the samplings did not have an empty map. We also show the same set of images for rows and columns, however comparison in the analysis was done between two separate groups, A and B.
We found that, for our particular dataset, using a pearson score with complete case analysis at a threshold of +/-1.0 had the highest classification accuracy (0.984, 95% CI = 0.983, 0.985), and that accuracy decreases as threshold increases. Complete results have been released with our
manuscript .
The task of image comparison is relevant to many fundamentals of neuroimaging analysis, including performing meta-analysis, clustering, and evaluating if a result has been replicated. As the sharing of statistical brain maps becomes the norm, leading to entire databases of such maps, having automated, and optimal approaches to evaluate new results, or find similar maps, is essential. We are not claiming that there exists an optimal strategy across all brain maps, but rather, that care must be taken to test for optimal comparison strategies, and that assuming that traditional thresholding approaches are best is not always the right thing to do.
There are two contrasting viewpoints held in the neuroimaging community pertaining to image thresholding. The first claims that unthresholded maps, by way of having more data, are always better than threhsolded maps. By demonstrating that thresholded maps produce higher classification accuracy than unthresholded, we challenge this view, and suggest that very small values in the map may serve as noise. The second viewpoint is the standard use of a thresholding strategy called "random field theory." We found that random field theory produces maps that are more highly thresholded, corresponding to lower classification accuracy in our analysis. This finding suggests that voxels that were "thresholded away" by random field theory may in fact contain meaningful signal.