Dienstag, 8. Januar 2013

Paper note: Nearest Template Prediction: A Single-Sample-Based Flexible Class Prediction with Confidence Assessment

Read on 7.1.2013. Hoshida Y (2010) Nearest Template Prediction: A Single-Sample-Based Flexible Class Prediction with Confidence Assessment. PLoS ONE 5(11): e15543.doi:10.1371/journal.pone.0015543

The Nearest Template Prediction (NTP) algorithm predicts sample classes using pre-defined gene signatures and a permutation test.

Suppose we are interested in a two-class prediction problem of sample class A and B. Further suppose we have nA gene signatures (markers.A) over-expressed in class A and nB gene signatures (markers.B) over-expressed in class B. The template is then defined as an ordered vector containing nA+nB elements: first nA elements correspond over-expressed genes in subclass A (marker.A), and next nB elements correspond over-expressed genes in class B (marker B). The template of class A has 1 in the positions of marker.A, and -1 in those of marker.B. The template of class B is similarly defined.

Suppose we have determined expression of N genes (N>=nA+nB) in a new sample S. First the (raw) expression values of signature A and signature B are extracted in the same order as the template defined  above. The similarity between S and the two classes is determined by cosine similarity (by default) or Pearson's correlation coefficient. The prediction confidence is measured by permutation tests: a number of nA+nB genes are randomly chosen from N genes for a large number of time (say 10000); the distribution of distances are used to measure the statistical significance of the predictions.

In case of multi-class prediction problems, the template is a concatenate of genes over-expressed in each class. For the template of each class, genes over-expressed in that class is given the value 1 and the rest genes given the value 0.

One desirable property of the NTP algorithm is that it can take one single sample as input: no control sample is needed and no co-variate structure has to be known from a collection of samples. It was shown to have a similar error rate compared with other single-sample prediction algorithms, including Support Vector Machine (though the author did not state whether the SVM was optimized), weighted voting, CART (classification and regression tree) and k-nearest neighbour.

Compared with Wilcoxon-Mann-Whitney (WMW) test using gene signatures, the NTP algorithm can handle over-expressed and under-expressed genes simultaneously. Besides the null hypothesis of NTP is that distance between gene signatures in S and the templates are no higher than randomly chosen genes. This is closer to what we are interested in than the null hypothesis of WMW tests, namely the mean of gene signatures is the same as the background.

Still unclear to me is how are the signatures best defined. According to the authors, t-statistics, log fold changes (logFCs), signal-2-noise ratios or other measures can be used to (1) identify the signatures and (2) weight the template vector. I could not found information regarding which measure (or combination of measures) works best with the NTP algorithm.