Yes, using Pearson correlation alone is a bit limited, particularly because the data is very often non-linear.

If we want to use "correlations" then Kendall's Tau and/or Spearman's Rho are better. These scores more accurately reflect the ability of methods to rank models correctly. You can have situations where you have a low Pearsons R, but still have a reasonably good ranking. Conversely you have situations where Pearsons R is very high, but the ranking is way off.

I agree that the measurement of the observed model quality of top ranked models is also very useful, but only if used along with an appropriate significance test i.e. paired Wilcoxon signed rank sum tests. If the GTD_TS sum of top models is used alone to select a "winning" method, then one or two incorrect models can adversely affect the difference in perceived performance of methods, where there may actually be no significant difference.

I've been saying this for several years - I go into more detail here:

http://www.biomedcentral.com/1471-2105/8/345