This document outlines metrics used in contact prediction in the past and provides feedback from the CASP13 contact prediction assessor, Andras Fiser.

========================================================

Assessment of the Residue-Residue contacts (RR) in CASP:

the experiment setting, evaluation measures, open issues and suggestions

========================================================

1. What to read

Assessment papers: at least from CASP9-CASP12 [1-4]

2. What is predicted?

Contacts in target proteins. Each contact is assigned a probability p [0;1] reflecting confidence of the assignment.

3. Definition of contacts (residue centers and distance thresholds)

(i) The definition historically used in CASP: a pair of residues is defined to be in contact when the distance between their Cβ atoms (Cα in case of glycine) is smaller than 8.0 Å.

Discussion:

Other definitions are possible, e.g. (ii) distance between any two heavy atoms from different residues <5 Å (used in CASP11/CAPRI assessment and CASP12 assessment of Assembly predictions [5,6]), or (iii) distance between two atoms < sum of atoms’ van der Waals radii + radius of water molecule (2.8 Å) – used in CAD definition [7]. There are a few papers discussing contact definition issue, including [8]. An in-house analysis shows that the three definitions on CASP targets agree in 80+ % of cases (i.e., contact between two residues according to measure x is also a contact according to measure y). Still, up to 20% of contact pairs are definition-unique.

4. Targets

The main evaluation is carried out on the free modeling (FM) target domains, for which structural templates could not be identified even by a-posteriori structure similarity search. Some of the analyses were also performed on the extended (FM+TBM_hard) target set, which additionally included the TBM_hard domains, for which templates did exist but were relatively difficult to identify.

5. Types of contacts

Assessment is concentrated on the long-range contacts (separation of the interacting residues of at least 24 positions along the sequence) as these are the most valuable for structure prediction. Also, previous assessments evaluated predictions on medium + long-range contacts (12+ residues separation). Long-range contacts were the main subject of analysis in pre-CASP12 experiment; in CASP12 evaluation main attention was paid to long+medium.

6. Evaluation lists

To ensure fairness of the comparison, all participating groups are evaluated on the same number of contacts. Two different approaches are used. In the first approach, the lists of predicted contacts are truncated to the same number of contacts so that only the most reliably predicted contacts are considered in the evaluation; in the second, these lists are “padded” with zero-probabilities for pairs of residues that are not predicted as being in contact. We call the datasets used in the first approach “reduced lists” (RL), and those in the second - “full lists” (FL).

In RL evaluation, the previous assessments mainly discussed results on the L/5 long-range contact lists. In CASP12, the L/2 list was given a higher prominence and the L/5 long-range list was used for comparison with previous CASPs. The results for the two shorter lists (L/10 and Top5) are also available on the web.

Discussion:

David Baker/Rosetta and Jinbo Xu/RaptorX-Contact groups speculate that we need 1.5L ~ 2L contacts to obtain good contact-assisted ab initio contact prediction. Analysis of non-redundant SCOP structures from Y. Zhang’s group shows that the average number of short, medium, and long-range contacts of a well folded protein domain is 0.3*L, 0.4*L, and 1.2*L, respectively. Maybe these higher numbers are more reasonable to assess contact map accuracy? Or we simply can use the number of long-range contacts in the native structure? In CASP12 assessors stated that “No definite answer can be drawn regarding the best number of contacts to use for modeling.” Interesting to track this also in the next CASP.

7. Evaluation analyses

(1) how accurate are the methods in predicting contacts with the highest reliability (RL)

(2) how accurate are all submitted contact predictions, including those predicted with lower reliability, i.e. on FL

(3) how dispersed are predicted contacts

(4) how accuracy of contact prediction depends on the depth of the alignment

(5) how accuracy of tertiary structure prediction depends on accuracy of contact prediction

Note: 4 and 5 are very important analyses in light of recent advances of approaches using fixed correlated mutations methodology.

(6) how accuracy of contact prediction depends on the length of the target

(7) auxiliary check if results from different methods are different (Jaccard-distance-based)

8. Evaluation measures

Historically the main evaluation measure is the ratio of correctly predicted contacts in an RL list:

precision=TP/(TP+FP).

Since TP+FP=const and TP+FN=const for a selected RL list, the precision score in the RL analysis is proportional to

recall=TP/(TP+FN)

and

F1= 2*precision*recall/(precision+recall).

For the calculation of these scores, the true positives (TP) and false positives (FP) values are the numbers of correctly and incorrectly predicted contacts regardless of the associated probabilities.

In the FL analysis, the main estimators of binary classifiers are the Matthews correlation coefficient (or the F1-score discussed above), and the area under the precision-recall curve (AUC_PR). The AUC_PR analysis is conceptually similar to the ROC curve analysis, but is better suited for analysis of imbalanced datasets, which are the case in contact prediction – just a few positive cases (contacts) and many more negative (non-contact pairs of residues) [CASP9 or CASP10 evaluation paper]. The PR-curve analysis implicitly takes into account the predicted contact probabilities. The threshold for separating contacts from non-contacts is selected at the p=0.5 level, thus a contact is considered as correctly predicted (TP) if it was included in the prediction with a probability of 0.5 or higher (this is stated in the description of RR format). Since both the MCC and PR analyses account for the accuracy of predictors as two-class classifiers, their results are shown to be similar in previous CASPs, so using only one of them may be sufficient. The measures used in the RL analysis can also be applied, but they are not proportional any more.

In CASP12 a new Entropy score was introduced, which favors predictions with more dispersed correctly predicted contacts [1]. Assessors should be careful using this measure as all pairs predicted as contacts give the perfect ES score (therefore should be used only in combination with other measures).

Discussion:

(1) CASP12 assessors tested probability-weighted measures [1], but refrained from using them in their final analysis as those were questioned for their scientific reliability. CASP organizers do not recommend using them in future assessments.

(2) CASP12 assessors used formula F1 + 0.5*ES for ranking groups. As the Entropy Score is only an auxiliary measure – what is the most reasonable coefficient for the final score (some predictors argued that 0.5 is too high).

9. Group ranking

The scores for each group are calculated on a per-target basis and subsequently averaged. The AUC_PR score was calculated on the dataset containing contacts from all targets pulled together.

The groups are ranked according to the cumulative z-scores from selected by the assessor evaluation measures. For each measure, the z-scores were calculated in accordance with the procedure for calculating the corresponding raw scores, e.g. on the per-target basis for the precision, or MCC, and on all targets together for the AUC_PR. After the initial computation, the z-scores are recalculated on the outlier-free datasets, with outliers defined as those with a score lower than the mean minus two standard deviations. For the per-target measures, these adjusted z-scores are averaged over all domains predicted by the group. Finally, before adding the z-scores from different measures, all negative z-scores are set to zero in order not to penalize too severely groups underperforming with respect to some of the scores.

10. Statistical significance of difference in results

To establish the significance of the differences between the scores for best groups, we performed t-tests and “head-to-head” comparisons [3] on the per-target measures (e.g., precision, MCC) and bootstrapping tests on all measures [4]. For the bootstrapping, we randomly sampled (with replacement) the list of targets predicted by each group, and re-calculated the evaluation scores on the resampled target sets. The 95% confidence intervals were established using the two-tailed bootstrap percentile method on 1000 resampling trials. The statistical significance of the differences in group performance was inferred based on the comparison of the corresponding confidence intervals.

11. Inter-domain and inter-chain contact predictions

Assessing inter-domain contact predictions provides an estimate of the ability of predictors to recognize proper packing of the constituent domains in multi-domain proteins and correct oligomerization interfaces [should we discuss this here or leave it for the Assembly assessor?].

1. Schaarschmidt J, Monastyrskyy B, Kryshtafovych A, Bonvin A. Assessment of contact predictions in CASP12: Co-evolution and deep learning coming of age. Proteins 2018;86 Suppl 1:51-66.

2. Monastyrskyy B, D'Andrea D, Fidelis K, Tramontano A, Kryshtafovych A. New encouraging developments in contact prediction: Assessment of the CASP11 results. Proteins 2016;84 Suppl 1:131-144.

3. Monastyrskyy B, D'Andrea D, Fidelis K, Tramontano A, Kryshtafovych A. Evaluation of residue-residue contact prediction in CASP10. Proteins 2014;82 Suppl 2:138-153.

4. Monastyrskyy B, Fidelis K, Moult J, Tramontano A, Kryshtafovych A. Evaluation of disorder predictions in CASP9. Proteins 2011;79 Suppl 10:107-118.

5. Lafita A, Bliven S, Kryshtafovych A, Bertoni M, Monastyrskyy B, Duarte JM, Schwede T, Capitani G. Assessment of protein assembly prediction in CASP12. Proteins 2018;86 Suppl 1:247-256.

6. Lensink MF, Velankar S, Kryshtafovych A, Huang SY, Schneidman-Duhovny D, Sali A, Segura J, Fernandez-Fuentes N, Viswanath S, Elber R, Grudinin S, Popov P, Neveu E, Lee H, Baek M, Park S, Heo L, Rie Lee G, Seok C, Qin S, Zhou HX, Ritchie DW, Maigret B, Devignes MD, Ghoorah A, Torchala M, Chaleil RA, Bates PA, Ben-Zeev E, Eisenstein M, Negi SS, Weng Z, Vreven T, Pierce BG, Borrman TM, Yu J, Ochsenbein F, Guerois R, Vangone A, Rodrigues JP, van Zundert G, Nellen M, Xue L, Karaca E, Melquiond AS, Visscher K, Kastritis PL, Bonvin AM, Xu X, Qiu L, Yan C, Li J, Ma Z, Cheng J, Zou X, Shen Y, Peterson LX, Kim HR, Roy A, Han X, Esquivel-Rodriguez J, Kihara D, Yu X, Bruce NJ, Fuller JC, Wade RC, Anishchenko I, Kundrotas PJ, Vakser IA, Imai K, Yamada K, Oda T, Nakamura T, Tomii K, Pallara C, Romero-Durana M, Jimenez-Garcia B, Moal IH, Fernandez-Recio J, Joung JY, Kim JY, Joo K, Lee J, Kozakov D, Vajda S, Mottarella S, Hall DR, Beglov D, Mamonov A, Xia B, Bohnuud T, Del Carpio CA, Ichiishi E, Marze N, Kuroda D, Roy Burman SS, Gray JJ, Chermak E, Cavallo L, Oliva R, Tovchigrechko A, Wodak SJ. Prediction of homoprotein and heteroprotein complexes by protein docking and template-based modeling: A CASP-CAPRI experiment. Proteins 2016;84 Suppl 1:323-348.

7. Olechnovic K, Kulberkyte E, Venclovas C. CAD-score: a new contact area difference-based function for evaluation of protein structural models. Proteins 2013;81(1):149-162.

8. Duarte JM, Srebniak A, Scharer MA, Capitani G. Protein interface classification by evolutionary analysis. BMC Bioinformatics 2012;13:334.

========================

Feedback from Andras Fiser

========================

I read the past CASP evaluations on contact prediction and related categories. I think there are a number of evolved ideas that I can fully agree with and we could just proceed business as usual. In fact we must do that to some extent so we can perform historical comparisons about progress.

I like that contacts are classified into different sequence separation ranges, as obviously a single long range contact 50-100 residues away can have more impact than a number of contacts within 20 residues. To this end I would argue that 24 aa cutoff is still probably too short, especially in case of all helical folds, as this will include many contacts packing two neighboring secondary structures, and really useful information comes from 1-3 and higher secondary structure packing/contact data. So, I wonder how contacts between residues separated by >50aa perform.

I also like the idea that same number of contacts are compared among groups, but we need to be specific that these contacts come from similar range categories. I like that contacts were evaluated with MCC or F1 evaluation, as is used for highly unbalanced sets. I like the idea of “entropic score” which considers the spread of contacts along the sequence.

In connection to some comments above and thinking about additional point of views: given the highly skewed distribution of protein folds, I think it is important that high capacity, deep learning neural networks or mutual information based multiple sequence analysis not simply recognize folds and their corresponding “typical” contacts but recognize contacts “genuinely". To this end I would like to check separately how contact predictions work in 1) new folds 2) known, but rare folds, 3) frequent or Superfolds. Another aspect of this evaluation is the class of folds: I just want to make sure that not all success happens with all helical folds only, so I would like to check separately the performance in different fold classes. Finally we should consider protein lengths, or compare proteins with similar number of secondary structures, as the possible combination of packing a fold is exponentially increasing with higher number of secondary structures.

An overarching interest within this topic to me is the “quality” of contacts, i.e. the minimum number of necessary and sufficient contacts, or simply, most informative contacts. To this end, an interesting conceptual question is the minimum number and location of contacts that can deliver a correct prediction, or a prediction that is as good as any other. I would imagine that as little as 5-10 contacts can do that but 5-10 correct predictions might not fare well with the existing evaluation scheme that will punish for missed contacts. Yet it might deliver as good models as any other. I will think about how to address it. I saw that in the past you performed Rosetta modeling on 5 proteins with and without contacts and assessed the impact of this extra information. We may want to extend that line of evaluation by digging into some details, asking predictors to re-run modeling exercise with different subsets of contacts etc.