Discussion: contact prediction (RR)

by **akryshtafovych** on Wed Apr 18, 2018 12:31 pm

This document outlines metrics used in contact prediction in the past and provides feedback from the CASP13 contact prediction assessor, Andras Fiser.

========================================================
Assessment of the Residue-Residue contacts (RR) in CASP:
the experiment setting, evaluation measures, open issues and suggestions
========================================================

1. What to read
Assessment papers: at least from CASP9-CASP12 [1-4]

2. What is predicted?
Contacts in target proteins. Each contact is assigned a probability p [0;1] reflecting confidence of the assignment.

3. Definition of contacts (residue centers and distance thresholds)
(i) The definition historically used in CASP: a pair of residues is defined to be in contact when the distance between their Cβ atoms (Cα in case of glycine) is smaller than 8.0 Å.
Discussion:
Other definitions are possible, e.g. (ii) distance between any two heavy atoms from different residues <5 Å (used in CASP11/CAPRI assessment and CASP12 assessment of Assembly predictions [5,6]), or (iii) distance between two atoms < sum of atoms’ van der Waals radii + radius of water molecule (2.8 Å) – used in CAD definition [7]. There are a few papers discussing contact definition issue, including [8]. An in-house analysis shows that the three definitions on CASP targets agree in 80+ % of cases (i.e., contact between two residues according to measure x is also a contact according to measure y). Still, up to 20% of contact pairs are definition-unique.

4. Targets
The main evaluation is carried out on the free modeling (FM) target domains, for which structural templates could not be identified even by a-posteriori structure similarity search. Some of the analyses were also performed on the extended (FM+TBM_hard) target set, which additionally included the TBM_hard domains, for which templates did exist but were relatively difficult to identify.

5. Types of contacts
Assessment is concentrated on the long-range contacts (separation of the interacting residues of at least 24 positions along the sequence) as these are the most valuable for structure prediction. Also, previous assessments evaluated predictions on medium + long-range contacts (12+ residues separation). Long-range contacts were the main subject of analysis in pre-CASP12 experiment; in CASP12 evaluation main attention was paid to long+medium.

6. Evaluation lists
To ensure fairness of the comparison, all participating groups are evaluated on the same number of contacts. Two different approaches are used. In the first approach, the lists of predicted contacts are truncated to the same number of contacts so that only the most reliably predicted contacts are considered in the evaluation; in the second, these lists are “padded” with zero-probabilities for pairs of residues that are not predicted as being in contact. We call the datasets used in the first approach “reduced lists” (RL), and those in the second - “full lists” (FL).
In RL evaluation, the previous assessments mainly discussed results on the L/5 long-range contact lists. In CASP12, the L/2 list was given a higher prominence and the L/5 long-range list was used for comparison with previous CASPs. The results for the two shorter lists (L/10 and Top5) are also available on the web.
Discussion:
David Baker/Rosetta and Jinbo Xu/RaptorX-Contact groups speculate that we need 1.5L ~ 2L contacts to obtain good contact-assisted ab initio contact prediction. Analysis of non-redundant SCOP structures from Y. Zhang’s group shows that the average number of short, medium, and long-range contacts of a well folded protein domain is 0.3*L, 0.4*L, and 1.2*L, respectively. Maybe these higher numbers are more reasonable to assess contact map accuracy? Or we simply can use the number of long-range contacts in the native structure? In CASP12 assessors stated that “No definite answer can be drawn regarding the best number of contacts to use for modeling.” Interesting to track this also in the next CASP.

7. Evaluation analyses
(1) how accurate are the methods in predicting contacts with the highest reliability (RL)
(2) how accurate are all submitted contact predictions, including those predicted with lower reliability, i.e. on FL
(3) how dispersed are predicted contacts
(4) how accuracy of contact prediction depends on the depth of the alignment
(5) how accuracy of tertiary structure prediction depends on accuracy of contact prediction
Note: 4 and 5 are very important analyses in light of recent advances of approaches using fixed correlated mutations methodology.
(6) how accuracy of contact prediction depends on the length of the target
(7) auxiliary check if results from different methods are different (Jaccard-distance-based)

8. Evaluation measures
Historically the main evaluation measure is the ratio of correctly predicted contacts in an RL list:
precision=TP/(TP+FP).
Since TP+FP=const and TP+FN=const for a selected RL list, the precision score in the RL analysis is proportional to
recall=TP/(TP+FN)
and
F1= 2*precision*recall/(precision+recall).
For the calculation of these scores, the true positives (TP) and false positives (FP) values are the numbers of correctly and incorrectly predicted contacts regardless of the associated probabilities.
In the FL analysis, the main estimators of binary classifiers are the Matthews correlation coefficient (or the F1-score discussed above), and the area under the precision-recall curve (AUC_PR). The AUC_PR analysis is conceptually similar to the ROC curve analysis, but is better suited for analysis of imbalanced datasets, which are the case in contact prediction – just a few positive cases (contacts) and many more negative (non-contact pairs of residues) [CASP9 or CASP10 evaluation paper]. The PR-curve analysis implicitly takes into account the predicted contact probabilities. The threshold for separating contacts from non-contacts is selected at the p=0.5 level, thus a contact is considered as correctly predicted (TP) if it was included in the prediction with a probability of 0.5 or higher (this is stated in the description of RR format). Since both the MCC and PR analyses account for the accuracy of predictors as two-class classifiers, their results are shown to be similar in previous CASPs, so using only one of them may be sufficient. The measures used in the RL analysis can also be applied, but they are not proportional any more.
In CASP12 a new Entropy score was introduced, which favors predictions with more dispersed correctly predicted contacts [1]. Assessors should be careful using this measure as all pairs predicted as contacts give the perfect ES score (therefore should be used only in combination with other measures).
Discussion:
(1) CASP12 assessors tested probability-weighted measures [1], but refrained from using them in their final analysis as those were questioned for their scientific reliability. CASP organizers do not recommend using them in future assessments.
(2) CASP12 assessors used formula F1 + 0.5*ES for ranking groups. As the Entropy Score is only an auxiliary measure – what is the most reasonable coefficient for the final score (some predictors argued that 0.5 is too high).

9. Group ranking
The scores for each group are calculated on a per-target basis and subsequently averaged. The AUC_PR score was calculated on the dataset containing contacts from all targets pulled together.
The groups are ranked according to the cumulative z-scores from selected by the assessor evaluation measures. For each measure, the z-scores were calculated in accordance with the procedure for calculating the corresponding raw scores, e.g. on the per-target basis for the precision, or MCC, and on all targets together for the AUC_PR. After the initial computation, the z-scores are recalculated on the outlier-free datasets, with outliers defined as those with a score lower than the mean minus two standard deviations. For the per-target measures, these adjusted z-scores are averaged over all domains predicted by the group. Finally, before adding the z-scores from different measures, all negative z-scores are set to zero in order not to penalize too severely groups underperforming with respect to some of the scores.

10. Statistical significance of difference in results
To establish the significance of the differences between the scores for best groups, we performed t-tests and “head-to-head” comparisons [3] on the per-target measures (e.g., precision, MCC) and bootstrapping tests on all measures [4]. For the bootstrapping, we randomly sampled (with replacement) the list of targets predicted by each group, and re-calculated the evaluation scores on the resampled target sets. The 95% confidence intervals were established using the two-tailed bootstrap percentile method on 1000 resampling trials. The statistical significance of the differences in group performance was inferred based on the comparison of the corresponding confidence intervals.

11. Inter-domain and inter-chain contact predictions
Assessing inter-domain contact predictions provides an estimate of the ability of predictors to recognize proper packing of the constituent domains in multi-domain proteins and correct oligomerization interfaces [should we discuss this here or leave it for the Assembly assessor?].

1. Schaarschmidt J, Monastyrskyy B, Kryshtafovych A, Bonvin A. Assessment of contact predictions in CASP12: Co-evolution and deep learning coming of age. Proteins 2018;86 Suppl 1:51-66.
2. Monastyrskyy B, D'Andrea D, Fidelis K, Tramontano A, Kryshtafovych A. New encouraging developments in contact prediction: Assessment of the CASP11 results. Proteins 2016;84 Suppl 1:131-144.
3. Monastyrskyy B, D'Andrea D, Fidelis K, Tramontano A, Kryshtafovych A. Evaluation of residue-residue contact prediction in CASP10. Proteins 2014;82 Suppl 2:138-153.
4. Monastyrskyy B, Fidelis K, Moult J, Tramontano A, Kryshtafovych A. Evaluation of disorder predictions in CASP9. Proteins 2011;79 Suppl 10:107-118.
5. Lafita A, Bliven S, Kryshtafovych A, Bertoni M, Monastyrskyy B, Duarte JM, Schwede T, Capitani G. Assessment of protein assembly prediction in CASP12. Proteins 2018;86 Suppl 1:247-256.
6. Lensink MF, Velankar S, Kryshtafovych A, Huang SY, Schneidman-Duhovny D, Sali A, Segura J, Fernandez-Fuentes N, Viswanath S, Elber R, Grudinin S, Popov P, Neveu E, Lee H, Baek M, Park S, Heo L, Rie Lee G, Seok C, Qin S, Zhou HX, Ritchie DW, Maigret B, Devignes MD, Ghoorah A, Torchala M, Chaleil RA, Bates PA, Ben-Zeev E, Eisenstein M, Negi SS, Weng Z, Vreven T, Pierce BG, Borrman TM, Yu J, Ochsenbein F, Guerois R, Vangone A, Rodrigues JP, van Zundert G, Nellen M, Xue L, Karaca E, Melquiond AS, Visscher K, Kastritis PL, Bonvin AM, Xu X, Qiu L, Yan C, Li J, Ma Z, Cheng J, Zou X, Shen Y, Peterson LX, Kim HR, Roy A, Han X, Esquivel-Rodriguez J, Kihara D, Yu X, Bruce NJ, Fuller JC, Wade RC, Anishchenko I, Kundrotas PJ, Vakser IA, Imai K, Yamada K, Oda T, Nakamura T, Tomii K, Pallara C, Romero-Durana M, Jimenez-Garcia B, Moal IH, Fernandez-Recio J, Joung JY, Kim JY, Joo K, Lee J, Kozakov D, Vajda S, Mottarella S, Hall DR, Beglov D, Mamonov A, Xia B, Bohnuud T, Del Carpio CA, Ichiishi E, Marze N, Kuroda D, Roy Burman SS, Gray JJ, Chermak E, Cavallo L, Oliva R, Tovchigrechko A, Wodak SJ. Prediction of homoprotein and heteroprotein complexes by protein docking and template-based modeling: A CASP-CAPRI experiment. Proteins 2016;84 Suppl 1:323-348.
7. Olechnovic K, Kulberkyte E, Venclovas C. CAD-score: a new contact area difference-based function for evaluation of protein structural models. Proteins 2013;81(1):149-162.
8. Duarte JM, Srebniak A, Scharer MA, Capitani G. Protein interface classification by evolutionary analysis. BMC Bioinformatics 2012;13:334.

========================
Feedback from Andras Fiser
========================

I read the past CASP evaluations on contact prediction and related categories. I think there are a number of evolved ideas that I can fully agree with and we could just proceed business as usual. In fact we must do that to some extent so we can perform historical comparisons about progress.

I like that contacts are classified into different sequence separation ranges, as obviously a single long range contact 50-100 residues away can have more impact than a number of contacts within 20 residues. To this end I would argue that 24 aa cutoff is still probably too short, especially in case of all helical folds, as this will include many contacts packing two neighboring secondary structures, and really useful information comes from 1-3 and higher secondary structure packing/contact data. So, I wonder how contacts between residues separated by >50aa perform.

I also like the idea that same number of contacts are compared among groups, but we need to be specific that these contacts come from similar range categories. I like that contacts were evaluated with MCC or F1 evaluation, as is used for highly unbalanced sets. I like the idea of “entropic score” which considers the spread of contacts along the sequence.

In connection to some comments above and thinking about additional point of views: given the highly skewed distribution of protein folds, I think it is important that high capacity, deep learning neural networks or mutual information based multiple sequence analysis not simply recognize folds and their corresponding “typical” contacts but recognize contacts “genuinely". To this end I would like to check separately how contact predictions work in 1) new folds 2) known, but rare folds, 3) frequent or Superfolds. Another aspect of this evaluation is the class of folds: I just want to make sure that not all success happens with all helical folds only, so I would like to check separately the performance in different fold classes. Finally we should consider protein lengths, or compare proteins with similar number of secondary structures, as the possible combination of packing a fold is exponentially increasing with higher number of secondary structures.

An overarching interest within this topic to me is the “quality” of contacts, i.e. the minimum number of necessary and sufficient contacts, or simply, most informative contacts. To this end, an interesting conceptual question is the minimum number and location of contacts that can deliver a correct prediction, or a prediction that is as good as any other. I would imagine that as little as 5-10 contacts can do that but 5-10 correct predictions might not fare well with the existing evaluation scheme that will punish for missed contacts. Yet it might deliver as good models as any other. I will think about how to address it. I saw that in the past you performed Rosetta modeling on 5 proteins with and without contacts and assessed the impact of this extra information. We may want to extend that line of evaluation by digging into some details, asking predictors to re-run modeling exercise with different subsets of contacts etc.

by **djones** on Wed Apr 18, 2018 1:39 pm

3. Definition of contacts

Personally I think that heavy atoms < 5A apart is a good enough working definition of a residue-residue contact. In terms of interpreting residue covariation, this is the
definition I prefer to use. However, if the intent of contact prediction is to model 3-D protein structure then CB-CB distances are more useful as they allow constraints
to be sensibly applied in modelling before the structure is refined enough to even think about side chain atoms.

Also, machine learning based methods are heavily optimized towards the standard CB-CB definition. Obviously they probably could be retrained on almost any realistic
definition of a contact, but in practice the majority of these methods will have been trained according to that definition, and so using any other definition to assess them
will produce arbitrary results. Now of course, it could be argued that we don't want people to produce methods that are overfitted to CASP assessment criteria, but with machine
learning we really have no choice. If you train a neural network to distinguish cats from dogs, it's not really very informative to assess how well that same trained network
can distinguish cows from sheep, without first retraining it with the new problem definition.

5. I have to say all this stuff about trying to work out how many contacts you need to model a structure is of very little practical use. As we all know, there are just too
many variables - sequence separation, fold type, length of protein, redundancy in the contact list. One metric I find useful when comparing in house methods is to ask
what is the largest number of (long range) contacts that can be produced by a method for which the overall fraction of correct contacts is > 0.5. In other words, how far down
the list of contacts can you go before the overall set contains more false than true contacts (edit: actually the correct algorithm would start with the complete list and work
backwards until the criterion is met). The larger that set of contacts is, the more useful the contacts tend to be for 3-D modelling in my experience.

Regarding the entropy score - I've never like the idea of adding apples to oranges with arbitrary weights. I did it when we assessed refinement, I know, but I was never
happy with it. Problem is that the crowd demands a single number for ranking, so what can you do.

One thought I have had in this area is to compute Z-scores not per target, but per-contact. Let's suppose you have one contact that is found in everyone's contact list but
another contact that appears in only very few lists. Perhaps predicting these "rarer" contacts should be given more weight. Likewise, a common false positive contact that is
uniquely avoided by a particular method could also be upweighted. A per-contact Z-score could even partially replace the entropy score, because low entropy contacts probably will be
common and so will produce Z-scores close to zero.

Problem with this is that to do it properly, it probably would require complete contact maps to be produced by each method. The current procedure of simply giving zero to any
unlisted contact may skew results too much in favour of methods that predict very few contacts. I suspect this idea needs some refinement before it could be used in anger, but it might
be useful in the future.

by **kad-ecoli** on Fri May 18, 2018 2:35 pm

Quote:
(6. Evaluation lists)
David Baker/Rosetta and Jinbo Xu/RaptorX-Contact groups speculate that we need 1.5L ~ 2L contacts to obtain good contact-assisted ab initio contact prediction. Analysis of non-redundant SCOP structures from Y. Zhang’s group shows that the average number of short, medium, and long-range contacts of a well folded protein domain is 0.3*L, 0.4*L, and 1.2*L, respectively. Maybe these higher numbers are more reasonable to assess contact map accuracy? Or we simply can use the number of long-range contacts in the native structure? In CASP12 assessors stated that “No definite answer can be drawn regarding the best number of contacts to use for modeling.” Interesting to track this also in the next CASP.

The above discussion implicitly assumed that the number of Cβ-Cβ contacts in the native structure is proportional to L. While it is roughly true that a protein usually have around 0.3*L and 0.4*L short and medium range contacts, it is certainly not true for long range contacts. While a well folded protein has on average 1.2*L long range contacts, it is obviously incorrect to assume a small protein with L=24 residues have 1.2*L=29 long range contacts. In fact, this small protein has zero long range contacts.

https://ndownloader.figshare.com/files/ ... review.jpg

To further illustrate this fact, 9896 non-redundant (pairwise sequence identity <40%) SCOP domain structures are collected from SCOPe 2.06. The attached figure shows the number of Cα-Cα contacts (upper panel y-axis) or Cβ-Cβ contacts (lower panel y-axis) versus sequence length (x-axis). The four different columns stands for short, medium, long, all (=short+medium+long) contacts. Each point in the figure is one SCOP domain, with red, green, and blue stands for α proteins, β proteins, and others. The black straight line in each subplot is the least square fit. While there is a rough linear correlation between contact number and length, it can see on the third column of the figure that the least square fit for long range contact has a intercept that is clearly not zero. Indeed, the best linear fit for Cβ-Cβ contact is #long_range_contact = 1.81*L -80.75. The correlation between sequence length and long range contact number is particularly week for small (L<=100) proteins. For the above reasons, the best strategy for the sake of evaluation might be simply “use the number of long-range contacts in the native structure” as the length of evaluation list to be considered.

As another issue, the use of full list evaluation, which is meant to assess “how accurate are all submitted contact predictions, including those predicted with lower reliability, i.e. on FL”, is misleading. Specifically, FL precision is a metric that strongly favors predictors that submit a particularly small list of predicted contacts. On the 37 FM domains used for CASP12 RR evaluation, the number of predicted contacts is remarkable correlated with FL precision for almost every target, with an average Pearson Correlation Coefficient of -0.55 over all targets, and correlation of -0.67 if we only consider the top 20 groups for each target. Based on these data, it should be called into serious question whether FL evaluation is a valid assessment approach, as the best way to get good FL precision is to submit as few contacts per target as possible.

by **kad-ecoli** on Sat May 19, 2018 11:56 am

I like that contacts are classified into different sequence separation ranges, as obviously a single long range contact 50-100 residues away can have more impact than a number of contacts within 20 residues. To this end I would argue that 24 aa cutoff is still probably too short, especially in case of all helical folds, as this will include many contacts packing two neighboring secondary structures, and really useful information comes from 1-3 and higher secondary structure packing/contact data. So, I wonder how contacts between residues separated by >50aa perform.

Is it possible for the assessor to elaborate a bit more on why predicting long range contact separated by 24-49 residues are not "really useful". It is true that many of these contacts belongs to packing of neighboring secondary structure elements. This neither means contacts separated by 24-49 residues are not important for protein folding, nor does that mean it is computationally trivial to predict these contacts.

Many proteins, especially small and medium targets commonly found in CASP, have their structure fold mainly defined by contacts between adjacent secondary structure elements, with few contacts beyond sequence separation 50. See, as a non-comprehensive list of CASP12 targets, T0898-D2, T0943-D1, T0859-D1, T0870-D1, T0903-D1, and many others.

As a specific example, CASP12 target T0870-D1 is a helical protein http://predictioncenter.org/casp12/rrc_ ... st_size=L2. Among all 102 long range contacts this protein has, 101 out of 102 of them is separated by 24-49 residues. Despite the lack of contact with >=50 sequence separation, it is still not trivial for most predictors to predict its contacts, with the top L/2 long range precision for the best performing servers being only 24%.

Discussion: contact prediction (RR)

Discussion: contact prediction (RR)

Re: Discussion: contact prediction (RR)

Re: Discussion: contact prediction (RR)

Re: Discussion: contact prediction (RR)

Who is online