Evaluation measures and procedures used in the assessments.

Historically, due to typically large differences between the model and target structures, assessors in this category relied on visual inspection aided by an array of numerical tools. These tools evaluated the correctness of modeled secondary structure, residue-residue long range contacts, and contacts between SS elements [1-4]. In addition, to identify regions of structure that were modeled correctly, the assessors used tools such as RMSD running window plots (Lesk window plots) [2], RMSD/coverage plots [5], or GDT summary plots [6, 7].

More recent CASPs saw further development of measures designed to more robustly identify correct features in a model, and – with the increasing numbers of models submitted for evaluation – to aid in assessment by limiting the number of models to consider. Three recently developed measures, LDDT, CAD, and SphereGrinder, are of particular interest. All three are independent of global model-target rigid body superpositions, and therefore only marginally influenced by intramolecular domain re-orientations or other large-scale deformations of structure. These three measures are conceptually distinct and address different aspects of model accuracy.

The Local Distance Difference Test (LDDT), developed by Torsten Schwede’s group [8], is a robust measure based on the comparison of interatomic (all-atom) distances of model and target structures. It focuses on local structure similarities. It was used in CASP e.g. by the CASP10 assessor Gaetano Montelione [9], CASP11 assessors Nick Grishin [10] and Roland Dunbrack [11, 12], as well as in the CASP11/12 assessments of model accuracy performed by Andriy Kryshtafovych [13, 14].

The Contact Area Difference (CAD) score, developed by Ceslovas Venclovas’ group, was designed to account for differences between physical contacts in model and target structures [15]. CAD scores are based on the Voronoi tessellation of protein structure and the concept of residue-residue contact area difference introduced by Abagyan and Totrov [16]. In addition to single domain applications, they allow for direct assessment of inter-domain or inter-subunit interfaces. In CASP, the CAD score was used as part of the global model accuracy assessments [13, 14].

The SphereGrinder (SG-score), developed by Krzysztof Fidelis’ group (Prediction Center) in collaboration with Piotr Lukasiak [17, 18], is based on local RMSD scores calculated for neighborhoods of each amino acid in a protein. Spheres of a specified radius, centered on the reference structure Cαs, define the comparison units between model and target structures. The measure retains the universally recognized RMSD scores while avoiding the limitations of the global rigid-body superpositions. In CASP assessments, the SG-score was used e.g. by the CASP10 assessor David Jones [19], CASP11 assessor Roland Dunbrack [11, 12], and CASP12 assessor Francesco Gervasio [20]. It was also used in the CASP11/12 assessments of model accuracy [13, 14].

In addition to the above measures and the principal measure of structure similarity used in CASP, the GDT_TS, several other measures recently used in the FM category assessments are characterized by conceptual ingenuity and diversity, and are worthy of consideration:

The QCS (Quality Control Score), based on a comparison of secondary structure elements in model and target structures and devised to mimic expert inspection.

The Handedness score, comparing global conformations by looking at the relative orientations of randomly picked atom tetrads.

The DFM (DeForMation) score, designed to measure distortion over residue tetrads in the model relative to target.

The CoDM (Correlation of Distance Matrix) score, measuring the correlation of the residue-residue distance matrices of the model and target structures.

The LGA_S score, reflecting the percentage of residues that can be superimposed under a certain distance cutoff, and using sequence alignment independent superpositions.

The TM-align score, reflecting the TM-score calculated using sequence alignment independent superpositions.

MolProbity, a stereochemistry-based model validation score.

For quick reference, all the above methods are further briefly described in the last section.

Specific measures and procedures used in CASP11

The following scores were combined to estimate overall prediction quality in CASP11 FM assessment: GDT_TS, QCS, LDDT, and MolProbity, as well as TenS and ContS.

For quick reference, the GDT_TS, QCS, LDDT, and MolProbity are briefly described in the last section. TenS is a combination of ten scores previously used in CASP, including six structure similarity and four alignment quality scores, compiled by Nick Grishin’s assessment team for the template-free modeling category in CASP9 [21]. ContS [22, 23] is a score in the general category of contact-based measures (such as LDDT).

In addition, two types of model comparisons were performed. GDT_TS scores for the submitted models were compared to those obtained for models generated randomly (generation of random models is described in the assessment paper). Similarly, LGA_S scores [24] for the submitted models were compared to those obtained for the best structural templates (the LGA_S calculations were performed relative to target structures for both models and templates).

Models were ranked using the Z-score analysis, statistical tests, and head-to-head comparisons (the current CASP implementation is described in the last section).

Reference:

Evaluation of free modeling targets in CASP11 and ROLL. Kinch LN, Li W, Monastyrskyy B, Kryshtafovych A, Grishin NV. Proteins. 2016 Sep;84 Suppl 1:51-66. doi: 10.1002/prot.24973.

Specific measures and procedures used in CASP12 (this section edited by CASP12 and 13 assessors Matteo Dal Peraro and Luciano Abriata)

Model evaluation: In CASP12 the assessors used GDT_TS, QCS, Handedness, CoDM, DFM and TM-align scores to highlight candidate top models for subsequent visual evaluation (models were actually clustered at 3 Å RMSD to reduce the number of visual evaluations). A web app was developed to easily navigate models through these 6 scores, to select top models pooled from all scores, and to compare any model to the target structure on the fly (at http://lucianoabriata.altervista.org/pa ... sters.html, only targets released in the PDB are displayed). For CASP13 the assessors might use a similar procedure but focusing primarily on GDT_TS, QCS and TM-align, which turned out to be most informative for suggesting varied candidates for visual inspection. Other scores currently provided by the Prediction Center might be incorporated into the web app, at least to test SG, CAD and LDDT.

Group ranking: The “official” ranking was performed in CASP 12 using GDT_TS z-scores, but it was additionally shown that z-scores based on other metrics (including some from previous CASPs) led to the same top predictors.

Domain definition and evaluation: In CASP12 the assessors split target units into three categories, based on the likelihood of finding useful templates in the PDB when doing sequence and structure searches, and on actual server performances. The CASP12 assessment for topology prediction provided evidence that contact predictions are becoming of practical utility to model proteins with no available templates, and possibly also to detect templates of distant sequence and different continuity over sequences. Given this, CASP13 could include alignment depths as an additional metric for difficulty categorization.

Reference:

Assessment of hard target modeling in CASP12 reveals an emerging role of alignment-based contact prediction methods. Abriata LA, Tamò GE, Monastyrskyy B, Kryshtafovych A, Dal Peraro M.

Proteins. 2018 Mar;86 Suppl 1:97-112. doi: 10.1002/prot.25423.

Brief description of relevant measures and ranking procedures

GDT_TS score [24, 25] (Global Distance Test – Total Score) is calculated using multiple global rigid-body superpositions. In the GDT calculations, only the well-modeled regions contribute to the score, in contrast to the RMSD, where all residues do (including the superposition outliers). In the GDT analysis, the two structures are compared using 20 different superpositions, each maximizing the number of Cα atom pairs fulfilling the deviation criterion specified by the cutoff value. A set of incrementally increasing distance cutoffs is used, ranging from 0.5 Å to 10.0 Å with a 0.5 Å increment. The GDT_TS score is an average of the selected 4 GDT values calculated at 1, 2, 4, and 8 Å and normalized by the number of residues Nt in the target structure:

GDT_TS=1/(4Nt) (GDT_1+GDT_2+GDT_4+GDT_8),

where GDTi (i =1, 2, 4, 8) is the largest number of Cα atoms in the model that fit to the target structure under the i Å cutoff. The GDT_TS score is in the range of [0; 100], with the higher scores corresponding to better predictions. GDT_TS scores over 50 indicate structures with significant similarity to one another, while scores below 20 indicate unrelated structures (poor models).

The GDT scores are typically plotted as a curve showing the percentage of fitted residues (x-axis) for the 20 distance cutoffs (y-axis). A quick look at such a GDT summary plot can give an idea of how good is the model at hand: a larger area over the curve indicates a more accurate prediction, with the line running all the way along the x-axis and then straight up corresponding to the ideal prediction.

Contact-based measures

LDDT [8] (Local Distance Difference Test) is a superposition-free contact-based measure using comparison of interatomic (all-atom) distances of model and target structures. Only distances below a specified cutoff in the target structure (15Å by default) are considered. Distances in the model are labelled “preserved” if they do not deviate from the corresponding distance in the target by more than a specified threshold. Fractions of preserved distances are then calculated for each of the incrementally increasing thresholds of 0.5, 1, 2, and 4Å, matching the thresholds in the GDT_HA. The final LDDT score is the average of the corresponding four fractions and spans the [0;1] interval. By default, residues with unfeasible stereochemistry are considered as modelled incorrectly19. The score is well suited for the evaluation of local model accuracy in presence of relative domain shifts and large conformational flexibility. In addition to single structures, the method allows using sets of structures as reference, as is the case with NMR generated ensembles.

CoDM [26] (Correlation of Distance Matrix) score is a weighted Pearson's correlation of the distance matrices of the target and model structures. It ranges over the [-1;1] interval.

Contact area similarity-based measure

CAD-score [15] (Contact Area Difference) estimates similarity of two structures based on the calculated differences between their residue-residue contact areas. Two variants of the CAD-score, based on the all-atom (CAD-aa) and side-chain (CAD-sc) comparisons are computed. The contact areas are calculated using the Voronoi tessellation approach separately in model and target structures, and then the differences for the same pairs of residues are summed up and normalized to the [0;1] interval. Based on the CASP evaluation data, the CAD-aa score has a bell-shape distribution with around 90% of scores falling in the [0.3;0.7] range; the CAD-sc score has a monotonically descending distribution with over 90% of scores falling in the [0;0.5] range. As all measures in this section, the CAD scores can be directly applied to assessing quality of models on multi-domain targets, without prior splitting them into separate evaluation units. Compared with the GDT_TS, the scores were shown to more effectively favor physically realistic models, as demonstrated by a cross evaluation with the MolProbity score [15].

Similarity of local substructures-based measure

SG-score [17, 18] (SphereGrinder score) is an all-atom measure based on the local similarity of the Cα neighborhoods in the model and target structures. Local substructures are defined by spheres centered on the Cα atoms of the target structure. For every residue, an RMSD score is calculated for a set of atoms found inside a sphere of a selected radius. In CASP, the SG-scores are calculated using a sphere radius of 6 Å. The reported results are the percentages of residues fulfilling the similarity criterion within their sphere. In CASP, two RMSD cutoffs are used (2Å and 4Å), and scores are averaged. Furthermore, to visualize the similarity between model and target structures, the calculations may be performed for a range of radii (e.g. from 4 to 300 Å). The sphere RMSD results may then be plotted as a function of the amino acid position in sequence and the sphere radius.

Measuring structure distortion and handedness

DFM [26] (DeForMation Score) is a measure of distortion calculated on a representative set of Delaunay tetrahedrons defined on the Cα atoms of the target structure. A corresponding set of tetrahedrons, each anchored on the same residues that make up the Delaunay tetrahedrons in the target, is then defined in the model structure. Deformation of the tetrahedrons is expressed as the ratio of their volumes in model and target. The DFM score is a weighted penalty function reflecting the degree of deformation summed over all tetrahedron pairs. Lower scores correspond to better models. DFM equals to zero when volumes of the corresponding tetrahedrons are identical and reaches one when volume of the model tetrahedron is 0 or 3 times that in the target. Details of the deformation penalty function and the specifics of implementation, including Delaunay tessellation, are provided in Tai et al. [26].

Handedness [26] score was designed to compare global characteristics of the model and target structures, especially when models considerably deviate from target (e.g. in the New Fold category predictions). By nature of the tessellation procedure, residues forming Delaunay tetrahedrons used in the DFM score tend to be physically close to one another in space. To define a global measure, a random selection of a large number (50,000 in CASP) of the tetrahedron vertex sets was proposed. The Handedness score is the fraction of pairs of such tetrahedrons, in the model and target structure, that have the same handedness. The DFM and Handedness scores were proposed by the CASP10 assessor B.K. Lee’s team [26].

A score approximating expert assessment of structure topology

QCS [21] (Quality Control Score) examines the topological similarity of the model and target structures. During the CASP9 experiment (2010), the free modeling assessment team led by Nick Grishin defined a composite measure designed to correlate well with the results of expert assessment. The ensuing QCS score was based on examination of secondary structure elements (SSEs) in the model and target structures, comparing the SSE lengths, distances between the SSE centers and the center of the protein, angles between the SSE-defined vectors, and distances between the Cα atoms in key contacts that reflect the relative packing of the SSEs. The score was shown to produce results close to those of the expert assessment [21].

Sequence alignment independent measures

LGA_S [24] (Local-Global Alignment - Score) is a structure similarity score based on a sequence-independent superposition of the model and target structures calculated with the LGA program. One important feature that distinguishes the LGA algorithm from other sequence-independent structure superposition methods, is that the reported list of equivalent residues that are used to calculate the final superposition is rigorously defined, as it consists of only the residues that deviate from one another by no more than a selected distance cutoff (5 Å by default). The LGA_S scoring function is defined with the following formula:

LGA_S = w*S(GDT) + (1-w)*S(LCS),

(See [24] for the description of the weighting factor w and the S-function used to calculate the score). The combined score reflects the percentage of residues that can be superimposed under a certain distance cutoff and is in the range of [0; 100], with higher scores corresponding to structural alignments with larger number of residues and longer aligned fragments. LGA_S scores are close to the GDT_TS scores for targets where alignment errors are insignificant.

TM-align [27] is a sequence-independent structure superposition algorithm based on the TM-score (used as the cost function) [28]. The initial alignment is obtained with heuristics based on secondary structure assignments, TM-score guided threading, and dynamic programming. In an iterative process, the method then uses a series of sequence-dependent superpositions, each based on the alignment obtained in the preceding step and followed by dynamic programming to optimize the alignment. Alignment optimization uses a TM-score derived similarity matrix [27]. TM-score of the final alignment is a measure of structure similarity returned by this algorithm.

Geometry-based model validation score

MolProbity [29] is an all-atom structure validation package reporting the agreement of a model with physical constraints derived from known crystal structures of proteins. It defines the geometric clash score (Clash_score), rotamer outlier score (Rot_out), Ramachandran outlier score (Ram_out), and Ramachandran favored score (Ram_fv). All MolProbity scores are calculated from the coordinates of the predicted model. The cumulative MolProbity score (MPscore) combines three of the above statistics, reporting one number that reflects an approximate crystallographic resolution at which those values would be expected [29]:

MPscore = 0.426 *ln(1 + Clash_score) + 0.33 *ln(1 + max(0, Rot_out - 1)) + 0.25 *ln(1 + max(0, (100 – Ram_fv) - 2)) + 0.5

The coefficients in the above formula were derived from a log-linear fit to the values of crystallographic resolution on a filtered set of PDB structures. The lower MPscores correspond to stereochemically better structures. The cumulative scores below 3 usually indicate models of acceptable geometry [29].

Model ranking procedures

Z-scores. Prediction methods can be ranked by combining the individual scores for the submitted models over all targets. Since different targets may have different difficulty, the same difference in score may carry different weight for different targets and direct combining of raw scores can be misleading. The Z-score approach takes the predictive difficulty of a target into account, as the normalized score reflects relative accuracy of the model with respect to the results of other predictors. The use of Z-scores instead of raw scores proved to be very effective in analyzing relative model accuracy, although the results should be taken with a grain of salt for targets where only few good models are generated, as Z-scores can overestimate model accuracy in these cases. In addition to combining scores for different targets, the Z-score approach allows combining scores across different evaluation measures as Z-scores are dimensionless and, regardless of the evaluation measure, express raw scores in terms of their deviation from the population mean. This feature of Z-scores also allows combining scores with different weights.

A typical CASP ranking procedure is as follows:

Z-scores are calculated for every measure and for the selected subsets of groups (e.g., all groups or servers), models (e.g., models designated as number one or all models), and targets (e.g., ‘human’ targets or all targets).

Outliers are identified. The outliers are defined as models with raw scores lower than the population mean μ minus N standard deviations σ, where parameter N is usually set to 0 or 2. This parameter defines the largest assigned error in terms of standard deviation. Z-scores for the outliers are set to –N. Setting Z-score to -2 puts the outlier model at the level of the model scoring μ-2σ; setting Z-score to 0 bumps up the outlier model to the level of an average model.

Z-scores are recalculated for the outlier-free datasets.

Z-scores for the second-round outliers are set to –N, where N is the parameter from step 2.

If Z-scores are calculated on the per-target basis (as opposed to all targets pulled together), they are averaged over the domains predicted by the group.

If a participant did not predict on some targets, he /she is penalized by assigning the Z-score to -N on the missed targets (N is the parameter from step 2). The penalty treats missing predictions as outlier models.

If Z-scores are calculated on the per-target basis, they are summed over all domains evaluated in a particular evaluation category. Penalties for missing targets are applied.

Z-scores are combined for different measures with weights defined by the assessors.

Note that the scoring scheme described above involves several subjective choices that have to be made by assessors. First, it is necessary to decide on whether to use scores from all models for calculation of the Z-scores (i.e., skip step 2) or to filter the model dataset by removing outliers. Since the introduction of Z-scores in CASP, there has been a strong preference among the assessors to pre-filter the data as the mean and standard deviation of the model population can be significantly distorted by some extremely bad models. This can happen not only due to methodological reasons, but also simply because of technical bugs in servers or unintentional human errors. Therefore, pre-filtering can help eliminate the effect of these unrealistic models on the scoring system.

Next, a balanced scoring scheme should not over-penalize methods for very bad models, since a hefty penalty can overshadow advantages of the otherwise good method. For example, a method that generates models with one standard deviation above the mean (i.e., Z-score=1) for four out of five targets is definitely worth attention; but if this method misses the 5th target badly with the Z-score of -4 on that target, then its performance will be ranked as just average (the average Z-score on five targets equals 0), and the method most likely will be overlooked, if judged on the basis of Z-scores only. One way to avoid this potential problem is to adjust the Z-scores that are below some threshold upwards, to the value of that threshold. In previous CASPs two such thresholds were used: 0 and -2. In the first case, incorrect models are assigned the average score for the target, while in the second they are assigned the score of the 2σ outlier.

Another choice involves whether to use the average of Z-scores (step 5) or their sum (steps 6+7) for ranking. This choice is irrelevant if all groups predict the same set of targets. But when this is not the case, the ranking may be affected: the summing of the scores gives advantage to groups that predicted more targets.

Finally, rankings depend on a combination of scores and weights selected by the assessors (step 8).

Statistical tests - comparison of group performance. Z-score calculations are supplemented by the tests establishing the statistical significance of the differences in group performance. For measures that are calculated on the per-target, the paired t-tests and “head-to-head” comparisons are performed. These tests are run on the common set of predicted targets. As the t-test assumes normal distributions, one should verify that this is the case. If not, a non-parametric test such as the Wilcoxon signed rank test should be used. The statistical significance of the differences in group performance on the measures that are calculated on the data pulled together for all targets is inferred from the bootstrapping tests and subsequent comparison of the corresponding confidence intervals. The statistical significance of the differences in ROC analysis or PR-curve analysis is assessed with the DeLong tests.

REFERENCES

1. Defay, T. and F.E. Cohen, Evaluation of current techniques for ab initio protein structure prediction. Proteins, 1995. 23(3): p. 431-45.

2. Lesk, A.M., CASP2: report on ab initio predictions. Proteins, 1997. Suppl 1: p. 151-66.

3. Zemla, A., et al., Numerical criteria for the evaluation of ab initio predictions of protein structure. Proteins, 1997. Suppl 1: p. 140-50.

4. Zemla, A., et al., A modified definition of Sov, a segment-based measure for protein secondary structure prediction assessment. Proteins, 1999. 34(2): p. 220-3.

5. Orengo, C.A., et al., Analysis and assessment of ab initio three-dimensional prediction, secondary structure, and contacts prediction. Proteins, 1999. Suppl 3: p. 149-70.

6. Aloy, P., et al., Predictions without templates: new folds, secondary structure, and contacts in CASP5. Proteins, 2003. 53 Suppl 6: p. 436-56.

7. Vincent, J.J., et al., Assessment of CASP6 predictions for new and nearly new fold targets. Proteins, 2005. 61 Suppl 7: p. 67-83.

8. Mariani, V., et al., lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics, 2013. 29(21): p. 2722-8.

9. Huang, Y.J., et al., Assessment of template-based protein structure predictions in CASP10. Proteins, 2014. 82 Suppl 2: p. 43-56.

10. Kinch, L.N., et al., Evaluation of free modeling targets in CASP11 and ROLL. Proteins, 2016. 84 Suppl 1: p. 51-66.

11. Modi, V., et al., Assessment of template-based modeling of protein structure in CASP11. Proteins, 2016. 84 Suppl 1: p. 200-20.

12. Modi, V. and R.L. Dunbrack, Jr., Assessment of refinement of template-based models in CASP11. Proteins, 2016. 84 Suppl 1: p. 260-81.

13. Kryshtafovych, A., et al., Methods of model accuracy estimation can help selecting the best models from decoy sets: Assessment of model accuracy estimations in CASP11. Proteins, 2016. 84 Suppl 1: p. 349-69.

14. Kryshtafovych, A., et al., Assessment of model accuracy estimations in CASP12. Proteins, 2018. 86 Suppl 1: p. 345-360.

15. Olechnovic, K., E. Kulberkyte, and C. Venclovas, CAD-score: a new contact area difference-based function for evaluation of protein structural models. Proteins, 2013. 81(1): p. 149-62.

16. Abagyan, R.A. and M.M. Totrov, Contact area difference (CAD): a robust measure to evaluate accuracy of protein models. J Mol Biol, 1997. 268(3): p. 678-85.

17. Kryshtafovych, A., B. Monastyrskyy, and K. Fidelis, CASP prediction center infrastructure and evaluation measures in CASP10 and CASP ROLL. Proteins, 2014. 82 Suppl 2: p. 7-13.

18. Lukasiak, P., Antczak, M, Ratajczak, T, Blazewicz, J, SphereGrinder - reference structure-based tool for quality assessment of protein structural models. 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Washington, DC, USA. 9-12 Nov. 2015.

19. Nugent, T., D. Cozzetto, and D.T. Jones, Evaluation of predictions in the CASP10 model refinement category. Proteins, 2014. 82 Suppl 2: p. 98-111.

20. Hovan, L., et al., Assessment of the model refinement category in CASP12. Proteins, 2018. 86 Suppl 1: p. 152-167.

21. Kinch, L., et al., CASP9 assessment of free modeling target predictions. Proteins, 2011. 79 Suppl 10: p. 59-73.

22. Kinch, L.N., et al., CASP5 assessment of fold recognition target predictions. Proteins, 2003. 53 Suppl 6: p. 395-409.

23. Shi, S., et al., Analysis of CASP8 targets, predictions and assessment methods. Database (Oxford), 2009. 2009: p. bap003.

24. Zemla, A., LGA: A method for finding 3D similarities in protein structures. Nucleic Acids Res, 2003. 31(13): p. 3370-4.

25. Zemla, A., et al., Processing and evaluation of predictions in CASP4. Proteins, 2001. Suppl 5: p. 13-21.

26. Tai, C.H., et al., Assessment of template-free modeling in CASP10 and ROLL. Proteins, 2014. 82 Suppl 2: p. 57-83.

27. Zhang, Y. and J. Skolnick, TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res, 2005. 33(7): p. 2302-9.

28. Zhang, Y. and J. Skolnick, Scoring function for automated assessment of protein structure template quality. Proteins, 2004. 57(4): p. 702-10.

29. Keedy, D.A., et al., The other 90% of the protein: assessment beyond the Calphas for CASP8 template-based and high-accuracy models. Proteins, 2009. 77 Suppl 9: p. 29-49.