Discussion: TBM and refinement assessment

by **jmoult** on Sun Apr 15, 2018 2:19 pm

Assessment of high accuracy modeling in CASP13

This document outlines CASP template- based modeling and refinement metrics that have been used in the past. There are many of these, and the organizers and assessor are hoping for a lively discussion that will help guide CASP13 procedures. The last paragraph contains a few specific suggestions that have been so far.

Traditionally, CASP has an assessment category for template-based models (TBM), another for template free models (FM), and another for refinement. These categories will be maintained in CASP13, although the distinction between TBM and FM is becoming less clear with the advent of effective contact prediction methods.

Free modeling targets are usually of low accuracy, and so direct or indirect evaluations of backbone accuracy are often sufficient. From an assessment point of view, TBM and refinement share the goal of not only providing an analysis of overall backbone agreement with the experimental structure, but also evaluating the accuracy of structure details at an atomic level and of geometric properties. As discussed later, the assessments also include evaluation of estimated model accuracy. The range of evaluations required dictates that multiple metrics be used, and for some of the individual evaluation areas, more than one metric has been deployed, resulting in a complicated landscape of metric choices and combinations.

Full definitions of metrics that have been used in CASP are available on the Prediction center web site. Extensive analysis of CASP models using these measures is also made available by the prediction center. The primary job of the assessor is to use the evaluation data to gain insight into what methods are currently most effective, what methods hold promise, what aspects of models are good and what aspects are bad, and what the biggest bottle necks to progress are.

The following metrics have been used in TBM and/or refinement assessment:

Global Cα accuracy
Usually measured with the GDT_TS1 and the more fine-grained GDT_HA metric (the latter especially for refinement). Like all metrics based on global superposition, these may produce misleading results because of long range structure changes. In CASP, this issue is addressed by dividing a target into evaluation domains partly on the basis of Grishin plots 2, supplemented by visual inspection of the target structure. Global RMSD measures have not been used for TBM assessment in CASP, primarily because of sensitivity to missing residues in a model. But it continues to be a commonly used measure in refinement assessment, where all models of a target have the same content. The TM-score3, popular in the modeling community, has not been widely used by CASP assessors.

Local atomic and residue accuracy
These measures avoid the issues of global superposition but on the other hand are not very intuitive. Three have been widely used in recent CASPs: (1) Local Distance Difference Test (LDDT), introduced by assessor Torsten Schwede in CASP94 and used in assessments in CASP10, 11, and 12. (2) Contact Area Difference (CAD) score, developed by Ceslovas Venclovas and colleagues5. In CAD, Voronoi tessellation is used to define atom-atom interactions and residue-residue contact area is calculated from that6 The CAD score was used as part of overall model accuracy evaluation in the two most recent CASP TBM assessments. (3) SphereGrinder7 (SG) multiple local superposition analysis (developed by the Prediction Center in collaboration with Piotr Lukasiak. In SG, spheres with incrementally increasing radii and centered on each of the reference structure Cαs are used to define accuracy of local structure using RMSD. The SG-score has been used in CASPs 10, 11, and 12.

Accuracy limits
CASP treats X-ray structures as a gold standard against which models are compared. But there are limits to that, for four primary reasons. First, regions of structures may be distorted by crystal contacts from that present in solution. Second, some regions may be flexible, resulting in either atomic positions being poorly determined (reflected in high crystallographic temperature factors or freezing out an arbitrary conformation because of crystal contacts). Third, failure to inform predictors of bound ligands and crystal buffer conditions sometimes leads to apparent local ‘errors’. Fourth, the global superposition may be misleading. For these reasons, it may be advisable to discount differences in high GDT_TS scores (greater than 90).

Local geometry accuracy
The MolProbity score8 has been popular with recent assessors for providing a combined score of geometry feature accuracy, including main and side chain dihedral angles and hydrogen bonding. There has been argument in CASP about the importance of local geometry, and it is still unclear how often issues of that sort may or may not be trivially refined away. CASP policy has been to insist that these factors be of high quality to avoid any uncertainty.

Alignment accuracy
In the early days of CASP, models were usually based on a single template, and accuracy was dominated by the accuracy of mapping the target sequence onto that structure using sequence methods. Accuracy of alignment to a single template is still used in CASP, although because of the use of multiple templates, it is no longer clear when this measure is directly relevant to model accuracy. Measuring alignment accuracy requires a gold standard to compare to and in CASP sequence independent global superposition of evaluation units using LGA1 has long been used for this purpose. Sequentially constrained template and target structure Cα atoms less than 3.8 Angstroms apart in the superimposed structures are considered aligned. Accuracy is then measured as the fraction of these distances that also below 3.8 Angstroms in the LGA model-target superposition (A0) or the fraction for which a residue Cα less than four residues away in the target sequence is superposed by this criterion (A4).

Accuracy of non-principal template covered residues
Residues not aligned to the principal template (often loosely referred as ‘loop’s) are more challenging to model. CASP uses the fraction model Cαs less than 3.8 Angstroms from the corresponding target Cα in these regions to measure this. Because many non-template covered regions are short, the measure only includes regions 15 or more residues long. Good accuracy in these regions may reflect non-template methods have been successful or that another template may cover the region, increasingly the case as more templates become available.

Local accuracy improvements in refined models
A related issue to loop accuracy evaluation in template based modeling is evaluating whether the largest errors have been corrected by refinement. At present, no formal procedure for doing this has been developed.

Estimated accuracy
Models are seldom competitive in accuracy with experimental structures. In practice, whether this matters depends on what a model is used for, and what accuracy, global or local, is needed for that application. For example, drug design requires very high accuracy whereas selection of epitopes for provoking an immune response may require very low accuracy. For these reasons, CASP places a high emphasis on estimated model error overall and at the residue level for template-based models. Future inclusion of these estimates is also important for refined models. Participants provide error estimates in the B factor column of each model’s PDB format co-ordinate file, and this has led to some confusion about the exact requirement. The CASP official model error for an atom is the distance between that atom in the model and the corresponding atom in the evaluation unit LGA superposition of the model and experimental structure. That usage conforms with what a structural biologist might expect an error estimate to indicate, but can be inappropriate for multi-domain structures. This issue is addressed more fully in the EMA metrics discussion document.

Usefulness of a model as an assessment metric
As noted above, model accuracy is potentially useful for deciding whether a model is suitable for a particular application. Reversing that viewpoint, usefulness for particular applications can also be a metric of model quality. Some assessors, most noteably Randy Read, the TBM assessor in CASP79, have evaluated models based on their suitability for use in molecular replacement to solve the corresponding crystal structure. In the two most CASPs, target providers have been asked to report the the biological question(s) that promited the investment in the experimental, with goal of assessing the success of models in answering such questions compared to that of the experimental structures. Assessors have also examined the useful of models for interpreting the impact of missense mutations and understanding ligand binding properties10 11.

Composite measures
Because no single measure captures all aspects of model accuracy, TBM and refinement assessors have often used a composite score formed from a weighted combination of several measures. These composite scores have often provided an overall ranking of the results, and thus are a sensitive issue. In CASP12, the TBM score used was:

Ranking score = 1/3 z[GDT HA] + 1/9( z[LDDT] +z[CADaa] + z[SG]) + 1/3z[ASE]

That is, a weighted combination of the Z score values for GDT_HA, CAD, sphere grinder and the estimated error, ASE. (ASE is a global accuracy measure derived from the set of residue error estimates for a model. In calculating Z scores, all models that scored below the average (i.e., those with negative z-scores) were assigned z scores of 0 in order not to over-penalize the groups attempting novel strategies. Other metric combinations have been used by previous assessors. This was the first time that estimated error has been included, a controversial change.

For the CASP12 refinement assessment, the score used was:

Ranking score = 0.46z(RMSD) + 0.17z(GDT_HA) + 0.2x[SG] + 0.15z[QCS] + 0.02z[MolPrb]

That is, a weighted combination of Z scores for backbone RMSD, GDT_HA, Sphere Grinder, the ‘quality control score’ (QCS), and Molprobity. QCS is a measure introduced by the Grishin group in CASP9 template free modeling assessment, chosen to best reproduce subjective judgement ranking of the low accuracy models in that category. It
is the average of six individual scores that take into account the length, position and reciprocal orientation of secondary structure elements and Ca-Ca contacts. Thus, altogether, 10 different metrics are included the CASP12 refinement composite score. The weights were obtained using a genetic algorithm to optimize agreement to a subjective ranking of the models. Z scores were defined in a slightly different way than used in TBM, setting all values less than -2 to -2. An ad hoc procedure was also used to account for missing models.

The stark differences between the ranking score formulas used in the two categories and the dependence on subjective judgements as well as ad hoc rules underscore the difficulties of fairly and objectively ranking models. It should be noted that of course assessors have been aware of these difficulties, and have usually made many tests of the sensitivity of the rankings to the composite metric components and weights used.

Possible changes in CASP13
The assessor (Randy Read) would like feedback on the merits or otherwise of torsion space metrics.
CASP refinement category participants have suggested that (a) refinement targets should be small (not more than 150 residues), or least biased towards that, and, as has sometimes been done in the past, (b) clear guidance on which parts of the structure are most in need of adjustment should be provided. Small targets are expected to give a better assessment of current methodology, which is most effective in that size range. Small targets and more guidance might help attract additional molecular dynamics experts to participate.

References
1. Zemla A, Venclovas C, Moult J, Fidelis K. Processing and analysis of CASP3 protein structure predictions. Proteins 1999;Suppl(3):22-29.
2. Kinch LN, Shi S, Cheng H, Cong Q, Pei J, Mariani V, Schwede T, Grishin NV. CASP9 target classification. Proteins 2011;79 Suppl 10:21-36.
3. Zhang Y, Skolnick J. Scoring function for automated assessment of protein structure template quality. Proteins 2004;57(4):702-710.
4. Mariani V, Kiefer F, Schmidt T, Haas J, Schwede T. Assessment of template based protein structure predictions in CASP9. Proteins 2011;79 Suppl 10:37-58.
5. Olechnovic K, Kulberkyte E, Venclovas C. CAD-score: a new contact area difference-based function for evaluation of protein structural models. Proteins 2013;81(1):149-162.
6. Abagyan RA, Totrov MM. Contact area difference (CAD): a robust measure to evaluate accuracy of protein models. J Mol Biol 1997;268(3):678-685.
7. Kryshtafovych A, Monastyrskyy B, Fidelis K. CASP prediction center infrastructure and evaluation measures in CASP10 and CASP ROLL. Proteins 2014;82 Suppl 2:7-13.
8. Chen VB, Arendall WB, 3rd, Headd JJ, Keedy DA, Immormino RM, Kapral GJ, Murray LW, Richardson JS, Richardson DC. MolProbity: all-atom structure validation for macromolecular crystallography. Acta Crystallogr D Biol Crystallogr 2010;66(Pt 1):12-21.
9. Read RJ, Chavali G. Assessment of CASP7 predictions in the high accuracy template-based modeling category. Proteins 2007;69 Suppl 8:27-37.
10. Huwe PJ, Xu Q, Shapovalov MV, Modi V, Andrake MD, Dunbrack RL, Jr. Biological function derived from predicted structures in CASP11. Proteins 2016;84 Suppl 1:370-391.
11. Liu T, Ish-Shalom S, Torng W, Lafita A, Bock C, Mort M, Cooper DN, Bliven S, Capitani G, Mooney SD, Altman RB. Biological and functional relevance of CASP predictions. Proteins 2018;86 Suppl 1:374-386.

Discussion: TBM and refinement assessment

Discussion: TBM and refinement assessment

Who is online