Discussion: Estimates of model accuracy

We are opening a series of discussions on evaluation approaches for the upcoming CASP13 experiment. The first prediction category to discuss is EMA (estimates of model accuracy). Below we suggest to discuss 3 focus issues: what to predict in global assessment; what to predict in local assessment; and what are the optimal evaluation functions. We also provide a document with a more detailed description of the evaluation process and some of the open issues. The focus issues and the evaluation process description documents were compiled by CASP organizers (former EMA assessors) with the feedback from the future EMA assessor (Chaok Seok).
----------------
EMA prediction/evaluation focus issues
----------------
F1. What function to predict as a measure of global model fit?
Feedback from the future assessor: Chaok Seok thinks that having one main target function, e.g. GDT_TS, should be desirable.
One of the issues with GDT_TS is its non-optimal treatment of multi-domain targets (superposition is usually dominated by one of the domains). While local-based measures are devoid of this, they are not very intuitive, especially for non-specialists. Possible solution at the evaluation stage: consider single-domain and multidomain targets separately. For a uniform approach to the evaluation of both single-domain and multi-domain targets, a preliminary (i.e. before prediction) split of multi-domain targets into domains may be needed. If we do this, what is the best way for that? One of the options is to predict domain definitions using sequence-based methods (e.g., Ginzu/Robetta, ThreaDom, DomPred). Note that the predicted domain boundaries may be imperfect and they are planned to be used for the EMA assessment only if they prove reasonable by the structure analysis at the later stages of the evaluation. We can update the EMA format to allow providing global EMA scores for the whole target and for the suggested domains. Note that these domain definitions may be different from those that will be used for the TS evaluation, which will be based on structural analysis of the target. However, having analyzed data from previous CASP experiments, the future EMA assessor thinks that current domain parsing methods are not accurate enough, and it would be better to keep the same format of EMA prediction as in CASP12.
Global domain-based evaluation is possible without preliminary split into domains for local accuracy prediction methods (i.e. those providing per-residue accuracy scores – unfortunately only 1/3 of CASP12 methods provided local estimates). The overall domain score can be calculated by averaging local scores only for residues from the specified domain. Note that CASP12 evaluation paper reported that difference between domain-based and whole-structure based assessment was marginal with respect to ranking.
F2. What function to predict as a measure of local model fit?
As it has been a practice until now, the distance in Angstroms between the corresponding atoms in a rigid-body superposition (e.g., LGA or TM) was predicted. This distance can be measured from whole-structure superpositions or separate-domain superpositions (the latter requires preliminary split – see above).
In addition, other target per-residue reliability scores are possible to consider, for example a score scaled in the 0-1 range and reporting “1” for a residue that is predicted ‘reliably’. Reliability can be defined as local accuracy of a specific residue or local accuracy of the residue’s neighborhood. What are appropriate local scores to measure the reliability in both cases? How to adjust prediction format?
F3. What is the optimal global and local evaluation formula?
For global evaluation, the future assessor leans toward one target function, which can be GDT_TS. Additionally, a combination of GDT_TS with local measures (CAD, LDDT, SphereGrinder) can be considered for scoring. Other options?
For local evaluation, averaged distance-based S-function (ASE measure) is considered as one of the most viable possibilities. Per-residue scores from local measures (CAD, LDDT, SphereGrinder) can also be incorporated, but hard to combine.
==============
CASP organizers’ description of the evaluation process,
future assessors’ views,
and detailed subjects for the discussion with predictors
==============
1. Types of methods
In CASP, we categorize the methods into three broad categories: single-model methods (need no other information than structure of the model itself), quasi-single methods (generate scores for a single model, but use structural information from related structures (templates or other in-house generated models) in the background), and clustering (or consensus-based) methods that need many models (in amount of tens) to operate effectively. Since not all EMA methods are created equally with respect to their input, it is important to establish the abilities of methods across the categories (what EMA methods are capable of in general), within the categories (what specific types of methods can achieve), and compare performance of methods across different categories.
2. Targets and the two-stage evaluation procedure
As it was shown in CASP9 EMA evaluation paper [1] (Figure 6), evaluation scores depend on the datasets. It was shown that it was easier to estimate relative model accuracy in larger and more diverse model sets. This is true not only for clustering methods, but also for single-model methods. To adjust for that, in CASP10 we started filtering models, and suggested two conceptually different sets of models for accuracy estimation. One set (called ‘stage1’ or ‘select20’) is short (20 models) and very diverse, while the other (‘stage2’ or ‘best150’) is larger (150 models) and does not contain tentatively worst models according to the in-house consensus accuracy predictor Davis-EMAconsensus [2]. Models are released in a two-stage procedure in order not to compromise prediction results.
Open issue:
(O2.1) Do we need stage 1?
Stage1 is a small dataset (20 models per target) containing models spanning the whole range of accuracy (by design). It was introduced to show limits of different types of methods, and as such have exhausted itself. In the latest CASP (CASP12) it was used only to verify that the methods that were claimed as single-model indeed produced the same results in stages 1 and 2 (since pure single-model methods are based exclusively on the coordinates of the assessed models, it is expected that the accuracy estimates they produce would be the same every time a method is applied to the same model). Organizers and assessors suggest keeping it in only for that purpose.
3. Global accuracy evaluation: target function
In the global assessment mode (QAglob), each model has to be assigned a score between 0 and 1 reflecting accuracy of the model (the higher the score the better).
Since global model accuracy estimates are submitted for whole models (and not domains), evaluation of the results is also carried out at the whole model level (differently from the tertiary structure prediction, which is evaluated at the level of domains).
More than a dozen measures are used in CASP to evaluate structural similarity of a model to the target, and each of these measures can be considered as a target function for model accuracy assessment. From CASP7 through CASP10, the measure of choice was GDT_TS. In CASP11 and CASP12 three non-rigid-body based measures - LDDT [3], CAD[4] and SphereGrinder [5] were added to the evaluation tool chest. This way prediction results were assessed from different perspectives: recognizing the ability of EMA methods to not only properly estimate accuracy of the backbone, but also identify models with better local geometry or local structure context.
Open issues:
(O3.1) What is the optimal target function to predict?
Up to CASP11 predictors were asked to reproduce GDT_TS score of the assessed model. Now, when additional measures are added to the evaluation package, the question to the assessor is how to balance (combine) evaluation using different measures. It would be good to tell predictors what the target function to predict is (what are the components of the final score) and how we would evaluate accuracy of the prediction with this target function. This can help predictors to optimize their global score, as there seems to be no one-size-fits-all solution presently. Some predictors claim that GDT_TS is a bad target measure as it depends on the superposition and suggest to replace it with LDDT or CAD, but the GDTTS dependence on superposition seems to be a lesser problem (ie, different superpositions may deviate, but should agree in general) than using less intuitive measures with different accuracy ranges. The future assessor thinks that GDT_TS should be kept as a target function for global prediction /evaluation.
(O3.2) How to evaluate multidomain targets?
GDT_based evaluation becomes a problem on multidomain targets as one domain usually dominates the superposition and the evaluation method will assign high quality scores to the residues from one domain (usually the larger), and relatively low quality scores to the residues from the other domain. Can we assess EMA on the per-domain basis (like we do in tertiary structure assessment)? Some methods (but not all) generate global scores by averaging per-residue scores for the whole model. For these methods, we can generate per-domain global scores from their local scores by ourselves. But we cannot do this for other methods (including 2/3 of CASP EMA methods that do only global prediction). Should we ask predictors to provide a global score for the whole model (whatever it is), and additionally provide scores for different domains as they predict them? (would need to adjust QA format). This will also cause a problem of different domain definitions. The organizers/assessors can suggest preliminary domain definitions based on sequence analysis (not necessarily accurate), and the QA assessment will be done on these preliminary domains, without adjustment for later defined ‘official’ definitions. Feedback from the future assessor: This may be a good idea, but current domain parsing methods are not accurate enough, and it would be better to keep the same format of EMA prediction as CASP12.
4. Global accuracy evaluation: analyses
In previous CASPs, the effectiveness of EMA methods to assign overall accuracy score to a model was evaluated by assessing methods’ ability to (4.1) find the best model among many others, (4.2) reproduce model-target similarity scores, (4.3) discriminate between good and bad models and (4.4) rank models. All four evaluation target functions are used as the “ground truth” measures in these analyses. To establish the statistical significance of the differences in performance, the two-tailed paired t-tests on the common sets of predicted targets and models was performed for each evaluation measure separately (DeLong test for the ROC curve analysis).
4.1. Identifying the best models
To assess the ability of methods to identify the best models from among several available, for each target we calculated the difference between the scores of the model predicted to be the best (i.e. that with the highest predicted EMA score) and the model with the highest similarity to the native structure. Measuring the difference in accuracy between the predicted best model and the actual best model makes sense only if the actual best model is of good quality itself, therefore this analysis was performed only on targets for which at least one model was of ‘good enough’ quality, defined as 40% of the selected measures’ top score.
In complement to the accuracy loss analysis (above), we carried out the recognition rate analysis, showing the success and failure rates of EMA methods in identifying the best models. We assume that a method succeeds if the difference in scores between the best EMA model and the actual best model is small (within 2 score units) and fails if the difference is larger than 10. Since high success rate and low failure rate are the desired features of an EMA method, we used the difference between these rates as the criterion to examine methods’ efficiency.
4.2. Reproducing model-target similarity scores
To assess overall correctness of global model accuracy estimates, we calculated the absolute difference between the actual evaluation scores and the predicted accuracies for every server model included in the best150 datasets. Smaller average difference over all targets signifies better performance of a predictor.
Open issue:
(O4.2) This evaluation method puts at disadvantage those methods that are not spread in the 0-1 range similarly to the target function (for example, CAD_aa score is theoretically in 0-1 range, but practically in 0.3-0.7 range; and when used to evaluate accuracy of models optimized to mimic the GDT_TS score, the predictor will be penalized).
Feedback from the future assessor: If we assess absolute accuracy score, it is preferable that we have a single target function, not multiple. However, if we want to evaluate local accuracy in the global context, not just the accuracy of the backbone, it would be good to also have other superposition-free scores as reference measures. The question is how to combine them in a best manner.
4.3 Distinguishing between good and bad models
To assess the ability of methods to discriminate between good and bad models, we pulled together models for all targets and then carried out a Receiver Operating Characteristic (ROC) analysis using Measure=50 threshold to separate good and bad models. The area under the ROC curve (AUC) was used as a measure of the methods’ accuracy.
Open issue:
(O4.3) ‘Goodness’ thresholds may be adjusted for each evaluation measure.
Feedback from the future assessor: a measure that puts more emphasis on higher-rank models would be nice.
In previous CASPs we also used Matthews’ Correlation Coefficient to estimate accuracy of separation between good and bad models (and also good and bad regions in modes). Results were shown to be very highly correlated with the results of ROC analysis, so since CASP11 we decided to show only the ROC results.
4.4. Correlation between the predicted and observed scores
Correlation was a part of all QA assessments until CASP12. In CASP12 the assessors decided that the problem of ranking groups is of lower practical interest and did not use the measure for ranking groups.
Remark: This score benefits the measures that have substantial difference from target absolute accuracy score, but can produce similar relative scores (i.e., ranks of models). Also, the score is intuitive and popular among predictors.
Open issue:
(O4.4) Is ranking of models (the clustering methods’ strong side) all that important as users need to have absolute scores, not relative? If so, suggest to limit correlation analysis to top N predictions (what is N) or remove outliers (below 1 or 2 STD)?
Feedback from the future assessor: I would like to emphasize more on correlation of higher-rank models.
5. Local accuracy estimation: target function
In the local assessment mode, each residue in the model has to be assigned an estimated distance error in Ångstrӧms as would have been seen for that residue in the optimal model-target superposition. Since distance errors are submitted for each residue separately, this allows carrying out evaluation at both, the whole-target and domain levels. For single-domain targets, the results from both evaluation modes are identical. For multi-domain targets, the whole-target evaluation gives an extra credit to methods capable of correct identifying relative orientation of the constituent domains, while the domain-level evaluation gives advantage to methods being more accurate in prediction of the within-domain distance errors.
To evaluate the accuracy of predicted per-residue error estimates, in CASP12 we employed the ASE measure [6]. For each residue, the distance d is normalized to the [0;1] range using the S-function [7] and then averaged over the whole evaluation unit (target or domain) and rescaled to the [0;100] range. The higher the score, the more accurate the prediction of the distance errors in a model. If error estimates for some residues are not included in the prediction, they are set to a high value so the contribution of that specific error to the total score is negligible.
Open issues:
(O5.1) The local accuracy target function (distance) depends on the superposition. This becomes a bigger issue for multidomain targets. If at least a part of final score depends on a superposition-dependent measure, we have to tell predictors what distance will be used as a reference – the one from the whole-target superposition (and in this case it is hard to control which domain will dominate the specific superposition) or from per-domain ones (issue with domain boundaries – discussed above for global measures)?
Feedback from the future assessor: It is preferable that superposition is done per-domain only since we are evaluating local accuracy. Here is the issue how to do it
(O5.2) Reliability of the prediction can potentially be expressed in other superposition-free measures, but they are not necessarily intuitive. Which alternative measures can be used (CAD or LDDT are not overly intuitive). Maybe SphereGrinder, which assesses similarity of local neighborhoods in terms of % of well fit residues or some other rmsd-based local measure?
(O5.3) Some people tried to predict beta-factors – how did they do this not relying on formula B=8pi^2d^2/3?
(O5.4) The ASE score depends on d0 parameter, but potentially combination of ASE scores with several d0s may be beneficial (is the difference in ASE scores with d0=5 and d0=3 significant?)
Feedback from the future assessor: We will check.
6. Local accuracy evaluation: analyses
The effectiveness of local model accuracy estimators can be evaluated by verifying how well these methods (6.1) assign correct distance errors at the residue level, and (6.2) discriminate between reliable and unreliable regions in a model. Both analyses are carried out on the per-residue estimates submitted for all models and all targets.
6.1. Assigning residue error estimates
The accuracy of predicted per-residue error estimates was evaluated in CASP12 with the ASE measure in the whole-model and domain-based evaluation. CASP12 results in both evaluation modes are very similar, with single-model methods deviating by 0.0% ASE between the whole-model and domain-based ASE scores (on average), quasi-single methods deviating by 1.5% and clustering methods by <3%.
In previous CASPs the correspondence between the predicted and actual distances was evaluated with log-linear correlation (log used to smooth effect of large distances) and correlation between S-scores calculated from the distances capped at 15A. The correlation statistics were also calculated for CASP12 and shown on the Prediction Center web site, but not used for group ranking.
6.2. Discriminating between good and bad regions in the model
To evaluate how well CASP predictors can discriminate between accurately and poorly modeled regions in the model, in CASP12 we carried out the ROC analysis on the submitted distance errors setting the threshold for correct positioning of a residue at the 3.8Å level.
In previous CASPs we also used MCC (similarly to 4.3).
Feedback from the future assessor: I think that this aspect of analysis has a good relevance to the refinement category. It would be interesting to analyze performance on predicting stretches of poorly modeled residues, as well as per-residue analysis.
7. Comparison to the baseline method (one of the measures of progress)
To provide a baseline for assessing performance of the participating methods, we used an in-house developed Davis-EMAconsensus method that has not changed since its first implementation in CASP9 1 (2010). During the latest four CASP experiments this method was run as an ordinary EMA predictor, alongside with other participating methods. Ratio between the scores of the reference method in different CASPs may indicate the change in difficulty of targets. The change in the relative scores of the best methods with respect to the baseline method may reflect performance changes associated with the development of methods and not the change in the databases or target difficulty.
Feedback from the future assessor: I think this should be continued.
8. Ranking accuracy assessment methods alongside the tertiary structure prediction methods
Global scores generated by EMA methods can be used to pick five highest scoring models out of the 150 server models released to predictors on every target. This way, every CASP EMA method can be considered as a tertiary structure meta-predictor (selector) and ranked alongside the TS prediction methods.
To insert EMA methods into ranking tables for tertiary structure methods, we calculated their pseudo z-scores using the mean and standard deviation computed from the distribution of tertiary structure prediction scores. This way z-scores of TS models are intact and have the same values both in TS-only and joined TS+EMA rankings. Note that tertiary structure prediction methods are ranked differently depending on the model comparison environment (i.e. group types (server or expert), target subsets (all or human; TBM or FM), model types (model_1 or best-of-five)), and so are the EMA methods.
Feedback from the future assessor: I think this should be continued.
REFERENCES
1. Kryshtafovych A, Fidelis K, Tramontano A. Evaluation of model quality predictions in CASP9. Proteins 2011;79 Suppl 10:91-106.
2. Kryshtafovych A, Barbato A, Monastyrskyy B, Fidelis K, Schwede T, Tramontano A. Methods of model accuracy estimation can help selecting the best models from decoy sets: Assessment of model accuracy estimations in CASP11. Proteins 2016;84 Suppl 1:349-369.
3. Mariani V, Biasini M, Barbato A, Schwede T. lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics 2013;29(21):2722-2728.
4. Olechnovic K, Kulberkyte E, Venclovas C. CAD-score: a new contact area difference-based function for evaluation of protein structural models. Proteins 2013;81(1):149-162.
5. Kryshtafovych A, Monastyrskyy B, Fidelis K. CASP prediction center infrastructure and evaluation measures in CASP10 and CASP ROLL. Proteins 2014;82 Suppl 2:7-13.
6. Kryshtafovych A, Monastyrskyy B, Fidelis K. CASP11 statistics and the prediction center evaluation system. Proteins 2016;84 Suppl 1:15-19.
7. Gerstein M, Levitt M. Using iterative dynamic programming to obtain accurate pairwise and multiple alignments of protein structures. Proc Int Conf Intell Syst Mol Biol 1996;4:59-67.
----------------
EMA prediction/evaluation focus issues
----------------
F1. What function to predict as a measure of global model fit?
Feedback from the future assessor: Chaok Seok thinks that having one main target function, e.g. GDT_TS, should be desirable.
One of the issues with GDT_TS is its non-optimal treatment of multi-domain targets (superposition is usually dominated by one of the domains). While local-based measures are devoid of this, they are not very intuitive, especially for non-specialists. Possible solution at the evaluation stage: consider single-domain and multidomain targets separately. For a uniform approach to the evaluation of both single-domain and multi-domain targets, a preliminary (i.e. before prediction) split of multi-domain targets into domains may be needed. If we do this, what is the best way for that? One of the options is to predict domain definitions using sequence-based methods (e.g., Ginzu/Robetta, ThreaDom, DomPred). Note that the predicted domain boundaries may be imperfect and they are planned to be used for the EMA assessment only if they prove reasonable by the structure analysis at the later stages of the evaluation. We can update the EMA format to allow providing global EMA scores for the whole target and for the suggested domains. Note that these domain definitions may be different from those that will be used for the TS evaluation, which will be based on structural analysis of the target. However, having analyzed data from previous CASP experiments, the future EMA assessor thinks that current domain parsing methods are not accurate enough, and it would be better to keep the same format of EMA prediction as in CASP12.
Global domain-based evaluation is possible without preliminary split into domains for local accuracy prediction methods (i.e. those providing per-residue accuracy scores – unfortunately only 1/3 of CASP12 methods provided local estimates). The overall domain score can be calculated by averaging local scores only for residues from the specified domain. Note that CASP12 evaluation paper reported that difference between domain-based and whole-structure based assessment was marginal with respect to ranking.
F2. What function to predict as a measure of local model fit?
As it has been a practice until now, the distance in Angstroms between the corresponding atoms in a rigid-body superposition (e.g., LGA or TM) was predicted. This distance can be measured from whole-structure superpositions or separate-domain superpositions (the latter requires preliminary split – see above).
In addition, other target per-residue reliability scores are possible to consider, for example a score scaled in the 0-1 range and reporting “1” for a residue that is predicted ‘reliably’. Reliability can be defined as local accuracy of a specific residue or local accuracy of the residue’s neighborhood. What are appropriate local scores to measure the reliability in both cases? How to adjust prediction format?
F3. What is the optimal global and local evaluation formula?
For global evaluation, the future assessor leans toward one target function, which can be GDT_TS. Additionally, a combination of GDT_TS with local measures (CAD, LDDT, SphereGrinder) can be considered for scoring. Other options?
For local evaluation, averaged distance-based S-function (ASE measure) is considered as one of the most viable possibilities. Per-residue scores from local measures (CAD, LDDT, SphereGrinder) can also be incorporated, but hard to combine.
==============
CASP organizers’ description of the evaluation process,
future assessors’ views,
and detailed subjects for the discussion with predictors
==============
1. Types of methods
In CASP, we categorize the methods into three broad categories: single-model methods (need no other information than structure of the model itself), quasi-single methods (generate scores for a single model, but use structural information from related structures (templates or other in-house generated models) in the background), and clustering (or consensus-based) methods that need many models (in amount of tens) to operate effectively. Since not all EMA methods are created equally with respect to their input, it is important to establish the abilities of methods across the categories (what EMA methods are capable of in general), within the categories (what specific types of methods can achieve), and compare performance of methods across different categories.
2. Targets and the two-stage evaluation procedure
As it was shown in CASP9 EMA evaluation paper [1] (Figure 6), evaluation scores depend on the datasets. It was shown that it was easier to estimate relative model accuracy in larger and more diverse model sets. This is true not only for clustering methods, but also for single-model methods. To adjust for that, in CASP10 we started filtering models, and suggested two conceptually different sets of models for accuracy estimation. One set (called ‘stage1’ or ‘select20’) is short (20 models) and very diverse, while the other (‘stage2’ or ‘best150’) is larger (150 models) and does not contain tentatively worst models according to the in-house consensus accuracy predictor Davis-EMAconsensus [2]. Models are released in a two-stage procedure in order not to compromise prediction results.
Open issue:
(O2.1) Do we need stage 1?
Stage1 is a small dataset (20 models per target) containing models spanning the whole range of accuracy (by design). It was introduced to show limits of different types of methods, and as such have exhausted itself. In the latest CASP (CASP12) it was used only to verify that the methods that were claimed as single-model indeed produced the same results in stages 1 and 2 (since pure single-model methods are based exclusively on the coordinates of the assessed models, it is expected that the accuracy estimates they produce would be the same every time a method is applied to the same model). Organizers and assessors suggest keeping it in only for that purpose.
3. Global accuracy evaluation: target function
In the global assessment mode (QAglob), each model has to be assigned a score between 0 and 1 reflecting accuracy of the model (the higher the score the better).
Since global model accuracy estimates are submitted for whole models (and not domains), evaluation of the results is also carried out at the whole model level (differently from the tertiary structure prediction, which is evaluated at the level of domains).
More than a dozen measures are used in CASP to evaluate structural similarity of a model to the target, and each of these measures can be considered as a target function for model accuracy assessment. From CASP7 through CASP10, the measure of choice was GDT_TS. In CASP11 and CASP12 three non-rigid-body based measures - LDDT [3], CAD[4] and SphereGrinder [5] were added to the evaluation tool chest. This way prediction results were assessed from different perspectives: recognizing the ability of EMA methods to not only properly estimate accuracy of the backbone, but also identify models with better local geometry or local structure context.
Open issues:
(O3.1) What is the optimal target function to predict?
Up to CASP11 predictors were asked to reproduce GDT_TS score of the assessed model. Now, when additional measures are added to the evaluation package, the question to the assessor is how to balance (combine) evaluation using different measures. It would be good to tell predictors what the target function to predict is (what are the components of the final score) and how we would evaluate accuracy of the prediction with this target function. This can help predictors to optimize their global score, as there seems to be no one-size-fits-all solution presently. Some predictors claim that GDT_TS is a bad target measure as it depends on the superposition and suggest to replace it with LDDT or CAD, but the GDTTS dependence on superposition seems to be a lesser problem (ie, different superpositions may deviate, but should agree in general) than using less intuitive measures with different accuracy ranges. The future assessor thinks that GDT_TS should be kept as a target function for global prediction /evaluation.
(O3.2) How to evaluate multidomain targets?
GDT_based evaluation becomes a problem on multidomain targets as one domain usually dominates the superposition and the evaluation method will assign high quality scores to the residues from one domain (usually the larger), and relatively low quality scores to the residues from the other domain. Can we assess EMA on the per-domain basis (like we do in tertiary structure assessment)? Some methods (but not all) generate global scores by averaging per-residue scores for the whole model. For these methods, we can generate per-domain global scores from their local scores by ourselves. But we cannot do this for other methods (including 2/3 of CASP EMA methods that do only global prediction). Should we ask predictors to provide a global score for the whole model (whatever it is), and additionally provide scores for different domains as they predict them? (would need to adjust QA format). This will also cause a problem of different domain definitions. The organizers/assessors can suggest preliminary domain definitions based on sequence analysis (not necessarily accurate), and the QA assessment will be done on these preliminary domains, without adjustment for later defined ‘official’ definitions. Feedback from the future assessor: This may be a good idea, but current domain parsing methods are not accurate enough, and it would be better to keep the same format of EMA prediction as CASP12.
4. Global accuracy evaluation: analyses
In previous CASPs, the effectiveness of EMA methods to assign overall accuracy score to a model was evaluated by assessing methods’ ability to (4.1) find the best model among many others, (4.2) reproduce model-target similarity scores, (4.3) discriminate between good and bad models and (4.4) rank models. All four evaluation target functions are used as the “ground truth” measures in these analyses. To establish the statistical significance of the differences in performance, the two-tailed paired t-tests on the common sets of predicted targets and models was performed for each evaluation measure separately (DeLong test for the ROC curve analysis).
4.1. Identifying the best models
To assess the ability of methods to identify the best models from among several available, for each target we calculated the difference between the scores of the model predicted to be the best (i.e. that with the highest predicted EMA score) and the model with the highest similarity to the native structure. Measuring the difference in accuracy between the predicted best model and the actual best model makes sense only if the actual best model is of good quality itself, therefore this analysis was performed only on targets for which at least one model was of ‘good enough’ quality, defined as 40% of the selected measures’ top score.
In complement to the accuracy loss analysis (above), we carried out the recognition rate analysis, showing the success and failure rates of EMA methods in identifying the best models. We assume that a method succeeds if the difference in scores between the best EMA model and the actual best model is small (within 2 score units) and fails if the difference is larger than 10. Since high success rate and low failure rate are the desired features of an EMA method, we used the difference between these rates as the criterion to examine methods’ efficiency.
4.2. Reproducing model-target similarity scores
To assess overall correctness of global model accuracy estimates, we calculated the absolute difference between the actual evaluation scores and the predicted accuracies for every server model included in the best150 datasets. Smaller average difference over all targets signifies better performance of a predictor.
Open issue:
(O4.2) This evaluation method puts at disadvantage those methods that are not spread in the 0-1 range similarly to the target function (for example, CAD_aa score is theoretically in 0-1 range, but practically in 0.3-0.7 range; and when used to evaluate accuracy of models optimized to mimic the GDT_TS score, the predictor will be penalized).
Feedback from the future assessor: If we assess absolute accuracy score, it is preferable that we have a single target function, not multiple. However, if we want to evaluate local accuracy in the global context, not just the accuracy of the backbone, it would be good to also have other superposition-free scores as reference measures. The question is how to combine them in a best manner.
4.3 Distinguishing between good and bad models
To assess the ability of methods to discriminate between good and bad models, we pulled together models for all targets and then carried out a Receiver Operating Characteristic (ROC) analysis using Measure=50 threshold to separate good and bad models. The area under the ROC curve (AUC) was used as a measure of the methods’ accuracy.
Open issue:
(O4.3) ‘Goodness’ thresholds may be adjusted for each evaluation measure.
Feedback from the future assessor: a measure that puts more emphasis on higher-rank models would be nice.
In previous CASPs we also used Matthews’ Correlation Coefficient to estimate accuracy of separation between good and bad models (and also good and bad regions in modes). Results were shown to be very highly correlated with the results of ROC analysis, so since CASP11 we decided to show only the ROC results.
4.4. Correlation between the predicted and observed scores
Correlation was a part of all QA assessments until CASP12. In CASP12 the assessors decided that the problem of ranking groups is of lower practical interest and did not use the measure for ranking groups.
Remark: This score benefits the measures that have substantial difference from target absolute accuracy score, but can produce similar relative scores (i.e., ranks of models). Also, the score is intuitive and popular among predictors.
Open issue:
(O4.4) Is ranking of models (the clustering methods’ strong side) all that important as users need to have absolute scores, not relative? If so, suggest to limit correlation analysis to top N predictions (what is N) or remove outliers (below 1 or 2 STD)?
Feedback from the future assessor: I would like to emphasize more on correlation of higher-rank models.
5. Local accuracy estimation: target function
In the local assessment mode, each residue in the model has to be assigned an estimated distance error in Ångstrӧms as would have been seen for that residue in the optimal model-target superposition. Since distance errors are submitted for each residue separately, this allows carrying out evaluation at both, the whole-target and domain levels. For single-domain targets, the results from both evaluation modes are identical. For multi-domain targets, the whole-target evaluation gives an extra credit to methods capable of correct identifying relative orientation of the constituent domains, while the domain-level evaluation gives advantage to methods being more accurate in prediction of the within-domain distance errors.
To evaluate the accuracy of predicted per-residue error estimates, in CASP12 we employed the ASE measure [6]. For each residue, the distance d is normalized to the [0;1] range using the S-function [7] and then averaged over the whole evaluation unit (target or domain) and rescaled to the [0;100] range. The higher the score, the more accurate the prediction of the distance errors in a model. If error estimates for some residues are not included in the prediction, they are set to a high value so the contribution of that specific error to the total score is negligible.
Open issues:
(O5.1) The local accuracy target function (distance) depends on the superposition. This becomes a bigger issue for multidomain targets. If at least a part of final score depends on a superposition-dependent measure, we have to tell predictors what distance will be used as a reference – the one from the whole-target superposition (and in this case it is hard to control which domain will dominate the specific superposition) or from per-domain ones (issue with domain boundaries – discussed above for global measures)?
Feedback from the future assessor: It is preferable that superposition is done per-domain only since we are evaluating local accuracy. Here is the issue how to do it
(O5.2) Reliability of the prediction can potentially be expressed in other superposition-free measures, but they are not necessarily intuitive. Which alternative measures can be used (CAD or LDDT are not overly intuitive). Maybe SphereGrinder, which assesses similarity of local neighborhoods in terms of % of well fit residues or some other rmsd-based local measure?
(O5.3) Some people tried to predict beta-factors – how did they do this not relying on formula B=8pi^2d^2/3?
(O5.4) The ASE score depends on d0 parameter, but potentially combination of ASE scores with several d0s may be beneficial (is the difference in ASE scores with d0=5 and d0=3 significant?)
Feedback from the future assessor: We will check.
6. Local accuracy evaluation: analyses
The effectiveness of local model accuracy estimators can be evaluated by verifying how well these methods (6.1) assign correct distance errors at the residue level, and (6.2) discriminate between reliable and unreliable regions in a model. Both analyses are carried out on the per-residue estimates submitted for all models and all targets.
6.1. Assigning residue error estimates
The accuracy of predicted per-residue error estimates was evaluated in CASP12 with the ASE measure in the whole-model and domain-based evaluation. CASP12 results in both evaluation modes are very similar, with single-model methods deviating by 0.0% ASE between the whole-model and domain-based ASE scores (on average), quasi-single methods deviating by 1.5% and clustering methods by <3%.
In previous CASPs the correspondence between the predicted and actual distances was evaluated with log-linear correlation (log used to smooth effect of large distances) and correlation between S-scores calculated from the distances capped at 15A. The correlation statistics were also calculated for CASP12 and shown on the Prediction Center web site, but not used for group ranking.
6.2. Discriminating between good and bad regions in the model
To evaluate how well CASP predictors can discriminate between accurately and poorly modeled regions in the model, in CASP12 we carried out the ROC analysis on the submitted distance errors setting the threshold for correct positioning of a residue at the 3.8Å level.
In previous CASPs we also used MCC (similarly to 4.3).
Feedback from the future assessor: I think that this aspect of analysis has a good relevance to the refinement category. It would be interesting to analyze performance on predicting stretches of poorly modeled residues, as well as per-residue analysis.
7. Comparison to the baseline method (one of the measures of progress)
To provide a baseline for assessing performance of the participating methods, we used an in-house developed Davis-EMAconsensus method that has not changed since its first implementation in CASP9 1 (2010). During the latest four CASP experiments this method was run as an ordinary EMA predictor, alongside with other participating methods. Ratio between the scores of the reference method in different CASPs may indicate the change in difficulty of targets. The change in the relative scores of the best methods with respect to the baseline method may reflect performance changes associated with the development of methods and not the change in the databases or target difficulty.
Feedback from the future assessor: I think this should be continued.
8. Ranking accuracy assessment methods alongside the tertiary structure prediction methods
Global scores generated by EMA methods can be used to pick five highest scoring models out of the 150 server models released to predictors on every target. This way, every CASP EMA method can be considered as a tertiary structure meta-predictor (selector) and ranked alongside the TS prediction methods.
To insert EMA methods into ranking tables for tertiary structure methods, we calculated their pseudo z-scores using the mean and standard deviation computed from the distribution of tertiary structure prediction scores. This way z-scores of TS models are intact and have the same values both in TS-only and joined TS+EMA rankings. Note that tertiary structure prediction methods are ranked differently depending on the model comparison environment (i.e. group types (server or expert), target subsets (all or human; TBM or FM), model types (model_1 or best-of-five)), and so are the EMA methods.
Feedback from the future assessor: I think this should be continued.
REFERENCES
1. Kryshtafovych A, Fidelis K, Tramontano A. Evaluation of model quality predictions in CASP9. Proteins 2011;79 Suppl 10:91-106.
2. Kryshtafovych A, Barbato A, Monastyrskyy B, Fidelis K, Schwede T, Tramontano A. Methods of model accuracy estimation can help selecting the best models from decoy sets: Assessment of model accuracy estimations in CASP11. Proteins 2016;84 Suppl 1:349-369.
3. Mariani V, Biasini M, Barbato A, Schwede T. lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics 2013;29(21):2722-2728.
4. Olechnovic K, Kulberkyte E, Venclovas C. CAD-score: a new contact area difference-based function for evaluation of protein structural models. Proteins 2013;81(1):149-162.
5. Kryshtafovych A, Monastyrskyy B, Fidelis K. CASP prediction center infrastructure and evaluation measures in CASP10 and CASP ROLL. Proteins 2014;82 Suppl 2:7-13.
6. Kryshtafovych A, Monastyrskyy B, Fidelis K. CASP11 statistics and the prediction center evaluation system. Proteins 2016;84 Suppl 1:15-19.
7. Gerstein M, Levitt M. Using iterative dynamic programming to obtain accurate pairwise and multiple alignments of protein structures. Proc Int Conf Intell Syst Mol Biol 1996;4:59-67.