Discussion: Estimates of model accuracy

by **akryshtafovych** on Tue Mar 27, 2018 2:21 pm

We are opening a series of discussions on evaluation approaches for the upcoming CASP13 experiment. The first prediction category to discuss is EMA (estimates of model accuracy). Below we suggest to discuss 3 focus issues: what to predict in global assessment; what to predict in local assessment; and what are the optimal evaluation functions. We also provide a document with a more detailed description of the evaluation process and some of the open issues. The focus issues and the evaluation process description documents were compiled by CASP organizers (former EMA assessors) with the feedback from the future EMA assessor (Chaok Seok).

----------------
EMA prediction/evaluation focus issues
----------------

F1. What function to predict as a measure of global model fit?
Feedback from the future assessor: Chaok Seok thinks that having one main target function, e.g. GDT_TS, should be desirable.
One of the issues with GDT_TS is its non-optimal treatment of multi-domain targets (superposition is usually dominated by one of the domains). While local-based measures are devoid of this, they are not very intuitive, especially for non-specialists. Possible solution at the evaluation stage: consider single-domain and multidomain targets separately. For a uniform approach to the evaluation of both single-domain and multi-domain targets, a preliminary (i.e. before prediction) split of multi-domain targets into domains may be needed. If we do this, what is the best way for that? One of the options is to predict domain definitions using sequence-based methods (e.g., Ginzu/Robetta, ThreaDom, DomPred). Note that the predicted domain boundaries may be imperfect and they are planned to be used for the EMA assessment only if they prove reasonable by the structure analysis at the later stages of the evaluation. We can update the EMA format to allow providing global EMA scores for the whole target and for the suggested domains. Note that these domain definitions may be different from those that will be used for the TS evaluation, which will be based on structural analysis of the target. However, having analyzed data from previous CASP experiments, the future EMA assessor thinks that current domain parsing methods are not accurate enough, and it would be better to keep the same format of EMA prediction as in CASP12.
Global domain-based evaluation is possible without preliminary split into domains for local accuracy prediction methods (i.e. those providing per-residue accuracy scores – unfortunately only 1/3 of CASP12 methods provided local estimates). The overall domain score can be calculated by averaging local scores only for residues from the specified domain. Note that CASP12 evaluation paper reported that difference between domain-based and whole-structure based assessment was marginal with respect to ranking.

F2. What function to predict as a measure of local model fit?
As it has been a practice until now, the distance in Angstroms between the corresponding atoms in a rigid-body superposition (e.g., LGA or TM) was predicted. This distance can be measured from whole-structure superpositions or separate-domain superpositions (the latter requires preliminary split – see above).
In addition, other target per-residue reliability scores are possible to consider, for example a score scaled in the 0-1 range and reporting “1” for a residue that is predicted ‘reliably’. Reliability can be defined as local accuracy of a specific residue or local accuracy of the residue’s neighborhood. What are appropriate local scores to measure the reliability in both cases? How to adjust prediction format?

F3. What is the optimal global and local evaluation formula?
For global evaluation, the future assessor leans toward one target function, which can be GDT_TS. Additionally, a combination of GDT_TS with local measures (CAD, LDDT, SphereGrinder) can be considered for scoring. Other options?
For local evaluation, averaged distance-based S-function (ASE measure) is considered as one of the most viable possibilities. Per-residue scores from local measures (CAD, LDDT, SphereGrinder) can also be incorporated, but hard to combine.

==============
CASP organizers’ description of the evaluation process,
future assessors’ views,
and detailed subjects for the discussion with predictors
==============

1. Types of methods
In CASP, we categorize the methods into three broad categories: single-model methods (need no other information than structure of the model itself), quasi-single methods (generate scores for a single model, but use structural information from related structures (templates or other in-house generated models) in the background), and clustering (or consensus-based) methods that need many models (in amount of tens) to operate effectively. Since not all EMA methods are created equally with respect to their input, it is important to establish the abilities of methods across the categories (what EMA methods are capable of in general), within the categories (what specific types of methods can achieve), and compare performance of methods across different categories.

2. Targets and the two-stage evaluation procedure
As it was shown in CASP9 EMA evaluation paper [1] (Figure 6), evaluation scores depend on the datasets. It was shown that it was easier to estimate relative model accuracy in larger and more diverse model sets. This is true not only for clustering methods, but also for single-model methods. To adjust for that, in CASP10 we started filtering models, and suggested two conceptually different sets of models for accuracy estimation. One set (called ‘stage1’ or ‘select20’) is short (20 models) and very diverse, while the other (‘stage2’ or ‘best150’) is larger (150 models) and does not contain tentatively worst models according to the in-house consensus accuracy predictor Davis-EMAconsensus [2]. Models are released in a two-stage procedure in order not to compromise prediction results.
Open issue:
(O2.1) Do we need stage 1?
Stage1 is a small dataset (20 models per target) containing models spanning the whole range of accuracy (by design). It was introduced to show limits of different types of methods, and as such have exhausted itself. In the latest CASP (CASP12) it was used only to verify that the methods that were claimed as single-model indeed produced the same results in stages 1 and 2 (since pure single-model methods are based exclusively on the coordinates of the assessed models, it is expected that the accuracy estimates they produce would be the same every time a method is applied to the same model). Organizers and assessors suggest keeping it in only for that purpose.

3. Global accuracy evaluation: target function
In the global assessment mode (QAglob), each model has to be assigned a score between 0 and 1 reflecting accuracy of the model (the higher the score the better).
Since global model accuracy estimates are submitted for whole models (and not domains), evaluation of the results is also carried out at the whole model level (differently from the tertiary structure prediction, which is evaluated at the level of domains).
More than a dozen measures are used in CASP to evaluate structural similarity of a model to the target, and each of these measures can be considered as a target function for model accuracy assessment. From CASP7 through CASP10, the measure of choice was GDT_TS. In CASP11 and CASP12 three non-rigid-body based measures - LDDT [3], CAD[4] and SphereGrinder [5] were added to the evaluation tool chest. This way prediction results were assessed from different perspectives: recognizing the ability of EMA methods to not only properly estimate accuracy of the backbone, but also identify models with better local geometry or local structure context.
Open issues:
(O3.1) What is the optimal target function to predict?
Up to CASP11 predictors were asked to reproduce GDT_TS score of the assessed model. Now, when additional measures are added to the evaluation package, the question to the assessor is how to balance (combine) evaluation using different measures. It would be good to tell predictors what the target function to predict is (what are the components of the final score) and how we would evaluate accuracy of the prediction with this target function. This can help predictors to optimize their global score, as there seems to be no one-size-fits-all solution presently. Some predictors claim that GDT_TS is a bad target measure as it depends on the superposition and suggest to replace it with LDDT or CAD, but the GDTTS dependence on superposition seems to be a lesser problem (ie, different superpositions may deviate, but should agree in general) than using less intuitive measures with different accuracy ranges. The future assessor thinks that GDT_TS should be kept as a target function for global prediction /evaluation.
(O3.2) How to evaluate multidomain targets?
GDT_based evaluation becomes a problem on multidomain targets as one domain usually dominates the superposition and the evaluation method will assign high quality scores to the residues from one domain (usually the larger), and relatively low quality scores to the residues from the other domain. Can we assess EMA on the per-domain basis (like we do in tertiary structure assessment)? Some methods (but not all) generate global scores by averaging per-residue scores for the whole model. For these methods, we can generate per-domain global scores from their local scores by ourselves. But we cannot do this for other methods (including 2/3 of CASP EMA methods that do only global prediction). Should we ask predictors to provide a global score for the whole model (whatever it is), and additionally provide scores for different domains as they predict them? (would need to adjust QA format). This will also cause a problem of different domain definitions. The organizers/assessors can suggest preliminary domain definitions based on sequence analysis (not necessarily accurate), and the QA assessment will be done on these preliminary domains, without adjustment for later defined ‘official’ definitions. Feedback from the future assessor: This may be a good idea, but current domain parsing methods are not accurate enough, and it would be better to keep the same format of EMA prediction as CASP12.

4. Global accuracy evaluation: analyses
In previous CASPs, the effectiveness of EMA methods to assign overall accuracy score to a model was evaluated by assessing methods’ ability to (4.1) find the best model among many others, (4.2) reproduce model-target similarity scores, (4.3) discriminate between good and bad models and (4.4) rank models. All four evaluation target functions are used as the “ground truth” measures in these analyses. To establish the statistical significance of the differences in performance, the two-tailed paired t-tests on the common sets of predicted targets and models was performed for each evaluation measure separately (DeLong test for the ROC curve analysis).
4.1. Identifying the best models
To assess the ability of methods to identify the best models from among several available, for each target we calculated the difference between the scores of the model predicted to be the best (i.e. that with the highest predicted EMA score) and the model with the highest similarity to the native structure. Measuring the difference in accuracy between the predicted best model and the actual best model makes sense only if the actual best model is of good quality itself, therefore this analysis was performed only on targets for which at least one model was of ‘good enough’ quality, defined as 40% of the selected measures’ top score.
In complement to the accuracy loss analysis (above), we carried out the recognition rate analysis, showing the success and failure rates of EMA methods in identifying the best models. We assume that a method succeeds if the difference in scores between the best EMA model and the actual best model is small (within 2 score units) and fails if the difference is larger than 10. Since high success rate and low failure rate are the desired features of an EMA method, we used the difference between these rates as the criterion to examine methods’ efficiency.
4.2. Reproducing model-target similarity scores
To assess overall correctness of global model accuracy estimates, we calculated the absolute difference between the actual evaluation scores and the predicted accuracies for every server model included in the best150 datasets. Smaller average difference over all targets signifies better performance of a predictor.
Open issue:
(O4.2) This evaluation method puts at disadvantage those methods that are not spread in the 0-1 range similarly to the target function (for example, CAD_aa score is theoretically in 0-1 range, but practically in 0.3-0.7 range; and when used to evaluate accuracy of models optimized to mimic the GDT_TS score, the predictor will be penalized).
Feedback from the future assessor: If we assess absolute accuracy score, it is preferable that we have a single target function, not multiple. However, if we want to evaluate local accuracy in the global context, not just the accuracy of the backbone, it would be good to also have other superposition-free scores as reference measures. The question is how to combine them in a best manner.
4.3 Distinguishing between good and bad models
To assess the ability of methods to discriminate between good and bad models, we pulled together models for all targets and then carried out a Receiver Operating Characteristic (ROC) analysis using Measure=50 threshold to separate good and bad models. The area under the ROC curve (AUC) was used as a measure of the methods’ accuracy.
Open issue:
(O4.3) ‘Goodness’ thresholds may be adjusted for each evaluation measure.
Feedback from the future assessor: a measure that puts more emphasis on higher-rank models would be nice.
In previous CASPs we also used Matthews’ Correlation Coefficient to estimate accuracy of separation between good and bad models (and also good and bad regions in modes). Results were shown to be very highly correlated with the results of ROC analysis, so since CASP11 we decided to show only the ROC results.
4.4. Correlation between the predicted and observed scores
Correlation was a part of all QA assessments until CASP12. In CASP12 the assessors decided that the problem of ranking groups is of lower practical interest and did not use the measure for ranking groups.
Remark: This score benefits the measures that have substantial difference from target absolute accuracy score, but can produce similar relative scores (i.e., ranks of models). Also, the score is intuitive and popular among predictors.
Open issue:
(O4.4) Is ranking of models (the clustering methods’ strong side) all that important as users need to have absolute scores, not relative? If so, suggest to limit correlation analysis to top N predictions (what is N) or remove outliers (below 1 or 2 STD)?
Feedback from the future assessor: I would like to emphasize more on correlation of higher-rank models.

5. Local accuracy estimation: target function
In the local assessment mode, each residue in the model has to be assigned an estimated distance error in Ångstrӧms as would have been seen for that residue in the optimal model-target superposition. Since distance errors are submitted for each residue separately, this allows carrying out evaluation at both, the whole-target and domain levels. For single-domain targets, the results from both evaluation modes are identical. For multi-domain targets, the whole-target evaluation gives an extra credit to methods capable of correct identifying relative orientation of the constituent domains, while the domain-level evaluation gives advantage to methods being more accurate in prediction of the within-domain distance errors.
To evaluate the accuracy of predicted per-residue error estimates, in CASP12 we employed the ASE measure [6]. For each residue, the distance d is normalized to the [0;1] range using the S-function [7] and then averaged over the whole evaluation unit (target or domain) and rescaled to the [0;100] range. The higher the score, the more accurate the prediction of the distance errors in a model. If error estimates for some residues are not included in the prediction, they are set to a high value so the contribution of that specific error to the total score is negligible.
Open issues:
(O5.1) The local accuracy target function (distance) depends on the superposition. This becomes a bigger issue for multidomain targets. If at least a part of final score depends on a superposition-dependent measure, we have to tell predictors what distance will be used as a reference – the one from the whole-target superposition (and in this case it is hard to control which domain will dominate the specific superposition) or from per-domain ones (issue with domain boundaries – discussed above for global measures)?
Feedback from the future assessor: It is preferable that superposition is done per-domain only since we are evaluating local accuracy. Here is the issue how to do it
(O5.2) Reliability of the prediction can potentially be expressed in other superposition-free measures, but they are not necessarily intuitive. Which alternative measures can be used (CAD or LDDT are not overly intuitive). Maybe SphereGrinder, which assesses similarity of local neighborhoods in terms of % of well fit residues or some other rmsd-based local measure?
(O5.3) Some people tried to predict beta-factors – how did they do this not relying on formula B=8pi^2d^2/3?
(O5.4) The ASE score depends on d0 parameter, but potentially combination of ASE scores with several d0s may be beneficial (is the difference in ASE scores with d0=5 and d0=3 significant?)
Feedback from the future assessor: We will check.

6. Local accuracy evaluation: analyses
The effectiveness of local model accuracy estimators can be evaluated by verifying how well these methods (6.1) assign correct distance errors at the residue level, and (6.2) discriminate between reliable and unreliable regions in a model. Both analyses are carried out on the per-residue estimates submitted for all models and all targets.
6.1. Assigning residue error estimates
The accuracy of predicted per-residue error estimates was evaluated in CASP12 with the ASE measure in the whole-model and domain-based evaluation. CASP12 results in both evaluation modes are very similar, with single-model methods deviating by 0.0% ASE between the whole-model and domain-based ASE scores (on average), quasi-single methods deviating by 1.5% and clustering methods by <3%.
In previous CASPs the correspondence between the predicted and actual distances was evaluated with log-linear correlation (log used to smooth effect of large distances) and correlation between S-scores calculated from the distances capped at 15A. The correlation statistics were also calculated for CASP12 and shown on the Prediction Center web site, but not used for group ranking.
6.2. Discriminating between good and bad regions in the model
To evaluate how well CASP predictors can discriminate between accurately and poorly modeled regions in the model, in CASP12 we carried out the ROC analysis on the submitted distance errors setting the threshold for correct positioning of a residue at the 3.8Å level.
In previous CASPs we also used MCC (similarly to 4.3).
Feedback from the future assessor: I think that this aspect of analysis has a good relevance to the refinement category. It would be interesting to analyze performance on predicting stretches of poorly modeled residues, as well as per-residue analysis.

7. Comparison to the baseline method (one of the measures of progress)
To provide a baseline for assessing performance of the participating methods, we used an in-house developed Davis-EMAconsensus method that has not changed since its first implementation in CASP9 1 (2010). During the latest four CASP experiments this method was run as an ordinary EMA predictor, alongside with other participating methods. Ratio between the scores of the reference method in different CASPs may indicate the change in difficulty of targets. The change in the relative scores of the best methods with respect to the baseline method may reflect performance changes associated with the development of methods and not the change in the databases or target difficulty.
Feedback from the future assessor: I think this should be continued.

8. Ranking accuracy assessment methods alongside the tertiary structure prediction methods
Global scores generated by EMA methods can be used to pick five highest scoring models out of the 150 server models released to predictors on every target. This way, every CASP EMA method can be considered as a tertiary structure meta-predictor (selector) and ranked alongside the TS prediction methods.
To insert EMA methods into ranking tables for tertiary structure methods, we calculated their pseudo z-scores using the mean and standard deviation computed from the distribution of tertiary structure prediction scores. This way z-scores of TS models are intact and have the same values both in TS-only and joined TS+EMA rankings. Note that tertiary structure prediction methods are ranked differently depending on the model comparison environment (i.e. group types (server or expert), target subsets (all or human; TBM or FM), model types (model_1 or best-of-five)), and so are the EMA methods.
Feedback from the future assessor: I think this should be continued.

REFERENCES

1. Kryshtafovych A, Fidelis K, Tramontano A. Evaluation of model quality predictions in CASP9. Proteins 2011;79 Suppl 10:91-106.
2. Kryshtafovych A, Barbato A, Monastyrskyy B, Fidelis K, Schwede T, Tramontano A. Methods of model accuracy estimation can help selecting the best models from decoy sets: Assessment of model accuracy estimations in CASP11. Proteins 2016;84 Suppl 1:349-369.
3. Mariani V, Biasini M, Barbato A, Schwede T. lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics 2013;29(21):2722-2728.
4. Olechnovic K, Kulberkyte E, Venclovas C. CAD-score: a new contact area difference-based function for evaluation of protein structural models. Proteins 2013;81(1):149-162.
5. Kryshtafovych A, Monastyrskyy B, Fidelis K. CASP prediction center infrastructure and evaluation measures in CASP10 and CASP ROLL. Proteins 2014;82 Suppl 2:7-13.
6. Kryshtafovych A, Monastyrskyy B, Fidelis K. CASP11 statistics and the prediction center evaluation system. Proteins 2016;84 Suppl 1:15-19.
7. Gerstein M, Levitt M. Using iterative dynamic programming to obtain accurate pairwise and multiple alignments of protein structures. Proc Int Conf Intell Syst Mol Biol 1996;4:59-67.

by **arneelof** on Wed Mar 28, 2018 12:11 am

F1. What function to predict as a measure of global model fit?
I absolutely think that a local measure is better. The argument that it is "not very intuitive, especially for non-specialists" is honestly ridiculous. We are the specialists and we know what we are doing. In additon GDT_TS in not very intuitive. Both CAD and lDDT are also easier to predict (CC 0.8 vs 0.6) and in our experience the correlation between methods is almost identical. The only thing to consider is how to treat missing residues as (in particular for some of the CAD versions) the random score (for a completely wrong model) is not 0 but about 0.2, this means that for a model with many missing residues the global score will be worse than a completely random model even if it is partly correct. This problem is not as severe with lDDT or the "all atom - sidechain" version of CAD.

F2. What function to predict as a measure of local model fit?
Same here. A local score is much better as it is not dependent on a single superposition.

F3. What is the optimal global and local evaluation formula?
Same again lDDT or CAD (all atom - sidechain).

by **jmoult** on Mon Apr 02, 2018 9:00 am

Arne:

I think I was the originator of the phrase " "not very intuitive, especially for non-specialists", so I should respond to your comment.

You are right, we are the specialists, and that can be a problem - in my view, we should not be devising metrics that we like, we should be devising metrics that the intended users of models can relate to. I think if you tell almost anyone outside the CASP community 'the estimated error in this atom's position is X', they will assume that X refers to some kind of superposition frame. Or?

by **mcguffin** on Fri Apr 06, 2018 5:46 am

Hi!

F1. What function to predict as a measure of global model fit?
Arne has some valid points about lDDT - it does appear to be easier to train methods using it as a target function and it is not reliant on a single superposition. But superposition based metrics do have their uses for scoring full chain models. For example, it may be very useful for a user to know whether the relative orientation of the domains in their model are correct/incorrect, even if the individual domains may be correctly modelled. In fact this may be a weakness of methods trained to the lDDT score - they may not penalise full chain models as much if their domain orientations are incorrect.

Dividing EMA targets by domains using prediction methods is not really practical and I would not recommend going down this route. As mentioned, the averaged local scores can be used for scoring domains, once they have been specified.

F2. What function to predict as a measure of local model fit?
I do think the distance in Angstroms between the corresponding atoms is intuitive, it is established and I think we should keep this. Users do seem to easily understand it well. It is also easy to convert back to 0-1 for evaluation purposes. It is much harder for users to visualise an lDDT score, even if it is easier to predict.

F3. What is the optimal global and local evaluation formula?
We should evaluate using all available metrics, but if I had to pick two...
Global: GDT_TS and lDDT
Local: ASE measure (S-score/distance) and lDDT

Cheers,
Liam

by **uziela** on Mon Apr 09, 2018 2:10 am

I agree with Arne that GDT_TS is not more intuitive than LDDT or CAD. We have just used GDT_TS so much that now we are used to it. It is a bad habit and bad habits should be changed. If we want to have an intuitive metrics, we should go back to the stone age when RMSD was used. If we want a robust future metrics we should switch to CAD and/or LDDT.

CAD and LDDT have a number of advantages, such as:
1) They don't need an "optimal" superposition which is a very arbitrary thing itself
2) They solve the multi-domain evaluation problem
3) They are much more suitable for local quality evaluation, because they show the correctness of a local area around a residue. The local quality should not depend on the superposition.
4) As we recently showed, CAD and LDDT are more suitable for training machine learning-based EMA methods [1]

The problem of "missing residues" that Arne mentioned is a valid point. We are talking here about residues that are present in the native structure, but are missing in the model. CAD and LDDT tend to give scores larger than zero to any residue even if it is entirely incorrect [1]. However, I don't see this as a big problem, because the structure prediction groups should be encouraged to model all of the residues in the target sequence. So it is OK to penalize models that do not have all residues modelled.

Below are my answers to the questions raised in the original post.

F1. CAD (All-atom - Side-chain) and/or LDDT
F2. CAD (All-atom - Side-chain) and/or LDDT
F3. Same as F1 and F2.

O2.1 I agree that stage1 should only be used to categorize methods as single-model / consensus (if we can't find a better way to do it). Moreover, I think stage2 should contain all models, not just 150 "best models" selected by an arbitrary consensus predictor.

O3.1 CAD and/or LDDT
O3.2 CAD and LDDT solve the multi-domain evaluation problem. Superposition-based scores can only be kept to evaluate the orientation of two domains, as Liam noticed.

4.1. I think introducing arbitrary thresholds, such as "the method succeeds if it selects a model within 2 score units from the best model" or "‘good enough’ quality, defined as 40% of the selected measures’ top score" is very subjective and dangerous. We should avoid using such thresholds wherever possible.

In my optinion model selection should be evaluated as a simple mean difference between selected model and the best model according to given metrics (first-ranked score loss). In addition to that, the differences converted to Z-scores should be used, as it was done in the previous CASPs. This should be done for all targets, not just the ones that have at least one "good" model, because the definition of "good" is very arbitrary.

O4.2 I think using a Pearson/Spearman correlation is better than absolute difference between predicted and real scores. Alternatively, at the very least, the predictor groups should know in advance what evaluation measure to predict.

O4.3 As I mentioned before we should try to avoid using arbitrary thresholds, so I don't like so much the idea of evaluating the ability of discriminating between "good" and "bad" models. "Good" and "bad" is very arbitrary.

O4.3 and O4.4. I agree with Chaok Seok that top models should be given more weight. Therefore I suggest two novel methods for EMA evaluation:

A) Weighted per-target correlation. Weighted Pearson correlation is defined similarly as the regular Pearson correlation, but the each observation is weighted by it's degree of importance: https://en.wikipedia.org/wiki/Pearson_c ... oefficient Weighted Spearman correlation is defined similarly.

Weighted correlation solves the problem of giving more weight to good models. The weight coefficients could simply be the model scores according to the given evaluation metrics (for example, CAD or LDDT). If we want to give even more weight to top models, we can use squared CAD/LDDT scores as weights. We could also experiment with alternative weight definitions, but evaluation metrics as weights sound like the obvious first thing to try.

Moreoever, I would like to emphasise that that I suggest calculating "per-target" correlations. That is calculating the correlation for each CASP target separately and taking the average. The reason is that we are mostly interested in ranking the models for each target. The evaluation of overall model quality (among all targets) is of lesser importance, but it could also be done with weighted correlation.

Weighted per-target correlation could potentially be more informative than first-ranked score loss. The problem with first-ranked loss is that it only evaluates the top-ranked model for each target and because the differences in predicted scores is usually very small, this measure is very noisy [1]. Weighted correlation should be less noisy, because it evaluates the ranking of all models, but still gives more weight for the top models.

There is a ready-made R package for calculating weighted correlations: https://cran.r-project.org/web/packages/wCorr/wCorr.pdf

B) Plot the correlation against the top N models for each target.

What I mean here: for each CASP target take top N models according to a given evaluation metrics (ex. CAD or LDDT) and calculate the correlation between the predicted scores and the given metrics. Then plot N against the correlation for a given CASP target OR take a mean for all targets. N will vary from 3 to the number of models for each CASP target OR max number of models for a CASP target (Correlations for N = 1 or 2 are not defined/meaningful). So on X axis we have N, on Y axis we have correlation and we have a graph of a different color for each EMA method. The EMA methods whose graphs are on the top will be better than EMA methods that are on the bottom. To get a single measure for each method, we could calculate Area Under the Curve (AUC).

This idea came up when I was discussing EMA evaluation with Björn Wallner, so he should be given credit for it. I haven't tried plotting such graphs myself yet, but I think the idea is potentially useful.

6. Again, I think selecting an arbitrary threshold for good/bad residues is not a good idea. In my opinion a simple Pearson correlation between predicted and real scores is a better measure. The local correlation can be calculated in three ways: whole data set, per-target and per-model.

Once again, contact-based measures such as LDDT or CAD are much better suited here, because local scores should not depend on superposition.

7-8. I agree this should be continued.

[1] Uziela K., Menéndez Hurtado D., Shu N., Wallner B., Elofsson A. (2018) "Improved protein model quality assessments by changing the target function. Proteins (epub ahead of print). doi: 10.1002/prot.25492

by **djones** on Tue Apr 10, 2018 3:57 pm

Not aimed at anyone in particular here, but I have to say that this area of CASP has gotten very hard (and boring) to follow, even for relative experts. Heaven knows what the wider community make of all this. Is there a Google Translate option for translating EMA into English I wonder? :-)

I have to say that as someone who would quite like just to be able to pick the best model from a large set of template-free models, say, I'm generally left mystified as to which of these programs is supposed to be best at this basic job.

All of these metrics don't really push developers to make innovative (and better) tools - just to overfit to the limited data points generated by the CASP experiment. I'm not even that sure that CASP actually produces the right volume and type of data to really push these tools hard enough. How good are they are separating 5000 quite similar models or 4990 quite similar bad models and 10 decent ones? Are some tools better at detecting fold-level differences than others? Are some tools biased towards particular sources of models e.g. models from the Baker group having particularly good sterochemistry, that kind of thing. If I have nicely energy minimized models with the wrong fold and a few poorly minimized models with the right fold, can these programs pick these models out? Are any tools able to make reasonable decisions based just on C-alphas or main chain atoms?These sorts of questions encourage new ideas and perhaps even new participation, but all of this deep statistical introspection really doesn't.

by **gstuder** on Thu Apr 12, 2018 2:11 pm

As mentioned above by others, superposition-based scores depend on a (somewhat arbitrary and not very intuitive) choice of superposition and should probably be avoided.
Any metric that measures the accuracy of the local neighborhood of atoms in the models (i.e. lDDT and/or CAD) seem the more natural choice.

Independently of the choice of superposition-based score or not, the quality metrics should consider all atoms whenever possible. This is especially important to distinguish good and better models in the high quality range.

David raised a very important point on where any quality estimate is applicable. Depending on the application, different methods may perform better and different scores should reflect the different applications:
1) As a method developer, I want the estimates to guide the modelling process. Here, model ranking is crucial.
2) As a final user of a protein model, I want to know whether I can use the model for my purpose. Here, I want absolute numbers and it's likely that I am more interested in accurate model ranking in the high-quality range (again: all atom scores are crucial here) than in the low-quality range.

Regarding the original questions:
F1. Global score: lDDT and/or CAD (All-atom - Side-chain)
F2. Local score: lDDT and/or CAD (All-atom - Side-chain)
F3. Different evaluations due to different applications: show model ranking performance & correlation with score and split between low-quality and high-quality models (not targets)

by **venclovas** on Thu Apr 12, 2018 11:56 pm

It may well be that the CASP evaluation procedures of model quality assessment methods have become too convoluted. I guess this just reflects the fact that it is not so easy to define what should be evaluated and how. However, I believe that the problem of identifying realistic models and picking the most accurate one is extremely important. It is not only for picking the best template-free model. As previous CASPs have shown, even in the case of model refinement it is often difficult to say whether the model has improved or became worse (without knowing the target structure).

Comments regarding the focus issues:

F1. What function to predict as a measure of global model fit?
The function(s) should be suitable for both single- and multi-domain structures and preferably reflect physical nature of protein structure (interactions).

GTD-TS does not satisfy either of the two criteria. In my opinion, splitting protein chain into domains before the structure is known would be going from bad to worse. The comment that the difference between domain-based and whole-structure based assessment in CASP12 was marginal with respect to ranking is a weak argument. I suspect that the ranking of modeling methods would also not change much if multidomain targets would not be split into domains. Nonetheless, we all know that assessing whole-chain models for multidomain targets with GDT-TS does not make sense.

I believe local measures are much better suited for the task as they can be applied for both single- and multi-domain structures. At the same time it is not like the local measures do not care at all about the orientation of domains (subdomains, loops). The more extensive is the interaction between domains, the stronger penalty would be given for their improper orientation. The same is true for locally deviating rigid substructures. In addition, local measures better reflect interactions between residues.

F2. What function to predict as a measure of local model fit?
I am definitely for the local scores as they do not depend on an arbitrary defined superposition. In general I think that an attempt to predict deviation of residue position after a single global superposition is a wrong goal. Residue interactions (environment) is what matters for the formation of protein structure, and I think that is what should be assessed. As an example let us consider a long rigid loop that does not make contacts with the rest of the structure. Let us say that in two models this loop is locally identical including its backbone conformation, hydrogen bonds and side chain packing. Only a couple of residues at the root have slightly different conformation. After the global superposition in one case the largest deviation is 0.5A, in another case 5A. Use another cutoff for the superposition and it may become 1A and 10A. Do these values represent the “true” deviation? Which ones? In reality they are both misleading, because the two loops are identical.

F3. What is the optimal global and local evaluation formula?
I think it should be local scores that can assess both individual residues (their environment) and the entire structure (including multidomain proteins). Specific local scores should be selected based on their objectively assessed properties and not ‘like/hate’ or ‘intuitive/non-intuitive’ criteria.

by **arneelof** on Tue Apr 17, 2018 1:21 pm

David, I fully agree.. This is a discussing that probably are of interest for less than 10 people in the world.

But you address the two important problems. Can we find the best out of 5000 models and can we say when a prediction is ok or not.

For the first problem we do not really know what criteria we want. Most likely we want a model that explains biology or can be used for drug design. My guess is that all atom rmsd is an OK measure as a proxy for that. Any model more than 3 Å away is wrong so we can just ignore them..

The second problem I do think tjat consensus (or Clustering methods address quite well. But that we know from 20 years of running fragfold (or Rosetta). If you run it many times and get the same answer you are probably correct, if not most likely all models are wrong. CASP is just a proxy for that and really no progress had happened here for more than a decade. Current MQA methods try to catch up here.

You are also right that CASP might distract us from the important problems. We have a bias of models where often some are much worse than the best ones. And if you always pick a Zhang method you will do well.

Unfortunately I do not think the development of the current generation of MQA methods is aimed at all at the first problem. And most likely a method that is good for that would not perform very well using current evaluation methods. Probably we should use the Refinement models for evaluation instead (but then we would have zero correlation)

by **uziela** on Sun Apr 29, 2018 3:05 am

I agree with what Č. Venclovas wrote - those were very wise remarks.

Furthermore, I don't agree with Arne wrote in the last post. I think how we evaluate QA predictions can influence a lot the direction of the field. And in my opinion QA could revolutionise not just structural bioinformatics, but also have a big influence in medicine, biotechnology and other downstream applications.

Also, I think RMSD is a horrible metric, even for relatively accurate models.

Discussion: Estimates of model accuracy

Discussion: Estimates of model accuracy

Re: Discussion: Estimates of model accuracy

Re: Discussion: Estimates of model accuracy

Re: Discussion: Estimates of model accuracy

Re: Discussion: Estimates of model accuracy

Re: Discussion: Estimates of model accuracy

Re: Discussion: Estimates of model accuracy

Re: Discussion: Estimates of model accuracy

Re: Discussion: Estimates of model accuracy

Re: Discussion: Estimates of model accuracy

Who is online