|
The final result of a structure prediction strongly depends on how much information from already known structures can be used. At one extreme, models competitive with experiment can be produced for proteins with sequences very similar to these of known structures. At the other, models for proteins with no detectable sequence or structure relationship to one of known structure are still at best very approximate. In CASP1-6, reflecting how extensively models could be based on knowledge of other structures, targets have been divided into three broad categories: comparative modeling, fold recognition, and prediction of new folds. Over time, a finer grained intra-category distinctions have appeared. However, it is still useful to outline the general types of problems faced by structure prediction in each of the three main category.
|
|
Comparative modeling. Comparative or homology modeling relies on the fact that for all pairs of natural proteins so far encountered, a clear sequence relationship implies similar structures. Thus, the structure of a homologous protein can to a large degree guide generating a model of a new one. The questions specific to this level of prediction are the accuracy with which the new sequence is matched to the template provided by the known structure, and the extent to which there is improvement in detail of the model over simply copying that template. In assessment, attention is paid to the quality of alignments between the modeled sequence and the target structure, correctness of side chain building, correctness of protein-ligand interactions, and accuracy of model fragments that cannot be directly copied from the template. In addition, improvements over a copying operation, such as prediction of protein main chain shifts relative to the template, and structure refinement with, for example molecular dynamics, are also evaluated.
|
|
Fold recognition.Fold recognition takes advantage of the fact that protein structure is much more strongly conserved than sequence. Increasingly, new structures deposited in the Protein Data Bank turn out to have folds that have been seen before, even though there is no obvious sequence relationship between the related structures. The goal is to identify these structural relationships in cases where the sequence signal is either weak or does not exist. Techniques for fold recognition include advanced sequence comparison, secondary structure prediction, tests of the compatibility of sequences with known three-dimensional folds ('threading'), and the use of expert human knowledge. Evaluation of the quality of the models produced has common components with comparative modeling, specifically alignment accuracy, and with new fold methods, specifically recognizing correct architecture, even in cases where the topology is incorrect. Targets in this category are subdivided into domains that are considered to have diverged from a common ancestor of known structure - homologous folds, FR(H), and domains that are considered more likely to resemble known structures as a result of convergent evolution - analogous folds, FR(A). There are two main questions to be asked: how successful are the different methods at identifying fold relationships, and when successful, what is the quality of the models produced? Assessment focuses on the success of the fold assignments, quality of the sequence ? structure alignments, and - for more successful models - extends to the criteria considered in comparative modeling.
|
|
Prediction of new folds. In early CASPs, targets where there was no relationship to an already known complete structure were described as 'ab initio'. This name implies that there is no reliance on known structures in building models. In practice, most of the methods used for such targets do make extensive use of available structural information, both in devising scoring functions to distinguish between correct and incorrect predictions, and in choosing fragments to incorporate in the model. For this reason, the category was renamed as new folds, starting in CASP4. Methods include the well established secondary structure prediction tools, sequence based identification of sets of possible conformations for short fragments of chain, methods that assemble three dimensional folds from candidate fragments, prediction of which residues are in contact in the structure, 'mini-threading' methods that identify super-secondary structure motifs, and full domain fold recognition methods that may establish an approximate or partial topology. These approaches are sometimes combined with numerical simulation techniques and empirical potentials. Important evaluation criteria in the new fold category are the fraction of the structure predicted below a specified error level, and recognition of success in identifying general architecture.
|
|
Other Modeling Categories. New structure related modeling challenges have emerged in recent years. It has become clear that many regions of proteins do not adopt a single three dimensional structure, Many of the known disordered regions are involved in signaling, regulation, or control. Performance in this category has been assessed since CASP5. Many proteins contain multiple domains, and identifying domain boundaries is important not only in modeling, but also in successfully over-expressing proteins for structural or functional studies. Assessment in this area started in CASP6. One of the primary uses of a three dimensional model is to deduce more about the protein's function, and assessment of this also began in CASP6. Since CASP5, secondary structure prediction has not been included. Progress in this area is too slight to detect with the amount of data available, and larger scale benchmarking by EVA has covered this area.
|