Comparative modeling for protein structure prediction Krzysztof Ginalski With the progression of structural genomics projects, served regions; predicting structurally variable regions, comparative modeling remains an increasingly important including insertions and missing N and C termini; mod- method of choice. It helps to bridge the gap between the eling sidechains; and refining and evaluating the resulting available sequence and structure information by providing model. Although each step can introduce errors that affect reliable and accurate protein models. Comparative modeling the modeled structure, optimal use of structural informa- based on more than 30% sequence identity is now tion from available templates and correctness of approaching its natural template-based limits and further sequence-to-structure alignment are the most significant improvements require the development of effective determinants of final model quality. refinement techniques capable of driving models toward native structure. For difficult targets, for which the most Traditionally, comparative modeling refers to cases in significant progress in recent years has been observed, which related proteins of known structure can be found optimal template selection and alignment accuracy are with PSI-BLAST [5]. The recent introduction of more still the major problems. sophisticated methods (reviewed in [6 ]) that derive their power from profile-profile comparison [7 ,8,9] and Addresses the effective use of structural information [10,11] has Centre for Mathematical and Computational Modelling, Warsaw significantly increased not only the resulting alignment University, PawiÅ„skiego 5a, 02-106 Warsaw, Poland quality but also the remote homologue detection cap- ability. Consequently, the boundary in template-based Corresponding author: Ginalski, Krzysztof (kginal@icm.edu.pl) modeling between comparative modeling and fold recog- nition is now quite blurred. Increased interest in devel- Current Opinion in Structural Biology 2006, 16:172 177 oping new comparative modeling and fold recognition algorithms has led to a variety of prediction services This review comes from a themed issue on available on the Internet [6 ], including structure pre- Theory and simulation Edited by Joel Janin and Michael Levitt diction meta-servers [12]. The latter are of enormous importance to biologists and modelers because they pro- Available online 28th February 2006 vide convenient access to the results of various indepen- 0959-440X/$ see front matter dent two- and three-dimensional structure prediction # 2005 Elsevier Ltd. All rights reserved. methods. Also, they are frequently used as starting points in sequence analysis and three-dimensional model build- DOI 10.1016/j.sbi.2006.02.003 ing. This review summarizes recent progress, and dis- cusses the current roles, limitations and challenges of comparative modeling. Introduction Knowledge of three-dimensional protein structure is cru- Objective evaluation of methods current cial to answering many biological questions; however, the state of the art in comparative modeling rapidly growing number of sequenced genes and gen- The launch of the biannual CASP (Critical Assessment of omes is heavily outpacing the number of experimentally Techniques for Protein Structure Prediction) experiment determined structures. Despite considerable progress in [13,14 ,15 ], established to detect the capabilities and de novo structure prediction [1 ], comparative modeling limitations of current modeling methods, to determine methods, when applicable, provide the most reliable and the progress made and to highlight specific bottlenecks, accurate protein structure models [2]. Comparative mod- represented a crucial milestone in the protein structure eling is based on the general observation that evolutio- prediction field. Results from the latest CASP experi- narily related sequences have similar three-dimensional ments show that, in the comparative modeling category structures [3]. As a consequence, a three-dimensional [16 ], the most successful approaches use consensus model of a protein of interest (target) can be built from strategies [17,18 ] to build final models based on multi- related protein(s) of known structure [template(s)] that ple templates or protein fragment recombination [19]. share statistically significant sequence similarity. The Consensus results from various fold recognition methods traditional comparative modeling procedure consists of or multiple sequence searches [20] are frequently used for several consecutive steps usually repeated iteratively template selection and detection of reliable alignment until a satisfactory model is obtained [4]: finding suitable regions, whereas alternative alignment variants are eval- template protein(s) related to the target; aligning target uated at the tertiary structure level using quality assess- and template(s) sequences; identifying structurally con- ment methods [21,22] and/or visual inspection. Detailed Current Opinion in Structural Biology 2006, 16:172 177 www.sciencedirect.com Comparative modeling for protein structure prediction Ginalski 173 Figure 1 sequence analysis of the target and template families, investigation of characteristic features of the fold and extensive literature searches for any available biochem- ical information (mutations, catalytic residues, etc.) are usually mandatory, as even tiny details can serve as alignment anchors and lead to the successful identifica- tion of the correct sequence-structure mapping in ques- tionable regions. Division of the target sequence into single domains, removal of long insertions to the core of the fold and iterative submission to prediction servers are also strongly advised. Finally, model building for close homologues of the target may enable the detection of This example of difficult comparative modeling based on a distantly significant alignment errors, which manifest themselves related template illustrates the important role of human input in cases in three-dimensional models only for some family mem- of unexpected evolutionary changes in protein structure. (a) bers [17]. Experimental structure of CASP6 target T0223, a putative nitroreductase from Thermotoga maritima (PDB code 1vkw, green), Modeling based on multiple templates is often advanta- and the best model (T0223TS450_1, blue). (b) The available template, flavin reductase P from Vibrio harveyi (PDB code 1bkj, monomers in geous, not least because it increases the chance that the grey and orange), shares 18% sequence identity with the target. T0223 optimal template is among those used [18 ]. However, it is a monomeric pseudo-dimer containing two duplicated reductase is not easy to benefit from the large number of available domains arranged exactly as within the dimeric template. Correct templates, especially when their local structures differ modeling of the complete protein chain required the use of a dimeric template instead of a monomer. significantly. Although existing methods can provide reasonably accurate predictions for short loops, the mod- eling of longer regions not present in available templates remains a challenge and is frequently performed using and, more importantly, is not a blind prediction test. As de novo methods [23 ], with anecdotal examples of rela- clearly demonstrated by the CAFASP and LiveBench tive success [14 ,16 ]. Importantly, the quality of a experiments, significant progress in template-based auto- modeled structurally variable region is greatly affected mated protein structure prediction has been achieved by its length, correctness of the alignment and accuracy through the development of meta-servers [24,25 ], of predicted neighboring regions [23 ]. Our ability to which detect common structural motifs (consensus) in correctly predict sidechain conformations, which are the set of three-dimensional models generated by various backbone conformation dependent, is, not surprisingly, independent structure prediction services. Meta-servers rather limited [16 ]. Incorrect sidechain rotamers are either generate a new overall ranking and select the mainly caused by misaligned residues and/or backbone potentially best model [28] or perform additional mod- shifts, which must be either accurately modeled initially ifications (e.g. construct a hybrid from fragments of the or refined simultaneously to improve sidechain predic- original models) [29]. A well-designed meta-predictor tions. should perform at least as well as the best of its input components; meta-servers do outperform individual ser- Human expertise appears to be very valuable for model- vers and are already challenging most human expert ing difficult targets (template detected by PSI-BLAST) predictions [25 ]. Nevertheless, the performance of sev- and critical in cases of unexpected evolutionary changes eral newly developed autonomous servers appears to be in protein structure (Figure 1). In contrast, for easy amongst the best in comparative modeling, suggesting comparative modeling targets (related structure detected that further improvements of the individual methods by simple BLAST), human improvements are often have recently been obtained [16 ,25 ]. These new marginal, if not detrimental, as the performance of auto- autonomous methods base their strength on the compar- matic methods on these targets has increased substan- ison of sequence profiles combined with predicted sec- tially [14 ,16 ]. ondary structure [30] or, in addition, structure-based profile energy scoring [31]. Evaluation of automatic structure prediction methods is conducted by the CAFASP (Critical Assessment of Fully Quality and usefulness of comparative Automated Structure Prediction) experiment [24], which models runs in parallel with CASP on the same target set. A more Comparative models may be used to identify critical continuous assessment of servers is provided by Live- residues involved in catalysis (or migration of catalytic Bench [25 ] and EVA [26], which operate on a relatively residues), binding or structural stability, to examine pro- large number of prediction targets compiled every week tein protein or protein ligand interactions (including from newly released PDB [27] structures. However, drug design), to correlate genotypic and phenotypic LiveBench excludes easy comparative modeling targets mutation data, and to guide experimental design. The www.sciencedirect.com Current Opinion in Structural Biology 2006, 16:172 177 174 Theory and simulation usefulness of comparative models for a specific applica- value) with respect to the template structure [34 ,35 ]. tion depends on their quality, which tends to decrease as Although the average accuracy of structure-derived prop- evolutionary distance between target and template erties (residue exposure state, residue neighborhood, increases [4,14 ,16 ]. accessible surface area, electrostatic potential, etc.) decreases with lower target-template sequence identity, Two important factors influence the ability to predict in general their added value (especially for sequence- accurate models: the extent of structural conservation dependent properties) increases, making models rela- between target and template, and the correctness of tively more informative in spite of their lower accuracy alignment [4,14 ]. Models based on templates with more [34 ]. In addition, depending on the property, the accu- than 50% sequence identity are generally very accurate racy of comparative models based on templates with 25 and can exhibit 1Å Ca atom rmsd from the experi- 40% sequence identity reaches the same value as differ- mental structure. Proteins with 30 50% sequence iden- ences observed between NMR and X-ray structures tity share at least 80% of their structures; the best CASP [35 ]. models within this range usually do not exceed 4 Å rmsd (typically 2 3 Å) from the native structure, with errors Using comparative models instead of template structures located mainly in loop regions. Structural conservation has also been shown to be invaluable in molecular repla- can be as low as 55% for proteins that display 20 30% cement, whereby screening with a model or a diverse set sequence identity or even lower when sequence identity of models can frequently be successful in cases in which drops below 20%. Whereas alignments are most often the structural template used to build them failed [36 ]. near optimal for targets with more than 30% sequence Although target-template sequence identity is not a good identity to template structures (easy targets), below this diagnostic for the success of this procedure, models based threshold (mainly difficult targets), alignment quality on 30% sequence identity (and even sometimes less) sharply decreases and even as many as half of all residues seem to be sufficiently accurate for molecular replace- may be misaligned when sequence identity is less than ment [36 ]. 20% [14 ]. Limitations and current challenges Sequence identity between target and template is not, Despite steady but modest progress in difficult compara- however, an effective parameter to estimate the difficulty tive modeling based on a distant evolutionary relationship or the quality of a comparative model [14 ,16 ]. A much (template detected by PSI-BLAST), there is still room for better estimator of the expected model quality is the further improvement in both optimal template selection distribution of sequence identity in multiple sequence and the quality of sequence-to-structure alignments that alignment, encompassing target, template and intermedi- are particularly error prone in cases of low sequence ate homologous sequences [32 ]. Given the unprece- similarity [14 ,15 ]. Least progress has been made in dented growth of both structural and sequence comparative modeling from relatively high sequence databases, improvements in the quality of comparative identity templates, as measured in the last CASP experi- models seem to be largely due to the increased avail- ment [14 ,16 ]. In general, predictions are not closer to ability of sequences and structures homologous to the the experimental structure than the structure of the protein of interest [32 ]. closest template [16 ]; however, for easier targets the best models are now frequently as good as optimal multi- Importantly, functional regions are not modeled with any templates, suggesting that the template-based methodol- greater accuracy than the rest of the protein [16 ], unless ogy is approaching its natural limits for easy comparative they are structurally better conserved than other regions, modeling [37 ]. Further progress in this area requires the as is typically the case when the template structure has development of appropriate refinement techniques and the same function and specificity. However, when func- potentials that are capable of making adjustments on an tion or specificity differs, larger changes are usually atomic scale [14 16 ]. Typically, attempting the refine- expected in functional regions [15 ]. Accurate modeling ment of coordinates derived from template structures of the differences between similar structures (insertions leads to the deterioration of rather than improvement and backbone shifts) is one of the most biologically in model quality. Nevertheless, there are some encoura- relevant applications of comparative modeling, because ging signs in the field of final model improvement with these structural changes usually add novel functions and/ respect to the initial template alignment [38]. High- or specificities. Nevertheless, in many cases, relatively resolution refinement of comparative models remains a good insight into the active site architecture and ligand formidable challenge, because of inaccuracies in current binding can be obtained from comparative models, pro- force-fields and difficulties in sampling huge numbers of viding there are no alignment errors [33 ]. alternatively packed conformations [39]. As recently shown, some of these problems can be partially overcome Recent systematic studies have proved that even simple by combining free energy optimizations with sampling comparative models carry additional information (added along evolutionarily favored directions defined using Current Opinion in Structural Biology 2006, 16:172 177 www.sciencedirect.com Comparative modeling for protein structure prediction Ginalski 175 principal components of the backbone structure variation possibility of such structural changes in evolution has within a homologous family [40 ]. important implications for protein design, but notably impedes comparative modeling methods; the ability to More effective energy-based methods, coupled with detect such cases from sequence is crucial. relaxation techniques, should improve both best tem- plate detection (or optimal local template fragment com- Conclusions bination) and the selection of correct alignments for The number of unique folds in nature is expected to be regions where the proper evaluation of alignment variants limited [43] and the principal goal of structural genomics in the context of three-dimensional structure is not initiatives is to provide template structures for most possible without defrosting (refining) the inherited tem- protein families [44,45]. Structure prediction approaches plate backbone [18 ]. The development of such meth- are thus destined to become largely limited to compara- ods may provide a stepping-stone toward new tive modeling, as evolutionarily related structures will be comparative strategies that try to optimize all the mod- available for the majority of naturally occurring proteins eling steps (template selection, alignment, modeling in the foreseeable future. As most modeling cases fall in structurally variable regions and sidechain packing) in the 20 30% sequence identity range [46], where the a more simultaneous way. For instance, one could apply a majority of new information is generated [34 ], further refinement protocol to generate a diverse set of models progress is necessary to overcome the current major based on different templates and alignment variants, and bottlenecks in comparative modeling: improving align- use the energy function to discriminate near-native from ments and refining models. The development of effec- erroneous models. tive all-atom structure refinement procedures should tackle this problem, allowing the generation of high- Additional challenges to the comparative modeling field, resolution models that can reproduce all key functional which relies heavily on the assumption similar sequences features. similar structures , are brought about by the existence of evolutionarily related proteins that possess globally Acknowledgements I would like to thank Lisa Kinch for critical reading of the manuscript. distinct structures (Figure 2). The major mechanisms by which proteins change their fold include insertions, References and recommended reading deletions and substitutions of structural elements, circular Papers of particular interest, published within the annual period of permutations, rearrangements of b-sheet topologies review, have been highlighted as: (strand invasions and withdrawals, b-hairpin flips and of special interest swaps) and fusion of duplicated domains [41,42]. The of outstanding interest Figure 2 1. Bradley P, Misura KM, Baker D: Toward high-resolution de novo structure prediction for small proteins. Science 2005, 309:1868-1871. A review of recent progress in de novo protein structure prediction. 2. Baker D, Sali A: Protein structure prediction and structural genomics. Science 2001, 294:93-96. 3. Chothia C, Lesk AM: The relation between the divergence of sequence and structure in proteins. EMBO J 1986, 5:823-826. 4. Marti-Renom MA, Stuart AC, Fiser A, Sanchez R, Melo F, Sali A: Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct 2000, 29:291-325. 5. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation Same sequence adopts different structures a highly challenging of protein database search programs. Nucleic Acids Res 1997, case for comparative modeling. (a) Experimental structure of CASP6 25:3389-3402. target T0240, a 92-residue C-terminal fragment of TonB from 6. Ginalski K, Grishin NV, Godzik A, Rychlewski L: Practical lessons Escherichia coli (TonB-92; PDB code 1u07, monomers in green and from protein structure prediction. Nucleic Acids Res 2005, olive), and the best model (T0240TS450_1, blue). (b) The available 33:1874-1891. template, an 85-residue C-terminal fragment of TonB (TonB-85; PDB This comprehensive review outlines currently available practical code 1ihr, monomers in grey and orange). TonB-85 forms a swapped approaches to protein structure prediction, including recent advances dimer through the exchange of a b-hairpin and a C-terminal b-strand, in model quality assessment. whereas TonB-92 dimerizes with considerably different structure 7. Ohlson T, Wallner B, Elofsson A: Profile-profile methods provide without undergoing b-hairpin exchange. Although critical changes in improved fold-recognition: a study of different profile-profile the structure were modeled correctly, the model predicts TonB-92 to alignment methods. Proteins 2004, 57:188-197. be a monomer (as indeed it is expected to be in solution), with no An evaluation of different profile-profile alignment methods. swapping of C-terminal b-strands seen in the experimental crystal 8. Wang G, Dunbrack RL Jr: Scoring profile-to-profile sequence structure. alignments. Protein Sci 2004, 13:1612-1626. www.sciencedirect.com Current Opinion in Structural Biology 2006, 16:172 177 176 Theory and simulation 9. Sadreyev RI, Grishin NV: Quality of alignment comparison by Sali A et al.: EVA: evaluation of protein structure prediction COMPASS improves with inclusion of diverse confident servers. Nucleic Acids Res 2003, 31:3311-3315. homologs. Bioinformatics 2004, 20:818-828. 27. Deshpande N, Addess KJ, Bluhm WF, Merino-Ott JC, 10. Wrabl JO, Grishin NV: Gaps in structurally similar proteins: Townsend-Merino W, Zhang Q, Knezevich C, Xie L, towards improvement of multiple sequence alignment. Chen L, Feng Z et al.: The RCSB Protein Data Bank: a Proteins 2004, 54:71-87. redesigned query system and relational database based on the mmCIF schema. Nucleic Acids Res 2005, 11. Przybylski D, Rost B: Improving fold recognition without folds. 33:D233-D237. J Mol Biol 2004, 341:255-269. 28. Ginalski K, Elofsson A, Fischer D, Rychlewski L: 3D-Jury: a simple 12. Bujnicki JM, Elofsson A, Fischer D, Rychlewski L: approach to improve protein structure predictions. Structure prediction meta server. Bioinformatics 2001, Bioinformatics 2003, 19:1015-1018. 17:750-751. 29. Fischer D: 3D-SHOTGUN: a novel, cooperative, fold- 13. Moult J, Fidelis K, Tramontano A, Rost B, Hubbard T: Critical recognition meta-predictor. Proteins 2003, 51:434-441. assessment of methods of protein structure prediction (CASP) 30. Ginalski K, von Grotthuss M, Grishin NV, Rychlewski L: Detecting - round VI. Proteins 2005. distant homology with Meta-BASIC. Nucleic Acids Res 2004, 14. Kryshtafovych A, Venclovas C, Fidelis K, Moult J: Progress 32:W576-W581. over the first decade of CASP experiments. Proteins 2005, 31. Zhou H, Zhou Y: SPARKS 2 and SP(3) servers in CASP 6. 61(suppl 7):225-236. Proteins 2005. A description of the progress made in protein structure prediction during the course of the CASP experiments. 32. Cozzetto D, Tramontano A: Relationship between multiple sequence alignments and quality of protein comparative 15. Moult J: A decade of CASP: progress, bottlenecks and models. Proteins 2005, 58:151-157. prognosis in protein structure prediction. Curr Opin Struct Biol The distribution of sequence identity in multiple sequence alignments is 2005, 15:285-289. demonstrated to be a good estimator of the quality of comparative This paper reviews the state of the art in protein structure prediction in the models. context of a decade of CASP experiments. 33. DeWeese-Scott C, Moult J: Molecular modeling of protein 16. Tress M, Ezkurdia I, Grana O, Lopez G, Valencia A: Assessment of function regions. Proteins 2004, 55:942-961. predictions submitted for the CASP6 comparative modelling The authors explore the usefulness of comparative models in deducing category. Proteins 2005, 61(suppl 7):27-45. details of molecular function. They demonstrate that, in general, good An assessment of the state of the art in comparative modeling from insight into ligand binding can be obtained, providing there are no CASP6. alignment errors. 17. Ginalski K, Rychlewski L: Protein structure prediction 34. Chakravarty S, Sanchez R: Systematic analysis of added-value of CASP5 comparative modeling and fold recognition in simple comparative models of protein structure. Structure targets using consensus alignment approach and 3D 2004, 12:1461-1470. assessment. Proteins 2003, 53(suppl 6):410-417. This study justifies the use of comparative models instead of templates to estimate structure-derived properties of proteins, showing that, in gen- 18. Venclovas C, Margelevicius M: Comparative modeling in eral, their added value increases with lower target-template sequence CASP6 using consensus approach to template selection, identity. sequence-structure alignment and structure assessment. Proteins 2005, 61(suppl 7):99-105. 35. Chakravarty S, Wang L, Sanchez R: Accuracy of A report from one of the best performing groups in the comparative structure-derived properties in simple comparative modeling category of CASP6. models of protein structures. Nucleic Acids Res 2005, 33:244-259. 19. Kolinski A, Bujnicki JM: Generalized protein structure In an extension of their previous work [34 ], the authors show that the prediction based on combination of fold-recognition with average accuracy of structure-derived properties of comparative models de novo folding and evaluation of models. Proteins 2005, increases with higher target-template sequence identity. They also reveal 61(suppl 7):84-90. that, for most properties, the differences observed between NMR and X- 20. Margelevicius M, Venclovas C: PSI-BLAST-ISS: an ray structures are similar to the errors in models based on templates with intermediate sequence search tool for estimation of the 40% sequence identity. position-specific alignment reliability. BMC Bioinformatics 36. Giorgetti A, Raimondo D, Miele AE, Tramontano A: Evaluating the 2005, 6:185. usefulness of protein structure models for molecular 21. Luthy R, Bowie JU, Eisenberg D: Assessment of protein replacement. Bioinformatics 2005, 21:ii72-ii76. models with three-dimensional profiles. Nature 1992, This study reveals that there is a clear relationship between the quality of 356:83-85. comparative models and their suitability for molecular replacement. It also shows that target-template sequence identity is not a good diag- 22. Sippl MJ: Recognition of errors in three-dimensional nostic for the success of the procedure. structures of proteins. Proteins 1993, 17:355-362. 37. Contreras-Moreira B, Ezkurdia I, Tress ML, Valencia A: 23. Rohl CA, Strauss CE, Chivian D, Baker D: Modeling structurally Empirical limits for template-based protein structure variable regions in homologous proteins with Rosetta. Proteins prediction: the CASP5 example. FEBS Lett 2005, 2004, 55:656-677. 579:1203-1207. A de novo method for modeling structurally variable regions in compara- An analysis of the empirical limits of template-based modeling of protein tive models based on the Rosetta structure prediction algorithm is structure suggests that the methodology is approaching its limits for easy described and evaluated. comparative modeling and that additional improvements in quality require information not available from template structures. 24. Fischer D, Rychlewski L, Dunbrack RL Jr, Ortiz AR, Elofsson A: CAFASP3: the third critical assessment of fully automated 38. Zhang Y, Skolnick J: Automated structure prediction of weakly structure prediction methods. Proteins 2003, 53(suppl 6): homologous proteins on a genomic scale. Proc Natl Acad Sci 503-516. USA 2004, 101:7594-7599. 25. Rychlewski L, Fischer D: LiveBench-8: the large-scale, 39. Misura KM, Baker D: Progress and challenges in high- continuous assessment of automated protein structure resolution refinement of protein structure models. prediction. Protein Sci 2005, 14:240-245. Proteins 2005, 59:15-29. A report on the performance of protein structure prediction servers in the 40. Qian B, Ortiz AR, Baker D: Improvement of comparative LiveBench-8 experiment. model accuracy by free-energy optimization along 26. Koh IY, Eyrich VA, Marti-Renom MA, Przybylski D, principal components of natural structural variation. Madhusudhan MS, Eswar N, Grana O, Pazos F, Valencia A, Proc Natl Acad Sci USA 2004, 101:15346-15351. Current Opinion in Structural Biology 2006, 16:172 177 www.sciencedirect.com Comparative modeling for protein structure prediction Ginalski 177 The authors present a novel approach to refining comparative models by 44. Yan Y, Moult J: Protein family clustering for structural free energy optimization along evolutionarily favored sampling directions. genomics. J Mol Biol 2005, 353:744-759. They show that improvement in model quality can be obtained. 45. Liu J, Hegyi H, Acton TB, Montelione GT, Rost B: Automatic 41. Grishin NV: Fold change in evolution of protein structures. target selection for structural genomics on eukaryotes. J Struct Biol 2001, 134:167-185. Proteins 2004, 56:188-200. 42. Kinch LN, Grishin NV: Evolution of protein structures and 46. Sanchez R, Pieper U, Melo F, Eswar N, Marti-Renom MA, functions. Curr Opin Struct Biol 2002, 12:400-408. Madhusudhan MS, Mirkovic N, Sali A: Protein structure modeling for structural genomics. Nat Struct Biol 2000, 43. Grant A, Lee D, Orengo C: Progress towards mapping the 7(suppl):986-990. universe of protein folds. Genome Biol 2004, 5:107. www.sciencedirect.com Current Opinion in Structural Biology 2006, 16:172 177