5
Methodology

This chapter describes the proper order and combination of commands that you use for a typical project using the Homology module. To summarize, the process consists of the following steps:
1. Determine which proteins are related to the model protein.
2. Determine structurally conserved regions (SCRs).
3. Align sequences.
4. Assign coordinates within the SCRs.
5. Build loop or variable regions (VRs).
6. Explore the possible range of conformations of side chains for
particular residues.
7. Refine the model structure with Discover.
The chapter includes commands from both Homology and Insight II. For more information on these commands, refer to Chapter 4, Command summary, and the Insight II Reference Guide.

Step 1: Determine Which Proteins Are Related to the Model Protein
Many of the known proteins fall into families. This happens through evolution, and proteins from more closely related species resemble one another more than those from distantly related species. The resemblance is in terms of function, amino acid sequence, and three-dimensional structure. It can be generally assumed that if two proteins have similar sequences, then their three-dimensional structures will be similar.
Sequence Database Searching
The first step in predicting a protein structure from its amino acid sequence is to identify the family to which the sequence belongs. This is most easily done by searching a database of amino acid sequences. The sequence database files must be installed so that they can be accessed by the sequence database searching software. This procedure is explained in detail in the MSI System Guide.
The search can be restricted to include only the sequences of those proteins for which complete tertiary structures are known. This is a good way to find reference structures. To do this, you must first run the program seq_extract to create a database file that contains only sequences for which the structure is known (see Appendix D, seq_extract Utility).
MSI provides an alternative method for finding reference proteins in the Profiles_3D product. The Find Structures command in Profiles_3D directly tests the compatibility of a sequence with 3D folds, rather than with the sequences associated with those 3D folds. This is a highly sensitive method that may succeed in cases where a conventional sequence similarity search fails. This approach is described in the Profiles_3D User Guide. The remainder of this section concerns only conventional sequence similarity searches using the FASTA algorithm in the Homology program.
If there is only one known reference protein for your sequence, a search of the complete database of sequences can be especially valuable. A multiple alignment of related sequences often reveals those places in the sequences that are most highly conserved (by sequence similarity criteria). These regions are likely to be structurally conserved and can therefore reveal the best places in the reference protein from which to copy coordinates as you build the conserved core structure of your model.
The commands in the Databases pulldown provide a means to search sequence databases for proteins that are similar to a given sequence. The database(s) are not distributed by MSI but can be obtained from other sources (see Appendix G, Sequence Databases), and should be stored as one or more disk files accessible from your workstation.
The search is accomplished by running the FASTA program (Lipman and Pearson 1985, Pearson and Lipman 1988, Pearson 1990) as a background job (see the Background_Job Pulldown). Two of the commands in the pulldown, Input Databases and Output Databases, specify the various options to the FASTA program and the Run_FASTA Databases command submits the background job. The commands in the Background_Job pulldown control the execution of the process, specify the host machine, and query the job status.
All the parameter values for these commands are saved after a job is submitted. Therefore, multiple jobs can easily be run using the same parameters or with one or a few parameters modified before resubmitting the job. But once a parameter is set by selecting Execute, there is no way to go back to the original default value. Insight II must be restarted to accomplish that.
The Input Databases command specifies parameters concerning the execution of the FASTA search program. The command is divided into two parts. The set of parameters that are active is determined by the value of the End_Definition boolean parameter. When the command is first selected (with End_Definition off), parameters relating to the specification of sequences and databases are active. When End_Definition is toggled on, parameters having to do with the calculation of match scores become active.
With the End_Definition parameter off, the Probe Sequence parameter specifies which given sequence is to be matched to the database. It must be a part of the sequence display at the bottom of the screen, but it can be either a reference protein or a sequence with no defined coordinates. The Extract Sequences command can be used to display the sequence of the former type and Get Sequences for the latter type. The Activation parameter controls the creation and modification of the list of Database files to search. Thus, you can Add a Database file, Delete a specific one, Clear all the entries, or List the Database files to be searched. The choice of Database files is controlled by the Database Type parameter. There is a separate list of Protein and DNA databases, and the types cannot be mixed within a list. If you wish to change from one type to another, the current list must be cleared first. If the Database Type is Protein, you can specify the Scoring Matrix that is used to calculate the init1 score (see Chapter 2, Theory). This matrix gives high scores for residue identities and conservative substitutions and low scores for rare substitutions. See Scoring Matrices for a discussion of them. If the Database Type is DNA, then you can choose whether to consider matches in one direction (Downstream) or in both directions. Finally, the value of the Ktup parameter can be set. It specifies the number of residues to consider as a group during the initial scanning of the database. The higher the value, the faster the scan, but the less sensitive the search.
When the End_Definition parameter of the Input Databases command is toggled on, three other parameters become active. The Optimize parameter specifies whether initially matched regions are to be rescored using the Needleman and Wunsch method (see Chapter 2, Theory). If so, then the Optimize Threshold value specifies the scores above which optimization is done. When high-scoring single regions are found within a single sequence, they can be joined together to form a longer matched region containing insertions and deletions. Segments scoring higher than the Joining Threshold will be joined in this way. As a general rule, if optimization is done, the two threshold values are kept the same value.
The FASTA program creates several output files. One contains information concerning the run itself; another contains the results of the database search. The format of the latter file is specified with the parameters of the Output Databases command and its name is specified by the File Name parameter. If the Input Databases command has been previously executed, and there is a current choice for Probe Sequence, a default filename is constructed by adding the extension .fasta to the end of the sequence name. A value-aid is also provided that lists files that have already been written. If you choose a File Name that already exists, the file is overwritten upon execution.
The Alignment Style parameter specifies whether only the regions of the sequences found to be overlapping are written to the output file as an alignment, or whether the whole sequences is, with the similar regions demarcated with "X"s.
The Marking Style parameter controls the way the residue similarities and differences between the Probe Sequence and the matched sequences are indicated. Values for the Marking Style parameter are shown below:

If Ident_Conservative is chosen for the Marking Style, matches between identical residues are shown with a colon (":") and conservative substitutions are shown with a period ("."). If Conservtiv_and_Non is chosen, conservative substitutions are denoted by an "X" and nonconservative substitutions by an "x". If Replacements_Only is chosen, replacements are shown with the appropriate amino acid code and identities are indicated with a period ("."). If LibSequences_Only is chosen, only the aligned library sequences will be shown without the query sequence. This can be used to build a primitive multiple alignment.
The number of Scores to Show can also be controlled as well as the number of Alignments to Show. The latter parameter also specifies the number of sequence files that are written upon completion. For each sequence found and alignment shown, a sequence file is written with a format compatible with the Get Sequences command. The name is derived from the database entry code (accession code) of the sequence with the extension .fasta.seq. Finally, if optimization was done in the search, you may sort the sequences found based on the optimized score rather than the initn score.
Most of the parameters for the Input Databases and Output Databases commands have valid default values that can be used. The exceptions are the Probe Sequence, Database Type, and Database parameters for the Input Databases command and the File Name parameter for the Output Databases command.
After setting up both the input and output parameters, the Run_FASTA command is used to submit the background job. Any arbitrary string can be entered for the FASTA Run Name. This identifies the job for the system. The system then supplies a job number that is used when monitoring the job. The background job creates a file to log any job execution information or errors. It is named bkgd_job_<jobname><n>.csh_log, where <n> is the job number.
When the background job completes, you can examine the results in the output file named by the Output Databases command. Any desired sequence can be read into Homology with the Get Sequences command. The alignments seen in the output file can be reproduced using the scrolling buttons.
Motif searching
The structural similarity of different proteins may be due to a shared ancestor. On the other hand, it may simply be the result of the fact that the number of possible protein motifs is much smaller than the number of proteins. Even though new structures are reported almost every day, most of those structures are somewhat similar to structures already present in the Protein Data Bank/Brookhaven (see Setting Up the Brookhaven PDB for information about configuring the Brookhaven PDB for use in Insight II). It is estimated that the number of all possible, distinct folds is in the order of 1000. Therefore, it is essential to have a fast method for scanning the existing structural databases to identify possible structural matches.
A Motif_Search scan is performed in the following way. The query protein is first annotated with a secondary structure calculated using an algorithm similar to the Kabsch and Sander algorithm. Turn predictions are ignored and alpha-helices and beta-strands are smoothed to avoid sudden breaks in otherwise continuous secondary structure elements. Termini of secondary structure elements are then determined and the best line is fitted to the element using its inertia tensor. The Motif_Search structural database consists of secondary structure element vectors calculated in the same way for of all proteins in the current release of PDB.
For every protein from the database, Motif_Search looks for all possible matches for subsets of three secondary structure elements. Pairs of matching subsets are called triplets. Triplets have to be consistent -- i.e., the types of secondary structure elements and the direction (unless the Mix_Order parameter is specified) of the reference protein must match those of the query protein. Additionally, the spatial compactness of the triplet is constrained by the Compactness parameter, and the triplet RMS (the root mean square deviation of the hit from the query structure) must be lower than the value specified in the RMS_limit parameter.
Having identified all possible triplets that fulfill the above criteria, Motif_Search tries to merge triplets that have common secondary structure elements, provided that the merged substructure still satisfies the RMS deviation specified in the RMS_limit parameter. If the number of secondary structure elements in merged substructures is greater than the value of the Element_limit parameter, the substructure is marked as a hit and included in a hit list file.
When the Motif_Action parameter is set to Scan_Motif_DB, then the Motif_Search command spawns a background job that produces a human readable file (described in the file formats appendix). This file is read and analyzed by both the Load_Scan_Results and View_Scan_Results states. Additionally, loading structures of hits requires setting the INSIGHT_PDB environment variable to point to the top of the directory tree that contains pdb files.
A Motif_Search readable database of structures can be generated from the list of pdb files. This feature is not supported by the graphical interface, but an appropriate BCL macro does exist:
Motif_Generate Databases -Use_PDB_Definition pdb.flist motif_search.db
where Use_PDB_Definition has the same meaning as in the Motif_Search command. pdb.flist is a file containing a list of pdb files -- one per line with full path. motif_search.db is the name of the created database. Alternatively, the command line version motif_search can also be used, with a command line of the form:
motif_search -gen -list file_list -database file.db
where file_list is a list of PDB file names (full path, one per line) and file.db is the name of the database created.
The Motif_Search command
The Motif_Search command allows you to perform fast browsing of the provided comprehensive structural database in the search for structural motifs similar to whole or part of the query protein. A typical Motif_Search session consist of following steps:
Additionally, you can prepare your own version of the structural database and scan against this database. (For a full discussion of the parameters in Motif_Search, see the xhelp in the Insight II interface).
Scanning the database
In the Scan_Motif_DB state of the Motif_Action parameter, the Motif_Search parameter block allows you to prepare the proper query for the subsequent background job. Main parameters that control the behavior of the search are RMS_limit, Element_limit and Compactness.
RMS_limit specifies the maximum root mean square deviation of the matching substructures. For identification of closely related structures, it can be lowered to about 1 Angstrom, whereas, when distant structural relatives are sought, values in the range of 3 or 4 Angstroms are suggested. The Element_limit parameter specifies the minimum number of secondary structure elements for a substructure to be considered a match. For close matches, this number can approach the total number of secondary structure elements in the query structure. On the other hand, if distantly related structures significantly differ in the number of secondary structure elements, Element_limit should to be specified as a lower bound on the expected number of common subsets of secondary structure elements. Also, the search for a relatively small common motif may be performed by setting Element_limit to the element size of the motif. The Compactness parameter limits the size of the matching substructure and is used to restrict the existence of non-local substructures in large proteins. However, it also prevents matches from merging and therefore should be used cautiously.
The Default structure library provided by MSI consists of all structures from the January 1997 PDB release. It is possible, however, to use other databases by specifying a name in the DB_Name field (visible only when Use_Local_DB is toggled on. Instructions on how to prepare the database are located at the end of the methodology and implementation section.
By default, Motif_Search uses automatic secondary structure assignments. Additionally, secondary structure annotations from the PDB file can be used instead by specifying Use_PDB_Definition. This is not recommended, however, since the database has been prepared consistently with automatic annotation.
Loading and Viewing
matching structures
Loading results of the database scan can be accomplished by choosing the Load_Scan_Result value of the Motif_Action parameter. Note that you can load results from as many files as you wish. For each output file, a table containing matching structures will be produced. The table name consists of the protein name, capitalized background job name and string "MOTIFS" separated with an underscore (_), i.e table GOF_MYFIRST_MOTIFS correspond to the "myfirst" background job of the Motif_Search for galactose oxidase (GOF). The job name used as a prefix may be abbreviated if it is too long (>11 characters). The PDB code of the structure that contains a matching fragment is reported together with the number of matching elements, the root mean square deviation of the match from the query protein and the deviation, defined as a radius of gyration of the matching substructure.
There are two levels of browsing hits. In the first level, only the matching substructure can be shown, represented as a set of yellow arrows, superimposed onto a set of white arrows that represent all the secondary structure elements of the query protein. This allows you to explore the extent of the overlap and assess the quality of the match. Query and matching arrows are available as MS_TAR and MS_REF objects, respectively. Note that loading a new set of vectors will overwrite the MS_REF object.
The second level involves loading a whole structure that contains matching substructures and overlapping it with the query protein. The loaded structure is available as a protein called MOTIF_code, where code is the four character PDB code of the hit. In order to successfully perform this step, the environment variable INSIGHT_PDB has to be defined and must point to a directory tree which contains the pdb files. Often, the matching substructure is a small part of the whole structure. It is advantageous, therefore, to use Load_Vectors and Load_Structures simultaneously, which indicates which parts of the structures are considered a true hit and helps to assess how significant such a match is. Note that Load_Vectors and Load_Structures are not coupled and, in principle, can depict different matches.
Structural database customization
The currently provided database of structures is complete but redundant. Many structures are duplicated a few and in some cases hundreds of times, which can complicate the analysis. New databases, tailored to specific needs, can easily be prepared by using the Motif_Search database generation options from the UNIX command line as previously described.
Tutorials
To become acquainted with Motif_Search capabilities, run the Motif Database Searching tutorial from the Homology tutorials collection. In addition to basic operations, use of Motif_Search requires an understanding of the Insight II subset definition and subset operations. For more information, read the relevant chapter in the Insight II User Guide.

Step 2: Determining Structurally Conserved Regions (SCRs)
When two or more reference protein structures are available, it is advantageous to use the information they provide to establish structural guidelines for the family of proteins under consideration. Therefore, the first step in building a model protein by homology is determining what regions are structurally conserved or constant among all the reference proteins. It can then be assumed that the unknown protein has the same conformation in these conserved regions.
The proteins are read in with the Get command in the Molecule pulldown (top menu bar). Sequences for the reference proteins are then displayed by selecting the Extract command from the Sequences pulldown in the Homology module. This can take up to a minute for each protein. The sequences are shown in the sequence display at the bottom of the screen.
The Homology module provides two methods for finding SCRs, one that is fast and automatic, and another that requires an interactive, manual search. If the reference proteins are sufficiently similar in structure, then the automatic method is preferable. If, however, the structures are less well-conserved, the automatic method may not be able to find the regions of structural similarity. In such cases the manual method is preferable.
Automatic Determination of SCRs
The automatic method constructs a C
distance matrix for each molecule and compares small portions of the distance matrices at a time. RMS differences of the distance matrix elements are calculated to assess whether peptide segments are similar. Both local similarities and global orientations of the segments are examined. All overlapping segments are combined to form larger nonoverlapping SCRs. Segments of the sequences that all correspond to a given SCR form an m-block. An m-block is a group of m mutually related sequence segments, all the same length, that are related by their sequence similarity and/or structural similarity. (In this section the terms SCR and m-block are used interchangeably, but you should remember that the m-block is a more general concept). The segments in the structurally conserved m-blocks are aligned in the sequence window and enclosed by boxes. Also, the three-dimensional structures are superimposed so as to minimize the RMS deviations among corresponding atoms in the SCRs. For a complete description of the method used, see Automatic Determination of Structurally Conserved Regions.
To use this method, select the Structure command from the Alignment pulldown and choose Automatic as the Struct Align Mode.
Specifying the Initial Search Zone
The automatic Alignment/Structure command searches for SCRs in a "zone" of sequence segments. Normally this zone includes the full lengths of all the reference proteins. The Seq to Align 1 through Seq to Align 10 parameters specify the sequences to be searched for SCRs and aligned. By default they contain the names of all sequences present in the sequence display area for which there are corresponding three-dimensional structures. If there are fewer than ten sequences present, the unused Seq to Align parameters are set to None.
Each sequence name can be followed by an optional colon and residue number to specify the leftmost (closest to N-terminal) residue of the zone to be searched during the alignment. Any portions of the sequences outside this zone will not be aligned. If the residue number is omitted, then the zone begins at the N-terminus.
The length of the zone is specified by the Zone Length parameter. By default, it is set to the length of the longest sequence in the sequence display. Zone Length may be longer than all of the sequences, in which case the search zone extends to the C-termini of all sequences selected for alignment.
The Seq to Align parameters can be set by picking the sequences in the sequence display area, selecting the molecule names in the value-aid, or picking the molecules on the screen. If the Mol/Res Pick Level parameter is set to Residue, then individual residues can be picked to indicate the leftmost residue to be searched in the selected sequence. A sequence name can be deleted by selecting the parameter field, pressing the <Delete> key, and then pressing <Return>.
If you try to enter the same protein name in more than one Seq to Align parameter, then the redundant entries will just overwrite the first instance of that name. For example, if you enter FBJ:13 in Seq to Align 1 and then try to enter FBJ:22 in Seq to Align 2, FBJ:22 will appear in Seq to Align 1 and Seq to Align 2 will be left unchanged. This feature prevents duplication of sequence names and facilitates adjustment of the starting points of the search zone.
The Minimum Seq Per Blk parameter specifies the minimum number of sequence segments that an m-block must contain to be included in the alignment. It must be greater than or equal to 2, and \
91 less than or equal to the number of sequences selected for alignment. The default value is 2.
The search for SCRs initially covers the entire zone selected for alignment. As m-blocks are found, they subdivide the sequences into shorter zones that are in turn recursively searched for smaller m-blocks. The Minimum Zone Length parameter specifies the length, in residues, of the shortest zone that is considered worth searching. If any of the sequence segments between the m-blocks are shorter than Minimum Zone Length, then these will be excluded from the new search zones. Large values of Minimum Zone Length result in faster but less complete alignments. Minimum Zone Length must be greater than or equal to 1. The default value is 3.
The parameter Minimum Blk Length specifies the minimum length, in number of residues, that an m-block must have to be incorporated into the alignment. Any m-blocks shorter than Minimum Blk Length are automatically rejected.
Optimizing the Automatic Search for SCRs
Four numeric parameters control the structural similarity search: Probe Size, Contiguous Thresh, Scan Limit, and Orientation Thresh.
The Probe Size is the length of the portions of the distance matrices that are compared. If this parameter is small (three to five residues), then more matches will be found locally, but many of them will be screened out as spurious later in the algorithm when global orientations are considered. Values in the range 6 to 10 are recommended.
The Contiguous Thresh parameter indicates how closely matched two segments must be before they are accepted as similar. The lower the value, the more stringent the criterion, and the fewer the number of segments will be found, leading to shorter SCRs in the final analysis.
The Scan Limit indicates how far apart in sequence two segments can be before an attempt is be made to match them. The higher the value, the farther apart they can be. If it you expect that there is a large insertion in one of the proteins, then you should increase the value above the default of 20 residues.
The Orientation Thresh is a measure of how similar the relative orientations of two segments must be before they can be considered to be in the final set of SCRs. All pairs of matched segments are checked for self-consistency. The higher this value, the more segments are accepted, and the longer the final SCRs are as overlapping segments are combined.
The default values of these four parameters work well for proteins that are highly similar in their structurally conserved regions, such as the trypsin family of serine proteases or the C-type lysozymes used in the tutorials (Chapter 6, Tutorial). For less well-conserved proteins, however, some adjustment of the Contiguous Thresh and Orientation Thresh parameters may be necessary. In many cases, only Orientation Thresh need be increased. If no improvement is seen when Orientation Thresh is increased to about 3 or 4, then it may help to increase Contiguous Thresh as well. Normally Contiguous Thresh should be kept smaller than Orientation Thresh. As the thresholds are increased, the algorithm becomes less selective in its criteria for determining SCRs. Beyond a certain point, the thresholds may be so large that the SCRs found may be spurious. There is no perfect guideline for choosing the best values of these thresholds for all protein families, but, in general, if Contiguous Thresh and Orientation Thresh must be made larger than about 2 and 6, respectively, to find SCRs, then the SCRs are likely to be inaccurate. For structures this poorly conserved you should use either the manual method for finding SCRs or the Alignment/Multiple_Sequence command.
The four parameters just described control the search for structural similarity between two proteins. If more than two proteins are selected for alignment by the automatic Alignment/Structure command, then all possible pairs in the selected proteins are searched for structural similarity. As explained in Chapter 2, Theory, these pairwise SCRs (2-blocks) are merged into m-blocks of more than two sequences wherever a sufficient number of the overlapping 2-blocks are mutually consistent. When the parameter Strict_Overlap is on, this "sufficient number" is m (m-1) / 2, i.e., all possible pairs must be structurally similar to from a block of m sequence segments. When Strict_Overlap is off, a smaller number of overlapping 2-blocks is sufficient (see Theory for a more detailed explanation). When Strict_Overlap is on, m-blocks tend to be shorter, but their constituent segments are structurally more similar to one another. This often gives the best results for highly similar, closely related reference structures, such as the lysozymes in tutorial lesson 2 in Chapter 6, Tutorial. If the structures are less closely related, then the algorithm may be too conservative in setting the boundaries of the SCRs unless Strict_Overlap is turned off.
Subsets
For each sequence that is contained within one or more m-blocks, the Alignment/Structure command creates subsets that specify those regions of the sequence contained within the m-blocks. The subsets are named <protein_name>$AUTO$SCR<n> for each individual region, and <protein_name>$AUTO$SCR for the union of the regions for the given protein. These are ordinary subsets and can be used as input in other commands, such as the Molecule/Color and Molecule/Label commands.
The creation of subsets can be suppressed by setting the Create_Subsets parameter to off. This is reduces calculation time and is therefore useful when you are running the command many times with different combinations of parameters to optimize an alignment.
Automatic Superimposing of Structures
If two or more sequences of the reference proteins are contained within m-blocks, and if the parameter Superimpose_SCRs is on, then the Alignment/Structure command automatically superimposes their structures. It does this so as to minimize the RMS deviation over all corresponding alpha carbon atoms of residues within the m-blocks. This RMS deviation appears in the information area. If more than two structures are superimposed, and if the parameter Create_RMS_Table is on, then a table is also created that shows the RMS deviations between the corresponding alpha carbon atoms of each pair of superimposed proteins.
If the parameter Create_Assembly is on, then all superimposed structures are grouped into an assembly. An assembly is simply a grouping of molecules that allows them to be operated on as a whole. In particular, it allows them to be connected to the rotation and translation dials independently of other molecules not in the assembly. This is useful when only some of the structures on the screen are related by m-blocks and superimposed. The superimposed structures can be moved away from the unrelated structures and rotated independently of them. The assembly is named ASY_FAMILY, since it normally will be a grouping of proteins that belong to one family. Similarly, the RMS table described above is named TBL_RMS_FAMILY.
If the proteins selected for alignment are from two or more unrelated families, the Alignment/Structure command can align each family independently of the others. The grouping of the proteins into families will be evident from the contents of the m-blocks. In such cases the structures in each family will be independently superimposed, and a separate RMS table and assembly will be created for each family. The assembly and table names will be distinguished by 2-digit numbers appended to the names (e.g., ASY_FAMILY01, ASY_FAMILY02, etc.).
Characteristics of m-boxes
Each m-block found by the Alignment/Structure command is enclosed in a special type of box known as an m-box. The main difference between an m-box and a normal SCR box is that an m-box can contain more than two sequence segments, whereas a normal SCR box can only contain two sequence segments. Like normal SCR boxes, m-boxes can be frozen (not movable), and unfrozen (movable). They are magenta when frozen and blue when unfrozen. The Alignment/Structure command creates them in the frozen state if the parameter Freeze_New_Boxes is on; otherwise it creates them in the unfrozen state.
When the Alignment/Structure command finishes execution, it automatically switches the sequence window into the Contents coloring mode. In this mode the sequence segments that belong to m-blocks are colored, and all other residues are white. An m-block also has associated with it a p-value that represents the statistical significance of the sequence similarity of its constituent segments. The p-value can be displayed via color-coding by switching the sequence window coloring mode to p-value. This feature is explained in detail in the description of multiple sequence alignment (see Statistical Significance and Alternate Sequence Coloring).
Handling of Existing Boxes
The Delete_Boxes parameter determines how the Alignment/Structure command deals with existing boxes. The alignment cannot be done if any boxes exist within the zone selected for alignment, or if any boxes other than m-boxes exist outside the zone. If any such boxes exist and Delete_Boxes is off, then the program issues an error message and aborts the alignment. If Delete_Boxes is on, then the command automatically deletes any boxes that would otherwise prevent the alignment. Summary boxes are an exception: if present when Delete_Boxes is on, the program issues an error message and aborts the alignment without deleting the summary boxes (or any other boxes). The default value for Delete_Boxes is off.
Interrupting the Search
The Alignment/Structure command can be interrupted at any time by pressing the <Esc> key. A window appears that asks you to confirm or cancel your interrupt request. If you cancel the interrupt request, the calculation resumes. If you confirm the interrupt, any m-blocks that have been found up to that point will be incorporated into the alignment, and a message appears in the information area warning you that the alignment may be incomplete.
Manual Determination of SCRs
Manual determination of SCRs is done by interactively measuring the structural similarity between pairs of protein segments. Each such pair of segments is specified by a box that you create and manipulate. You move the box, change its length and alter the alignment of the sequences it contains while monitoring an RMS score that reflects the structural similarity of the two structural segments that correspond to the two sequence segments in the box. When you have found all the pairwise SCRs in this way, you then execute a command that merges them, where they overlap, into SCRs that contain more than two segments. These are enclosed in white boxes called summary boxes. They are similar to m-boxes, in that they contain more than two sequences, but, unlike m-boxes, summary boxes are permanently frozen. The final step in the procedure is to superimpose the three-dimensional structures over the SCRs specified by the summary boxes.
Finding Pairwise SCRs
To determine the best possible structurally conserved region in a particular part of two reference proteins, you need to perform RMS calculations. The position and size of the active (green) sequence box determines which residues are included in the calculations.
The Structure command in the Alignment pulldown is used to perform RMS calculations. When Manual is selected as the Struct Align Mode, internal flags are set so that a new RMS calculation is performed every time the active box is altered or a sequence is scrolled through it.
The RMS calculation is performed over the four backbone atoms of all residues enclosed by the active box, if these atoms are present. If one or both of the proteins contains only C
atoms, then the comparison is done only over C
atoms. The two proteins are optimally superimposed, via rigid rotation, over the corresponding regions enclosed by the active box, and labels appear on the residues that are being compared. The superposition, RMS value, and labels are all updated whenever the contents of the box change (for example, as a consequence of moving the box, changing its size, or moving a sequence through it).
Criteria for Evaluating Manually-Determined SCRs
Four criteria are used to determine whether a putative SCR is a good choice.
- Reasonably low RMS value - Because you are comparing backbone coordinates over such a small stretch of the protein, usually 10-15 residues, the expected RMS is much lower than for comparisons between whole proteins. Values less than 0.75 Å are appropriate. Values more than 1.0 Å usually indicate misalignment.
- Overall structural alignment - When the RMS deviation is very high, the two proteins being compared do not line up well. Superimposing the peptide segment causes the rest of the molecule to be rotated as well. If the two segments do not have the same orientation in their respective molecules, then the molecules cannot be aligned. Even if the RMS value is low in this situation, the peptide segments are poor choices for an SCR because it is impossible to choose which of the two orientations is correct for the model you are building.
- Increase in RMS - At the ends of an SCR, the RMS value rises steeply as the peptide chains diverge. When the sequence box is expanded beyond the limits of an SCR, the RMS increases abruptly.
- No manually determined SCR can span more than one secondary structural unit of a protein. In this definition, an antiparallel beta sheet would not be called a single conserved region. Instead, each of its beta strands would be classified as SCRs. This restriction is imposed to allow for insertions and deletions in the amino acid sequences, which are primarily seen at turns and loops.
It is best to find the SCRs in an orderly fashion from N- to C-terminus. If more than two reference proteins are available, SCRs can be determined between all combinations of proteins. They need not be of the same length at this stage. As they are decided upon, the positions of the boxes defining the SCRs should be preserved with the Freeze Boxes command. This prevents any inadvertent alteration of the boxes when scrolling sequences or introducing gaps elsewhere.
Summarizing the Manually-Determined SCRs
After all areas of structural conservation have been found, the Summarize command in the Boxes pulldown should be used to create subsets (abbreviations) for all the consensus SCRs in the reference proteins. If conserved regions have been defined in any reference proteins, this command examines all the proposed SCRs indicated by all the sequence boxes. In each region, it defines the final SCR as the maximum number of residues that are found in all overlapping boxes. An equal number of sequence boxes must exist for each pair of proteins, and every conserved region must have boxes defined for every possible pair. White summary boxes are drawn around the regions, and all the boxes used in the definitions for the summary boxes are frozen if they were not already.
The subsets created by the Summarize Boxes command are named as
<protein_name>$SCR<n> for each individual region, and <protein_name>$SCR for the union of the regions for a given protein. These are ordinary subsets and can be used as input in other commands, such as Color and Label in the Molecule pulldown.
Superimposing Reference Proteins Using Manually-Determined SCRs
As a final step in the manual determination of SCRs, the subsets created by the Summarize Boxes command should be used to superimpose the three dimensional structures of the reference proteins. The purpose of this is to define a consistent global coordinate frame for all the reference proteins. If this is not done, there may be a structural misalignment between any segments in the model protein built from different reference proteins.
The reference proteins are superimposed over their corresponding manually-determined SCRs by using the Superimpose command in the Transform pulldown. In the Superimpose parameter block, the Molecule Pick Level should be set to Subset and the Superposition Mode should be Backbone or Trace. Decide which protein should be the target molecule and enter protein1$SCR and protein2$SCR as the Source and Target Specs. After each of the reference proteins has been superimposed onto the same target protein, assignment of coordinates to the model protein can be done from any reference protein.
Multiple Sequence Alignment as an Alternative to the Manual Method
There are some homology modeling problems in which the structures are not sufficiently similar to allow automatic determination of SCRs, and yet the sequence similarity is still high enough that good results can be obtained with the Alignment/Multiple_Sequence command. In these situations the Alignment/Multiple_Sequence command may provide a fast and simple alternative to the manual method of determining SCRs just described. The Alignment/Multiple_Sequence command is described in detail under Multiple Sequence Alignment.

Step 3: Sequence Alignment
The alignment phase is most critical to the final structure of the model. It determines the correspondence between residues in the reference proteins and the model protein. The positions of conserved regions and lengths of variable regions are directly affected by the alignment of sequences.
If sequence alignment information from a previous project has been saved in a file with the Sequences/Put command, it can be read in and reused at this stage. The Sequences/Put command stores complete alignments with protein names, relative positions, and sequence gap information. The names of these files have the extension .align. Sequence boxes are not saved in these files, so the positions of any SCRs cannot be stored and retrieved by this method. A .align file can be read in by executing the Sequences/Get command with the Get Mode parameter set to Alignment. Any files with the extension .align are displayed in the value-aid of the File Name parameter. Picking one of these with the mouse and selecting Execute reads in the alignment contained in the file.
If no prior alignment information is available, the sequences can be automatically aligned using either the Alignment/Pairwise_Sequence or Alignment/Multiple_Sequence command. The Alignment/Pairwise_Sequence command aligns two sequences at a time. The Alignment/Multiple_Sequence command can align as many as ten sequences simultaneously.
The Alignment/Pairwise_Sequence command uses one of three methods for aligning a pair of sequences depending on whether or not the reference proteins have been structurally aligned. The appropriate method is chosen automatically.
The first method is that of Needleman and Wunsch (1970; see Needleman and Wunsch Algorithm for Pairwise Alignment). It is used to find an initial alignment when two sequences have just been read in and contain no sequence boxes, m-boxes, or summary boxes. This type of alignment maximizes the matching of amino acid types while minimizing the number of gaps that are inserted. The process can be done repeatedly with the parameters adjusted each time as necessary to optimize the alignment.
The second method is used when several reference proteins have been structurally aligned using the manual method and their SCRs have been summarized. It is then possible to automatically align a sequence protein to one of the reference proteins. This enhanced algorithm (see Needleman and Wunsch Algorithm for Pairwise Alignment) incorporates structural information into the sequence alignment. The program scores the sequence similarity only in the conserved regions and allows gaps to be inserted only between SCRs (i.e., between summary boxes). When a gap is required between two summary boxes in the reference protein, a corresponding gap is placed in each of the other reference proteins included in those summary boxes. In this way the existing structural alignment of the reference proteins with each other is not disrupted by the new alignment of the unknown to the reference protein.
The third method is used when several reference proteins have been structurally aligned using the automatic method. In this case the reference proteins contain m-boxes, not summary boxes. A sequence (typically the model sequence that has no three-dimensional structure) is aligned to one of the reference proteins using the standard Needleman-Wunsch algorithm. Unlike the alignment of a sequence to summary boxes, the alignment of a sequence to m-boxes can place gaps in the sequence opposite an m-box, or in the reference sequences within an m-box. The m-boxes are split automatically as necessary to accommodate the insertion of these gaps.
Choosing a Scoring Matrix
Whenever sequence comparisons are made, either manually or automatically, a scoring matrix is used to measure the degree of similarity between the amino acids.
There are four of these 20X20 matrices available for use with the Alignment/Pairwise_Sequence command, and two matrices available for use with the Alignment/Multiple_Sequence command (see Scoring Matrices for a description of each matrix). You must select one of these matrices before executing the Alignment/Pairwise_Sequence or Alignment/Multiple_Sequence command. The selection of a matrix for one of these commands has no effect on the matrix selection of the other command. For each of these two commands the selection remains in effect until a new matrix selection is made.
Automatic Sequence Alignment without SCRs
If it is not obvious where to begin the search for structurally conserved regions, or if there is only one reference protein, automatic sequence alignment should be done. Either the Alignment/Pairwise_Sequence command or the Alignment/Multiple_Sequence command can be used in this situation. The use of the Alignment/Multiple_Sequence command is explained in a separate section below (Multiple Sequence Alignment). This section concerns only the Alignment/Pairwise_Sequence command.
When no summary boxes have been defined, the Alignment/Pairwise_Sequence command uses the method of Needleman and Wunsch described in Needleman and Wunsch Algorithm for Pairwise Alignment. This standard method inserts sequence gaps into both sequences with equal weight to match amino acid similarity as much as possible. A Gap Penalty (a parameter of the Alignment/Pairwise_Sequence command) is associated with starting a new gap region; the higher the value, the lower the number of gaps. The cost of inserting a gap can also be made to increase in proportion to the length of the gap. The Gap Length Penalty parameter is the proportionality constant. The total cost of inserting a gap is:
cost = Gap_Penalty + Gap_Length_Penalty * ( L - 1 )
where L is the length of the gap in number of residues. The cost is subtracted from the alignment score. A higher Gap Length Penalty tends to produce shorter gaps. If Gap Length Penalty is set to zero, all gaps are penalized equally, regardless of length.You can try using different values for the Gap Penalty and Gap Length Penalty parameters to determine which are best suited for a particular pair of sequences. The computation is invoked by using the Alignment/Pairwise_Sequence command with the Seq Align Mode set to Automatic. It is recommended that Mutation be used for the Scoring Matrix parameter. The calculation can be performed on any two sequences, either reference or unknown. If gaps were already in either of the sequences, they are deleted prior to alignment. It is therefore possible to perform the calculation several times, using different values for the Gap Penalty and Gap Length Penalty parameters, until a satisfactory alignment is obtained. The calculation takes from one to several minutes depending on the lengths of the sequences.
Automatic Sequence Alignment with Automatically-Determined SCRs
After all the structurally conserved regions in the reference proteins have been found using the automatic method, the sequence of the unknown protein can be aligned with the reference proteins using the Alignment/Pairwise_Sequence command. This alignment must be done before coordinates can be copied from the SCRs of the reference protein to the model protein.
In this case the reference sequences contain m-boxes and the model sequences contain no boxes. The Alignment/Pairwise_Sequence command works exactly as just described for the case in which both sequences contain no boxes. You select one of the reference sequences for alignment with the model sequence. The best choice is usually the reference sequence most similar to the model sequence. The algorithm of Needleman and Wunsch is used to align the sequences, and gaps are inserted as necessary into either sequence to optimize the alignment score. The gap penalty parameters work exactly as when no boxes are present.
Although only one reference sequence is being aligned to the model sequence, any gaps inserted into the reference sequence are automatically duplicated as necessary in the other reference sequences to maintain the correct multiple alignment established by the m-blocks. If a gap must be inserted within an m-block, the corresponding m-box is split at that point to accommodate the gap. It may seem inappropriate to allow the insertion of a gap within an m-block, since an m-block found by the automatic Alignment/Structure command represents a structurally conserved region. Strictly speaking, though, such an m-block may actually contain several structurally conserved regions that are separated by surface loops that just happen to be structurally similar in the chosen reference proteins. A more distantly related protein from the same family may have an insertion in one of those loops, thus requiring a gap within the m-block.
The alignment may be repeated as often as necessary to experiment with different parameter values. If a prior execution of the command inserted a gap that split an m-box, and the current execution does not create that gap, the m-box will remain split. Repeated executions with different gap penalty parameters may therefore fragment the m-boxes unnecessarily. This does not cause any serious problems, since the residues that are aligned within m-blocks remain the same. It can, however, affect the coloring of the m-blocks in p-value mode, since the fragmented m-blocks are shorter and therefore less statistically significant. If you wish to eliminate this fragmentation, the best strategy is to execute the Alignment/Pairwise sequence command as many times as necessary to find the best parameters, then rerun the Alignment/Structure command to restore the original (unfragmented) m-blocks, and finally rerun the Alignment/Pairwise_Sequence command with the optimal parameters.
Automatic Sequence Alignment with Manually-Determined SCRs
After all the structurally conserved regions in the reference proteins have been found using the manual method, and the Boxes/Summarize command has been executed (see Manual Determination of SCRs), the sequence of the unknown protein can be aligned with the conserved regions using the Alignment/Pairwise_Sequence command. This alignment must be done before coordinates can be copied from the SCRs of the reference protein to the model protein.
The presence of summary boxes tells the program that the standard algorithm of Needleman and Wunsch should not be used. Instead a specialized proprietary method (described in MSI's Automatic Divide-and-Conquer Algorithm) is used that does not permit the insertion of gaps within any SCR. The best possible amino acid match for each SCR is made with gaps inserted between them, if necessary. This algorithm has the advantage of incorporating structural information into the alignment.
Like the Needleman and Wunsch automatic sequence alignment method, computation for the proprietary method is invoked by using the Alignment/Pairwise_Sequence command and setting the Seq Align Mode to Automatic. In this case, however, it is recommended that Identity be used as the type of Scoring Matrix because it has more stringent matching criteria. Because gaps are only allowed between SCRs, there are fewer chances to correct a spurious match farther down the sequence.
This nonstandard algorithm ignores any residue in the reference protein that is not within a summary box. It matches residues only within SCRs. Gaps are disallowed within SCRs, but can be put in either sequence between SCRs. If a gap is put in a reference protein that is tied to other rows via frozen or summary boxes, then gaps are put in all the reference protein rows at the same column position. The calculation takes from one to several minutes to complete. The Gap Penalty and Gap Length Penalty parameters are not used when performing automatic alignment with summary boxes.
Since not all the gaps are deleted prior to the calculation, the command is not repeatable. If you wish to try several alternative sequence alignments by this method, first save all objects using the File/Save_Folder command.This stores the original structural alignment of the reference proteins. You can then do the first alignment of the unknown and reference protein. When you are ready to try the next such alignment, delete all objects using the Object/Delete command, restore the original structural alignment using the File/Restore_Folder command, and then execute the next sequence alignment.
Pairwise Manual Sequence Alignment
The results of the automatic alignment procedures can be manually refined if necessary. Pairwise manual alignment is activated by executing the Alignment/Pairwise_Sequence command with the Seq Align Mode parameter set to Manual.
As with structural alignment, internal flags are set up to calculate an average match score for all residues contained in the active (green) sequence box whenever it is altered or a sequence is scrolled through it. The active box should be adjusted and the alignment done to maximize the score over a particular region that has the same width as the corresponding SCR found in the reference proteins.
As explained in the following section, alignments produced by the Alignment/
Multiple_Sequence command can also be manually refined.
Multiple Sequence Alignment
The Alignment/Multiple_Sequence command is typically used when you have only one reference protein, but you also have one or more sequences from the same family in addition to the model sequence. In this situation you cannot determine the SCRs by structural comparisons, as you could with multiple reference structures. The next best thing is to determine where the proteins are conserved with respect to sequence similarity. By reading in additional sequences from the same family and running the Alignment/Multiple_Sequence command, you can get a more accurate estimate of the locations of the conserved regions than you could from the reference and model sequences alone.
The Alignment/Multiple_Sequence command can also be used to align the model sequence and multiple reference proteins, but it will do so using only sequence similarity information, not structural similarity information. Normally the automatic Alignment/Structure command is a better choice in this situation, but, if the reference proteins are not sufficiently similar in structure, then Alignment/Multiple_Sequence may give better results and is faster than manual determination of SCRs.
The Alignment/Multiple_Sequence command simultaneously aligns up to ten amino acid sequences. Typically the best results are obtained with three to eight sequences. Using the segment pair overlap algorithm (see Automatic Sequence Alignment Methods), the command finds regions over which the sequences are mutually related and estimates the statistical significance of those relationships. Each such block of mutually related sequence segments is called an m-block, where m is the number of sequence segments in the block. Those m-blocks containing highly significant relationships are likely to contain structurally conserved regions. The algorithm automatically inserts gaps into the sequence display as necessary to bring the contents of the m-blocks into alignment. It makes no attempt to optimize the alignment of residues that are not contained within the m-blocks. It surrounds each m-block with a special type of box known as an m-box. Unlike the normal boxes that you can create interactively when the sequence window is in Box mode, m-boxes can contain more than two sequence segments. The Alignment/Multiple_Sequence command cannot be used if summary boxes are present.
Search Modes
The Alignment/Multiple_Sequence command can be used in either of three modes that are selected via the Mult Align Mode parameter.
When Manual is chosen as the Mult Align Mode, the command simply creates a single m-box having the location and length that you specify. No automatic searching for high-scoring m-blocks is done in this mode. This mode can be useful when you know that certain segments of the sequences should be aligned with one another (e.g., if they contain homologous active-site residues).
When Single_Search is chosen as the Mult Align Mode, the command searches the sequences only once each time Execute is selected. This normally finds only a few m-blocks (possibly none) and therefore produces an incomplete alignment. The alignment can be completed by executing additional single searches. This interactive searching strategy allows manual adjustment of the parameters between searches.
When Automatic is chosen as the Mult Align Mode, the command searches the sequences repeatedly, each time adjusting the search parameters according to heuristic rules. You can modify the behavior of these rules, and thereby adjust the thoroughness of the repetitive automatic searching, by changing certain parameters of the Alignment/Multiple_Sequence command. The repetitive searching continues until no more statistically significant m-blocks can be found. Automatic mode is the preferred method for most alignment problems.
Parameters that Control
the Automatic Repetitive
Search
When Mult Align Mode is set to Automatic, the behavior of the automatic divide-and-conquer algorithm is controlled by the Pairwise Threshold, Minimum Zone Length, Num Related Seqs, Minimum Seq Per Blk, Not Signif p, and Dimension Bias parameters.
As explained in the Theory section, the sensitivity of each segment pair overlap search is determined by the pairwise score threshold, t. In addition, each search in Automatic mode is further constrained by the requirement that m, the number of sequence segments in the m-blocks found, must be greater than or equal to an internal variable
mmin. The divide-and-conquer algorithm adjusts the variables t and mmin after each segment pair overlap search.
The first search begins with t equal to the Pairwise Threshold parameter and with
mmin equal to the Num Related Seqs parameter. The default value of Num Related Seqs is the total number of sequences selected for alignment. If no m-blocks are found, then t is lowered by a fixed percentage and the search is repeated. This cycle of searching and threshold reduction is repeated until one of the following two conditions is met:
1. One or more statistically significant m-blocks are found, or
2. The current threshold is so low that there is little chance that
further searches at lower thresholds would find significant
m-blocks.
If the first condition occurs, then the newly found m-blocks are saved for alignment, and the process of searching and threshold reductions is repeated for the remaining zones between the significant m-blocks. An m-block is considered significant if its p value (from Eq. 2-1 in the Homology 2.0 User Guide) is less than the Not Signif p parameter. As m-blocks are found, they are inserted into the alignment in the order of their statistical significance (most significant first). An m-block is rejected if it overlaps or conflicts with another m-block that is already in the alignment.
The second condition occurs when either of the following criteria is met:
a. The search found one or more m-blocks, but all of them were
statistically insignificant, or
b. The search found no m-blocks, and t is less than or equal to
(1 - Dimension Bias) (tmin).
The internal parameter tmin is the lowest pairwise threshold that could be expected to find a significant m-block, assuming that all possible segment pairs in the m-block have identical pairwise scores equal to tmin. It is calculated using the inverse of Eq. 2-1 in the Homology User Guide, with p set equal to Not Signif p. Because the pairwise segment scores within a real m-block are rarely identical, tmin is only an estimate of the lowest pairwise score threshold that can find significant m-blocks. In practice it is usually best to continue searching for m-blocks until t is somewhat lower than tmin. The parameter Dimension Bias determines how much lower than tmin the threshold t is allowed to go. Dimension Bias can have any value between 0 and 1, inclusive.
If the second condition occurs, then any further searching at lower thresholds would probably be unproductive. There may, however, be undiscovered significant m-blocks for which m is less than the current value of mmin. At this point, therefore, mmin is decremented. At the same time, the threshold t is reset to the value of Pairwise Threshold. Again the remaining zones are searched repeatedly, the threshold being decremented after each search, until either the first or second condition is met.
This entire process is repeated until mmin becomes less than the parameter Minimum Seq Per Blk, or until the remaining zones between m-blocks are so short that they are not worth searching. A zone is considered too short to search if it does not contain two or more sequence segments of length greater than or equal to Minimum Zone Length.
With this algorithm in mind, the intuitive meanings of the parameters become clear. The algorithm has an intrinsic bias towards selecting m-blocks with the highest possible dimension, m. This bias arises from the fact that the searching begins with mmin equal to Num Related Seqs, which is normally the total number of sequences selected for alignment (the highest possible m). Any blocks of lower dimension are ignored until no more significant blocks of dimension mmin can be found. Blocks of lower dimension are allowed into the alignment only after the second condition is met. If parameters are adjusted such that the second condition is more easily satisfied, then the bias towards m-blocks of high dimension is reduced. Reducing the Dimension Bias parameter has exactly this effect. Notice however, that the second condition is also satisfied if the last search finds only insignificant m-blocks. In many alignment problems this occurs before t becomes less than tmin, and in these cases adjusting Dimension Bias has no effect on the results.
The Num Related Seqs parameter specifies the number of sequences known to be related to one another. Normally it is the total number of sequences selected for alignment. If, however, you are not certain that all of the sequences are related to one another, you can speed the alignment by reducing Num Related Seqs. This allows the inclusion into the alignment of m-blocks with m equal to this lower number, even in the earliest searches. Reducing Num Related Seqs thereby reduces the bias towards blocks of high dimension. Note that if Num Related Seqs is less than the number of sequences selected, blocks containing more than Num Related Seqs sequence segments can still be found since blocks of dimension mmin or higher are acceptable.
The Not Signif p parameter is the criterion for statistical significance. Any m-block with p greater than or equal to Not Signif p is considered insignificant and is rejected from the alignment. Not Signif p can also affect the bias towards blocks of high dimension, since tmin is a function of Not Signif p. Larger values of Not Signif p tend to produce a more complete alignment, with more blocks and possibly greater bias towards blocks of high dimension. Smaller values of Not Signif p produce alignments with fewer but more significant m-blocks.
The Minimum Seq Per Blk parameter specifies the minimum number of sequence segments that an m-block must contain to be included in the alignment. It must be greater than or equal to 2, and \
91 less than or equal to the number of sequences selected for alignment. The default value is 2.
The Minimum Zone Length parameter specifies the length, in residues, of the shortest zone that is considered worth searching. If any of the sequence segments between the m-blocks are shorter than Minimum Zone Length, then these will be excluded from the new search zones. Large values of Minimum Zone Length result in faster but less complete alignments. Minimum Zone Length must be greater than or equal to 1. The default value is 3.
The Pairwise Threshold parameter specifies the first and highest pairwise score threshold used in the segment pair overlap searches. For each of the later searches, the threshold is automatically decremented as necessary to increase the sensitivity of the search. The default value depends on the number and lengths of the sequences selected for alignment (see Automatic Calculation of Pairwise Threshold).
Specifying the Initial Search Zone
The Seq to Align 1 through Seq to Align 10 parameters specify the sequences to be aligned. By default they contain the names of all sequences present in the sequence display area. If there are fewer than ten sequences present, then the unused Seq to Align parameters are set to None.
Each sequence name can be followed by an optional colon and residue number to specify the leftmost (closest to N-terminal) residue of the zone to be searched during the alignment. Any portions of the sequences outside this zone will not be aligned. If the residue number is omitted, then the zone begins at the N-terminus.
The length of the zone is specified by the Zone Length parameter. By default, it is set to the length of the longest sequence in the sequence display. Zone Length may be longer than all of the sequences, in which case the search zone extends to the C-termini of all sequences selected for alignment.
The Seq to Align parameters can be set by picking the sequences in the sequence display area, selecting the molecule names in the value-aid, or picking the molecules on the screen. If the Mol/Res Pick Level parameter is set to Residue, then individual residues can be picked to indicate the leftmost residue to be searched in the selected sequence. A sequence name can be deleted by selecting the parameter field, pressing the <Delete> key, and then pressing <Return>.
If you try to enter the same protein name in more than one Seq to Align parameter, then the redundant entries will just overwrite the first instance of that name. For example, if you enter FBJ:13 in Seq to Align 1 and then try to enter FBJ:22 in Seq to Align 2, FBJ:22 will appear in Seq to Align 1 and Seq to Align 2 will be left unchanged. This feature prevents duplication of sequence names and facilitates adjustment of the starting points of the search zone.
Specifying a Mandatory Sequence
The alignment can be constrained to reject all m-blocks that do not contain the sequence specified in the Mandatory Seq parameter. It is sometimes useful to set Mandatory Seq to the name of the model (unknown) protein. If the algorithm finds overlapping m-blocks, only one of which contains the model protein, and if the block containing the model protein is not the most significant, then this block would normally be rejected from the alignment. If, however, Mandatory Seq specifies that the model protein is a mandatory sequence, then the more significant blocks that lack this sequence will be rejected instead.
Automatic Calculation of Pairwise Threshold
The optimal value for Pairwise Threshold depends in a complex way on the scoring matrix used and on the number, lengths, and relatedness of the sequences selected for alignment. The default value of Pairwise Threshold approximates this optimum and is calculated from the number and lengths of the sequences specified in the initial search zone. It is automatically recalculated whenever there is a change in the Mult Scoring Matrix parameter, the Zone Length parameter, or in the number of sequences listed in the Seq to Align parameters.
This automatic recalculation of Pairwise Threshold is suppressed if you manually enter a new Pairwise Threshold value. If you do this and later wish to reactivate the automatic calculation of Pairwise Threshold, you can do so simply by deleting the current Pairwise Threshold value with the <Delete> key. When you then press <Return>, the automatically calculated value appears.
Statistical Significance and Alternate Sequence Coloring
The p value of an m-block is defined as the probability that a block of equal or greater score could be found by chance in a set of random sequences of the same lengths as those in the sequence display. It is calculated using Eq. 8 in the Theory section. The search space N in that equation is calculated as the product of the lengths of the n sequences in the display if the block contains n sequence segments. If the block contains only m segments, where m<n, then the search space is calculated as the sum, over all possible choices of m sequences, of the product of the sequence lengths.
Note the distinction between the search space, used in calculating the statistical significance, and the search zone that is selected for alignment. The former is always calculated from the full lengths of all sequences in the sequence display. The latter may be smaller than this, encompassing only a subset of the sequences or a restricted portion of their lengths. If you align some sequences and then add another sequence to the display, the statistical significance of the m-blocks will become lower because the search space has become larger.
The Alignment/Multiple_Sequence command surrounds each m-block with a magenta box (called an m-box) and colors the contents of the m-block magenta as well. The intensity of this magenta coloring indicates the statistical significance of the m-block, with paler shades indicating lower statistical significance (larger p). Residues that are white are not significantly related to any others.
The range of p values over which this color change occurs is determined by the parameters High Signif p and Not Signif p in the Alignment/Multiple_Sequence command. All m-blocks for which p is less than High Signif p are considered statistically significant and are colored the darkest magenta. Those for which p lies between High Signif p and Not Signif p are of questionable significance and are colored in successively paler shades as p approaches Not Signif p. Those for which p is greater than Not Signif p are insignificant and are colored white.
The High Signif p and Not Signif p parameters can be changed any time the Alignment/Multiple_Sequence command is active. They are stored internally, even if the Alignment/Multiple_Sequence command is not executed. The coloring of the m-blocks is immediately updated when either of these parameters is changed. When you enter a new value for one of these parameters, it is stored internally when you press the <Return> key, or when you move the focus by selecting another parameter with the mouse. If, however, you enter a new value and immediately select Cancel, then the newly entered value will be discarded.
The magenta coloring of the sequences does not replace the original coloring, but instead constitutes an alternate coloring mode. The button labeled p-value in the lower left corner of the sequence window activates this alternate coloring mode. The Alignment/Multiple_Sequence command always activates the p-value color mode to display the statistical significance of its results. Under some circumstances the contents of an m-block can become statistically insignificant (for example, when you manually shorten the box or change the scoring matrix). If this happens when the coloring mode is set to p-value, it is impossible to know which sequence segments are members of the block, because they are all white. You can always see the which segments belong to a block, regardless of its significance, by switching the coloring mode to Contents. In this mode all segments belonging to a block are colored and all other residues are white.
The Alignment/Multiple_Sequence command also displays a statistical summary of the quality of the alignment in the information area. This summary includes:
The statistical significance calculated by the Alignment/Multiple_Sequence command is only a rough estimate, based on a simplified model of the statistical nature of protein sequences. More importantly, its meaning is based on the null hypothesis that the m-block in question was found by searching random sequences having the same lengths as those in the sequence display. In practice the sequences used in homology modeling come from the much larger search space of a sequence or structural database. For this reason, the p values calculated by the Alignment/Multiple_Sequence command are underestimates. They are best used as guides to the relative significance of the m-blocks found, rather than as indicators of absolute significance.
Characteristics of m-boxes
Like normal SCR boxes, m-boxes can be frozen (not movable), and unfrozen (movable). They are magenta when frozen and blue when unfrozen. The Alignment/Multiple_Sequence command creates them in the frozen state if the parameter Freeze_New_Boxes is on; otherwise it creates them in the unfrozen state.
An m-box turns green when it is the active box in Scroll Boxes mode. When an unfrozen m-box becomes active, a statistical summary of its contents appears in the information area. This includes the p value, the number of residue pairs in the block, the percentage of these that are identical, the mean score of all the pairs in the block, and the worst (lowest) score among all the pairs. If p is less than 0.0001, then its logarithm (base 10) is displayed instead. When you change the scoring matrix parameter, the statistical summary is recalculated and updated in the information area. The coloring of the m-blocks is also updated if the coloring mode is set to p-value. This is necessary because the score is a function of the scoring matrix. The statistical summary and m-block coloring are also updated whenever the contents of the active m-box change, either because the box changes in size or position, or because a sequence is scrolled through it.
Symbols other than amino acid symbols, such as gap symbols, are never allowed within an m-block. The Alignment/Multiple_Sequence algorithm also requires that m-blocks never overlap. The enclosing m-boxes may in some cases appear to overlap, but their constituent sequence segments never do.
An unfrozen m-box can be used as a window for exploring the local statistical significance of the alignment in greater detail. As the m-box is moved, the coloring of its contents changes to indicate the significance of the alignment within the box.
Subsets
For each sequence that is contained within one or more m-blocks, the Alignment/
Multiple_Sequence command creates subsets that specify those regions of the sequence contained within the m-blocks. The subsets are named <protein_name>$MULT$BLK<n> for each individual region, and <protein_name>$MULT$BLK for the union of the regions for the given protein. These are ordinary subsets and can be used as input in other commands, such as the Molecule/Color and Molecule/Label commands.
The creation of subsets can be suppressed by setting the Create_Subsets parameter to off. This is reduces calculation time and is therefore useful when you are running the command many times with different combinations of parameters to optimize an alignment.
Automatic Superimposing of Structures
If two or more sequences that are contained within m-blocks have three-dimensional structures (i.e., are reference proteins), and if the parameter Superimpose_SCRs is on, then the Alignment/Multiple_Sequence command automatically superimposes their structures. It does this so as to minimize the RMS deviation over all corresponding alpha carbon atoms of residues within the m-blocks. This RMS deviation appears in the information area. If more than two structures are superimposed, and if the parameter Create_RMS_Table is on, then a table is also created that shows the RMS deviations between the corresponding alpha carbon atoms of each pair of superimposed proteins. From these RMS scores, and by visually inspecting the quality of the superposition of the structures, you can judge to what extent the m-blocks correspond to structurally conserved regions.
If the parameter Create_Assembly is on, then all superimposed structures are grouped into an assembly. An assembly is simply a grouping of molecules that allows them to be operated on as a whole. In particular, it allows them to be connected to the rotation and translation dials independently of other molecules not in the assembly. This is useful when only some of the structures on the screen are related by m-blocks and superimposed. The superimposed structures can be moved away from the unrelated structures and rotated independently of them. The assembly is named ASY_FAMILY, since it normally will be a grouping of proteins that belong to one family. Similarly, the RMS table described above is named TBL_RMS_FAMILY.
If the proteins selected for alignment are from two or more unrelated families, the Alignment/Multiple_Sequence command can align each family independently of the others. The grouping of the proteins into families will be evident from the contents of the m-blocks. In such cases the structures in each family will be independently superimposed, and a separate RMS table and assembly will be created for each family. The assembly and table names will be distinguished by 2-digit numbers appended to the names (e.g., ASY_FAMILY01, ASY_FAMILY02, etc.).
Adjusting the Sensitivity and Selectivity of the Search
In the search for m-blocks there is a trade-off between sensitivity and selectivity. If a search is too sensitive, then the alignment it produces contains many m-blocks, but some of these may contain insignificant or incorrect local alignments. If a search is too selective, then the m-blocks it finds are likely to be significant and correct, but large portions of the sequences may contain no m-blocks, and important relationships may be overlooked. The most important parameters for adjusting sensitivity and selectivity are Mult Scoring Matrix, Strict_Overlap, Not Signif p, and Dimension Bias.
The Mult Scoring Matrix parameter determines which amino acid scoring matrix is used during the search for m-blocks and in the calculation of the statistical significance of each m-block.
The two choices for the Mult Scoring Matrix, PAM_120 and PAM_250, are both evolutionary mutation matrices. They were derived from the same evolutionary data but differ in the way they were normalized. In general, the PAM_250 matrix produces alignments containing larger m-blocks than does the PAM_120 matrix. The PAM_250 matrix is often more sensitive to weak similarities in distantly related sequences than is the PAM_120 matrix. The PAM_120 matrix is more selective and tends to find m-blocks that are more likely to be structurally conserved. The default value for Mult Scoring Matrix is PAM_120.
As described in the Theory section, step 3 of the segment pair overlap algorithm can use either of two criteria for parsing an m-diagonal into m-blocks. The Strict_Overlap parameter specifies which of these criteria is used. It determines the degree of mutual relatedness that must exist among the sequences in a local region to form an m-block in that region. When Strict_Overlap is on, each sequence segment within an m-block is guaranteed to be related to every other segment in the block. This means that each possible pair of segments within the block lies within a region, over which the cumulative sequence similarity score for that pair exceeds the current pairwise score threshold, t. When Strict_Overlap is off, each sequence segment within the m-block is guaranteed to be related to at least one other segment in the m-block, but not necessarily to all of them. If only two sequences are being aligned, these two criteria are identical and Strict_Overlap has no effect on the results.
If the sequences are closely related (i.e., the percent identity is greater than about 60% within m-blocks), then changing the Strict_Overlap parameter may have little effect. For less closely related sequences, however, the m-blocks found with Strict_Overlap on tend to be shorter and tend to contain fewer sequences than when Strict_Overlap is off. The alignment may therefore be less complete with Strict_Overlap on, but the m-blocks found are more likely to be structurally conserved over their entire lengths. If the sequences are distantly related (i.e., the percent identity is less than about 25% within m-blocks), then significant relationships among the sequences may be missed if the alignment is performed with Strict_Overlap on.
The selections of the Mult Scoring Matrix and Strict_Overlap parameters can be adjusted in tandem to cover the spectrum from high sensitivity to high selectivity. For most sequence families that are sufficiently conserved to be useful for homology modeling, the spectrum is spanned as shown in the following table. There is, however, no simple way to predict which combination of these parameter values will give the best results for a given alignment problem.
The Not Signif p and Dimension Bias parameters also affect the sensitivity of the search for m-blocks. Increasing either of these parameters tends to increase the sensitivity of the search, although changing Not Signif p usually has a much greater effect than does changing Dimension Bias.
Handling of Existing Boxes
The Delete_Boxes parameter determines how the Alignment/Multiple_Sequence command deals with existing boxes. Multiple alignment cannot be done if any boxes exist within the zone selected for alignment, or if any boxes other than m-boxes exist outside the zone. If any such boxes exist and Delete_Boxes is off, then the program issues an error message and aborts the alignment. If Delete_Boxes is on, then the Alignment/Multiple_Sequence command automatically deletes any boxes that would otherwise prevent the alignment. Summary boxes are an exception: if present when Delete_Boxes is on, the program issues an error message and aborts the alignment without deleting the summary boxes (or any other boxes). The default value for Delete_Boxes is off.
Interrupting the Search
The Alignment/Multiple_Sequence command can be interrupted at any time by pressing the <Esc> key. A window will appear that asks you to confirm or cancel your interrupt request. If you cancel the interrupt request, the calculation resumes. If you confirm the interrupt, any m-blocks that have been found up to that point will be incorporated into the alignment, and a message will appear in the information area warning you that the alignment may be incomplete. The interrupt capability is useful for difficult alignment problems that result in excessive calculation time (see below). The best strategy in such cases is to interrupt the command and then rerun it using a better choice of parameters.
Excessive Calculation Time
As explained in the Theory section, the time required to find the globally optimal solution to the multiple alignment problem increases exponentially with the number of sequences. The Alignment/Multiple_Sequence command embodies a heuristic algorithm that, in most cases, avoids this exponential complexity. It does so by taking advantage of the fact that related biological sequences are not random, but instead typically contain highly conserved regions separated by less well-conserved regions. The calculation time therefore depends not only on the number and lengths of the sequences, but also on their relatedness. For most sequence families that are sufficiently conserved to be useful for homology modeling, the Alignment/Multiple_Sequence command requires on the order of tens of seconds or less to align the sequences.
There are, however, special circumstances that can result in unacceptably long calculation times. These include:
Even in these situations it is usually possible to obtain a good multiple alignment in a reasonably short time by appropriate adjustment of the parameters. The best strategies, listed roughly in order from most effective to least effective, are as follows:
- This involves searching for m-blocks over several restricted search zones, each of which covers only part of the full lengths of the sequences. This is especially helpful when the sequences differ greatly in length. In most cases only two or three separate search zones are necessary.
- This is especially helpful when not all of the sequences are known to be related. Note that m-blocks of higher dimension will not necessarily be excluded.
- This reduces the time necessary to find the first m-blocks, so that the problem is internally subdivided sooner.
- Although these changes reduce the sensitivity of the search, they generally have no effect on the time required to find the first m-block. Instead they tend to reduce the number of searches done before mmin is decremented.
- This speeds the search by making it less thorough. It may result in the omission of some of the shorter and less significant m-blocks.
- It is rarely helpful to increase Pairwise Threshold above the automatically calculated default value. If, however, you enter a Pairwise Threshold lower than the default and find that the calculation time is unacceptably long, then using a higher Pairwise Threshold will help.
- Single_Search mode can be used, several times and over different search zones if necessary, to find m-blocks quickly. To build a complete alignment this way is tedious, so this is usually the method of last resort.
Single_Search Mode
When Mult Align Mode is set to Single_Search, the Alignment/Multiple_Sequence command executes a single segment pair overlap search each time Execute is selected. The pairwise score threshold used in this search is the value of the Pairwise Threshold parameter. The search rejects any m-block that contains fewer than Minimum Seq Per Blk sequence segments. If Pairwise Threshold is too high, the search will find nothing. If it is too low, the search will produce many m-blocks, but the calculation time may be excessive. The automatically calculated default Pairwise Threshold is usually a good choice for the first search.
In Single_Search mode the parameters Minimum Zone Length, Num Related Seqs, and Dimension Bias are inactive because their only purpose is to control repetitive searching in Automatic mode. All other parameters have the same meanings in both Single_Search and Automatic mode. In Single_Search mode the command must be executed repeatedly, each time with manual adjustments of Pairwise Threshold, and perhaps other parameters, to build a complete multiple alignment.
Single_Search mode is most useful for difficult alignment problems that involve distantly related or highly repetitive sequences. It is also useful if you only want to find the most significant blocks of a certain dimension (specified by Minimum Seq Per Blk), without generating a complete alignment. This is a good way to find similar sequence motifs in otherwise unrelated sequences.
Manual Mode
When Mult Align Mode is set to Manual, the Alignment/Multiple_Sequence command creates a single m-block each time Execute is selected. You specify the location and contents of the m-block via the Seq to Align and Block Length parameters. When any of the Seq to Align parameters is in focus, the Mol/Res Pick Level parameter also appears and is set to Residue by default. You specify the contents of the m-block in a manner exactly analogous to the way you specify an initial search zone in Automatic mode. You specify the leftmost residue of each sequence segment simply by picking that residue with the mouse in the sequence window while any one of the Seq to Align parameters is in focus. To exclude a sequence from the m-block, select its Seq to Align parameter, press the <Delete> key and then press <Return>. You specify the length of the block via the integer parameter Block Length.
In Manual mode the parameters Zone Length, Mandatory Seq, Minimum Seq Per Blk, Pairwise Threshold, Strict_Overlap, Minimum Zone Length, Num Related Seqs, and Dimension Bias are all inactive because these parameters are only used when searching for m-blocks. In Manual mode there is no searching involved, because you specify the exact location and contents of the m-block. All other parameters have the same meanings as in Single_Search and Automatic mode.
Manual mode is useful for forcing the alignment of residues that may have low sequence similarity, and therefore are not properly aligned in Single_Search or Automatic mode, but that you know should be aligned because of their common biological function in the proteins. In such a situation it is usually best to create as much of the alignment as possible in Automatic mode, then correct any misaligned regions using Manual mode. The Manual mode of the Alignment/Multiple_Sequence command can also be used in this way to correct an alignment produced by the Alignment/Structure command.

Step 4: Assigning Coordinates Within the SCRs
Once conserved regions have been defined, and the sequence of the unknown protein has been aligned in these same regions, it is possible to begin building the model. Choose a particular SCR, and make sure there is a sequence box of the same length that contains both the unknown protein's sequence and an appropriate reference protein's sequence. The choice of which reference protein to use for each SCR can be made by comparing the sequence alignment scores found with Manual Sequence Alignment (see page 5-129).
The AssignCoords command in the Sequences pulldown is used to transfer coordinates from a reference protein to a model protein in a region defined by a sequence box that you select. The box must contain exactly one reference and one unknown sequence, and it must be frozen. An m-box can be used only if it meets these criteria; since m-boxes often contain more than two sequences, they often cannot be used. You can assign coordinates in a region enclosed by such an m-box by creating a normal SCR box that overlaps the m-box.
You select a box as input to the AssignCoords command by picking a residue within it. The selected box turns yellow. If there are several boxes that overlap the picked residue, you can repeatedly press and release the mouse button to select each of them in turn, until the desired box turns yellow.
When Execute is selected, the coordinates are copied from the reference protein to the model. If the two proteins have the same amino acids at corresponding positions in the sequence, then the coordinates are simply copied. Wherever they don't match, the side chains are automatically replaced. When this is done, the side chain atoms that are in common are copied, and for the rest, library amino acids are used and placed along the same direction as the reference protein. When the command has finished, the newly built segment of the model is displayed on the screen, and the corresponding amino acid code letters are promoted to uppercase in the sequence window. A list of steric clashes in the model is reported in the information area and in the textport if the parameter Bump_Check is on.
If two or more reference proteins will be used as sources of coordinates in different SCRs, then all the reference proteins must be superimposed beforehand. This superimposition of the reference proteins is extremely important: without it, the resulting model structure will be grossly incorrect. The reference structures can be superimposed either automatically by the Alignment/Structure or Alignment/Multiple_Sequence command, or from the results of manual structure alignment as described in Superimposing Reference Proteins Using Manually-Determined SCRs.

Step 5: Building Loop or Variable Regions (VRs)
Variable regions (VRs) are sections of the proteins that come between the SCRs in the sequence. They tend to be found at the surfaces of the proteins where differences in amino acid properties and conformation are much better tolerated. They can sometimes be taken directly from one of the reference proteins, but even here they cannot be classified as conserved because the segment is not common to all members of the protein family.
If one of the reference proteins can be used to define coordinates for a variable region, the AssignCoords Sequences command should be used with the Segment Definition parameter set to Designated_Loop. But if none of the reference proteins can be used, appropriate coordinates must be found elsewhere.
Searching for and Displaying Loops
The Search Loops command in the Loops pulldown scans the Brookhaven Protein Databank for protein structures of a predefined length that would fit properly into the model protein between two SCRs (see Setting Up the Brookhaven PDB for information about configuring the Brookhaven PDB for use in Insight II). The search is done by comparing the alpha-carbon distance matrix of the flanking SCR peptides with a precalculated matrix for all known proteins that have the same number of flanking residues and an intervening peptide segment of the given length. Ten loops are reported that have the best overall values for RMS differences between the model and database proteins in the flanking, or preflex and postflex, peptide segments.
The command is easily invoked by picking the two sequence boxes on either side of the loop or variable region (VR). The boxes must contain residues from the model protein for which coordinates have been defined (uppercase letters) and must be picked from left to right. Picking the boxes automatically fills all the required parameters for the command. The Start and Stop Residues indicate the residues of the model protein adjacent to other end of the loop that contains the valid coordinates. Flex Residues represent the number of residues within the loop. Preflex and Postflex Residues are the total number of residues of the model protein on either side of the loop. Any or all parameters can be entered by hand, but must be done with caution because an error could be made concerning the identity and number of residues with valid coordinates. If the sum of the preflex, flex, and postflex regions is greater than 40, then the preflex and postflex regions should be shortened as needed. They may not, however, be lengthened beyond the automatically determined values.
When Execute is selected, the database is scanned for the best ten loops, and then the Display Loops command is automatically activated. This command allows you to examine any of the ten choices at a time. Choose a loop and select Execute to view it. This can be done repeatedly. The Tails option determines whether the preflex and postflex residues are included when superimposing the loop with the model protein. Toggle it on. Remove deletes all the loops from the system. The best choice for insertion into the model is the one that has a low RMS and extends away from the body of the protein.
Generating and Displaying Loops
The Generate Loops command in the Loops pulldown should be used when no appropriate loop choices can be found with the Search Loops command. This can happen when there are no high resolution proteins similar to the model being built. The side chains of the ten loops found may overlap with the conserved regions of the model or the topologies of their backbones may trace obviously implausible paths. In such cases, the Generate Loops command creates a series of ten choices de novo. They are not necessarily the "best" loops that this algorithm could create, but they are the first ten loops to have passed all the geometric and algorithmic criteria established by the parameters of the Generate Loops command
As with the Search Loops command, the Start and Stop Residues are defined as the SCR residues of the model protein at either end of the loop itself. They are easily specified by picking, with the mouse, the sequence boxes to the left and right of the loop region. Note that the boxes must contain residues from the model protein for which coordinates have been defined (represented by uppercase letters) and must be picked from left to right. After specifying the Start and Stop Residues, the Flex Residues parameter is automatically filled in with the number of residues in the loop.
The main chain dihedral angles of the loop,
and
, are initially set to random values. A constrained minimization algorithm then adjust these angles as necessary to make the ends of the loop fit the adjacent SCRs. The Seed parameter is used as a seed to the random number generator that initializes these angles. If the command is run twice with identical parameters, including the same Seed, then the loops generated will be identical in the two cases.
All
angles are set to 180° (trans), except for the peptide bonds on the N-terminal side of proline residues, which can be either cis or trans. When the Pro_Torsions parameter is set to Cis, all prolines in the loop will be cis. Similarly, when it is Trans, all prolines in the loop will be trans. When it is set to Random, the prolines in the loop are randomly made either cis or trans, with a 50% probability of each.
Convergence, Closure Iterations, and Scale Torsions parameters control the loop closure process. As the base distance differences are minimized, the distances approach their ideal values. When the changes from one iteration to the next become small, the RMS difference is compared to the preselected value for Convergence. The smaller the value, the more exact the closure and hence, the larger the amount of computer time required. Closure Iterations specifies the number of cycles permitted before the loop closure process is aborted for any single generated loop. Too small a value here causes all proposed loops to be rejected, but the generation continues indefinitely. Therefore a high value is recommended. Finally, since the algorithm for calculating the changes in
and
very often yields small proposed values, the Scale Torsions parameter is a multiplicative factor forcing the changes in
and
to be larger, thus speeding the closure process.
In some situations the Generate Loops command may run indefinitely without converging. This can happen if the loop is too short to span the distance in three-dimensional space between the adjacent SCRs, even when the loop is fully extended. If this happens, interrupt the command by pressing the <Esc> key. A dialog box will appear, asking you to confirm the interrupt request, which you should do. You must then shorten the adjoining SCRs, thereby making the loop between them longer and more flexible. Often only one or two additional residues in the loop will suffice. To shorten the SCRs, unfreeze the two boxes that were used as input to the AssignCoords command and shorten one or both of them by one residue so as to make the loop between them longer. Then freeze the boxes and AssignCoords from them again. You should then be able to run Generate Loops again.
After the loop has been closed, checked for proper chirality, and superimposed onto the rest of the model protein, it is checked for steric overlaps. This is done in two stages. First, the atoms in the loop are checked among themselves to see whether there are internal steric clashes. Then the atoms of the loop are checked to see if they overlap with any part of the rest of the protein. Parameters are provided to assess each part independently. The Internal Overlap and External Overlap parameters specify the degree of atomic interpenetration that will be permitted before a violation is reported, and the proposed loop is rejected. They range from 0 to 1, with a higher number indicating a greater tolerance. Since the loops created here are meant to be approximate, and the structures will probably be refined with energy minimization via Discover, large values for Internal Overlap and External Overlap are suggested. The default values of both are 0.8.
After the Generate Loops command completes, the Display Loops command is automatically activated. You may choose one or more loops to be viewed simultaneously. After choosing one best loop, select the AssignCoords Loops command to transfer the coordinates from the proposed loop choice to the model protein.
The resulting loops will have extended conformations for all the side chains in the loop. Manual rotation, energy minimization, molecular dynamics, or a combination of these methods must be used to obtain a final structure for any loop created with Generate Loops.
Building Coordinates for the VRs
The AssignCoords Loops command is used to copy the coordinates from the chosen loop to the model protein. The command lists the same loop choices as Display Loops, with the last loop chosen still indicated and displayed. Selecting Execute triggers the command to copy the coordinates from the reference loop to the model protein just as AssignCoords Sequences did. As with AssignCoords Sequences, side chain replacements are done appropriately.

Step 6: Conformational Search for Side Chains Using Rotamers
On many occasions, it is desirable to explore the possible range of conformations for a certain residue's side chain or those of a set of interacting residues. There are two circumstances when conformational exploration is commonly done. The first is when the coordinates of a set of residues in the model protein have been taken from a reference protein with different amino acid types. The side chain may not fit properly in the new environment because it is larger or branches differently than the old one. The second is after the Loops/Generate command has been executed to create a loop de novo. That algorithm can only provide extended side chain conformations for the residues in the loop. Extended side chains are never seen in naturally occurring proteins and are only intended to be arbitrary starting points for a more careful conformational analysis.
Depending on how many side chains will be moved, either the Residue/Manual_Rotamer or Residue/Auto_Rotamer command can be used. The former is intended to find the best positions for a small number of residues. The residues are adjusted one at a time, and no attempt is made to deal with side chains that may be intertwined. Each new conformation can be evaluated by checking for improper steric overlap or in terms of its relative nonbond energy. The Residue/Auto_Rotamer command finds the optimum set of conformations for a user-defined list of moving residues. The optimum conformation is defined as the one with the lowest nonbond energy. An iterative approach is used, where residues in the list of moving residues are examined one at a time, according to the order in which they were added to the list.
To use the Residue/Manual_Rotamer command, first set the Manual Operation parameter to Review. Then, choose the residue for which the best conformation will be found by picking it from the screen or sequence display, or by typing its name in the parameter block. To evaluate the relative stability of each proposed conformation, toggle the Evaluate_Energy parameter on. The Nonbond Cutoff parameter specifies how close two residues must be before their interaction energy will be calculated. The higher the value, the faster, but less accurate, the calculation will be because fewer interactions will be included. You may also test whether the side chain bumps any part of the rest of the protein by toggling Bump_Check to on. The Overlap parameter specifies the percentage of van der Waals overlap that is permitted before a bump is flagged. The proper range is from 0 to 1, and the higher the value, the fewer the number of bumps that will be found. Proceed by selecting Next or Previous for the Rotamer Choice. A new side chain rotamer is placed on the backbone, either forward or backward in the list in the library for that amino acid type. You may also check the relative stability of the initial conformation by setting the Rotamer Choice to Original at any time.
The output of each execution of the Residue/Manual_Rotamer command can be seen in the information area at the bottom of the screen. The information displayed includes the residue name and number, and a rotamer number that identifies which rotamer entry is currently displayed on the screen. If selected, the calculated single-point nonbond energy, and/or the bumps between the moving residue and the rest of the protein are also given. The best rotamer choice is the one producing the lowest energy or the fewest bumps. You may switch the active residue at any time; the last conformation of the previously selected residue remains in place. If you wish to keep the conformation you have found, select Done as the Manual Operation, fill in the name of the Done Molecule from the value-aid, select Keep as the Update Mode, and select Execute. Selecting Undo for the Update Mode instead returns all altered side chains to their original conformations.
Whenever a new residue is altered by this or the Residue/Auto_Rotamer command, it is added to a list of active rotamer residues. The list contains the identities of the residues and their original conformations. It is maintained until either command is executed with Update Mode set to Done. Thus, you can cancel out of the commands and do other work without completing the conformational search. You can also transfer the list from one command to the other. The usefulness of the latter is described below.
The Residue/Auto_Rotamer command is organized in a similar fashion to the Residue/Manual_Rotamer command in that it is divided into separate modes. To begin, the Automatic Operation parameter must be set to Setup. Then, a list of moving residues needs to be specified. Note that this list is the same as that used by the Rotamers/
Manual command. New entries to the list are appended when the Activation parameter is set to Add and a residue is picked from the screen or sequence display, or when its name is typed in the Residue Spec parameter. If the Residue Pick Level is set to Subset, then a range of residues can be specified by picking from the list of subsets provided in the value-aid. If a residue or subset is picked, then the command executes automatically. However, you need to select Execute if the residue's name is typed, or if a residue range is typed. Residues can be deleted from the list when Activation is set to Delete and the command is executed. List pops the textport forward and displays the current list of moving residues. Finally, Clear removes all entries from the list that are currently in their original conformations, but not those for which the conformation has been altered. Thus if a mistake is made adding residues to the list, previously altered conformations will not be destroyed by selecting Clear.
The search algorithm does not test all possible combinations of rotamers for a given list of residues. Instead, an iterative approach is taken where each residue in the list is examined in turn. For each residue, the rotamer choice producing the lowest nonbond energy is found. All library entries as well as the original conformation are tested. A cycle is defined as a single pass through the list of residues in the order in which they were added. After each cycle, the energy is compared to the last one. If it has not dropped appreciably, the search is stopped. Note that the energy can never rise from one cycle to the next because the lowest-energy rotamer is always chosen.
The ability to transfer the active rotamer residue list between the Residue/Manual_Rotamer and Residue/Auto_Rotamer commands is designed to permit a semi-automatic conformational search to be performed. Because not every combination is tried, it is possible that a favorable conformation may be missed. If it is known in advance that a certain combination of rotamers is likely to be the best, as found by the Residue/Manual_Rotamer command, then that can be set up in advance. The residues involved will already be part of the active rotamer residue list, and the specific rotamer combination will be evaluated.
After the list of moving residues has been properly specified, select Search as the Automatic Operation. Now the Convergence and the Maximum Cycles parameters, that control the termination of the iterative search process, can be specified. The Convergence criterion is satisfied when two successive search cycles have energies that are the same within the specified tolerance. If the Convergence criterion has not been satisfied before Maximum Cycles have been performed, the search is aborted. As with the Residue/Manual_Rotamer command, if you wish to keep the conformation for all the residues that were found, select Done as the Automatic Operation, the appropriate molecule as the Rotamer Molecule, and Keep as the Update Mode, and then select Execute. Selecting Undo instead of Keep returns all the side chains to their original conformations.
On some occasions, it may be found that many of the residues involved in an automatic conformational search remain in their original conformations, but that these conformations are suspected not to be correct. This condition can arise when one of the side chains is locked into an incorrect rotamer by other surrounding side chains. It may be barred from the correct conformation because of very tight packing.
There are two approaches that may alleviate this situation. The first has been mentioned before, namely manually placing rotamers thought to be correct. If they are not known in advance, then it is possible to eliminate the steric problems by temporarily removing the side chains from the protein. Use the Residue/Replace command in the Biopolymer module to replace all the residues to be moved by alanines (except glycines and prolines). Then change the most tightly packed residue back to its original type. Use the Residue/Auto_Rotamer or the Residue/Manual_Rotamer command to find the lowest energy conformation. Continue to replace one or more residues at a time with their original types using the Rotamers commands at each step. The best overall conformation should be the result.
As explained in the Theory section, the Residue/Auto_Rotamer command stores the nonbond energies it calculates in two separate lists. The list of nonbond energies of interaction between moving and fixed side chains ("static" energies) is calculated all at once before the conformational search begins. The list of nonbond energies of interaction between pairs of moving side chains is created and enlarged as the conformational search progresses. Because the stored energies are reused in the later cycles of the search, the early cycles are more time-consuming, and the initial calculation of the static energies is typically the most time-consuming step of all. The algorithm is sufficiently fast, however, that it is feasible to include in the search all side chains of a protein of average size. The calculation time is roughly proportional to the square of the number of active residues. A search involving 100 side chains of elastase takes about six minutes on an Iris 4D-35; a search involving all the movable side chains (191 total) takes about 40 minutes. If necessary, the command can be interrupted at any time by pressing the <Esc> key.
The conformation found by the Residue/Auto_Rotamer command should be energy minimized. This is because the local environment of the side chain is not exactly the same as that for which the statistics were derived. Also, the rotamer
values used are averages over many different conformations. Therefore, it is recommended that you use the Refine/Relax command or the commands in the Discover module before proposing a final structure.

Step 7: Refining the Structure with Discover
Once steps 3, 4, and 5 are completed, most of the coordinates for the model protein have been defined. However, coordinates for the N- and C-terminal regions may not have been defined. In other regions, the coordinates that were initially assigned may be less than optimal. This is because larger side chains may have been built where small ones existed in the reference proteins, and because the peptide segments that are copied from the various reference proteins may not meet properly at the splice points. Therefore, the final step in the model building process is completing and refining the model protein structure.
Running Discover with Homology-Built Model Structures
There are two ways to use Discover with protein models built from sequence data. The most convenient way is to use the commands in the Refine pulldown in the Homology module. These commands are specific to the Homology program. When you use this method, all potentials, charges, and hydrogens of the protein are automatically examined and, if necessary, corrected.
The other way is to use the Discover interface by selecting Discover from the Module pulldown. However, in this case, it is up to you to check and correct potentials, partial charges, and hydrogens before submitting a Discover job. You are warned upon leaving the Homology module if there are problems with the model protein. To correct these problems, invoke the Builder module by selecting Builder from the Module pulldown. For hydrogens, use the Hydrogen command in the Modify pulldown. To check and correct potential atom types and partial charges, use the commands in the Forcefield pulldown.
Note that a model structure built with Homology may contain unacceptable steric overlaps. If the structure is submitted in this form, the Discover program notices the overlap violations and exits without performing any refinement. It may be necessary to manually refine the structure's coordinates prior to submission to Discover. Use the Bump command from the Measure pulldown to check for any overlaps. Use a value of 0.8 for the Overlap parameter. If any bumps are found, it is necessary to relieve them by moving atoms with the Torsion command in the Transform pulldown.
The Bump command does not find unacceptable steric overlaps caused by one to three interactions or short bond lengths. If either of these are present, the Discover job cannot run, and an error specifying the overlapping atoms is printed in the .out file. It is necessary to relieve the strain before resubmitting the job. This is done using the Geometry command in the Modify pulldown in the Builder module.
Several of the commands in the Refine pulldown set up and optionally submit Discover runs, either interactively or in the background. Since these runs can be time consuming, it is not practical to do all the required work during one interactive session. Most of the interface between Homology and Discover depends heavily on the subset definitions and internal flags for various parts of the model protein that were created during the course of coordinate assignment. Therefore it is necessary to save these definitions from one session to the next.
The Copy Sequences command is provided to propagate the required information from one version of the model protein to the next as it is refined in a series of successive Discover calculations. If you have used Discover in the batch mode, the refined structure contained in the output file (.cor file for minimizations, .arc file for molecular dynamics) must be read in with the Get Molecule command. Then select Copy Sequences. Enter the name of the original (unrefined) molecule into the Copy FROM Sequence parameter and the name of the newly loaded (refined) molecule into the Copy TO Molecule parameter. The sequence row, all internal residue status flags, and subsets will be copied from one protein to the other. All the commands in the Refine pulldown that rely on this information will now be able to operate properly to set up the next Discover job.
To speed up the calculation, cutoffs are used for the nonbond energy evaluation. Specifically, the value of the Discover constant, cutdis, is set to 7.0 Å, swtdis is set to 1.5 Å, and cutoff is set to 8.0 Å. (Refer to the Discover manual for a complete description of these variables.) You can change the values of any of these parameters by choosing Command File for the Computation Mode parameter and editing the input file before the job is submitted.
End Repair
Coordinates are only assigned to structurally conserved regions and the variable regions between them. If the SCRs do not extend to the N- and C-termini of the protein, then the coordinates will not have been assigned there. This is generally the case because the ends of a protein are often flexible. Since the reference proteins have been deemed unsuitable here, there is no model from which to obtain coordinates, and so the conformations for the end regions must be assigned arbitrarily.
The EndRepair command allows you to add coordinates by taking amino acid templates from the internal library within Insight II. The conformation given is simply an extended chain. Although this is not optimal, energy minimization can be performed on these regions at a later time.
To use the EndRepair command, select it from the Refine pulldown. When the parameter block appears, enter the object name by picking the object on the screen, typing its name, or selecting it from the list provided. Selecting Execute triggers the command.
Splice Repair
Since the various peptide segments used to build up the model protein may have been obtained from a large number of sources, the junction or splice points between the segments may not be smooth. Even though all the reference proteins have presumably been superimposed using all the SCRs as a guide before coordinate assignment (see Summarizing the Manually-Determined SCRs), and all the loops have been superimposed in the tail regions prior to coordinate transfer, there is no guarantee that the segments meet properly. There is a danger that the peptide bonds do not have the proper length, or that they are not trans. Therefore, there is a provision to repair the splice point peptide bonds.
Whenever coordinates are set with the AssignCoords Sequences or AssignCoords Loops commands, subset definitions are created to keep track of the locations of all the splice points. The SpliceRepair command sets up a molecular mechanics simulation with Discover, placing a torsion force on the omega angle(s) of one or all of these peptide bonds. Most of the atoms of the protein are held fixed during the calculation with the exception of the atoms in close proximity (along the sequence) to the splice. If a particular splice is between an SCR and a loop, then only two residues of the loop adjacent to the SCR are left free to move. If the splice is between two adjacent SCRs with no intervening loop, then only the few atoms surrounding the peptide bond are free to move. These are the alpha-carbons and side chains of the two residues flanking the peptide bond, and the carbonyl and imino groups of the bond itself.
The SpliceRepair command is used by indicating the splice points to be repaired. One, several, or all splice points can be repaired at one time. These parameters can be filled in by choosing from the lists provided. (Note that when the object is selected, the values of all the splice points' dihedral angles and bond lengths are calculated and displayed in the textport.) Torsion forcing of the peptide omega bond(s) to 180º is optional as is the force constant (suggested value is 50). Tethering of all moving atoms is also optional (suggested value is 100). You also have control of the minimization method, the maximum number of iterations allowed, and the derivative cutoff. The calculation can be performed with Morse functions, cross terms, or charges, or any combination. As with all Discover jobs run from Insight II, it can be run interactively, in the background, or with only a command file written out to be submitted at a later time.
Energy Minimization
It is possible that areas of the model protein may be involved in steric overlaps. These may have arisen because a large side chain has been placed in a position where a small side chain existed in the reference protein source, or a peptide segment from one reference protein may be interacting badly with a segment from another reference protein, or perhaps a loop segment may not fit perfectly with the core of the newly built model.
The Relax command provides for the selective minimization of sections of the model protein. The classes of regions that can be refined include the N- and C-termini, the variable regions or loops (side chains or all atoms), the side chains in the SCRs that were replaced (mutated) using the AssignCoords Sequences command, and all the side chains of the SCRs. Note that the backbone atoms of the SCRs are not one of the classes that can be chosen. This is because experience has shown that altering the conformation of the SCRs only serves to decrease the agreement between the final model and the homologous family of proteins.
The Relax command is organized similarly to the SpliceRepair command. You can specify one or several classes of subsets to be minimized, and any number and combination of qualifying regions. If Single is chosen for Subset Groups, a list of the appropriate subsets is displayed. As before, tethering, the type of minimization, the minimization parameters, and the type of job submission can be specified.
Molecular Dynamics
The final step in the structural refinement of the model protein is the exploration of conformational space for the variable regions (loops) using molecular dynamics. The Explore command in the Refine pulldown can accomplish this task. It is designed to perform minimization periodically during the dynamics run and save all intermediate minimized structures in an archive file for later review and analysis.
The Explore command is quite flexible and encompasses many parameters. As before, the object is entered by choosing from the list or picking the object. Only the names of the appropriate variable regions are displayed in the list of subsets in the Subset Name value-aid. Select the parameters for minimization, as well as parameters for molecular dynamics, such as Initialization Step, Temperature, and History. The Number of Cycles parameter is the number of combined dynamics and minimization loops that is performed and archive file frames that are saved.

Step 8: Validating Results
ProStat is a set of tools dedicated to the analysis of peptide and protein three-dimensional structures derived from experimental data obtained from X-ray, NMR, or from modelling procedures (such as Homology methods).
The analysis methods provided include checking of geometrical criteria (such as bond lengths, bond angles, and torsions) against a knowledge base (derived from small molecule crystallographic studies, calculation and tabulation of backbone and sidechain dihedral for single/multiple conformations, calculation of solvent accessible surface area, and calculation of the secondary structure classification).
Many of the properties computed by ProStat are per-residue properties, and many of the criteria used for determining the structural integrity are from per-residue spreadsheets. ProStat can either use an existing table or create one automatically as required.
Structure Checking
The ProStat/Struct_Check command provides access to a means of checking the bond lengths, bond angles, torsions, and other geometric properties of a 3D peptide or protein structure which is very easy to use but comprehensive. The aim of this command is to provide an assessment of the geometric correctness of the structure and also to focus the attention of the experimentalist or model builder on problem areas. These problem areas may be regions of an experimentally determined structure where there was a lack of data, or where noise or other experimental difficulties hinder the interpretation of the data.
The bond length and bond angle reference data used in ProStat were derived from a statistical survey (Engh and Huber 1991) of the common amino acids and appropriate chemical fragments extracted from the Cambridge Structural Databank (CSD) (Allen and Kennard 1993). The X-ray structures of compounds found in the CSD are sufficiently small that their molecular parameters can be determined fully from the diffraction data without using a forcefield or other means or referencing the bond lengths and bond angles. The structures drawn from the CSD were used to derive mean and standard deviation values for specific bond lengths and angles. In ProStat, the bond length reference data are stored in a file called pro_bond.dat. The format of this file is illustrated below.
! Information derived from
! R.A.Engh and R.Huber, Acta. Cryst., A47 292-300 (1991),
! File rules ! 1) Later lines take precedence over earlier ones with
! atoms of same names.
! 2) Atom name in first column is associated with residue name
! in third column.
! 3) Only wildcarding allowed is a single * character
! this implies a match with any residue name
CA C * 1.525 0.021
CA C GLY 1.516 0.018
C O * 1.231 0.020
CB CA * 1.530 0.020
CB CA ALA 1.521 0.033
CB CA ILE 1.540 0.027
CB CA THR 1.540 0.027
CB CA VAL 1.540 0.027
N CA * 1.458 0.019
N CA PRO 1.466 0.015
N CA GLY 1.451 0.016
N C * 1.329 0.014
N C PRO 1.341 0.016
SG SG CYS 2.000 0.100
Any line beginning with the "!" character is a comment line. When the
ProStat/Struct_Check command is run, Insight II first looks for the pro_bond.dat file in the current directory. If the file is found, it is used to define the bond length reference data, and the bond length checking proceeds. If no such file exists in the current directory, then Insight II looks for this file in the directory given by
$ROOT/data/macromol
if it finds the file here, it goes ahead with the program, otherwise the bond length checking is not performed. This method allows an expert user to modify the reference values and standard deviations should the need arise, using a local copy of the pro_bond.dat file. The pro_bond.dat file is extensible, in that any new bonds added to the file are checked by the program. Since ProStat is a protein-specific tool, a given monomer in Insight II is only checked if the monomer name corresponds to one of the 20 common amino acid types. The program does not give an error message, but rather simply does not perform the checking. This allows, for example, the protein portion of a protein-DNA complex to be checked for protein geometry distortions without a profusion of error messages relating to non-protein monomer types found in the DNA.
For bond angle checking, ProStat seeks a file called pro_angle.dat, using the same directory searching sequence just discussed. The format of this file is:
! MSI Protein Bond Angle Table 1
! Created Sept 7 1994.
! Residue Specific bond Angles and standard deviations
! Information derived from
! R.A.Engh and R.Huber, Acta. Cryst., A47 292-300 (1991),
! File rules
! 1) Later lines take precedence over earlier ones with
! atoms of same names.
! 2) Atom name in column 1 is associated with residue name
! in column 4, 2 with 5 etc.
! 3) Only wildcarding allowed is a single * character
! this implies a match with any residue name
! 4) A zero entry implies this specific bond not present
! in the data base and will not be checked.
C N CA * * * 121.7 1.8
C N CA * GLY GLY 120.6 1.7
C N CA * PRO PRO 122.6 5.0
CA C N * * * 116.2 2.0
CA C N GLY GLY * 116.4 2.1
CA C N * * PRO 116.9 1.5
CA C N GLY GLY PRO 118.2 2.1
CA C O * * * 120.8 1.7
CA C O GLY GLY GLY 120.8 2.1
CB CA C * * * 110.1 1.9
CB CA C ALA ALA ALA 110.5 1.5
CB CA C ILE ILE ILE 109.1 2.2
CB CA C THR THR THR 109.1 2.2
CB CA C VAL VAL VAL 109.1 2.2
N CA C * * * 111.2 2.8 N CA C *
GLY GLY 112.5 2.9
N CA C PRO * * 111.8 2.5
N CA C PRO GLY GLY 0.0 0.0
N CA CB * * * 110.5 1.7
N CA CB ILE ILE ILE 111.5 1.7
N CA CB THR THR THR 111.5 1.7
N CA CB VAL VAL VAL 111.5 1.7
N CA CB ALA ALA ALA 110.4 1.5
N CA CB PRO PRO PRO 103.0 1.1
O C N * * * 123.0 1.6
O C N * * PRO 122.0 1.4
Once again, an expert user can modify or extend the file and the program will work directly with the modified or extended pro_angle.dat file. In order to use modified files successfully, you should note the rules of precedence and format found in the comment lines of these files.
ProStat checks the torsion angles and other properties of a peptide/protein structure using stereochemical parameters derived from high resolution protein structures (Morris et al. 1992) The properties for the model structure are computed on a per residue basis and are compared to the values in Table 5 (Laskowski et al. 1993), where all angles are measured in degrees.
The ProStat data file holding these values is called Pro_misc.dat and is searched for in the same manner as the other data files. This data file contains keywords that specify the property name, along with the mean and standard deviation. This implies that the values in the data file can be modified, but unlike the bond length and bond angle data files, new properties cannot be added to the file so that they are automatically checked by ProStat.
The CA virtual torsion provides a simple check of the handedness of a chiral alpha carbon center that is not dependent upon the R/S Cahn-Ingold-Prelog notation. Note that for L forms of the 20 common amino acids, all 20 should have a positive virtual torsion around the value of 34 degrees; while for D amino acids the value will be -34 degrees. In the standard R/S nomenclature, the alpha carbons of the common L-amino acids are S, while L-Cys alone has an R configuration alpha carbon.
The ProStat/Struct_Check command checks for erroneous bond lengths, bond angles, and torsions. The command has three methods for outputting the results:
1. highlighting erroneous bonds
2. listing erroneous bonds to the textport
3. listing the bonds in a per-residue spreadsheet
The per-residue spreadsheet will contain the value of the property as measured in the model structure, the reference value, the difference between the property and the reference value, and the difference expressed as a number of standard deviations. Only those bonds, angles, and torsions whose number of standard deviations exceeds the threshold you specified in the user interface are output. When you specify a table name in the user interface, ProStat uses a spreadsheet of that name to output the data. If a table of the name you specify already exists, ProStat uses a spreadsheet of that name; otherwise ProStat appends the molecule name to the given name and automatically creates a per-residue spreadsheet with the concatenated name. This allows automatic creation of multiple spreadsheets when a wildcard is given for the molecule name and more than one molecule is present in Insight II.
When checking whether a given residue's phi-psi values fall within the favored regions of the Ramachandran plot, ProStat uses discrete 10 degree by 10 degree digitized regions of the phi-psi space (Macarthur and Thornton 1993).
When ProStat/Color_Molecule is on, ProStat highlights the two atoms defining a bond when it detects an erroneous bond. Similarly, when it detects an erroneous angle it highlights only the central atom of the three atoms defining the bond, while an erroneous torsion causes highlighting of the central two atoms defining that torsion. In the case of the CA virtual torsion check, only the CA atom is highlighted if this property is deemed outside the threshold.
Residue Dihedral Angles
The ProStat/Residue_Dihedral command provides an easy-to-use method of calculating and tabulating peptide and protein-specific backbone and sidechain dihedral angles. The built-in graphing capabilities of spreadsheets are then used to quickly generate 2D and 3D graphs of these dihedrals, and to search for correlations between these and other per residue properties. The calculation and tabulation of each of the individual dihedral angles is controlled by the parameter buttons (Phi, Psi, omega). Once a Residue_Table_Name has been given, then the calculated per-residue dihedral angles are stored there, if a spreadsheet of the chosen name exists. If a spreadsheet of this name does not exist then a new one is created, whose name is found by adding the molecule name to the end of the original name. This allows several spreadsheets to be created for individual molecules with a single command.
When the Assembly parameter is on, an assembly of conformers of the same molecule, as might be obtained from a distance geometry or simulated annealing calculation, are used in the calculation of the dihedral angles. If the Dihedral_Value parameter is on, individual tables are created for each molecule in the assembly, and the selected phi, psi, and other dihedral angles stored in them. If the Dihedral_Value parameter is off, then only one spreadsheet is created, and statistics relating to the dihedral angle distribution are stored there. The minimum value, maximum value, and circular variance of the selected dihedrals across the assembly of conformers are computed and stored in the spreadsheet if the parameters Min_Value, Max_Value, and Circular_Variance are turned on.
Since a dihedral angle is a circular rather than a linear measure, it can be inappropriate to average these in a linear fashion--the mean and standard deviation thus derived may be meaningless. ProStat therefore uses the circular variance (Mardia 1972) as a measure of the variability amongst a set of dihedral angle values. If a given dihedral angle has the value
i in conformer i, then the circular variance is defined as:
Eq. 17
where
Eq. 18
The circular variance can only vary between values of 0 and 1; a small value indicates a tight clustering of the dihedral angles.
The per-residue spreadsheet tables used in ProStat may be used in creating variable width and blended color ribbons using the Molecule/Ribbon and Molecule/Color_Ribbon commands within Insight II.
Secondary Structure Classification
The ProStat/SecondaryClassify command provides a simple means of computing and storing protein secondary structure classifications in a per-residue spreadsheet. When the Classify_Method parameter is set to Kabsch_Sander, the 3D coordinates are used to calculate the secondary structure classification for each residue, with the algorithm of Kabsch and Sander (Kabsch and Sander 1983). In this case, a set of Insight II subsets are also created if the Create_Subsets parameter is on. These may subsequently be used in place of an explicit atom specification for many of the commands within the Insight II program. When the Classify_Method parameter is set to PDB_Classification, then the Insight II program stores the secondary structure information derived from the PDB file (which is currently stored in Insight II) in the spreadsheet. Saving the secondary structure classification in a spreadsheet allows you to easily edit the classification. Finally, the secondary structure classification in the spreadsheet may be utilized in the construction of Richardson-type protein diagrams using the Molecule/ SecondaryRender command in the Insight II software.
Algorithmic Implementation
The solvent accessible surface (SAS) method is available in the ProStat/Access_Surf command. The ProStat pulldown is located in several modules, including NMR_Refine, Homology, and Xsight. The ProStat/Access_Surf command runs a program called hydroscopy in the background that implements the Lee and Richards (1971) algorithm as further refined by Shrake and Rupley (1973). The same hydroscopy program is used in the Solvation module within the Insight II interface, thereby facilitating the comparison of the results between Solvation and ProStat.
United Atom Models
Although Lee and Richards (1971) and Shrake and Rupley (1973) did not explicitly include hydrogen atoms in their model system (they used a united atom model instead), our commercial implementation is more flexible. You can use an all-atom model or you can designate a united atom model. There are two main choices for radii: a VDW radii option with radii from Insight II, or a user-defined radii option that must be used for united atoms. We have also provided for the inclusion of hydrogens (Heavy_Atoms_Only = Off) or their exclusion (Heavy_Atoms_Only = On). For critical applications without hydrogens, you should use a united atom model with the Heavy_Atoms_Only option set to On (alternatively, hydrogen atoms could be given a radius = 0.0, but this method requires unnecessarily-longer compute times).
Setting Atomic Radii
The united atom models for radii can be incorporated into the ProStat/Access_Surf command by specifying a radius file in the Radii_File parameter. The radii file format consists of three fields on each line, separated by white space:
atom_name radius (angstroms) element (optional)
The element designation is given within brackets, like [C] for carbon or [!C] for not carbon. For example, the line:
* 2.0 [C]
indicates that all atoms with the atom_name =* (wild card for any name) should be given radius = 2.0 if their element type is also carbon. The line:
C 1.5 [C]
indicates all atoms with atom_name = "C" should be given radius = 1.5 if their element type is [C]. Note that there is a hierarchy to the file descriptors--in the example above, all "C" atoms will first have a radius = 2.0, until the second line is read and the radius is re-set to 1.5 for "C" atoms. It is important to consider this hierarchy when designing the file. The more general lines should precede the more specific lines in the file.
Table 6 shows the united atom model from Shrake and Rupley (1973) along with the descriptor line for the radii file that is appropriate:
The radii file format can also be used to specify specific atoms. For example, if residue number 23 is Phe, and a specific radius = 1.80 is desired for the CG atom, the line would read:
PHE_23:CG 1.80
It is important to realize that the atom_name field must match exactly. For example, PHE can sometimes be called PHEN, PHEC, or PHEn depending upon charge and location in the protein. All residue type names need to be present with their own lines in the radii file (the wildcard `?' is not supported).
The file format supports the comment character `#' anywhere on the line, and the `!' comment character at the beginning of the line.
A final test of the correctness of the radii file is to run a ProStat/Access_Surf calculation and look at the output file with the name <job_name>.atm. This file contains the atomic radii used by the hydroscopy program. The atom radii in that file should contain the values that you desire for the corresponding atoms. If you detect errors, you can modify (or change the hierarchy of) the lines in the radii file to correct the atom's radius.
Another example of a united atom model is provided in Table 7:
Definition of Computed Surface Areas and their Significance
ProStat supplies three measurements of surface area for your analysis, sampled at the level of individual residues: total surface area, fractional surface area, and polar/nonpolar surface area. Each of these areas will be discussed next.
Total Surface Area
The total surface area of a residue is determined by one run of hydroscopy on the protein. The surface area is always calculated at the atomic level internally, and summed to yield the per residue value. The total surface area of a particular residue is obviously due to two factors: the size of the residue itself (that is, Gly vs. Arg) and the extent to which the residue is buried in the molecule. It is sometimes possible to compare the total surface area of residues between molecules and get a meaningful comparison, for example between conformers of the same protein. However, the relative surface area is usually a more useful parameter since it involves a normalization that eliminates the effects due to different residue types.
Relative Surface Area and the Tripeptide Model
The relative surface area is defined as the fraction (or percentage) of the maximum surface area of a residue that is exposed to solvent. This value goes from 0.0 to 1.0 (or from 0% to 100%) irrespective of the residue type. The value for a residue that is fully exposed is 100%, and the value for a residue that is fully buried is 0%. The formula is simply Ai/Aoi where Ai is the area calculated for residue `i' in the current molecule, and Aoi is the maximum (or total) surface area for residue `i'. Obviously this analysis requires a method for the determination of the maximum surface area Aoi for each residue.
It might seem reasonable to use canonical values from the literature for the maximum surface area Aoi of the residues (for example, from Shrake and Rupley, 1973), but this idea is impractical in our implementation. The fact that the atomic radii can be defined by the user clearly invalidates the use of a single canonical reference value. Therefore, we have adopted the rigorous treatment of determining the total solvent accessible surface area for each residue in its location in the current protein by directly calculating the surface area using an in situ tripeptide model.
The tripeptide model requires a separate hydroscopy run for each residue in the protein; the runs are launched sequentially after the regular run so that your machine is not swamped with processes. The tripeptide model that we have adopted is the Gly-X-Gly model, which has been used previously by Eisenberg and McLachlan (1986) and Shrake and Rupley (1973) (although they used a canonical rather than an in situ model). The tripeptide model for residue X is implemented by mutating the amino acids on each side of X to glycine (thus removing their sidechains but not otherwise changing their coordinates), and then determining the surface area of the central X residue by running hydroscopy on the peptide. Residues at either end of the protein have only one neighbor, so that a dipeptide model is used for them (X-Gly or Gly-X). The surface area value from each run is used for the Ao for that particular residue.
Polar and Apolar Surface Area
ProStat provides the option of analyzing the size of polar to apolar surface of a residue. As noted by Novotny et al. (1984), the apolar surface area is larger in incorrectly folded proteins.
There are two options for determining the polarity of atoms in ProStat. One option is to choose to use the Polarity_Source = Default option. This option uses the simple criterion that nitrogen, oxygen, and sulphur (and their hydrogens, if present) are polar, and carbon (and its hydrogens, if present) are apolar (Wesson and Eisenberg, 1992); any other atoms are undetermined (that is, they are given the value "None" in the <run name>.atm file). The second option is to choose Polarity_Source = File, and to specify a file name that contains polarity information in a format similar to the radii file discussed above, and detailed below.
The file format consists of three fields:
atom_name polarity element(optional)
Comment characters `#' can be used anywhere on the line, and `!' at the beginning of the line only. Here are some sample lines with their meaning:
!=========================================
* nonpolar [H] # all hydrogens
HN* polar [H] # all hydrogens connected to N
HO* polar [H] # all hydrogens connected to O
HIS:* polar [!C] # all atoms NOT carbon in all HIS residues
HIS_1:OXT none [O] # "none" means exclude from calculation
!=========================================
Note that the rules (lines) are applied from top to bottom in the file (like the radii file), so consider to the hierarchy when designing the file. The polarity of the atoms can be checked in the <run_name>.atm file that is written when a job is started.
Limitations in Implementation
The calculation of polar surface area values requires that the molecule be continuously present in the Insight II interface for the job to successfully finish. If the molecule is deleted, atomic values (rather than residue values) are written to the textport. Because the Insight II software performs the summation over all atoms, and because some polarity information is stored with each atom in the molecule, the same molecule must be available at the end of the job. For this same reason, no other surface area calculation is allowed on that molecule while it is waiting, since the new job could change the attributes stored with the atoms.
Conclusion
If you complete all of the steps described in the order suggested here, the model protein retains much of the structural information contained in the reference proteins.
The refinement procedure ensures that the protein has proper physical and chemical properties. The overall character of the homologous family is maintained through the use of structural constraints.
Libraries of structurally conserved regions for many families of proteins can be built with repeated use of the Homology module. These libraries can be saved and used again to speed the early stages of future projects.
Last updated January 06, 1999 at 05:40PM PST.
Copyright © 1998, 1999 Molecular Simulations, Inc. All rights
reserved.