Seqfold



3       Methodology


Introduction

SeqFold can be executed for any Insight II protein sequence object. Insight will automatically create Chou-Fasman, GOR or DSC annotated sequence files (pseq file).

SeqFold can also use a pseq file created by the user according to the pseq file format as described in the File formats chapter. Converters for the two most popular secondary structure prediction programs have been included in the SeqFold/Fold_Search command interface. In general, the better the predicted secondary structure (the closer it is to the real secondary structure), the bigger the chance that the fold corresponding to the target sequence will be singled out from the fold library.


Typical steps

The typical steps in a SeqFold calculation are as follows:

1.   Fold_Search -- Searching for the compatible fold.

Prepare the input files for the search using the Fold_Search command.

Using the default parameters

The Annotated_Sequence parameter controls whether an Insight II sequence or one of the supported file formats will be used.

SeqFold is run as a background job. Therefore, for each separate run, a Run_Name must be supplied. The run name is arbitrary and is used in the construction of the SeqFold output file. It is appended to the pseq file name to create a unique ID for the seqfold output file.

To perform a SeqFold search with the default parameters, simply click Execute. The SeqFold calculation will be launched in the background and, after a few minutes, a notifier box should appear telling you that the job has completed successfully.

Changing the default parameter values

For most cases, the default parameters are optimal, however expert users may wish to change them. The three toggles at the bottom of the Fold_Search dialog control the display of the Score, Align and Gap parameters, respectively.

In addition to the recommended default search parameters, the Fold_Search command allows you to change the sequence-structure scoring function parameters, the type of the alignment and the fold library.

Score Parameters

Gonnet, the default Substitution_Matrix, is that described in Gonnet et al. (1992). This is the most recently derived log-odds matrix based on the exhaustive matching of a large database of sequences.

The blosum62 matrix (derived from a database of protein blocks, Henikoff and Henikoff, 1992) and the classical PAM120 and PAM250 (Dayhoff et al., 1978) are also available.

Additionally, you may specify another sequence substitution matrix by selecting User_Defined. By default, SeqFold searches the $BIOSYM/data/seqfold/mat directory for the Matrix_File_Name. This directory contains a number of sequence similarity matrices in different formats (which appear in a list box on the right when you select one of the toggles under Matrix_Format).

You may change the default location of the matrix database by selecting Change_Matrix_Path and editing the Matrix_Path parameter.

The extent of the predicted secondary structure's influence on the alignment can be controlled using the Seq_Weight and Sec_Weight parameters which specify the relative weights of the sequential and structural contributions to the scoring function. Note that the sum of both parameters is normalized to one by SeqFold, therefore multiplying both parameters by the same factor does not change the SeqFold score. Setting Sec_Weight to zero is equivalent to a search based solely on the sequence similarity. Similarly, setting Seq_Weight to zero is equivalent to using only secondary structure identities for the matching.

Align Parameters

The default alignment optimization is the Global_Local method which is local for the target sequence and enforces global alignment for the reference structure. Standard Local and Global alignments are also available.

Max_Top_Scores indicates the number of top scores shown in the SeqFold output file. From those top hits only a specified number (as indicated in Max_Top_Alignments) are reported in detail and are available for import using the Fold_Load command.

The Change_Fold_Lib option allows the user to specify a new name and/or location for the fold library. The fold library, which is based on the recent PDB database, is a collection of reference protein structures (or folds) and is gathered in a file called folds.1d_prf in the $BIOSYM/data/seqfold/lib directory.

This library may be updated by copying it to another location and following the 1d_prf file format specified in the File formats chapter. Select Change_Fold_Path and set Fold_Lib_Path to specify the directory's new location.

The Random_Alignments parameter specifies the number of alignments of the reference sequence with randomized target sequence, used to estimate properties of alignment score distribution.

Gap Parameters

Gap penalties used by the alignment optimization subroutine can be customized in the Gap Penalties sub menu. There are three gaps each for the reference structure (Struct) and the target sequence (Seq).

Note that the default gap values are optimal assuming that the Substitution_Matrix, Align_Type, Seq_Weight and Sec_Weight parameters are set to their default values. The gap values may not be optimal if any of those other parameters have been changed to non-default values.

After you have specified the parameters, click Execute. The SeqFold calculation will be launched in the background and, after a few minutes, a notifier box should appear telling you that the job has completed successfully.

2.   Fold_Browse -- Browsing search results.

After the background job initiated by the Fold_Search command is complete, the top scoring results may be loaded into an Insight table and conveniently summarized. The Fold_Browse command accepts only one parameter. SeqFold_File_Name is a SeqFold output file name produced in the Fold_Search step.

The Seqfold_Table_Name parameter allows you to modify the name of the table in which the Seqfold results are presented (by default, the name is derived from the Seqfold output file name).

The table contains a list of hits identified by pdb ID and the pdb chain name accompanied by the score values. For hits with alignment information (highlighted in yellow) additional descriptors are reported as described in the following table:

Table 1. SeqFold alignment descriptors.

Column Value
B   SeqFold total raw score.  
C   Matching fold ID (if available)  
D   Total alignment length.  
E   Gapless alignment length.  
F   Percent of identical residues in the alignment.  
G   Percent of similar residues in the alignment.  
H   Percent of identical residues in the gapless alignment.  
I   Percent of similar residues in the gapless alignment.  
J   Gapless alignment to total alignment ratio.  
K   Target sequence alignment coverage.  
L   Reference structure alignment coverage.  

The StartMultiView parameter activates the MultiView window which is designed to aid detailed analysis of Seqfold results. ScoresSet parameter indicates how many alignment scores will be included in the MultiView analysis. See also MultiView: multiple features analysis in the Theory section for more information.

The MultiView window facilitates the comparison of different alignment features displayed as a scattered plot in the panel. Any group of outstanding hits can be color coded by drawing a circle around the group with the mouse. The left panel (List_scores) can be used to establish direct access to any inter/intranet service. In the present release, it is configured to access the SCOP database, however, the default configuration can be modified by changing the seqfold_scop_query file (found in $BIOSYM/data/seqfold) or by creating a local copy of this file and setting the SEQFOLD_SCOP_QUERY environmental variable to point to this modified copy. Note that the content of the seqfold_scop_query file will be appended with a four letter PDB code in order to create a relevant URL.

3.   Fold_Load - Getting the target-reference hit alignment

Using the Fold_Load command, the alignment for a specific hit may be imported into the Insight sequence window for analysis and model building.

The Fold_Load command takes three parameters:

The SeqFold_File_Name specifies a source for the alignment, whereas the Model_Name specifies the name of the model based on the alignment.

Fold_ID specifies the hit ID exactly as reported in column C (Matching Fold ID) of the Insight table created by the Fold_Browse command.

When Fold_Load is executed successfully, two objects are created corresponding to the sequences contained within the alignment limits. E.g., for the Model_Name testmodel and the Fold_ID 1cew_I, the two created objects I_TESTMODEL_1CEW_I and I_1CEW_I_TESTMODEL contain the alignment part of the target sequence and the alignment part of the reference structure, respectively. This allows different alignments corresponding to the same target to be loaded simultaneously.




Last updated December 10, 1998 at 12:46PM PST.
Copyright © 1998, Molecular Simulations, Inc. All rights reserved.