Biopolymer



7       Structure Database Searching


Introduction

The Str_DB pulldown is in the Biopolymer module in Insight II. Insight II provides graphical interface to a protein database search program, template_db_search. The protein fragments which satisfy the search query may be read into Insight and displayed.

The database information is derived for a set of known protein structures from the Brookhaven Protein Databank and stored in the file $BIOSYM/data/biopolymer/database.dat. The same list of proteins used in loop searches are used to create this database. The database $BIOSYM/gifts/biopolymer/structure.db consists of all the pdb files release by PDB before 8/6/97.

The stored structural information is mostly residue-based and includes some information on atoms and secondary structure. Searches of this database can address many common protein structure questions.

Tutorials

The Biopolymer Pilot tutorial lesson 8 describes the creation of a custom database of pepsin-like and retroviral aspartic proteases and subsequently uses this database in a search exercise.

To start Pilot, click the mortarboard icon on the Insight II toolbar. When the Pilot interface appears, click Select and then choose the Biopolymer tutorial from the list.


Methodology and Implementation

Searching the structural database

The protein structural information has been extracted from a set of proteins in the current Brookhaven Protein Databank and stored in a single file which by default is called $BIOSYM/data/biopolymer/database.dat.

For each protein the stored information consists of:

For each residue the stored information consists of:

A program external to Insight, called template_db_search, is used to search the structural database file. This program can be run stand-alone using a control file created within Insight or edited manually.

Defining a structure database query

There are four types of information that can be used to define a structure database query:

Proteins to search

The search can be limited to those proteins with a resolution better than some specified value or using a keyword such as "`hemoglobin" in their description. However, by default all proteins are searched.

Templates

A template is a fragment of consecutive residues. Each residue may be defined by residue type (or some wildcard group of residue types such as "hydrophobic"), by secondary structure type, by main chain or side chain torsions or by intra-template C-C distance. Each residue in the template may be defined as much or as little as required.

Constraints

A constraint is a limiting relationship between two templates. It specifies that the two templates must be a certain distance apart or a certain number of residues apart in the sequence. For example:

If two or more templates are defined, they must be linked by constraints, otherwise they are completely independent and should be treated as separate database queries.

A database search usually finds a number of hit fragments that satisfy the query.

Search parameters

The user can specify how tightly the search criteria must be adhered to by changing the search tolerance. For example, if you specify a required main chain or side chain torsion, then, by default, all structures within 30° of the required value will be retrieved. After you enter the required information using the Define_Template, IntraTmplt_Cnstrt, InterTmplt_Cnstrt and Run_Search commands, the control file for the search program is generated, and the search runs as a background job under Insight II.

The insight II interface to the structure database search program produces an input file called run_name.ddb that contains the details of the search query, name of database file to search, maximum number of protein hits, etc. In the following example a local database (called ./asp_proteinase.db) is searched for a maximum of 50 hits. The two templates are denoted asp1 and asp2 respectively and both consist of tripeptides ASP-any-GLY with no side chain or main chain conformation restrictions. The query requires hits to have an ASP CA to ASP CA distance between 5.5 and 7.5 Å and that the second template is after the first template (i.e., between 0 and 500 residues) in the protein sequence. Such a query finds both pepsin-like and retroviral aspartic proteases.

A graphical representation of some of the details of this query is shown in Figure 10 below.

Figure 10 . A graphical representation of an example query.

An example of the input file that represents this query is shown below:


NHIT 50
DTOL 1.0
ATOL 30.0 30.0
CTOL 50.0
STOL 30.0
DBAS ./asp_proteinase.db
WILD ./wildcard.dat
RSLN 10.0
TMPL 3 asp1
RESD ASP * * * *
RESD ANY * * * *
RESD GLY * * * *
TMPL 3 asp2
RESD ASP * * * *
RESD ANY * * * *
RESD GLY * * * *
CONS CACA asp1:1 asp2:1 5.5 7.5 0
CONS IRNG asp1:1 asp2:1 0 500 0
The default database file used for searching is found in $BIOSYM/data/biopolymer/database.dat.

Browsing database search results

The output of the Run_Search command is a list of hits stored in an Insight II table file called run_name.tab. In addition, a summary of the calculation, including query details and the hits found is included in a file run_name.log in the local directory. The Read_Search_Result command allows the list of protein structural hits to be loaded into an Insight II table by setting the Load_Type parameter to Load_Hit_List. When the Load_Type parameter is set to Load_Hit_Protein, then one or more protein structure hits may be loaded into Insight II and superimposed using the atom subsets that are then automatically defined.

For example, if the table of structure database hits contains the following rows


Table 1



Hit     Protein asp1:ASP  asp2:ANY  asp1:GLY  asp2:ASP asp2:ANY asp2:GLY
........
hit7 5pep 32:ASP 33:THR 34:GLY 215:ASP 216:THR 217:GLY
.....
hit9 4hvp A25:ASP A26:THR A27:GLY B25:ASP B26:THR B27:GLY\
.....
then loading hits 5 and 7 will produce subsets named HIT7$PEP and HIT9$HVP respectively, and the subsets can simply be used in the Transform/Superimpose command in the viewer module to superimpose the two hits.

The two hits in the above table represent porcine pepsin and HIV protease, respectively.

Residue and secondary structure types

When a template is being defined in the construction of a structure database query the Residue_Spec is a symbolic name that is entered by the user (e.g., res1). The Res_Type is selected from a value aid, and it may be one of the twenty standard amino acids or one of the following categories

ANY -- any amino acid.

ALLXGP -- all except gly, pro.

HYPHOB -- hydrophobic.

HYPHIL -- hydrophilic.

ACIDIC -- glu, asp.

BASIC -- lys, arg, his.

NEUTRAL -- of neutral pH.

AROMTC -- phe, trp, tyr.

SMALL -- gly, ala, val, ser, thr.

GLNASN -- gln, asn.

These definitions are contained in the data file $BIOSYM/data/biopolymer/wildcard.dat.

New residue type definitions can be entered into this file, or into a local copy specified by the Res_File_Type parameter in the Run_Search command. These new definitions may then be entered into the Res_Type parameter in the Define_Template command, but they do not show up in the value aid. The set of valid secondary structure types includes

H -- Folded i.e. any of next 4 types.

A -- -helix.

3 -- 3-Turn.

4 -- 4-Turn.

5 -- 5-Turn.

T -- Turn, includes 3,4,5.

E -- Extended chain.

N-- N terminal.

C-- C terminal.

As well as the ANY category for any type. Each of the above types can be prefixed by NOT, and these categories are also presented in the SecStruct_Type value aid. The secondary structure types in the database are assigned using a Kabsch and Sander algorithm. n-Turn is a backbone conformation in which the hydrogen bond between CO of residue i and NH of reside i+n is formed.

Restricting main chain and side chain torsions

Valid torsion angles are in the range -180 degrees to +180 degrees. The main chain conformation of a residue within a template can be restricted by turning on the Cnstrn_MainChain boolean within the Define_Template command. This then displays the mean values for the Phi, Psi and CA_Torsion parameters. The default values of -500.0 simply imply that these are unrestricted since they are outside the valid range of torsion angles. The default tolerance for the allowed values of Phi and Psi is 30.0 degrees; these defaults can be altered by turning on the More_Parameters boolean in the Run_Search command, which makes the Phi_Tolerance and Psi_Tolerance parameters visible in the user interface.

The CA torsion for the i-th residue is the virtual torsion defined by the position of the C atoms for residues i-1 to i+2. By default this is unrestricted (value -500.0) The default tolerance for the CA virtual torsion is 50.0 degrees and can be altered by turning on the More_Parameters boolean in the Run_Search command and then modifying the CATor_Tolerance parameters.

Side chain torsion restrictions can be set by turning on the Cnstrn_SideChain boolean in the Define_Template command; the default tolerance is again set by turning on the More_Parameters boolean in the Run_Search command and setting the Sch_Tolerance parameter value.

Intra-template constraints

Once templates have been defined, constraints within a single selected template can be applied to further restrict the set of proteins found with the query. A symbolic constraint name must be entered for the new constraint. The template to which the constraint applies can be picked from the value aid for the Template_Name parameter and the two residues involved in the constraint selected from the value aids for the Residue_Spec1 and Residue_Spec2 parameters. The distance constraint can be defined in terms of either a C atom to C atom distance, a C atom to side chain center distance or a side chain center to side chain center distance. These options are available using the Constraint_Method parameter. A target distance for the constraint can be set and the distance tolerance (default 1.0A) is one of the additional parameters in the Run_Search command. The atoms that define the centers of the side chain for specific residues are shown below.

Definition of side chain center:

PRO -- CG

GLY -- CA

ALA -- CB

VAL -- CG1/CG2

LEU -- CD1/CD2

ILE -- CD

MET -- CE

PHE -- CH

TRP -- NE1

SER -- OG

THR -- OG1

ASN -- OD1/ND2

GLN -- OE1/NE2

CYS -- SG

TYR -- OZ

ASP -- OD1/OD2

GLU -- OE1/OE2

LYS -- NZ

ARG -- NH1/NH2

HIS -- ND1/CD2

Where / implies the geometric mean coordinate of the two atoms.

Inter-template constraints

The InterTmplt_Cnstrnt command can be used to set constraints between two pre-existing templates. As above, a symbolic constraint name should be entered. The two templates can then be selected from the value aid, the symbolic names used for the residues in the Define_Template command are used to select the residues to which the constraint will apply. The distance constraint can again be one of the three forms, C to C, C to side chain center or side chain center to side chain center.

In the case of inter-template constraints the Constrain parameter also contains one additional option. When Res_Separate_Range is selected, additional parameters are enabled which allow the user to only select hits where the separation, in terms of number of residues apart in the protein sequence, are either inside or outside a specific range. One use of this feature is to ensure that a given query find only one example of each hit by requiring template 2 to be after template 1 in the sequence, by default two examples of each hit would be otherwise returned.

Another application of the residue separation constraint is to specify a variable number of residues between two fixed patterns of residues. For example, to find a sequence pattern G-G-(2,5)X-G-- two consecutive gly residues followed by between two and five residues before another gly residue-- the first template is two gly residues and the second template is one gly residue. The constraint is that the first residue of the first template and the residue of the second template are between four and seven residues apart.

It also is often necessary to set an exclusion range of residues. For example, in studying side chain-side chain interactions you may require that the interactions be between residues remote in sequence. Two templates can be defined each containing one residue of the amino acid type of interest, and set an exclusion range between the two templates. If this range is set between -5 and +5 residues, then the two residues must be at least five residues apart in the sequence.

An instance where it is necessary, but not obvious, to set a constraint is if two identical templates are defined. For example, in searching for two interacting histidine residues, you would define two templates; one for each histidine residue. In this case you also must specify an exclusion range between the two templates of at least -1 to 1 to ensure that any single histidine residue does not satisfy both templates.

In other instances, it may be desirable that one residue in a protein is simultaneously in two templates. Take, for example, search for a structure with two -helices which are connected by a loop of between 6 and 12 non-helical residues. To do this you would define two templates and set a residue separation constraint between them. The first template is 12 residues long with the first six residues specified to be helix, and the second six specified as not helix. The second template is 12 residues long, with the first six residues not helix and the second six helix. The constraint is that the fist residue of the first template, and the first residue of the second template are between 6 and 12 residues apart.

This could be illustrated by the two extreme solutions: where H = helix, and N = not-helix.

First residues in the two templates separated by 6 residues:


Template 1:      H-H-H-H-H-H-N-N-N-N-N-N
Template 2: N-N-N-N-N-N-H-H-H-H-H-H
Fragment found: H-H-H-H-H-H-N-N-N-N-N-N-H-H-H-H-H-H
First residues in the two templates separated by 12 residues:


Template 1:      H-H-H-H-H-H-N-N-N-N-N-N
Template 2: N-N-N-N-N-N-H-H-H-H-H-H
Fragment found: H-H-H-H-H-H-N-N-N-N-N-N-N-N-N-N-N-N-H-H-H-H-H-H
In the case which the first residues area separated by 6 residues, some residues in the hit structure are in both templates.

Database search details

When the Run_Search command is executed, Insight II will launch a background job by calling the script:


$BIOSYM/bin/biopolymer/Run_Search 


The structure database search is then performed by the executable


$BIOSYM/$BIOSYM_PLATFORM/biosym_exe/template_db_search


The job can be made to run locally or remotely by setting the Background_Job/Setup_Bkgd_Job/Background_Job parameter to Run_Search and setting the Host parameter to either Local or to the hostname of a machine on the local area network. The progress of the search calculation can be monitored using the Background_Job/Completion_Status command and the job aborted using the Background_Job/Kill_Bkgd_Job command. For the supplied database.dat file a simply query of the database can be performed in a few minutes.

The Run_Search command will default to a maximum of 50 hits. In addition to setting the torsion angle constraint tolerances the search can be restricted to crystal structures of a given resolution by the setting the Xtal_Resolution parameter to High, Medium or Low. The definitions of these terms are as follows:

The Run_Search/Job_Name parameter is used as a root name, and the input and output file names are all derived from this root name. The structure database search calculation can be run from the command line as follows:


Run_Search Job_Name


and the same set of files produced by running from within Insight II will be produced.

Custom database creation

The Str_DB/Create_DB command may be used to create a customized version of the database file searched by the Run_Search command. The the PDBList_File parameter accepts the name of a file containing a list of names of PDB files. This file has a simple format, each line must specify the full directory and path name of the PDB file to be used. When a Job_Name parameter has been specified, the database creation runs as a background job. A script


$BIOSYM/bin/biopolymer/Run_Crebase


is called from Insight II and this in turn calls the executable


$BIOSYM/$BIOSYM_PLATFORM/biosym_exe/crebase


which creates the run_name.db file used in the search. By default, the Create_DB command will create a new database file, but if the Append_DB boolean is turned on, then the new data will be added to an existing run_name.db file.

Overview of commands

What follows is a brief summary of the function of each command in the Str_DB pulldown.

Define_Template

A template is a fragment of consecutive residues. The Define_Template command allows you to either create a template by sequentially adding residues to the template or to modify an existing template. Each residue in the template may be defined by residue type (or a wildcard group of residue types such as "hydrophobic"), by secondary structure type, or by main chain or side chain torsions. Each residue in the template may be defined as much or as little as required.

There are two ways to add a residue to a template. If you load a protein structure into Insight, you can click the residue you want to add to the template. The parameters for this residue will be filled automatically. You can also add a residue by typing in values or selecting from the value aid.

IntraTmplt_Cnstrnt

The Intratmplt_Cnstrnt command defines the relationship of two residues in a template in terms of the distance between two C atoms, a C atom and side chain center, or two side chain centers.

InterTmplt_Cnstrnt

The Intertmplt_Cnstrnt command defines a constraint between two residues in different templates in terms of the residue distance or residue separation range in a protein sequence.

The distance between two residues can be between two C atoms, a C atom and side chain center, or two side chain centers.

Delete_Query

The Delete_Query command allows the user to delete individual templates or constraints, or to delete all elements of a query.

List_Query

The List_Query command lists defined queries to the text port or to a file. The format of the list is the same as that in the command file jobname.ddb written out by the Run_Search command. Please refer to the manual for details of the format.

Run_Search

The Run_Search command writes out the control file for searching the database and starts the search as a background job.

Read_Search_Results

The Read_Search_Results command allows you to load either a table of hit lists or a list of hit pdb files with a subset of hit residues.

Create_DB

The Create_DB command allows you to create a database file from a list of pdb file names.




1 Kabsch and Sanders method is used to determine secondary structure type.

Last updated September 30, 1997 at 11:31AM PDT.
Copyright © 1997, Molecular Simulations, Inc. All rights reserved.