Supplementary Material

 

Suel, G.M., S.W. Lockless, M.A. Wall, R. Ranganathan, "Evolutionarily conserved networks of residues mediate allosteric communication in proteins".

 

Nature Structural Biology. January 2003. 10(1): 59-69

 

Below, we provide a more detailed description of the analytic methods used in this paper, organized into three sections: (1) Statistical Coupling Analysis (SCA), in which we describe the core calculations comprising the sequence analysis, (2) Acceptance Criteria for SCA, in which we describe our current methods for validation of alignments and potential statistical perturbations, and (3) Matrix assembly and Clustering, in which we describe the methods and software for identifying co-evolving clusters of amino-acid positions from a statistical perturbation scan of a multiple sequence alignment. Also available are the multiple sequence alignments used in this paper.

 

A software package that will allow interactive mapping between sequence, structure, and function, and that will embed the statistical coupling analysis is currently under construction; we expect that this will be available to the scientific community in the summer of 2003.  For now, code implementing these algorithms is available upon request (mail to: Rama Ranganathan), either as standard C or as a MATLAB toolbox.  A site containing practical examples of this sequence-based analysis to illustrate the features and conditions of applying the SCA will become available as our time allows.

 

I.  Statistical Coupling Analysis:

The calculation of statistical coupling was carried out as described in Lockless and Ranganathan [1].  This analysis quantitatively measures the change in the amino-acid distribution at one position  in a multiple sequence alignment (MSA) given a perturbation at another position  as a statistical coupling energy between then two ().  The goal of this analysis is to expose functional (rather than historical) relationships between positions in the evolutionary record of a protein family; accordingly, we apply two statistical criteria (described in the next section) to validate the MSA for this analysis and one statistical criterion for validating sites on the MSA for perturbation.

 

Step 1: Each position  in the MSA is described as a 20-element vector of individual amino-acid frequencies (e.g. =[,,,...,]).  The frequency vector is then converted to a vector of amino-acid probabilities (), where now each element is the binomial probability of observing each amino-acid residue at position  given its mean frequency in all natural proteins.  This calculation makes two important contributions to the analysis: (1) it provides a logical basis for dealing with cases where the frequency of an amino-acid at a site is zero; in such cases, it reports the likelihood that the amino-acid is not observed simply by chance given the total number of sequences in the MSA, and (2) it accounts for the intuitive expectation that changes in the frequency of an amino-acid when highly conserved should be scored higher than an equivalent frequency change when weakly conserved.  For example, the evolutionary "work" required to change an amino-acid frequency at a site from 0.7 to 0.75 is expected to be much greater than to change from 0.05 to 0.1 when the mean frequency of that amino-acid is near to 0.05. 

 

Step 2: The vector of amino-acid probabilities is converted to a vector of statistical energies (=[,,,...,]), where each term is the value for amino-acid  at site , and is given by the Boltzmann distribution: .   is probability of observing amino-acid  overall in the MSA, and serves as a common reference state for comparing different sites. 

 

Step 3: To measure coupling between a perturbation at  and any site , we calculate the difference energy vector, , where  is the statistical energy vector of site  in the parent alignment and  is that of site  in the sub-alignment derived from the perturbation at .  The scalar coupling energy () is the magnitude of this difference vector, and reports the combined effect of perturbation at  on all amino-acids at position .  If sites  and  are evolutionarily independent, the coupling energy is zero and is consistent with lack of interaction, but if the coupling energy is non-zero, the two sites interact to the extent measured by .  Calculation of  for all sites  given a perturbation at  is a mapping of how all sites in the protein feel the effect of perturbing  (e.g. Figure 1g, ref. [2]).

 

 

II.  Acceptance criteria for alignments and perturbations

 

A necessary condition of this analysis is that the multiple sequence alignment be large enough and diverse enough that it is statistically representative of the evolutionary constraints on the protein family.  We interpret this condition in the following way.  First, the MSA should be so diverse that several sites display amino-acid distributions near to the mean in all natural proteins (e.g. Figure 1d, ref [2]).  If so, we conclude that the MSA has experienced substantial evolution, and that the amino-acid distributions at sites are indeed reflective of the functional (and structural) constraints on the protein family.  Second, the MSA should be so large that random elimination of sequences from the alignment does not change the amino-acid frequencies at sites much (Figure 1, supplementary material).  If so, we can say that the MSA has reached a state of statistical equilibrium in sequence space, a necessary condition for applying Boltzmann statistics in the analysis.  As described in the manuscript, one additional condition constraints the perturbation analysis.  Perturbations at sites in the MSA should produce sub-alignments that are also large and diverse, such that they remain a representative subset of the parent MSA and do not substantially alter the state of statistical equilibrium.  If so, unconserved sites (which by definition are not evolutionarily constrained) should remain unconserved, and show coupling energies close to zero.  A simple method that implements this rule is to apply a sub-alignment size cutoff for each protein family such that perturbations must produce sub-alignments with more than the cutoff number of sequences.  The cutoff values were determined by graphing the average statistical coupling energy for five unconserved sites in each MSA upon random exclusion of varying numbers of sequences (Figure 2, supplementary material).  In fractions of the total alignment size, the perturbations chosen for study in each family produce sub-alignments that are: >0.32 (GPCRs), >0.49 (serine proteases), >0.68 (globins).   

 

III.  Matrix Assembly and Cluster Analysis

The statistical coupling matrix for each protein family represents all MSA perturbations that meet the criteria described above, and contains positions from N- to C-terminus as rows and perturbations (N- to C-terminus) as columns.  The clustering algorithm to determine co-evolving networks of positions is based on iterative clustering methods developed for DNA microarray analysis [3].  The overall idea is to carry out sequential rounds of two-dimensional clustering, each time extracting the submatrix that contains positions and perturbations that cluster together in the previous iteration and that contain large signals.  Thus, each iterative step is an attempt to refine the assignment of clusters by focusing the clustering algorithm around regions of positions  and perturbations  that show significant  values.  If, as in the serine proteases, two independent clusters are found at a given iteration, then each submatrix is extracted and subjected to independent two-dimensional clustering at the next round.  The city-block metric was used for calculating distances, and clustering was carried out using standard MATLAB 6.1 programs (The Mathworks Inc., Natick, MA).  Figure 3 of the supplementary section provides an example of this approach for the serine protease family.  Iterations are continued until no further refinement to clusters is found; protein families studied required three iterations at most to converge to stable clusters of positions

 

Multiple Sequence Alignments

These are the alignments of the globin family, serine protease family, and GPCR family.

 

 

References:

 

 

1.         Lockless, S.W. and R. Ranganathan, Evolutionarily conserved pathways of energetic connectivity in protein families. Science, 1999. 286(5438): p. 295-9.

2.         Suel, G.M., et al., Evolutionarily conserved networks of residues mediate allosteric communication in proteins. Nat Struct Biol, 2003: p. Advanced Online Publication.

3.         Getz, G., E. Levine, and E. Domany, Coupled two-way clustering analysis of gene microarray data. Proc Natl Acad Sci U S A, 2000. 97(22): p. 12079-84.