Supplementary Material
Suel, G.M., S.W. Lockless, M.A. Wall, R. Ranganathan, "Evolutionarily conserved networks of residues mediate allosteric
communication in proteins".
Nature Structural Biology. January 2003. 10(1): 59-69
Below, we provide a more detailed description of the analytic methods used in this paper, organized into three sections: (1) Statistical Coupling Analysis (SCA), in which we describe the core calculations comprising the sequence analysis, (2) Acceptance Criteria for SCA, in which we describe our current methods for validation of alignments and potential statistical perturbations, and (3) Matrix assembly and Clustering, in which we describe the methods and software for identifying co-evolving clusters of amino-acid positions from a statistical perturbation scan of a multiple sequence alignment. Also available are the multiple sequence alignments used in this paper.
A software package that will
allow interactive mapping between sequence, structure, and function, and that
will embed the statistical coupling analysis is currently under construction;
we expect that this will be available to the scientific community in the summer
of 2003. For now, code implementing these algorithms is available upon request (mail to: Rama Ranganathan), either as standard C or as a MATLAB toolbox. A site containing practical examples of this sequence-based
analysis to illustrate the features and conditions of applying the SCA will
become available as our time allows.
I. Statistical Coupling Analysis:
The calculation of
statistical coupling was carried out as described in Lockless and Ranganathan [1]. This
analysis quantitatively measures the change in the amino-acid distribution at
one position
in a multiple
sequence alignment (MSA) given a perturbation at another position
as a statistical
coupling energy between then two (
). The goal of
this analysis is to expose functional (rather than historical) relationships
between positions in the evolutionary record of a protein family; accordingly,
we apply two statistical criteria (described in the next section) to validate
the MSA for this analysis and one statistical criterion for validating sites on
the MSA for perturbation.
Step
1: Each position
in the MSA is
described as a 20-element vector of individual amino-acid frequencies (e.g.
=[
,
,
,...,
]). The frequency
vector is then converted to a vector of amino-acid probabilities (
), where now each element is the binomial probability of
observing each amino-acid residue at position
given its mean
frequency in all natural proteins.
This calculation makes two important contributions to the analysis: (1)
it provides a logical basis for dealing with cases where the frequency of an
amino-acid at a site is zero; in such cases, it reports the likelihood that the
amino-acid is not observed simply by chance given the total number of sequences
in the MSA, and (2) it accounts for the intuitive expectation that changes in
the frequency of an amino-acid when highly conserved should be scored higher
than an equivalent frequency change when weakly conserved. For example, the evolutionary "work" required to change an amino-acid frequency at a site from 0.7 to 0.75 is expected to be much greater than to change from 0.05 to 0.1 when the mean frequency of that amino-acid is near to 0.05.
Step
2: The vector of amino-acid probabilities is converted to a vector of
statistical energies (
=[
,
,
,...,
]), where each term is the value for amino-acid
at site
, and is given by the Boltzmann distribution:
.
is probability
of observing amino-acid
overall in the
MSA, and serves as a common reference state for comparing different sites.
Step
3: To measure coupling between a perturbation at
and any site
, we calculate the difference energy vector,
, where
is the
statistical energy vector of site
in the parent
alignment and
is that of site
in the sub-alignment derived from the perturbation at
. The scalar
coupling energy (
) is the magnitude of this difference vector, and reports the
combined effect of perturbation at
on all
amino-acids at position
. If sites
and
are
evolutionarily independent, the coupling energy is zero and is consistent with
lack of interaction, but if the coupling energy is non-zero, the two sites
interact to the extent measured by
. Calculation of
for all sites
given a
perturbation at
is a mapping of
how all sites in the protein feel the effect of perturbing
(e.g. Figure 1g, ref. [2]).
II. Acceptance criteria for alignments and
perturbations
A
necessary condition of this analysis is that the multiple sequence alignment be
large enough and diverse enough that it is statistically representative of the
evolutionary constraints on the protein family. We interpret this condition in the following way. First, the MSA should be so diverse
that several sites display amino-acid distributions near to the mean in all natural
proteins (e.g. Figure 1d, ref [2]). If so,
we conclude that the MSA has experienced substantial evolution, and that the
amino-acid distributions at sites are indeed reflective of the functional (and
structural) constraints on the protein family. Second, the MSA should be so large that random elimination of sequences from the alignment does not change the amino-acid frequencies at sites much (Figure 1, supplementary material). If so, we can say that the MSA has reached a state of
statistical equilibrium in sequence space, a necessary condition for applying
Boltzmann statistics in the analysis. As described in the manuscript, one additional condition
constraints the perturbation analysis.
Perturbations at sites in the MSA should produce sub-alignments that are
also large and diverse, such that they remain a representative subset of the
parent MSA and do not substantially alter the state of statistical
equilibrium. If so, unconserved
sites (which by definition are not evolutionarily constrained) should remain
unconserved, and show coupling energies close to zero. A simple method that implements this
rule is to apply a sub-alignment size cutoff for each protein family such that
perturbations must produce sub-alignments with more than the cutoff number of
sequences. The cutoff values were determined by graphing the average statistical coupling energy for five unconserved sites in each MSA upon random exclusion of varying numbers of sequences (Figure 2, supplementary material). In fractions of the total alignment size, the perturbations
chosen for study in each family produce sub-alignments that are: >0.32
(GPCRs), >0.49 (serine proteases), >0.68 (globins).
III. Matrix Assembly and Cluster Analysis
The
statistical coupling matrix for each protein family represents all MSA
perturbations that meet the criteria described above, and contains positions
from N- to C-terminus as rows and perturbations (N- to C-terminus) as
columns. The clustering algorithm
to determine co-evolving networks of positions is based on iterative clustering
methods developed for DNA microarray analysis [3]. The
overall idea is to carry out sequential rounds of two-dimensional clustering,
each time extracting the submatrix that contains positions and perturbations
that cluster together in the previous iteration and that contain large
signals. Thus, each iterative step
is an attempt to refine the assignment of clusters by focusing the clustering
algorithm around regions of positions
and
perturbations
that show
significant
values. If, as in the serine proteases, two
independent clusters are found at a given iteration, then each submatrix is
extracted and subjected to independent two-dimensional clustering at the next
round. The city-block metric was
used for calculating distances, and clustering was carried out using standard
MATLAB 6.1 programs (The Mathworks Inc., Natick, MA). Figure 3 of the supplementary section provides an example of this approach for the serine protease family. Iterations are continued until no further refinement to
clusters is found; protein families studied required three iterations at most
to converge to stable clusters of positions
These are the alignments of the globin family, serine protease family, and GPCR family.
References:
1. Lockless,
S.W. and R. Ranganathan, Evolutionarily conserved pathways of energetic
connectivity in protein families.
Science, 1999. 286(5438): p.
295-9.
2. Suel,
G.M., et al., Evolutionarily conserved networks of residues mediate allosteric
communication in proteins. Nat Struct
Biol, 2003: p. Advanced Online Publication.
3. Getz,
G., E. Levine, and E. Domany, Coupled two-way clustering analysis of gene
microarray data. Proc Natl Acad Sci U
S A, 2000. 97(22): p. 12079-84.