| RANGANATHAN LAB | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME RESEARCH PEOPLE SCA STRUCTURES PAPERS CONTACT METHODS INTERNAL | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
A summary of SCA calculationsOlivier Rivoire1, Stanislas Leibler1, and Rama Ranganathan2 (Dated: August 18, 2008)
I. INTRODUCTIONThis document provides a short summary of the principles and implementations of the SCA method. A more thorough description of the assumptions, justifications and open questions related to the method will be given elsewhere in formal publication.
II. PRELIMINARIES - MULTIPLE SEQUENCE ALIGNMENT AND FREQUENCIESA. FrequenciesA multiple sequence alignment of M sequences of length L is represented by a binary array xi,s(a), where xi,s(a) = 1 if sequence s has amino acid a at position i, and 0 otherwise (s = 1,…,M is for sequences, i = 1,…,L is for positions and a = 1,…,20 is for amino acids). The frequency fi(a) of an amino acid a at position i is computed as the number of sequences in the alignment having amino acid a at position i, divided by the total number of sequences, including those with a gap at i; it can also be written
where xi,s(a) is averaged over all sequences s.
B. Binary approximationIn Halabi et al., (manuscript submitted), we also make use of a so-called ”binary approximation” of the full alignment in which we consider only the most frequent amino acid ai at position i. The alignment is then represented by a binary array xi,s where xi,s = 1 if sequence s contains the most frequent amino acid at position i, and 0 otherwise (i.e., xi,s = xi,s(ai)). This reduction is useful since, as calculated below (Sec. III ), the positional conservation in the full alignment is well-approximated by the positional conservation in the binary alignment. Other approaches to binary approximation are possible and are currently under investigation for improved approximation of positional conservation and correlation.
C. Background frequenciesAs described in section III , positional conservation is measured by the divergence of the observed frequency fi(a) of amino acid a at position i from the background probability q(a) of amino acid a. This background probability is computed from the mean frequency of amino acid a in all proteins in the non-redundant database. Specifically, ![]() where amino acids are ordered according to the alphabetic order of their standard one-letter abbreviation.
D. GapsSome calculations also require introducing a background probability for gaps. If γ represents the fraction of gaps in the
alignment, a background probability distribution can be taken as
III. POSITION-SPECIFIC CONSERVATION - FIRST ORDER STATISTICSA. Relative entropyThe conservation of amino acid a at position i, considered independently of other positions, is measured by the statistical quantity Di(a), the so-called relative entropy (1) of fi(a) given q(a). Its definition is derived from the probability PM[fi(a)] of observing fi(a) in an alignment of M sequences given a background probability q(a):
The value of Di(a) indicates how unlikely the observed frequency of amino acid a at position i would be if a occurred randomly with probability q(a) - a definition of position-specific conservation.
B. Equivalence with previous definitionsDi(a) is equivalent to measures of positional conservation introduced in previous reports of the SCA method. In essence, Di(a) is the asymptotic limit for large M for ΔGistat,a (MATLAB SCA Toolbox v1.0, as reported in Refs. (2–5)), and ΔEistat,a (SCA Toolbox v1.5, as reported in Ref. (6)):
The pre-factor -
C. Appropriate alignment sizesA more precise relation between the probability PM[fi(a)] and the relative entropy Di(a) is
The values of Di(a) are typically of order 1-3 (the scale is given by ln20 ≈ 3), os the corrective term lnM∕(2M) can be neglected when M is of order of 100 sequences or greater (M = 100 corresponds to lnM∕(2M) ≈ 0.02). This gives a lower bound on the size of alignments appropriate for SCA studies; provided one operates above this limit, the previous measurements of conservation are quantitatively close to Di(a).
D. Overall positional conservationAn overall positional conservation Di taking into account the frequencies of all 20 amino acids can also be defined, but requires introducing a background probability for gaps (see Sec. II.D ). Denoting fi(0) = 1 -∑ a=120fi(a) the fraction of gaps at position i, we can then write the probability of jointly observing the frequencies (fi(1),…,fi(20)) of each of the 20 possible amino acids at position i as
where Di = ∑
a=020fi(a) ln
E. Relation between overall and amino acid-specific positional conservations If considering gaps, we also define
F. Equivalence of various definitions in the binary approximation limitNote that Di(a),
behaves essentially as Di(ai).
IV. CORRELATED CONSERVATION - SECOND ORDER STATISTICSThe basic principle for defining a SCA correlation matrix is to weight correlations between pairs of positions by a function of the positional conservations. Different implementations of the SCA method correspond to different definitions of the weights, but are all based on this same principle (described below). The original implementation of SCA defined conserved correlations through a specific type of perturbation analysis on the sequence alignment (MATLAB SCA Toolbox 1.5, Sec. IV.E ). The current implementation is based on a bootstrap (or, more precisely, jackknife) procedure, which amounts to using weights that are gradients of positional conservations (Sec. IV.D ). With regard to distributions, SCA Toolbox 2.x computes the SCA correlation matrix using the bootstrap, and SCA Toolbox 3.0 using weights. In practice, versions 2.x and 3.0 report nearly identical values.
A. Unweighted covariance matrixIn general, a covariance matrix reporting pair-wise correlations between amino acids at positions in a multiple sequence alignment can be defined as
B. Weighted covariance matrixAs a general principle, SCA matrices can be obtained by weighting these covariance matrices by a functional φ of the positional conservations Di(a) (or, more generally, a function of the frequencies fi(a) and q(a) with properties similar to that of Di(a))
These weights rise even more steeply than Di(a) as the frequencies of amino acids fi(a) approach one, a property that reduces correlations arising from weakly conserved amino acids (since the gradient of Di(a) approaches zero as fi(a) → q(a)), and emphasizes conserved correlations. This property addresses a central issue in assessing functional correlations in sequence alignments - the need to minimize the contribution of purely historical correlations between positions that derive from many small clades of sequences with close phylogenetic relationships.
C. Reduced SCA matrix and binary approximationIn previous implementations of the SCA method, a reduced matrix
Within the range of validity of the binary approximation, this matrix corresponds to
D. Weights derived from the bootstrap procedureIf we introduce Di,s(a), the positional conservation of amino acid a at position i for an alignment obtained by leaving out sequence s, the covariance matrix associated with this bootstrap procedure is:
i(a) is the relative entropy Di(a) with fi(a) replaced by fi(a). It thus follows that, to first order in
1∕M,
E. Weights derived from the original perturbation procedureThe implementation of the SCA method introduced originally in Lockless and Ranganathan was based on a perturbation to the amino acid distribution at one test site i to measure the difference in position-specific conservation of each amino acid at a second site j. In general, the perturbation consisted of restricting the test site to a highly prevalent amino acid ai, a manipulation that extracts a sub-alignment with size equal to fi(ai)M. For test sites in which sub-alignments retained sufficient size and diversity to be globally representative of the full alignment (i.e., fi(ai)M > 100 sequences), a difference conservation value was calculated:
ln , corresponds to Dj(b). Given the assumption that perturbations lead to sub-alignments that are
representative of the full alignment (a condition satisfied typically by only the most frequent amino acid at a subset of
positions), fj|i(b)|ai ≈ fj(b) for most amino acids b at positions j. We may therefore expand the second term,
- ln , by writing
V. DISTRIBUTIONS OF SCA(1) SCA v1.5: The original SCA method as specified in Lockless and Ranganathan (2) with one modification that was used in all subsequent papers: the division of binomial probabilities by the mean probability of amino acids in the alignment is removed. This version is longer in active use. (2) SCA v2.5: The bootstrap-based approach for SCA. Position-specific conservation calculated as in Eq. (4 ) and correlations calculated as in Eq. (11 ). Matrix reduction per Eq. (14 ). (3) SCA v3.0: The analytical calculation of correlations weighted by gradients of relative entropy. Position-specific conservation calculated as in Eq. (4 ) and correlations calculated as in Eq. (11 )-(12 ). An update to this version is expected shortly that will also includes codes for new statistical methods for identifying groups of correlated amino acid positions (the ”sectors” in Halabi et al., submitted) and for assessing the statistical independence of sectors. For non-binarized alignments, matrix reduction is per Eq. (14 ). Current version. Distributions are MATLAB Toolboxes that include various accessory codes for data formatting, display, and analysis through hierarchical clustering. A tutorial with a sample alignment that illustrates the analytic process is also provided.
REFERENCES
[1] T. M. Cover and J. A. Thomas. Elements of information theory. Wiley-Interscience, New-York, 1991. [2] S W Lockless and R Ranganathan. Evolutionarily conserved pathways of energetic connectivity in protein families. Science, 286(5438):295–9, Oct 1999. [3] Gürol M Süel, Steve W Lockless, Mark A Wall, and Rama Ranganathan. Evolutionarily conserved networks of residues mediate allosteric communication in proteins. Nat Struct Biol, 10(1):59–69, Jan 2003. [4] Mark E Hatley, Steve W Lockless, Scott K Gibson, Alfred G Gilman, and Rama Ranganathan. Allosteric determinants in guanine nucleotide-binding proteins. Proc Natl Acad Sci USA, 100(24):14445–50, Nov 2003. [5] Andrew I Shulman, Christopher Larson, David J Mangelsdorf, and Rama Ranganathan. Structural determinants of allosteric ligand activation in rxr heterodimers. Cell, 116(3):417–29, Feb 2004. [6] Michael Socolich, Steve W Lockless, William P Russ, Heather Lee, Kevin H Gardner, and Rama Ranganathan. Evolutionary information for specifying a protein fold. Nature, 437(7058):512–8, Sep 2005. [7] B. Efron and R. J. Tishirani. An introduction to the bootstrap. Chapman and Hall, 1994. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||