Unbiased selection of sample alignments

     This bio-recipe shows how to select random alignments to study certain properties in a way that the sample is not biased by the peculiar distribution of sequences in the database. We will then use this technique to sample alignments in a particular PAM distance range and study the distribution of gap lengths. The distribution of gap lengths is found to be Zipfian and this leads to a new deletion cost formula based on a logarithmic approximation. The discussion of alignments based on a logarithmic deletion cost is done in another bio-recipe with that subject.

     First, lets understand what is wrong with finding alignments just by taking pairs of randomly selected sequences until we satisfy some criteria. The main problem biasing this method is the existence of many very close homologues in the database. Suppose that we find an interesting alignment between sequences A and B and another one between sequences C and D. Suppose that neither A nor B have close homologues, but C has 9 very close homologues and D has also 9 very close homologues. It is clear now that the alignment of C and D is going to be overrepresented in any random sample. First it is 100 times more likely to be selected at random. Secondly, if the sample is large, it is likely to be represented more than once. The purpose of the first part of this bio-recipe is to build a sample of alignments free from this bias. If you want to follow it, you might have to install the SwissProt data file which can be downloaded here.

We need a protein database so we define the location of the SwissProt database and then load it. We also create Dayhoff matrices to be able to align sequences.

SwissProt := 'SampleAlignments/SwissProt.Z':
Reading 169638448 characters from file SampleAlignments/SwissProt.Z
Pre-processing input (peptides)
163235 sequences within 163235 entries considered
Building new Pat index in file SampleAlignments/SwissProt.tree with 59631787 entries
Pat index with 59631787 entries
 sorted, from "A</SEQ></E>\n<E><ID>" to "YYYYYYYYYYYYGMSSIIF"
Peptide file(SampleAlignments/SwissProt.Z(169638448), 163235 entries, 59631787

First we initialize several variables. The set where we will store the candidate sequences to be compared Cand , the set of good Alignments found so far SampleAl and the limits of significance and PAM values to the selected alignments

Cand := SampleAl := {}:
minSim := 150:
loPAM := 40:
hiPAM := 80:

We will run the loop until we collect the desired number of alignments (in this case just one to make the bio-recipe run faster).

while length(SampleAl) < 1 do
   e := Rand(Entry);
   s := Sequence(e);
   if length(s) < 80 or length(s) > 900 then next fi;

        Select a random Entry, and discard it its length is not between 80 and 900. Sequences too short or too long are unlikely to be desirable for this purpose. Now test the new sequence against all the ones which were selected before.

   tooClose := false:  possible := {}:
   for c in Cand do
      al := Align(Sequence(c),s);
      if al[Score] > minSim then
	  al := Align(al[Seq1],al[Seq2],DMS);
	  if al[PamDistance] < loPAM then
	       tooClose := true;  break
	  elif al[PamDistance] <= hiPAM then
	       possible := append(possible,al) fi

        If the new sequence is too close to an existing sequence, then discard it, else add any alignments that it could have contributed.

   if not tooClose then
	SampleAl := SampleAl union possible;
        Cand := append(Cand,e);
length(Cand), length(SampleAl);
128, 1

     The above loop will take considerable time to collect hundreds of matches, so for the purpose of this bio-recipe we will read a file which has been prepared exactly as above but with many more alignments.

ReadProgram( '/home/darwin/Repositories/teaching/bio-recipes/SampleAlignments/data.drw' );

And we see that we have 333 alignments. All these alignments, because of the proximity restriction of the sequences, are essentially independent. We print one as a sample.

lengths=539,545 simil=2064.3, PAM_dist=71, identity=48.7%
ID=CH60_STRA3   AC=Q8CX22;   DE=60 kDa chaperonin (Protein Cpn60) (groEL
protein).   OS=Streptococcus agalactiae (serotype III).   OC=Bacteria;
Firmicutes; Lactobacillales; Streptococcaceae; Streptococcus.   KW=ATP-binding;
Chaperone; Complete proteome.
ID=HS60_SCHPO   AC=Q09864; Q10285;   DE=Heat shock protein 60, mitochondrial
precursor (HSP60).   OS=Schizosaccharomyces pombe (Fission yeast).  
OC=Eukaryota; Fungi; Ascomycota; Schizosaccharomycetes; Schizosaccharomycetales;
Schizosaccharomycetaceae; Schizosaccharomyces.   KW=ATP-binding; Chaperone; Heat
shock; Mitochondrion; Transit peptide.

!!||||||||||||:||..|.:!||.||.||:.:||||:.||...|| |:. .:.!!..|.|:|||.!|:..! .!||.:

::||.|||.!||||!:|.|.:..||||.|||:|||||:|.|:!|| :....|:|||.||:!!||!|.!|!||| || ..:

..|||:|||!|||||||...!|||:||.:.|||!|||||||.|!.||.|:|!||...|..!!:.:.::.|. : ||....

|||.|!.|!|.:||||...!.:|...|!..| !...!!!!!||||||||||:||!||||||.:!|.|:.|.| ||.||||

|.!|||.||!:.|.||::|.     |:|!|      | |::.|..||.!|:..|.:.|. |||.||::|!.!||:   .|

...|!:.|...:||! ..|!:||:||.|!.|.:|:.||||: |||..!.:  .||...||| .|.|  |||:

     We will now study the distribution of the gap lengths in these alignments. This is almost exactly the procedure followed in the paper by Gaston. H. Gonnet, Mark A. Cohen, and Steven A. Benner. "Empirical and structural models for insertions and deletions in the divergent evolution of proteins". Journal of Molecular Biology, 229:1065-1082, 1993. This paper shows that the distribution of the gap length is Zipfian. A Zipfian probability distribution follows:

     Pr[i] = ----------
             i  Zeta(t)

where t>1 is a parameter of the distribution and Zeta(t) is needed to normalize the probabilities so that they add to one.

     The next computational task is to collect the gap lengths. We will take from each sequence in each alignment the gaps and tally the number of occurrences per gap length. This is done with the following statements.

G := CreateArray(1..50):
for al in SampleAl do
   dps := DynProgStrings(al);
   for s in [ dps[2], dps[3] ] do
      while length(s) > 0 do

          To find each gap we search for the first underscore and then count how many there are. Notice the operations on text shifting, done by adding an integer to a text (left addition shifts the left end, right addition shifts the right end).

	i := SearchString( '_', s );
	 if i<0 then break fi;
	 s := i+s;
	 for i to length(s)-1 while s[i+1]='_' do od;
	 s := i+s;
	 if i <= length(G) then G[i] := G[i] + 1 fi
[438, 142, 82, 54, 54, 20, 22, 18, 11, 9, 9, 8, 6, 7, 6, 3, 2, 0, 1, 2]

and we can see that the first 20 gaps decrease smoothly. To visually accept the Zipfian distribution as opposed to the exponential distribution we will produce two plots. (Notice that accepting a linear cost penalty function for deletions, is equivalent to accepting that the distribution of gaps is exponential.) The first one is a log-log plot, where a purely Zipfian distribution will show as a straight line.

DrawPlot( [ seq( [ln(i),ln(G[i])], i=1..15 )] ):

log-log gap length distribution

     An exponential distribution would show as a straight line in a semi-log plot of the distribution. The semi-log plot is definitely curved, so we reject the exponential distribution in favour of a Zipfian one. Of course, more data would show the difference more markedly.

DrawPlot( [ seq( [i,ln(G[i])], i=1..15 )] ):

linear-log gap length distribution

     To show the agreement between the data and the Zipfian distribution we show the expected values and cumulatives for the first 20 gap lengths. We will use the value t=1.5 that was proposed in the paper and as it can be seen, fits the data very well.

k := sum(G[1..20]) / sum( 1/i^1.5, i=1..20 );
for i to 20 do
   printf( '%3d    %3d %6.1f    %3d %6.1f\n', i,
      G[i], k/i^1.5,
      sum(G[j],j=1..i), k * sum(1/j^1.5,j=1..i)) od;
k := 411.8521
  1    438  411.9    438  411.9
  2    142  145.6    580  557.5
  3     82   79.3    662  636.7
  4     54   51.5    716  688.2
  5     54   36.8    770  725.0
  6     20   28.0    790  753.1
  7     22   22.2    812  775.3
  8     18   18.2    830  793.5
  9     11   15.3    841  808.8
 10      9   13.0    850  821.8
 11      9   11.3    859  833.1
 12      8    9.9    867  843.0
 13      6    8.8    873  851.8
 14      7    7.9    880  859.6
 15      6    7.1    886  866.7
 16      3    6.4    889  873.2
 17      2    5.9    891  879.0
 18      0    5.4    891  884.4
 19      1    5.0    892  889.4
 20      2    4.6    894  894.0

© 2011 by Gaston Gonnet, Informatik, ETH Zurich

Index of bio-recipes

Last updated on Wed Oct 5 21:34:22 2011 by GhG