Total Number of Pairs of Amino Acids in the SwissProt Database


     This bio-recipe shows how to compute the total number of pairs of amino acids based on the entire contents of the SwissProt database.

     The pairs are counted in a matrix called C. For compactness, we restrict this matrix to have only the first M x M columns/rows.

     The possible purpose of this exercise would be to find out whether the distribution of amino acids is independent of its neighbours, or there are biases, i.e. amino acid pairs which are more frequent than expected.

define the location of the SwissProt database and load it
SwissProt := '/home/darwin/DB/SwissProt.latest':
ReadDb(SwissProt);
Peptide file(/home/darwin/DB/SwissProt.latest(321441404), 397539 entries, 
143289088 aminoacids)
define the counters and loop over all sequences in SwissProt
M := 8:
C := CreateArray(1..M,1..M):
for s in Sequences() do
    for i to length(s)-1 do
	i1 := AToInt(s[i]);
	i2 := AToInt(s[i+1]);
	if i1 <= M and i2 <= M then
	    C[i1,i2] := C[i1,i2] + 1 fi;
	od
    od:
bytes alloc=34800000, time=116.280
print a header line and the count matrix
for i to M do printf( '%7s', IntToAAA(i) ) od;
printf('\n');  print(C);
    Ala    Arg    Asn    Asp    Cys    Gln    Glu    Gly
 1199232  651482  399716  610689  152584  464438  782546  868555
  616016  550642  305859  425169  106826  331426  551496  517746
  420093  282161  303269  289762   84911  227997  350832  400683
  614970  388803  293778  439627  103911  255748  564212  541122
  135532  115547   82104  108912   47946   80194  116418  177996
  492657  342858  225070  260535   74513  339270  378821  360777
  811994  561776  424678  531312  114006  407837  828196  589244
  787913  572003  368965  527853  145384  376755  625593  809527

     The above is correct, but laborious to code and inefficient to run. If we use the index facilities of Darwin, we can search through its Pat index very efficiently. For example, all the occurrences of AG pairs in the entire database is given by

SearchSeqDb(AG);
PatEntry(3204090..4072644)

     The range contains exactly the same number of entries as C[1,8]. To make this size computation easier, we can write a simple function and verify 3 values from the above matrix:

Size := proc( x:string )
  t := SearchSeqDb(x);
  t[1,2]-t[1,1]+1
end:
Size(AA), Size(AG), Size(GA);
1199232, 868555, 787913

     The Size function can be used to compute pairs, but also single amino acid counts or occurrences of any subsequence.

Size(A), Size(AA), Size(AAA), Size(AAAA), Size(AAAAAAAA);
11671756, 1199232, 161066, 33091, 2505

     While the matrix C is roughly symmetric, we see a pair of entries (Ala-Gly) which are too different. To conclude whether the differences are significant or not, we should estimate what is the expected value and variance of this event happening purely at random. The probability of an Ala followed by a Gly (or viceversa if they are independent) is given by the product of their frequencies:

p := Size(A)/DB[TotAA] * Size(G)/DB[TotAA];
p := 0.00573539

     The number of occurrences of pairs is equal to N, the total number of amino acids minus the total number of sequences.

N := DB[TotAA] - DB[TotEntries];
N := 142891549

     The expected value of pairs follows a Poisson distribution with:

ExpectedValue=N*p, Variance=N*p*(1-p);
ExpectedValue = 819538.5883, Variance = 814838.2158

     The number of standard deviations away from its expected value, for Ala-Gly is:

(C[1,8]-N*p) / sqrt(N*p*(1-p));
54.3007

     And we can unequivocably conclude that the deviation is very significant and cannot be considered random.

     Finally, we would like to analyze which are the largest deviations of pairs of amino acids, say larger than 50 standard deviations. This can be written as:

for i1 to 20 do
  a1 := IntToA(i1);
  for i2 to 20 do
    a2 := IntToA(i2);
    p := Size(a1)/DB[TotAA] * Size(a2)/DB[TotAA];
    std := ( Size(a1.a2) - N*p ) / sqrt( N*p*(1-p) );
    if |std| > 50 then
	printf( '%s%s occurs %6d times, %6.2f std away from expected\n',
	    a1, a2, Size(a1.a2), std ) fi
    od od:
AA occurs 1199232 times, 258.78 std away from expected
AN occurs 399716 times, -106.81 std away from expected
AG occurs 868555 times,  54.30 std away from expected
AL occurs 1199725 times,  69.68 std away from expected
AP occurs 497728 times, -78.09 std away from expected
AS occurs 725187 times, -57.63 std away from expected
AY occurs 297297 times, -75.01 std away from expected
RR occurs 550642 times, 178.93 std away from expected
RS occurs 480328 times, -60.88 std away from expected
RT occurs 378247 times, -67.01 std away from expected
NA occurs 420093 times, -77.14 std away from expected
NR occurs 282161 times, -66.50 std away from expected
NN occurs 303269 times, 138.50 std away from expected
NE occurs 350832 times, -64.82 std away from expected
NI occurs 398527 times,  93.19 std away from expected
NP occurs 326466 times,  93.38 std away from expected
NY occurs 192524 times,  54.22 std away from expected
DR occurs 388803 times, -56.70 std away from expected
DQ occurs 255748 times, -92.52 std away from expected
DE occurs 564212 times,  59.84 std away from expected
DI occurs 516430 times,  86.49 std away from expected
DF occurs 337431 times,  68.53 std away from expected
DS occurs 477441 times, -53.05 std away from expected
DT occurs 379713 times, -54.21 std away from expected
DY occurs 265635 times,  82.06 std away from expected
CA occurs 135532 times, -73.35 std away from expected
CC occurs  47946 times, 112.57 std away from expected
CE occurs 116418 times, -55.06 std away from expected
CG occurs 177996 times,  92.83 std away from expected
CM occurs  35803 times, -59.16 std away from expected
QR occurs 342858 times,  54.86 std away from expected
QD occurs 260535 times, -83.87 std away from expected
QQ occurs 339270 times, 240.79 std away from expected
QG occurs 360777 times, -61.11 std away from expected
QS occurs 329910 times, -78.37 std away from expected
EN occurs 424678 times,  53.39 std away from expected
EC occurs 114006 times, -61.59 std away from expected
EE occurs 828196 times, 223.26 std away from expected
EG occurs 589244 times, -107.98 std away from expected
EI occurs 617644 times,  62.94 std away from expected
EK occurs 682759 times, 155.11 std away from expected
EF occurs 333377 times, -65.63 std away from expected
EP occurs 343154 times, -172.30 std away from expected
ES occurs 513672 times, -160.26 std away from expected
GN occurs 368965 times, -62.56 std away from expected
GE occurs 625593 times, -63.73 std away from expected
GG occurs 809527 times, 120.43 std away from expected
GL occurs 922357 times, -51.90 std away from expected
GK occurs 631351 times,  51.62 std away from expected
GP occurs 402613 times, -112.51 std away from expected
HA occurs 232733 times, -64.48 std away from expected
HD occurs 148637 times, -66.81 std away from expected
HC occurs  57242 times,  50.45 std away from expected
HE occurs 173290 times, -99.67 std away from expected
HH occurs 107493 times, 120.44 std away from expected
HK occurs 154634 times, -85.38 std away from expected
HF occurs 147213 times,  57.86 std away from expected
HP occurs 199662 times, 110.84 std away from expected
HY occurs 113898 times,  58.95 std away from expected
IN occurs 385490 times,  70.94 std away from expected
ID occurs 518596 times,  89.69 std away from expected
IQ occurs 304933 times, -53.47 std away from expected
IL occurs 768740 times, -55.39 std away from expected
IM occurs 160421 times, -96.13 std away from expected
LA occurs 1188301 times,  58.87 std away from expected
LI occurs 736800 times, -90.79 std away from expected
LM occurs 281087 times, -89.91 std away from expected
LP occurs 708119 times,  59.14 std away from expected
LS occurs 971844 times,  52.68 std away from expected
LY occurs 364539 times, -63.78 std away from expected
LV occurs 886279 times, -58.27 std away from expected
KN occurs 386891 times,  77.66 std away from expected
KC occurs  99817 times, -56.64 std away from expected
KE occurs 651617 times, 113.64 std away from expected
KG occurs 516615 times, -97.84 std away from expected
KL occurs 765985 times, -52.34 std away from expected
KK occurs 647568 times, 218.45 std away from expected
KM occurs 177984 times, -54.25 std away from expected
KF occurs 272647 times, -93.45 std away from expected
KS occurs 505544 times, -73.10 std away from expected
MA occurs 328488 times,  91.13 std away from expected
MY occurs  83787 times, -53.68 std away from expected
FA occurs 399958 times, -76.72 std away from expected
FR occurs 272697 times, -58.72 std away from expected
FD occurs 338645 times,  70.75 std away from expected
FC occurs  95177 times,  58.66 std away from expected
FK occurs 292398 times, -58.82 std away from expected
FM occurs 112586 times, -57.16 std away from expected
FF occurs 245104 times,  65.11 std away from expected
FS occurs 422605 times,  87.67 std away from expected
PR occurs 333520 times, -68.85 std away from expected
PE occurs 518430 times,  86.60 std away from expected
PG occurs 524621 times,  63.81 std away from expected
PI occurs 341604 times, -98.48 std away from expected
PK occurs 342745 times, -92.60 std away from expected
PM occurs 132108 times, -79.52 std away from expected
PP occurs 410471 times, 148.38 std away from expected
PS occurs 491698 times,  54.76 std away from expected
PV occurs 501078 times,  52.53 std away from expected
SA occurs 695864 times, -91.01 std away from expected
SE occurs 566921 times, -93.64 std away from expected
SG occurs 724040 times,  65.40 std away from expected
SK occurs 521045 times, -52.35 std away from expected
SM occurs 194484 times, -72.87 std away from expected
SS occurs 828416 times, 243.51 std away from expected
TR occurs 377140 times, -68.72 std away from expected
TQ occurs 276098 times, -50.72 std away from expected
TE occurs 456530 times, -83.13 std away from expected
TG occurs 588678 times,  67.30 std away from expected
TK occurs 389093 times, -91.64 std away from expected
TM occurs 154785 times, -69.14 std away from expected
TP occurs 443004 times, 127.81 std away from expected
TT occurs 459818 times,  76.88 std away from expected
TV occurs 570763 times,  67.14 std away from expected
WQ occurs  75148 times,  52.16 std away from expected
WL occurs 172695 times,  54.54 std away from expected
WP occurs  59497 times, -55.88 std away from expected
WW occurs  24957 times,  59.56 std away from expected
YA occurs 290259 times, -87.08 std away from expected
YM occurs  82598 times, -57.42 std away from expected
YF occurs 190890 times,  70.77 std away from expected
YY occurs 148048 times,  72.45 std away from expected
VQ occurs 340335 times, -74.72 std away from expected
VT occurs 568580 times,  64.12 std away from expected
VV occurs 736042 times,  88.08 std away from expected

     This confirms and quantifies the well known fact that pairs of amino acids are not independent of each other.

© 2008 by Gaston Gonnet, Informatik, ETH Zurich

Index of bio-recipes

Last updated on Wed Dec 17 16:08:46 2008 by GhG