This bio-recipe shows how to compute the total number of pairs of amino acids based on the entire contents of the SwissProt database.
The pairs are counted in a matrix called C. For compactness, we restrict this matrix to have only the first M x M columns/rows.
The possible purpose of this exercise would be to find out whether the distribution of amino acids is independent of its neighbours, or there are biases, i.e. amino acid pairs which are more frequent than expected.
define the location of the SwissProt database and load itSwissProt := '/home/darwin/DB/SwissProt.latest': ReadDb(SwissProt);
Peptide file(/home/darwin/DB/SwissProt.latest(321441404), 397539 entries, 143289088 aminoacids)define the counters and loop over all sequences in SwissProt
M := 8: C := CreateArray(1..M,1..M): for s in Sequences() do for i to length(s)-1 do i1 := AToInt(s[i]); i2 := AToInt(s[i+1]); if i1 <= M and i2 <= M then C[i1,i2] := C[i1,i2] + 1 fi; od od:
bytes alloc=34800000, time=116.280print a header line and the count matrix
for i to M do printf( '%7s', IntToAAA(i) ) od; printf('\n'); print(C);
Ala Arg Asn Asp Cys Gln Glu Gly 1199232 651482 399716 610689 152584 464438 782546 868555 616016 550642 305859 425169 106826 331426 551496 517746 420093 282161 303269 289762 84911 227997 350832 400683 614970 388803 293778 439627 103911 255748 564212 541122 135532 115547 82104 108912 47946 80194 116418 177996 492657 342858 225070 260535 74513 339270 378821 360777 811994 561776 424678 531312 114006 407837 828196 589244 787913 572003 368965 527853 145384 376755 625593 809527
The above is correct, but laborious to code and inefficient to run. If we use the index facilities of Darwin, we can search through its Pat index very efficiently. For example, all the occurrences of AG pairs in the entire database is given by
SearchSeqDb(AG);
PatEntry(3204090..4072644)
The range contains exactly the same number of entries as C[1,8]. To make this size computation easier, we can write a simple function and verify 3 values from the above matrix:
Size := proc( x:string ) t := SearchSeqDb(x); t[1,2]-t[1,1]+1 end: Size(AA), Size(AG), Size(GA);
1199232, 868555, 787913
The Size function can be used to compute pairs, but also single amino acid counts or occurrences of any subsequence.
Size(A), Size(AA), Size(AAA), Size(AAAA), Size(AAAAAAAA);
11671756, 1199232, 161066, 33091, 2505
While the matrix C is roughly symmetric, we see a pair of entries (Ala-Gly) which are too different. To conclude whether the differences are significant or not, we should estimate what is the expected value and variance of this event happening purely at random. The probability of an Ala followed by a Gly (or viceversa if they are independent) is given by the product of their frequencies:
p := Size(A)/DB[TotAA] * Size(G)/DB[TotAA];
p := 0.00573539
The number of occurrences of pairs is equal to N, the total number of amino acids minus the total number of sequences.
N := DB[TotAA] - DB[TotEntries];
N := 142891549
The expected value of pairs follows a Poisson distribution with:
ExpectedValue=N*p, Variance=N*p*(1-p);
ExpectedValue = 819538.5883, Variance = 814838.2158
The number of standard deviations away from its expected value, for Ala-Gly is:
(C[1,8]-N*p) / sqrt(N*p*(1-p));
54.3007
And we can unequivocably conclude that the deviation is very significant and cannot be considered random.
Finally, we would like to analyze which are the largest deviations of pairs of amino acids, say larger than 50 standard deviations. This can be written as:
for i1 to 20 do a1 := IntToA(i1); for i2 to 20 do a2 := IntToA(i2); p := Size(a1)/DB[TotAA] * Size(a2)/DB[TotAA]; std := ( Size(a1.a2) - N*p ) / sqrt( N*p*(1-p) ); if |std| > 50 then printf( '%s%s occurs %6d times, %6.2f std away from expected\n', a1, a2, Size(a1.a2), std ) fi od od:
AA occurs 1199232 times, 258.78 std away from expected AN occurs 399716 times, -106.81 std away from expected AG occurs 868555 times, 54.30 std away from expected AL occurs 1199725 times, 69.68 std away from expected AP occurs 497728 times, -78.09 std away from expected AS occurs 725187 times, -57.63 std away from expected AY occurs 297297 times, -75.01 std away from expected RR occurs 550642 times, 178.93 std away from expected RS occurs 480328 times, -60.88 std away from expected RT occurs 378247 times, -67.01 std away from expected NA occurs 420093 times, -77.14 std away from expected NR occurs 282161 times, -66.50 std away from expected NN occurs 303269 times, 138.50 std away from expected NE occurs 350832 times, -64.82 std away from expected NI occurs 398527 times, 93.19 std away from expected NP occurs 326466 times, 93.38 std away from expected NY occurs 192524 times, 54.22 std away from expected DR occurs 388803 times, -56.70 std away from expected DQ occurs 255748 times, -92.52 std away from expected DE occurs 564212 times, 59.84 std away from expected DI occurs 516430 times, 86.49 std away from expected DF occurs 337431 times, 68.53 std away from expected DS occurs 477441 times, -53.05 std away from expected DT occurs 379713 times, -54.21 std away from expected DY occurs 265635 times, 82.06 std away from expected CA occurs 135532 times, -73.35 std away from expected CC occurs 47946 times, 112.57 std away from expected CE occurs 116418 times, -55.06 std away from expected CG occurs 177996 times, 92.83 std away from expected CM occurs 35803 times, -59.16 std away from expected QR occurs 342858 times, 54.86 std away from expected QD occurs 260535 times, -83.87 std away from expected QQ occurs 339270 times, 240.79 std away from expected QG occurs 360777 times, -61.11 std away from expected QS occurs 329910 times, -78.37 std away from expected EN occurs 424678 times, 53.39 std away from expected EC occurs 114006 times, -61.59 std away from expected EE occurs 828196 times, 223.26 std away from expected EG occurs 589244 times, -107.98 std away from expected EI occurs 617644 times, 62.94 std away from expected EK occurs 682759 times, 155.11 std away from expected EF occurs 333377 times, -65.63 std away from expected EP occurs 343154 times, -172.30 std away from expected ES occurs 513672 times, -160.26 std away from expected GN occurs 368965 times, -62.56 std away from expected GE occurs 625593 times, -63.73 std away from expected GG occurs 809527 times, 120.43 std away from expected GL occurs 922357 times, -51.90 std away from expected GK occurs 631351 times, 51.62 std away from expected GP occurs 402613 times, -112.51 std away from expected HA occurs 232733 times, -64.48 std away from expected HD occurs 148637 times, -66.81 std away from expected HC occurs 57242 times, 50.45 std away from expected HE occurs 173290 times, -99.67 std away from expected HH occurs 107493 times, 120.44 std away from expected HK occurs 154634 times, -85.38 std away from expected HF occurs 147213 times, 57.86 std away from expected HP occurs 199662 times, 110.84 std away from expected HY occurs 113898 times, 58.95 std away from expected IN occurs 385490 times, 70.94 std away from expected ID occurs 518596 times, 89.69 std away from expected IQ occurs 304933 times, -53.47 std away from expected IL occurs 768740 times, -55.39 std away from expected IM occurs 160421 times, -96.13 std away from expected LA occurs 1188301 times, 58.87 std away from expected LI occurs 736800 times, -90.79 std away from expected LM occurs 281087 times, -89.91 std away from expected LP occurs 708119 times, 59.14 std away from expected LS occurs 971844 times, 52.68 std away from expected LY occurs 364539 times, -63.78 std away from expected LV occurs 886279 times, -58.27 std away from expected KN occurs 386891 times, 77.66 std away from expected KC occurs 99817 times, -56.64 std away from expected KE occurs 651617 times, 113.64 std away from expected KG occurs 516615 times, -97.84 std away from expected KL occurs 765985 times, -52.34 std away from expected KK occurs 647568 times, 218.45 std away from expected KM occurs 177984 times, -54.25 std away from expected KF occurs 272647 times, -93.45 std away from expected KS occurs 505544 times, -73.10 std away from expected MA occurs 328488 times, 91.13 std away from expected MY occurs 83787 times, -53.68 std away from expected FA occurs 399958 times, -76.72 std away from expected FR occurs 272697 times, -58.72 std away from expected FD occurs 338645 times, 70.75 std away from expected FC occurs 95177 times, 58.66 std away from expected FK occurs 292398 times, -58.82 std away from expected FM occurs 112586 times, -57.16 std away from expected FF occurs 245104 times, 65.11 std away from expected FS occurs 422605 times, 87.67 std away from expected PR occurs 333520 times, -68.85 std away from expected PE occurs 518430 times, 86.60 std away from expected PG occurs 524621 times, 63.81 std away from expected PI occurs 341604 times, -98.48 std away from expected PK occurs 342745 times, -92.60 std away from expected PM occurs 132108 times, -79.52 std away from expected PP occurs 410471 times, 148.38 std away from expected PS occurs 491698 times, 54.76 std away from expected PV occurs 501078 times, 52.53 std away from expected SA occurs 695864 times, -91.01 std away from expected SE occurs 566921 times, -93.64 std away from expected SG occurs 724040 times, 65.40 std away from expected SK occurs 521045 times, -52.35 std away from expected SM occurs 194484 times, -72.87 std away from expected SS occurs 828416 times, 243.51 std away from expected TR occurs 377140 times, -68.72 std away from expected TQ occurs 276098 times, -50.72 std away from expected TE occurs 456530 times, -83.13 std away from expected TG occurs 588678 times, 67.30 std away from expected TK occurs 389093 times, -91.64 std away from expected TM occurs 154785 times, -69.14 std away from expected TP occurs 443004 times, 127.81 std away from expected TT occurs 459818 times, 76.88 std away from expected TV occurs 570763 times, 67.14 std away from expected WQ occurs 75148 times, 52.16 std away from expected WL occurs 172695 times, 54.54 std away from expected WP occurs 59497 times, -55.88 std away from expected WW occurs 24957 times, 59.56 std away from expected YA occurs 290259 times, -87.08 std away from expected YM occurs 82598 times, -57.42 std away from expected YF occurs 190890 times, 70.77 std away from expected YY occurs 148048 times, 72.45 std away from expected VQ occurs 340335 times, -74.72 std away from expected VT occurs 568580 times, 64.12 std away from expected VV occurs 736042 times, 88.08 std away from expected
This confirms and quantifies the well known fact that pairs of amino acids are not independent of each other.
© 2008 by Gaston Gonnet, Informatik, ETH Zurich
Last updated on Wed Dec 17 16:08:46 2008 by GhG