This bio-recipe shows how to search for a sequence alignment with special characteristics.
As an example we want to find a protein sequence from humans and one from spinach that have a similarity score greater than 150.
First we pick at random two protein sequences, one from humans, and one from spinach. Then we align them using the local alignment method. The iteration is stopped if the score of the alignment is greater than 150.
Define the location of the SwissProt database and then load it. We also create Dayhoff matrices to be able to align sequences.
SwissProt := '~cbrg/DB/SwissProt.Z': ReadDb(SwissProt); CreateDayMatrices();
Reading 169638448 characters from file /home/cbrg/DB/SwissProt.Z Pre-processing input (peptides) 163235 sequences within 163235 entries considered Peptide file(/home/cbrg/DB/SwissProt.Z(169638448), 163235 entries, 59631787 aminoacids)
Start the iteration that searches the sequence alignment
with the special characteristics.
for iter do
Select first sequence (from humans)
do e1 := Rand(Entry); if SearchString('homo sapiens',e1) > 0 then break fi od;
Select second sequence (from spinach)
do e2 := Rand(Entry); if SearchString('spinach',e2) > 0 then break fi od;
Do the alignment and stop if the score is > 150
a := Align( Sequence(e1), Sequence(e2), Local ); if a[Score] > 150 then break fi od: print(iter,a);
bytes alloc=35600000, time=7.170 bytes alloc=40800000, time=13.430 bytes alloc=40800000, time=19.660 bytes alloc=40800000, time=25.900 bytes alloc=40800000, time=32.180 bytes alloc=40800000, time=38.720 bytes alloc=40800000, time=44.830 bytes alloc=40800000, time=50.870 bytes alloc=40800000, time=56.840 bytes alloc=40800000, time=62.830 bytes alloc=40800000, time=69.400 bytes alloc=40800000, time=75.510 bytes alloc=40800000, time=81.900 bytes alloc=40800000, time=88.000 bytes alloc=40800000, time=94.240 bytes alloc=40800000, time=100.640 4994 lengths=109,110 simil=185.5, PAM_dist=250, identity=28.2% ID=AAC3_HUMAN AC=Q08043; DE=Alpha-actinin 3 (Alpha actinin skeletal muscle isoform 3) (F-actin cross linking protein). OS=Homo sapiens (Human). OC=Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. KW=Actin-binding; Multigene family; Polymorphism; Repeat. ID=CALM_SPIOL AC=P04353; DE=Calmodulin (CaM). OS=Spinacia oleracea (Spinach). OC=Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; Caryophyllales; Amaranthaceae; Spinacia. KW=Acetylation; Calcium-binding; Direct protein sequencing; Methylation; Repeat. AKGLSQEQLNEFRASFNHFDRKRNGMMEPDDFRACLISMGYDLGEVEFARIMTMVDPNAAGVVTFQAFIDFMTRETAETD | |::||!.||!.:|:.||!:.:|.:.::!:.:.!.|!|.:..|:|:..:::.||:::.|:!.|..|!::|:|:..!|| AXXLTDEQIAEFKEAFSLFDKDGDGCITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTIDFPEFLNLMARKMKDTD TTEQVVASFKILAGDKN_YITPEELRRELP :.|::..:|!!:..|:| !|::.|||:.!: SEEELKEAFRVFDKDQNGFISAAELRHVMT
Although the above search has done its job, it is
not very efficient.
It required more than 4000 iterations and about 100 secs to find a suitable match.
This is partly due to the way this was coded.
Once that we found a sequence for each, we align them and if
not suitable, we throw both of them out.
It would be much better to keep a set of human sequences and
a set of spinach sequences and test every new human/spinach
sequence against all the sequences in the other set.
This can be coded as follows:
Spinach := Human := {}: for iter do e := Rand(Entry); if SearchString(spinach,e) > 0 then als := zip( Align(Human,e) ); Spinach := append(Spinach,e)
After a random Entry is selected, if this entry belongs to spinach, we align it to all the sequences from Human. The alignments are stored in als. The zip function will notice that Human is a set and will compute a list with the Alignments of each of the components of Human. Next we add the new sequence to the set Spinach. The same is done if the sequence is from a human. At the end of the if statement we compute the maximum score of all the alignments in the list.
elif SearchString('homo sapiens',e) > 0 then als := zip( Align(e,Spinach) ); Human := append(Human,e) else next fi; if max( seq( a[Score], a=als )) > 150 then break fi od: for a in als do if a[Score] > 150 then print(a) fi od;
lengths=616,617 simil=1906.4, PAM_dist=250, identity=65.5% ID=HS7C_HUMAN AC=P11142; Q9H3R6; DE=Heat shock cognate 71 kDa protein. OS=Homo sapiens (Human). OC=Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. KW=Alternative splicing; ATP-binding; Chaperone; Direct protein sequencing; Heat shock; Multigene family; Nuclear protein. ID=BIP_SPIOL AC=Q42434; DE=Luminal binding protein precursor (BiP) (78 kDa glucose-regulated protein homolog) (GRP 78). OS=Spinacia oleracea (Spinach). OC=Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; Caryophyllales; Amaranthaceae; Spinacia. KW=ATP-binding; Endoplasmic reticulum; Signal. GPAVGIDLGTTYSCVGVFQHGKVEIIANDQGNRTTPSYVAFTDTERLIGDAAKNQVAMNPTNTVFDAKRLIGRRFDDAVV |::!|||||||||||||!::|||||||||||||.|||!||||:.|||||!|||||:|.||.:|!||:||||||!|!|..| GTVIGIDLGTTYSCVGVYKDGKVEIIANDQGNRITPSWVAFTNDERLIGEAAKNQAAANPERTIFDVKRLIGRKFEDKEV QSDMKHWPFMVVNDAGRPKVQVEYK_GETKSFYPEEVSSMVLTKMKEIAEAYLGKTVTNAVVTVPAYFNDSQRQATKDAG |:|||. |!.!||..|!|.!||:.: ||||.|.|||!|:|!||||||.||:!|||:!::|||||||||||:||||||||| QKDMKLVPYKIVNRDGKPYIQVKVQEGETKVFSPEEISAMILTKMKETAETFLGKKIKDAVVTVPAYFNDAQRQATKDAG TIAGLNVLRIINEPTAAAIAYGLDKKVGAERNVLIFDLGGGTFDVSILTIEDGIFEVKSTAGDTHLGGEDFDNRMVNHFI :||||||.|||||||||||||||||! |:|!|!|!|||||||||||!|||!:|!|||.:|.|||||||||||:|!:::|| VIAGLNVARIINEPTAAAIAYGLDKR_GGEKNILVFDLGGGTFDVSVLTIDNGVFEVLATNGDTHLGGEDFDQRLMEYFI AEFKRKHKKDISENKRAVRRLRTACERAKRTLSSSTQASIEIDSLYEGIDFYTSITRARFEELNADLFRGTLDPVEKALR . :|!||:||||:::||:.!||..||||||:|||:.|:.!||!||!!|!||..:!|||||||||.||||.|!:||:||!. KLIKKKHTKDISKDNRALGKLRRECERAKRALSSQHQVRVEIESLFDGVDFSEPLTRARFEELNNDLFRKTMGPVKKAMD DAKLDKSQIHDIVLVGGSTRIPKIQKLLQDFFNGKELNKSINPDEAVAYGAAVQAAILSGDKSENVQDLLLLDVTPLSLG ||.|!|:||:!||||||||||||!|:||:!||||||.:|:!|||||||!|||||::||||!.:|:::!!|||||:||:|| DAGLEKNQIDEIVLVGGSTRIPKVQQLLKEFFNGKEPSKGVNPDEAVAFGAAVQGSILSGEGGEETKEILLLDVAPLTLG IETAGGVMTVLIKRNTTIPTKQTQTFTTYSDNQPGVLIQVYEGERAMTKDNNLLGKFELTGIPPAPRGVPQIEVTFDIDA |||:|||||.||.|||:||||::|:||||:|:|:.|.|||!||||:!|||.:|||||!||||:|||||:|||||||!!|| IETVGGVMTKLIPRNTVIPTKKSQVFTTYQDQQTTVTIQVFEGERSLTKDCRLLGKFDLTGIAPAPRGTPQIEVTFEVDA NGILNVSAVDKSTGKENKITITNDKGRLSKEDIERMVQEAEKYKAEDEKQRDKVSSKNSLESYAFNMKATVED_EKLQGK ||||||:|.||::||::||||||||||||:|!|||||:|||:!..||:|.!!|!::!||||:|.!|||.:!:| !||.:| NGILNVKAEDKASGKSEKITITNDKGRLSQEEIERMVREAEEFAEEDKKVKEKIDARNSLETYIYNMKNQISDADKLADK INDEDKQKILDKCNEIINWLDKNQTAEKEEFEHQQKELEKVCNPIITKLYQSAGGMPG !::!!|:|| :..:|.!:|||:||:||||!!!::.||:|.|||||||.:||.:||.:| LESDEKEKIEGAVKEALEWLDDNQSAEKEDYDEKLKEVEAVCNPIITAVYQRSGGPSG
The required alignment has been found in 47 secs which is much faster than before. The number of iterations and final sizes of the sets are:
iter, length(Spinach), length(Human);
11417, 14, 764
© 2005 by Gaston Gonnet, Informatik, ETH Zurich
Last updated on Fri Sep 2 16:49:13 2005 by GhG