Sequence Alignments with Special Characteristics


     This bio-recipe shows how to search for a sequence alignment with special characteristics.

     As an example we want to find a protein sequence from humans and one from spinach that have a similarity score greater than 150.

     First we pick at random two protein sequences, one from humans, and one from spinach. Then we align them using the local alignment method. The iteration is stopped if the score of the alignment is greater than 150.

Define the location of the SwissProt database and then load it. We also create Dayhoff matrices to be able to align sequences.

SwissProt := '~cbrg/DB/SwissProt.Z':
ReadDb(SwissProt);
CreateDayMatrices();
Reading 169638448 characters from file /home/cbrg/DB/SwissProt.Z
Pre-processing input (peptides)
163235 sequences within 163235 entries considered
Peptide file(/home/cbrg/DB/SwissProt.Z(169638448), 163235 entries, 59631787
 aminoacids)

Start the iteration that searches the sequence alignment with the special characteristics.

for iter do

          Select first sequence (from humans)

    do  e1 := Rand(Entry);
	if SearchString('homo sapiens',e1) > 0 then break fi od;

          Select second sequence (from spinach)

    do  e2 := Rand(Entry);
	if SearchString('spinach',e2) > 0 then break fi od;

          Do the alignment and stop if the score is > 150

    a := Align( Sequence(e1), Sequence(e2), Local );
    if a[Score] > 150 then break fi
    od:
print(iter,a);
bytes alloc=35600000, time=7.170
bytes alloc=40800000, time=13.430
bytes alloc=40800000, time=19.660
bytes alloc=40800000, time=25.900
bytes alloc=40800000, time=32.180
bytes alloc=40800000, time=38.720
bytes alloc=40800000, time=44.830
bytes alloc=40800000, time=50.870
bytes alloc=40800000, time=56.840
bytes alloc=40800000, time=62.830
bytes alloc=40800000, time=69.400
bytes alloc=40800000, time=75.510
bytes alloc=40800000, time=81.900
bytes alloc=40800000, time=88.000
bytes alloc=40800000, time=94.240
bytes alloc=40800000, time=100.640
4994
lengths=109,110 simil=185.5, PAM_dist=250, identity=28.2%
ID=AAC3_HUMAN   AC=Q08043;   DE=Alpha-actinin 3 (Alpha actinin skeletal muscle
isoform 3) (F-actin cross linking protein).   OS=Homo sapiens (Human).  
OC=Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;
Eutheria; Primates; Catarrhini; Hominidae; Homo.   KW=Actin-binding; Multigene
family; Polymorphism; Repeat.
ID=CALM_SPIOL   AC=P04353;   DE=Calmodulin (CaM).   OS=Spinacia oleracea
(Spinach).   OC=Eukaryota; Viridiplantae; Streptophyta; Embryophyta;
Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots;
Caryophyllales; Amaranthaceae; Spinacia.   KW=Acetylation; Calcium-binding;
Direct protein sequencing; Methylation; Repeat.
AKGLSQEQLNEFRASFNHFDRKRNGMMEPDDFRACLISMGYDLGEVEFARIMTMVDPNAAGVVTFQAFIDFMTRETAETD
|  |::||!.||!.:|:.||!:.:|.:.::!:.:.!.|!|.:..|:|:..:::.||:::.|:!.|..|!::|:|:..!||
AXXLTDEQIAEFKEAFSLFDKDGDGCITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTIDFPEFLNLMARKMKDTD

TTEQVVASFKILAGDKN_YITPEELRRELP
:.|::..:|!!:..|:| !|::.|||:.!:
SEEELKEAFRVFDKDQNGFISAAELRHVMT

     Although the above search has done its job, it is not very efficient. It required more than 4000 iterations and about 100 secs to find a suitable match. This is partly due to the way this was coded. Once that we found a sequence for each, we align them and if not suitable, we throw both of them out. It would be much better to keep a set of human sequences and a set of spinach sequences and test every new human/spinach sequence against all the sequences in the other set. This can be coded as follows:

Spinach := Human := {}:
for iter do
    e := Rand(Entry);
    if SearchString(spinach,e) > 0 then
	 als := zip( Align(Human,e) );
	 Spinach := append(Spinach,e)

          After a random Entry is selected, if this entry belongs to spinach, we align it to all the sequences from Human. The alignments are stored in als. The zip function will notice that Human is a set and will compute a list with the Alignments of each of the components of Human. Next we add the new sequence to the set Spinach. The same is done if the sequence is from a human. At the end of the if statement we compute the maximum score of all the alignments in the list.

    elif SearchString('homo sapiens',e) > 0 then
	 als := zip( Align(e,Spinach) );
	 Human := append(Human,e)
    else next fi;
    if max( seq( a[Score], a=als )) > 150 then break fi
    od:
for a in als do if a[Score] > 150 then print(a) fi od;
lengths=616,617 simil=1906.4, PAM_dist=250, identity=65.5%
ID=HS7C_HUMAN   AC=P11142; Q9H3R6;   DE=Heat shock cognate 71 kDa protein.  
OS=Homo sapiens (Human).   OC=Eukaryota; Metazoa; Chordata; Craniata;
Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae;
Homo.   KW=Alternative splicing; ATP-binding; Chaperone; Direct protein
sequencing; Heat shock; Multigene family; Nuclear protein.
ID=BIP_SPIOL   AC=Q42434;   DE=Luminal binding protein precursor (BiP) (78 kDa
glucose-regulated protein homolog) (GRP 78).   OS=Spinacia oleracea (Spinach).  
OC=Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; Caryophyllales;
Amaranthaceae; Spinacia.   KW=ATP-binding; Endoplasmic reticulum; Signal.
GPAVGIDLGTTYSCVGVFQHGKVEIIANDQGNRTTPSYVAFTDTERLIGDAAKNQVAMNPTNTVFDAKRLIGRRFDDAVV
|::!|||||||||||||!::|||||||||||||.|||!||||:.|||||!|||||:|.||.:|!||:||||||!|!|..|
GTVIGIDLGTTYSCVGVYKDGKVEIIANDQGNRITPSWVAFTNDERLIGEAAKNQAAANPERTIFDVKRLIGRKFEDKEV

QSDMKHWPFMVVNDAGRPKVQVEYK_GETKSFYPEEVSSMVLTKMKEIAEAYLGKTVTNAVVTVPAYFNDSQRQATKDAG
|:|||. |!.!||..|!|.!||:.: ||||.|.|||!|:|!||||||.||:!|||:!::|||||||||||:|||||||||
QKDMKLVPYKIVNRDGKPYIQVKVQEGETKVFSPEEISAMILTKMKETAETFLGKKIKDAVVTVPAYFNDAQRQATKDAG

TIAGLNVLRIINEPTAAAIAYGLDKKVGAERNVLIFDLGGGTFDVSILTIEDGIFEVKSTAGDTHLGGEDFDNRMVNHFI
:||||||.|||||||||||||||||! |:|!|!|!|||||||||||!|||!:|!|||.:|.|||||||||||:|!:::||
VIAGLNVARIINEPTAAAIAYGLDKR_GGEKNILVFDLGGGTFDVSVLTIDNGVFEVLATNGDTHLGGEDFDQRLMEYFI

AEFKRKHKKDISENKRAVRRLRTACERAKRTLSSSTQASIEIDSLYEGIDFYTSITRARFEELNADLFRGTLDPVEKALR
. :|!||:||||:::||:.!||..||||||:|||:.|:.!||!||!!|!||..:!|||||||||.||||.|!:||:||!.
KLIKKKHTKDISKDNRALGKLRRECERAKRALSSQHQVRVEIESLFDGVDFSEPLTRARFEELNNDLFRKTMGPVKKAMD

DAKLDKSQIHDIVLVGGSTRIPKIQKLLQDFFNGKELNKSINPDEAVAYGAAVQAAILSGDKSENVQDLLLLDVTPLSLG
||.|!|:||:!||||||||||||!|:||:!||||||.:|:!|||||||!|||||::||||!.:|:::!!|||||:||:||
DAGLEKNQIDEIVLVGGSTRIPKVQQLLKEFFNGKEPSKGVNPDEAVAFGAAVQGSILSGEGGEETKEILLLDVAPLTLG

IETAGGVMTVLIKRNTTIPTKQTQTFTTYSDNQPGVLIQVYEGERAMTKDNNLLGKFELTGIPPAPRGVPQIEVTFDIDA
|||:|||||.||.|||:||||::|:||||:|:|:.|.|||!||||:!|||.:|||||!||||:|||||:|||||||!!||
IETVGGVMTKLIPRNTVIPTKKSQVFTTYQDQQTTVTIQVFEGERSLTKDCRLLGKFDLTGIAPAPRGTPQIEVTFEVDA

NGILNVSAVDKSTGKENKITITNDKGRLSKEDIERMVQEAEKYKAEDEKQRDKVSSKNSLESYAFNMKATVED_EKLQGK
||||||:|.||::||::||||||||||||:|!|||||:|||:!..||:|.!!|!::!||||:|.!|||.:!:| !||.:|
NGILNVKAEDKASGKSEKITITNDKGRLSQEEIERMVREAEEFAEEDKKVKEKIDARNSLETYIYNMKNQISDADKLADK

INDEDKQKILDKCNEIINWLDKNQTAEKEEFEHQQKELEKVCNPIITKLYQSAGGMPG
!::!!|:|| :..:|.!:|||:||:||||!!!::.||:|.|||||||.:||.:||.:|
LESDEKEKIEGAVKEALEWLDDNQSAEKEDYDEKLKEVEAVCNPIITAVYQRSGGPSG

     The required alignment has been found in 47 secs which is much faster than before. The number of iterations and final sizes of the sets are:

iter, length(Spinach), length(Human);
11417, 14, 764

© 2005 by Gaston Gonnet, Informatik, ETH Zurich

Index of bio-recipes

Last updated on Fri Sep 2 16:49:13 2005 by GhG