Multiple repetitions of a short motif


     It is well known to molecular biologists that multiple repetitions of short motifs happen frequently, much more frequently than they would happen purely at random. There are also very good biological reasons for these repetitions to occur. The first reason is purely structural, if a short sequence has some shape in space, a repetition of the sequence will produce a block with a regular structure. From the point of view of evolution, it is easier to create a pattern and to replicate it many times than to create a large pattern in the first place.

     This bio-recipe will illustrate two points. First we will simulate how natural evolution could create a multiple repeated pattern. Secondly we will describe how searching for such patterns is done. This requires a special kind of alignment, which is a simple variation of the dynamic programming alignment algorithm.

     Once that a repetition of a small sequence is incorporated in a protein, there are very good reasons for which this repetition may propagate. This has to be explained at the DNA level. We will carry out an example to illustrate the point. So we will first create a random sequence of coding DNA, assume it has been duplicated (by concatenating it to itself) and its complement.

s1 := Rand(CodingDNA(24));
s1 := s1 . s1;
s2 := Complement(s1);
s1 := GGGTGCGCAAACTTAGTATCCCGC
s1 := GGGTGCGCAAACTTAGTATCCCGCGGGTGCGCAAACTTAGTATCCCGC
s2 := CCCACGCGTTTGAATCATAGGGCGCCCACGCGTTTGAATCATAGGGCG

     Once that a duplication has happened, it is more likely that this repetition is extended than other random growth. This is due to the following phenomenon: suppose that s1 breaks at an arbitrary position, say between 10 and 11 then it has two possibilities of reattaching to its complement which will be very favourable. The first is just reattaching where it broke, there the complement matches perfectly. Secondly it could attach 24 bases further, where it will match its complement for at least 14 bases.

s1[1..10] . CreateString(24,'-') . s1[11..-1];
s2 .  CreateString(24,'-');
GGGTGCGCAA------------------------ACTTAGTATCCCGCGGGTGCGCAAACTTAGTATCCCGC
CCCACGCGTTTGAATCATAGGGCGCCCACGCGTTTGAATCATAGGGCG------------------------

     Now, a simple repair or duplication mechanism could fill the missing gap to match the complement and we have the original duplication extended by exactly one more period. With this mechanism in mind, we can see that if a repeated structure gives functional or structural advantages and be positively selected, then it is likely to happen.

     There are other mechanisms by which repetitions can be produced. Replication slippage is one of them. In this case the DNA reattaches to a similar pattern appearing later and forms a bow. The bow can be cut or expanded giving rise to repetitions. Another mechanism is unequal or asymmetrical recombination of DNA. This latter mechanism is likely to produce long repeats, sometimes involving many proteins, not just a short motif.

     The next step is to identify repeated subsequences in a given sequence. This is done by aligning the sequence with itself and preventing it from matching exactly onto itself. A variation of the string alignment algorithm that overcomes this problem is to penalize a match where we match the same string position against each other. This is implemented in Darwin and it is called the option "NoSelf" of an alignment. We will now show how this can be done. First we define the file name where we can find the SwissProt database and then load it. We also create a Dayhoff matrix at distance 40 to align sequences. We will use a shorter distance (than the 250 default) so that we find alignments which are less distant. We will also increase its fixed and incremental deletion cost so that deletions will be very unlikely to appear (we would like to find repetitions which are not affected by deletions).

SwissProt := '~cbrg/DB/SwissProt.Z':
ReadDb(SwissProt);
DM := DayMatrix(40);
DM[FixedDel] := -100: DM[IncDel] := -10:
Reading 169638448 characters from file /home/cbrg/DB/SwissProt.Z
Pre-processing input (peptides)
163235 sequences within 163235 entries considered
Peptide file(/home/cbrg/DB/SwissProt.Z(169638448), 163235 entries, 59631787
 aminoacids)
DM := DayMatrix(Peptide, pam=40, Sim: max=18.072, min=-17.119,
 del=-25.730-1.396*(k-1))

     The first example of an overlapping match can be found in the following sequence,

s1 := Sequence(AC(Q16527)):
a1 := Align(s1,s1,NoSelf);  print(a1);
a1 := Alignment(Sequence(AC('Q16527'))[3..79],Sequence(AC('Q16527'))[112..188],366.2357,DM,0,0,{Local,NoSelf})
lengths=77,77 simil=366.2, PAM_dist=40, identity=53.2%
ID=CSR2_HUMAN   AC=Q16527; Q93030;   DE=Cysteine and glycine-rich protein 2
(Cysteine-rich protein 2) (CRP2) (Smooth muscle cell LIM protein) (SmLIM) (LIM-
only protein 5).   OS=Homo sapiens (Human).   OC=Eukaryota; Metazoa; Chordata;
Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini;
Hominidae; Homo.   KW=Developmental protein; Differentiation; LIM domain; Metal-
binding; Nuclear protein; Repeat; Zinc.
ID=CSR2_HUMAN   AC=Q16527; Q93030;   DE=Cysteine and glycine-rich protein 2
(Cysteine-rich protein 2) (CRP2) (Smooth muscle cell LIM protein) (SmLIM) (LIM-
only protein 5).   OS=Homo sapiens (Human).   OC=Eukaryota; Metazoa; Chordata;
Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini;
Hominidae; Homo.   KW=Developmental protein; Differentiation; LIM domain; Metal-
binding; Nuclear protein; Repeat; Zinc.
WGGGNKCGACGRTVYHAEEVQCDGRSFHRCCFLCMVCRKNLDSTTVAIHDEEIYCKSCYGKKYGPKGYGYGQGAGTL
:||..||..||.:||.||.!  .|!..|!.||.|..|.|.|!|||.. .!.|||||.||.|.!||||!|||||||.|
YGGAEKCSRCGDSVYAAEKIIGAGKPWHKNCFRCAKCGKSLESTTLTEKEGEIYCKGCYAKNFGPKGFGYGQGAGAL

     We use the definition of a sequence by its SwissProt accession number and then match the sequence to itself. Without the NoSelf mode, the match would be perfect and for the entire length of the sequence. Here we see that we align positions 3..79 with positions 112..188. The alignment is of very good quality, with more than 50% identity. Although this is a repetition of a subsequence, this is not exactly what we were looking for, as the aligned subsequences do not overlap. If we try another sequence we find:

s2 := Sequence(AC(Q62010)):
a2 := Align(s2,s2,NoSelf);  print(a2);
a2 := Alignment(Sequence(AC('Q62010'))[486..626],Sequence(AC('Q62010'))[493..633],1053.7564,DM,0,0,{Local,NoSelf})
lengths=141,141 simil=1053.8, PAM_dist=40, identity=77.3%
ID=OGP_MOUSE   AC=Q62010;   DE=Oviduct-specific glycoprotein precursor
(Oviductal glycoprotein) (Oviductin) (Estrogen-dependent oviduct protein).  
OS=Mus musculus (Mouse).   OC=Eukaryota; Metazoa; Chordata; Craniata;
Vertebrata; Euteleostomi; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae;
Murinae; Mus.   KW=Fertilization; Glycoprotein; Repeat; Signal.
ID=OGP_MOUSE   AC=Q62010;   DE=Oviduct-specific glycoprotein precursor
(Oviductal glycoprotein) (Oviductin) (Estrogen-dependent oviduct protein).  
OS=Mus musculus (Mouse).   OC=Eukaryota; Metazoa; Chordata; Craniata;
Vertebrata; Euteleostomi; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae;
Murinae; Mus.   KW=Fertilization; Glycoprotein; Repeat; Signal.
SKTTTGVSKTTTGISKTTTGVSKTTTGVSKATAGISKTIPEISKATAGVSKTTTGVSKTTTGISKTITGVSKTTTGISKT
||||||!||||||!|||||||||.|.|!||....|||....!||.|.||||||||!|||.||!|||.||!||||||||||
SKTTTGISKTTTGVSKTTTGVSKATAGISKTIPEISKATAGVSKTTTGVSKTTTGISKTITGVSKTTTGISKTTTGISKT

TTGISKTTTGVSKITTGVSKTTTGISKTTTGISQTTTGISKTTTDISKTTTGISKTTPGIS
|||!||.||||||.|||!||||||||:||||||:|||.||||||.|||||.||||||||::
TTGVSKITTGVSKTTTGISKTTTGISQTTTGISKTTTDISKTTTGISKTTPGISKTTPGMT

     This is now a perfect example of what we were searching for, a long repetition of a relatively short pattern. It is the nature of an overlapped alignment, that the length of the pattern which repeats is given by the difference of the offsets. The length of the repeated motif is the total overlap. We can compute these values for the above sequence:

MotifLength := | GetOffset(a2[Seq1]) - GetOffset(a2[Seq2]) |;
TotalLength := MotifLength + a2[Length2];
Repetitions := TotalLength / MotifLength;
MotifLength := 7
TotalLength := 148
Repetitions := 21.1429

     So we can see that the motif is of length 7, being SKTTTGV and is repeated 21 times. In general, the replication process could happen at different times during evolution, and it may result in ambiguous motif lengths. This procedure will give the motif which was repeated last in evolution. It is possible (not in this example), that the motif itself has a repeated motif inside.

     Finally we would like to write a program which searches the database for repetitions of this kind. In this example we will search for any repetition which is at least 5 residues long, that represents a very good quality alignment (score larger than 1500) and that repeats a record number of times. To show this exercise we will run over 2000 random entries from the database (as opposed to the entire database).

to 2000 do
    s := Sequence(Rand(Entry));
    a := Align(s,s,NoSelf);
    if a[Score] < 1500 then next fi:
    ml := | GetOffset(a[Seq1]) - GetOffset(a[Seq2]) |;
    if ml < 5 then next fi;
    tl := ml + a[Length2];
    if tl/ml > Repetitions then Repetitions := tl/ml;  a2 := a fi
    od:
Repetitions;
print(a2);
21.1429
lengths=141,141 simil=1053.8, PAM_dist=40, identity=77.3%
ID=OGP_MOUSE   AC=Q62010;   DE=Oviduct-specific glycoprotein precursor
(Oviductal glycoprotein) (Oviductin) (Estrogen-dependent oviduct protein).  
OS=Mus musculus (Mouse).   OC=Eukaryota; Metazoa; Chordata; Craniata;
Vertebrata; Euteleostomi; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae;
Murinae; Mus.   KW=Fertilization; Glycoprotein; Repeat; Signal.
ID=OGP_MOUSE   AC=Q62010;   DE=Oviduct-specific glycoprotein precursor
(Oviductal glycoprotein) (Oviductin) (Estrogen-dependent oviduct protein).  
OS=Mus musculus (Mouse).   OC=Eukaryota; Metazoa; Chordata; Craniata;
Vertebrata; Euteleostomi; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae;
Murinae; Mus.   KW=Fertilization; Glycoprotein; Repeat; Signal.
SKTTTGVSKTTTGISKTTTGVSKTTTGVSKATAGISKTIPEISKATAGVSKTTTGVSKTTTGISKTITGVSKTTTGISKT
||||||!||||||!|||||||||.|.|!||....|||....!||.|.||||||||!|||.||!|||.||!||||||||||
SKTTTGISKTTTGVSKTTTGVSKATAGISKTIPEISKATAGVSKTTTGVSKTTTGISKTITGVSKTTTGISKTTTGISKT

TTGISKTTTGVSKITTGVSKTTTGISKTTTGISQTTTGISKTTTDISKTTTGISKTTPGIS
|||!||.||||||.|||!||||||||:||||||:|||.||||||.|||||.||||||||::
TTGVSKITTGVSKTTTGISKTTTGISQTTTGISKTTTDISKTTTGISKTTPGISKTTPGMT

and we find a pattern of length 7 which repeats 21 times. It should be relatively easy to change the above loop to select repetitions based on other criteria.

© 2005 by Gaston Gonnet, Informatik, ETH Zurich

Index of bio-recipes

Last updated on Fri Sep 2 16:43:11 2005 by GhG