The central dogma of molecular biology states that DNA is transcribed to RNA and RNA is translated into amino acids according to the genetic code. The transcription between DNA and RNA is one-to-one and does not offer any ambiguity. The translation from RNA to amino acids is a many-to-one translation, where 61 codons translate to 20 amino acids according to the genetic code.
?GeneticCode
GGG G Gly AGG R Arg CGG R Arg UGG W Trp GGA G Gly AGA R Arg CGA R Arg UGA Stop GGC G Gly AGC S Ser CGC R Arg UGC C Cys GGU G Gly AGU S Ser CGU R Arg UGU C Cys GAG E Glu AAG K Lys CAG Q Gln UAG Stop GAA E Glu AAA K Lys CAA Q Gln UAA Stop GAC D Asp AAC N Asn CAC H His UAC Y Tyr GAU D Asp AAU N Asn CAU H His UAU Y Tyr GCG A Ala ACG T Thr CCG P Pro UCG S Ser GCA A Ala ACA T Thr CCA P Pro UCA S Ser GCC A Ala ACC T Thr CCC P Pro UCC S Ser GCU A Ala ACU T Thr CCU P Pro UCU S Ser GUG V Val AUG M Met CUG L Leu UUG L Leu GUA V Val AUA I Ile CUA L Leu UUA L Leu GUC V Val AUC I Ile CUC L Leu UUC F Phe GUU V Val AUU I Ile CUU L Leu UUU F Phe See Also: ?AltGenCode ?BaseToInt ?CIntToAmino ?CodonToInt ?IntToBBB ?AminoToInt ?BBBToInt ?CIntToCodon ?Complement ?IntToCInt ?antiparallel ?BToInt ?CIntToInt ?GeneticCode ?IntToCodon ?AToCInt ?CIntToA ?CodonToA ?IntToB ?Reverse ?AToCodon ?CIntToAAA ?CodonToCInt ?IntToBase ------------
In this bio-recipe we will study how to produce DNA (or RNA) that will encode a given protein. This is in principle trivial, but in reality it is not so. We will first load a database for the entire genome of our target organism, in this case Saccharomyces cerevisiae (baker's yeast).
ProtDB := ReadDb('/home/darwin/DB/genomes/YEAST/YEAST.db');
ProtDB := Peptide file(/home/darwin/DB/genomes/YEAST/YEAST.db(17557412), 5799 entries, 2886213 aminoacids)
For the computations that will follow, it will be very convenient to create a new database over the coding DNA of yeast. This is done with the DNA entries in the database. The new database will have the same entries (in the same order) as the protein database, so it is very easy to correlate one with the other.
OpenWriting( 'yeastDNA.db' ); for e in Entries(ProtDB) do printf( '<E><ID>%s</ID><SEQ>%s</SEQ></E>\n', SearchTag(ID,e), SearchTag(DNA,e) ) od: OpenWriting( previous ); DNADB := ReadDb( 'yeastDNA.db' ):
Reading 8893314 characters from file yeastDNA.db Pre-processing input (DNA) 5799 sequences within 5799 entries considered Creating file yeastDNA.db.map for mapping Building new Pat index in file yeastDNA.db.tree with 8676036 entries Pat index with 8676036 entries sorted, from "A</SEQ></E>\n" to "XXXXXXXXXXXXXXXXXXX"
Now we can print a table by amino acid of the corresponding codons and their frequencies. For this we have to first accumulate the counts of the codons.
CodCounts := CreateArray(1..64): for s in Sequences(DNADB) do for i by 3 to length(s) do j := CodonToCInt(s[i..i+2]); if j > 0 then CodCounts[j] := CodCounts[j]+1 fi od; od; totaa := sum(CodCounts); for aa to 20 do printf( '%s:\n', IntToAmino(aa) ); for cod in IntToCodon(aa) do PatEntr := SearchSeqDb( cod ); printf( ' %s%5.2f%%', cod, CodCounts[CodonToCInt(cod)] / totaa * 100 ) od; printf( '\n' ) od;
totaa := 2826224 Alanine: GCA 1.63% GCC 1.22% GCG 0.62% GCT 2.02% Arginine: AGA 2.10% AGG 0.95% CGA 0.31% CGC 0.26% CGG 0.18% CGT 0.62% Asparagine: AAC 2.47% AAT 3.66% Aspartic acid: GAC 2.03% GAT 3.83% Cysteine: TGC 0.47% TGT 0.79% Glutamine: CAA 2.70% CAG 1.24% Glutamic acid: GAA 4.60% GAG 1.97% Glycine: GGA 1.12% GGC 0.98% GGG 0.61% GGT 2.27% Histidine: CAC 0.76% CAT 1.38% Isoleucine: ATA 1.83% ATC 1.69% ATT 3.02% Leucine: CTA 1.36% CTC 0.55% CTG 1.07% CTT 1.23% TTA 2.65% TTG 2.67% Lysine: AAA 4.28% AAG 3.06% Methionine: ATG 2.07% Phenylalanine: TTC 1.80% TTT 2.63% Proline: CCA 1.77% CCC 0.69% CCG 0.53% CCT 1.35% Serine: AGC 1.00% AGT 1.46% TCA 1.90% TCC 1.40% TCG 0.87% TCT 2.33% Threonine: ACA 1.80% ACC 1.24% ACG 0.82% ACT 2.01% Tryptophan: TGG 1.04% Tyrosine: TAC 1.45% TAT 1.90% Valine: GTA 1.20% GTC 1.12% GTG 1.08% GTT 2.16%
As we can see, codons are not used with uniform frequencies. There are multiple reasons for these differences, and currently we do not really know whether these peculiar distributions are random or due to precise reasons.
There are several simple-minded approaches to generate DNA that will encode for a protein:
(a) | Select a random codon for each amino acid |
(b) | Select the most frequent codon for each amino acid |
(c) | Select the codons randomly, but with the actual genome probability distribution of the codons |
(d) | As any of the above, but favouring reuse of codons (as has been found is the case in nature). |
Experience shows that sometimes these sequences do not work at all, or synthesize proteins very slowly or are too error prone. There are at least 3 reasons why a DNA sequence produced with these simple rules may fail:
(i) | The DNA or RNA produced has an undesirable physical property or structure, e.g. curls too much or in the wrong way, DNA attaches too loosely or too strongly, etc. |
(ii) | The DNA contains a signal for something unwanted, e.g. a restriction site, a promoter, or something else which affects transcription or translation. |
(iii) | The translation becomes very inefficient or error prone, usually because of undesired interactions in the ribosome or between rRNAs. |
Not knowing which criteria should be used to select the codons does not mean that we cannot build a DNA sequence which is as similar as possible to existing coding DNA. Since we have plenty of coding DNA - the entire genome - we will use a strategy of mimicking the existing coding DNA. That is, select codons so that the resulting DNA appears as often as possible in the coding DNA.
By using such strategy, we will be quite safe from using subsequences that have the problems mentioned above. This is because natural selection will not use these subsequences or use them very sparingly.
To make this notion of mimicking the coding DNA of the genome more formal we have to decide on a DNA-window size, say 8 for the purpose of this example. We want the generated DNA, divided in sliding windows of size 8, to be as frequent as possible. More formally, Let s be a sequence and s[i..j] the subsequence of s from symbol i to symbol j. We want the substrings s[1..8], s[2..9], s[3..10], ...to be most frequently appearing in the coding DNA. As a global measure we will take the product of all the frequencies (or the sum of their logarithms).
Lets take a short arbitrary peptide sequence:
s := 'AACID':
Now we will back-translate using random codons. We define the function RandL to take a random element of a list.
RandL := proc( L:list ) L[ Rand(1..length(L)) ] end: sD := '': for i to length(s) do sD := sD . RandL( AToCodon( s[i] )) od: sD;
GCCGCATGCATAGAT
The frequency of any DNA sequence can be found by searching the sequence with SearchSeqDb which will return a range of entries in the Pat index. From the range we can immediately compute the number of hits.
k := 8; sD[1..k]; SearchSeqDb(sD[1..k]);
k := 8 GCCGCATG PatEntry(5187869..5187911)
We can now compute all frequencies of all sliding windows of size k and the total score:
score := 0: for i to length(sD)-k+1 do pe := SearchSeqDb( sD[i..i+k-1] ); freq := pe[1,2]-pe[1,1]+1; lprint( sD[i..i+k-1], freq ); score := score + log(freq) od: score;
GCCGCATG 43 CCGCATGC 26 CGCATGCA 50 GCATGCAT 70 CATGCATA 85 ATGCATAG 65 TGCATAGA 98 GCATAGAT 90 32.8816
There are 1, 2, 3, 4 or 6 codon choices for each amino. Clearly the total number of choices will be exponential in the length of the protein and hence intractable. To find this optimal DNA we can use dynamic programming. We will not describe the algorithm in detail, but just give the main highlights:
(1) | Given a sequence of k bases, finding its frequency in the database can be done with the Pat indices of Darwin in a single operation. This requires (internally) logarithmic time in the size of the database and hence it is very efficient. |
(2) | We keep a variable number of partial optimal solutions, all the ones for which the last k-1 bases show some difference. Each partial solution is associated with a score (sum of logarithms of frequencies). |
(3) | We start with a single partial solution, with score 0 and the empty string (no DNA). |
(4) | If we have enough new bases to compute frequencies, we compute those and add their logarithms to the corresponding score. |
(5) | If any two partial solutions coincide on their last substring of length k-1, we keep only the one with the highest score. |
(6) | If no more scores can be computed, we add the next amino acid, which means adding all possible codons to each partial solution. E.g. if we add Isoleucine, which can be coded by 3 codons, the number of partial solutions will be multiplied by 3. |
(7) | When there are no more amino acids and no more frequencies to be computed, the solution is the partial solution with the highest score. |
In essence, the above describes a dynamic programming problem which is more complicated than usual because there will be a variable number of intermediate solutions to carry. This is implemented in Darwin with the function BackTranslateDP. We recompute the DNA for our example sequence.
sDopt := BackTranslateDP( s, k ); sD2 := sDopt[1]; score := 0: for i to length(sD2)-k+1 do pe := SearchSeqDb( sD2[i..i+k-1] ); freq := pe[1,2]-pe[1,1]+1; lprint( sD2[i..i+k-1], freq ); score := score + log(freq) od: score;
sDopt := [GCTGCTTGTATTGAT, 38.8420] sD2 := GCTGCTTGTATTGAT GCTGCTTG 83 CTGCTTGT 72 TGCTTGTA 97 GCTTGTAT 102 CTTGTATT 133 TTGTATTG 196 TGTATTGA 195 GTATTGAT 246 38.8420
As we can see the total score is higher and all individual frequencies are higher. So according to our optimization criteria this is a better sequence than the random one.
The number of intermediate states is exponential in k (bounded roughly by 6^(k/3)). So very large values of k (more context to asses similarities) will have a significant penalty in execution time. In practice this is not a problem, as when k is too large the process is likely to fail because at some point all the partial solutions end in k-nucleotide combinations which do not appear in the DNA database. A partial solution that ends in a k-nucleotide with frequency 0 is removed from the partial solution set. We can test this with:
BackTranslateDP( s, 10 ); BackTranslateDP( s, 13 );
[GCTGCTTGTATTGAC, 14.5066] [GCAGCTTGTATTGAT, -1.7976931348623147e+308]
So even for a very short peptide, k=13 is too much context for this genome.
If we want to incorporate the stop codon at the end of the sequence we can use the '$' symbol at the end of the peptide.
BackTranslateDP( s.'$', k );
[GCTGCTTGTATTGATTGA, 56.5626]
It is interesting to test whether this procedure will produce sequences more "yeast-like" than any of the simple-minded methods. An easy way to test it is to regenerate the DNA from the protein sequences in the genome and compare the number of DNA matches. So for each sequence in the database we will generate the DNA, compare it against the real DNA and count how many matches/mismatches we have.
We set up counters and run a loop over all the entries. Running the entire database is too time consuming, so we just do the first 100 entries.c1 := Counter('DNA matches'): c2 := Counter('DNA mismatches'): c3 := Counter('unknown DNA'): c4 := Counter('single occurrences'): for ie to 100 doPick up the entry from the protein database, then change to the DNA database to have access to the DNA fragment frequencies.
DB := ProtDB: e := Entry(ie); dna := SearchTag(DNA,e); DB := DNADB; r := BackTranslateDP( Sequence(e).'$', k )[1]; if length(r) <> length(dna) then error(e,'length of dna and protein disagree') fi;Count matches, mismatches and potential unknown DNA.
for i by 3 to length(r) do if r[i..i+2] = dna[i..i+2] then c1+1 elif CodonToCInt(dna[i..i+2]) = 0 then c3+1 else c2+1 fi od;
One observation is very relevant here. Since we are using the yeast DNA to generate yeast proteins, if we choose k sufficiently large we will reproduce the DNA exactly. That is, no matter how large is k, there is always one DNA sequence which will have frequency at least 1; the sequence for the protein we are converting. If k is too large, any other sequence will have such a low probability of being found that it will virtually disappear. An indication that this is happening is when a frequency is exactly 1. So as a safeguard against this potential problem we will count how many times the frequencies are 1, i.e. a single occurrence of the k-nucleotide sequence. We would like to select k (for this particular test) so that the frequencies of 1 are almost completely avoided. We will count the cases where the frequency is 1.
for i to length(r)-k+1 do pe := SearchSeqDb(r[i..i+k-1]); if pe[1,1]=pe[1,2] then c4+1 fi od; od: print(c1,c2,c3,c4);
DNA matches: 24797 DNA mismatches: 26207 unknown DNA: 0 single occurrences: 0
The first three simple-minded methods are simple enough to analyze exactly. For this we have to analyze each amino acid separately (and add the stop codons to this analysis). The analysis is based on the observed frequencies of codons that were calculated earlier.
corr := CreateArray(1..3): for i to 22 do if i <> 21 then fs := [ seq(CodCounts[j], j=IntToCInt(i) )]; corr := corr +The expected number of correct choices for a random choice is simply the total number of codons divided by the number of choices.
[ sum(fs) / length(fs),The number of correct ones when we select the most probable is the maximum.
max(fs),Finally, choosing codons randomly according to their frequency will result in a probability that is the sum of the squares of the individual probabilities.
fs*fs / sum(fs) ] fi od: PercentCorrect := 100*corr/sum(CodCounts);
PercentCorrect := [36.2333, 49.3079, 40.1234]
The value that our method obtains is approximately 48.62%. This is very competitive for a sequence that uses all the codons available.
© 2006 by Gaston Gonnet, Informatik, ETH Zurich
Please cite as:
@techreport{ Gonnet-BackTranslate, author = {Gaston H. Gonnet}, title = {Back Translation (protein to DNA) in an optimal way}, month = { December }, year = {2005}, number = { 505 }, howpublished = { Electronic publication }, copyright = {code under the GNU General Public License}, institution = {Informatik, ETH, Zurich}, URL = { http://www.biorecipes.com/BackTranslate/code.html } }
Last updated on Thu Apr 20 10:49:16 2006 by GhG