Back Translation (protein to DNA) in an optimal way

by Gaston H. Gonnet

The central dogma of molecular biology states that DNA is transcribed to RNA and RNA is translated into amino acids according to the genetic code. The transcription between DNA and RNA is one-to-one and does not offer any ambiguity. The translation from RNA to amino acids is a many-to-one translation, where 61 codons translate to 20 amino acids according to the genetic code.

?GeneticCode


GGG  G  Gly    AGG  R  Arg    CGG  R  Arg    UGG  W  Trp    
GGA  G  Gly    AGA  R  Arg    CGA  R  Arg    UGA    Stop    
GGC  G  Gly    AGC  S  Ser    CGC  R  Arg    UGC  C  Cys    
GGU  G  Gly    AGU  S  Ser    CGU  R  Arg    UGU  C  Cys    

GAG  E  Glu    AAG  K  Lys    CAG  Q  Gln    UAG    Stop    
GAA  E  Glu    AAA  K  Lys    CAA  Q  Gln    UAA    Stop    
GAC  D  Asp    AAC  N  Asn    CAC  H  His    UAC  Y  Tyr    
GAU  D  Asp    AAU  N  Asn    CAU  H  His    UAU  Y  Tyr    

GCG  A  Ala    ACG  T  Thr    CCG  P  Pro    UCG  S  Ser    
GCA  A  Ala    ACA  T  Thr    CCA  P  Pro    UCA  S  Ser    
GCC  A  Ala    ACC  T  Thr    CCC  P  Pro    UCC  S  Ser    
GCU  A  Ala    ACU  T  Thr    CCU  P  Pro    UCU  S  Ser    

GUG  V  Val    AUG  M  Met    CUG  L  Leu    UUG  L  Leu    
GUA  V  Val    AUA  I  Ile    CUA  L  Leu    UUA  L  Leu    
GUC  V  Val    AUC  I  Ile    CUC  L  Leu    UUC  F  Phe    
GUU  V  Val    AUU  I  Ile    CUU  L  Leu    UUU  F  Phe    

See Also:
   ?AltGenCode      ?BaseToInt    ?CIntToAmino    ?CodonToInt     ?IntToBBB  
   ?AminoToInt      ?BBBToInt     ?CIntToCodon    ?Complement     ?IntToCInt 
   ?antiparallel    ?BToInt       ?CIntToInt      ?GeneticCode    ?IntToCodon
   ?AToCInt         ?CIntToA      ?CodonToA       ?IntToB         ?Reverse   
   ?AToCodon        ?CIntToAAA    ?CodonToCInt    ?IntToBase                 


   ------------

In this bio-recipe we will study how to produce DNA (or RNA) that will encode a given protein. This is in principle trivial, but in reality it is not so. We will first load a database for the entire genome of our target organism, in this case Saccharomyces cerevisiae (baker's yeast).

ProtDB := ReadDb('/home/darwin/DB/genomes/YEAST/YEAST.db');

ProtDB := Peptide file(/home/darwin/DB/genomes/YEAST/YEAST.db(17557412), 5799
 entries, 2886213 aminoacids)

For the computations that will follow, it will be very convenient to create a new database over the coding DNA of yeast. This is done with the DNA entries in the database. The new database will have the same entries (in the same order) as the protein database, so it is very easy to correlate one with the other.

OpenWriting( 'yeastDNA.db' );
for e in Entries(ProtDB) do
    printf( '<E><ID>%s</ID><SEQ>%s</SEQ></E>\n',
        SearchTag(ID,e), SearchTag(DNA,e) ) od:
OpenWriting( previous );
DNADB := ReadDb( 'yeastDNA.db' ):

Reading 8893314 characters from file yeastDNA.db
Pre-processing input (DNA)
5799 sequences within 5799 entries considered
Creating file yeastDNA.db.map for mapping
Building new Pat index in file yeastDNA.db.tree with 8676036 entries
Pat index with 8676036 entries
 sorted, from "A</SEQ></E>\n" to "XXXXXXXXXXXXXXXXXXX"

Now we can print a table by amino acid of the corresponding codons and their frequencies. For this we have to first accumulate the counts of the codons.

CodCounts := CreateArray(1..64):
for s in Sequences(DNADB) do
    for i by 3 to length(s) do
	j := CodonToCInt(s[i..i+2]);
	if j > 0 then CodCounts[j] := CodCounts[j]+1 fi
    od;
od;
totaa := sum(CodCounts);
for aa to 20 do
    printf( '%s:\n', IntToAmino(aa) );
    for cod in IntToCodon(aa) do
	PatEntr := SearchSeqDb( cod );
	printf( '  %s%5.2f%%', cod,
	    CodCounts[CodonToCInt(cod)] / totaa * 100 )
    od;
    printf( '\n' )
od;

totaa := 2826224
Alanine:
  GCA 1.63%  GCC 1.22%  GCG 0.62%  GCT 2.02%
Arginine:
  AGA 2.10%  AGG 0.95%  CGA 0.31%  CGC 0.26%  CGG 0.18%  CGT 0.62%
Asparagine:
  AAC 2.47%  AAT 3.66%
Aspartic acid:
  GAC 2.03%  GAT 3.83%
Cysteine:
  TGC 0.47%  TGT 0.79%
Glutamine:
  CAA 2.70%  CAG 1.24%
Glutamic acid:
  GAA 4.60%  GAG 1.97%
Glycine:
  GGA 1.12%  GGC 0.98%  GGG 0.61%  GGT 2.27%
Histidine:
  CAC 0.76%  CAT 1.38%
Isoleucine:
  ATA 1.83%  ATC 1.69%  ATT 3.02%
Leucine:
  CTA 1.36%  CTC 0.55%  CTG 1.07%  CTT 1.23%  TTA 2.65%  TTG 2.67%
Lysine:
  AAA 4.28%  AAG 3.06%
Methionine:
  ATG 2.07%
Phenylalanine:
  TTC 1.80%  TTT 2.63%
Proline:
  CCA 1.77%  CCC 0.69%  CCG 0.53%  CCT 1.35%
Serine:
  AGC 1.00%  AGT 1.46%  TCA 1.90%  TCC 1.40%  TCG 0.87%  TCT 2.33%
Threonine:
  ACA 1.80%  ACC 1.24%  ACG 0.82%  ACT 2.01%
Tryptophan:
  TGG 1.04%
Tyrosine:
  TAC 1.45%  TAT 1.90%
Valine:
  GTA 1.20%  GTC 1.12%  GTG 1.08%  GTT 2.16%

As we can see, codons are not used with uniform frequencies. There are multiple reasons for these differences, and currently we do not really know whether these peculiar distributions are random or due to precise reasons.

Simple approaches to back translation

There are several simple-minded approaches to generate DNA that will encode for a protein:

(a) Select a random codon for each amino acid

(b) Select the most frequent codon for each amino acid

(c) Select the codons randomly, but with the actual genome probability distribution of the codons

(d) As any of the above, but favouring reuse of codons (as has been found is the case in nature).

Experience shows that sometimes these sequences do not work at all, or synthesize proteins very slowly or are too error prone. There are at least 3 reasons why a DNA sequence produced with these simple rules may fail:

(i) The DNA or RNA produced has an undesirable physical property or structure, e.g. curls too much or in the wrong way, DNA attaches too loosely or too strongly, etc.

(ii) The DNA contains a signal for something unwanted, e.g. a restriction site, a promoter, or something else which affects transcription or translation.

(iii) The translation becomes very inefficient or error prone, usually because of undesired interactions in the ribosome or between rRNAs.

An optimization criteria which solves most problems

Not knowing which criteria should be used to select the codons does not mean that we cannot build a DNA sequence which is as similar as possible to existing coding DNA. Since we have plenty of coding DNA - the entire genome - we will use a strategy of mimicking the existing coding DNA. That is, select codons so that the resulting DNA appears as often as possible in the coding DNA.

By using such strategy, we will be quite safe from using subsequences that have the problems mentioned above. This is because natural selection will not use these subsequences or use them very sparingly.

To make this notion of mimicking the coding DNA of the genome more formal we have to decide on a DNA-window size, say 8 for the purpose of this example. We want the generated DNA, divided in sliding windows of size 8, to be as frequent as possible. More formally, Let s be a sequence and s[i..j] the subsequence of s from symbol i to symbol j. We want the substrings s[1..8], s[2..9], s[3..10], ...to be most frequently appearing in the coding DNA. As a global measure we will take the product of all the frequencies (or the sum of their logarithms).

Example

Lets take a short arbitrary peptide sequence:

s := 'AACID':

Now we will back-translate using random codons. We define the function RandL to take a random element of a list.

RandL := proc( L:list ) L[ Rand(1..length(L)) ] end:
sD := '':
for i to length(s) do
    sD := sD . RandL( AToCodon(  s[i] )) od:
sD;

GCCGCATGCATAGAT

The frequency of any DNA sequence can be found by searching the sequence with SearchSeqDb which will return a range of entries in the Pat index. From the range we can immediately compute the number of hits.

k := 8;
sD[1..k];
SearchSeqDb(sD[1..k]);

k := 8
GCCGCATG
PatEntry(5187869..5187911)

We can now compute all frequencies of all sliding windows of size k and the total score:

score := 0:
for i to length(sD)-k+1 do
    pe := SearchSeqDb( sD[i..i+k-1] );
    freq := pe[1,2]-pe[1,1]+1;
    lprint( sD[i..i+k-1], freq );
    score := score + log(freq)
od:
score;

GCCGCATG 43
CCGCATGC 26
CGCATGCA 50
GCATGCAT 70
CATGCATA 85
ATGCATAG 65
TGCATAGA 98
GCATAGAT 90
32.8816

Variable-state dynamic programming

There are 1, 2, 3, 4 or 6 codon choices for each amino. Clearly the total number of choices will be exponential in the length of the protein and hence intractable. To find this optimal DNA we can use dynamic programming. We will not describe the algorithm in detail, but just give the main highlights:

(1) Given a sequence of k bases, finding its frequency in the database can be done with the Pat indices of Darwin in a single operation. This requires (internally) logarithmic time in the size of the database and hence it is very efficient.

(2) We keep a variable number of partial optimal solutions, all the ones for which the last k-1 bases show some difference. Each partial solution is associated with a score (sum of logarithms of frequencies).

(3) We start with a single partial solution, with score 0 and the empty string (no DNA).

(4) If we have enough new bases to compute frequencies, we compute those and add their logarithms to the corresponding score.

(5) If any two partial solutions coincide on their last substring of length k-1, we keep only the one with the highest score.

(6) If no more scores can be computed, we add the next amino acid, which means adding all possible codons to each partial solution. E.g. if we add Isoleucine, which can be coded by 3 codons, the number of partial solutions will be multiplied by 3.

(7) When there are no more amino acids and no more frequencies to be computed, the solution is the partial solution with the highest score.

In essence, the above describes a dynamic programming problem which is more complicated than usual because there will be a variable number of intermediate solutions to carry. This is implemented in Darwin with the function BackTranslateDP. We recompute the DNA for our example sequence.

sDopt := BackTranslateDP( s, k );
sD2 := sDopt[1];
score := 0:
for i to length(sD2)-k+1 do
    pe := SearchSeqDb( sD2[i..i+k-1] );
    freq := pe[1,2]-pe[1,1]+1;
    lprint( sD2[i..i+k-1], freq );
    score := score + log(freq)
od:
score;

sDopt := [GCTGCTTGTATTGAT, 38.8420]
sD2 := GCTGCTTGTATTGAT
GCTGCTTG 83
CTGCTTGT 72
TGCTTGTA 97
GCTTGTAT 102
CTTGTATT 133
TTGTATTG 196
TGTATTGA 195
GTATTGAT 246
38.8420

As we can see the total score is higher and all individual frequencies are higher. So according to our optimization criteria this is a better sequence than the random one.

The number of intermediate states is exponential in k (bounded roughly by 6^(k/3)). So very large values of k (more context to asses similarities) will have a significant penalty in execution time. In practice this is not a problem, as when k is too large the process is likely to fail because at some point all the partial solutions end in k-nucleotide combinations which do not appear in the DNA database. A partial solution that ends in a k-nucleotide with frequency 0 is removed from the partial solution set. We can test this with:

BackTranslateDP( s, 10 );
BackTranslateDP( s, 13 );

[GCTGCTTGTATTGAC, 14.5066]
[GCAGCTTGTATTGAT, -1.7976931348623147e+308]

So even for a very short peptide, k=13 is too much context for this genome.

If we want to incorporate the stop codon at the end of the sequence we can use the '$' symbol at the end of the peptide.

BackTranslateDP( s.'$', k );

[GCTGCTTGTATTGATTGA, 56.5626]

Testing the method

It is interesting to test whether this procedure will produce sequences more "yeast-like" than any of the simple-minded methods. An easy way to test it is to regenerate the DNA from the protein sequences in the genome and compare the number of DNA matches. So for each sequence in the database we will generate the DNA, compare it against the real DNA and count how many matches/mismatches we have.

We set up counters and run a loop over all the entries. Running the entire database is too time consuming, so we just do the first 100 entries.

c1 := Counter('DNA matches'):
c2 := Counter('DNA mismatches'):
c3 := Counter('unknown DNA'):
c4 := Counter('single occurrences'):
for ie to 100 do

Pick up the entry from the protein database, then change to the DNA database to have access to the DNA fragment frequencies.

    DB := ProtDB:
    e := Entry(ie);
    dna := SearchTag(DNA,e);
    DB := DNADB;
    r := BackTranslateDP( Sequence(e).'$', k )[1];
    if length(r) <> length(dna) then
	error(e,'length of dna and protein disagree') fi;

Count matches, mismatches and potential unknown DNA.

    for i by 3 to length(r) do
	if r[i..i+2] = dna[i..i+2] then c1+1
	elif CodonToCInt(dna[i..i+2]) = 0 then c3+1
	else c2+1 fi
    od;

One observation is very relevant here. Since we are using the yeast DNA to generate yeast proteins, if we choose k sufficiently large we will reproduce the DNA exactly. That is, no matter how large is k, there is always one DNA sequence which will have frequency at least 1; the sequence for the protein we are converting. If k is too large, any other sequence will have such a low probability of being found that it will virtually disappear. An indication that this is happening is when a frequency is exactly 1. So as a safeguard against this potential problem we will count how many times the frequencies are 1, i.e. a single occurrence of the k-nucleotide sequence. We would like to select k (for this particular test) so that the frequencies of 1 are almost completely avoided. We will count the cases where the frequency is 1.

    for i to length(r)-k+1 do
	pe := SearchSeqDb(r[i..i+k-1]);
	if pe[1,1]=pe[1,2] then c4+1 fi
    od;
od:
print(c1,c2,c3,c4);

DNA matches: 24797
DNA mismatches: 26207
unknown DNA: 0
single occurrences: 0

Exact analysis of the simple methods

The first three simple-minded methods are simple enough to analyze exactly. For this we have to analyze each amino acid separately (and add the stop codons to this analysis). The analysis is based on the observed frequencies of codons that were calculated earlier.

corr := CreateArray(1..3):
for i to 22 do if i <> 21 then
    fs := [ seq(CodCounts[j], j=IntToCInt(i) )];
    corr := corr +

The expected number of correct choices for a random choice is simply the total number of codons divided by the number of choices.

	[ sum(fs) / length(fs),

The number of correct ones when we select the most probable is the maximum.

          max(fs),

Finally, choosing codons randomly according to their frequency will result in a probability that is the sum of the squares of the individual probabilities.

          fs*fs / sum(fs) ]
fi od:
PercentCorrect := 100*corr/sum(CodCounts);

PercentCorrect := [36.2333, 49.3079, 40.1234]

The value that our method obtains is approximately 48.62%. This is very competitive for a sequence that uses all the codons available.

Please cite as:

@techreport{ Gonnet-BackTranslate,
        author = {Gaston H. Gonnet},
        title = {Back Translation (protein to DNA) in an optimal way},
        month = { December },
        year = {2005},
        number = { 505 },
        howpublished = { Electronic publication },
        copyright = {code under the GNU General Public License},
        institution = {Informatik, ETH, Zurich},
        URL = { http://www.biorecipes.com/BackTranslate/code.html }
        }

Index of bio-recipes

Last updated on Thu Apr 20 10:49:16 2006 by GhG

!!! Dieses Dokument stammt aus dem ETH Web-Archiv und wird nicht mehr gepflegt !!!
!!! This document is stored in the ETH Web archive and is no longer maintained !!!

(a)	Select a random codon for each amino acid
(b)	Select the most frequent codon for each amino acid
(c)	Select the codons randomly, but with the actual genome probability distribution of the codons
(d)	As any of the above, but favouring reuse of codons (as has been found is the case in nature).

(i)	The DNA or RNA produced has an undesirable physical property or structure, e.g. curls too much or in the wrong way, DNA attaches too loosely or too strongly, etc.
(ii)	The DNA contains a signal for something unwanted, e.g. a restriction site, a promoter, or something else which affects transcription or translation.
(iii)	The translation becomes very inefficient or error prone, usually because of undesired interactions in the ribosome or between rRNAs.

(1)	Given a sequence of `k` bases, finding its frequency in the database can be done with the Pat indices of Darwin in a single operation. This requires (internally) logarithmic time in the size of the database and hence it is very efficient.
(2)	We keep a variable number of partial optimal solutions, all the ones for which the last `k-1` bases show some difference. Each partial solution is associated with a score (sum of logarithms of frequencies).
(3)	We start with a single partial solution, with score 0 and the empty string (no DNA).
(4)	If we have enough new bases to compute frequencies, we compute those and add their logarithms to the corresponding score.
(5)	If any two partial solutions coincide on their last substring of length `k-1`, we keep only the one with the highest score.
(6)	If no more scores can be computed, we add the next amino acid, which means adding all possible codons to each partial solution. E.g. if we add Isoleucine, which can be coded by 3 codons, the number of partial solutions will be multiplied by 3.
(7)	When there are no more amino acids and no more frequencies to be computed, the solution is the partial solution with the highest score.