Information

Alternate genetic codes in newly sequenced organisms

Alternate genetic codes in newly sequenced organisms


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Variations of the standard genetic code are pretty rare, but as the cost of high-throughput genome sequencing continues to drop, there is a greater possibility of discovering additional exceptions. That being said, there is a clear emphasis in genome projects on nucleotide (genome and transcriptome) sequencing, with much less (if any) effort put into proteomics work (correct me if I'm wrong there).

Let's assume we're sequencing the genome of a new organism and we're focusing completely on genome and transcriptome sequencing--no proteomics. Let's also assume this organism has slight variations to the standard genetic code. Would it be possible to annotate this genome (for protein-coding genes) completely incorrectly since the gene prediction software does not take into account these variations, or would it be pretty obvious? What would you expect to see in this case?


If the organism uses an alternative code, the predicted protein sequences would always include the same type of amino acid substitution errors. This pattern should become apparent when you compare the proteins to other organisms. In reality, the most common alternative code uses UGA for tryptophan instead of the usual stop. If you make the mistake of using the standard code for these genomes, the predicted genes would become highly fragmented so it is almost impossible not to notice.


You are correct to be a little concerned, especially with mitochondrial genomes (where non-standard genetic codes are more prevalent). In addition to the use of non-standard codes you should also consider RNA editing (changed codons or deleted/inserted nucleotides to shift reading frames) and special amino acids (e.g., selenocysteine). All together, the emphasis on DNA sequence to infer protein structure is safe in the bulk of cases, but don't be surprised by differences (for particular genes or for whole organisms).


The short answer is yes, you could end up with many errors. See here for more detail:

NCBI: "The Genetic Codes"


The Non-Universality of the Genetic Code (And Its Implications) by Rich Deem

Original studies of the genetic code demonstrated that it was constant throughout the plant, protists, and animal kingdoms. It was taught (even in my college days) that the genetic code should be universal as predicted by the theory of evolution, since alterations of the genetic code would be lethal in those individuals that acquired genetic code Permanent structural alterations in DNA, consisting of either substitutions, insertions or deletions of nucleotide bases. mutations . Recently, many examples of variations in the genetic code have been discovered in many species of unrelated organisms. Although first shown in 1979 by Barrell et. al (1), subsequent studies have demonstrated that the genetic codes differs in diverse groups of unrelated animals, plants, and protists (2). There is no pattern of change in the code of related groups of organisms and none of the organisms that possess altered genetic code exhibit any form of evolutionary descent or common ancestry, which might imply that the genetic code had evolved.

Introduction to the genetic code and An organic compound made of amino acids arranged in a linear chain, joined together by peptide bonds between the carboxyl and amino groups of the adjacent amino acid residues. protein translation:

The molecule depicted on this page (Figure 1) is transfer Ribonucleic acid: a chemical that directs the manufacture of proteins and sometimes codes for the genetic material within certain organisms. RNA ( A small RNA molecule that transfers a specific amino acid to a growing polypeptide chain at the ribosomal site of protein synthesis. tRNA ). This molecule is responsible for translating messenger Ribonucleic acid: a chemical that directs the manufacture of proteins and sometimes codes for the genetic material within certain organisms. RNA ( Messenger ribonucleic acid is a molecule of RNA encoding a chemical 'blueprint' for a protein product, which is transcribed from a DNA template, and carries this information to the sites of protein synthesis. mRNA ) to An organic compound made of amino acids arranged in a linear chain, joined together by peptide bonds between the carboxyl and amino groups of the adjacent amino acid residues. protein . A group of 20 different kinds of small molecules that link together in long chains to form proteins. Often referred to as the "building blocks" of proteins. Amino acids are esterified (bound) to the 3' end of the molecule. The A sequence of three nucleotides in transfer RNA that binds to the complementary triplet (codon) in messenger RNA to specify an amino acid during protein synthesis. anticodon portion of the molecule binds to the complimentary A specific sequence of three adjacent nucleotides on a strand of DNA or RNA that specifies the genetic code information for encoding a specific amino acid into a polypeptide chain. codon of Messenger ribonucleic acid is a molecule of RNA encoding a chemical 'blueprint' for a protein product, which is transcribed from a DNA template, and carries this information to the sites of protein synthesis. mRNA , subsequently adding the A group of 20 different kinds of small molecules that link together in long chains to form proteins. Often referred to as the "building blocks" of proteins. amino acid to the growing A compound consisting of two or more amino acids, the building blocks of proteins. peptide ( An organic compound made of amino acids arranged in a linear chain, joined together by peptide bonds between the carboxyl and amino groups of the adjacent amino acid residues. protein ) chain. The A sequence of three nucleotides in transfer RNA that binds to the complementary triplet (codon) in messenger RNA to specify an amino acid during protein synthesis. anticodon site of the A small RNA molecule that transfers a specific amino acid to a growing polypeptide chain at the ribosomal site of protein synthesis. tRNA molecule provides specificity to the genetic code, since each specific A small RNA molecule that transfers a specific amino acid to a growing polypeptide chain at the ribosomal site of protein synthesis. tRNA 's bind to a specific Messenger ribonucleic acid is a molecule of RNA encoding a chemical 'blueprint' for a protein product, which is transcribed from a DNA template, and carries this information to the sites of protein synthesis. mRNA A specific sequence of three adjacent nucleotides on a strand of DNA or RNA that specifies the genetic code information for encoding a specific amino acid into a polypeptide chain. codon The order of nucleotides in a DNA or RNA molecule, or the order of amino acids in a protein molecule. sequence . All organisms that differ from the "universal" genetic code do so through a single One of the structural components, or building blocks, of DNA and RNA. A nucleotide consists of a base plus a molecule of sugar and one of phosphate. nucleotide Replacement of one nucleotide in a DNA sequence by another nucleotide or replacement of one amino acid in a protein by another amino acid. substitution in the A sequence of three nucleotides in transfer RNA that binds to the complementary triplet (codon) in messenger RNA to specify an amino acid during protein synthesis. anticodon site of the A small RNA molecule that transfers a specific amino acid to a growing polypeptide chain at the ribosomal site of protein synthesis. tRNA . In all the examples given in this paper, this single One of the structural components, or building blocks, of DNA and RNA. A nucleotide consists of a base plus a molecule of sugar and one of phosphate. nucleotide Replacement of one nucleotide in a DNA sequence by another nucleotide or replacement of one amino acid in a protein by another amino acid. substitution causes the A small RNA molecule that transfers a specific amino acid to a growing polypeptide chain at the ribosomal site of protein synthesis. tRNA to bind to a completely different mRNA A specific sequence of three adjacent nucleotides on a strand of DNA or RNA that specifies the genetic code information for encoding a specific amino acid into a polypeptide chain. codon (see Figure 2). Therefore, all Sequences of three adjacent nucleotides on a strand of DNA or RNA that specifies the genetic code information for encoding a specific amino acid into a polypeptide chain. codons that coded for this one A group of 20 different kinds of small molecules that link together in long chains to form proteins. Often referred to as the "building blocks" of proteins. amino acid would now be substituted by another A group of 20 different kinds of small molecules that link together in long chains to form proteins. Often referred to as the "building blocks" of proteins. amino acid . This kind of mutation would result in wholesale changes to all or nearly all of the Organic compounds made of amino acids arranged in a linear chain, joined together by peptide bonds between the carboxyl and amino groups of the adjacent amino acid residues. proteins of an organism, which would be lethal to 100% of these mutants.


Figure 2. Translation of same Messenger ribonucleic acid is a molecule of RNA encoding a chemical 'blueprint' for a protein product, which is transcribed from a DNA template, and carries this information to the sites of protein synthesis. mRNA by (A), wild type and (B), A small RNA molecule that transfers a specific amino acid to a growing polypeptide chain at the ribosomal site of protein synthesis. tRNA mutant

Since evolution requires a process by which the genetic code would be altered, evolutionists propose that organisms that express a non-standard genetic code must be genetic mutants. The problem is that this single point A permanent structural alteration in DNA, consisting of either a substitution, insertion or deletion of nucleotide bases. mutation would result in an organism that could not survive (in fact, the fertilized ovum of such a mutant would be unable to undergo even one round of cell division). This is not just my opinion. Even proponents of evolution agree that 100% of these A small RNA molecule that transfers a specific amino acid to a growing polypeptide chain at the ribosomal site of protein synthesis. tRNA mutants would be unable to survive. Dr. T. H. Jukes (University of California, Berkeley) has stated, "Any Relating to a permanent structural alteration in DNA, consisting of either a substitution, insertion or deletion of nucleotide bases. mutational change in the code would be lethal, because it would produce widespread alterations in the A group of 20 different kinds of small molecules that link together in long chains to form proteins. Often referred to as the "building blocks" of proteins. amino acid The order of nucleotides in a DNA or RNA molecule, or the order of amino acids in a protein molecule. sequences of Organic compounds made of amino acids arranged in a linear chain, joined together by peptide bonds between the carboxyl and amino groups of the adjacent amino acid residues. proteins . Such changes would destroy An organic compound made of amino acids arranged in a linear chain, joined together by peptide bonds between the carboxyl and amino groups of the adjacent amino acid residues. protein function, and hence would be intolerable." What these scientists have really found to be "intolerable" is what these " Permanent structural alterations in DNA, consisting of either substitutions, insertions or deletions of nucleotide bases. mutations " disclose about the theory of evolution.

A small RNA molecule that transfers a specific amino acid to a growing polypeptide chain at the ribosomal site of protein synthesis. tRNA "Mutants" of the "Universal" Genetic Code

What kinds of changes would A small RNA molecule that transfers a specific amino acid to a growing polypeptide chain at the ribosomal site of protein synthesis. tRNA Permanent structural alterations in DNA, consisting of either substitutions, insertions or deletions of nucleotide bases. mutations result in? Table 1 describes a number of these A small RNA molecule that transfers a specific amino acid to a growing polypeptide chain at the ribosomal site of protein synthesis. tRNA mutants, including the kinds of organisms and the changes in A group of 20 different kinds of small molecules that link together in long chains to form proteins. Often referred to as the "building blocks" of proteins. amino acids coded for. One can see that many of these Permanent structural alterations in DNA, consisting of either substitutions, insertions or deletions of nucleotide bases. mutations substitute A group of 20 different kinds of small molecules that link together in long chains to form proteins. Often referred to as the "building blocks" of proteins. amino acids that stabilize a-helical structures (the most common tertiary structure in Organic compounds made of amino acids arranged in a linear chain, joined together by peptide bonds between the carboxyl and amino groups of the adjacent amino acid residues. proteins ) for those that destabilize a-helical structures or, alternatively, those that destabilize a-helical structures for those that stabilize a-helical structures. A Replacement of one nucleotide in a DNA sequence by another nucleotide or replacement of one amino acid in a protein by another amino acid. substitution that changes the configuration of a An organic compound made of amino acids arranged in a linear chain, joined together by peptide bonds between the carboxyl and amino groups of the adjacent amino acid residues. protein is virtually guaranteed to destroy its function. It actually gets worse in other instances. In some of the genetic code mutants, former stop Sequences of three adjacent nucleotides on a strand of DNA or RNA that specifies the genetic code information for encoding a specific amino acid into a polypeptide chain. codons (which cause An organic compound made of amino acids arranged in a linear chain, joined together by peptide bonds between the carboxyl and amino groups of the adjacent amino acid residues. protein translation to halt when the An organic compound made of amino acids arranged in a linear chain, joined together by peptide bonds between the carboxyl and amino groups of the adjacent amino acid residues. protein reaches the proper size) now code for A group of 20 different kinds of small molecules that link together in long chains to form proteins. Often referred to as the "building blocks" of proteins. amino acids , which would result in longer Organic compounds made of amino acids arranged in a linear chain, joined together by peptide bonds between the carboxyl and amino groups of the adjacent amino acid residues. proteins . Although many Organic compounds made of amino acids arranged in a linear chain, joined together by peptide bonds between the carboxyl and amino groups of the adjacent amino acid residues. proteins undergo post-translational modification, those that don't undergo post-translational modification will undergo conformational changes resulting in inactivation of function. The most lethal of all these mutants would be the ones that changed from coding for arginine to coding for the stop The order of nucleotides in a DNA or RNA molecule, or the order of amino acids in a protein molecule. sequence . This mutant would produce Organic compounds made of amino acids arranged in a linear chain, joined together by peptide bonds between the carboxyl and amino groups of the adjacent amino acid residues. proteins with very short polypeptide chains, since An organic compound made of amino acids arranged in a linear chain, joined together by peptide bonds between the carboxyl and amino groups of the adjacent amino acid residues. protein translation would stop at every instance where arginine (AGR) was coded. Some Organic compounds made of amino acids arranged in a linear chain, joined together by peptide bonds between the carboxyl and amino groups of the adjacent amino acid residues. proteins would not be made at all and virtually all other Organic compounds made of amino acids arranged in a linear chain, joined together by peptide bonds between the carboxyl and amino groups of the adjacent amino acid residues. proteins would be non-functional.

Table 1. tRNA "Mutants" of the "Universal" Genetic Code
"Mutant" Code ("Normal" amino acid) Stabilize a -helix? "Mutant" amino acid ("Universal" Code) Stabilize a -helix? Where "mutant" found
AUA (isoleucine) no methionine (AUG) yes human mitochondria
UGA (stop) N/A tryptophan (UGG) yes human mitochondria
UGA (stop) N/A tryptophan (UGG) yes Mycoplasma spp.
UAA and UAG (stop) N/A glutamine (GAA, GAG) no ciliated protozoa, Acetabularia
UGA (stop) N/A cysteine (UGC, UGU) yes E. octacarinatus
CUG (leucine) yes serine (UCN) no Candida spp.
CUN (leucine) yes threonine (ACN) no yeasts
AAA (lysine) no asparagine (AAU, AAC) yes platyhelminths and echinoderms
UAA (stop) N/A tyrosine (UAU, UAC) yes planaria
AGR (arginine) no serine (AGU, AGC) no several animal orders
AGR (arginine) no stop (UGA, UAA, UAG) N/A some vertebrates
A = Adenine C = Cytosine G = Guanine U = Uracil N = Adenine and Guanine R = Cytosine or Uracil

The evidence is overwhelming and the problem an extremely profound one. The scientific community and even the community of creationary scientists, despite its obvious implications, have largely ignored this problem. Those who have proposed evolutionary explanations have done so using mechanisms that are so improbable as to be statistically impossible. Here is their explanation:

We propose that the changes are typically preceded by loss of a A specific sequence of three adjacent nucleotides on a strand of DNA or RNA that specifies the genetic code information for encoding a specific amino acid into a polypeptide chain. codon from all coding The order of nucleotides in a DNA or RNA molecule, or the order of amino acids in a protein molecule. sequences in an organism or organelle, often as a result of directional A permanent structural alteration in DNA, consisting of either a substitution, insertion or deletion of nucleotide bases. mutation pressure, accompanied by loss of the A small RNA molecule that transfers a specific amino acid to a growing polypeptide chain at the ribosomal site of protein synthesis. tRNA that translates the A specific sequence of three adjacent nucleotides on a strand of DNA or RNA that specifies the genetic code information for encoding a specific amino acid into a polypeptide chain. codon . The A specific sequence of three adjacent nucleotides on a strand of DNA or RNA that specifies the genetic code information for encoding a specific amino acid into a polypeptide chain. codon reappears later by conversion of another A specific sequence of three adjacent nucleotides on a strand of DNA or RNA that specifies the genetic code information for encoding a specific amino acid into a polypeptide chain. codon and emergence of a A small RNA molecule that transfers a specific amino acid to a growing polypeptide chain at the ribosomal site of protein synthesis. tRNA that translates the reappeared A specific sequence of three adjacent nucleotides on a strand of DNA or RNA that specifies the genetic code information for encoding a specific amino acid into a polypeptide chain. codon with a different assignment (2).

The Problem of Mechanism

Here is the essence of what evolutionists are proposing. They propose that every instance of a specific A specific sequence of three adjacent nucleotides on a strand of DNA or RNA that specifies the genetic code information for encoding a specific amino acid into a polypeptide chain. codon in the Deoxyribonucleic acid: the chemical inside the nucleus of a cell that carries the genetic instructions for making living organisms. DNA is mutated and replaced by another A specific sequence of three adjacent nucleotides on a strand of DNA or RNA that specifies the genetic code information for encoding a specific amino acid into a polypeptide chain. codon . This requires replacement of 1-5% of the entire genetic The order of nucleotides in a DNA or RNA molecule, or the order of amino acids in a protein molecule. sequence of an organism. Although this doesn't seem like a large amount of the All the DNA contained in an organism or a cell, which includes both the chromosomes within the nucleus and the DNA in mitochondria. genome , it is the specificity of replacement that makes this mechanism statistically impossible. In the case of vertebrates, this replacement would involve the specific replacement of millions of One of the structural components, or building blocks, of DNA and RNA. A nucleotide consists of a base plus a molecule of sugar and one of phosphate. nucleotide pairs. There is no "directional A permanent structural alteration in DNA, consisting of either a substitution, insertion or deletion of nucleotide bases. mutation pressure" that would cause only one A specific sequence of three adjacent nucleotides on a strand of DNA or RNA that specifies the genetic code information for encoding a specific amino acid into a polypeptide chain. codon The order of nucleotides in a DNA or RNA molecule, or the order of amino acids in a protein molecule. sequence to be replaced in an organism, even according to evolutionary theories. Evolution states that selection acts on the An organic compound made of amino acids arranged in a linear chain, joined together by peptide bonds between the carboxyl and amino groups of the adjacent amino acid residues. protein structure to improve its function. Since we are talking about all the Organic compounds made of amino acids arranged in a linear chain, joined together by peptide bonds between the carboxyl and amino groups of the adjacent amino acid residues. proteins in an organism, there is no one selective pressure that would work to improve the function of all Organic compounds made of amino acids arranged in a linear chain, joined together by peptide bonds between the carboxyl and amino groups of the adjacent amino acid residues. proteins simultaneously (especially by substituting only one specific A group of 20 different kinds of small molecules that link together in long chains to form proteins. Often referred to as the "building blocks" of proteins. amino acid for another). It would make much more sense from an evolutionary viewpoint that The functional and physical unit of heredity passed from parent to offspring. Genes are pieces of DNA, and most genes contain the information for making a specific protein. gene duplication of A small RNA molecule that transfers a specific amino acid to a growing polypeptide chain at the ribosomal site of protein synthesis. tRNA would occur, followed by A permanent structural alteration in DNA, consisting of either a substitution, insertion or deletion of nucleotide bases. mutation of the duplicated A small RNA molecule that transfers a specific amino acid to a growing polypeptide chain at the ribosomal site of protein synthesis. tRNA and gradual conversion of genetic The order of nucleotides in a DNA or RNA molecule, or the order of amino acids in a protein molecule. sequences from those that bound to the universal A small RNA molecule that transfers a specific amino acid to a growing polypeptide chain at the ribosomal site of protein synthesis. tRNA to those that bound the mutant A small RNA molecule that transfers a specific amino acid to a growing polypeptide chain at the ribosomal site of protein synthesis. tRNA . However, such a scenario would be expected to produce intermediate forms of organisms (possessing both forms of A small RNA molecule that transfers a specific amino acid to a growing polypeptide chain at the ribosomal site of protein synthesis. tRNA ), since the process would obviously be a long one. Even though there are dozens of examples of A small RNA molecule that transfers a specific amino acid to a growing polypeptide chain at the ribosomal site of protein synthesis. tRNA mutants, none of them exist as these hypothetical intermediates, indicating that this is not a reasonable mechanism.

The Problem of Descent

The existence of genetic code mutants in diverse groups of unrelated organisms presents a significant problem in evolutionary theory. Since all life must be related to their relatives, the genetics must reflect this fact. These mutants present a glaring problem in terms of descent. One would expect that related organisms would exhibit some form of evolutionary tree in regard to the genetic code mutants. Instead, what we find are randomly isolated individual species of organisms which possess these genetic code Permanent structural alterations in DNA, consisting of either substitutions, insertions or deletions of nucleotide bases. mutations . If the mechanism for producing these Permanent structural alterations in DNA, consisting of either substitutions, insertions or deletions of nucleotide bases. mutations was evolution, we would expect to find whole families and orders of organisms with these kinds of Permanent structural alterations in DNA, consisting of either substitutions, insertions or deletions of nucleotide bases. mutations . Alternatively, we must accept that all of these Permanent structural alterations in DNA, consisting of either substitutions, insertions or deletions of nucleotide bases. mutations occurred relatively recently, such that there would be no record of descent.

Conclusion

In conclusion, there is no reasonable evolutionary mechanism by which A small RNA molecule that transfers a specific amino acid to a growing polypeptide chain at the ribosomal site of protein synthesis. tRNA point Permanent structural alterations in DNA, consisting of either substitutions, insertions or deletions of nucleotide bases. mutations can occur in such a diversity of organisms. Evolutionists must believe in a magical directional A permanent structural alteration in DNA, consisting of either a substitution, insertion or deletion of nucleotide bases. mutation pressure that replaces all of one specific A specific sequence of three adjacent nucleotides on a strand of DNA or RNA that specifies the genetic code information for encoding a specific amino acid into a polypeptide chain. codon The order of nucleotides in a DNA or RNA molecule, or the order of amino acids in a protein molecule. sequence in a variety of unrelated species. In addition, they must believe that none of these species have evolved once this A permanent structural alteration in DNA, consisting of either a substitution, insertion or deletion of nucleotide bases. mutation occurred, since there is no evidence of descent. The alternative, that a Creator designed A small RNA molecule that transfers a specific amino acid to a growing polypeptide chain at the ribosomal site of protein synthesis. tRNA "mutants" to show us that He and not blind chance is responsible for life, is intolerable to those whose "faith" is evolution.


Alternate genetic codes in newly sequenced organisms - Biology

In this section, you will explore the following questions:

  • What is the “Central Dogma” of protein synthesis?
  • What is the genetic code, and how does nucleotide sequence prescribe the amino acid and polypeptide sequence?

Connection for AP ® Courses

Since the rediscovery of Mendel’s work in the 1900s, scientists have learned much about how the genetic blueprints stored in DNA are capable of replication, expression, and mutation. Just as the 26 letters of the English alphabet can be arranged into what seems to be a limitless number of words, with new ones added to the dictionary every year, the four nucleotides of DNA—A, T, C, and G—can generate sequences of DNA called genes that specify tens of thousands of polymers of amino acids. In turn, these sequences can be transcribed into mRNA and translated into proteins which orchestrate nearly every function of the cell. The genetic code refers to the DNA alphabet (A, T, C, G), the RNA alphabet (A, U, C, G), and the polypeptide alphabet (20 amino acids). But how do genes located on a chromosome ultimately produce a polypeptide that can result in a physical phenotype such as hair or eye color—or a disease like cystic fibrosis or hemophilia?

The Central Dogma describes the normal flow of genetic information from DNA to mRNA to protein: DNA in genes specify sequences of mRNA which, in turn, specify amino acid sequences in proteins. The process requires two steps, transcription and translation. During transcription, genes are used to make messenger RNA (mRNA). In turn, the mRNA is used to direct the synthesis of proteins during the process of translation. Translation also requires two other types of RNA: transfer RNA (tRNA) and ribosomal RNA (rRNA). The genetic code is a triplet code, with each RNA codon consisting of three consecutive nucleotides that specify one amino acid or the release of the newly formed polypeptide chain for example, the mRNA codon CAU specifies the amino acid histidine. The code is degenerate that is, some amino acids are specified by more than one codon, like synonyms you study in your English class (different word, same meaning). For example, CCU, CCC, CCA, and CCG are all codons for proline. It is important to remember the same genetic code is universal to almost all organisms on Earth. Small variations in codon assignment exist in mitochondria and some microorganisms.

Deviations from the simple scheme of the central dogma are discovered as researchers explore gene expression with new technology. For example the human immunodeficiency virus (HIV) is a retrovirus which stores its genetic information in single stranded RNA molecules. Upon infection of a host cell, RNA is used as a template by the virally encoded enzyme, reverse transcriptase, to synthesize DNA. The viral DNA is later transcribed into mRNA and translated into proteins. Some RNA viruses such as the influenza virus never go through a DNA step. The RNA genome is replicated by an RNA dependent RNA polymerase which is virally encoded.

The content presented in this section supports the Learning Objectives outlined in Big Idea 1 and Big Idea 3 of the AP ® Biology Curriculum Framework. The Learning Objectives merge Essential Knowledge content with one or more of the seven Science Practices. These Learning Objectives provide a transparent foundation for the AP ® Biology course, along with inquiry-based laboratory experiences, instructional activities, and AP ® Exam questions.

Big Idea 1 The process of evolution drives the diversity and unity of life.
Enduring Understanding 1.B Organisms are linked by lines of descent from common ancestry.
Essential Knowledge 1.B.1 Organisms share many conserved core processes and features that evolved and are widely distributed among organisms today.
Science Practice 3.1 The student can pose scientific questions.
Science Practice 7.2 The student can connect concepts in and across domain(s) to generalize or extrapolate in and/or across enduring understandings and/or big ideas.
Learning Objective 1.15 The student is able to describe specific examples of conserved core biological processes and features shared by all domains or within one domain of life, and how these shared, conserved core processes and features support the concept of common ancestry for all organisms.
Big Idea 3 Living systems store, retrieve, transmit and respond to information essential to life processes.
Enduring Understanding 3.A Heritable information provides for continuity of life.
Essential Knowledge 3.A.1 DNA, and in some cases RNA, is the primary source of heritable information.
Science Practice 6.5 The student can evaluate alternative scientific explanations.
Learning Objective 3.1 The student is able to construct scientific explanations that use the structure and functions of DNA and RNA to support the claim that DNA and, in some cases, that RNA are the primary sources of heritable information.

The Science Practice Challenge Questions contain additional test questions for this section that will help you prepare for the AP exam. These questions address the following standards:
[APLO 3.4][APLO 3.25]

The cellular process of transcription generates messenger RNA (mRNA), a mobile molecular copy of one or more genes with an alphabet of A, C, G, and uracil (U). Translation of the mRNA template converts nucleotide-based genetic information into a protein product. Protein sequences consist of 20 commonly occurring amino acids therefore, it can be said that the protein alphabet consists of 20 letters (Figure 15.2). Each amino acid is defined by a three-nucleotide sequence called the triplet codon. Different amino acids have different chemistries (such as acidic versus basic, or polar and nonpolar) and different structural constraints. Variation in amino acid sequence gives rise to enormous variation in protein structure and function.

The Central Dogma: DNA Encodes RNA RNA Encodes Protein

The flow of genetic information in cells from DNA to mRNA to protein is described by the Central Dogma (Figure 15.3), which states that genes specify the sequence of mRNAs, which in turn specify the sequence of proteins. The decoding of one molecule to another is performed by specific proteins and RNAs. Because the information stored in DNA is so central to cellular function, it makes intuitive sense that the cell would make mRNA copies of this information for protein synthesis, while keeping the DNA itself intact and protected. The copying of DNA to RNA is relatively straightforward, with one nucleotide being added to the mRNA strand for every nucleotide read in the DNA strand. The translation to protein is a bit more complex because three mRNA nucleotides correspond to one amino acid in the polypeptide sequence. However, the translation to protein is still systematic and colinear , such that nucleotides 1 to 3 correspond to amino acid 1, nucleotides 4 to 6 correspond to amino acid 2, and so on.

The Genetic Code Is Degenerate and Universal

Given the different numbers of “letters” in the mRNA and protein “alphabets,” scientists theorized that combinations of nucleotides corresponded to single amino acids. Nucleotide doublets would not be sufficient to specify every amino acid because there are only 16 possible two-nucleotide combinations (4 2 ). In contrast, there are 64 possible nucleotide triplets (4 3 ), which is far more than the number of amino acids. Scientists theorized that amino acids were encoded by nucleotide triplets and that the genetic code was degenerate . In other words, a given amino acid could be encoded by more than one nucleotide triplet. This was later confirmed experimentally Francis Crick and Sydney Brenner used the chemical mutagen proflavin to insert one, two, or three nucleotides into the gene of a virus. When one or two nucleotides were inserted, protein synthesis was completely abolished. When three nucleotides were inserted, the protein was synthesized and functional. This demonstrated that three nucleotides specify each amino acid. These nucleotide triplets are called codons . The insertion of one or two nucleotides completely changed the triplet reading frame , thereby altering the message for every subsequent amino acid (Figure 15.4). Though insertion of three nucleotides caused an extra amino acid to be inserted during translation, the integrity of the rest of the protein was maintained.

Scientists painstakingly solved the genetic code by translating synthetic mRNAs in vitro and sequencing the proteins they specified (Figure 15.5).

In addition to instructing the addition of a specific amino acid to a polypeptide chain, three of the 64 codons terminate protein synthesis and release the polypeptide from the translation machinery. These triplets are called nonsense codons , or stop codons. Another codon, AUG, also has a special function. In addition to specifying the amino acid methionine, it also serves as the start codon to initiate translation. The reading frame for translation is set by the AUG start codon near the 5' end of the mRNA.

The genetic code is universal. With a few exceptions, virtually all species use the same genetic code for protein synthesis. Conservation of codons means that a purified mRNA encoding the globin protein in horses could be transferred to a tulip cell, and the tulip would synthesize horse globin. That there is only one genetic code is powerful evidence that all of life on Earth shares a common origin, especially considering that there are about 10 84 possible combinations of 20 amino acids and 64 triplet codons.

LINK TO LEARNING

Transcribe a gene and translate it to protein using complementary pairing and the genetic code at this site.


Contents

It is noteworthy that the genetic code for all organisms is basically the same, so that all living beings use the same ’genetic language’. [4] In general, the introduction of new functional unnatural amino acids into proteins of living cells breaks the universality of the genetic language, which ideally leads to alternative life forms. [5] Proteins are produced thanks to the translational system molecules, which decode the RNA messages into a string of amino acids. The translation of genetic information contained in messenger RNA (mRNA) into a protein is catalysed by ribosomes. Transfer RNAs (tRNA) are used as keys to decode the mRNA into its encoded polypeptide. The tRNA recognizes a specific three nucleotide codon in the mRNA with a complementary sequence called the anticodon on one of its loops. Each three-nucleotide codon is translated into one of twenty naturally occurring amino acids. [6] There is at least one tRNA for any codon, and sometimes multiple codons code for the same amino acid. Many tRNAs are compatible with several codons. An enzyme called an aminoacyl tRNA synthetase covalently attaches the amino acid to the appropriate tRNA. [7] Most cells have a different synthetase for each amino acid (20 or more synthetases). On the other hand, some bacteria have fewer than 20 aminoacyl tRNA synthetases, and introduce the "missing" amino acid(s) by modification of a structurally related amino acid by an aminotransferase enzyme. [8] A feature exploited in the expansion of the genetic code is the fact that the aminoacyl tRNA synthetase often does not recognize the anticodon, but another part of the tRNA, meaning that if the anticodon were to be mutated the encoding of that amino acid would change to a new codon. In the ribosome, the information in mRNA is translated into a specific amino acid when the mRNA codon matches with the complementary anticodon of a tRNA, and the attached amino acid is added onto a growing polypeptide chain. When it is released from the ribosome, the polypeptide chain folds into a functioning protein. [7]

In order to incorporate a novel amino acid into the genetic code several changes are required. First, for successful translation of a novel amino acid, the codon to which the novel amino acid is assigned cannot already code for one of the 20 natural amino acids. Usually a nonsense codon (stop codon) or a four-base codon are used. [6] Second, a novel pair of tRNA and aminoacyl tRNA synthetase are required, these are called the orthogonal set. The orthogonal set must not crosstalk with the endogenous tRNA and synthetase sets, while still being functionally compatible with the ribosome and other components of the translation apparatus. The active site of the synthetase is modified to accept only the novel amino acid. Most often, a library of mutant synthetases is screened for one which charges the tRNA with the desired amino acid. The synthetase is also modified to recognize only the orthogonal tRNA. [6] The tRNA synthetase pair is often engineered in other bacteria or eukaryotic cells. [9]

In this area of research, the 20 encoded proteinogenic amino acids are referred to as standard amino acids, or alternatively as natural or canonical amino acids, while the added amino acids are called non-standard amino acids (NSAAs), or unnatural amino acids (uAAs term not used in papers dealing with natural non-proteinogenic amino acids, such as phosphoserine), or non-canonical amino acids.

The first element of the system is the amino acid that is added to the genetic code of a certain strain of organism.

Over 71 different NSAAs have been added to different strains of E. coli, yeast or mammalian cells. [10] Due to technical details (easier chemical synthesis of NSAAs, less crosstalk and easier evolution of the aminoacyl-tRNA synthase), the NSAAs are generally larger than standard amino acids and most often have a phenylalanine core but with a large variety of different substituents. These allow a large repertoire of new functions, such as labeling (see figure), as a fluorescent reporter (e.g. dansylalanine) [11] or to produce translational proteins in E. coli with Eukaryotic post-translational modifications (e.g. phosphoserine, phosphothreonine, and phosphotyrosine). [10] [12]

Unnatural amino acids incorporated into proteins include heavy atom-containing amino acids to facilitate certain x-ray crystallographic studies amino acids with novel steric/packing and electronic properties photocrosslinking amino acids which can be used to probe protein-protein interactions in vitro or in vivo keto, acetylene, azide, and boronate-containing amino acids which can be used to selectively introduce a large number of biophysical probes, tags, and novel chemical functional groups into proteins in vitro or in vivo redox active amino acids to probe and modulate electron transfer photocaged and photoisomerizable amino acids to photoregulate biological processes metal binding amino acids for catalysis and metal ion sensing amino acids that contain fluorescent or infra-red active side chains to probe protein structure and dynamics α-hydroxy acids and D-amino acids as probes of backbone conformation and hydrogen bonding interactions and sulfated amino acids and mimetics of phosphorylated amino acids as probes of post-translational modifications. [13] [14] [15]

Availability of the non-standard amino acid requires that the organism either import it from the medium or biosynthesize it. In the first case, the unnatural amino acid is first synthesized chemically in its optically pure L-form. [16] It is then added to the growth medium of the cell. [10] A library of compounds is usually tested for use in incorporation of the new amino acid, but this is not always necessary, for example, various transport systems can handle unnatural amino acids with apolar side-chains. In the second case, a biosynthetic pathway needs to be engineered, for example, an E. coli strain that biosynthesizes a novel amino acid (p-aminophenylalanine) from basic carbon sources and includes it in its genetic code. [15] [17] [18] Another example: the production of phosphoserine, a natural metabolite, and consequently required alteration of its pathway flux to increase its production. [12]

Another element of the system is a codon to allocate to the new amino acid.

A major problem for the genetic code expansion is that there are no free codons. The genetic code has a non-random layout that shows tell-tale signs of various phases of primordial evolution, however, it has since frozen into place and is near-universally conserved. [19] Nevertheless, some codons are rarer than others. In fact, in E. coli (and all organisms) the codon usage is not equal, but presents several rare codons (see table), the rarest being the amber stop codon (UAG).

Codon usage in E. coli [20]
Codon Amino acid Abundance (%)
UUU Phe (F) 1.9
UUC Phe (F) 1.8
UUA Leu (L) 1.0
UUG Leu (L) 1.1
CUU Leu (L) 1.0
CUC Leu (L) 0.9
CUA Leu (L) 0.3
CUG Leu (L) 5.2
AUU Ile (I) 2.7
AUC Ile (I) 2.7
AUA Ile (I) 0.4
AUG Met (M) 2.6
GUU Val (V) 2.0
GUC Val (V) 1.4
GUA Val (V) 1.2
GUG Val (V) 2.4
UCU Ser (S) 1.1
UCC Ser (S) 1.0
UCA Ser (S) 0.7
UCG Ser (S) 0.8
CCU Pro (P) 0.7
CCC Pro (P) 0.4
CCA Pro (P) 0.8
CCG Pro (P) 2.4
ACU Thr (T) 1.2
ACC Thr (T) 2.4
ACA Thr (T) 0.1
ACG Thr (T) 1.3
GCU Ala (A) 1.8
GCC Ala (A) 2.3
GCA Ala (A) 0.1
GCG Ala (A) 3.2
UAU Tyr (Y) 1.6
UAC Tyr (Y) 1.4
UAA Stop 0.2
UAG Stop 0.03
CAU His (H) 1.2
CAC His (H) 1.1
CAA Gln (Q) 1.3
CAG Gln (Q) 2.9
AAU Asn (N) 1.6
AAC Asn (N) 2.6
AAG Lys (K) 3.8
AAA Lys (K) 1.2
GAU Asp (D) 3.3
GAC Asp (D) 2.3
GAA Glu (E) 4.4
GAG Glu (E) 1.9
UGU Cys (C) 0.4
UGC Cys (C) 0.6
UGA Stop 0.1
UGG Trp (W) 1.4
CGU Arg (R) 2.4
CGC Arg (R) 2.2
CGA Arg (R) 0.3
CGG Arg (R) 0.5
AGU Ser (S) 0.7
AGC Ser (S) 1.5
AGA Ser (S) 0.2
AGG Ser (S) 0.2
GGU Gly (G) 2.8
GGC Gly (G) 3.0
GGC Gly (G) 0.7
GGA Gly (G) 0.9

Amber codon suppression Edit

The possibility of reassigning codons was realized by Normanly et al. in 1990, when a viable mutant strain of E. coli read through the UAG ("amber") stop codon. [21] This was possible thanks to the rarity of this codon and the fact that release factor 1 alone makes the amber codon terminate translation. Later, in the Schultz lab, the tRNATyr/tyrosyl-tRNA synthetase (TyrRS) from Methanococcus jannaschii, an archaebacterium, [6] was used to introduce a tyrosine instead of STOP, the default value of the amber codon. [22] This was possible because of the differences between the endogenous bacterial synthases and the orthologous archaeal synthase, which do not recognize each other. Subsequently, the group evolved the orthologonal tRNA/synthase pair to utilise the non-standard amino acid O-methyltyrosine. [23] This was followed by the larger naphthylalanine [24] and the photocrosslinking benzoylphenylalanine, [25] which proved the potential utility of the system.

The amber codon is the least used codon in Escherichia coli, but hijacking it results in a substantial loss of fitness. One study in fact found that there were at least 83 peptides majorly affected by the readthrough [26] Additionally, the labelling was incomplete. As a consequence, several strains have been made to reduce the fitness cost, including the removal of all amber codons from the genome. In most E. coli K-12 strains (viz. Escherichia coli (molecular biology) for strain pedigrees) there are 314 UAG stop codons. Consequently, a gargantuan amount of work has gone into the replacement of these. One approach pioneered by the group of Prof. George Church from Harvard, was dubbed MAGE in CAGE: this relied on a multiplex transformation and subsequent strain recombination to remove all UAG codons—the latter part presented a halting point in a first paper, [27] but was overcome. This resulted in the E. coli strain C321.ΔA, which lacks all UAG codons and RF1. [28] This allowed an experiment to be done with this strain to make it "addicted" to the amino acid biphenylalanine by evolving several key enzymes to require it structurally, therefore putting its expanded genetic code under positive selection. [29]

Rare sense codon reassignment Edit

In addition to the amber codon, rare sense codons have also been considered for use. The AGG codon codes for arginine, but a strain has been successfully modified to make it code for 6-N-allyloxycarbonyl-lysine. [30] Another candidate is the AUA codon, which is unusual in that its respective tRNA has to differentiate against AUG that codes for methionine (primordially, isoleucine, hence its location). In order to do this, the AUA tRNA has a special base, lysidine. The deletion of the synthase (tilS) was possible thanks to the replacement of the native tRNA with that of Mycoplasma mobile (no lysidine). The reduced fitness is a first step towards pressuring the strain to lose all instances of AUA, allowing it to be used for genetic code expansion. [31]

Four base codons Edit

Other approaches include the addition of extra base pairing or the use of orthologous ribosomes that accept in addition to the regular triplet genetic code, tRNAs with quadruple code. [32] This allowed the simultaneous usage of two unnatural amino acids, p-azidophenylalanine (pAzF) and N6-[(2-propynyloxy)carbonyl]lysine (CAK), which cross-link with each other by Huisgen cycloaddition. [33]

Another key element is the tRNA/synthetase pair.

The orthologous set of synthetase and tRNA can be mutated and screened through directed evolution to charge the tRNA with a different, even novel, amino acid. Mutations to the plasmid containing the pair can be introduced by error-prone PCR or through degenerate primers for the synthetase's active site. Selection involves multiple rounds of a two-step process, where the plasmid is transferred into cells expressing chloramphenicol acetyl transferase with a premature amber codon. In the presence of toxic chloramphenicol and the non-natural amino acid, the surviving cells will have overridden the amber codon using the orthogonal tRNA aminoacylated with either the standard amino acids or the non-natural one. To remove the former, the plasmid is inserted into cells with a barnase gene (toxic) with a premature amber codon but without the non-natural amino acid, removing all the orthogonal synthases that do not specifically recognize the non-natural amino acid. [6] In addition to the recoding of the tRNA to a different codon, they can be mutated to recognize a four-base codon, allowing additional free coding options. [34] The non-natural amino acid, as a result, introduces diverse physicochemical and biological properties in order to be used as a tool to explore protein structure and function or to create novel or enhanced protein for practical purposes.

Orthogonal sets in model organisms Edit

The orthogonal pairs of synthetase and tRNA that work for one organism may not work for another, as the synthetase may mis-aminoacylate endogenous tRNAs or the tRNA be mis-aminoacylated itself by an endogenous synthetase. As a result, the sets created to date differ between organisms.

In 2017 a mouse engineered with an extended genetic code that can produce proteins with unnatural amino acids was reported. [48]

Similarly to orthogonal tRNAs and aminoacyl tRNA synthetases (aaRSs), orthogonal ribosomes have been engineered to work in parallel to the natural ribosomes. Orthogonal ribosomes ideally use different mRNA transcripts than their natural counterparts and ultimately should draw on a separate pool of tRNA as well. This should alleviate some of the loss of fitness which currently still arises from techniques such as Amber codon suppression. Additionally, orthogonal ribosomes can be mutated and optimized for particular tasks, like the recognition of quadruplet codons. Such an optimization is not possible, or highly disadvantageous for natural ribosomes.

O-Ribosome Edit

In 2005 three sets of ribosomes were published, which did not recognize natural mRNA, but instead translated a separate pool of orthogonal mRNA (o-mRNA). [49] This was achieved by changing the recognition sequence of the mRNA, the Shine-Dalgarno sequence, and the corresponding recognition sequence in the 16S rRNA of ribosomes, the so-called Anti-Shine-Darlgarno-Sequence. This way the base pairing, which is usually lost if either sequence is mutated, stays available. However the mutations in the 16S rRNA were not limited to the obviously base-pairing nucleotides of the classical Anti-Shine-Darlgarno sequence.

Ribo-X Edit

In 2007 the group of Jason W. Chin presented an orthogonal ribosome, which was optimized for Amber codon suppression. [50] The 16S rRNA was mutated in such a way that it bound the release factor RF1 less strongly than the natural ribosome does. This ribosome did not eliminate the problem of lowered cell fitness caused by suppressed stop codons in natural proteins. However through the improved specificity it raised the yields of correctly synthesized target protein significantly (from

20% to >60% percent for one amber codon to be suppressed and form <1% to >20% for two amber codons).

Ribo-Q Edit

In 2010 the group of Jason W. Chin presented a further optimized version of the orthogonal ribosome. The Ribo-Q is a 16S rRNA optimized to recognize tRNAs, which have quadruplet anti-codons to recognize quadruplet codons, instead of the natural triplet codons. [33] With this approach the number of possible codons rises from 64 to 256. Even accounting for a variety of stop codons, more than 200 different amino acids could potentially be encoded this way.

Ribosome stapling Edit

The orthogonal ribosomes described above all focus on optimizing the 16S rRNA. Thus far, this optimized 16S rRNA was combined with natural large-subunits to form orthogonal ribosomes. If the 23S rRNA, the main RNA-component of the large ribosomal subunit, is to be optimized as well, it had to be assured, that there was no crosstalk in the assembly of orthogonal and natural ribosomes (see figureX B). To ensure that optimized 23S rRNA would only form into ribosomes with the optimized 16S rRNA, the two rRNAs were combined into one transcript. [51] By inserting the sequence for the 23S rRNA into a loop-region of the 16S rRNA sequence, both subunits still adopt functioning folds. Since the two rRNAs are linked and thus in constant proximity, they preferably bind each other, not other free floating ribosomal subunits.

Engineered peptidyl transferase center Edit

In 2014 it was shown that by altering the peptidyl transferase center of the 23S rRNA, ribosomes could be created which draw on orthogonal pools of tRNA. [52] The 3’ end of tRNAs is universally conserved to be CCA. The two cytidines base pair with two guanines the 23S rRNA to bind the tRNA to the ribosome. This interaction is required for translational fidelity. However, by co-mutating the binding nucleotides in such a way, that they can still base pair, the translational fidelity can be conserved. The 3’-end of the tRNA is mutated from CCA to CGA, while two cytidine nucleotides in the ribosomes A- and P-sites are mutated to guanidine. This leads to ribosomes which do not accept naturally occurring tRNAs as substrates and to tRNAs, which cannot be used as substrate by natural ribosomes.
To use such tRNAs effectively, they would have to be aminoacylated by specific, orthogonal aaRSs. Most naturally occurring aaRSs recognize the 3’-end of their corresponding tRNA. [53] [54] aaRSs for these 3’-mutated tRNAs are not available yet. Thus far, this system has only been shown to work in an in-vitro translation setting where the aminoacylation of the orthogonal tRNA was achieved using so called “flexizymes”. Flexizymes are ribozymes with tRNA-amino-aclylation activity. [55]

With an expanded genetic code, the unnatural amino acid can be genetically directed to any chosen site in the protein of interest. The high efficiency and fidelity of this process allows a better control of the placement of the modification compared to modifying the protein post-translationally, which, in general, will target all amino acids of the same type, such as the thiol group of cysteine and the amino group of lysine. [56] Also, an expanded genetic code allows modifications to be carried out in vivo. The ability to site-specifically direct lab-synthesized chemical moieties into proteins allows many types of studies that would otherwise be extremely difficult, such as:

  • Probing protein structure and function: By using amino acids with slightly different size such as O-methyltyrosine or dansylalanine instead of tyrosine, and by inserting genetically coded reporter moieties (color-changing and/or spin-active) into selected protein sites, chemical information about the protein's structure and function can be measured.
  • Probing the role of post-translational modifications in protein structure and function: By using amino acids that mimic post-translational modifications such as phosphoserine, biologically active protein can be obtained, and the site-specific nature of the amino acid incorporation can lead to information on how the position, density, and distribution of protein phosphorylation effect protein function. [57][58][59][60]
  • Identifying and regulating protein activity: By using photocaged aminoacids, protein function can be "switched" on or off by illuminating the organism.
  • Changing the mode of action of a protein: One can start with the gene for a protein that binds a certain sequence of DNA and, by inserting a chemically active amino acid into the binding site, convert it to a protein that cuts the DNA rather than binding it.
  • Improving immunogenicity and overcoming self-tolerance: By replacing strategically chosen tyrosines with p-nitro phenylalanine, a tolerated self-protein can be made immunogenic. [61]
  • Selective destruction of selected cellular components: using an expanded genetic code, unnatural, destructive chemical moieties (sometimes called "chemical warheads") can be incorporated into proteins that target specific cellular components. [62]
  • Producing better protein: the evolution of T7 bacteriophages on a non-evolving E. coli strain that encoded 3-iodotyrosine on the amber codon, resulted in a population fitter than wild-type thanks to the presence of iodotyrosine in its proteome [63]

The expansion of the genetic code is still in its infancy. Current methodology uses only one non-standard amino acid at the time, whereas ideally multiple could be used.

Recoded synthetic genome Edit

One way to achieve the encoding of multiple unnatural amino acids is by synthesising a rewritten genome. [64] In 2010, at the cost of $40 million an organism, Mycoplasma laboratorium, was constructed that was controlled by a synthetic, but not recoded, genome. [65] In 2019, Eschericia coli Syn61 was created, with a 4 megabase recoded genome consisting of only 61 codons instead of the natural 64. [3] [2] In addition to the elimination of the usage of rare codons, the specificity of the system needs to be increased as many tRNA recognise several codons [64]

Expanded genetic alphabet Edit

Another approach is to expand the number of nucleobases to increase the coding capacity.

An unnatural base pair (UBP) is a designed subunit (or nucleobase) of DNA which is created in a laboratory and does not occur in nature. A demonstration of UBPs were achieved in vitro by Ichiro Hirao's group at RIKEN institute in Japan. In 2002, they developed an unnatural base pair between 2-amino-8-(2-thienyl)purine (s) and pyridine-2-one (y) that functions in vitro in transcription and translation for the site-specific incorporation of non-standard amino acids into proteins. [66] In 2006, they created 7-(2-thienyl)imidazo[4,5-b]pyridine (Ds) and pyrrole-2-carbaldehyde (Pa) as a third base pair for replication and transcription. [67] Afterward, Ds and 4-[3-(6-aminohexanamido)-1-propynyl]-2-nitropyrrole (Px) was discovered as a high fidelity pair in PCR amplification. [68] [69] In 2013, they applied the Ds-Px pair to DNA aptamer generation by in vitro selection (SELEX) and demonstrated the genetic alphabet expansion significantly augment DNA aptamer affinities to target proteins. [70]

In 2012, a group of American scientists led by Floyd Romesberg, a chemical biologist at the Scripps Research Institute in San Diego, California, published that his team designed an unnatural base pair (UBP). [71] The two new artificial nucleotides or Unnatural Base Pair (UBP) were named "d5SICS" and "dNaM." More technically, these artificial nucleotides bearing hydrophobic nucleobases, feature two fused aromatic rings that form a (d5SICS–dNaM) complex or base pair in DNA. [72] [73] In 2014 the same team from the Scripps Research Institute reported that they synthesized a stretch of circular DNA known as a plasmid containing natural T-A and C-G base pairs along with the best-performing UBP Romesberg's laboratory had designed, and inserted it into cells of the common bacterium E. coli that successfully replicated the unnatural base pairs through multiple generations. [74] This is the first known example of a living organism passing along an expanded genetic code to subsequent generations. [72] [75] This was in part achieved by the addition of a supportive algal gene that expresses a nucleotide triphosphate transporter which efficiently imports the triphosphates of both d5SICSTP and dNaMTP into E. coli bacteria. [72] Then, the natural bacterial replication pathways use them to accurately replicate the plasmid containing d5SICS–dNaM.

The successful incorporation of a third base pair into a living micro-organism is a significant breakthrough toward the goal of greatly expanding the number of amino acids which can be encoded by DNA, thereby expanding the potential for living organisms to produce novel proteins. [74] The artificial strings of DNA do not encode for anything yet, but scientists speculate they could be designed to manufacture new proteins which could have industrial or pharmaceutical uses. [76]

In May 2014, researchers announced that they had successfully introduced two new artificial nucleotides into bacterial DNA, and by including individual artificial nucleotides in the culture media, were able to passage the bacteria 24 times they did not create mRNA or proteins able to use the artificial nucleotides. [72] [77] [78] [79]

Selective pressure incorporation (SPI) method for production of alloproteins Edit

There have been many studies that have produced protein with non-standard amino acids, but they do not alter the genetic code. These protein, called alloprotein, are made by incubating cells with an unnatural amino acid in the absence of a similar coded amino acid in order for the former to be incorporated into protein in place of the latter, for example L-2-aminohexanoic acid (Ahx) for methionine (Met). [80]

These studies rely on the natural promiscuous activity of the aminoacyl tRNA synthetase to add to its target tRNA an unnatural amino acid (i.e. analog) similar to the natural substrate, for example methionyl-tRNA synthase's mistaking isoleucine for methionine. [81] In protein crystallography, for example, the addition of selenomethionine to the media of a culture of a methionine-auxotrophic strain results in proteins containing selenomethionine as opposed to methionine (viz. Multi-wavelength anomalous dispersion for reason). [82] Another example is that photoleucine and photomethionine are added instead of leucine and methionine to cross-label protein. [83] Similarly, some tellurium-tolerant fungi can incorporate tellurocysteine and telluromethionine into their protein instead of cysteine and methionine. [84] The objective of expanding the genetic code is more radical as it does not replace an amino acid, but it adds one or more to the code. On the other hand, proteome-wide replacements are most efficiently performed by global amino acid substitutions. For example, global proteome-wide substitutions of natural amino acids with fluorinated analogs have been attempted in E. coli [85] and B. subtilis. [86] A complete tryptophan substitution with thienopyrrole-alanine in response to 20899 UGG codons in E. coli was reported in 2015 by Budisa and Söll. [87] Moreover, many biological phenomena, such as protein folding and stability, are based on synergistic effects at many positions in the protein sequence. [88]

In this context, the SPI method generates recombinant protein variants or alloproteins directly by substitution of natural amino acids with unnatural counterparts. [89] An amino acid auxotrophic expression host is supplemented with an amino acid analog during target protein expression. [90] This approach avoids the pitfalls of suppression-based methods [91] and it is superior to it in terms of efficiency, reproducibility and an extremely simple experimental setup. [92] Numerous studies demonstrated how global substitution of canonical amino acids with various isosteric analogs caused minimal structural perturbations but dramatic changes in thermodynamic, [93] folding, [94] aggregation [95] spectral properties [96] [97] and enzymatic activity. [98]

In vitro synthesis Edit

The genetic code expansion described above is in vivo. An alternative is the change of coding in vitro translation experiments. This requires the depletion of all tRNAs and the selective reintroduction of certain aminoacylated-tRNAs, some chemically aminoacylated. [99]

Chemical synthesis Edit

There are several techniques to produce peptides chemically, generally it is by solid-phase protection chemistry. This means that any (protected) amino acid can be added into the nascent sequence.

In November 2017, a team from the Scripps Research Institute reported having constructed a semi-synthetic E. coli bacteria genome using six different nucleic acids (versus four found in nature). The two extra 'letters' form a third, unnatural base pair. The resulting organisms were able to thrive and synthesize proteins using "unnatural amino acids". [100] [101] The unnatural base pair used is dNaM–dTPT3. [101] This unnatural base pair has been demonstrated previously, [102] [103] but this is the first report of transcription and translation of proteins using an unnatural base pair.


Alternate genetic codes in newly sequenced organisms - Biology

  • Xenobiology has the potential to reveal fundamental knowledge about biology and the origin of life. In order to better understand the origin of life, it is necessary to know why life evolved seemingly via an early RNA world to the DNA-RNA-protein system and its nearly universal genetic code. [6] Was it an evolutionary "accident" or were there constraints that ruled out other types of chemistries? By testing alternative biochemical "primordial soups", it is expected to better understand the principles that gave rise to life as we know it.
  • Xenobiology is an approach to develop industrial production systems with novel capabilities by means of enhanced biopolymer engineering and pathogen resistance. The genetic code encodes in all organisms 20 canonical amino acids that are used for protein biosynthesis. In rare cases, special amino acids such as selenocysteine, pyrrolysine or formylmethionine, can be incorporated by the translational apparatus in to proteins of some organisms. [7] By using additional amino acids from among the over 700 known to biochemistry, the capabilities of proteins may be altered to give rise to more efficient catalytical or material functions. The EC-funded project Metacode, [8] for example, aims to incorporate metathesis (a useful catalytical function so far not known in living organisms) into bacterial cells. Another reason why XB could improve production processes lies in the possibility to reduce the risk of virus or bacteriophage contamination in cultivations since XB cells would no longer provide suitable host cells, rendering them more resistant (an approach called semantic containment)
  • Xenobiology offers the option to design a "genetic firewall", a novel biocontainment system, which may help to strengthen and diversify current bio-containment approaches. [2] One concern with traditional genetic engineering and biotechnology is horizontal gene transfer to the environment and possible risks to human health. One major idea in XB is to design alternative genetic codes and biochemistries so that horizontal gene transfer is no longer possible. [9] Additionally alternative biochemistry also allows for new synthetic auxotrophies. The idea is to create an orthogonal biological system that would be incompatible with natural genetic systems. [10]

In xenobiology, the aim is to design and construct biological systems that differ from their natural counterparts on one or more fundamental levels. Ideally these new-to-nature organisms would be different in every possible biochemical aspect exhibiting a very different genetic code. [11] The long-term goal is to construct a cell that would store its genetic information not in DNA but in an alternative informational polymer consisting of xeno nucleic acids (XNA), different base pairs, using non-canonical amino acids and an altered genetic code. So far cells have been constructed that incorporate only one or two of these features.

Xeno nucleic acids (XNA) Edit

Originally this research on alternative forms of DNA was driven by the question of how life evolved on earth and why RNA and DNA were selected by (chemical) evolution over other possible nucleic acid structures. [12] Two hypotheses for the selection of RNA and DNA as life's backbone are either they are favored under life on Earth's conditions, or they were coincidentally present in pre-life chemistry and continue to be used now. [13] Systematic experimental studies aiming at the diversification of the chemical structure of nucleic acids have resulted in completely novel informational biopolymers. So far a number of XNAs with new chemical backbones or leaving group of the DNA have been synthesized, [3] [14] [15] [16] e.g.: hexose nucleic acid (HNA) threose nucleic acid (TNA), [17] glycol nucleic acid (GNA) cyclohexenyl nucleic acid (CeNA). [18] The incorporation of XNA in a plasmid, involving 3 HNA codons, has been accomplished already in 2003. [19] This XNA is used in vivo (E coli) as template for DNA synthesis. This study, using a binary (G/T) genetic cassette and two non-DNA bases (Hx/U), was extended to CeNA, while GNA seems to be too alien at this moment for the natural biological system to be used as template for DNA synthesis. [20] Extended bases using a natural DNA backbone could, likewise, be transliterated into natural DNA, although to a more limited extent. [21]

Aside being used as extensions to template DNA strands, XNA activity has been tested for use as genetic catalysts. Although proteins are the most common components of cellular enzymatic activity, nucleic acids are also used in the cell to catalyze reactions. A 2015 study found several different kinds of XNA, most notably FANA (2'-fluoroarabino nucleic acids), as well as HNA, CeNA and ANA (arabino nucleic acids) could be used to cleave RNA during post-transcriptional RNA processing acting as XNA enzymes, hence the name XNAzymes. FANA XNAzymes also showed the ability to ligate DNA, RNA and XNA substrates. [13] Although XNAzyme studies are still preliminary, this study was a step in the direction of searching for synthetic circuit components that are more efficient than those containing DNA and RNA counterparts that can regulate DNA, RNA, and their own, XNA, substrates.

Expanding the genetic alphabet Edit

While XNAs have modified backbones, other experiments target the replacement or enlargement of the genetic alphabet of DNA with unnatural base pairs. For example, DNA has been designed that has – instead of the four standard bases A, T, G, and C – six bases A, T, G, C, and the two new ones P and Z (where Z stands for 6-Amino-5-nitro3-(l'-p-D-2'-deoxyribofuranosyl)-2(1H)-pyridone, and P stands for 2-Amino-8-(1-beta-D-2'-deoxyribofuranosyl)imidazo[1,2-a]-1,3,5-triazin-4 (8H)). [22] [23] [24] In a systematic study, Leconte et al. tested the viability of 60 candidate bases (yielding potentially 3600 base pairs) for possible incorporation in the DNA. [25]

In 2002, Hirao et al. developed an unnatural base pair between 2-amino-8-(2-thienyl)purine (s) and pyridine-2-one (y) that functions in vitro in transcription and translation toward a genetic code for protein synthesis containing a non-standard amino acid. [26] In 2006, they created 7-(2-thienyl)imidazo[4,5-b]pyridine (Ds) and pyrrole-2-carbaldehyde (Pa) as a third base pair for replication and transcription, [27] and afterward, Ds and 4-[3-(6-aminohexanamido)-1-propynyl]-2-nitropyrrole (Px) was discovered as a high fidelity pair in PCR amplification. [28] [29] In 2013, they applied the Ds-Px pair to DNA aptamer generation by in vitro selection (SELEX) and demonstrated the genetic alphabet expansion significantly augment DNA aptamer affinities to target proteins. [30]

In May 2014, researchers announced that they had successfully introduced two new artificial nucleotides into bacterial DNA, alongside the four naturally occurring nucleotides, and by including individual artificial nucleotides in the culture media, were able to passage the bacteria 24 times they did not create mRNA or proteins able to use the artificial nucleotides. [31] [32] [33]

Novel polymerases Edit

Neither the XNA nor the unnatural bases are recognized by natural polymerases. One of the major challenges is to find or create novel types of polymerases that will be able to replicate these new-to-nature constructs. In one case a modified variant of the HIV-reverse transcriptase was found to be able to PCR-amplify an oligonucleotide containing a third type base pair. [34] [35] Pinheiro et al. (2012) demonstrated that the method of polymerase evolution and design successfully led to the storage and recovery of genetic information (of less than 100bp length) from six alternative genetic polymers based on simple nucleic acid architectures not found in nature, xeno nucleic acids. [36]

Genetic code engineering Edit

One of the goals of xenobiology is to rewrite the genetic code. The most promising approach to change the code is the reassignment of seldom used or even unused codons. [37] In an ideal scenario, the genetic code is expanded by one codon, thus having been liberated from its old function and fully reassigned to a non-canonical amino acid (ncAA) ("code expansion"). As these methods are laborious to implement, and some short cuts can be applied ("code engineering"), for example in bacteria that are auxotrophic for specific amino acids and at some point in the experiment are fed isostructural analogues instead of the canonical amino acids for which they are auxotrophic. In that situation, the canonical amino acid residues in native proteins are substituted with the ncAAs. Even the insertion of multiple different ncAAs into the same protein is possible. [38] Finally, the repertoire of 20 canonical amino acids can not only be expanded, but also reduced to 19. [39] By reassigning transfer RNA (tRNA)/aminoacyl-tRNA synthetase pairs the codon specificity can be changed. Cells endowed with such aminoacyl-[tRNA synthetases] are thus able to read [mRNA] sequences that make no sense to the existing gene expression machinery. [40] Altering the codon: tRNA synthetases pairs may lead to the in vivo incorporation of the non-canonical amino acids into proteins. [41] [42] In the past reassigning codons was mainly done on a limited scale. In 2013, however, Farren Isaacs and George Church at Harvard University reported the replacement of all 321 TAG stop codons present in the genome of E. coli with synonymous TAA codons, thereby demonstrating that massive substitutions can be combined into higher-order strains without lethal effects. [43] Following the success of this genome wide codon replacement, the authors continued and achieved the reprogramming of 13 codons throughout the genome, directly affecting 42 essential genes. [44]

An even more radical change in the genetic code is the change of a triplet codon to a quadruplet and even pentaplet codon pioneered by Sisido in cell-free systems [45] and by Schultz in bacteria. [46] Finally, non-natural base pairs can be used to introduce novel amino acid in proteins. [47]

Directed evolution Edit

The goal of substituting DNA by XNA may also be reached by another route, namely by engineering the environment instead of the genetic modules. This approach has been successfully demonstrated by Marlière and Mutzel with the production of an E. coli strain whose DNA is composed of standard A, C and G nucleotides but has the synthetic thymine analogue 5-chlorouracil instead of thymine (T) in the corresponding positions of the sequence. These cells are then dependent on externally supplied 5-chlorouracil for growth, but otherwise they look and behave as normal E. coli. These cells, however, are currently not yet fully auxotrophic for the Xeno-base since they are still growing on thymine when this is supplied to the medium. [48]

Xenobiological systems are designed to convey orthogonality to natural biological systems. A (still hypothetical) organism that uses XNA, [49] different base pairs and polymerases and has an altered genetic code will hardly be able to interact with natural forms of life on the genetic level. Thus, these xenobiological organisms represent a genetic enclave that cannot exchange information with natural cells. [50] Altering the genetic machinery of the cell leads to semantic containment. In analogy to information processing in IT, this safety concept is termed a “genetic firewall”. [2] [51] The concept of the genetic firewall seems to overcome a number of limitations of previous safety systems. [52] [53] A first experimental evidence of the theoretical concept of the genetic firewall was achieved in 2013 with the construction of a genomically recoded organism (GRO). In this GRO all known UAG stop codons in E.coli were replaced by UAA codons, which allowed for the deletion of release factor 1 and reassignment of UAG translation function. The GRO exhibited increased resistance to T7 bacteriophage, thus showing that alternative genetic codes do reduce genetic compatibility. [54] This GRO, however, is still very similar to its natural “parent” and cannot be regarded as a genetic firewall. The possibility of reassigning the function of large number of triplets opens the perspective to have strains that combine XNA, novel base pairs, new genetic codes, etc. that cannot exchange any information with the natural biological world. Regardless of changes leading to a semantic containment mechanism in new organisms, any novel biochemical systems still has to undergo a toxicological screening. XNA, novel proteins, etc. might represent novel toxins, or have an allergic potential that needs to be assessed. [55] [56]

Xenobiology might challenge the regulatory framework, as currently laws and directives deal with genetically modified organisms and do not directly mention chemically or genomically modified organisms. Taking into account that real xenobiology organisms are not expected in the next few years, policy makers do have some time at hand to prepare themselves for an upcoming governance challenge. Since 2012, the following groups have picked up the topic as a developing governance issue: policy advisers in the US, [57] four National Biosafety Boards in Europe, [58] the European Molecular Biology Organisation, [59] and the European Commission's Scientific Committee on Emerging and Newly Identified Health Risks (SCENIHR) in three opinions (Definition, [60] risk assessment methodologies and safety aspects, [61] and risks to the environment and biodiversity related to synthetic biology and research priorities in the field of synthetic biology. [62] ).


Newly discovered genetic code controls bacterial survival during infections

Four nucleotides, abbreviated A, C, G, and T, spell out DNA sequences that code for all of the proteins cells need. MIT researchers have now discovered another layer of control, mediated by transfer RNA, that helps cells to rapidly divert resources in emergency situations. Credit: Arron Teo

The genetic code that allows cells to store the information necessary for life is well-known. Four nucleotides, abbreviated A, C, G, and T, spell out DNA sequences that code for all of the proteins cells need.

MIT researchers have now discovered another layer of control that helps cells to rapidly divert resources in emergency situations. Many bacteria, including strains that cause tuberculosis, use this strategy to enter a dormancy-like state that allows them to survive in hostile environments when deprived of oxygen or nutrients. For tuberculosis, lung infections can last for years, before eventually "re-awakening" and causing disease again.

"What this study does is reveal a system that the bacteria use to shut themselves down and enter one of these persistent states when they get stressed," says Peter Dedon, the Underwood-Prescott Professor of Biological Engineering at MIT.

Dedon and his colleagues studied a type of bacteria known as Mycobacterium bovis, one of several bacterial strains that can cause tuberculosis in humans. This strain causes a milder version of the disease than the more lethal Mycobacterium tuberculosis and is used in some countries to vaccinate against tuberculosis.

Targeting this newly identified genetic control system could help scientists develop new antibiotics against TB and other diseases, says Dedon, who is the senior author of a paper describing the findings in the Nov. 11 issue of Nature Communications. Yok Hian Chionh, a postdoc at the Singapore-MIT Alliance for Research and Technology (SMART), is the paper's lead author.

Dedon and colleagues have previously shown that stresses such as radiation or toxic chemicals provoke yeast cells to turn on a system that makes chemical modifications to transfer RNA (tRNA), which diverts the cells' protein-building machinery away from routine activities to emergency action.

In the new study, the researchers delved into how this switch influences the interactions between tRNA and messenger RNA (mRNA), which carries instructions for protein building from the nucleus to cell structures called ribosomes. The genetic code in mRNA is "read" on the ribosome as a series of three-letter sequences known as codons, each of which calls for a specific amino acid (the building blocks of proteins).

Those amino acids are delivered to the ribosome by tRNA. Like other types of RNA, tRNA consists of a sequence of four main ribonucleosides—A, G, C, and U. (U in RNA substitutes for the T found in DNA.) Each tRNA molecule has an anticodon that matches an mRNA codon, ensuring that the correct amino acid is inserted into the protein sequence. However, many amino acids can be encoded by more than one codon. For example, the amino acid threonine can be encoded by ACU, ACC, ACA, or ACG. In total, the genetic code has 61 codons that correspond to only 20 amino acids.

Once a tRNA molecule is manufactured, it is altered with dozens of different chemical modifications. These modifications are believed to influence how tightly the tRNA anticodon binds to the mRNA codon at the ribosome.

In this study, Dedon and colleagues found that certain tRNA modifications went up dramatically when the bacteria were deprived of oxygen and stopped growing.

One of these modifications was found on the ACG threonine anticodon, so the researchers analyzed the entire genome of Mycobacterium bovis in search of genes that contain high percentages of that ACG codon compared to the other threonine codons. They found that genes with high levels of ACG included a family known as the DosR regulon, which consists of 48 genes that are needed for a cells to stop growing and survive in a dormancy-like state.

When oxygen is lacking, these bacterial cells begin churning out large quantities of the DosR regulon proteins, while production of proteins from genes containing one of the other codons for threonine drops. The DosR regulon proteins guide the cell into a dormancy-like state by shutting down cell metabolism and halting cell division.

"The authors present an impressive example of the new, emerging deep biology of transfer RNAs, which translate the genetic code in all living organisms to create proteins," says Paul Schimmel, a professor of cell and molecular biology at the Scripps Research Institute, who was not involved in the research. "This long-known function was viewed in a simple, straightforward way for decades. They present a powerful, comprehensive analysis to show there are layers and layers, ever deeper, to this function of translation."

"Alternative genetic code"

The researchers also showed that when they swapped different threonine codons into the genomic locations where ACG is usually found, the bacterial cells failed to enter a dormant state when oxygen levels were diminished. Because making this tRNA modification switch is critical to bacterial cells' ability to respond to stress, the enzymes responsible for this switch could make good targets for new antibiotics, Dedon says.

Dedon suspects that other families of genes, such as those required to respond to starvation or to develop drug resistance, may be regulated in a similar way by other tRNA modifications.

"It is really an alternative genetic code, in which any gene family that is required to change a cell phenotype is enriched with specific codons" that correspond to specific modified tRNAs, he says.

The researchers have also seen this phenomenon in other species, including the parasite that causes malaria, and they are now studying it in humans.


Results and Discussion

A new lineage of insect-dwelling Rhizaria

As reported in Záhonová et al. [18], the non-canonical code of Blastocrithidia spp. (Trypanosomatida) was discovered accidentally by one of us noticing that a transcriptome shotgun assembly (TSA) from the heteropteran Lygus hesperus [25] is contaminated by trypanosomatid sequences including in-frame termination codons. Surprisingly enough, looking for contigs in the TSA data from L. hesperus exhibiting in-frame termination codons revealed not only those with obvious trypanosomatid affinities, but also some that could not be readily assigned to any particular eukaryotic taxon when compared by blastx against the NCBI non-redundant protein sequence database. Therefore, we probed the TSA data from L. hesperus with a set of conserved proteins employed in previous phylogenomic studies of the global eukaryote phylogeny (see Methods for details) and detected 71 orthologs that clustered neither with metazoans nor with trypanosomatids in individual single-protein trees (Additional file 1: Table S1A-C). Most of them contained one or more in-frame UAG codons 62 of these sequences branched with bootstrap support (BS) > 50% exclusively (in a clan sensu [26]) with homologs from Rhizaria, most often (46 sequences, 23 of which with BS at least 90%) as a sister group of the aggregative amoeba Guttulinopsis vulgaris. The remaining nine sequences clustered with homologs from other groups, but only in two cases such a grouping was supported by BS > 50%, and in both cases the non-rhizarian homologs were nested in a broader clan comprising primarily rhizarian sequences (Additional file 1: Table S1C). Despite inconclusive or contradicting phylogenetic evidence for rhizarian affinity of the nine sequences, they all likely come from the same organism as those with clear rhizarian affinity, as evidenced by the presence of at least one in-frame UAG codon in all of them. In addition, many of the genes in the reference set that we used for searching the TSA from L. hesperus (e.g., genes for ribosomal proteins) are highly expressed, so contamination of the TSA by more than one species would manifest by the presence of more than one ortholog of these genes. Putting aside the previously detected trypanosomatid contamination [18], no other eukaryotic orthologs were observed, as were no additional 18S rRNA sequences. Therefore, we assumed that the TSA from L. hesperus was contaminated by two different protist species – Blastocrithidia sp. (see [18]) and an unidentified rhizarian, “Rhizaria gen. sp. ex Lygus hesperus”, hereafter for simplicity referred to as the rhizarian exLh.

To further illuminate the identity of the rhizarian exLh, we carried out a maximum likelihood (ML) phylogenetic analysis of a 70-protein supermatrix containing 54 orthologs from this organism (a subset of the 71 genes mentioned above, passing an initially imposed threshold of minimal sequence identity to orthologs from other eukaryotes). The resulting tree showed it branching with maximum support within Rhizaria, specifically within Filosa as a sister lineage to G. vulgaris (Additional file 2: Figure S1). However, only very few representatives of Filosa could be included in the phylogenomic analysis due to an extremely low number of sequenced genomes or transcriptomes of this diverse group. Therefore, we sought to determine the phylogenetic position of the rhizarian exLh within Filosa using the 18S ribosomal RNA (rRNA) gene, the most broadly sampled phylogenetic marker for rhizarian phylogeny. Searching the L. hesperus TSA sequences revealed two contigs that proved to be chimeric sequences consisting from artificially merged parts of a different origin, including a 3’ segment of an 18S rRNA dissimilar to any 18S rRNA sequence in the GenBank database (and very different from the L. hesperus 18S rRNA sequence present in the TSA as another contig see Methods for details). Using the partial 18S rRNA sequence as a seed and original RNA-seq reads we assembled a complete 18S rRNA sequence that fell phylogenetically into Filosa, specifically into the group Sainouroidea (Fig. 1a). This clade comprises several poorly studied free-living or coprophilic flagellates and amoebae, including G. vulgaris [27]. The result of the 18S rRNA analysis is thus concordant with the phylogenomic analysis of protein sequences, supporting the assumption that the assembled 18S rRNA sequence comes from the same organism as the protein-coding transcripts. Furthermore, it specifically indicates that the rhizarian exLh is a previously undetected lineage of Sainouroidea, presumably a separate genus.

Phylogenetic position of the organisms studied. a Phylogeny of eukaryotes including the rhizarian exLh based on 18S rDNA sequences. The maximum likelihood (ML) tree was inferred with RAxML using the GTRGAMMAI substitution model. The values at branches represent RAxML BS values followed by PhyloBayes posterior probabilities (GTRCAT model). b Phylogeny of Fornicata including I. spirale based on a concatenated data set of 18S rDNA and EF-1α, EF2, HSP70, and HSP90 protein sequences. The ML tree was inferred with RAxML using the substitution models GTRGAMMA (for 18S rDNA) and PROTGAMMALG4X (for the protein sequences). The values at branches represent RAxML BS values followed by PhyloBayes posterior probabilities (CAT Poisson model). Maximal support (100/1) is indicated with black dots. Asterisks indicate support values lower than 50% or 0.5, respectively, dashes mark branches in the ML tree that are absent from the PhyloBayes tree

The presence of sequences from a rhizarian species in the L. hesperus TSA may be explained by an accidental contamination or an error in data handling in the sequencing center. Alternatively, it may reflect a specific physical connection between L. hesperus and the rhizarian, most likely with the latter being a symbiont of the former. To distinguish between these possibilities, we obtained several individuals of L. hesperus captured in the wild (Kern County, CA, USA), isolated DNA from whole individuals, and carried out a PCR reaction using specific primers designed on the basis of the assembled 18S rRNA sequence of the rhizarian exLh. One of the reactions (with a template consisting of DNA pooled from five L. hesperus individuals) yielded a product of the expected size of 1200 bp, and sequencing revealed that, except for two substitutions, it was identical to the respective region of the 18S rRNA sequence obtained from the published transcriptomic data from L. hesperus. This result suggests that the rhizarian exLh is a natural inhabitant, perhaps an endobiont, of L. hesperus. The nature of this association (mutualism, commensalism, or parasitism) and the host range of the rhizarian exLh remain to be investigated.

The rhizarian exLh employs UAG to encode leucine

We next investigated the identity and significance of the termination codons interrupting the coding sequences of the rhizarian exLh. In total, we identified 384 instances of in-frame termination codons in 71 genes (transcripts) the codon was UAG in all cases. Based on a comparison with orthologs from other eukaryotes, we identified the bona fide termination codons of the respective transcripts (i.e., the 3’-ends of the respective coding sequences), which were exclusively represented by UAA (56 cases) or UGA (15 cases) (Additional file 1: Table S1A–C). This suggested that the UAG codon has been reassigned as a sense codon in the rhizarian exLh and does not signal translation termination anymore. Visual inspection of multiple protein sequence alignments revealed a conspicuous pattern in the distribution of in-frame UAG codons – a strong tendency to occur at positions occupied by conserved leucine residues (for an example see the alignment of Bat1 protein sequences, Fig. 2a). We hypothesized that UAG encodes leucine in the novel rhizarian, which was further formally tested by two approaches.

In-frame UAG codons in protein-coding genes of the rhizarian exLh and I. spirale. a An example of a rhizarian exLh gene with several in-frame UAG codons: multiple sequence alignment of orthologs of the Bat1 protein (spliceosome RNA helicase). b Relative frequency of hyperconserved positions (at least 90% amino acid identity across orthologs from 250 representatives of main eukaryotic groups in the alignment) corresponding to UAG-containing sites in the rhizarian exLh transcripts. c An example of a I. spirale gene with several in-frame UAG codons: multiple sequence alignment of orthologs of the Polr2a protein (also known as RNA polymerase II subunit RPB1). d Relative frequency of hyperconserved positions (at least 90% amino acid identity across orthologs from 54 representatives of main eukaryotic groups in the alignment) corresponding to UAG-containing sites in the rhizarian exLh transcripts. e, f Dominant amino acid identity at conserved alignment positions (defined using 90% and 50% threshold) in a broad-scale comparison of I. spirale sequences with eukaryotic homologs. e Positions corresponding to in-frame UAG codons in I. spirale sequences. f Positions corresponding to canonical glutamine codons (CAG, CAA) in I. spirale sequences. In Fig. 2a and c , only selected segments of the full alignments (separated by double slashes) are shown for simplicity. Asterisks indicate positions with in-frame UAG codons in the underlying coding sequences. In Fig. 2b and d , the hyperconserved positions are sorted according to the respective hyperconserved amino acid residue (only four most frequent position classes are shown). Source tables for Fig. 2b and d including data from read mapping are available in Additional file 1: Table S1D and S2C

For the first analysis, we inspected the alignments of orthologous protein sequences used for the phylogenomic analysis described above and identified 192 positions in well conserved blocks that corresponded to an in-frame UAG in the sequences from the rhizarian exLh. From those positions, 86 were classified as hyperconserved, i.e., with at least 90% amino acid identity across all the sequences in the alignment 95% (i.e., 82) of which corresponded to a hyperconserved leucine (Fig. 2b). To ensure that the presence of the in-frame UAG codons at these hyperconserved sites was supported by original sequencing data, we inspected raw RNA-seq reads mapped onto the respective contigs. No conflicting signal concerning the identity of the nucleotides corresponding to any of these UAG codons was apparent (each in-frame UAG was supported by more than one read, with the read variability lower than 4.50%) (Additional file 1: Table S1D). As an alternative to the hyperconserved position-based inference of the UAG codon meaning, we devised a phylogeny-informed ML-based method that unselectively considers all UAG positions (see Methods for details). Briefly, we first inferred an organismal phylogeny using a smaller dataset of eight conserved proteins (to save computation time), with the respective genes from the rhizarian exLh containing 71 in-frame UAG codons, represented as an undetermined amino acid (X) in the alignment. We then prepared 20 modifications of the dataset, each with a different amino acid considered at positions corresponding to the in-frame UAG codons in the genes from the rhizarian exLh. Then, we calculated the best ML tree for each of the 20 datasets, using the same substitution model and the tree from the initial dataset as a constraint. The dataset where UAG was translated as leucine showed the highest likelihood score, and the conditional probability that UAG encodes leucine in the rhizarian exLh (calculated conditional upon UAG encoding one amino acid at all positions) is virtually 1.0, whereas it is negligible for any other amino acid (Table 1).

Based on these results, we conclude that UAG is the seventh codon for leucine in the rhizarian exLh and does not serve as a stop codon in this organism. This is not without precedent, as the stop-to-leucine UAG reassignment was previously reported from mitochondrial genomes of several green algae of the order Sphaeropleales [22, 28] and of the chytrid fungus Spizellomyces punctatus and its relatives [21]. Interestingly, we noticed striking differences in the UAG codon abundance between certain groups of genes from the rhizarian exLh. In-frame UAG codons were overrepresented in genes encoding components of the 26S proteasome, where the UAG codon was the most abundant codon for leucine (Fig. 3a). In contrast, UAG was the rarest leucine codon in genes for ribosomal proteins. In total, we identified sequences corresponding to 28 ribosomal protein genes of the rhizarian exLh, but 50% of those sequences did not contain any UAG codon, although they all branched with homologs from Rhizaria, most often (11/14) sisters to those from G. vulgaris. Genes for ribosomal proteins are highly expressed and typically exhibit a strong codon usage bias facilitating efficient synthesis of ribosomal proteins [29]. The low abundance of the UAG codon in ribosomal protein genes in the rhizarian exLh thus suggests that this codon is not as efficiently translated as the six standard codons for leucine.

Relative codon frequencies in the rhizarian exLh and I. spirale. a Relative codon frequencies in two different groups of genes (for ribosomal proteins and for subunits of the 26S proteasome listed in Additional file 1: Tables S1A and S1B) in the rhizarian exLh. b Relative codon frequencies in a reference set of genes of I. spirale (listed in Additional file 3: Tables S2A and S2B). The relative codon frequencies are calculated as the percentage of the codon among all occurrences of codons with the same meaning (i.e., coding for the same amino acid or terminating translation)

I. spirale (Fornicata) uses UAG to encode glutamine

The second case of a novel non-canonical genetic code was unexpectedly encountered when we sequenced the transcriptome of I. spirale. It is a recently described anaerobic flagellate isolated from fresh feces of a gecko Phelsuma grandis and is considered an intestinal endobiont [30]. Phylogenetic analyses based on a fragment (

1000 bp) of the 18S rRNA gene sequence affiliated I. spirale to the previously defined “Carpediemonas-like lineage 3” (CL3 [31]) in Fornicata (one of the three principal groups of Metamonada), although its morphology is highly unusual for a fornicate [30]. To corroborate this initial insight, we used the data from the newly sequenced transcriptome of I. spirale and carried out two multi-locus phylogenetic analyses. A 70-protein phylogenomic analysis (the same as used above for establishing the phylogenetic position of the rhizarian exLh) with the dataset including 60 orthologs from I. spirale showed this organism as a branch sister to, yet deeply diverged from, representatives of Diplomonadida (Additional file 2: Figure S1). This is consistent with the previous result based on the partial 18S rRNA gene sequence and with the fact that no other non-diplomonad fornicate could be included in the analysis due to lack of genome-scale sequence data. The second analysis was based on a dataset comprising a complete 18S rRNA sequence, which we identified as one of the assembled transcript contigs, and sequences of four conserved proteins used in a previous detailed study of the phylogeny of the Fornicata [32]. This analysis placed I. spirale with maximum support as the sister lineage of the CL3 representative Hicanonectes teleskopos (Fig. 1b), again congruently with the previously reported result. Thus, it is now robustly established that I. spirale is an unusual fornicate. In addition, it is a lineage well separated from Hexamitinae, a subgroup of diplomonads, which is a conclusion important for the interpretation of the evolution of the genetic code in fornicates (see below).

While analyzing the assembled transcript sequences from I. spirale, we noticed occasional in-frame UAG codons (for an example, see the alignment of Polr2a protein sequences, Fig. 2c). In total, we scrutinized sequences representing 126 genes, which contained 204 in-frame UAG codons (Additional file 3: Tables S2A and S2B). UAG was not used as an apparent bona fide termination codon in any of these transcripts, as 3’-termini of coding sequences predicted on the basis of conservation with orthologous sequences were marked only with UGA (in 15 cases) and, significantly, UAA (in 104 cases) (the remaining seven transcripts were truncated). This suggested that UAG, but not UAA, has been reassigned as an amino acid-encoding codon in I. spirale. We used similar approaches as employed for analyzing the genetic code of the rhizarian exLh to determine the identity of this UAG-encoded amino acid (see Methods for details). First, using a smaller subset of genes sampled broadly to include representatives from most major eukaryotic lineages we identified 28 hyperconserved positions with an in-frame UAG in I. spirale (in all cases confirmed by inspection of raw reads mapped onto the respective transcript), 22 of which (i.e., 79%) corresponded to a hyperconserved glutamine (Fig. 2a and Additional file 3: Table S2C). In the second analysis, a concatenated protein sequence alignment considering glutamine in place of in-frame UAG codons in I. spirale sequences gave the highest likelihood among all 20 possible variants (considering 20 different amino acids) when tested against a precomputed species tree, and the conditional probability that UAG encodes glutamine in I. spirale was nearly 1.0, with negligible probabilities obtained for any other amino acid (Table 1).

To further test that UAG is the only termination codon reassigned in I. spirale and that its meaning is to encode glutamine, we carried out a broader analysis of the available transcriptomic data (see Methods for details). In total, we analyzed 730 contigs assigned to I. spirale (i.e., apparently not coming from the prokaryotes contaminating the culture) and exhibiting close homologs in other eukaryotes 348 of them contained at least one UAG codon within the region aligned to the homologs. Considering positions with 90% amino acid identity across the alignment with 50 best blastp hits, we identified 231 positions that corresponded to an UAG codon in the I. spirale sequence, 95 (41.13%) of which were dominated by glutamine (and no other amino acid reached such a frequency at the positions Fig. 2e). Although this value may seem low and ambiguous, a similar proportion of conserved alignment positions dominated by glutamine (44.08%) was observed for positions occupied by canonical glutamine codons in I. spirale sequences (Fig. 2f). Imposing a less stringent threshold for amino acid conservation across the alignment (50%) yielded 634 positions with an UAG codon in the I. spirale sequence, 147 (23.19%) of which were dominated by glutamine. For canonical glutamine codons in I. spirale sequences the proportion was even lower (19.52%). In addition, neither of the 730 examined transcripts included the UAG codon as an obvious termination codon marking the end of the coding sequence.

All these results indicate that UAG in I. spirale encodes glutamine and does not serve as a stop codon. In contrast, our procedure identified only two contigs (a2783c02 and a9294c03) with candidate in-frame UAA or UGA codons, but manual scrutiny revealed that these codons are located in regions representing obvious retained introns. Thus, there is no evidence for reassignment or context-dependent dual meaning of the UAA or UGA codons in I. spirale and both apparently serve solely as standard termination codons, with UAA being the predominant termination codon in I. spirale (Fig. 3b). In the past, such a genetic code variant (UAG = Q, UAA = stop) was attributed to the ciliate genus Blepharisma by some sources, see, e.g., the code 15 (“Blepharisma Macronuclear”) in the list of genetic codes at an ftp page of NCBI [33] or a recent textbook on molecular and genome evolution [34]. However, this was an apparent error, as it is beyond any doubt that Blepharisma spp. use both UAA and UAG as stop codons, while they have reassigned UGA as a tryptophan codon [9, 16]. Indeed, the most recent list of genetic code tables provided on another NCBI page [35] omits the code 15. Thus, the code we here document for I. spirale is, to the best of our knowledge, truly unprecedented.

Evolutionary origin of the non-standard genetic codes in the rhizarian exLh and I. spirale

Rhizaria is an extremely diverse eukaryotic grouping, but our knowledge of even the general biology, let alone molecular details, of most of rhizarian groups is lamentable. The discovery of a new rhizarian lineage exhibiting a peculiar feature of their gene expression machinery is thus not so unexpected. Our phylogenetic analyses place the new rhizarian with the stop-to-leucine reassignment of the UAG codon into the group Sainouroidea (see above). We should, therefore, ask how widespread this feature is in Sainouroidea or possibly a broader rhizarian clade. The closest relative of the rhizarian exLh, for which a substantial amount of sequences of protein coding genes (transcripts) are available, is G. vulgaris. Brown et al. [36] sequenced its transcriptome using the 454 method, but did not mention any observation concerning the genetic code employed by this organism. Therefore, we analyzed the 147 transcript sequences from G. vulgaris that have been released by Brown et al. [36] to the GenBank database (JT844885–JT845030). Only one of these sequences, JT844913, includes obvious in-frame termination codons (specifically a spot with UAG and UGA separated by a glutamine codon see Additional file 4: Table S3), but checking unpublished Illumina RNA-Seq data from G. vulgaris revealed that the occurrence of the two termination codons in JT844913 is a sequencing error (Dr. M. Brown, personal communication). Interestingly, all but one available G. vulgaris transcript sequences that cover the 3’-end of the coding sequence use UAA as the termination codon, with the sole exception putatively employing UAG (Additional file 4: Table S3). Thus, G. vulgaris most likely does not share the stop-to-leucine UAG reassignment with the rhizarian exLh, although the prevalence of UAG as a termination codon needs to be investigated using more complete sequence data. Little data is available on nuclear protein-coding genes of other sainouroids, specifically a single sequence for each of Rosculus sp. (DQ388527.1), Helkesimastix (AY748812.1), and Sainouron acronematica (DQ098274.1). Neither of these sequences exhibits in-frame termination codons, and the gene from Rosculus sp. uses UAG as the bona fide termination codon, whereas the remaining two sequences have truncated 3’-ends. These sequences are, therefore, consistent with the notion that the genetic code has changed specifically in the rhizarian exLh lineage, but a systematic exploration of sainouroid transcriptomes or genomes is needed to pinpoint this evolutionary event with a higher confidence.

While our study uncovers the first case of a non-canonical nuclear code for the whole Rhizaria, the departure from the standard code reported here from I. spirale is not unprecedented in the Fornicata. Specifically, hexamitin diplomonads (Hexamitinae) for example, members of the genera Spironucleus, Trimitus or Trepomonas, also encode glutamine by non-standard codons [37, 38]. However, in contrast to I. spirale, all hexamitin taxa investigated have reassigned both UAG and UAA as glutamine codons. The non-hexamitin diplomonads of the genus Giardia are known to employ the standard genetic code, and our inspection of the limited number of protein-coding gene sequences available from the various “Carpediemonas-like” fornicates [32] did not reveal any case of an in-frame UAG (or UAA) codon that would presumably encode glutamine. Hexamitins and the I. spirale lineage thus apparently modified their genetic codes independently of each other (Fig. 1b). The closest I. spirale relative, from which sequences of protein-coding genes are available, is H. teleskopos. All four these sequences (GenBank accession numbers AB600290.1 to AB600293.1) lack in-frame UAG codons, suggesting that the stop-to-glutamine UAG reassignment occurred only after the Iotanema lineage split from the one of Hicanonectes. However, we need to be cautious, as the abundance of the UAG sense codon may be low (it corresponds to only

8% of all glutamine codons in I. spirale genes Fig. 3b), and we also lack positive evidence for UAG as a termination codon in H. teleskopos due to 3’-end truncation of all four sequences. It will be interesting not only to obtain more complete data for investigating the genetic code of H. teleskopos, but also to study the free-living unidentified isolate PCS [31] that is phylogenetically closer to I. spirale than H. teleskopos [30], and of members of the endobiotic genus Caviomonas, whose relationship to I. spirale has been suggested by morphological similarities of their flagellar apparati [30].

Molecular mechanisms of codon reassignment in the rhizarian exLh and I. spirale

Let us now touch briefly upon the actual molecular underpinnings of the changed specificity of the UAG codon in the rhizarian exLh and I. spirale. The occurrence of UAG as a sense codon implies the existence of a cognate aminoacyl-tRNA that translates UAG into the proper amino acid. As this aminoacyl-tRNA must at the same time ignore UAA, which has been retained as a dominant stop codon in both taxa, we are left with a single possible anticodon, CUA, which pairs with the UAG codon but not with the UAA codon due to C:A mismatch at the first anticodon:third codon position. Therefore, we predict that sequencing the genomes of the rhizarian exLh and I. spirale will reveal the presence of novel tRNAs with the CUA anticodon and with attributes characteristic for tRNAs recognized by leucinyl-tRNA synthetase and glutaminyl-tRNA synthetase, respectively. Genes for tRNA Leu (CUA) were previously found in mitochondrial genomes of the chytrid S. punctatus [21] and several green algae of the order Sphaeropleales [22], so the postulated existence of a similar gene in the rhizarian exLh is not without precedent. Similarly, the occurrence of tRNA Gln (CUA) has already been documented, namely in ciliates with the stop-to-glutamine reassignment of UAR codons, e.g., Tetrahymena thermophila [38] or C. magnum [16], and in hexamitins, e.g., in the Spironucleus salmonicida genome (the tRNA gene SS50377_t0098 on the scaffold scf7180000020657 GenBank accession number KI546140.1). Notably, anticodons of these tRNAs differ in only one nucleotide position from anticodons of standard tRNAs carrying the respective amino acids, i.e., tRNA Leu (CAA) decoding the UUG leucine codon and tRNA Gln (CUG) decoding the CAG glutamine codon.

Hence, the simplest scenario for the evolutionary origin of the putative UAG-decoding tRNAs in the rhizarian exLh and I. spirale is a mutation of pre-existing (duplicated copies of) tRNA Leu (CAA) and tRNA Gln (CUG) genes, respectively. Checking a test set of 27 nuclear genomes confirmed that tRNA Leu (CAA) and tRNA Gln (CUG) commonly occur in eukaryotic genomes, often in multiple copies in the genome (Additional file 5: Table S4), which is conducive to the emergence of the postulated mutant variants required for reading the UAG codon. In addition, leucinyl-tRNA synthetases generally do not recognize the anticodon as a tRNA identity element [39–41], suggesting that efficient charging by leucine of the newly emerged tRNA Leu (CUA) does not necessarily require changes in the enzyme. In contrast, glutaminyl-tRNA synthetases do use the anticodon as a tRNA identity element, and indeed, tRNA Gln (CUA) from Tetrahymena thermophila was not recognized as a substrate by a mammalian glutaminyl-tRNA synthetase [42]. Unfortunately, to our knowledge, glutaminyl-tRNA synthetases from eukaryotes decoding UAG (and UAA) as glutamine have not been studied in detail, so the putative modifications needed for efficient charging of tRNA Gln (CUA) are unknown. Nevertheless, the multiple independent cases of the stop-to-glutamine UAG reassignment in various eukaryotes (Fig. 4) by themselves suggest that such modifications are achieved readily.

Phylogenetic distribution of known non-canonical genetic codes in nuclear genes of eukaryotes. The schematic phylogenetic tree was drawn on the basis of phylogenetic and phylogenomic analyses for eukaryotes as a whole [60, 71, 72] (our own Fig. 1 and Additional file 2: Figure S1) and for the relevant subgroups with non-canonical codes [12, 13, 73–77]. Multifurcations indicate uncertain or controversial branching order, dashed branches indicate different positions of Metamonada within eukaryotes suggested by different studies, branches drawn as double lines indicate paraphyletic groupings. The types and occurrences of the different non-canonical codes are based on this study (the rhizarian exLh and Iotanema) and the following previous reports: fungi [14, 15] Amoeboaphelidium [13] oxymonads [11] Blastocrithidia [18] ulvophytes [12] ciliates [7, 9, 16, 17]. Note that, for simplicity, code variants with a context-dependent dual meaning of UAR or UGA codons as sense or termination ones (UAR in Blastocrithidia and Condylostoma, UGA in Parduczia and Condylostoma) are not distinguished from those with a “complete” reassignment. We also omitted some ciliate species with their putative non-canonical codes supported by little data that are specifically related to and possibly sharing the same code with better studied species. Changes in the genetic code are mapped onto the tree primarily (black circles) using Dollo parsimony (no reversions are allowed). An alternative maximum parsimony scenario with reversions weighted the same as other changes is indicated by the respective code numbers in white circles. An alternative branching order to the one indicated in the figure was supported by some studies for some of the ciliate lineages, but the alternative topology does not decrease the minimal number of codon reassignments required to explain the distribution of non-standard genetic codes

Identification of tRNAs cognate to the UAG codon in the rhizarian exLh and I. spirale and elucidation of their evolutionary origin awaits sequencing the genomes of the two organisms. This will require identification and culturing the rhizarian exLh work towards this goal is underway in our laboratory. Sequencing the genome of I. spirale is complicated by the fact that it grows slowly and the culture is dominated by bacteria [30], which by itself rules out direct sequencing of tRNA molecules as an alternative approach. Moreover, it should be noted that identification of tRNAs responsible for reading reassigned termination codons may not be straightforward even when the genome sequence is available. For example, Swart et al. [16] failed to find a gene for the expected tRNA Trp (UCA) in their draft genome sequence of the ciliate C. magnum when investigating the genetic code of this organism. The specificity of tRNAs is not necessarily obvious from the gene sequence itself, as post-transcriptional editing or base modifications may be involved, too. As a result, the actual tRNAs responsible for termination codon reassignments remain unknown for most of the previously described non-canonical codes in eukaryotic nuclear genomes.

An effective reassignment of any of the UAG, UAA, or UGA codons is commonly thought to also depend on specific changes in the mechanism of translation termination. In eukaryotes, translation termination is mediated by the interaction of all three termination codons with the same protein, eRF1 (eukaryotic release factor 1), specifically with its N-terminal domain [43, 44]. This domain includes several highly conserved motifs or individual residues directly or indirectly involved in the recognition or binding of the termination codons, namely GTS, (TAS)NIKS, YxCxxxF, E55, and S70 [45, 46] (the residue numbering is based on the human eRF1 sequence as a reference). Indeed, eRF1 sequences in eukaryotes that have altered the meaning of UAG, UAA, or UGA codons proved to typically exhibit various alterations in these motifs when compared to eRF1 sequences from organisms with the canonical code [9, 47, 48], and some of these changes have been demonstrated as causally linked to an altered specificity of the eRF1 protein towards the termination codons [45, 49].

We identified transcripts encoding eRF1 in both the rhizarian exLh and I. spirale and compared the critical region of the respective eRF1 protein sequences with homologs from diverse eukaryotes, including a wide selection of species with non-canonical genetic codes (Additional file 6: Figure S2). The eRF1 sequence from the rhizarian exLh does not display any obvious deviation in the conserved elements noticed above, but it notably exhibits an alanine residue at the Leu69 position of the human eRF1 protein. Although this position is not particularly conserved among eRF1 proteins, the substitution to alanine is unique for the rhizarian exLh (Additional file 6: Figure S2) and a corresponding L69A mutation was shown to increase readthrough of all three stop codons, particularly of the UAG codon, suggesting that this position is specifically important for the recognition of guanine of the third stop codon position by eRF1 [45]. It is, therefore, possible that this substitution is partly responsible for efficient usage of UAG as a sense codon in the rhizarian exLh. The most conspicuous feature of the eRF1 protein from I. spirale is a mutation in the highly conserved GTS motif, specifically T32G substitution (Additional file 6: Figure S2). This seems to be significant. With adenosine in the second position of the termination codon, Thr32 faces the base at the third position, making a hydrogen bond with the N2 atom of guanosine in UAG [46]. Hence, the T32G substitution presumably disrupts this interaction and weakens the affinity of the I. spirale eRF1 to the UAG codon. The analysis of the eRF1 sequences from the rhizarian exLh and I. spirale thus provides interesting testable working hypotheses on how the translation apparatus of these two taxa has been modified to interpret the UAG codon as a sense one.


Some Viruses Use an Alternative Genetic Alphabet

Abby Olena
Apr 29, 2021

I n 1977, scientists showed that a virus called S-2L that infects cyanobacteria has no adenine in its genome. Instead, S-2L uses a nucleotide known as diaminopurine or 2-aminoadenine, shortened to Z, that makes three hydrogen bonds—rather than the two that adenine (A) makes—when paired with thymine (T). In three papers published today (April 29) in Science, researchers show that the use of Z by phages, those viruses that infect bacteria, is more widespread than previously believed, and they describe the pathways by which the alternative nucleotide is made and incorporated into phage genomes.

“It’s been known that there’s this phage that doesn’t have adenine in its genome . . . and it’s been an unsolved mystery about how it does that,” says Jef Boeke, a molecular biologist at New York University Grossman School of Medicine who was not involved in the work. These papers “spell that out in glorious molecular detail,” he tells The Scientist. Plus, the authors “have done an amazingly comprehensive job of showing that this is not one crazy outlier, but there’s a whole group of bacteriophages that have this kind of genetic material.”

In 1998, Pierre Alexandre Kaminski of the Institut Pasteur and colleagues sequenced the genome of the S-2L cyanophage in hopes of deciphering the pathways that allow the virus to skirt the canonical nucleotide code. They found a sequence related to purA—the gene that encodes succinoadenylate synthase, one of the enzymes in the adenine synthesis pathway—that seemed like a good lead, but then shelved the project due to the challenges of working with the phage and its cyanobacterial host.

Over the following years, the researchers periodically searched databases and compared published sequences to the purA-like gene. In late 2015, they hit on a homologous sequence in Vibrio phage, which infects Vibrio, a genus of gram-negative bacteria that are much easier to work with than the S-2L cyanobacterial host. The sequence of the purA-like genes in the S-2L and Vibrio phages were more similar to each other than to other known purA genes, indicating that the Vibrio phage might also use diaminopurine in its DNA.

Since 1977, the cyanophage seemed like some sort of one-off problem and not very interesting, but they really made it apparent that this is out there, and in more places than we expect.

The team analyzed the composition of the Vibrio phage DNA and found that it does contain diaminopurine in place of adenine. They describe these findings in one of the papers out today, as well as the structure and in vitro function of the enzyme encoded by the purA-like gene, which they term PurZ. They show that PurZ has a similar function in the Z biosynthetic pathway to PurA in the synthesis of adenine, and that the bacteriophage genomes also contain another enzyme involved in making Z, known as PurB.

They also identify 19 purZ genes in different types of bacteriophages that phylogenetically cluster with the purA genes present in archaea.

“It’s striking how far back it goes . . . in the phylogeny,” says David Dunlap, a biophysicist at Emory University who did not participate in the work. “These things have been evolving in parallel for a long time. Since 1977, the cyanophage seemed like some sort of one-off problem and not very interesting, but they really made it apparent that this is out there, and in more places than we expect.”

See “Are Phages Overlooked Mediators of Health and Disease?”

In a second study, the same team identified phage genes encoding DNA polymerases that selectively incorporate diaminopurine in lieu of adenine. The original S-2L phage does not appear to harbor one of these genes, but nine other phage genomes, including Vibrio phage, do include a polymerase gene, which the authors named dpoZ. In the Vibrio phage and the other phage genomes, this DNA polymerase gene was found in the vicinity of the purZ gene.

An independent group, led by Huimin Zhao at the University of Illinois, corroborates these findings in the third study published today, while further characterizing the enzymes responsible for synthesizing Z and Z-containing genomes and identifying dozens of phage genomes distributed worldwide that contain genes encoding these enzymes. Zhao’s group also found a conserved enzyme encoded in several phage genomes that supports Z-genome synthesis by depleting adenosine triphosphate and its precursor from the nucleotide pool of the host, thus preventing the phages from incorporating A into their genomes.

In addition to using PurZ and PurB, these phages also hijack host enzymes to help synthesize diaminopurine and incorporate it into the phage genome. Finally, the researchers showed that Z-containing genomes are resistant to degradation by host restriction enzymes.

“If you want to be a diaminopurine-containing genome, you’re going to have to eliminate the competitor,” explains Dunlap. “The phage walks into an adenine world and has to enforce its will.”

Zhao and colleagues are looking into how to harness this show of phage will for applications such as treating bacterial infections. Researchers have already leveraged phages to treat some bacterial infections, he says, but including this pathway in those phages may make them even more effective, as they’d be resistant to degradation by their bacterial targets.

“There are a lot of questions that remain unanswered,” says Kaminski. In a paper that came out earlier this month on which he’s a coauthor, researchers shed light on one of those questions—how the S-2L genome is copied—by identifying the relevant polymerase. But Kaminski explains that one of the most difficult questions to answer will be when this mechanism evolved. “It’s supposed to be ancient because it roots deeply in the phylogenetic tree and because of the similarity of the [enzymatic] structures,” but it’s not clear whether Z or A genomes came first.


Are genes our destiny? Scientists discover 'hidden' code in DNA evolves more rapidly than genetic code

A "hidden" code linked to the DNA of plants allows them to develop and pass down new biological traits far more rapidly than previously thought, according to the findings of a groundbreaking study by researchers at the Salk Institute for Biological Studies.

The study, published September 16 in the journal Science, provides the first evidence that an organism's "epigenetic" code -- an extra layer of biochemical instructions in DNA -- can evolve more quickly than the genetic code and can strongly influence biological traits.

While the study was limited to a single plant species called Arabidopsis thaliana, the equivalent of the laboratory rat of the plant world, the findings hint that the traits of other organisms, including humans, might also be dramatically influenced by biological mechanisms that scientists are just beginning to understand.

"Our study shows that it's not all in the genes," said Joseph Ecker, a professor in Salk's Plant Molecular and Cellular Biology Laboratory, who led the research team. "We found that these plants have an epigenetic code that's more flexible and influential than we imagined. There is clearly a component of heritability that we don't fully understand. It's possible that we humans have a similarly active epigenetic mechanism that controls our biological characteristics and gets passed down to our children. "

With the advent of techniques for rapidly mapping the DNA of organisms, scientists have found that the genes stored in the four-letter DNA code don't always determine how an organism develops and responds to its environment. The more biologists map the genomes of various organisms (their entire genetic code), the more they are discovering discrepancies between what the genetic code dictates and how organisms actually look and function.

In fact, many of the major discoveries that led to these conclusions were based upon studies in plants. There are traits such as flower shape and fruit pigmentation in some plants that are under the control of this epigenetic code. Such traits, which defy the predictions of classical Mendelian genetics, are also found in mammals. In some strains of mice, for instance, a tendency for obesity can pass from generation to generation, but no difference between the genetic code of fat mice and thin mice explains this weight difference.

Scientists have even found that identical human twins exhibit different biological traits, despite their matching DNA sequences. They have theorized that such unexplained disparities could be the work of epigenetic variation.

"Since none of these patterns of variation and inheritance match what the genetic sequence says should happen, there is a clearly a component of the 'genetic' heritability that is missing," Ecker said.

Ecker and other scientists have traced these mysterious patterns to chemical markers that serve as a layer of genetic control on top of the DNA sequence. Just as genetic mutations can arise spontaneously and be inherited by subsequent generations, epigenetic mutations can emerge in individuals and spread into the broader population.

Although scientists have identified a number of epigenetic traits, very little was known about how often they arose spontaneously, how quickly they could spread through a population and how significant an influence they could have on biological development and function.

"Perception of the extent of epigenetic variation in plants from generation to generation varies widely within our scientific community," said Robert Schmitz, a post-doctoral research in Eckers' laboratory and the lead author on the paper. "We actually did the experiment, and found that overall there is very little change between each generation, but spontaneous epimutations do exist in populations and arise at a rate much higher than the DNA mutation rate, and at times they had a powerful influence over how certain genes were expressed."

In their study, the Salk researchers and collaborators at Scripps Research Institute mapped the epigenome of a population of Arabidopsis plants then observed how this biochemical landscape had changed after 30 generations. This mapping consisted of recording the state of all locations on the DNA molecule that could undergo a chemical modification known as methylation, a key epigenetic change that can alter how certain underlying genes are expressed. They then watched how methylation states of these sites evolved over the generations.

The plants were all clones of a single ancestor, so their DNA sequences were essentially identical across the generations. Thus any changes in how the plants expressed certain genetic traits were likely to be a result of spontaneous changes in their epigenetic code -- variations in the methylation of the DNA sites- not the result of variations in the underlying DNA sequences.

"You couldn't do this kind of study in humans, because our DNA gets shuffled each generation," Ecker said. "Unlike people, some plants are easily cloned, so we can see the epigenetic signature without all the genetic noise."

The researchers discovered that as many as a few thousand methylation sites on the plants' DNA were altered each generation. Although this represents a small proportion of the potentially six million methylation sites estimated to exist on Arabidopsis DNA, it dwarfs the rate of spontaneous change seen at the DNA sequence level by about five orders of magnitude.

This suggests that the epigenetic code of plants -- and other organisms, by extension -- is far more fluid than their genetic code.

Even more surprising was the extent to which some of these changes turned genes on or off. A number of plant genes that underwent heritable changes in methylation also experienced substantial alterations in their expression -- the process by which genes control cellular function through protein production.

This meant that not only was the epigenome of the plants morphing rapidly despite the absence of any strong environmental pressure, but that these changes could have a powerful influence on the plants' form and function.

Ecker said the results of the study provide some of the first evidence that the epigenetic code can be rewritten quickly and to dramatic effect. "This means that genes are not destiny," he said. "If we are anything like these plants, our epigenome may also undergo relatively rapid spontaneous change that could have a powerful influence on our biological traits."

Now that they have shown the extent to which spontaneous epigenetic mutations occur, the Salk researchers plan to unravel the biochemical mechanisms that allow these changes to arise and get passed from one generation to the next.

They also hope to explore how different environmental conditions, such as differences in temperature, might drive epigenetic change in the plants, or, conversely, whether epigenetic traits provide the plants with more flexibility in coping with environmental change.

"We think these epigenetic events might silence genes when they aren't needed, then turned them back on when external conditions warrant," Ecker said. "We won't know how important these epimutations are until we measure the effect on plant traits, and we're just now to the point where we can do these experiments. It's very exciting."

The research is supported by the National Science Foundation, the National Institutes of Health, the Howard Hughes Medical Institute, the Gordon and Betty Moore foundation and the Mary K. Chapman Foundation.


Watch the video: Edward N. Trifonov - Second, third, fourth.. genetic codes P (May 2022).