2.6: Transcription - Biology

2.6: Transcription - Biology

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

The flow of genetic information

The primary role of DNA is to store heritable information that encodes the instructions for creating an organism. In the past decade we have gotten to be very good at sequencing DNA, but we still don't know how to reliably decode ALL of the information, and we still don't understand ALL of the mechanisms by which it is expressed.

There are however some core principles and mechanisms associated with the reading and expression of the genetic code whose basic steps are understood and that need to be part of the conceptual toolkit for all biologists. Two of these processes are transcription and translation; the copying of parts of the genetic code written in DNA into molecules of the related polymer RNA, followed by and the reading and decoding of the RNA sequence (of nucleotides) into the proteins's sequence (of amino acids).

In BIS2A we focus largely on developing an understanding of the


of transcription (recall that an Energy Story is simply a way of describing a process) and its role in the expression of genetic information. We motivate our discussion of transcription by focusing on functional problems (bringing in parts of our problem solving/design challenge) that must be solved for the process to take place. We then go on to describe how the process is used by Nature to create a variety of functional RNA molecules (that may have various structural, catalytic or regulatory roles) including messenger RNA (mRNA) molecules that carry the information required to synthesize proteins. Likewise, we focus on challenges and questions associated with the process of translation, the process by which the ribosomes synthesize proteins.

The basic flow of genetic information in biological systems is often depicted in a scheme known as "the central dogma" (see figure below). This scheme states that information encoded in DNA flows into RNA via transcription and ultimately to proteins via translation. Processes like reverse transcription (the creation of DNA from and RNA template) and replication also represent mechanisms for propagating information in different forms. This scheme, however, doesn't say anything about how information is encoded or about the mechanisms by which regulatory signals are move between the various layers of molecule types depicted in the model. Therefore, while the scheme below is a nearly required part of the lexicon of any biologist, perhaps left over from old tradition, students should also be aware that mechanisms of information flow are more complex (we'll learn about some as we go, and that "the central dogma" only represents some core pathways.

The flow of genetic information.
Attribution: Marc T. Facciotti (original work)

Genotype to Phenotype

An important concept in the following sections is the relationship between genetic information, the genotype, and the result of expressing it, the phenotype. These two terms and the mechanisms that link the two will be discussed repeatedly over the next few weeks - start becoming proficient with using this vocabulary.

The information stored in DNA is in the sequence of the individual nucleotides when read from 5' to 3' direction. Conversion of the information from DNA into RNA (a process called transcription) produces the second form that information takes in the cell. The mRNA is used as the template for the creation of the amino acid sequence of proteins (in translation). Here two different sets of information are shown. The DNA sequence is slightly different, resulting in two different mRNAs produced, followed by two different proteins, and ultimately, two different coat colors for the mice.

Genotype refers to the information stored in the DNA of the organism, the sequence of the nucleotides, the compilation of its genes. Phenotype refers to any physical characteristic that you can measure, such as height, weight, amount of ATP produced, ability to metabolize lactose, response to environmental stimuli, etc. Differences in genotype, even slight, can lead to different phenotypes that are subject to natural selection. The figure above depicts this idea. Note also, that while classic discussions of the genotype and phenotype relationships are talked about in the context of multicellular organisms, this nomenclature and the underlying concepts apply to all organisms, even single celled organisms like bacteria and archaea.

Suggested discussion

Can something you can not see "by eye" be considered a phenotype?

Suggested discussion

Can single-celled organisms have multiple simultaneous phenotypes? If so, can you propose an example? If not, why?


What is a gene? A gene is a segment of DNA in an organism's genome that encodes a functional RNA (such as rRNA or tRNA, etc) or protein product (enzymes, tubulin, etc). A generic gene contains elements encoding regulatory regions (which are often not transcribed) and a region encoding a transcribed unit.

Genes can acquire mutations - defined as changes in the in the composition and or sequence of the nucleotides - either in the coding or regulatory regions. These mutations can lead to several possible outcomes: (1) nothing measurable happens as a result; (2) the gene is no longer expressed; or (3) the expression or behavior of the gene product(s) are different. In a population of organisms sharing the same gene, different variants of the gene are known as alleles. Different alleles can lead to differences in phenotypes of individuals and contribute to the diversity in biology that is under selective pressure.

Start learning these vocabulary terms and associated concepts. You will then be somewhat familiar with them when we start diving into them in more detail over the next lectures.

A gene consists of a coding region for an RNA or protein product accompanied by its regulatory regions. The coding region is transcribed into RNA which is then translated into protein. Note that most transcripts to not begin with a start codon and end with a stop codon (unlike this example), there are usually upstream and downstream untranslated regions.

Section Summary

All living things must all transcribe genes from their genomes. While the cellular location may be different (eukaryotes perform transcription in the nucleus; bacteria and archaea- lacking a nucleus- perform transcription in the cytoplasm), the mechanisms by which organisms from each of these clades carry out this process are fundamentally the same and can be characterized by three stages: initiation, elongation, and termination.

Transcription: from DNA to RNA

A short overview of transcription

Transcription is the process of creating an RNA copy of a segment of DNA. Since this is a process, we want to apply the Energy Story to develop a functional understanding of transcription. What does the system of molecules look like before the start of the transcription? What does it look like at the end? What transformations of matter and transfers of energy happen during the transcription and what, if anything, catalyzes the process? We also want to think about the process from a Design Challenge standpoint. If the biological task is to create a copy of DNA in the chemical language of RNA, what challenges can we reasonably hypothesize, or anticipate given our knowledge about other nucleotide polymer processes, must be overcome? Is there evidence that Nature solved these problems in different ways? What seem to be the criteria for success of transcription? You get the idea.

Listing some of the basic requirements for transcription

Let us first consider the tasks at hand by using some of our foundational knowledge and imagining what might need to happen during the process of transcription if the goal is to make an RNA copy of a piece of one strand of a double stranded DNA molecule. We'll see that using some basic logic allows us to infer many of the important questions and things that we need to know about to properly describe the process.

Let's imagine that we want to design a nanomachine/nanobot that would conduct transcription We can use some Design Challenge thinking to identify problems and subproblems that need to be solved by our little robot.

• The first thing that we might want our machine to know is where to start. Along the millions to billions of base pairs, where should the machine be directed?
• Likewise, we need to know where to stop.
• If we have start and stop sites, we will need ways of encoding that information so that our machine(s) can read this information - how will that be accomplished?
• How many RNA copies of the DNA will we need to make?
• How fast do the RNA copies need to be made?
• How accurately do the copies need to be made?
• How much energy will the process take and where is the energy going to come from?

These are, of course, only some of the core questions. One can dig deeper if they wish. However, they are already good enough for us to start getting a good feel for this process. Notice, too, that many of these questions are remarkably similar to those we inferred might be necessary to understand about DNA replication.

The Building Blocks of Transcription

The building blocks of RNA

Recall from our discussion on the structure of nucleotides that the building blocks of RNA are very similar to those in DNA. In RNA the building blocks consists of nucleotide triphosphates that are composed of a ribose sugar, a nitrogenous base and three phosphate groups. The key differences between the building blocks of DNA and those of RNA is that RNA molecules are composed of nucleotides with ribose sugars (as opposed to deoxyribose sugars) and that instead of utilizing thymidine (the thymine containing nucleotide) RNA utilizes uridine (a uracil containing nucleotide). Note below that uracil and thymine are structurally very similar - the uracil is just lacking a methyl (CH3) functional group compared to thymine.

The basic chemical components of nucleotides.
Attribution: Marc T. Facciotti (original work)

Transcription Initiation


Proteins responsible for creating an RNA copy of a specific piece of DNA (transcription) must first be able to recognize the beginning of the element to be copied. A promoter is a DNA sequence onto which various proteins, collectively known as the transcription machinery, bind and initiates transcription. In most cases, promoters exist upstream (5' to the coding region) of the genes they regulate. The specific sequence of a promoter is very important because it determines whether the corresponding coding portion of the gene is transcribed all the time, some of the time, or infrequently.

In the bacterium E. coli, at the -10 and -35 regions upstream of the initiation site (the site of the first nucleotide of the transcript), there are two promoter consensus sequences, or regions that are similar across many promoters and across various related species. Some promoters will have a sequence very similar to the consensus sequence (the sequence containing the most common sequence elements), and others will look very different. These sequence variations affect the strength to which the transcriptional machinery can bind to the promoter to initiate transcription. This helps to control the number of transcripts that are made and how often they get made.

(a) A general diagram of a gene. The gene includes the promoter sequence, an untranslation region (UTR), and the coding sequence. (b) A list of several strong E. coli promoter sequences. The -35 box and -10 box are highly conserved sequences throughout the strong promoter list. Weaker promoters will have more base pair differences when compared to these sequences. Source:

Suggested discussion

What types of interactions are changed between the transcription machinery and the DNA when the nucleotide sequence of the promoter changes? Why would some sequences create a "strong" promoter and why do others create a "weak" promoter?

Bacterial vs Eukaryotic Promoters

In bacterial cells, the -10 region is AT rich, often TATAAT. Sequences just upstream of the -10, s well as the -35 (TTGACA) region, are recognized and bound by the protein σ ("sigma factor"), which is one component of the RNA polymerase holoenzyme. Once this protein-DNA interaction is made, sigma factor facilitates unwinding of the -10 region and loads polymerase onto the template strand. Sigma factor thus assists the polymerase in recognizing promoter sequences and loading the polymerase onto the right spot, pointing in the correct direction. Interestingly, E. coli, makes several different sigma factors. In different situations- for example, under a particular stress- the activation of a different sigma factor will change the types of genes most frequently transcribed by RNA polymerase. Some bacteriophage have taken advantage of this aspect of bacterial transcription, producing their phage-gene specific sigma factors that hijack the host's RNA polymerase and redirect it to the phage genome.

Eukaryotic promoters are much larger and more complex than prokaryotic promoters, but both have an AT-rich region - in eukaryotes it is typically called a "TATA box". For example, in the mouse thymidine kinase gene, the TATA box is located at approximately -30. For this gene, the exact TATA box sequence is TATAAAA, as read in the 5' to 3' direction on the nontemplate strand. This sequence is not identical to the E. coli -10 region, but both share the quality of being A–T rich elements.

Instead of a single bacterial polymerase, the genomes of most eukaryotes encode three different RNA polymerases, each made up of 10 protein subunits or more. Each eukaryotic polymerase also requires a distinct set of proteins known as transcription factors to recruit it to a promoter. The terminology for transcription factors is unfortunately rather inconsistent. Suffice it to say that there are many proteins that are always required to act simultaneously to load, for example, any RNA polymerase II (the polymerase that transcribes mRNAs). These are referred to as "basal" (or "general") transcription factors. In addition, an army of proteins may affect the frequency of the attraction of these basal factors to the promoters, the frequency of loading of RNA pol II, and even the "escape" of the pol II complex from eukaryotic promoters (so that it can actually perform transcription). Enhancers and silencers- both DNA sequences, not proteins- are recognized by these regulatory transcription factors. Basal transcription factors are crucial in the formation of a preinitiation complex on the DNA template that subsequently recruits RNA polymerase for transcription initiation.

image from Kelvinsong

Regardless of the details of bacterial vs. eukaryotic polymerase localization and orientation, initiation of transcription begins with the binding of RNA polymerase to the promoter. Transcription requires the DNA double helix to partially unwind such that one strand can be used as the template for RNA synthesis. Note that unwinding occurs within the polymerase; RNA polymerase, unlike DNA polymerase, has an intrinsic helicase activity. Double stranded DNA is sucked into the enzyme (as are NTPs), and double stranded DNA, plus RNA, emerges. The region of unwinding is called a transcription bubble.

During elongation, RNA polymerase tracks along the DNA template, synthesizes mRNA in the 5' to 3' direction, and unwinds then rewinds the DNA as it is read. The "nontemplate" strand illustrated here is often called the "coding stand", simply because its sequence will match the sequence of the transcript.


Transcription always proceeds from one of the two DNA strands, which is called the template strand. The RNA product is complementary to the template strand and is almost identical to the non-template strand, called the coding strand, with the exception that RNA contains a uracil (U) in place of the thymine (T) found in DNA. During elongation, an enzyme called RNA polymerase proceeds along the DNA template adding nucleotides by base pairing with the DNA template in a manner similar to DNA replication, with the difference that an RNA strand is being synthesized that does not remain bound to the DNA template. As elongation proceeds, the DNA is continuously unwound ahead of the core enzyme and rewound behind it. Note that the direction of synthesis is identical to that of synthesis in DNA - 5' to 3'.

During elongation, RNA polymerase tracks along the DNA template, synthesizes mRNA in the 5' to 3' direction, and unwinds then rewinds the DNA as it is read.

A) The addition of nucleotides during the process of transcription is very similar to nucleotide addition in DNA replication. The RNA is polymerized from 5' to 3' and with each addition of a nucleotide, a phosphoanhidride bond is hydrolized by the enzyme resulting in a longer polymer and the release of two inorganic phosphates.

Suggested discussion

Compare and contrast the energy story for the addition of a nucleotide in DNA replication to the addition of a nucleotide in transcription.

Bacterial vs Eukaryotic Elongation

In bacteria, elongation begins with the release of the σ subunit of the RNA polymerase holoenzyme. The dissociation of σ allows the core enzyme to proceed along the DNA template, synthesizing mRNA in the 5' to 3' direction at a rate of approximately 40 nucleotides per second. The base pairing between DNA and RNA is not stable enough to maintain the stability of the mRNA synthesis components. Instead, the RNA polymerase acts as a stable linker between the DNA template and the nascent RNA strands to ensure that elongation is not interrupted prematurely.

In eukaryotes, following the formation of the preinitiation complex, the polymerase is released from the other transcription factors, and elongation is allowed to proceed as it does in prokaryotes with the polymerase synthesizing pre-mRNA in the 5' to 3' direction. As discussed previously, RNA polymerase II transcribes the major share of eukaryotic genes, so this section will focus on how this polymerase accomplishes elongation and termination.


In Bacteria

Once a gene is transcribed, the bacterial polymerase needs to be instructed to dissociate from the DNA template and liberate the newly made mRNA. Depending on the gene being transcribed, there are two kinds of termination signals. One is protein-based and the other is RNA-based. Rho-dependent termination is controlled by the rho protein, which tracks along behind the polymerase on the growing mRNA chain. Near the end of the gene, the polymerase encounters a run of G nucleotides on the DNA template and it stalls. As a result, the rho protein collides with the polymerase. The interaction with rho releases the mRNA from the transcription bubble.

Rho-independent termination is controlled by specific sequences in the DNA template strand. As the polymerase nears the end of the gene being transcribed, it encounters a region rich in C–G nucleotides. The mRNA folds back on itself, and the complementary C–G nucleotides bind together. The result is a stable hairpin that causes the polymerase to stall as soon as it begins to transcribe a region rich in A–T nucleotides. The complementary U–A region of the mRNA transcript forms only a weak interaction with the template DNA. This, coupled with the stalled polymerase, induces enough instability for the core enzyme to break away and liberate the new mRNA transcript.

In Eukaryotes

The termination of transcription is different for the different polymerases. Unlike in prokaryotes, elongation by RNA polymerase II in eukaryotes takes place 1,000–2,000 nucleotides beyond the end of the gene being transcribed. This pre-mRNA tail is subsequently removed by cleavage during mRNA processing. On the other hand, RNA polymerases I and III require termination signals. Genes transcribed by RNA polymerase I contain a specific 18-nucleotide sequence that is recognized by a termination protein. The process of termination in RNA polymerase III involves an mRNA hairpin similar to rho-independent termination of transcription in prokaryotes.

In Archaea

Termination of transcription in the archaea is far less studied than in the other two domains of life and is still not well understood. While the functional details are likely to resemble mechanisms that have been seen in the other domains of life the details are beyond the scope of this course.

Cellular Location

In bacteria and archaea

In bacteria and archaea, transcription occurs in the cytoplasm, where the DNA is located. Because the location of the DNA, and thus the process of transcription, is not physically segregated from the rest of the cell, translation often starts before transcription has finished. This means that mRNA in bacteria and archaea is used as the template for a protein before the entire mRNA is produced. The lack of spacial segregation also means that there is very little temporal segregation for these processes. The image below shows the processes of transcription and translation occurring simultaneously.

Here the pale blue circles represent RNA polymerases proceeding along a DNA template (from left to right). Note that multiple polymerases can load sequentially onto a single gene. Because this is a prokaryote, ribosomes can begin to use a transcript to synthesize protein before the transcript is complete. Note also that multiple ribosomes can sequentially load onto a single transcript.
Source: Marc T. Facciotti (own work)

In Eukaryotes....

In eukaryotes, the process of transcription is physically segregated from the rest of the cell, sequestered inside of the nucleus. This results in two things: the mRNA is completed before translation can start, and there is time to "adjust" or "edit" the mRNA before translation starts. The physical separation of these processes gives eukaryotes a chance to alter the mRNA in such a way as to: extend the lifespan of the mRNA or even alter the protein product that will be produced from the mRNA.

MRNA Processing

5' G-Cap and 3' Poly-A tail

When a eukaryotic gene is transcribed, the primary transcript is processed in the nucleus in several ways. Eukaryotic mRNAs are modified at the 3' end by the addition of a poly-A tail. This run of A residues is added by an enzyme that does not use genomic DNA as a template. Additionally, the mRNAs have a chemical modification of the 5' end, called a 5'-cap. Data suggests that these modifications both help to increase the lifespan of the mRNA (prevent its premature degradation in the cytoplasm) as well as to help the mRNA initiate translation.

Figure: An example of an almost-completely eukaryotic transcript and the signals required for addition of the polyA tail (not yet added!). These signal will be cleaved from the transcript upon addition of the tail.

Figure: the 5' cap of eukaryotic transcripts. Often the first 2 nucleotides (here purple) of the mRNA are modified also. Note the odd 5' to 5' linkage.


"Splicing" of eukaryotic mRNAs refers to the removal of noncoding sequences (referred to as introns) embedded within the coding region of the transcript (the exons) (see below). Dr. Britt's section will not discuss splicing at length, except to refer to it as a process that must occur before the mRNA is fully mature and can be shipped to the cytoplasm for translation.

2.6: Transcription - Biology

1. Skill: Drawing simple diagrams of the structure of single nucleotides of DNA and RNA, using circles, pentagons and rectangles to represent phosphates, pentoses and bases.

2. DNA differs from RNA in the number of strands present, the base composition and the type of pentose.

3. The nucleic acids DNA and RNA are polymers of nucleotides linked together by covalent bonds into a single strand

  • backbone: sugar - phosphate - sugar - phosphate
  • nitrogenous base attached to 5-carbon pentose sugar (deoxyribose in DNA ribose in RNA)

4. DNA is a double helix made of two antiparallel strands of nucleotides linked by hydrogen bonding between complementary base pairs.

complementary base paring:

  • A = T: adenine forms two hydrogen bonds with its complementary base, thymine
  • G = C: guanine forms three hydrogen bonds with its complementary base, cytosine
  • the combination of hydrogen bonds between A = T and G = C hold the two strands of DNA together
  • each strand forms a helix, the two strands together form a double helix

Applications and skills:

• Application: Crick and Watson’s elucidation of the structure of DNA using model making.

• In diagrams of DNA structure, the helical shape does not need to be shown, but the two strands should be shown antiparallel. Adenine should be shown paired with thymine and guanine with cytosine, but the relative lengths of the purine and pyrimidine bases do not need to be recalled, nor the numbers of hydrogen bonds between the base pairs.


DNase-I hypersensitive sites sequencing (DNase-seq [1–4]) and Assays for Transposase-Accessible Chromatin sequencing (ATAC-seq [5, 6]) are two widely used protocols for genome-wide identification of open chromatin. DNase-seq and ATAC-seq are based on the use of cleavage enzymes (DNase-I and Tn5, respectively), which recognize and cleave DNA in open chromatin regions. Sequencing and the alignment of reads from these fragments allows the detection of open chromatin by identifying genomic intervals with many reads [1, 2]. However, the presence of transcription factors (TFs) bound to the DNA prevents the enzyme from cleavage in an otherwise nucleosome-free region. This leaves small regions, referred to as footprints, where read coverage suddenly drops within peak regions of high coverage.

Computational methods scanning open chromatin profiles to find footprints have been shown to predict transcription factor binding sites (TFBS) with high accuracy in DNase-seq data [7, 8]. Among others, computational footprinting has been used to detect the regulatory lexicon of several cell types [9, 10], to measure the effects of genetic variants in TF binding [11] and to assess changes in the activity of TFs, e.g., during inflammatory responses [12] or fasting conditions [13]. Computational footprinting, which only requires a single open chromatin experiment per cell of interest, is a powerful tool to study regulatory processes.

ATAC-seq has several experimental advantages over DNase-seq: it requires fewer cells (50.000 to single cells) and is less laborious [5, 6]. Not surprisingly, the number of ATAC-seq-based studies deposited in Gene Expression Omnibus is twelve times higher than the number of DNase-seq-based studies in the last year (366 ATAC-seq vs. 29 DNase-seq) Footnote 1 . There is also two times more ATAC-seq samples than DNase-seq samples per study, confirming that its experimental simplicity makes it a good choice for studies with large sample size, for example in clinical settings [14]. However, computational footprinting is still poorly explored in ATAC-seq data. The single study contrasting ATAC-seq and DNase-seq shows that ATAC-seq footprints have inferior accuracy than DNase-seq footprints [15]. It was also reported that ATAC-seq average footprint profiles are not so well defined as average footprint profiles from DNase-seq [11]. However, all the work with footprinting in ATAC-seq so far [5, 15, 16] used computational methods tailored to DNase-seq data and ignored characteristics intrinsic to the ATAC-seq protocol.

A possible reason for the lower performance of ATAC-seq footprinting might be the cleavage enzyme Tn5 itself, which has a large (17bp) “Tn5 motif” [5, 17] and a complex cleavage mechanism requiring a Tn5 dimer for action. The large size of the Tn5 dimer makes cleavage events dependent on structural features of the neighboring proteins (TFs or histones) and on the size of accessible DNA [18]. Cleavage events in small linker DNA between nucleosomes are possible, but less likely than cleavage of fragments from active regulatory regions [5]. Importantly, the DNA binding preferences of enzymes cause sequence-specific cleavage bias. Thus, computational bias correction is an important aspect of the analysis of DNase-seq [19, 20] and ATAC-seq data [21]. Some work uses position weight matrices (PWMs), which assume independence between positions, to model DNase-seq bias [22]. However, most bias correction methods infer bias estimates using k-mer sequences around the start of aligned reads, by estimating the probability of finding a k-mer at read start sites against occurrences in the genome [19]. For DNase-seq, a k equal to 6 was frequently used [8, 11, 19, 20, 23]. This method requires the estimation of a multinomial distribution and is likely to suffer from overfitting for large k-mers [24]. Alternatively, position dependency models (PDMs) allow flexibility in the type of dependencies being modeled [25, 26]. They have been shown to overcome the problem of overfitting in modeling protein-DNA binding preferences. We are unaware of methods exploring effects of the local chromatin structure in ATAC-seq or the use of PDMs for modeling the bias of cleavage enzymes.

Here, we propose HINT-ATAC, which is the first footprinting method dealing with the characteristics of the ATAC-seq protocol. First, we propose the use of a probabilistic PDM based on sparse local inhomogeneous mixtures (SLIM) models for the correction of cleavage bias [26] and evaluate it for both ATAC-seq and DNase-seq protocols. Second, we model a novel observation that ATAC-seq cleavage events show a strand bias, which is associated to the number of nucleosomes in ATAC-seq fragments. HINT-ATAC, which is based on hidden Markov models, uses strand-specific, nucleosome-size decomposed, and bias-corrected signals to identify footprints. We show that HINT-ATAC significantly improves the recovery of footprints supported by TF ChIP-seq data [8, 27] from ENCODE cell lines [9]. Moreover, HINT-ATAC footprints have similar predictive accuracy using either ATAC-seq or DNase-seq protocols. Finally, as an example of practical application of footprint analysis, we use HINT-ATAC to detect TFs associated with immune dendritic cell (DC) specification.

Results and discussion

To investigate the processing of CD19 exon 2, we treated the NALM-6 B-ALL cell line with thapsigargin, which induces unfolded protein response and IRE1 activity [10], and profiled select transcripts by RT-PCR. As anticipated, the levels of the spliced XBP1 isoform were increased, but we did not detect changes in the reported CD19 Δex2part product (Additional File 1: Fig. S1a). This called into question the role of IRE1 in exon 2 processing. We therefore decided to investigate aberrant splicing of CD19 mRNA in B-ALL in more detail. To this end, we performed dRNA-seq and cDNA-seq on the same RNA sample from a therapy-resistant patient-derived xenograft [17] using long-read ONT sequencing. Both datasets documented the occurrence of several previously reported pathological CD19 isoforms, including exon 2 skipping [2] and intron 2 retention [4]. Surprisingly, we failed to detect the Δex2part product in dRNA-seq, even though it was clearly observed in cDNA-seq (Fig. 1a). This suggested that it may be an artifact of the reverse transcription (RT)/PCR amplification-based protocol. Close examination of the CD19 exon 2 sequence revealed that the putative exitron could be folding into a stable hairpin flanked by two 8-nt direct repeats (Fig. 1b), hinting at possible RT or PCR slippage at the base of the hairpin and ensuing product truncation.

The reported exitron in the CD19 exon 2 is a reverse transcription artifact. a Genome browser view showing cDNA-seq and dRNA-seq data for RNA from a patient-derived xenograft (PDX). Junction reads supporting the reported Δex2part product can be observed in cDNA-seq but are absent in the dRNA-seq. b Schematic of the predicted secondary structure and the direct repeats of the putative intron in CD19 exon 2. c Schematic of the eGFP/mCherry-based reporter to detect splicing of the reported CD19 exitron. d RT-PCR assay characterizing the CD19 transcript isoforms for the wild type version and the variants of the reporter shown in panel c. They include two different point mutants predicted to stabilize the putative hairpin (mut+) or disrupt one of the direct repeats (mut−), as well as the control construct wherein the reported exitron has been deleted at the DNA level (exon2part-del). e Flow cytometry-based assay to characterize splicing of the reported exitron in HEK293T cells. f Genome browser view showing the region of CD19 exon 2. cDNA-seq, dcDNA-seq, and dRNA-seq were performed on the same RNA sample from HEK293T cells expressing the mut+ reporter shown in panel c. Several hundred junction reads supporting exitron excision at the direct repeats in the cDNA-seq and dcDNA-seq data are detected, while none are found in the dRNA-seq

To test this hypothesis, we engineered a dual-fluorescence GFP/RFP reporter (Fig. 1c) that would allow detection of CD19 exitron excision by standard RT-PCR, and the corresponding protein product - via restoring the RFP open reading frame detectable by flow cytometry. Consistent with the CD19 exitron excision being an RT-PCR artifact, we readily observed the corresponding RT-PCR product, but no RFP/GFP double-positive cells upon transfection into HEK293T cells (Fig. 1d, e). In addition, we introduced point mutations that were predicted to either increase the stability of the secondary structure (mut+ ΔΔG = − 5.1 kcal/mol) or disrupt one of the direct repeats (mut− Fig. 1b). Consistent with our hairpin hypothesis, these reporter variants altered the levels of the Δex2part product in the RT-PCR-based assay. Namely, they were 82% higher in the case of mut+ or completely abolished in the case of mut− (Fig. 1d). Again, neither of them, not even mut+, yielded GFP/RFP double-positive cells (Fig. 1e). As a positive control, we removed the reported exitron from the reporter at the DNA level (exon2part-del) and readily observed both truncated RT-PCR product (Fig. 1d, e Additional File 1: Fig. S1b, c) and robust expression of RFP (Fig. 1e).

To differentiate between RT and PCR artifacts, we performed dRNA-seq, direct cDNA (dcDNA)-seq omitting PCR amplification, and regular PCR-aided cDNA-seq on the reporter-transfected cells. To rule out the sensitivity issue, we used the mut+ reporter variant, which yields the highest levels of the Δex2part product in RT-PCR (Fig. 1e). Strikingly, in the long-read ONT data, the Δex2part product accounted for > 25% of dcDNA-seq and almost 30% of cDNA-seq reads, but was undetectable using dRNA-seq (Fig. 1f). This direct comparison of sequencing protocols indicated that excision of the reported CD19 exitron occurs not in live cells, but in the test tube during the RT step, possibly due to the two direct repeats brought together at the base of the predicted hairpin structure. A similar phenomenon has been previously observed in the human LIP1 and FOXL2 genes [18, 19].

Our results indicate that RT-based sequencing protocols can lead to the widespread mis-identification of exitrons. Indeed, the CD19 exitron was recently reported to yield a new isoform in the long-read full-length cDNA-seq dataset obtained using the Rolling Circle Amplification to Concatemeric Consensus (R2C2) method serving to increase detection accuracy [7, 8]. To determine whether other transcripts are prone to such RT artifacts, we performed a targeted search in publicly available ONT sequencing datasets. Specifically, we screened for transcript isoforms that are present only in cDNA-seq but not in the matching dRNA-seq. This was achieved using several filtering steps, such as adjusting for read coverage and excluding the presence of canonical splice sites (Fig. 2a, Additional File 1: Fig. S2a, also see Methods). We first applied this comparison to cDNA-seq and dRNA-seq data for the B-lymphoblastoid cell line GM12878 from the Nanopore RNA Consortium [20]. We readily rediscovered the CD19 exitron along with 19 other questionable exitrons, which we dubbed “falsitrons” (Fig. 2b, c, Additional File 1: Fig. S2b, Additional File 2: Data 1, Additional File 3: Table S1), supporting the common nature of such artifacts. We then extended our search to ONT sequencing data for five commonly used cell lines from the Singapore Nanopore Expression Project (SG-NEx) [21]: A549, HCT116, HepG2, K562, and MCF-7. In total, we discovered 100 candidate events corresponding to 57 unique falsitrons in 43 genes, for which “spliced” reads were present in the cDNA-seq (up to 70% of reads) but completely absent in the matched dRNA-seq (Fig. 2c, Additional File 2: Data 1, Additional File 3: Table S1). Many of these falsitrons were short (median length 353 nt Fig. 2d), with the “spliced” regions flanked by direct repeats (35 out of 57 Fig. 2c, e). This discovery strengthens our hypothesis that falsitrons in many instances arise from RT slippage. These artifacts are not restricted to ONT data, but occur in other long-read sequencing protocols such as Iso-Seq (Isoform Sequencing, PacBio) as well [13]. We detected 33 out of 57 falsitrons in the reconstructed isoforms from publicly available Iso-Seq data for several human RNA samples (Alzheimer brain, lymphoblastoid cell line COLO829BL, melanoma cell line COLO829T and Human Universal Reference RNA—see the “Methods” section and Additional File 1: Fig. S2c).

The detection of questionable exitrons is common in cDNA-seq and dcDNA-seq. a Schematic representation of the workflow to identify falsitrons in public ONT sequencing datasets. b Genome browser view showing the falsitron in TAX1BP3 in ONT sequencing data for GM12878. c Violin plots indicating the detection of falsitrons in cDNA-seq and dcDNA-seq of different human cell lines. d Stacked bar plots showing the fraction of falsitrons of different lengths. e Bar graph depicting the length of falsitron-flanking direct repeats. f Violin plots show relative abundance of falsitron products in DNAJC22 and GAS2L3 for three TCGA cancer cohorts. ESCA, esophageal carcinoma. OV, ovarian serous cystadenocarcinoma. STAD, stomach adenocarcinoma. g Plot showing cumulative percentage with direct repeats of at least a given length. Dashed lines indicate the total fraction of introns with direct repeats (≥ 4 nt). h Sequence logos indicating nucleotide composition at 5′ and 3′ splice sites. Positions of splice site dinucleotide motifs are highlighted

Conceptually, such RT artifacts would not be restricted to long-read cDNA-seq data either and should also be found in conventional short-read RNA-seq protocols. To test this hypothesis, we screened the Cancer Genome Atlas (TCGA) database [22] and immediately found six of the falsitrons in several cancer types. Overall, the abundance of the corresponding isoforms was low (< 5%), but could rise up to > 90% for certain samples and tumor types (Fig. 2f). This is potentially important, because a recent paper reported more than 100,000 exitrons in the TCGA database and suggested that the corresponding isoforms are novel cancer drivers and neoepitopes [23]. To learn whether such analyses might be affected by RT artifacts, we overlaid the falsitrons from our ONT data comparison onto these reported exitrons. We found that five falsitrons, including the CD19 one, overlapped with reported exitrons. To our surprise, we further detected direct repeats (≥ 4 nt) overlapping the putative splice sites in almost 75% of the reported exitrons (91,852 out of 123,337 median length 5 nt), i.e. even more than in our falsitron list (with the shorter median length of 4 nt Fig. 2g). In contrast, only

25% of all annotated introns harbored such direct repeats at their splice sites (median length < 4 nt). Moreover, even though exitrons had been selected for canonical splice site dinucleotides (GU/GC-AG), they lacked other characteristics of 5′ and 3′ splice sites such as U1 complementarity and the polypyrimidine tract (Fig. 2h). This finding indicates that a significant fraction of the reported exitrons could also be RT artifacts. Although this observation awaits experimental validation, it suggests that caution is required when interpreting RNA-seq mapping data. We envision that as more dRNA-seq data become available, the unequivocal classification of cryptic introns as exitrons or falsitrons will be possible.

2.6 – Enzymes

2.6 – Enzymes
2.6.1 – Define enzyme and active site
Enzyme – A biological catalyst made of globular protein
Enzymes speed up the reactions by influencing the stability of bonds in the reactants. They may also provide an alternative reaction pathway, and reduce the energy needed for the reaction.
Active Site – The region of an enzyme molecule surface where the substrate molecule binds and catalysis occurs
The substrate is drawn in to the active site. It has both binding and catalytic regions. The molecules are positioned to promote the reaction.

2.6.2 – Explain enzyme-substrate specificity
A substrate is the starting substance, which is converted to the product.
Enzymes are very specific, and will only catalyse one type of reaction or a very small group of similar reactions. They recognize the substrate as the active site had a precise shape and
distinctive chemical properties. Hence, only particular substrate molecules will be attracted to the active site and fit there. Others cannot fit and will not bind.
Enzymes can have high specificity (when it will only bind to a single type of substrate) or low specificity (when it will bind to a range of related substances). When they bind, the enzyme substrate complex is formed. In the lock and key model, it is suggested that the enzyme and substrate possess specific, complementary shapes that fit exactly into each other.

2.6.3 – Explain the effects of temperature, pH and substrate concentration on enzyme
Temperature -Each enzyme has an optimal temperature for function. When at this temperature, the enzyme will work at its peak, speeding up the reaction. After the temperature reaches its optimum level, the reaction rate abruptly declines. Many enzymes are adversely affected by high temperatures, at which point denaturation occurs. Many enzymes only have a narrow range of conditions under which they operate properly. This is usually at low temperatures for plant and animal enzymes.

pH -Enzymes also have an optimal pH. At this point, it works best and the reaction occurs the fastest, as the enzyme is the most active. There is lower activity above and below the optimum pH (see graph). Extremes in pH will usually result in a complete loss of activity for most enzymes as it leads to a change in shape of the active site. The H+ ions interfere with hydrogen and ionic bonds within the protein structure, which means that the substrate cannot bind. The optimum pH for each enzyme varies greatly. For example, pepsin has an optimum pH of 1.5, but lipase has an optimum pH of 8.0.

Substrate Concentration – If the amount of the enzyme is kept constant and the substrate concentration is increased, the reaction velocity will increase until it hits its maximum. After that, the velocity plateaus. At this point, all of the enzymes have formed complexes with the substrates.

Enzyme Concentration – The rate of reaction, so long as there is excess substrate, will continue to increase as the concentration of the enzyme increases

2.6.4 – Define denaturation
Denaturation is a structural change in a protein that alters its shape and results in a loss of biological properties. This can be caused by pH or temperature. This is when the protein loses its three-dimensional structure, usually along with function. It is often permanent. The bonds in the secondary and tertiary structure are altered, although the sequence is unchanged.
This can result from strong acids and alkalis, which disrupt ionic bonds, resulting in coagulation. Long exposure will eventually break down the primary structure. Heavy metals also disrupt ionic bonds, and form bonds with the carboxyl groups of the R group, reducing the charge of the protein. This generally causes the protein to precipitate. Heat and radiation (such as UV rays) disrupt the bonds because of the increased energy provided to the atoms. Detergents and solvents form bonds with the non-polar groups in the protein, which disrupts hydrogen bonding.

2.6.5 – Explain the use of lactase in the production of lactose-free milk
The production of lactose-free milk is an example of industrial use of biotechnology, which is of huge and increasing economic importance. People who cannot digest lactose are lactose intolerant and do not produce lactase. They must instead drink lactose-free milk, which is made by using lactase from bacteria.
This used to be done through whole-cell preparations. This is not efficient, however, and inappropriate for a food like liquid milk. Cell-free preparation is also used, although the enzymes cannot be re-used, and removal can be expensive.
Instead, immobilized enzymes are used. The advantages of this method are:

  • The enzyme preparation can be re-used
  • The product received is enzyme-free
  • The enzyme may be more stable and long lasting due to protection by the inert matrix

Today, lactose free milk is produced by passing milk over lactase enzyme, bound to an inert carrier. The enzyme is obtained from bacteria, purified, and enclosed in capsules. Once the molecule is cleaved, there are no lactose ill-effects. Alternatively, a harmless bacterium may be added (such as L. Acidophilus), which affects the lactose in milk and yoghurt.

Structural basis for backtracking by the SARS-CoV-2 replication-transcription complex

Backtracking, the reverse motion of the transcriptase enzyme on the nucleic acid template, is a universal regulatory feature of transcription in cellular organisms but its role in viruses is not established. Here we present evidence that backtracking extends into the viral realm, where backtracking by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) RNA-dependent RNA polymerase (RdRp) may aid viral transcription and replication. Structures of SARS-CoV-2 RdRp bound to the essential nsp13 helicase and RNA suggested the helicase facilitates backtracking. We use cryo-electron microscopy, RNA-protein cross-linking, and unbiased molecular dynamics simulations to characterize SARS-CoV-2 RdRp backtracking. The results establish that the single-stranded 3' segment of the product RNA generated by backtracking extrudes through the RdRp nucleoside triphosphate (NTP) entry tunnel, that a mismatched nucleotide at the product RNA 3' end frays and enters the NTP entry tunnel to initiate backtracking, and that nsp13 stimulates RdRp backtracking. Backtracking may aid proofreading, a crucial process for SARS-CoV-2 resistance against antivirals.

Keywords: RNA-dependent RNA polymerase backtracking coronavirus cryo-electron microscopy molecular dynamics.

Watch the video: Notes for IB Biology Chapter (May 2022).