Information

Why is Sanger sequencing inferior for detecting SNPs in cancer cells?

Why is Sanger sequencing inferior for detecting SNPs in cancer cells?


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am familiar with Sanger sequencing, but at the level of an undergraduate. A lecturer of mine tried to describe Sanger sequencing as losing the sequence information in noise when used to detect cancer. This paper also says "lacks sufficient sensitivity for detecting mutant alleles in tumor biopsies,"(Thomas et al., 2006). What is it about Sanger that makes it too insensitive for SNP analysis?

Thomas, R.,K., et al. (2006) Sensitive mutation detection in heterogeneous cancer specimens by massively parallel picoliter reactor sequencing. Nat. Med. 12, 852-855


In the Sanger approach, DNA would be isolated from the biopsy and would contain both normal alleles and mutant alleles of genes associated with the development of the tumour. If, for example, PCR amplification was then used to derive a sample of a target template region, this material would end up being sequenced as a mixed population: the derived sequence would be an average of that population of sequences, and rare alleles would be masked.

The key difference in most next generation approaches is that the template DNA is "cloned" physically (e.g. by sequestering individual molecules into droplets, or by binding them to a surface) so that the sequences of these individual molecules can be determined in parallel. This approach would reveal tumour-associated polymorphisms when the sequences of the individual template molecules were compared.

Added later as supplementary information.

I have now looked at the paper cited in the question. This quotation from the Introduction bears out my main point about sequencing a mixture of templates differing at just a few key positions.

Although commonly used in many clinical settings, dideoxynucleotide chain termination (or 'Sanger') sequencing of PCR products often lacks sufficient sensitivity for detecting mutant alleles in tumor biopsies, where the failure rate has reached 75% in some cases. Gain-of-function oncogenic mutations are frequently heterozygous events or may represent a single allele of an amplified gene; thus, the signal for mutated residues is typically reduced relative to neighboring bases. Moreover, the ability to detect single base mutations or small insertions or deletions in biopsy material by Sanger sequencing depends heavily on sample purity (for example, the extent of contaminating stromal DNA) and genomic DNA integrity. Furthermore, resistance to kinase inhibitors may correlate with low-frequency second-site mutations. These observations underscore the challenges for accurate mutation detection in cancer specimens.

A massively parallel sequencing-by-synthesis approach, 'picotiter plate pyrosequencing,' provides a new alternative to Sanger sequencing. This approach relies on emulsion PCR-based clonal amplification of a DNA library adapted onto micron-sized beads and subsequent pyrosequencing-by-synthesis of each clonally amplified template in a picotiter plate, generating over 200,000 unique clonal sequencing reads per experiment. Sequence variants that represent a fraction of a complex sample can be vastly oversampled, thus enabling statistically meaningful quantification of low-abundance variants.


Systematic comparative analysis of single-nucleotide variant detection methods from single-cell RNA sequencing data

Systematic interrogation of single-nucleotide variants (SNVs) is one of the most promising approaches to delineate the cellular heterogeneity and phylogenetic relationships at the single-cell level. While SNV detection from abundant single-cell RNA sequencing (scRNA-seq) data is applicable and cost-effective in identifying expressed variants, inferring sub-clones, and deciphering genotype-phenotype linkages, there is a lack of computational methods specifically developed for SNV calling in scRNA-seq. Although variant callers for bulk RNA-seq have been sporadically used in scRNA-seq, the performances of different tools have not been assessed.

Results

Here, we perform a systematic comparison of seven tools including SAMtools, the GATK pipeline, CTAT, FreeBayes, MuTect2, Strelka2, and VarScan2, using both simulation and scRNA-seq datasets, and identify multiple elements influencing their performance. While the specificities are generally high, with sensitivities exceeding 90% for most tools when calling homozygous SNVs in high-confident coding regions with sufficient read depths, such sensitivities dramatically decrease when calling SNVs with low read depths, low variant allele frequencies, or in specific genomic contexts. SAMtools shows the highest sensitivity in most cases especially with low supporting reads, despite the relatively low specificity in introns or high-identity regions. Strelka2 shows consistently good performance when sufficient supporting reads are provided, while FreeBayes shows good performance in the cases of high variant allele frequencies.

Conclusions

We recommend SAMtools, Strelka2, FreeBayes, or CTAT, depending on the specific conditions of usage. Our study provides the first benchmarking to evaluate the performances of different SNV detection tools for scRNA-seq data.


Workflow overview

Thermo Fisher Scientific has integrated all the tools necessary for genome editing and downstream analysis (Figure 1). The Invitrogen™ GeneArt™ design tool facilitates the design and ordering of targetspecific gRNAs for CRISPR-mediated genome editing or TALs for TALENmediated genome editing. Invitrogen transfection reagents offer several options for delivery of genome editing tools into eukaryotic cells. In addition, Invitrogen TOPO™ TA cloning vectors and competent cells facilitate the sequence analysis of primary transformants. Gibco™ media is available for growing the primary transformants and secondary cultures following clonal expansion. Finally, Applied Biosystems sequencing instruments and reagents enable the determination of specific genomic editing events. In this application note, we demonstrate how this workflow comes together to generate and identify mutations in the human hypoxanthine phosphoribosyl transferase (HPRT) gene.

A brief overview of the steps used to generate and analyze a primary culture with HPRT mutations is shown in Figure 2. The target-specific CRISPR RNA (crRNA) sequence within the gRNA was designed to a HPRT-specific locus. The gRNA was synthesized via in vitro transcription using the Invitrogen GeneArt Precision gRNA Synthesis Kit. Following synthesis and purification, gRNA was cotransfected with Cas9 mRNA into 293FT cells using Invitrogen Lipofectamine™ MessengerMAX™ Transfection Reagent. The cells were harvested 78 hours posttransfection. The cell lysates were then used along with primers flanking the HPRT target to generate PCR amplicons no greater than 600 bp in length. The PCR products were then subcloned using the Invitrogen Zero Blunt™ TOPO PCR Cloning Kit and transformed into Invitrogen TOP10 E. coli cells. Ninety-six bacterial colonies were picked per transformed pool of gene-edited cells and processed for DNA isolation using the Invitrogen PureLink™ 96 Plasmid Purification System and subjected to Sanger sequencing. The resulting sequencing data was then analyzed to measure the percent of PCR products containing accurately edited sequence and to select which clonal isolates to maintain. Alternatively, although it was not performed for this study, the PCR product could be sequenced directly, without subcloning into TOPO cells.

Figure 1. Overall workflow for CRISPR genome editing. Thermo Fisher Scientific provides the tools, reagents, and expertise required for success at each step of the workflow.

Figure 2. Steps for determining the efficiency of an edit using TOPO cloning and Sanger sequencing by CE. 1. Transfect gRNA and Cas9 mRNA into cells. 2. Incubate cells to allow processing of genomic change. 3. Purify genomic DNA from the cell culture, PCR-amplify the engineered locus from the heterogeneous culture, and clone PCR fragments into TOPO vector. 4. Isolate plasmids from single colonies and PCR-amplify the insert. 5. Sequence the insert. The efficiency of the edit is the ratio of the number of inserts with an engineered change to the total number of inserts sequenced. Higher efficiency will likely result in fewer secondary clones that need to be screened to identify specific cells with the change.


The development of RNA sequencing technologies

It was not until 1953 when Watson and Crick proposed the double-helix structure did people truly realize at the molecular level that the essence of life is the result of gene interactions [4]. The continuous development of RNA sequencing has ushered transcriptome analysis into a new era, with higher efficiency and lower cost. The timeline of RNA sequencing technologies is shown in Fig. 1.

The development timeline of RNA sequencing technologies

The first-generation sequencing technology is also called Sanger sequencing. The chain termination method was initiated by Sanger in 1977, followed by the chemical degradation method developed by Maxam and Gilbert [5, 6]. The same year, Sanger determined the 5368 bp genome of phage φX174, which is the first DNA genome sequenced [7]. The DNA microarray has aided significant progress in many fields since it was first introduced. However, microarrays require prior knowledge of gene sequences and are unable to identify novel gene expression [8]. After the first high-throughput sequencing platform appeared in 2005 [1], multiple next-generation sequencing platforms followed (Table 1, Figs. 2, 3). The accuracy and reproducibility among different platforms depended on several factors, including the inherent features of the platform and the corresponding analysis pipelines [9, 10]. Pyrosequencing that was no longer supported after 2016, developed by 454 Life Sciences, used a “sequencing by synthesis” method [1, 11,12,13]. The ion torrent sequencing platform is also based on the “sequencing by synthesis” method, which outperforms pyrosequencing with respect to sensitivity. SOLiD (Sequencing by Oligonucleotide Ligation and Detection) exhibits high accuracy, as each base is sequenced twice, but the read length is short [11,12,13]. DNBS (DNA nanoball sequencing) enables large collection of DNA nanoballs for simultaneous sequencing. Illumina-based sequencing technology represents a “reversible terminator sequencing” method. High-throughput sequencing has the advantage of fast speed, low sequencing cost and high accuracy, otherwise known as next-generation sequencing (NGS). Compared to microarray, it can detect unknown gene expression sequences but is time intensive [14].

RNA extraction and template preparation before RNA-sequencing. RNA was extracted from tissues, and after fragmentation, fragmented DNA molecules were converted into cDNA by reverse transcription then amplified by emulsion PCR or bridge PCR to prepare sequencing library

Three kinds of sequencing methods. These methods contain sequencing by synthesis, sequencing by reversible terminator and sequencing by ligation. And their different mechanisms are shown in detail

In addition to NGS, there is third-generation sequencing, which allows for long-read sequencing of individual RNA molecules [15]. Single-molecule RNA sequencing enables the generation of full-length cDNA transcripts without clonal amplification or transcript assembly. Thus, third-generation sequencing is free from the shortcomings generated by PCR amplification and read mapping. It can greatly reduce the false positive rate of splice sites and capture the diversity of transcript isoforms [15]. Single-molecule sequencing platforms comprise Pacific Biosciences (PacBio) single-molecule real-time (SMRT) sequencing [16], Helicos single-molecule fluorescent sequencing [17] and Oxford Nanopore Technologies (ONT) nanopore sequencing [18]. Furthermore, RNA-seq recently evolved from bulk sequencing to single-cell sequencing. Single-cell RNA sequencing was first published in 2009 to profile the transcriptome at single-cell resolution [19]. Drop-Seq and InDrop were initially reported in 2015 by analyzing mouse retina cell and embryonic stem cell transcriptomes, identifying novel cell types. Sci-RNA-seq, single-cell combinatorial indexing RNA sequencing, was developed in 2017, and SPLiT-seq (split-pool ligation-based transcriptome sequencing) was first reported in 2018. Both approaches use a combinatorial indexing strategy in which attached RNAs are labeled with barcodes that indicate their cellular origin [20, 21].

Though single-cell data enable single-cell transcriptomics, it may lose spatial information during single-cell isolation. To solve this problem, spatial transcriptomics has emerged. Spatial transcriptomics employs unique positional barcodes to visualize RNA distributions in RNA sequencing of tissue sections and was first published in 2016 [22]. Slide-seq, reported in 2019, uses DNA barcode beads with specific positional information [23]. Geo-seq was introduced in 2017 and integrated scRNA-seq with laser capture microdissection (LCM), which can isolate individual cells [24]. In situ sequencing refers to targeted sequencing of RNA fragments in morphologically preserved tissues or cells without RNA extraction, including in situ cDNA synthesis by padlock probes or stably cross-linked cDNA amplicons in fluorescent in situ RNA sequencing (FISSEQ) and in situ amplification by rolling-circle amplification (RCA) [25, 26]. Furthermore, various new technologies based on RNA-seq have been developed for specific applications. For example, a type of targeted RNA sequencing, CaptureSeq, employs biotinylated oligonucleotide probes and results in the enrichment of certain transcripts to identify gene fusion [27, 28].


References

Darwin, C. On the Origin of Species (John Murray Press, 1859).

Luria, S. E. & Delbrück, M. Mutations of bacteria from virus sensitivity to virus resistance. Genetics 28, 491–511 (1943).

Cairns, J. Mutation selection and the natural history of cancer. Nature 255, 197–200 (1975).

Fisher, R. et al. Deep sequencing reveals minor protease resistance mutations in patients failing a protease inhibitor regimen. J. Virol. 86, 6231–6237 (2012).

Schmitt, M. W., Loeb, L. A. & Salk, J. J. The influence of subclonal resistance mutations on targeted cancer therapy. Nat. Rev. Clin. Oncol. 13, 335–347 (2016).

Maher, G. J. et al. Visualizing the origins of selfish de novo mutations in individual seminiferous tubules of human testes. Proc. Natl Acad. Sci. USA 113, 2454–2459 (2016).

Kennedy, S. R., Loeb, L. A. & Herr, A. J. Somatic mutations in aging, cancer and neurodegeneration. Mech. Ageing Dev. 133, 118–126 (2012).

Vijg, J. Somatic mutations, genome mosaicism, cancer and aging. Curr. Opin. Genet. Dev. 26, 141–149 (2014).

Shendure, J. et al. DNA sequencing at 40: past, present and future. Nature 550, 345–353 (2017).

Goodwin, S., Mcpherson, J. D. & Mccombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).

Sanger, F., Nicklen, S. & Coulson, A. R. DNA sequencing with chain-terminating inhibitors. Proc. Natl Acad. Sci. USA 74, 5463–5467 (1977). One of two Nobel prize-winning DNA sequencing methodologies published in 1977 (the other being that of Maxam and Gilbert). The Sanger approach formed the basis of The Human Genome Project.

Ley, T. J. et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 456, 66–72 (2008).

Zagordi, O., Klein, R., Däumer, M. & Beerenwinkel, N. Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies. Nucleic Acids Res. 38, 7400–7409 (2010).

Parsons, B. L. & Heflich, R. H. Genotypic selection methods for the direct analysis of point mutations. Mutat. Res. 387, 97–121 (1997).

Bielas, J. H. & Loeb, L. A. Quantification of random genomic mutations. Nat. Methods 2, 285–290 (2005).

Li, J. et al. Replacing PCR with COLD-PCR enriches variant DNA sequences and redefines the sensitivity of genetic testing. Nat. Med. 14, 579–584 (2008).

Sykes, P. J. et al. Quantitation of targets for PCR by use of limiting dilution. Biotechniques 13, 444–449 (1992).

Vogelstein, B. & Kinzler, K. W. Digital, P. C. R. Proc. Natl Acad. Sci. USA 96, 9236–9241 (1999).

Hindson, B. J. et al. High-throughput droplet digital PCR system for absolute quantitation of DNA copy number. Anal. Chem. 83, 8604–8610 (2011).

Fox, E. J., Reid-Bayliss, K. S., Emond, M. J. & Loeb, L. A. Accuracy of next generation sequencing platforms. Next Gener. Seq. Appl. 1, 1000106 (2014).

Blokzijl, F. et al. Tissue-specific mutation accumulation in human adult stem cells during life. Nature 538, 260–264 (2016).

Ewing, B. & Green, P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8, 186–194 (1998). Among the first and most important uses of rigorous statistical methods to assign degree of certainty to DNA sequencing data.

Cock, P. J. A., Fields, C. J., Goto, N., Heuer, M. L. & Rice, P. M. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 38, 1767–1771 (2010).

Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013).

Koboldt, D. C. et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 22, 568–576 (2012).

Wang, Q. et al. Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers. Genome Med. 5, 91 (2013).

Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at ArXiV arXiv:1303.3997v2 [q-bio.GN] (2013).

Wei, Z., Wang, W., Hu, P., Lyon, G. J. & Hakonarson, H. SNVer: a statistical tool for variant calling in analysis of pooled or individual next-generation sequencing data. Nucleic Acids Res. 39, e132–e132 (2011).

Wilm, A. et al. LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res. 40, 11189–11201 (2012).

Gerstung, M. et al. Reliable detection of subclonal single-nucleotide variants in tumour cell populations. Nat. Commun. 3, 811 (2012).

Costello, M. et al. Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation. Nucleic Acids Res. 41, e67–e67 (2013).

Chen, L., Liu, P., Evans, T. C. & Ettwiller, L. M. DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification. Science 355, 752–756 (2017).

Schirmer, M., D'Amore, R., Ijaz, U. Z., Hall, N. & Quince, C. Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data. BMC Bioinformatics 17, 125 (2016).

Martincorena, I. et al. Tumor evolution. High burden and pervasive positive selection of somatic mutations in normal human skin. Science 348, 880–886 (2015).

Welch, J. S. et al. The origin and evolution of mutations in acute myeloid leukemia. Cell 150, 264–278 (2012).

Nik-Zainal, S. et al. Landscape of somatic mutations in 560 breast cancer whole-genome sequences. Nature 534, 47–54 (2016).

Kircher, M., Sawyer, S. & Meyer, M. Double indexing overcomes inaccuracies in multiplex sequencing on the Illumina platform. Nucleic Acids Res. 40, e3 (2012). An important description of the commonness of PCR chimaeras, optical duplicates and index swapping that occurs during NGS library preparation and polony formation. This contributed to the now common practice of dual indexing for error-sensitive applications.

Potapov, V. & Ong, J. L. Examining sources of error in PCR by single-molecule sequencing. PLOS ONE 12, e0169774 (2017).

Brodin, J. et al. PCR-induced transitions are the major source of error in cleaned ultra-deep pyrosequencing data. PLOS ONE 8, e70388 (2013).

Star, B. et al. Palindromic sequence artifacts generated during next generation sequencing library preparation from historic and ancient DNA. PLOS ONE 9, e89676 (2014).

Van Allen, E. M. et al. Whole-exome sequencing and clinical interpretation of formalin-fixed, paraffin-embedded tumor samples to guide precision cancer medicine. Nat. Med. 20, 682–688 (2014).

Arbeithuber, B., Makova, K. D. & Tiemann-Boege, I. Artifactual mutations resulting from DNA lesions limit detection levels in ultrasensitive sequencing applications. DNA Res. 23, 547–559 (2016).

Lindahl, T. & Nyberg, B. Rate of depurination of native deoxyribonucleic acid. Biochemistry 11, 3610–3618 (1972).

Knierim, E., Lucke, B., Schwarz, J. M., Schuelke, M. & Seelow, D. Systematic comparison of three methods for fragmentation of long-range PCR products for next generation sequencing. PLOS ONE 6, e28240 (2011).

Do, H. & Dobrovic, A. Sequence artifacts in DNA from formalin-fixed tissues: causes and strategies for minimization. Clin. Chem. 61, 64–71 (2015).

Lou, D. I. et al. High-throughput DNA sequencing errors are reduced by orders of magnitude using circle sequencing. Proc. Natl Acad. Sci. USA 110, 19872–19877 (2013). The first important description of consensus sequencing by tandem duplication of library molecules. Although challenging on short-read sequencers, this concept is likely to become very important as single-molecule sequencers improve in the coming years.

Chen, G., Mosier, S., Gocke, C. D., Lin, M.-T. & Eshleman, J. R. Cytosine deamination is a major cause of baseline noise in next-generation sequencing. Mol. Diagn. Ther. 18, 587–593 (2014).

Schaaper, R. M., Kunkel, T. A. & Loeb, L. A. Infidelity of DNA synthesis associated with bypass of apurinic sites. Proc. Natl Acad. Sci. USA 80, 487–491 (1983).

Sagher, D. & Strauss, B. Insertion of nucleotides opposite apurinic/apyrimidinic sites in deoxyribonucleic acid during in vitro synthesis: uniqueness of adenine nucleotides. Biochemistry 22, 4518–4526 (1983).

Nishimura, S. 8-Hydroxyguanine: a base for discovery. DNA Repair 10, 1078–1083 (2011).

Sinha, R. et al. Index switching causes 'spreading-of-signal' among multiplexed samples in Illumina HiSeq 4000 DNA sequencing. https://doi.org/10.1101/125724 (2017).

Hiatt, J. B., Turner, E. H., Patwardhan, R. P., Caperton, L. & Shendure, J. Next-generation DNA sequencing for de novo genome assembly. Western Student Medical Research Forum (2009).

Hiatt, J. B., Patwardhan, R. P., Turner, E. H., Lee, C. & Shendure, J. Parallel, tag-directed assembly of locally derived short sequence reads. Nat. Methods 7, 119–122 (2010). The first description of consensus sequencing PCR duplicates for error correction, both with UMIs and without.

Casbon, J. A., Osborne, R. J., Brenner, S. & Lichtenstein, C. P. A method for counting PCR template molecules with application to next-generation sequencing. Nucleic Acids Res. 39, e81 (2011).

Kinde, I., Wu, J., Papadopoulos, N., Kinzler, K. W. & Vogelstein, B. Detection and quantification of rare mutations with massively parallel sequencing. Proc. Natl Acad. Sci. USA 108, 9530–9535 (2011). A key early description of single-strand tag-based error correction for rare variant detection. This publication put the significance in clinical context and was probably the most important launch for the field.

Jabara, C. B., Jones, C. D., Roach, J., Anderson, J. A. & Swanstrom, R. Accurate sampling and deep sequencing of the HIV-1 protease gene using a Primer ID. Proc. Natl Acad. Sci. USA 108, 20166–20171 (2011).

Fu, G. K., Hu, J., Wang, P.-H. & Fodor, S. P. A. Counting individual DNA molecules by the stochastic attachment of diverse labels. Proc. Natl Acad. Sci. USA 108, 9026–9031 (2011).

Kivioja, T. et al. Counting absolute numbers of molecules using unique molecular identifiers. Nat. Methods 9, 72–74 (2011).

Shiroguchi, K., Jia, T. Z., Sims, P. A. & Xie, X. S. Digital RNA sequencing minimizes sequence-dependent bias and amplification noise with optimized single-molecule barcodes. Proc. Natl Acad. Sci. USA 109, 1347–1352 (2012).

Schmitt, M. W. et al. Detection of ultra-rare mutations by next-generation sequencing. Proc. Natl Acad. Sci. USA 109, 14508–14513 (2012). The initial description of DupSeq and the concept of labelling copies of both strands of individual double-stranded molecules to allow them to be sequenced and compared for even greater accuracy. This technique opened the door to investigations of ultra-rare variants, such as those that occur in ageing and with mutagenic chemical exposure.

Hoang, M. L. et al. Genome-wide quantification of rare somatic mutations in normal human tissues using massively parallel sequencing. Proc. Natl Acad. Sci. USA 113, 9846–9851 (2016). A duplex sequencing approach at very low depth and not requiring exogenous UMIs. An excellent example of genotoxicity and ageing applications.

Nachmanson, D. et al. CRISPR-DS: an efficient, low DNA input method for ultra-accurate sequencing. Preprint at bioRxivhttps://doi.org/10.1101/207027 (2017).

Liang, R. H. et al. Theoretical and experimental assessment of degenerate primer tagging in ultra-deep applications of next-generation sequencing. Nucleic Acids Res. 42, e98 (2014).

Zhang, T.-H., Wu, N. C. & Sun, R. A benchmark study on error-correction by read-pairing and tag-clustering in amplicon-based deep sequencing. BMC Genomics 17, 108 (2016).

Smith, T., Heger, A. & Sudbery, I. UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome Res. 27, 491–499 (2017).

Ståhlberg, A. et al. Simple, multiplexed, PCR-based barcoding of DNA enables sensitive mutation detection in liquid biopsies using sequencing. Nucleic Acids Res. 44, e105 (2016).

Ståhlberg, A. et al. Simple multiplexed PCR-based barcoding of DNA for ultrasensitive mutation detection by next-generation sequencing. Nat. Protoc. 12, 664–682 (2017).

Hiatt, J. B., Pritchard, C. C., Salipante, S. J., O'Roak, B. J. & Shendure, J. Single molecule molecular inversion probes for targeted, high accuracy detection of low frequency variation. Genome Res. https://doi.org/10.1101/gr.147686.112 (2013).

Carlson, K. D. et al. MIPSTR: a method for multiplex genotyping of germline and somatic STR variation across many individuals. Genome Res. 25, 750–761 (2015).

Boyle, E. A., O'Roak, B. J., Martin, B. K., Kumar, A. & Shendure, J. MIPgen: optimized modeling and design of molecular inversion probes for targeted resequencing. Bioinformatics 30, 2670–2672 (2014).

Wang, K. et al. Ultra-precise detection of mutations by droplet-based amplification of circularized DNA. BMC Genomics 17, 214 (2016). An important description of several biochemical techniques to improve consensus making efficiency and reduce cost.

Hong, L. Z. et al. BAsE-Seq: a method for obtaining long viral haplotypes from short sequence reads. Genome Biol. 15, 517 (2014).

Schmitt, M. W., Fox, E. J. & Salk, J. J. Risks of double-counting in deep sequencing. Proc. Natl Acad. Sci. USA 111, E1560 (2014).

Hong, J. & Gresham, D. Incorporation of unique molecular identifiers in TruSeq adapters improves the accuracy of quantitative sequencing. Biotechniques 63, 221–226 (2017).

Narayan, A. et al. Ultrasensitive measurement of hotspot mutations in tumor DNA in blood using error-suppressed multiplexed deep sequencing. Cancer Res. 72, 3492–3498 (2012).

Gregory, M. T. et al. Targeted single molecule mutation detection with massively parallel sequencing. Nucleic Acids Res. 44, e22–e22 (2016).

Pel, J. et al. Duplex Proximity Sequencing (Pro-Seq): a method to improve DNA sequencing accuracy without the cost of molecular barcoding redundancy. Preprint at bioRxiv https://doi.org/10.1101/163444 (2017).

Kennedy, S. R. et al. Detecting ultralow-frequency mutations by duplex sequencing. Nat. Protoc. 9, 2586–2606 (2014).

Roach, J. C. et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science 328, 636–639 (2010).

Kennedy, S. R., Salk, J. J., Schmitt, M. W. & Loeb, L. A. Ultra-sensitive sequencing reveals an age-related increase in somatic mitochondrial mutations that are inconsistent with oxidative damage. PLOS Genet. 9, e1003794 (2013). The first description of high-accuracy consensus sequencing to measure the effect of human ageing on somatic mutation load.

Taylor, P. H., Cinquin, A. & Cinquin, O. Quantification of in vivo progenitor mutation accrual with ultra-low error rate and minimal input DNA using SIP-HAVA-seq. Genome Res. 26, 1600–1611 (2016).

Hoekstra, J. G., Hipp, M. J., Montine, T. J. & Kennedy, S. R. Mitochondrial DNA mutations increase in early stage Alzheimer disease and are inconsistent with oxidative damage. Ann. Neurol. 80, 301–306 (2016).

Pickrell, A. M. et al. Endogenous parkin preserves dopaminergic substantia nigral neurons following mitochondrial DNA mutagenic stress. Neuron 87, 371–381 (2015).

Reid-Bayliss, K. S., Arron, S. T., Loeb, L. A., Bezrookove, V. & Cleaver, J. E. Why Cockayne syndrome patients do not get cancer despite their DNA repair deficiency. Proc. Natl Acad. Sci. USA 113, 10151–10156 (2016).

Chawanthayatham, S. et al. Mutational spectra of aflatoxin B1 in vivo establish biomarkers of exposure for human hepatocellular carcinoma. Proc. Natl Acad. Sci. USA 114, E3101–E3109 (2017).

Mattox, A. K. et al. Bisulfite-converted duplexes for the strand-specific detection and quantification of rare mutations. Proc. Natl Acad. Sci. USA 114, 4733–4738 (2017).

Kumar, V. et al. Partial bisulfite conversion for unique template sequencing. Nucleic Acids Res. https://doi.org/10.1093/nar/gkx1054 (2017).

Deamer, D., Akeson, M. & Branton, D. Three decades of nanopore sequencing. Nat. Biotechnol. 34, 518–524 (2016).

Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. 323, 133–138 (2009).

Madoui, M.-A. et al. Genome assembly using nanopore-guided long and error-free DNA reads. BMC Genomics 16, 327 (2015).

Schüle, B. et al. Parkinson's disease associated with pure ATXN10 repeat expansion. NPJ Parkinsons Dis. 3, 27 (2017).

Li, C. et al. INC-Seq: accurate single molecule reads using nanopore sequencing. Gigascience 5, 34 (2016).

Jain, M., Olsen, H. E., Paten, B. & Akeson, M. The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome Biol. 17, 239 (2016).

Travers, K. J., Chin, C.-S., Rank, D. R., Eid, J. S. & Turner, S. W. A flexible and efficient template format for circular consensus sequencing and SNP detection. Nucleic Acids Res. 38, e159 (2010). The first description of consensus sequencing based on iterative resequencing of both strands of individual molecules. This concept, although currently challenging, will probably become very important as single-molecule DNA sequencers improve.

Loomis, E. W. et al. Sequencing the unsequenceable: expanded CGG-repeat alleles of the fragile X gene. Genome Res. 23, 121–128 (2013).

Russo, G. et al. Highly sensitive, non-invasive detection of colorectal cancer mutations using single molecule, third generation sequencing. Appl. Transl Genom. 7, 32–39 (2015).

Frank, J. A. et al. Improved metagenome assemblies and taxonomic binning using long-read circular consensus sequence data. Sci. Rep. 6, 25373 (2016).

Hestand, M. S., Van Houdt, J., Cristofoli, F. & Vermeesch, J. R. Polymerase specific error rates and profiles identified by single molecule sequencing. Mutat. Res. 784–785, 39–45 (2016).

Heerema, S. J. & Dekker, C. Graphene nanodevices for DNA sequencing. Nat. Nanotechnol. 11, 127–136 (2016).

Beechem, J. Library free targeted sequencing of native genomic DNA FFPE samples using Hyb & Seq technology-the hybridization based single molecule sequencing system. Advances in Genome Biology and Technology Annual Meeting https://www.nanostring.com/application/files/3815/0206/1895/AGBT2017_HybSeq_Chemistry_Final.pdf (2017).

Johnson, S. S., Zaikova, E., Goerlitz, D. S., Bai, Y. & Tighe, S. W. Real-time DNA sequencing in the Antarctic dry valleys using the Oxford Nanopore sequencer. J. Biomol. Tech. 28, 2–7 (2017).

Wang, K. et al. Using ultra-sensitive next generation sequencing to dissect DNA damage-induced mutagenesis. Sci. Rep. 6, 25310 (2016).

Stoler, N., Arbeithuber, B., Guiblet, W., Makova, K. D. & Nekrutenko, A. Streamlined analysis of duplex sequencing data with Du Novo. Genome Biol. 17, 180 (2016).

Newman, A. M. et al. Integrated digital error suppression for improved detection of circulating tumor DNA. Nat. Biotechnol. 34, 547–555 (2016). An important early comprehensive description of a cfDNA liquid biopsy approach using tag-based error correction techniques.

Zheng, Z. et al. Anchored multiplex PCR for targeted next-generation sequencing. Nat. Med. 20, 1479–1484 (2014).

Kennedy, S. & Hipp, M. J. Removing sequencer and PCR artifacts for forensic DNA analysis on massively parallel sequencing platforms: https://www.promega.com/-/media/files/products-and-services/genetic-identity/ishi-28-oral-abstracts/kennedy-ishipaper.pdf (2017).

Krimmel, J. D., Salk, J. J. & Risques, R.-A. Cancer-like mutations in non-cancer tissue: towards a better understanding of multistep carcinogenesis. Transl Cancer Res. https://doi.org/10.21037/tcr.2016.11.67 (2016).

Loeb, L. A., Springgate, C. F. & Battula, N. Errors in DNA replication as a basis of malignant changes. Cancer Res. 34, 2311–2321 (1974).

Merlo, L. M. F., Pepper, J. W., Reid, B. J. & Maley, C. C. Cancer as an evolutionary and ecological process. Nat. Rev. Cancer 6, 924–935 (2006).

Gatenby, R. A. & Gillies, R. J. A microenvironmental model of carcinogenesis. Nat. Rev. Cancer 8, 56–61 (2008).

Salk, J. J., Fox, E. J. & Loeb, L. A. Mutational heterogeneity in human cancers: origin and consequences. Annu. Rev. Pathol. 5, 51–75 (2010).

Greaves, M. & Maley, C. C. Clonal evolution in cancer. Nature 481, 306–313 (2012).

Burrell, R. A., McGranahan, N., Bartek, J. & Swanton, C. The causes and consequences of genetic heterogeneity in cancer evolution. Nature 501, 338–345 (2013).

Gerlinger, M. et al. Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N. Engl. J. Med. 366, 883–892 (2012).

Sottoriva, A. et al. Intratumor heterogeneity in human glioblastoma reflects cancer evolutionary dynamics. Proc. Natl Acad. Sci. USA 110, 4009–4014 (2013).

Zhang, J. et al. Intratumor heterogeneity in localized lung adenocarcinomas delineated by multiregion sequencing. Science 346, 256–259 (2014).

de Bruin, E. C. et al. Spatial and temporal diversity in genomic instability processes defines lung cancer evolution. Science 346, 251–256 (2014).

Naxerova, K. et al. Hypermutable DNA chronicles the evolution of human colon cancer. Proc. Natl Acad. Sci. USA 111, E1889–E1898 (2014).

Reiter, J. G. et al. Reconstructing metastatic seeding patterns of human cancers. Nat. Commun. 8, 14114 (2017).

Marusyk, A. et al. Non-cell-autonomous driving of tumour growth supports sub-clonal heterogeneity. Nature 514, 54–58 (2014).

Yates, L. R. et al. Subclonal diversification of primary breast cancer revealed by multiregion sequencing. Nat. Med. 21, 751–759 (2015).

Ding, L. et al. Clonal evolution in relapsed acute myeloid leukaemia revealed by whole-genome sequencing. Nature 481, 506–510 (2012).

Sequist, L. V. et al. Genotypic and histological evolution of lung cancers acquiring resistance to EGFR inhibitors. Sci. Transl Med. 3, 75ra26 (2011).

Jamal-Hanjani, M. et al. Tracking the evolution of non-small-cell lung cancer. N. Engl. J. Med. 376, 2109–2121 (2017).

Andor, N. et al. Pan-cancer analysis of the extent and consequences of intratumor heterogeneity. Nat. Med. 22, 105–113 (2016).

Mroz, E. A. et al. High intratumor genetic heterogeneity is related to worse outcome in patients with head and neck squamous cell carcinoma. Cancer 119, 3034–3042 (2013).

Parker, W. T., Ho, M., Scott, H. S., Hughes, T. P. & Branford, S. Poor response to second-line kinase inhibitors in chronic myeloid leukemia patients with multiple low-level mutations, irrespective of their resistance profile. Blood 119, 2234–2238 (2012).

Landau, D. A. et al. Evolution and impact of subclonal mutations in chronic lymphocytic leukemia. Cell 152, 714–726 (2013).

Klco, J. M. et al. Association between mutation clearance after induction therapy and outcomes in acute myeloid leukemia. JAMA 314, 811–822 (2015).

Misale, S. et al. Emergence of KRAS mutations and acquired resistance to anti-EGFR therapy in colorectal cancer. Nature 486, 532–536 (2012).

Stroun, M., Anker, P., Lyautey, J., Lederrey, C. & Maurice, P. A. Isolation and characterization of DNA from the plasma of cancer patients. Eur. J. Cancer Clin. Oncol. 23, 707–712 (1987).

Bettegowda, C. et al. Detection of circulating tumor DNA in early- and late-stage human malignancies. Sci. Transl Med. 6, 224ra24 (2014).

Wan, J. C. M. et al. Liquid biopsies come of age: towards implementation of circulating tumour DNA. Nat. Rev. Cancer 17, 223–238 (2017).

Murtaza, M. et al. Non-invasive analysis of acquired resistance to cancer therapy by sequencing of plasma DNA. Nature 497, 108–112 (2013).

Garcia-Murillas, I. et al. Mutation tracking in circulating tumor DNA predicts relapse in early breast cancer. Sci. Transl Med. 7, 302ra133 (2015).

Tie, J. et al. Circulating tumor DNA analysis detects minimal residual disease and predicts recurrence in patients with stage II colon cancer. Sci. Transl Med. 8, 346ra92 (2016).

Newman, A. M. et al. An ultrasensitive method for quantitating circulating tumor DNA with broad patient coverage. Nat. Med. 20, 548–554 (2014).

Fujii, T. et al. Mutation-enrichment next-generation sequencing for quantitative detection of KRAS mutations in urine cell-free DNA from patients with advanced cancers. Clin. Cancer Res. 23, 3657–3666 (2017).

Wang, Y. et al. Detection of tumor-derived DNA in cerebrospinal fluid of patients with primary tumors of the brain and spinal cord. Proc. Natl Acad. Sci. USA 112, 9704–9709 (2015).

Kinde, I. et al. Evaluation of DNA from the Papanicolaou test to detect ovarian and endometrial cancers. Sci. Transl Med. 5, 167ra4 (2013).

Maritschnegg, E. et al. Lavage of the uterine cavity for molecular detection of Müllerian duct carcinomas: a proof-of-concept study. J. Clin. Oncol. 33, 4293–4300 (2015).

Wang, Y. et al. Detection of somatic mutations and HPV in the saliva and plasma of patients with head and neck squamous cell carcinomas. Sci. Transl Med. 7, 293ra104 (2015).

Sidransky, D. et al. Identification of ras oncogene mutations in the stool of patients with curable colorectal tumors. Science 256, 102–105 (1992).

Aravanis, A. M., Lee, M. & Klausner, R. D. Next-generation sequencing of circulating tumor DNA for early cancer detection. Cell 168, 571–574 (2017).

Armitage, P. & Doll, R. The age distribution of cancer and a multi-stage theory of carcinogenesis. Br. J. Cancer 8, 1–12 (1954).

Genovese, G. et al. Clonal hematopoiesis and blood-cancer risk inferred from blood DNA sequence. N. Engl. J. Med. 371, 2477–2487 (2014).

Jaiswal, S. et al. Age-related clonal hematopoiesis associated with adverse outcomes. N. Engl. J. Med. 371, 2488–2498 (2014).

Young, A. L., Challen, G. A., Birmann, B. M. & Druley, T. E. Clonal haematopoiesis harbouring AML-associated mutations is ubiquitous in healthy adults. Nat. Commun. 7, 12484 (2016). A description of the use of a single-strand tag-based error correction technique to identify preneoplastic clones in nearly all adults, which had only 2 years earlier been believed to occur in only a subset of very elderly individuals. It is an important example of how a fundamental biological understanding can change quickly with improved discovery technologies.

Krimmel, J. D. et al. Ultra-deep sequencing detects ovarian cancer cells in peritoneal fluid and reveals somatic TP53 mutations in noncancerous tissues. Proc. Natl Acad. Sci. USA 113, 6005–6010 (2016).

Salk, J. J. et al. Duplex Sequencing detects cancer-associated mutations arising during normal aging: clonal evolution over a century of human lifetime [abstract]. Cancer Res. 77, 3041 (2017).

Jee, J. et al. Rates and mechanisms of bacterial mutagenesis from maximum-depth sequencing. Nature 534, 693–696 (2016).

Maslov, A. Y., Quispe-Tintaya, W., Gorbacheva, T., White, R. R. & Vijg, J. High-throughput sequencing in mutation detection: a new generation of genotoxicity tests? Mutat. Res. 776, 136–143 (2015).

Fielden, M. R. et al.Modernizing human cancer risk assessment of therapeutics. Trends Pharmacol. Sci. https://doi.org/10.1016/j.tips.2017.11.005 (2017).

Kim, D., Kim, S., Kim, S., Park, J. & Kim, J.-S. Genome-wide target specificities of CRISPR-Cas9 nucleases revealed by multiplex Digenome-seq. Genome Res. 26, 406–415 (2016).

Caperton, L. et al. Assisted reproductive technologies do not alter mutation frequency or spectrum. Proc. Natl Acad. Sci. USA 104, 5085–5090 (2007).

Nelson, J. L. The otherness of self: microchimerism in health and disease. Trends Immunol. 33, 421–427 (2012).

Eun, J. K., Guthrie, K. A., Zirpoli, G. & Gadi, V. K. In situ breast cancer and microchimerism. Sci. Rep. 3, 2192 (2013).

Fan, H. C., Blumenfeld, Y. J., Chitkara, U., Hudgins, L. & Quake, S. R. Noninvasive diagnosis of fetal aneuploidy by shotgun sequencing DNA from maternal blood. Proc. Natl Acad. Sci. USA 105, 16266–16271 (2008).

Chiu, R. W. K. et al. Non-invasive prenatal assessment of trisomy 21 by multiplexed maternal plasma DNA sequencing: large scale validity study. BMJ 342, c7401 (2011).

Bianchi, D. W. et al. Noninvasive prenatal testing and incidental detection of occult maternal malignancies. JAMA 314, 162–169 (2015).

Jamuar, S. S. & Walsh, C. A. Somatic mutations in cerebral cortical malformations. N. Engl. J. Med. 371, 2038–2038 (2014).

Poduri, A., Evrony, G. D., Cai, X. & Walsh, C. A. Somatic mutation, genomic variation, and neurological disease. Science 341, 1237758–1237758 (2013).

De Vlaminck, I. et al. Circulating cell-free DNA enables noninvasive diagnosis of heart transplant rejection. Sci. Transl Med. 6, 241ra77 (2014).

Shugay, M. et al. Towards error-free profiling of immune repertoires. Nat. Methods 11, 653–655 (2014).

DeWitt, W. S. et al. Dynamics of the cytotoxic T cell response to a model of acute viral infection. J. Virol. 89, 4517–4526 (2015).

Hsu, M. S. et al. TCR sequencing can identify and track glioma-infiltrating T cells after DC vaccination. Cancer Immunol. Res. 4, 412–418 (2016).

Tumeh, P. C. et al. PD-1 blockade induces responses by inhibiting adaptive immune resistance. Nature 515, 568–571 (2014).

Goodnow, C. C. Multistep pathogenesis of autoimmune disease. Cell 130, 25–35 (2007).

Qian, J. et al. B cell super-enhancers and regulatory clusters recruit AID tumorigenic activity. Cell 159, 1524–1537 (2014).

Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214 (2012).

Lynch, S. V. & Pedersen, O. The human intestinal microbiome in health and disease. N. Engl. J. Med. 375, 2369–2379 (2016).

Van de Wiele, T., Van Praet, J. T., Marzorati, M., Drennan, M. B. & Elewaut, D. How the microbiota shapes rheumatic diseases. Nat. Rev. Rheumatol. 12, 398–411 (2016).

Rosenbaum, M., Knight, R. & Leibel, R. L. The gut microbiota in human energy homeostasis and obesity. Trends Endocrinol. Metab. 26, 493–501 (2015).

Alexander, J. L. et al. Gut microbiota modulation of chemotherapy efficacy and toxicity. Nat. Rev. Gastroenterol. Hepatol. 1805, 105 (2017).

Vindigni, S. M. & Surawicz, C. M. Fecal microbiota transplantation. Gastroenterol. Clin. North Am. 46, 171–185 (2017).

Dominguez-Bello, M. G. et al. Partial restoration of the microbiota of cesarean-born infants via vaginal microbial transfer. Nat. Med. 22, 250–253 (2016).

Roach, D. J. et al. A year of infection in the intensive care unit: prospective whole genome sequencing of bacterial clinical isolates reveals cryptic transmissions and novel microbiota. PLOS Genet. 11, e1005413 (2015).

Cummings, L. A. et al. Clinical next generation sequencing outperforms standard microbiological culture for characterizing polymicrobial samples. Clin. Chem. 62, 1465–1473 (2016).

Grumaz, S. et al. Next-generation sequencing diagnostics of bacteremia in septic patients. Genome Med. 8, 73 (2016).

Kim, S. et al. High-throughput automated microfluidic sample preparation for accurate microbial genomics. Nat. Commun. 8, 13919 (2017).

Acevedo, A., Brodsky, L. & Andino, R. Mutational and fitness landscapes of an RNA virus revealed through population sequencing. Nature 505, 686–690 (2014).

Eigen, M. The concept of the quasispecies will soon be 50 years old. Introduction. Curr. Top. Microbiol. Immunol. 392, vii (2016).

Henn, M. R. et al. Whole genome deep sequencing of HIV-1 reveals the impact of early minor variants upon immune recognition during acute infection. PLOS Pathog. 8, e1002529 (2012).

Solmone, M. et al. Use of massively parallel ultradeep pyrosequencing to characterize the genetic diversity of hepatitis B virus in drug-resistant and drug-naive patients and to detect minor variants in reverse transcriptase and hepatitis B S antigen. J. Virol. 83, 1718–1726 (2009).

Svarovskaia, E. S., Martin, R., McHutchison, J. G., Miller, M. D. & Mo, H. Abundant drug-resistant NS3 mutants detected by deep sequencing in hepatitis C virus-infected patients undergoing NS3 protease inhibitor monotherapy. J. Clin. Microbiol. 50, 3267–3274 (2012).

Daum, L. T. et al. Next-generation ion torrent sequencing of drug resistance mutations in Mycobacterium tuberculosis strains. J. Clin. Microbiol. 50, 3831–3837 (2012).

Katz, M., Hover, B. & Brady, S. Culture-independent discovery of natural products from soil metagenomes. J. Ind. Microbiol. Biotechnol. 43, 129–141 (2016).

Bassil, N. M., Bryan, N. & Lloyd, J. R. Microbial degradation of isosaccharinic acid at high pH. ISME J. 9, 310–320 (2015).

Yamamoto, S. et al. Environmental DNA metabarcoding reveals local fish communities in a species-rich coastal sea. Sci. Rep. 7, 40368 (2017).

Mayo, B. et al. Impact of next generation sequencing techniques in food microbiology. Curr. Genom. 15, 293–309 (2014).

Jäger, A. C. et al. Developmental validation of the MiSeq FGx Forensic Genomics System for targeted next generation sequencing in forensic DNA casework and database laboratories. Forensic Sci. Int. Genet. 28, 52–70 (2017).

Stiller, M. et al. Patterns of nucleotide misincorporations during enzymatic amplification and direct large-scale sequencing of ancient DNA. Proc. Natl Acad. Sci. USA 103, 13578–13584 (2006).

Avery, O. T., Macleod, C. M. & McCarty, M. Studies on the chemical nature of the substance inducing transformation of pneumococcal types: induction of transformation by a desoxyribonucleic acid fraction isolated from pneumococcus type III. J. Exp. Med. 79, 137–158 (1944).

Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

Mostovoy, Y. et al. A hybrid approach for de novo human genome sequence assembly and phasing. Nat. Methods 13, 587–590 (2016).

Bickhart, D. M. et al. Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome. Nat. Genet. 49, 643–650 (2017).

King, D. A. et al. Mosaic structural variation in children with developmental disorders. Hum. Mol. Genet. 24, 2733–2745 (2015).

Navin, N. et al. Tumour evolution inferred by single-cell sequencing. Nature 472, 90–94 (2011).

Vitak, S. A. et al. Sequencing thousands of single-cell genomes with combinatorial indexing. Nat. Methods 14, 302–308 (2017).

Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).

Rosenberg, A. B. et al. Scaling single cell transcriptomics through split pool barcoding. Preprint at bioRxiv https://doi.org/10.1101/105163 (2017).

Ullal, A. V. et al. Cancer cell profiling by barcoding allows multiplexed protein analysis in fine-needle aspirates. Sci. Transl Med. 6, 219ra9 (2014).

ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

Sun, W.-J. et al. RMBase: a resource for decoding the landscape of RNA modifications from high-throughput sequencing data. Nucleic Acids Res. 44, D259–265 (2016).

Wellcome Collection. Charles Robert Darwin. Photograph by L. Darwin. Wellcome Trust https://wellcomecollection.org/works/s6x9wbsj?page=1&query=darwin (2016).


Results

The SCP and its implementation into the single-cell genotyping workflow

The original SCP principle has been previously described in detail [15]. Fig 1A–1C show the SCP prototype that we used for the present study and the workflow for single-cell isolation and analysis in the 384-well format: First, the cell suspension is pipetted into the disposable cartridge that consists of a milled plastic part and the microfluidic dispenser chip (Fig 1A). Next, the cartridge is mounted on the printhead that comprises the piezo actuator driving the dispenser chip (Fig 1B). A microscopic vision system monitors the nozzle of the dispenser chip and provides the image data for cell detection, classification and isolation (see below). Unwanted droplets are discarded via a vacuum shutter system. The printhead is mounted on a three-axis robotic stage which allows precise deposition of single-cell encapsulating droplets into microwells specified in the SCP software by the operator. Different from previous applications, in the present study, the SCP was used for the isolation of cancer cells and subsequent cell lysis, WGA and molecular genetic analyses (Fig 1D).

(A) The cell suspension is filled into the sterile single-use cartridge. (B) The microwell plate holder is equipped with a camera to automatically determine and adjust for the dispenser offset prior to cell printing (automatic offset compensation, AOC). The dispenser with the mounted cartridge and the cell detection optics are part of the printhead. (C) Total view of the SCP prototype that was used in this study. (D) Illustration of the workflow for single-cell genotyping. Individual cells are isolated via the SCP. After cell lysis, the DNA is subjected to whole genome amplification (WGA), which then can be used for routine molecular genetic analyses.

Evaluation of the precision and efficiency of single-cell deposition

A prerequisite for the genetic analyses is the exact deposition of the single cell in the well, since only then the cell lysis and WGA can be reliably performed, given the small reaction volumes.

Although the precision of the dispenser is sufficiently high to deposit single-cell encapsulating droplets into microwells, we observed that the droplet position within the well can vary. The reason is a variation of the nozzle position due to the cartridge fabrication process and the fixation of the cartridge to the printhead. In order to deposit droplets accurately onto the well bottom in an automated manner we designed a tool to compensate for such an offset by measuring the droplet placement position before and during the cell isolation process. For this, droplets are dispensed on a glass slide that is imaged by a digital camera attached to the microwell plateholder (Fig 2). The droplet position is extracted from the image data and the algorithm automatically calculates the correct dispensing position to target the center of the microwells.

(A) For this, droplets are dispensed on a glass slide that is imaged by a digital camera. (B) The actual droplet position is extracted from the image data by image processing with openCV. (C) displays the binary image after thresholding. The algorithm automatically calculates the correct dispensing position to target the center of the microwell.

In addition, free-flying droplets can be deflected by electrostatic forces, which occur due to the electric charge that accumulates on both the droplet and plate. Thus, we used ionized air to neutralize the electrostatic charging of the microwell plate.

In order to evaluate whether the SCP with the automatic dispense offset compensation and the deionization deposits single droplets with high efficiency and precision, single 10 μm sized green fluorescent latex beads as cell equivalents were printed into the wells of a 384-microwell plate, and the ejection and deposition efficiencies were assessed. The ejection efficiency (i.e. truly a single bead has been ejected from the nozzle) was determined through the images automatically stored by the SCP that show the nozzle before, during, and after the dispensation (Fig 3A–3E). The deposition efficiency (i.e. a single bead was successfully delivered to the bottom of the well) was concluded from fluorescence microscopy images (Fig 3F).

Four consecutive images are stored automatically for each printing event: (A-C) A cell (or bead as cell equivalent) is transported towards the nozzle of the dispenser-chip, where it is detected and classified within a region of interest (ROI, green area). Only if the object recognition meets predefined criteria in terms of size, roundness and singularity, the droplet ejected from the nozzle will be targeted to the well. (D) A final image confirms the absence of the cell in the nozzle after droplet ejection. The image series can be used to provide direct evidence that truly a single cell was ejected. (E) shows an example for an image where two cells would enter the droplet. Such droplets are automatically discarded by the vacuum suction. To evaluate the precision of the instrument, 2304 single fluorescent beads were printed into six 384-microwell plates. The images were evaluated to determine the ejection efficiency (99.7%). (F) Correctly deposited beads (dashed circle) were visualized by fluorescence microscopy of the well bottoms (1.2 mm in diameter). (G) The beads were correctly delivered in an average of 98.8% of the wells if the microwell plate was electrostatically neutralized before printing.

Of the fluorescent beads, 1152 each were dispensed into three untreated or three deionized 384-well plates this equaled a total number of 2304 beads. The overall single-bead ejection efficiency was on average 99.7 ± 0.3%. The deposition efficiency depended on whether the plate was deionized or not. Without prior deionization, in only 20.7 ± 8.4% of the well bottoms a single bead was detected, while after deionization 98.8 ± 1.5% of the beads were correctly delivered (Fig 3G).

Following this workflow, a total of 150 single cells of different origins (U-2 OS, n = 40 Kasumi-1, n = 44 AML patient, n = 66) were printed into deionized 384-well plates resulting in a total single-cell ejection efficiency of 98.7%. A subset of the printed cells of each specimen was subjected to WGA and genotyping (as detailed below).

Whole genome amplification of single-cells

Single cells were subjected to WGA prior to downstream molecular analysis. In order to minimize the odds for contaminating DNA that would be co-amplified by the WGA, we worked with DNA-free cartridges and plates and reduced the hands-on steps during cell isolation, lysis and amplification. The success of the WGA was assessed by fluorometric quantitation of the DNA, and a PCR on repetitive LINE1 transposons was used to control for the amplification of human DNA in the samples and its absence in the no-template control.

Of the 40 U-2 OS cells that were deposited into the deionized microwells, 25 were subjected to a WGA resulting in a median DNA yield of 3.8 μg (range, 3.5–5.5 μg) per cell (Fig 4A) the PCR on the LINE1 transposons was positive in all samples (Fig 4B). Of Kasumi-1, 33 cells were subjected to WGA, resulting in a median DNA yield of 14.3 μg (range, 8.6–20.8 μg) per cell, and the LINE1 PCR was positive in all cells (S1 Fig). Among the 23 single cells from an AML patient, the WGA resulted in a median DNA yield of 16.3 μg (range, 14.0–19.3 μg) per cell and a positive LINE1 PCR in all cells (S2 Fig).

(A) Bar diagram displaying the WGA DNA yields from the individual U-2 OS cells and the respective controls, as measured by Qubit ™ . (B) Agarose gel illustrating the differently sized products of the LINE1 multiplex PCR that was performed on the WGA DNA of the individual U-2 OS cells. (C) Exemplary sequencing chromatograms of the SLC34A2 and TET2 gene mutations in the cell bulk and individual cells. (D) Conclusions on the occurrence of allelic dropout (ADO) through sequencing of single nucleotide polymorphisms (SNPs). SNPs rs1391438 and rs7655890 are located in close genomic proximity to the TET2 mutation and show heterozygous patterns in the cell bulk (left). In the single U-2 OS cells B8 and C10, wild-type only is detected at the TET2 mutation site. The heterozygous patterns of the SNPs in B8 suggest true wild-type in TET2, while the detection of only one allele of both SNPs in C10 suggest loss of the genomic region due to ADO. NTC: no-template control, PTC: positive control.

We also examined whether free-floating DNA was present in the droplets generated by the SCP. Such DNA, if amplified by the WGA, would hinder the genetic analyses of the single cells. Therefore, empty droplets (n = 1, 3, and 10, respectively) from a suspension of Kasumi-1 cells were printed into individual wells of a 384-well plate and then subjected to WGA. All empty droplets yielded no product in the subsequent LINE1 PCR, while droplets containing single Kasumi-1 cells were positive (S3 Fig).

Genotyping of single cancer cells

We sought to evaluate the applicability of the SCP for the isolation and genetic analyses of single cancer cells. For this, we studied representative gene variants in U-2 OS, Kasumi-1 and the PBMCs of a patient with AML.

U-2 OS harbors mutations in the SLC34A2 (ENST00000382051: c.1538G>T p.R513L) and TET2 genes (ENST00000380013: c.1394C>T p.P465L) [17], both of which are of functional relevance in cancers [19–21]. We confirmed the mutations in SLC34A2 and TET2 in the bulk sample. In line with published data [17], the chromatograms suggested that the SLC34A2 mutation was homo- or hemizygous and the TET2 mutation heterozygous. From our CNV array data, we concluded that the zygosity of the SLC34A2 mutation was due to the loss of heterozygosity (LOH) of the respective genomic region. In the 25 U-2 OS cells analyzed, the SLC34A2 mutation was detected in 23 and the TET2 mutation in 19 cells (Figs 4C and 5A). In one cell, the SLC34A2 PCR and, in another cell, both the SLC34A2 and TET2 PCR repeatedly failed, which suggests insufficient amplification of the target region by the WGA. As expected from the zygosity in the bulk, no cell with SLC34A2 wild-type sequence was detected. In contrast, TET2 wild-type sequence only was detected in 5 cells. These cells were evaluated for the presence of ADO to allow conclusions regarding the co-occurrence of mutations in the individual cells (see below).

(A) U-2 OS cell line, (B) Kasumi-1 cell line and (C) AML patient. Displayed are the nucleotides identified by sequencing of the bulk specimens and the individual cells (annotated for example B1 or C1). Highlighted in red is the presence and in green the absence of the respective mutated sequence. Highlighted in grey are inconclusive analyses either due to failed PCR (n.d., not determined) or the likely occurrence of allelic dropout (*). For the gene mutation analyses, the clonal architecture concluded from the single-cell analyses is schematically displayed.

Kasumi-1 harbors mutations in the tyrosine kinase KIT (ENST00000288135: c.2466T>A p.N822K) and the tumor suppressor TP53 (ENST00000269305: c.743G>A p.R248Q) [17]. The mutations in KIT and TP53 were confirmed by NGS in a Kasumi-1 bulk sample. As verified by pyrosequencing, the VAF of the KIT mutation was 84.0% at median (range, 83.3–85.2%). The overrepresentation of the mutated allele is due to the amplification of the KIT genomic region [22,23]. The TP53 mutation was present with a VAF of 100% in line with an LOH of the chromosome 17p region in the CNV array. The KIT mutation was detected in 30 and the TP53 mutation in 25 cells (Fig 5B). In the remaining cells, the KIT or TP53 PCR failed, most likely due to an inefficient amplification of the respective regions by the WGA. No cell with KIT or TP53 wild-type sequence only was detected. With regard to the co-occurrence of the mutations, the analyses yielded informative results for both mutations in 23 cells. All these cells harbored both the KIT and TP53 mutation (Fig 5B).

In the PBMCs of a patient with AML, we assessed the potentially pathogenic non-synonymous SNP rs1042522 in TP53 (ENST00000269305: c.215C>G p.P72R) [24,25]. We decided for this approach since no C-allele was detectable by Sanger sequencing of the bulk specimen and since, as indicated by FISH and CNV array, the AML harbored one or more clones with loss of a chromosome 17p allele (including TP53 S4 Fig), and chromosomal loss of TP53 in cancers preferentially affects the C-allele of rs1042522 [25]. Thus, we tested whether an ancestral C-allele would be still detectable in a subset of individual cells. Indeed, the TP53 PCR yielded a product in 21 cells, in 5 of which the C-allele was detected (Fig 5C).

Evaluation of allelic dropout and co-occurrence of mutations in U-2 OS

As stated above, in 5 U-2 OS cells the TET2 wild-type sequence only was detected. To evaluate whether the absence of TET2 mutations in these cells was due to ADO, we analyzed the SNPs rs1391438 and rs7655890 (located 4,650 bp and 15,992 bp 5’ from the TET2 mutation, respectively) in these cells both SNPs were heterozygous in the bulk sample (Fig 4C). In one of the 5 cells, only one of the two alleles of each SNP was detected, which strongly suggests that ADO has occurred at the genomic region that included the SNPs and TET2 mutation (Fig 4C). Thus, we deemed the analysis of this one cell inconclusive. In the remaining 4 cells, both of the SNPs were heterozygous, suggesting that WGA has successfully amplified both alleles (Fig 4C) this makes it unlikely that the absence of the TET2 mutation in these cells was due to ADO. Thus, we concluded that these 4 cells indeed lacked the TET2 mutation.

Thus, in terms of co-occurrence, our analyses yielded informative results for both mutation sites in 22 cells. Of these, 18 cells harbored both the SLC34A2 and TET2 mutation while 4 harbored the SLC34A2 but not the TET2 mutation, which indicates clonal heterogeneity with regard to TET2 mutated cells within the U-2 OS cell line (Fig 5A).

In Kasumi-1, no evaluation of ADO was necessary since no cell with KIT or TP53 wild-type sequence only was detected.


Troubleshooting Your Data

The two most common causes for failure to get good or any sequence data for your samples are purity and concentration of your template DNA. If you are having trouble getting good sequencing results for your samples, you may first want to look through our Sequencing Basics section for some recommendations on template preparation and quantitation. If it appears that you have done everything correctly and followed our suggestions, then look below for some additional reasons why you might obtain less than optimal DNA sequence data quality.

No Sequence Data

Cause: priming site not present
Solutions: if you’ve chosen one of the sequencing facility’s vector primers, make sure it is present in your vector. While many of the primers we provide are quite common to many different vectors (e.g. T7, M13-48R), others are specific to one particular type (e.g. GL primer 1 can be used with the pGL2 vector but not the pGL3 vector). Doublecheck your plasmid maps/sequences.

- If you’ve designed your own custom primer from previous sequence data, make sure you were using a reliable area of sequence - look for sharp, well-defined peaks with no ambiguity. Avoid areas where the peaks are broader and not well separated - this will occur towards the end of the sequence where the fragments are larger and the polymer cannot adequately resolve single nucleotides, causing inaccurate basecalling.

Cause: Not enough or no DNA/primer in tube
Solutions: Doublecheck your quantitations, stock concentrations and dilutions. Check our What kinds of DNA can we sequence and how much do we need? section to make sure you’ve provided the appropriate amount of DNA and/or primer. While our sequencers are very sensitive and can detect a range of DNA concentrations, there is still a "threshold" amount that must be reached to obtain any sequence data.

Cause: Inhibitory contaminant
Solutions: The cycle sequencing reaction used to amplify samples for automated sequencing is very sensitive to the presence of certain contaminants, some of which will completely inhibit our sequencing enzyme. Please check the Contaminant chart in the Template preparation and purification section for a list of potential inhibitors and the amounts that are tolerable. You may need to reprep your sample to sufficiently remove one or more inhibitory components to obtain any sequence data.

Cause: Expired reagents
Solutions: Reagents cannot be expired

Noisy Data with Weak Signal

"Noisy" data can be identified by the presence of multiple peaks and numerous "N"s within your sequence. The Sequencing Analysis program assigns an "N’ as a base identification when there are two or more peaks present at one position. This "N" may signify the legitimate occurrence of two nucleotides, as in the case of a heterozygote, but may also be seen when background noise is high or when multiple products are present. When your sample exhibits weak signal, the software attempts to compensate by boosting up the signal of sample bands to detectable levels. However, the background noise will also be artificially amplified, giving a poor signal-to-noise ratio. Background noise appears as many smaller, undefined peaks under your sequence peaks of interest. This noise is always present, but with well-prepared samples of good signal strength, it will be undetectable. To determine if your noisy data may be due to weak signal, look at your ABI trace file. If you are looking at a paper chromatogram, look towards the top and middle of your trace for a line that says "Signal". If the file is on your computer, click the "A" radio button in the bottom left-hand corner, which is visible when you have opened up the trace file within a viewing program, such as EditView or Chromas. Scroll down to the line that says "Signal" and you will see the four nucleotides followed by numbers in parentheses. These numbers represent the average signal strength of each nucleotide and their values should, optimally, be between 200-400. If they are much less than 100, then you can assume your noisy data is at least partially due to its weak signal.

Cause: Not enough DNA
Solutions: Doublecheck your quantitations, stock concentrations, calculations and dilutions. Check our What kinds of DNA can we sequence and how much do we need? section to make sure you’ve provided the appropriate amount of DNA and/or primer.

Cause: Inhibitory contaminant e.g..salts, phenol
Solutions: The cycle sequencing reaction used to amplify samples for automated sequencing is very sensitive to the presence of certain contaminants, some of which can partially or completely inhibit our sequencing enzyme. Please check the Contaminant chart in the Template preparation and purification section for a list of potential inhibitors and the amounts that are tolerable. You may need to re-purify your sample to sufficiently remove one or more inhibitory components to obtain better sequence data.

Cause: Degraded DNA from nucleases, repeated freeze-thaw, excessive UV light exposure, bisulfite treatment.
Solutions: Nuclease contamination in a template preparation as well as repeated freeze-thaw cycles can degrade DNA over time. Even low amounts of nucleases can extensively degrade DNA depending on storage conditions and temperatures, as well as the length of time the DNA is stored. Generally, re-isolation and purification of the template DNA will be necessary to obtain good DNA sequence. When extracting PCR products from a gel, prolonged exposure to UV light will degrade and nick the DNA. Limit the time and UV intensity as much as possible to prevent degradation. When treating DNA with bisulfite for methylation experiments, it is important to avoid long incubations at higher temperatures as substantial amounts of DNA will be degraded in this process.

Cause: Trend in worsening data?
Solutions: If you have previously been able to obtain good sequence data but begin to see a deterioration in quality that gets progressively worse, you may have some contamination in one or more reagents, or have some reagents that have reached the end of their usefulness. Make up fresh stocks of commonly used reagents, such as buffers, and always use high quality distilled water in your preparations.

Cause: Inefficient primer binding (low Tm, degenerate primers, mismatch)
Solutions: the Tm of a primer is defined as the temperature at which 50% of the oligonucleotide and its perfect complement are in duplex. The Tm of an oligo can be roughly calculated by using the formula:

This is the most commonly used formula for calculating Tm, though it is not the most accurate as it does not factor in salt or formamide concentrations. A good website to check out if you are interested in some detailed theory behind Tm calculations is http://www.sigma-genosys.com/oligo_meltingtemp.asp.

In our cycle sequencing reaction, our primer/template annealing step occurs at 50ºC. Thus, if your primer Tm is much lower than 50ºC, hybridization to its complementary template will be much less efficient and a lesser number of extending fragments will be generated. Increase your primer Tm by adding additional bases to the 5’ or 3’ end to raise the Tm to be within the range of 52ºC-58ºC. Degenerate primers and those with mismatched bases will also show decreased hybridization efficiency due to reduction of the stability of primer binding, and if degeneracy or mismatches occur at or near the 3’ end of your primer, it is highly likely that your sequencing attempt will fail.

Multiple Peaks Within Your Sequence

The presence of multiple peaks within your sequence can be caused by numerous factors. To help determine the cause, it can be useful to look at two aspects - where the multiple peaks begin, as well as the overall signal strength of your sample. As mentioned above in the Noisy data with weak signal section, samples with low signal strength can have artificially high background noise that can give the appearance of multiple peaks. However, if your average signal strength numbers are above 100 or so, it’s probably unlikely that background interference is your exclusive problem. We’ve broken down this section into two parts, based on where your multiple peaks begin.

From the beginning

Cause: Multiple priming sites involving vectors
Solution: Your primer may have a secondary hybridization site that may be identical or closely related, with different nucleotide sequences following each site, giving superimposed bands within your sequence. If the priming sites are identical, (such as when more than one T7 promoter site is present, for example), the double peaks will be strong from the outset. The fragments may also show shifted migration so that the double peaks are not directly on top of one another but will be offset to one side or the other due to the differing mobility patterns of the strands with dissimilar nucleotide composition. In other instances, a secondary priming site may not be exactly the same, but may differ by a few internal bases. In this case, the mismatched primer may not hybridize as efficiently but can still anneal and extend, and give rise to less intense fragments that can be seen underneath your peaks of interest. In both cases, it’s necessary to screen both your vector and insert carefully to look for sequences that may match or be similar to your proposed primer. You may need to choose another vector primer on the same end of the multiple cloning site or redesign your custom primer. When choosing another primer is difficult, such as when primer walking through a repetitive area, try to find a primer that has a 3’-base match specific to your area of interest which can help act as an "anchor".

Cause: Multiple priming sites in PCR
Solution: This may occur when one or both of the PCR primers hybridizes to more than one position on the template DNA, giving rise to multiple PCR products. Often this will be obvious when visualizing the PCR products on an agarose gel as there will be more than one band present. In this case, gel purification of the desired product will be necessary. One can run into difficulty, however, when the products are very similar in size, which may arise when amplifying related or repetitive DNA, and do not separate well on the gel. In this case, optimization of the PCR reaction may be necessary or redesign of the PCR primers in order to choose a more specific priming site.

Cause: PCR primers acting as both forward and reverse
Solution: Sometimes, a PCR product may be generated when one primer functions as both the forward and reverse primer in the PCR reaction, giving rise to an artifactual product. This is fairly easy to detect when sequencing the PCR product as one primer will give double peaks from the start, while the other fails to give any sequence data. Redesign your set of PCR primers.

Cause: Residual PCR primers and/or dNTPs
Solution: As two primers are present in the PCR reaction, incomplete removal of these primers can lead to double peaks within the sequencing data. Both primers will act as sequencing primers and lead to superimposed bands which correspond to the complementary strands from opposite orientations. It is critical to remove excess primers and dNTPs from the PCR reaction by purification (look at our Template preparation and purification section for our recommendations on PCR purification). If attempting to do direct sequencing of PCR products without purification by diluting an aliquot of your PCR product with water to lower the concentration of residual primers and dNTPS (a method which we do not recommend), then it is imperative to optimize your PCR reaction so that primers and dNTPS are used in limiting amounts so that most are used up by the end of the PCR.

Cause: Primers with high Tm
Solution: Primers that have a Tm much higher (>65ºC) than our suggested 52ºC-58ºC often do not function well as sequencing primers. When primers have a Tm that high, it is often a result of increased G-C content or because the primer is quite long, both factors that can increase the potential for primer secondary structure formation. If possible, choose another primer with a lower Tm. If that is not optimal, let us know and we can perform a two-step cycle sequencing method that eliminates the lowest temperature 50ºC annealing step and proceeds from the 96ºC denaturing step to the 60ºC extension step. The 60ºC step, in this case, will function as both the annealing and extension step. This can sometimes improve sequencing results.

Cause: Primers with n-1 population

Solution: This problem is not uncommon and can result from poor quality synthesis of sequencing primers. Primers are synthesized from the 3’end to the 5’end and when synthesis is inefficient, there can be a significant population of less than full-length primers - n-1s, which are full-length primers minus one base, plus other shorter derivatives. These primers have a common 3’end but different 5’ends, thus chains that terminate at the same position will have different lengths and will run at different positions on the gel. Primers that have degraded from the 3’end will also give this appearance. It is easy to spot this problem within the sequencing chromatogram as each position will contain the true peak as well as the peak immediately to the right of it, giving the appearance of "shadow" peaks. Whatever the cause of the n-1s, it will be necessary to resynthesize the primer to obtain an oligo of suitable quality for sequencing. When high-quality reagents and proper protocols are utilized during oligo synthesis, cartridge or HPLC purification of the primers is usually not necessary for typical oligos (<30 bp), but sometimes additional purification can be beneficial.

Begin farther into the sequence

Cause: Mixed plasmid prep
Solution: A plasmid prep that is contaminated by more than one product, such as two vectors with different inserts or vector with insert and vector without, will generally show an early section of clean sequence data (common vector multiple cloning site sequence) followed by double peaks. Occasionally, a plasmid may contain more than one vector molecule or may encounter spontaneous deletions or insertions during growth. The point at which the double peaks begin corresponds to the start of the insert cloning site. To avoid this problem, it’s important to carefully pick a single colony from your growth plate, restreaking if necessary, to be sure that your colony is completely clonal. You should follow this up with a restriction digest of your plasmid run out on an agarose gel to ensure vector and insert are present as expected.

Cause: Homopolymeric regions

Solution: Regions that contain long stretches of a single nucleotide can be difficult to sequence through accurately. Short stretches of homopolymeric regions are generally not difficult to get through, but longer sections can be challenging. Sequence data up to and including the polynucleotide region may be fine, but the last base of the poly region and all peaks following it may show a wave-like, stuttering pattern of double peaks that cannot be interpreted. This tends to be more problematic in PCR products, but can also occur when sequencing plasmids, especially when trying to sequence the polyA region of cDNA. This difficulty is thought to arise due to enzyme "slippage" when the growing strand does not stay paired correctly with the template DNA during polymerization through the homopolymer region, thus giving rise to fragments of varying lengths that have the same sequence after this area. When sequencing cloned DNA with a homopolymer region, several options can be tried. In our BigDye Terminator sequencing chemistry, dTTP has been replaced with dUTP, which lowers the melting temperature of DNA. However, it also has the effect of increasing the predisposition of slippage to occur through polyT regions (where A is in the template strand). An alternative sequencing chemistry can be used - dRhodamine chemistry - where dTTP is still in the reaction, and generally gives better results through these polyA regions. Alternatively, an oligo dT(12-15) primer that contains a wobble base (A, G or C) on the 3’ end can be used to anchor the primer in place at the end of the polyA region and give clean sequence following. Sequencing the opposite strand can sometimes be more successful, especially when going through a polyG region as the polyC strand is often easier to get through. Sometimes designing a new primer that is closer to the homopolymeric region can help, as nucleotide concentration and enzyme activity will be in a more optimal range when extending the smaller fragments in the cycle sequencing reaction. And lastly, we can try adjusting our cycle sequencing conditions as higher annealing temperatures and longer extension times can sometimes be useful in cases like this. Similar approaches can be used when trying to sequence PCR products with homopolymeric regions, but, in the end, it may sometimes be necessary to clone the PCR product in order to read through the repetitive stretch.

Cause: Compression
Solution: Compressions can sometimes be observed when a region of secondary structure forms in the amplified strand of DNA, leading to an alteration in the electrophoretic mobility of the DNA strand. This can appear as overlapping fragments after a certain point and can resemble a contaminated plasmid prep, but the contaminated prep will show double peaks beginning at the insertion site. To relax this compression, we can sometimes alter cycle sequencing conditions or use additives to denature the secondary structure. Alternatively, you can linearize your DNA or use 7-deaza-dGTP in a PCR reaction to help relieve the compression.

Cause: Frame shift mutation
Solution: A frame shift mutation can occur when one or more bases are inserted or deleted into the template DNA and if multiple products are present in your sample, whether it be plasmid DNA or PCR product, you will see clean sequence up to the point of the mutation, followed by double peaks caused by the shift in the nucleotide sequence. In the case of plasmid DNA, it will be necessary to re-isolate your DNA to get a pure clone containing only one of the molecules. With PCR products, you will need to gel purify the two products in order to separate them.

Truncated Sequences

Truncated sequences can be characterized as abrupt or gradual. Abrupt truncations will show strong, clean signal up to a point and then drop sharply down over the course of a few nucleotides to much weaker or no detectable signal. Gradual truncations will show good sequence data initially but then begins to taper off to progressively weaker, smaller peaks until there is nothing but background noise. The nature of the truncation can sometimes help to determine its cause.

Cause: Secondary structure


Cause: linearized DNA
Solution: If your DNA has been cut with one or more restriction enzymes, the sequence data will sharply end at the recognition site of the enzyme that cut at the 3’ end of your insert. Did you accidentally send us digested DNA? Run it out on a gel to see.

Cause: Too much DNA


Solution: While there is a range of DNA concentrations we can sequence reliably, too much DNA will cause premature termination of signal. Overloading of DNA will exhibit early top-heavy peaks followed by rapidly weakening peak height and strength. This occurs because the dNTPS in the cycle sequencing reaction will be distributed among too many extending chains and will be depleted early on, resulting in an excessive amount of short fragments. Overloading is of special concern when sequencing on our 3100 capillary system as it is much more sensitive to DNA concentration and less tolerant of DNA overloading and severe overloading will reduce the lifespan of our (very expensive) capillary arrays. In addition, if your template is impure, higher concentrations of DNA can be accompanied by higher amounts of contaminants that can further worsen your DNA sequence quality. So please quantitate your template DNA carefully and check our Methods for quantitation section for our recommendations.

Cause: Salts
Solution: Excessive amounts of salts will also give rise to premature termination and may look similar to DNA overloading, with strong signal followed by progressively weakening signal. Salts have an inhibitory effect on the processivity of the sequencing Taq polymerase, which can lead to an overabundance of short fragments, or if the salt concentration is too great, the enzyme will be completely inhibited with no sequence data obtained. If salts are potentially a problem, perform an ethanol precipitation for salt removal.

Cause: Repetitive regions

Solution: The nucleotide composition, as well as the size, of a repetitive region can play a large role in the success of sequencing through such an area. In general, G-C and G-T (ofter seen in bisulfite-treated DNA) repeats tend to be the most troublesome though, as mentioned before, the newest version of Applied Biosystems BigDye Terminator v3.1 contains some modifications that have allowed for some striking improvements in certain previously difficult templates. However, there are still some that remain a pain. In general, one can sequence partially through the repetitive region and the signal begins to fade and eventually becomes unreadable. This may be due to premature dNTP depletion, secondary structure formation or enzyme slippage. Various methods can be tried to sequence the repeat entirely, and many are similar to those we would use for G-C rich templates that form secondary structures, including the addition of betaine or DMSO and/or alterations in cycle sequencing parameters. If the repeat region is not excessively large, sequencing from the opposite strand to complete the region can be successful, especially if the complementary strand has a nucleotide composition that is more efficiently extended. However, if the region is large, it may be difficult to complete its entire sequence and determine the exact number of repeats present. Alternative methods, such as directed deletions or the use of an in vitro transposon system may need to be utilized.

Sequencing Artifacts

There is definitely a link between template preparation and the degree to which these artifacts can be a problem, so the cleaner and more accurately quantified your sample is, the less of a problem these artifacts will be. Clean samples with strong signal are generally not affected by these artifacts, and if they are, many times the true peaks can be identified and corrected. We do visually inspect every chromatogram and edit what we can. But when we do spot an artifact that was probably due to an instrument or post-cycle sequencing cleanup issue, and we can’t be sure of the correct basecalling, and your sample had good signal strength and no other obvious problems, we will repeat the sample, UPON REQUEST, at no charge. We ask that you inspect your chromatogram as well and if there is a sequencing artifact that causes difficulty with your analysis, and meets the above criteria, please let us know right away and we will rerun the sample as soon as we can. We store reacted samples for several days (DNA and primers for 2-3 months) but then discard them, so please let us know as quickly as you can if you wish a repeat.

Artifact: "Dye blobs"

Solution: Dye blobs are unincorporated dye terminator molecules that have passed through the cleanup columns and remain in solution with the purified DNA loaded onto the sequencers. They are most often seen with samples that have low signal strength. Samples with weak signal usually either 1).did not have enough DNA so there was less starting template to amplify and label, thus leaving behind a greater proportion of unincorporated dye molecules or 2). contained contaminants that inhibited the sequencing reaction and it’s theorized that certain contaminants may have a predisposition to bind to these dye clumps. And we have noticed a pattern where certain customer samples, as a whole, are more likely to contain dye blobs regardless of signal strength. In general, dye blobs appear as broad, undefined peaks of a one or two colors (usually red and blue in 3100 data) with the true DNA peaks underneath, and tend to occur relatively early in the data - generally before 50-60 bp - so for many, they aren’t much of a problem as that is still vector sequence. Repetition of samples with dye blobs is generally not too successful, as they don’t often go away but sometimes do become less intense. With very weak samples, oftentimes there’s not much we can do to fix the data. With samples of average signal strength, however, they are usually easily correctable as the true peaks are often visible beneath.

Artifact: "Spikes"

Solution: "Spikes" are seen as multicolored peaks within the sequence that usually obscure just one or two nucleotides worth of data, and occur in samples run on the capillary-based sequencers such as the 3100. They are caused by tiny air bubbles within the liquid polymer or by small pieces of dried polymer that have flaked off and entered a capillary. Again, there seems to be a slight predisposition for some customer samples to experience these artifacts and, when they do occur, are much more pronounced in samples with weak signal. When a sample has strong signal, they are often not detectable, but there are times when they can be very visible. The good thing is that these are most often always correctable upon rerunning. So, please let us know if you want a repeat because of a spike - for those of you only interested in a small separate region that is not affected by something like this, there would be no need for rerunning, but for those who are looking at an entire reading frame, for example, we realize that this would be a problem. So, as we can’t know everybody’s experiments and regions of interest, we ask that you help us and let us know when this problem affects your analyses and we will quickly repeat it for you.

Artifact: Loss of resolution

Solution: Samples that exhibit loss of resolution (or LOR) will initially show sharp, well-defined peaks that, after a time, begin to widen and progressively deteriorate in quality to a point where these broad peaks become unreadable. The occurrence of this problem may sometimes be instrument-related and at other times may be sample-related. When the cause is instrument-related, it is generally due to improper capillary filling when fresh polymer is being pumped through the array. By modifying a parameter in the software that led to slower, more complete filling of the capillaries, we have been able to dramatically reduce the incidence of LOR. When the problem is sample-related, it is thought to be due to a (currently) unknown contaminant. It’s been speculated that this contaminant is more likely to be present when columns available in some commercial miniprep kits are heavily overloaded. As it’s usually difficult to determine the exact cause of sporadic LOR, we automatically rerun samples that exhibit LOR and, almost always, the sample will run fine the second time.

Homopolymeric Regions

- See discussion about homopolymers above under Multiple peaks within your sequences, homopolymeric regions.

Missing/Extra Bases

When you are analyzing your sequence data and it appears that one or more bases have been inserted or deleted, the first thing you should always do is visually inspect your chromatogram. Oftentimes, the ambiguity is due to incorrect software analysis of the peaks and will usually occur either early in the sequence or much later when resolution of large fragments is less than optimal. Analysis issues are usually easily correctable. Read over our Interpreting your chromatograms section first for a basic overview on how to quickly evaluate a trace and get a sense of its overall quality. Then look below for some more specific examples of software issues, as well as some less common situations where nucleotides may really be missing or inserted.

Cause: Basecalling difficulties
Solution: As mentioned above, our Sequencing Analysis software is not always 100% accurate when assigning base designations, and these errors will most often occur either very early in the sequence - first 50 bp or so - as well as much later in the sequence, for different reasons. Early in the sequence, the smallest fragments show some minor variability in migration that can throw off the calculations that the software uses to adjust proper peak spacing, with certain nucleotide combinations being more susceptible than others. As mentioned before, we visually inspect all trace files and we will manually edit the alterations that we spot, and most are very easy to fix. Our manual edits will appear as lowercase letters within the sequence. However, we always STRONGLY suggest YOU also look over your chromatogram files to doublecheck the sequence data. What we look for, early on, are peaks that migrate very close together with a spacing gap that may occur on either side. In cases like this, the computer may insert an "N" to compensate for the odd peak spacing, though there is really no nucleotide there. Alternatively, when there is a spacing gap like this, the software may interpret a small rise in background noise as a peak and mistakenly insert an extra base where there shouldn’t be. Sometimes when the peaks are very close together, there is also a tendency for the software to miss one of them entirely and leave out one nucleotide of the sequence. And lastly, some of the excess unincorporated dye molecules will migrate at the same position as the very first bases of your sequence, sometimes obscuring the first 10-20 bases. In addition, the smallest fragments are not always very sharply resolved, a problem that seems to be more pronounced on capillary-based sequencers. For all the reasons mentioned above, it’s always best to choose a primer that is at least 40-50 bases away from your sequence of interest so that you can be sure you are in a region of the highest accuracy. Certain nucleotide combinations are more likely to show odd migration in the first 40-50 bp and will sometimes jumble together or not separate well. Doublecheck sequence data where C’s are followed by A’s, where A’s are followed by G’s and where there are two or three A’s in a row.

You may also find that occurrences of extra or missing bases becomes more frequent towards the end of the sequence. There is a limitation to the resolving power of the polymer to separate out the largest fragments so while the signal intensity of the later bands may still be quite strong, the peaks will become broader and less sharp. The latest version of Sequencing Analysis software includes improved algorithms that allows for better interpretation of the spacing of these larger fragments, and has shown increased accuracy of the basecalling farther out, but there is still a limit. With the use of our modified POP-7 protocols and a 50cm array on our 3100s, we often get 900-950 bases, and often more, of >98% accuracy on well-prepared templates.

Cause: Site-directed mutagenesis primers
Solution: When making a DNA primer, synthesis of the oligo proceeds from the 3’ end to the 5’ end. During the synthesis procedure, truncations may occur when a specific base fails to be added to the growing oligo chain. The DNA sequencing chemistry generally allows for these failure sequences to be capped and not extended any further, but this process is not always 100% efficient and some of these truncations will continue to elongate, with the internal base deleted. As a result, a completed DNA synthesis will contain not only the desired full-length product, but potentially a population of a combination of all possible internal single base deletion sequences as well. Purification of the primer will remove the majority of these failure sequences but a small proportion of these truncations may still remain. So, when using a synthetic primer for site-directed mutagenesis, the potential is there for picking a clone that contains an oligo that is one of these deletion products and not your full-length primer. If this should occur, you should try picking a few other clones for sequencing, and often you’ll find one that does contain the desired mutagenesis primer. If you find your clones consistently contain the same deletion in your primer region, an error may have occurred when programming the primer sequence and you should contact the synthesis company for a new primer. For site-directed mutagenesis primers, it’s always advisable to choose to have your oligo purified by HPLC to minimize the population of failure sequences.


Affiliations

Department of Animal Molecular Biology, National Research Institute of Animal Production, Krakowska 1, Balice, 32-083, Kraków, Poland

Klaudia Pawlina-Tyszko, Ewelina Semik-Gurgul, Artur Gurgul, Maria Oczkowicz & Tomasz Szmatoła

Center for Experimental and Innovative Medicine, The University of Agriculture in Kraków, Rędzina 1c, 30-248, Kraków, Poland

Artur Gurgul & Tomasz Szmatoła

Department of Animal Reproduction, Anatomy and Genomics, The University of Agriculture in Kraków, al. Mickiewicza 24/28, 30-059, Kraków, Poland


Results

Performance of NGS on DNA Samples from Fresh Frozen and Formalin Fixed Material

Sequence runs containing only FF samples resulted in significantly more usable reads (p = 0.0009), defined as reads that passed quality filters (Fig 2A), although the absolute difference in usable reads was only 7.1%. Analysis of library statistics showed a significantly increased percentage on-target reads (p = 0.002) for FF samples compared to FFPE samples (Fig 2B), where the number of samples containing a low percentage on-target reads was limited. Moreover, the samples with low percentage on-target and thus a low coverage could easily be identified: in total 7.7% of the samples were excluded due to a mean coverage <800x of which 71% showed also <80% on-target. These excluded samples consisted for 98% of FFPE samples. The remaining quality parameters including the number of mapped reads did not show differences. Furthermore, all targeted regions could be covered adequately, as none of the amplicons showed an average mean coverage below 100x leading to exclusion from analysis and 87% of the amplicons for FFPE samples and 94% of the amplicons for FF samples were covered >800x on average (S2 Fig). During the 1,5 year intake period of this study, the sequence runs performed at a stable level (S3 Fig), with only a slight decrease of the percentage of usable reads and an increase in the percentage low quality ISPs from the moment of inclusion of FFPE samples half-way this time period.

A) Boxplot of run statistics of FFPE (green) and FF (orange) samples for 4 variables: 1. the percentage of ISP (Ion Sphere Particle) density (the addressable wells on the chip which have detectable loading) 2. usable reads of the total number of reads (percentage of ISPs that pass the polyclonal, low quality, and primer dimer filters) 3. polyclonals, ISPs that contain more than one template sequence per ISP and 4. low quality, ISPs with a low or unrecognizable signal. The upper and lower “hinges” of the boxplots correspond to the first and third quartiles (the 25 th and 75 th percentiles). The upper “whisker” extends from the hinge to the highest value that is within 1.5*IQR of the line, where IQR is the inter-quartile range (the distance between the first and third quartiles). The lower “whisker” extends from the hinge to the lowest value within 1.5*IQR of the hinge. Data beyond the end of the vertical lines are outliers and plotted as points. B) Library statistics of FFPE (green) and FF (orange) samples the mean target base read depth (including non-covered target bases) the number of reads mapped to the full reference genome and the percentage of mapped reads which are aligned to the target region. Significant differences calculated by means of an independent t-test between FFPE and FF samples are depicted with ** p = 0.002 or ***p = 0.0009).

In summary, a good quality sample could be recognized by a mean coverage of at least 800x and >80% on target.

When comparing coverage of all amplicons in the Ampliseq Cancer Hotspot Panel v2 between FFPE and FF samples, a decreased coverage for the longer amplicons was seen in FFPE samples (S4 Fig). There was also no significant difference in the ratio of C > T or G > A base transitions in the FFPE samples compared to the FF samples (S5 Fig).[27, 28]

Defining the Requirements for Mutation Calling in DNA Samples

To determine the variant detection limit of the assay, dilution experiments of four FF DNA samples with known TP53 mutations were performed. With an R squared of 94.53% the dilution data were close to the expected allele frequencies (fitted line, S6 Fig). The known TP53 mutations were reliably detected down to an allele frequency of 1%. As dilution assays may overestimate the sensitivity of the assay a cut-off of 5% allele frequency was therefore set to be reliable for future diagnostic use.

Since the percentage of tumour cells present in the material used for DNA extraction is an important variable defining the ability of any assay to detect somatic mutations in diagnostic specimens, [29] we predicted that a variant could be detected when at least 20 reads were detected with a coverage of 800x for the amplicon, given the input material contained at least 10% tumour cells (Fig 3). For standard mutation calling, 800x is probably not necessary but our assay was designed to obtain a high sensitivity even for samples with low tumour cell percentages. Next, we performed an analysis on the entire dataset to assess whether tumour cell percentage of the input material affected the mean VAF. Theoretically, a heterozygous mutation in a diploid sample with 10% tumour cells can be reliably detected when using a detection limit of a frequency of 5%, but we did not find a relationship between tumour percentage and VAR (Fig 4).

Lines depict the coverage needed for a certain tumour percentage. In this study a detection limit of 20 variants was used, which, combined with a tumour percentage of at least 10%, leads to a needed coverage of 800x.

The observed allele frequency for all variants detected using NGS is plotted against the tumour cell percentage as determined by a pathologist. The green line depicts the theoretical line of expected allele frequency of a heterozygous (somatic) mutation versus tumour cell percentage. A forced linear regression line (black line) was plotted to determine whether increased tumour percentage affects the mean allele frequency detected with a correlation coefficient of 0.041.

Validation of Mutational Profiles Obtained with the Ampliseq Assay

We validated 328 variants, of which 323 were concordant between NGS and the conventional techniques, resulting in an overall concordance of 98.5% (sensitivity of 99.1%)(Fig 5, Table 1). Of the 5 discordant samples, two false negative variants of TP53 exon 8 (p.G266E) were identified using Sanger Sequencing but not using NGS, a discrepancy that could not be resolved. A third false negative variant was identified in TP53 exon 7 (c.757_758insA, p.T253fs*11) that was not called by TVC but was clearly visible in IGV. The only false positive variant was TP53, exon 7 (c.723delC, p.C242fs*5) which was called by TVC but was not visible in IGV upon manual check. The final discordant variant was identified in EGFR exon 21 (p.L858R) with an VAF of 7.3% which was not detected using HRM analysis due to the low tumour cell percentage of the input material (estimated at 5–10%). TP53 is not fully covered in the Ampliseq panel resulting in 19 samples where a TP53 variant was identified with Sanger Sequencing, which could not be identified using NGS (S4 Table). These data support the conclusion that the Ion Torrent AmpliSeq workflow is a reliable technique for mutation analysis and manual checks in IGV further improve its reliability.

A) The absolute number of samples with a mutation in various genes as denoted on the x-axis that were used for the validation of NGS by means of the Ion Torrent platform. All samples are colour coded: dark blue are the concordant samples with the same mutation in standard versus NGS, the intermediate blue are the concordant samples showing no mutation, the light blue bar represents the discordant samples B) The same data as depicted in Fig 5A, however represented as percentages of all tested samples for a given gene.

Interpretation of Data Obtained with the Ampliseq Assay

To further understand whether NGS results reflect an expected mutational pattern we analysed all identified mutations in the TP53, KRAS, BRAF, EGFR and PIK3CA genes in a final dataset containing 386 samples, 290 derived from FF material and 96 derived from FFPE material. Even though the AmpliSeq panel does not cover the entire TP53 gene, mutations were identified throughout the targeted region (Fig 6A). Comparison with the TCGA database shows a 82% overlap of our findings compared to the TCGA database (S8 Fig). As could be expected, a limited mutation distribution was identified for KRAS, BRAF, EGFR and PIK3CA (Fig 6B–6E) as these genes contain mutational hotspot locations, which could be detected reliably in this assay. Of interest, several parts of the PIK3CA gene were sequenced without identifying mutations, suggesting an absence of a systematic bias towards false positive findings based on the choice of amplicons sequenced.

All graphs depict a lollipop plot (adapted from (Vohra and Biggin, 2013)) showing identified variants relative to a schematic representation of the gene. Any position with a mutation obtains a circle, the length of the line depends on the number of mutations detected at that codon. The grey bar represents the entire protein with the different amino acid positions (aa). The coloured boxes are specific functional domains. On top of the lollipops the most frequent variants are annotated as the amino-acid change at that specific site. Black lines underneath the grey box indicate the regions where the Ampliseq panel covers the gene. A) Mutations identified in the TP53 gene using NGS, B) KRAS, C) BRAF, D) EGFR and E) PIK3CA.

For all samples site of tumour origin was used to analyse the frequency of mutational distribution among the different tumour types. As expected, TP53 was found to be the most frequently mutated gene in this unselected set of tumours (Fig 7A). The dataset contains a sample bias towards colorectal cancer, non-small cell lung carcinoma (NSCLC) and melanoma probably caused by the fact that in these tumour types mutational data already influences therapy choice, and clinicians are therefore more likely to request NGS analysis in patients with such tumours (S7 Fig).

A) Heatmap of number of variants per tumour group. On the y-axis the different primary tumour site is depicted and on the x-axis all genes with mutational data are depicted. The relative number of mutations is defined as the number of mutations normalized per number of samples in the tumour group. B) co-occurrence of different variants in colorectal tumours. The size of the circle around a gene is indicative of the number of times a variant is identified in the gene. The lines represent co-occurrences between genes where the line thickness indicates the number of co-occurrences. The colour of the circles indicates the function of the gene: green–tumour suppressor genes and oncogenes, purple–receptor tyrosine kinases, pink–PI3K pathway, yellow–KRAS/BRAF pathway.


Conclusions

Advances in technology made it possible to improve technical skills in nucleic acids sequencing. From the initial results of Sanger technique to the actual next-generation sequencing, a lot of work has been done trying to consider the “individual variability” to move to the “personalized medicine”. Currently, NGS technology stands out as one of the most powerful and effective approach for fast DNA/RNA sequencing. In cancer research, many scientists are striving to exploit this technology at its best and some laboratories are starting to show exciting data, especially in the case of CRC. However, it should be noted that the amount of data in the field is still limited. Additional studies are required to obtain more significant reliability of this technology for clinical application. This means that, maybe, a proper optimization to discover the whole potential of these platforms could be achieved in some years from now. The concept of NGS use in clinical routine is challenging, since these tools produce good results in terms of detecting clinically relevant mutations, but often are not able to repeat these successful performances when wider regions of the genome are subjected to analysis. Specific improvements in quality control methods (i.e. the identification of correct quality parameters) could greatly help to overcome these problems. Additionally, the introduction of NGS technology as clinical tool will require for sure measures for process standardization, data handling and interpretation. Greater attention should be paid to the work of bioinformaticians and biostatisticians for the analyses of the massive quantity of data these systems will generate. Clinical challenges are principally based on obtaining accurate data which can be also easy to interpret, by taking into consideration critical issues related to somatic mutation detection in CRC and solid tumors, foremost the accuracy in identifying lesions with very low allelic frequencies. With this regard, innovative approaches for alignment, assembler and variant calling should be devised to augment the accuracy of the entire NGS workflow. Still today, bioinformatics approaches are agnostic about the disease under study and do not embed in their computation the knowledge specific to the disease or gene under analysis, as instead do the scientists in their evaluations. In this direction, a disruptive approach would be to devise new bioinformatics methods that are aware of the pathology and disease the scientists are looking for and add this knowledge while executing their analysis. In our opinion, this would considerably increase the accuracy of NGS results. At the same level, investments should be made for appropriate education and formation of clinicians about the interpretation of the clinical significance of the data obtained.

In conclusion, NGS technology surely represents a giant step forward in the direction toward personalized medicine against CRC, but further analyses are necessary to reach more complete results and higher level in our view of the big picture.


Watch the video: Comparing DNA Sequences (May 2022).