What is the difference between sequence, reads, and contigs of genetic material?

What is the difference between sequence, reads, and contigs of genetic material?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Can someone explain the differences between sequence, reads, and contigs of genetic material such as DNA, if possible with an example?

I am new to bioinformatics, and I have not found any conclusive answers for all these concepts on the web.

My understanding of those three words as follows:

  • sequence is a generic name describing order of biological letters (DNA/RNA or amino acids). Both contigs and reads are DNA/RNA or aa sequences

  • reads are just a short hand for sequenced reads. Usually sequenced reads refer to somewhat digital information obtained from the sequencing machine (for example Illumina MySeq) and stored in thefastqfile with quality scores per base. Reads are usually short. However "short" changes rapidly. Right now MySeq produces reads anywhere between 50-150 base pairs long (bp). From a single run (it will really depends on the run) you can get millions of reads, where each read will be set bp size e.g 100bp long. All reads are stored in a singlefastqfile per replicate, where all reads in that file are usually of uniform size e.g all 5 million reads are 100bp long.

As a bioinformatician your first job is to identify where about those reads come from. Depending on the experimental goal and on what sort of sequencing you were doing e.g DNA-seq or RNA-seq you may or may not encounter contigs.

  • contigs are simply reads that have been assembled together. For example if you are doing de novo transcriptomics. Then you would:

    1. purify your transcript from a tissue and send it off for sequencing
    2. get your fastq files with sequenced reads, that are all short reads (e.g 100 bp)
    3. assemble those 100bp reads into a longer contig that hopefully will resemble your individual transcript

I'm going to say the same thing as @Serine but in a slightly different context. Let's take an example where you want to compare smoking persons against non-smokers.

In this context, you'd want to take a DNA sequence of smoking persons. However, due to technology limitation you won't get a single DNA sequence from the sequencing machine. You'll get millions of short overlapping DNA sequences known as reads.

We need an assembler to "map" the reads and compare them with a reference genome. In this example, the reference genome could have been the human HG38.

The assembler would need to merge the overlapping-reads into a set of non-overalapping regions, known as contigs.

What's the Difference Between a DNA and RNA Vaccine?

Laura Hensley is an award-winning lifestyle journalist who has worked in some of the largest newsrooms in Canada.

James Lacy, MLS, is a fact checker and researcher. James received a Master of Library Science degree from Dominican University.

Key Takeaways

  • DNA and RNA vaccines have the same goal as traditional vaccines, but they work slightly differently.
  • Instead of injecting a weakened form of a virus or bacteria into the body as with a traditional vaccine, DNA and RNA vaccines use part of the virus’ own genetic code to stimulate an immune response.
  • An mRNA vaccine for COVID-19 co-developed by Pfizer and BioNTech is the first of its kind authorized for emergency use in the United States.
  • Several other potential DNA and RNA COVID-19 vaccines are in clinical trials, meaning they are an important and promising area of vaccine development.

Researchers around the world are working on developing safe and effective vaccines for COVID-19, the disease caused by the novel coronavirus SARS-CoV-2. There are currently several global vaccine clinical trials taking place, including four major trials in the United States. Some of these potential COVID-19 vaccines are RNA and DNA vaccines, which is an emerging area of vaccine development.

On December 11, the Food and Drug Administration granted emergency use authorization for a messenger RNA (mRNA) vaccine for COVID‑19 co-developed by Pfizer and BioNTech. This emergency use is approved for people ages 16 and older.

What Is Genetic Material?

Genetic material is the medium by which instructions are transmitted from one generation of organisms to the next. In life on Earth, it takes the form of nucleotide sequences that are organized into genomes. A genome is all of the DNA contained within the cell of a living being. Each molecule of human DNA has billions of nucleotides arranged like steps on a ladder.

It's the sequence of nucleotides that determines the traits of the organism. At various places, called loci, along each chromosome between large stretches of non-coding, DNA sequences of nucleotides resolve into coherent patterns that instruct messenger proteins in how to build other proteins. These proteins are synthesized in the cytoplasm of the cell and work to build every structure of a living body. Genes, as a natural consequence of their nucleotide sequences, build proteins, and proteins build bodies.

Genetic material is passed among large organisms by vertical transmission from parent to offspring. Each offspring resembles its parent more closely than it resembles a randomly chosen member of its species because the exact sequence of genetic instructions on how to build the body have been inherited from the parent. Small errors in copying genes are known as mutations, and their proliferation throughout a gene pool drives the process of evolution.

What is the difference between sequence, reads, and contigs of genetic material? - Biology

Rapid haploid variant calling and core genome alignment

Snippy finds SNPs between a haploid reference genome and your NGS sequence reads. It will find both substitutions (snps) and insertions/deletions (indels). It will use as many CPUs as you can give it on a single computer (tested to 64 cores). It is designed with speed in mind, and produces a consistent set of output files in a single folder. It can then take a set of Snippy results using the same reference and generate a core SNP alignment (and ultimately a phylogenomic tree).

Install Homebrew (MacOS) or LinuxBrew (Linux) then:

This will install the latest version direct from Github. You'll need to add Snippy's bin directory to your $PATH .

Ensure you have the desired version:

Check that all dependencies are installed and working:

  • a reference genome in FASTA or GENBANK format (can be in multiple contigs)
  • sequence read file(s) in FASTQ or FASTA format (can be .gz compressed) format
  • a folder to put the results in
Extension Description
.tab A simple tab-separated summary of all the variants
.csv A comma-separated version of the .tab file
.html A HTML version of the .tab file
.vcf The final annotated variants in VCF format
.bed The variants in BED format
.gff The variants in GFF3 format
.bam The alignments in BAM format. Includes unmapped, multimapping reads. Excludes duplicates.
.bam.bai Index for the .bam file
.log A log file with the commands run and their outputs
.aligned.fa A version of the reference but with - at position with depth=0 and N for 0 < depth < --mincov (does not have variants)
.consensus.fa A version of the reference genome with all variants instantiated
.consensus.subs.fa A version of the reference genome with only substitution variants instantiated
.raw.vcf The unfiltered variant calls from Freebayes
.filt.vcf The filtered variant calls from Freebayes
.vcf.gz Compressed .vcf file via BGZIP
.vcf.gz.csi Index for the .vcf.gz via bcftools index )

⚠️ ❌ Snippy 4.x does NOT produce the following files that Snippy 3.x did

Extension Description
.vcf.gz.tbi Index for the .vcf.gz via TABIX
.depth.gz Output of samtools depth -aa for the .bam file
.depth.gz.tbi Index for the .depth.gz file

Columns in the TAB/CSV/HTML formats

Name Description
CHROM The sequence the variant was found in eg. the name after the > in the FASTA reference
POS Position in the sequence, counting from 1
TYPE The variant type: snp msp ins del complex
REF The nucleotide(s) in the reference
ALT The alternate nucleotide(s) supported by the reads
EVIDENCE Frequency counts for REF and ALT

If you supply a Genbank file as the --reference rather than a FASTA file, Snippy will fill in these extra columns by using the genome annotation to tell you which feature was affected by the variant:

Name Description
FTYPE Class of feature affected: CDS tRNA rRNA .
STRAND Strand the feature was on: + - .
NT_POS Nucleotide position of the variant withinthe feature / Length in nt
AA_POS Residue position / Length in aa (only if FTYPE is CDS)
LOCUS_TAG The /locus_tag of the feature (if it existed)
GENE The /gene tag of the feature (if it existed)
PRODUCT The /product tag of the feature (if it existed)
EFFECT The snpEff annotated consequence of this variant (ANN tag in .vcf)

Type Name Example
snp Single Nucleotide Polymorphism A => T
mnp Multiple Nuclotide Polymorphism GC => AT
ins Insertion ATT => AGTT
del Deletion ACGG => ACG
complex Combination of snp/mnp ATTC => GTTA

The variant calling is done by Freebayes. The key parameters under user control are:

  • --mincov - the minimum number of reads covering a site to be considered (default=10)
  • --minfrac - the minimum proportion of those reads which must differ from the reference
  • --minqual - the minimum VCF variant call "quality" (default=100)

Looking at variants in detail with snippy-vcf_report

If you run Snippy with the --report option it will automatically run snippy-vcf_report and generate a which has a section like this for each SNP in snps.vcf :

If you wish to generate this report after you have run Snippy, you can run it directly:

If you want a HTML version for viewing in a web browser, use the --html option:

It works by running samtools tview for each variant, which can be very slow if you have 1000s of variants. Using --cpus as high as possible is recommended.

--rgid will set the Read Group ( RG ) ID ( ID ) and Sample ( SM ) in the BAM and VCF file. If not supplied, it will will use the --outdir folder name for both ID and SM .

--mapqual is the minimum mapping quality to accept in variant calling. BWA MEM using 60 to mean a read is "uniquely mapped".

--basequal is minimum quality a nucleotide needs to be used in variant calling. We use 13 which corresponds to error probability of

5%. It is a traditional SAMtools value.

--maxsoft is how many bases of an alignment to allow to be soft-clipped before discarding the alignment. This is to encourage global over local alignment, and is passed to the samclip tool.

--mincov and --minfrac are used to apply hard thresholds to the variant calling beyond the existing statistical measure.. The optimal values depend on your sequencing depth and contamination rate. Values of 10 and 0.9 are commonly used.

--targets takes a BED file and only calls variants in those regions. Not normally needed unless you are only interested in variants in specific locii (eg. AMR genes) but are still performing WGS rather than amplicon sequencing.

--contigs allows you to call SNPs from contigs rather than reads. It shreds the contigs into synthetic reads, as to put the calls on even footing with other read samples in a multi-sample analysis.

If you call SNPs for multiple isolates from the same reference, you can produce an alignment of "core SNPs" which can be used to build a high-resolution phylogeny (ignoring possible recombination). A "core site" is a genomic position that is present in all the samples. A core site can have the same nucleotide in every sample ("monomorphic") or some samples can be different ("polymorphic" or "variant"). If we ignore the complications of "ins", "del" variant types, and just use variant sites, these are the "core SNP genome".

To simplify running a set of isolate sequences (reads or contigs) against the same reference, you can use the snippy-multi script. This script requires a tab separated input file as follows, and can handle paired-end reads, single-end reads, and assembled contigs.

Then one would run this to generate the output script. The first parameter should be the file. The remaining parameters should be any remaining shared snippy parameters. The ID will be used for each isolate's --outdir .

It will also run snippy-core at the end to generate the core genome SNP alignment files core.* .

Extension Description
.aln A core SNP alignment in the --aformat format (default FASTA)
.full.aln A whole genome SNP alignment (includes invariant sites)
.tab Tab-separated columnar list of core SNP sites with alleles but NO annotations
.vcf Multi-sample VCF file with genotype GT tags for all discovered alleles
.txt Tab-separated columnar list of alignment/core-size statistics
.ref.fa FASTA version/copy of the --ref
.self_mask.bed BED file generated if --mask auto is used.

Why is core.full.aln an alphabet soup?

The core.full.aln file is a FASTA formatted mutliple sequence alignment file. It has one sequence for the reference, and one for each sample participating in the core genome calculation. Each sequence has the same length as the reference sequence.

Character Meaning
ATGC Same as the reference
atgc Different from the reference
- Zero coverage in this sample or a deletion relative to the reference
N Low coverage in this sample (based on --mincov )
X Masked region of reference (from --mask )
n Heterozygous or poor quality genotype (has GT=0/1 or QUAL < --minqual in snps.raw.vcf )

You can remove all the "weird" characters and replace them with N using the included snippy-clean_full_aln . This is useful when you need to pass it to a tree-building or recombination-removal tool:

  • If you want to mask certain regions of the genome, you can provide a BED file with the --mask parameter. Any SNPs in those regions will be excluded. This is common for genomes like M.tuberculosis where pesky repetitive PE/PPE/PGRS genes cause false positives, or masking phage regions. A --mask bed file for M.tb is provided with Snippy in the etc/Mtb_NC_000962.3_mask.bed folder. It is derived from the XLSX file from
  • If you use the snippy --cleanup option the reference files will be deleted. This means snippy-core can not "auto-find" the reference. In this case you simply use snippy-core --reference REF to provide the reference in FASTA format.

Increasing speed when too many reads

Sometimes you will have far more sequencing depth that you need to call SNPs. A common problem is a whole MiSeq flowcell for a single bacterial isolate, where 25 million reads results in genome depth as high as 2000x. This makes Snippy far slower than it needs to be, as most SNPs will be recovered with 50-100x depth. If you know you have 10 times as much data as you need, Snippy can randomly sub-sample your FASTQ data:

Only calling SNPs in particular regions

If you are looking for specific SNPs, say AMR releated ones in particular genes in your reference genome, you can save much time by only calling variants there. Just put the regions of interest into a BED file:

Finding SNPs between contigs

Sometimes one of your samples is only available as contigs, without corresponding FASTQ reads. You can still use these contigs with Snippy to find variants against a reference. It does this by shredding the contigs into 250 bp single-end reads at 2 &times --mincov uniform coverage.

To use this feature, instead of providing --R1 and --R2 you use the --ctgs option with the contigs file:

This output folder is completely compatible with snippy-core so you can mix FASTQ and contig based snippy output folders to produce alignments.

Correcting assembly errors

The de novo assembly process attempts to reconstruct the reads into the original DNA sequences they were derived from. These reconstructed sequences are called contigs or scaffolds. For various reasons, small errors can be introduced into the assembled contigs which are not supported by the original reads used in the assembly process.

A common strategy is to align the reads back to the contigs to check for discrepancies. These errors appear as variants (SNPs and indels). If we can reverse these variants than we can "correct" the contigs to match the evidence provided by the original reads. Obviously this strategy can go wrong if one is not careful about how the read alignment is performed and which variants are accepted.

Snippy is able to help with this contig correction process. In fact, it produces a snps.consensus.fa FASTA file which is the ref.fa input file provided but with the discovered variants in snps.vcf applied!

However, Snippy is not perfect and sometimes finds questionable variants. Typically you would make a copy of snps.vcf (let's call it corrections.vcf ) and remove those lines corresponding to variants we don't trust. For example, when correcting Roche 454 and PacBio SMRT contigs, we primarily expect to find homopolymer errors and hence expect to see ins more than snp type variants.

In this case you need to run the correcting process manually using these steps:

You may wish to iterate this process by using corrected.fa as a new --ref for a repeated run of Snippy. Sometimes correcting one error allows BWA to align things it couldn't before, and new errors are uncovered.

Snippy may not be the best way to correct assemblies - you should consider dedicated tools such as PILON or iCorn2, or adjust the Quiver parameters (for Pacbio data).

Sometimes you are interested in the reads which did not align to the reference genome. These reads represent DNA that was novel to your sample which is potentially interesting. A standard strategy is to de novo assemble the unmapped reads to discover these novel DNA elements, which often comprise mobile genetic elements such as plasmids.

By default, Snippy does not keep the unmapped reads, not even in the BAM file. If you wish to keep them, use the --unmapped option and the unaligned reads will be saved to a compressed FASTQ file:

The name Snippy is a combination of SNP (pronounced "snip") , snappy (meaning "quick") and Skippy the Bush Kangaroo (to represent its Australian origin)

Snippy is free software, released under the GPL (version 2).

Please submit suggestions and bug reports to the Issue Tracker

  • perl >= 5.18
  • bioperl >= 1.7
  • bwa mem >= 0.7.12
  • minimap2 >= 2.0
  • samtools >= 1.7
  • bcftools >= 1.7
  • bedtools >= 2.0
  • GNU parallel >= 2013xxxx
  • freebayes >= 1.1 (freebayes, freebayes-parallel,
  • vcflib >= 1.0 (vcfstreamsort, vcfuniq, vcffirstheader) >= 0.5
  • snpEff >= 4.3
  • samclip >= 0.2
  • seqtk >= 1.2
  • snp-sites >= 2.0
  • any2fasta >= 0.4
  • wgsim >= 1.8 (for testing only - wgsim command)

For Linux (compiled on Ubuntu 16.04 LTS) and macOS (compiled on High Sierra Brew) some of the binaries, JARs and scripts are included.


We generated the first genome for a dioecious species within the genus Solanum, to assess the early emergence and genomic signatures of sex-differentiation and sex determination. To do so, we assembled a high-quality genome, took a k-mer approach to find sex-linked genomic regions, and carried out an RNA-seq experiment of floral tissues to find genes involved in sex determination and sexual dimorphism. We found that dioecious S. appendiculatum appears to have a recently evolved sex-determination region and that males are likely to be the heterogametic sex. Indeed, the patterns of male–female sequence divergence we observed do not indicate the presence of a large nonrecombining region containing genes involved in sex determination. Moreover, the specific loci associated with sex-differentiation suggest that the evolution of dioecy in this system involved changes in the regulation of pectin synthesis and degradation, including in specific phenotypic transitions observed in functionally female flowers. This genome, and the associated candidate genes, represents a valuable genomic resource for the continued investigation of recent transitions to dioecy within Solanum.

Limited Sex-Biased Gene Expression and Few Sex-Associated Regions Are Consistent with Recent Evolution of Sexual Dimorphism

We found a very modest amount of sex-biased gene expression in flower buds, and larger but still delimited sex differences in the expression profiles of mature flowers. Given that sex-specificity of gene expression is expected to accumulate with time since the origin of sexual dimorphism ( Ellegren and Parsch 2007), the observation that few genes show sex-biased expression is consistent with a young sex-determination system. This very modest genomic and transcriptomic divergence between the sexes is consistent with the subtle morphological differentiation between male and female flowers, which is among the least pronounced in the dioecious nightshades ( Anderson et al. 2015).

For mature flowers, sex-biased genes more commonly had higher expression in females than in males ( fig. 2B). This finding contrasts with another species with a recently evolved sex-determining region–the garden asparagus ( Harkess et al. 2015)—likely because of developmental differences in sex expression between the two systems. In asparagus, anther development is arrested before microspore meiosis in female flowers ( Caporali et al. 1994), thus genes associated with later pollen development are expected to be expressed only in males ( Harkess et al. 2015). In contrast, in S. appendiculatum female flowers develop mature pollen, but fail to deposit primexine at the apertural regions ( Zavada and Anderson 1997). Our observation of more female-biased genes in S. appendiculatum is therefore consistent with this maintenance of both functional styles (female reproductive parts) and active production of (inaperturate) pollen ( Levine and Anderson 1986) in female flowers, and seems to indicate some loss of function of female reproductive parts in male plants. This possible loss of function, however, is not reflected in the morphology of male flowers, which have complete female reproductive parts (albeit with much shorter styles Anderson 1979 Anderson and Levine 1982).

Regulation of Pectin as a Potential Mechanism for the Formation of Aperturate Pollen

Identification of candidate genes playing potential feminizing or masculinizing effects is important to understand sex determination in this recently evolved dioecious species. Collectively, three different approaches in this study—gene family dynamics, sex-biased expression, and sex-specific k-mers—detected a set of loci distinctive to S. appendiculatum. Some of these are likely unrelated to this species’ transition to dioecy, and some others are possibly associated with general physiological consequences of this breeding system transition rather than directly involved in sex-differentiation and sex-determination per se. For instance, our gene family analysis detected a contraction of the self-incompatibility protein S1 family specifically in S. appendiculatum. Because the evolution of dioecy dramatically reduces the possibility of self-fertilization, this transition might be expected to relax selection to maintain functional self-incompatibility genes similar losses of self-incompatibility proteins has also been observed in other Solanaceae species that have undergone breeding system transitions (e.g., to self-compatibility Wu et al. 2019). Nonetheless, among the genetic changes detected, it is striking that all three of our different approaches detected pectin-related genes in association with sexual differentiation in S. appendiculatum, including pectin acetylesterases (PAE), pectin lyase-like proteins (PLL), and pectin methylesterase inhibitors (PMEI). Our finding is particularly intriguing as pectin synthesis and regulation is known to play important roles in pollen wall development, and in pollen function more broadly. Pectin consists of homogalacturonan (HG), which can be methyl- and acetyl-esterified ( Wu et al. 2018), and pectin polysaccharides are critical components of the pollen wall. Mutants in genes encoding pectin polysaccharide synthetic and degrative enzymes—including pectin methylesterase (PME), polygalatcturonase (PG), PAE, and PLL—often show defective primexine, intine, or other pollen wall structures ( Shi et al. 2015 Wu et al. 2018). Strikingly, in Nicotiana (Solanaceae), transgenic mutants of one pectin acetylesterase gene, PAE1, exhibit the loss of germination pores on the surface of the pollen grains ( Gou et al. 2012)—a very similar phenotype to the inaperturate pollen observed in the female flowers of S. appendiculatum. The overexpression PAE1 in transgenic tobacco results in severe male sterility by affecting the germination of pollen grains and the growth of pollen tubes ( Gou et al. 2012).

Other pectin-associated proteins are also implicated in numerous functional roles in pollen tube germination and growth, including via coordinated regulation between PMEs and their inhibitors—PMEIs ( Mollet et al. 2013). For instance, PME is important for the generation of methyl esterified HG in the apical zone of growing pollen tubes, which provides sufficient plasticity for sustaining growth ( Cheung and Wu 2008). The removal of methyl ester groups by PME may allow the pectin-degrading enzymes, such as PLL or PG, to cleave the HG backbone, which may affect the rigidity of the cell wall ( Gaffe et al. 1994 Micheli 2001). It has been proposed that the pollen cell might maintain a closely regulated level of PME activity, via regulation by PMEIs, in order to maintain the equilibrium between strength and plasticity in the apical cell wall ( Bosch and Hepler 2005, 2006). For example, silencing of the PME1 gene in tobacco ( Bosch and Hepler 2006), and suppression of PMEI At1g10770 in Arabidopsis ( Zhang et al. 2010), both result in slowed pollen tube growth.

In addition to detecting sex-specific expression of PAE, we also found three PMEIs in a candidate sex-determining region (scf14997) in S. appendiculatum. The arrangement and relationship between these putative sex-determining genes are consistent with them being recent duplications, similar to what has been found in other dioecious plants ( Harkess et al. 2017 Akagi et al. 2018). Although the specific function of these genes is not yet known, the general roles of PMEIs, PAE, and other related proteins in the formation and function of pollen suggests some possible models for the emergence of sex-specific pollen functions in the two sexes of S. appendiculatum. For example, it is possible that these PMEI copies influence the differential (sex-specific) expression patterns of downstream pectin-related genes in mature flowers, including PAE, thereby inhibiting or initiating the feminizing effect (i.e., inaperturate pollen) observed in female flowers. This process could also involve other tightly linked genes: the same syntenic block contains a gene coding for a LOB domain protein (sapp25115), the Arabidopsis ortholog of which (AT1G06280) is specifically expressed during tapetum and microspore development in the anthers ( Oh et al. 2010 Zhu et al. 2010). Other differentially expressed genes also have clearly relevant functions. For example, the pyruvate dehydrogenase E1 component subunit alpha (sapp29734) was differentially expressed between males and females in the mature flower pyruvate dehydrogenase catalyzes the early steps of sporopollenin biosynthesis, a major component of the exine layer of pollen grains ( Jiang et al. 2013).

Although pectin-related genes are promising candidates for the expected male-sterilizing step in the evolution of dioecy, it is possible that they are downstream of a master regulator of sex determination. For instance, a MYB-like transcription factor similar to that found in scf15476 (gene sapp39069) has been implicated in the determination of sex in Asparagus officinalis ( Murase et al. 2017), and the knockout of its putative ortholog causes male sterility in Arabidopsis thaliana ( Zhu et al. 2008). Although the sapp39069 transcription factor could be a regulator of sex, the R2R3 MYB superfamily has been shown to have an extreme diversity of regulatory functions ( Yanhui et al. 2006) and we do not yet have enough data to infer the role of this gene in S. appendiculatum. Therefore, whether some upstream genetic changes trigger the downstream changes in pectin-related genes will have to be addressed in future studies. For instance, transcriptome analysis of additional developmental stages of male and female flowers could clarify how pectin regulation changes across flower development and the specific timing of divergent expression differences between male and female flowers. Regardless, with a genome-wide search for sex-specific sequences, in conjunction with gene expression analyses, we were able to detect both putative sex-determining regions and genes that may contribute to at least one of the two steps expected in the path from hermaphroditism to dioecy. These loci provide clear candidates for direct functional analysis in this system, especially for inaperturate pollen development phenotypes in female flowers.

The S. appendiculatum Genome Provides a Foundation for Addressing Repeated Transitions to Dioecy

Although the speciose genus Solanum contains fewer than 20 documented dioecious species, dioecy is estimated to have arisen independently at least 4 times ( Anderson et al. 2015). Many of these transitions appear to involve common phenotypic features, most notably the development of inaperturate pollen in female individuals and dramatic reduction of the pistil in male flowers ( Anderson et al. 2015). As such, this young genus (estimated ∼17 My old Särkinen et al. 2013) offers a promising system in which to address the genomic features and genetic mechanisms of repeated, recent transitions to dioecy.

Solanum appendiculatum is among the most recently evolved dioecious angiosperms with sequenced genomes (<4 My Echeverría-Londoño et al. 2020). The resources generated here provide a valuable framework for examining additional transitions to dioecy in the highly speciose genus, including a high-quality assembled genome, transcriptome characterization for annotation and gene expression analyses, and a set of candidate loci for directed exploration in parallel systems. Because most dioecious nightshades have similar sexual traits, including inaperturate pollen in the stamens of female flowers ( Anderson et al. 2015), addressing the parallel origins of dioecy in this group can also address whether these transitions have followed convergent paths at genomic, genetic, and developmental levels. In conjunction with the S. appendiculatum genome, sequence data from other dioecious Solanum species can be used to dissect these parallel origins of sex determination in Solanum, including whether these exhibit similar genomic features (in terms of the number, size, and distribution of emerging sex-determination regions), draw on the same kinds of genomic/genetic changes (i.e., share orthologous sex-linked regions), and/or involve the same specific pathways and individual loci, including whether there is a general role for pectin-related loci in the early emergence of sexual differentiation. In this context, study of the genetic control of sex expression in species like S. polygamum and S. conocarpum—both of which bear anthers on female flowers, but that anthers are largely devoid of any pollen ( Anderson et al. 2015)—could prove especially informative. Data from multiple recent, parallel systems will also be critical for testing the general predictions of theoretical models of the evolution of dioecy and assessing whether the complexity of genomic transitions that underpinning real empirical transitions matches well with these theoretical expectations.

Genomics & Systems Biology

David P. Clark , Nanette J. Pazdernik , in Molecular Biology (Second Edition) , 2013

2 Assembling Small Genomes by Shotgun Sequencing

As described in Chapter 8 , individual dideoxy sequencing reactions give lengths of sequence that are several hundred base pairs long. A whole genome must be assembled from vast numbers of such short sequences. There are three approaches to whole genome assembly: shotgun sequencing , cloned contig sequencing, and the directed shotgun approach, which is really a mixture of the first two.

In shotgun sequencing the genome is broken randomly into short fragments (1 to 2 kbp long) suitable for sequencing. The fragments are ligated into a suitable vector and then partially sequenced. Around 400–500 bp of sequence can be generated from each fragment in a single sequencing run. In some cases, both ends of a fragment are sequenced. Computerized searching for overlaps between individual sequences then assembles the complete sequence. Overlapping sequences are assembled to generate contigs ( Fig. 9.04 ). The term contig refers to a known DNA sequence that is contiguous and lacks gaps.

Figure 9.04 . Shotgun Sequencing

The first step in shotgun sequencing an entire genome is to digest the genome into a large number of small fragments suitable for sequencing. All the small fragments are then cloned and sequenced. Computers analyze the sequence data for overlapping regions and assemble the sequences into several large contigs. Since some regions of the genome are unstable when cloned, some gaps may remain even after this procedure is repeated several times.

Sequencing very large numbers of small fragments provides enough information to assemble a complete genome sequence—if your computer is powerful enough.

Since fragments are cloned at random, duplicates will quite often be sequenced. To get full coverage the total amount of sequence obtained must therefore be several times that of the genome to allow for duplications. For example, 99.8% coverage requires a total amount of sequence that is 6- to 8-fold the genome size. In principle, all that is required to assemble a genome, however large, from small sequences is a sufficiently powerful computer. No genetic map or prior information is needed about the organism whose genome is to be sequenced. The original limitation to shotgun sequencing was the massive data handling that is required. The development of faster computers overcame this problem.

The first bacterial genome to be sequenced was Haemophilus influenza. The sequence was deduced from just under 25,000 sequences averaging 480 bp each. This gave a total of almost 12 million bp of sequence—six times the genome size. Computerized assembly using overlaps resulted in 140 regions of contiguous sequence—that is, 140 contigs.

The bacterium Haemophilus had the honor of being the first organism to be totally sequenced.

The gaps between the contigs may be closed by more individualistic procedures. The easiest method is to re-screen the original set of clones with pairs of probes corresponding to sequences on the two sides of each gap. Clones that hybridize to both members of such a pair of probes presumably carry DNA that bridges the gap between two contigs. Such clones are then sequenced in full to close the gaps between contigs. However, many of the gaps between contigs are due to regions of DNA that are unstable when cloned, especially in a multicopy vector. Therefore, a second library in a different vector, often a single copy vector such as a lambda phage, is often used during the later stages of shotgun cloning. Pairs of end-of-contig probes are used to screen the new library for clones that hybridize to both probes and carry DNA that bridges the gap between the two contigs ( Fig. 9.05A ). A third approach, which avoids cloning altogether, is to run PCR reactions on whole genomic DNA using random pairs of PCR primers corresponding to contig ends. A PCR product will result only if the two contig ends are within a few kb of each other ( Fig. 9.05B ).

Figure 9.05 . Closing Gaps between Contigs

To identify gaps between contigs, probes or primers are made that correspond to the ends of the contigs (pink). In (A) a new library of clones (green) is screened with end-of-contig probes. Clones that hybridize to probes from two sides of a gap are isolated. In this example, a probe for the end of contig #3 (3b) and the beginning of contig #4 (4a) hybridize to the fragment shown. Therefore, the sequence of this clone should close the gap between contig #3 and #4. (B) The second approach uses PCR primers that correspond to the ends of contigs to amplify genomic DNA. If the primer pair is within a few kilobases of each other, a PCR product is made and can be sequenced.


Cryptosporidium specimens

Four C. hominis specimens were used in whole genome sequencing in the study: specimens 30974 and 37999 of the IbA10G2 subtype and 30976 and 33537 of the IaA28R4 subtype. Specimen 30974 was collected from a patient from a cryptosporidiosis outbreak in July 2010 in Columbia, South Carolina associated with a splash pad that had problems with filtration and chlorination. Testing of filter backflush and stools from six patients all identified the presence of the C. hominis IbA10G2 subtype. Specimen 30976 was collected from a patient in a cryptosporidiosis outbreak in July 2010 in the St. Louis area in Illinois and Missouri associated with swimming pools and a water park. Testing of nine patient specimens identified the occurrence of C. hominis IaA28R4 in seven patients, IaA24R4 in one patient, and IdA15G1 in another patient. Specimen 33537 was collected from a patient from a cryptosporidiosis outbreak in July 2011 in Walsenburg, Colorado associated with a waterpark that had problems with the chlorinator. Testing of filter backflush and stools from five patients identified IaA28R4 in all. Specimen 37999 was collected from a sporadic cryptosporidiosis patient in Twin Falls, Idaho in September 2012. All stool specimens were collected fresh from symptomatic patients and stored in 2.5% potassium dichromate at 4°C prior to being used in Cryptosporidium oocyst isolation for whole genome sequencing within 6 months. Cryptosporidium species and subtypes were determined by PCR-RFLP analysis of the small subunit rRNA and sequence analysis of the 60 kDa glycoprotein (gp60) genes, respectively [17].

Oocyst isolation and whole genome amplification

Cryptosporidium oocysts were isolated from stool specimens by discontinuous sucrose and cesium chloride gradients as previously described [52]. They were further purified by immunomagnetic separation using the Dynabeads Anti-Cryptosporidium kit (Invitrogen, Carlsbad, CA). After treating the purified oocysts with 10% commercial bleach on ice for 10 min and five cycles of freezing and thawing, DNA was extracted from them by using the Qiagen DNeasy Blood & Tissue Kit (Qiagen, Valencia, CA). Whole genome amplification (WGA) of the 25–100 ng of extracted DNA was conducted by using the REPLI-g Midi Kit (Qiagen). The quality of the WGA products was verified by sequencing BamHI-digested WGA products cloned into a pUC19 vector (Fermantas, Pittsburgh, PA). The sequencing was done by using the ABI BigDye Terminator v3.1 Cycle Sequencing Kit on an ABI3130 Genetic Analyzer (Applied Biosystems, Foster City, CA).

454 and Illumina sequencing and de novo contig assembly

The WGA products from specimens 30974 and 33537 were sequenced with 454 technology on a GS-FLX Titanium System (Roche, Branford, CT) by using approximately 1 μg of DNA for library construction and following standard Roche library protocols, with an average insert size of 600 bp. One full PTP plate was used in the analysis of each specimen. The sequence reads from each run were assembled using Newbler in the GS De Novo Assembler ( with the default settings.

The WGA products from specimens 30976 and 37999 were used to generate Illumina TruSeq (v3) libraries (average insert size: 350 bp) and sequenced 100×100 bp paired-end on an Illumina Genome Analyzer IIx (Illumina, San Diego, CA). The sequence reads with a minimum quality of 20 were trimmed by using CLC Assembly Cell 4.1.0 ( The data were then assembled with default parameters and a minimum contig length of 500 bp, with scaffolding using paired-end data.

Comparative genomic analyses

For comparisons of sequences at the genome level, contigs of each specimen were aligned with reference sequences of the near complete genome of the C. parvum IOWA isolate (version AAEE00000000.1) and the 1,422 contigs of the C. hominis TU5205 isolate (version NZ_AAEL00000000.1) using Nucmer, a tool in MUMmer 3.23 ( [53]. Multiple genome alignments were also constructed by using the progressive alginment algorithm of the Mauve 2.3.1 ( with default options [54]. In-house perl scripts were developed to calculate the average nucleotide identities. For the detection of SNPs, Fastqc 0.10.0 ( was used for the QC analysis of Illumina sequence reads, and PRINSEQ 0.20.3 ( [55] was used to remove low quality reads, with a min_qual_mean setting of 20 and min_len of 65. Reads were then aligned to reference sequences by using Bowtie 0.12.7 ( [56]. The resulting SAM files were processed, sorted and duplicates were removed by using Picard 1.126 ( The mpileup in SAMtools ( was finally used to create the pileup file for SNP variant calls using the mpileup2snp in VarScan 2.3.7 ( [57]. Default parameters for VarScan were used except that min-avg-qual was set to 30.

PCR verification

As the comparative genomic analysis had identified some nucleotide sequences (AAEL01000413, AAEL01000728, and AAEL01000717) in the published C. hominis that had not been seen in the published C. parvum genome, primers were designed based on these sequences to verify the source of these sequences by PCR (Additional file 6: Table S1). Five specimens each of C. parvum and C. hominis were used in PCR analysis of each target. In addition, two C. andersoni specimens were used in confirmation of Cryptosporidium-origin of contig AAEL01000728. Each specimen was analyzed in duplicate nested PCR using 50 μl PCR mixture consisting of 1 μl (

100 ng) of extracted DNA or 2 μL of primary PCR products (in secondary PCR), 200 μM deoxynucleoside triphosphate, 1× PCR buffer (Applied Biosystems), 3.0 mM MgCl2, 5.0 U of Taq polymerase (Promega, Madison, WI), 100 nM primers, and 400 ng/μl of non-acetylated bovine serum albumin (Sigma-Adrich, St. Louis, MO). The primary and secondary PCR reactions were performed in a GeneAmp PCR 9700 thermocycler (Applied Biosystems) for 35 cycles of 94°C for 45 s, 55°C for 45 s, and 72°C for 60 s, with an initial denaturation (94°C for 5 min) and a final extension (72°C for 7 min). The secondary PCR products were sequenced in both directions using Sanger technology described above. Nucleotide sequences obtained were aligned with reference sequences downloaded from GenBank by using ClustalX (

NCBI BioProject No.

Nucleotide sequences generated from the project, including all SRA data and assembled contigs, were submitted to the NCBI BioProject under the accession number PRJNA252787.

Ethics statement

The study was done on delinked residual diagnostic specimens. It was covered by Human Subjects Protocol No. 990115 “Use of residual human specimens for the determination of frequency of genotypes or sub-types of pathogenic parasites”, which was reviewed and approved by the Institutional Review Board of the Centers for Disease Control and Prevention (CDC). No personal identifiers were associated with the specimens at the time of submission for diagnostic service at CDC.


The authors thank Otto van Poeselaere, Sabine Van Leirberghe and Lucas N. Davey for stimulating discussions during the preparation of this manuscript. We acknowledge access to the Syngenta Musa 3'EST database, donated by Syngenta to Bioversity International within the framework of the Global Musa Genomics Consortium. We thank Bioversity International, Dr. Gerard Ngoh-Newilah of CARBAP, Djombe, Cameroon, Dr. Angela Kepler of Pacific-Wide Ecological Consulting, Hawaii, and the late Dr. Lois Engelberger of Pohnpei for providing samples of fruit. We thank the Ministry of Higher Education, Malaysia, for University of Malaya grants RG006-09BIO, PV109/2011A and FRGS grant FP005-2011A to JAH, GR and NZK. We would like to thank Wendy Chin Yi Wen from Plant Biotechnology Research Laboratory, University of Malaya for providing the embryogenic cell suspension. Finally the authors would like to thank Mathieu Rouard from Bioversity International, Montpellier for constructing the website to host the data generated here.

Electronic supplementary material is available online at

Published by the Royal Society under the terms of the Creative Commons Attribution License, which permits unrestricted use, provided the original author and source are credited.


2016 Challenges in microbial ecology: building predictive understanding of community function and dynamics . ISME J. 10, 2557-2568. (doi:10.1038/ismej.2016.45) Crossref, PubMed, ISI, Google Scholar

Knight R, Callewaert C, Marotz C, Hyde ER, Debelius JW, McDonald D, Sogin ML

. 2017 The microbiome and human biology . Annu. Rev. Genomics Hum. Genet. 18, 65-86. (doi:10.1146/annurev-genom-083115-022438) Crossref, PubMed, ISI, Google Scholar

Gilbert JA, Blaser MJ, Caporaso JG, Jansson JK, Lynch SV, Knight R

. 2018 Current understanding of the human microbiome . Nat. Med. 24, 392-400. (doi:10.1038/nm.4517) Crossref, PubMed, ISI, Google Scholar

2004 Community structure and metabolism through reconstruction of microbial genomes from the environment . Nature 428, 37-43. (doi:10.1038/nature02340) Crossref, PubMed, ISI, Google Scholar

. 2008 Colloquium paper: resistance, resilience, and redundancy in microbial communities . Proc. Natl Acad. Sci. USA 105(Suppl. 1), 11 512-11 519. (doi:10.1073/pnas.0801925105). Crossref, ISI, Google Scholar

Fuhrman JA, Cram JA, Needham DM

. 2015 Marine microbial community dynamics and their ecological interpretation . Nat. Rev. Microbiol. 13, 133-146. (doi:10.1038/nrmicro3417) Crossref, PubMed, ISI, Google Scholar

2016 Thousands of microbial genomes shed light on interconnected biogeochemical processes in an aquifer system . Nat. Commun. 7, 13219. (doi:10.1038/ncomms13219) Crossref, PubMed, ISI, Google Scholar

Bardgett RD, Freeman C, Ostle NJ

. 2008 Microbial contributions to climate change through carbon cycle feedbacks . ISME J. 2, 805-814. (doi:10.1038/ismej.2008.58) Crossref, PubMed, ISI, Google Scholar

2004 Environmental genome shotgun sequencing of the Sargasso Sea . Science 304, 66-74. (doi:10.1126/science.1093857) Crossref, PubMed, ISI, Google Scholar

Quince C, Walker AW, Simpson JT, Loman NJ, Segata N

. 2017 Shotgun metagenomics, from sampling to analysis . Nat. Biotechnol. 35, 833-844. (doi:10.1038/nbt.3935) Crossref, PubMed, ISI, Google Scholar

Koskella B, Hall LJ, Metcalf CJE

. 2017 The microbiome beyond the horizon of ecological and evolutionary theory . Nat. Ecol. Evol. 1, 1606-1615. (doi:10.1038/s41559-017-0340-2) Crossref, PubMed, ISI, Google Scholar

Hansen SK, Rainey PB, Haagensen JA, Molin S

. 2007 Evolution of species interactions in a biofilm community . Nature 445, 533-536. (doi:10.1038/nature05514) Crossref, PubMed, ISI, Google Scholar

Lawrence D, Fiegna F, Behrends V, Bundy JG, Phillimore AB, Bell T, Barraclough TG

. 2012 Species interactions alter evolutionary responses to a novel environment . PLoS Biol. 10, e1001330. (doi:10.1371/journal.pbio.1001330) Crossref, PubMed, ISI, Google Scholar

. 2018 It takes a village: microbial communities thrive through interactions and metabolic handoffs . mSystems 3, e00152-17. (doi:10.1128/mSystems.00152-17) Crossref, PubMed, ISI, Google Scholar

Robinson CD, Klein HS, Murphy KD, Parthasarathy R, Guillemin K, Bohannan BJM

. 2018 Experimental bacterial adaptation to the zebrafish gut reveals a primary role for immigration . PLoS Biol. 16, e2006893. (doi:10.1371/journal.pbio.2006893) Crossref, PubMed, ISI, Google Scholar

Marbouty M, Baudry L, Cournac A, Koszul R

. 2017 Scaffolding bacterial genomes and probing host-virus interactions in gut microbiome by proximity ligation (chromosome capture) assay . Sci. Adv. 3, e1602105. (doi:10.1126/sciadv.1602105) Crossref, PubMed, ISI, Google Scholar

Truong DT, Tett A, Pasolli E, Huttenhower C, Segata N

. 2017 Microbial strain-level population structure and genetic diversity from metagenomes . Genome Res. 27, 626-638. (doi:10.1101/gr.216242.116) Crossref, PubMed, ISI, Google Scholar

Garud NR, Good BH, Hallatschek O, Pollard KS

. 2019 Evolutionary dynamics of bacteria in the gut microbiome within and across hosts . PLoS Biol. 17, e3000102. (doi:10.1371/journal.pbio.3000102) Crossref, PubMed, Google Scholar

. 2019 Tracking microbial evolution in the human gut using Hi-C . Nat. Microbiol. 5, 343-353. (doi:10.1038/s41564-019-0625-0) Crossref, PubMed, ISI, Google Scholar

. 1980 Selfish genes, the phenotype paradigm and genome evolution . Nature 284, 601-603. (doi:10.1038/284601a0) Crossref, PubMed, ISI, Google Scholar

. 1980 Selfish DNA: the ultimate parasite . Nature 284, 604-607. (doi:10.1038/284604a0) Crossref, PubMed, ISI, Google Scholar

Bergstrom CT, Lipsitch M, Levin BR

. 2000 Natural selection, infectious transfer and the existence conditions for bacterial plasmids . Genetics 155, 1505-1519. PubMed, ISI, Google Scholar

. 2006 Genes in conflict: the biology of selfish genetic elements . Harvard, MA : Belknap Press . Crossref, Google Scholar

. 2003 Evolution experiments with microorganisms: the dynamics and genetic bases of adaptation . Nat. Rev. Genet. 4, 457-469. (doi:10.1038/nrg1088) Crossref, PubMed, ISI, Google Scholar

Rainey PB, Remigi P, Farr AD, Lind PA

. 2017 Darwin was right: where now for experimental evolution? Curr. Opin Genet. Dev. 47, 102-109. (doi:10.1016/j.gde.2017.09.003) Crossref, PubMed, ISI, Google Scholar

Maltez Thomas A, Prata Lima F, Maria Silva Moura L, Maria da Silva A, Dias-Neto E, Setubal JC

. 2018 Comparative metagenomics . Methods Mol. Biol. 1704, 243-260. (doi:10.1007/978-1-4939-7463-4_8) Crossref, PubMed, Google Scholar

. 2011 Microbial diversity of cellulose hydrolysis . Curr. Opin Microbiol. 14, 259-263. (doi:10.1016/j.mib.2011.04.004) Crossref, PubMed, ISI, Google Scholar

. 2002 Enzymology and bioenergetics of respiratory nitrite ammonification . FEMS Microbiol. Rev. 26, 285-309. (doi:10.1111/j.1574-6976.2002.tb00616.x) Crossref, PubMed, ISI, Google Scholar

Goddard MR, Godfray HCJ, Burt A

. 2005 Sex increases the efficacy of natural selection in experimental yeast populations . Nature 434, 636-640. (doi:10.1038/nature03405) Crossref, PubMed, ISI, Google Scholar

McDonald MJ, Rice DP, Desai MM

. 2016 Sex speeds adaptation by altering the dynamics of molecular evolution . Nature 531, 233. (doi:10.1038/nature17143) Crossref, PubMed, ISI, Google Scholar

. 2011 Horizontal gene exchange in environmental microbiota . Front. Microbiol. 2, 158. (doi:10.3389/fmicb.2011.00158) Crossref, PubMed, ISI, Google Scholar

Colombi E, Straub C, Kunzel S, Templeton MD, McCann HC, Rainey PB

. 2017 Evolution of copper resistance in the kiwifruit pathogen Pseudomonas syringae pv. actinidiae through acquisition of integrative conjugative elements and plasmids . Environ. Microbiol. 19, 819-832. (doi:10.1111/1462-2920.13662) Crossref, PubMed, ISI, Google Scholar

Hall JPJ, Brockhurst MA, Harrison E

. 2017 Sampling the mobile gene pool: innovation via horizontal gene transfer in bacteria . Phil. Trans. R. Soc. B 372, 20160424. (doi:10.1098/rstb.2016.0424) Link, ISI, Google Scholar

. 2003 Prophages and bacterial genomics: what have we learned so far? Mol. Microbiol. 49, 277-300. (doi:10.1046/j.1365-2958.2003.03580.x) Crossref, PubMed, ISI, Google Scholar

2015 CDD: NCBI's conserved domain database . Nucleic Acids Res. 43, D222-D226. (doi:10.1093/nar/gku1221) Crossref, PubMed, ISI, Google Scholar

Seed KD, Lazinski DW, Calderwood SB, Camilli A

. 2013 A bacteriophage encodes its own CRISPR/Cas adaptive response to evade host innate immunity . Nature 494, 489-491. (doi:10.1038/nature11927) Crossref, PubMed, ISI, Google Scholar

. 2016 Horizontal gene transfer of chromosomal Type II toxin-antitoxin systems of Escherichia coli . FEMS Microbiol. Lett. 363, fnv238. (doi:10.1093/femsle/fnv238) Crossref, PubMed, ISI, Google Scholar

. 2017 Carriage of type II toxin-antitoxin systems by the growing group of IncX plasmids . Plasmid 91, 19-27. (doi:10.1016/j.plasmid.2017.02.006) Crossref, PubMed, ISI, Google Scholar

Singhania RR, Patel AK, Sukumaran RK, Larroche C, Pandey A

. 2013 Role and significance of beta-glucosidases in the hydrolysis of cellulose for bioethanol production . Bioresour. Technol. 127, 500-507. (doi:10.1016/j.biortech.2012.09.012) Crossref, PubMed, ISI, Google Scholar

2005 The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes . Nucleic Acids Res. 33, 5691-5702. (doi:10.1093/nar/gki866) Crossref, PubMed, ISI, Google Scholar

. 2017 Convergence and divergence in a long-term experiment with bacteria . Am. Nat. 190, S57-S68. (doi:10.1086/691209) Crossref, PubMed, ISI, Google Scholar

Chu HY, Sprouffske K, Wagner A

. 2018 Assessing the benefits of horizontal gene transfer by laboratory evolution and genome sequencing . BMC Evol. Biol. 18, 54. (doi:10.1186/s12862-018-1164-7) Crossref, PubMed, ISI, Google Scholar

Frazão N, Sousa A, Lässig M, Gordo I

. 2019 Horizontal gene transfer overrides mutation in Escherichia coli colonizing the mammalian gut . Proc. Natl Acad. Sci. USA 116, 17 906-17 915. (doi:10.1073/pnas.1906958116) Crossref, ISI, Google Scholar

Zhao SJ, Lieberman TD, Poyet M, Kauffman KM, Gibbons SM, Groussin M, Xavier RJ, Alm EJ

. 2019 Adaptive evolution within gut microbiomes of healthy people . Cell Host Microbe 25, 656. (doi:10.1016/j.chom.2019.03.007) Crossref, PubMed, ISI, Google Scholar

. 1989 Reviving the superorganism . J. Theor. Biol. 136, 337-356. (doi:10.1016/S0022-5193(89)80169-9) Crossref, PubMed, ISI, Google Scholar

Swenson W, Wilson DS, Elias R

. 2000 Artificial ecosystem selection . Proc. Natl Acad. Sci. USA 97, 9110-9114. (doi:10.1073/pnas.150237597) Crossref, PubMed, ISI, Google Scholar

. 2019 Simulations reveal challenges to artificial community selection and possible strategies for success . PLoS Biol. 17, e3000295. (doi:10.1371/journal.pbio.3000295) Crossref, PubMed, ISI, Google Scholar

Black AJ, Bourrat P, Rainey PB.

In press. Ecological scaffolding and the evolution of individuality . Nat. Ecol. Evol. (doi:10.1038/s41559-019-1086-9) ISI, Google Scholar

. 1934 The struggle for existence . Baltimore, MD : Williams & Wilkins . Crossref, Google Scholar

Rosenzweig RF, Sharp RR, Treves DS, Adams J

. 1994 Microbial evolution in a simple unstructured environment: genetic differentiation in Escherichia coli . Genetics 137, 903-917. PubMed, ISI, Google Scholar

Rainey PB, Buckling A, Kassen R, Travisano M

. 2000 The emergence and maintenance of diversity: insights from experimental bacterial populations . Trends Ecol. Evol. 15, 243-247. (doi:10.1016/S0169-5347(00)01871-1) Crossref, PubMed, ISI, Google Scholar

. 2002 Functional redundancy in ecology and conservation . Oikos 98, 156-162. (doi:10.1034/j.1600-0706.2002.980116.x) Crossref, ISI, Google Scholar

2018 Function and functional redundancy in microbial systems . Nat. Ecol. Evol. 2, 936-943. (doi:10.1038/s41559-018-0519-1) Crossref, PubMed, ISI, Google Scholar

Landsberger M, Gandon S, Meaden S, Rollie C, Chevallereau A, Buckling A, Westra ER, van Houte S

. 2018 Anti-CRISPR phages cooperate to overcome CRISPR-Cas immunity . Cell 174, 908-916. (doi:10.1016/j.cell.2018.05.058) Crossref, PubMed, ISI, Google Scholar

Marbouty M, Cournac A, Flot JF, Marie-Nelly H, Mozziconacci J, Koszul R

. 2014 Metagenomic chromosome conformation capture (meta3C) unveils the diversity of chromosome organization in microorganisms . eLife 3, e03318. (doi:10.7554/eLife.03318) Crossref, PubMed, ISI, Google Scholar

. 2011 Microbial nitrogen cycling processes in oxygen minimum zones . Annu. Rev. Mar. Sci. 3, 317-345. (doi:10.1146/annurev-marine-120709-142814) Crossref, PubMed, ISI, Google Scholar

Givens DI, Adamson AH, Cobby JM

. 1988 The effect of ammoniation on the nutritive value of wheat, barley and oat straws. II. Digestibility and energy value measurements in vivo and their prediction from laboratory measurements . Anim. Feed Sci. Technol. 19, 173-184. (doi:10.1016/0377-8401(88)90065-X) Crossref, ISI, Google Scholar

. 2007 Biology's next revolution . Nature 445, 369. (doi:10.1038/445369a) Crossref, PubMed, ISI, Google Scholar

. 2009 Darwinian evolution in the light of genomics . Nucleic Acids Res. 37, 1011-1034. (doi:10.1093/nar/gkp089) Crossref, PubMed, ISI, Google Scholar

. 2010 Horizontal gene transfer in evolution: facts and challenges . Proc. R. Soc. B 277, 819-827. (doi:10.1098/rspb.2009.1679) Link, ISI, Google Scholar

Ochman H, Lawrence JG, Groisman EA

. 2000 Lateral gene transfer and the nature of bacterial innovation . Nature 405, 299-304. (doi:10.1038/35012500) Crossref, PubMed, ISI, Google Scholar

. 2011 Selfish genetic elements, genetic conflict, and evolutionary innovation . Proc. Natl Acad. Sci. USA 108(Suppl. 2)), 10 863-10 870. (doi:10.1073/pnas.1102343108) Crossref, ISI, Google Scholar

. 2013 Horizontal gene transfer and the evolution of bacterial and archaeal population structure . Trends Genet. 29, 170-175. (doi:10.1016/j.tig.2012.12.006) Crossref, PubMed, ISI, Google Scholar

Fullmer MS, Soucy SM, Gogarten JP

. 2015 The pan-genome as a shared genomic resource: mutual cheating, cooperation and the black queen hypothesis . Front. Microbiol. 6, ARTN 728. (doi:10.3389/fmicb.2015.00728) Crossref, ISI, Google Scholar

. 2018 Processes and patterns of interaction as units of selection: an introduction to ITSNTS thinking . Proc. Natl Acad. Sci. USA 115, 4006-4014. (doi:10.1073/pnas.1722232115) Crossref, PubMed, ISI, Google Scholar

. 2011 FLASH: fast length adjustment of short reads to improve genome assemblies . Bioinformatics 27, 2957-2963. (doi:10.1093/bioinformatics/btr507) Crossref, PubMed, ISI, Google Scholar

. 2011 Quality control and preprocessing of metagenomic datasets . Bioinformatics 27, 863-864. (doi:10.1093/bioinformatics/btr026) Crossref, PubMed, ISI, Google Scholar

2016 The MG-RAST metagenomics database and portal in 2015 . Nucleic Acids Res. 44, D590-D594. (doi:10.1093/nar/gkv1322) Crossref, PubMed, ISI, Google Scholar

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ

. 1990 Basic local alignment search tool . J. Mol. Biol. 215, 403-410. (doi:10.1016/S0022-2836(05)80360-2) Crossref, PubMed, ISI, Google Scholar

Li D, Liu CM, Luo R, Sadakane K, Lam TW

. 2015 MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph . Bioinformatics 31, 1674-1676. (doi:10.1093/bioinformatics/btv033) Crossref, PubMed, ISI, Google Scholar

Rice P, Longden I, Bleasby A

. 2000 EMBOSS: the European molecular biology open software suite . Trends Genet. 16, 276-277. (doi:10.1016/S0168-9525(00)02024-2) Crossref, PubMed, ISI, Google Scholar

Niu B, Zhu Z, Fu L, Wu S, Li W

. 2011 FR-HIT, a very fast program to recruit metagenomic reads to homologous reference genomes . Bioinformatics 27, 1704-1705. (doi:10.1093/bioinformatics/btr252) Crossref, PubMed, ISI, Google Scholar

Influenza Virus Genome Sequencing and Genetic Characterization

Influenza viruses are constantly changing, in fact all influenza viruses undergo genetic changes over time (for more information, see How the Flu Virus Can Change: &ldquoDrift&rdquo and &ldquoShift&rdquo). An influenza virus&rsquo genome consists of all genes that make up the virus. CDC conducts year-round surveillance of circulating influenza viruses to monitor changes to the genome (or parts of the genome) of these viruses. This work is performed as part of routine U.S. influenza surveillance and as part of CDC&rsquos role as a World Health Organization (WHO) Collaborating Center for Reference and Research on Influenza. The information CDC collects from studying genetic changes (also known as &ldquosubstitutions,&rdquo &ldquovariants&rdquo or &ldquomutations&rdquo) in influenza viruses plays an important public health role by helping to determine whether vaccines and antiviral drugs will work against currently-circulating influenza viruses, as well as helping to determine the potential for influenza viruses in animals to infect humans.

Genome sequencing reveals the sequence of the nucleotides in a gene, like alphabet letters in words. Nucleotides are organic molecules that form the structural unit building block of nucleic acids, such as RNA or DNA. All influenza viruses consist of single-stranded RNA as opposed to dual-stranded DNA. The RNA genes of influenza viruses are made up of chains of nucleotides that are bonded together and coded by the letters A, C, G and U, which stand for adenine, cytosine, guanine, and uracil, respectively. Comparing the composition of nucleotides in one virus gene with the order of nucleotides in a different virus gene can reveal variations between the two viruses.

Genetic variations are important because they can affect the structure of an influenza virus&rsquo surface proteins. Proteins are made of sequences of amino acids.

The substitution of one amino acid for another can affect properties of a virus, such as how well a virus transmits between people, and how susceptible the virus is to antiviral drugs or current vaccines.

Genome sequencing reveals the sequence of the nucleotides in a gene, like alphabet letters in words. Comparing the composition of nucleotides in one virus gene with the order of nucleotides in a different virus gene can reveal variations between the two viruses.

Genetic variations are important because they affect the structure of an influenza virus&rsquo surface proteins. Proteins are made of sequences of amino acids.

The substitution of one amino acid for another can affect properties of a virus, such as how well a virus transmits between people, and how susceptible the virus is to antiviral drugs or current vaccines.

Influenza A and B viruses &ndash the primary influenza viruses that infect people &ndash are RNA viruses that have eight gene segments. These genes contain &lsquoinstructions&rsquo for making new viruses, and it&rsquos these instructions that an influenza virus uses once it infects a human cell to trick the cell into producing more influenza viruses, thereby spreading infection.

Influenza genes consist of a sequence of molecules called nucleotides that bond together in a chain-like shape. Nucleotides are designated by the letters A, C, G and U.

Genome sequencing is a process that determines the order, or sequence, of the nucleotides (i.e., A, C, G and U) in each of the genes present in the virus&rsquos genome. Full genome sequencing can reveal the approximately 13,500-letter sequence of all the genes of the virus&rsquo genome.

Each year CDC performs whole genome sequencing on about 7,000 influenza viruses from original clinical samples collected through virologic surveillance. An influenza A or B virus&rsquo genome contains eight gene segments that encode (i.e., determine the structure and features of) the virus&rsquo 12 proteins, including its two primary surface proteins: hemagglutinin (HA) and neuraminidase (NA). An influenza virus&rsquo surface proteins determine important properties of the virus, including how the virus responds to certain antiviral drugs, the virus&rsquo genetic similarity to current influenza vaccine viruses, and the potential for zoonotic (animal origin) influenza viruses to infect human hosts.

Genetic Characterization

CDC and other public health laboratories around the world have been sequencing the genes of influenza viruses since the 1980s. CDC contributes gene sequences to public databases, such as GenBank external icon and the Global Initiative on Sharing Avian Influenza Data (GISAID) external icon , for use by public health researchers. The resulting libraries of gene sequences allow CDC and other laboratories to compare the genes of currently circulating influenza viruses with the genes of older influenza viruses and viruses used in vaccines. This process of comparing genetic sequences is called genetic characterization. CDC uses genetic characterization for the following reasons:

  • To determine how closely &ldquorelated&rdquo or similar flu viruses are to one another genetically
  • To monitor how flu viruses are evolving
  • To identify genetic changes that affect the virus&rsquo properties. For example, to identify the specific changes that are associated with influenza viruses spreading more easily, causing more-severe disease, or developing resistance to antiviral drugs
  • To assess how well an influenza flu vaccine might protect against a particular influenza virus based on its genetic similarity to the virus
  • To monitor for genetic changes in influenza viruses circulating in animal populations that could enable them to infect humans.

The relative differences among a group of influenza viruses are shown by organizing them into a graphic called a &lsquophylogenetic tree.&rsquo Phylogenetic trees for influenza viruses are like family (genealogy) trees for people. These trees show how closely &lsquorelated&rsquo individual viruses are to one another. Viruses are grouped together based on whether their genes&rsquo nucleotides are identical or not. Phylogenetic trees of influenza viruses will usually display how similar the viruses&rsquo hemagglutinin (HA) or neuraminidase (NA) genes are to one another. Each sequence from a specific influenza virus has its own branch on the tree. The degree of genetic difference (number of nucleotide differences) between viruses is represented by the length of the horizontal lines (branches) in the phylogenetic tree. The further apart viruses are on the horizontal axis of a phylogenetic tree, the more genetically different the viruses are to one another.

Figure. A phylogenetic tree.

For example, after CDC sequences an influenza A(H3N2) virus collected through surveillance, the virus sequence is cataloged with other virus sequences that have a similar HA gene (H3), and a similar NA gene (N2). As part of this process, CDC compares the new virus sequence with the other virus sequences, and looks for differences among them. CDC then uses a phylogenetic tree to visually represent how genetically different the A(H3N2) viruses are from each other.

CDC performs genetic characterization of influenza viruses year round. This genetic data is used in conjunction with virus antigenic characterization data to help determine which vaccine viruses should be chosen for the upcoming Northern Hemisphere or Southern Hemisphere influenza vaccines. In the months leading up to the WHO vaccine consultation meetings in February and September, CDC collects influenza viruses through surveillance and compares the HA and NA gene sequences of current vaccine viruses against those of circulating flu viruses. This is one way to assess how closely related the circulating influenza viruses are to the viruses the seasonal flu vaccine was formulated to protect against. As viruses are collected and genetically characterized, differences can be revealed.

For example, sometimes over the course of a season, circulating viruses will change genetically, which causes them to become different from the corresponding vaccine virus. This is one indication that a different vaccine virus may need to be selected for the next flu season&rsquos vaccine, although other factors, including antigenic characterization findings, heavily influence vaccine decisions. The HA and NA surface proteins of influenza viruses are antigens, which means they are recognized by the immune system and are capable of triggering an immune response, including production of antibodies that can block infection. Antigenic characterization refers to the analysis of a virus&rsquos reaction with antibodies to help assess how it relates to another virus.

Methods of Flu Genome Sequencing

One influenza sample contains many influenza virus particles that were grown in a test tube and that often have small genetic differences in comparison to one another among the whole population of sibling viruses.

Traditionally, scientists have used a sequencing technique called &ldquothe Sanger reaction&rdquo to monitor influenza evolution as part of virologic surveillance. Sanger sequencing identifies the predominant genetic sequence among the many influenza viruses found in an isolate. This means small variations in the population of viruses present in a sample are not reflected in the final result. Scientists often use the Sanger method to conduct partial genome sequencing of influenza viruses, while newer technologies (see next paragraph) are better suited for whole genome sequencing.

Over the past five years, CDC has been using &ldquoNext Generation Sequencing (NGS)&rdquo methodologies, which have greatly expanded the amount of information and detail that sequencing analysis can provide. NGS uses advanced molecular detection (AMD) to identify gene sequences from each virus in a sample. Therefore, NGS reveals the genetic variations among many different influenza virus particles in a single sample, and these methods also reveal the entire coding region of the genomes. This level of detail can directly benefit public health decision-making in important ways, but data must be carefully interpreted by highly-trained experts in the context of other available information. See AMD Projects: Improving Influenza Vaccines for more information about how NGS and AMD are revolutionizing flu genome mapping at CDC.


  1. Westley

    Also worries me about this issue, where can I find more information on this topic?

  2. Gail

    It's regular conditionality

  3. Dirck

    It's interesting. Giving Where can I read about this?

Write a message