How do you merge SNP data with a reference genome?

How do you merge SNP data with a reference genome?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

My Data

I have a 23andMe file listing SNPs in the form:

rsid chromosome position genotype rsXXXXX 1 PPPPPP CT rsXXXXX 1 PPPPPP GG

Fields are TAB-separated and each line corresponds to a single SNP. For each SNP, four fields of data are supplied.

  1. An identifier (an rsid or an internal id)
  2. Its location on the reference genome.
    • The chromosome it is located on.
    • The position within the chromosome is is located on.
  3. The genotype call oriented with respect to the plus strand on the human reference sequence.

The reference genome is the human assembly build 37 (also known as Annotation Release 104).

My Question

How do I merge the SNPs into the reference genome?

For example, take the first line in my SNP file:


Part 1

I can see that I need to replace the nucleotide at position PPPPPP on chomosome 1 of the reference genome with a nucleotide from the genotype field, but which nucleotide am I supposed to use? C or T? And why?

Part 2

Where am I supposed to start counting from on the reference genome? Looking at chromosome 1 of the human assembly build 37, the first ~10,000 characters (excluding the first line description) areN. Is the first N number 1? eg. If PPPPPP was 100,000 would I replace the 100,000th character in the reference genome with the correct nucleotide from Part 1 of this question? Or should I start counting from the first non-N character in the fasta file?

First, you need to know which genome sequence does the SNP file refer to. They must have mentioned the reference sequence that they used.

As others mentioned the case ofCTis heterozygosity. If you just want to mark the changes then discard the residue that is already present in the reference genome and use the other allele. However, you want to keep a track of the haplotype then you have make sure that a set of SNPs come from the same chromatid. This is difficult- you might still be able to know it for SNPs that are close enough to be mapped by a single read but it is almost impossible for SNPs that are separated well enough.

As Endre said, you have to start from the first nucleotide. However, it seems dubious that you are getting $ (NNNN)_n$ in the beginning of chromosome 1. Complete assembled chromosomes do not have such stretches. Below are the first 10 lines of the chromosome 1 fasta file. See for yourself.


How to replace $N^{th}$ residue is quite a straightforward task. But that is a programming question and not the scope of this forum. Assuming that you have solved the part 1 problem and have a tab separated sorted file like this:

chromosome position residue 1 79989 G 1 100232 T 3 341342 A

This script may not be the best but would work in a linux/*nix/Cygwin terminal, to replace the residues (ensure that you havegawk version >=4.0):

gawk -F "	" '(FNR==1){x++} (x==1){a[$1][$2]=$3;next} (x==2){if($0~/>/){h=$0;sub(/^.*chromosome /,"",h);sub(/ .*/,"",h)} else{seq[h]=seq[h]$0}} END{for(i in a){s=0; for(j in a[i]){m=m substr(seq[i],s,j-1) a[i][j];s=j+1} m=m substr(seq[i],s); print ">Chr"i"
"m}}' SNP_file Genome.fa | fold -w 60

Genetics 101, you have 2 copies of all your DNA at every position, one copy from your mother, one from your father. So for the "CT" one, you have one copy with a C, and one with a T.

And yea, it's normal for the first several thousand, or million letters to be N's. The genome is repetitive and yucky there, but it's counted for numbering purposes anyway.

Honestly, I wouldn't do this with a giant text file of the genome. Just search for your SNP in using the rs number, and you'll get the SNP, and some flanking sequence, and some context. Search it in PubMed if you want to see if it's ever turned up in any publication

Part 1:

According to Lior Pachter, 23andme data are not phased. Which means that for each entry in the genotype field, you do not know which chromosome copy it came from. This happens since modern microarray platforms are not able to tell which of the two copies of a chromosome a snp came from.

You can solve this problem for most snps by comparing your alleles to the reference genome, but this would take some programming chops. You could use as an example, which does the same thing, but for plink files.

Part 2:

I'm assuming that you would like to do this programatically, and not by copying and pasting the snps into the reference genome.

The short answer is that the first N is the first nucleotide. But, you should rather use a package such as Biopython to do the counting for you, it might be knottier than you think (you need to adjust for line endings in the fasta file, for example).

How do you merge SNP data with a reference genome? - Biology

Brief description of all scripts used in Picard and Gehring, 2017 Genome Biology. All scripts written by Colette L Picard (cpicard AT mit DOT edu) and licensed under the Apache License, version 2.0:

Copyright 2017 Colette L Picard

Licensed under the Apache License, Version 2.0 (the "License") you may not use this file except in compliance with the License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Any questions or issues can be directed to CLP. Some scripts require additional tools to be installed, which will be indicated where possible.

Note that all scripts noted here can be called without arguments for more detail on options and usage. Any scripts not described here are helper scripts required by one or more of these primary scripts, but are not separately described.

v.1.3, python script, requires Python 2, tested on 2.7.6 - required packages sys, os, argparse, re

v.1.2, R script, requires R, tested on 3.3.2 - requires package optparse

v.1.0, python script, requires Python 2, tested on 2.7.6 - required packages sys, os, argparse, re, matplotlib, numpy, scipy

v.1.7, bash script, requires Python 2, tested on 2.7.6, and R, tested on 3.3.2 - required helper scripts (must be in same directory as this script): - - by Colette L Picard - - by Colette L Picard - ends_analysis_make_plot.R - by Colette L Picard - - by Colette L Picard (only required if using -M or -C options) - required installed on user PATH: - bedtools (tested on v2.23.0)

v.1.0, bash script - required helper scripts (must be in same directory as this script): - merge_by_column.R (by Colette Picard) - fishers_exact.R (by Colette Picard) - required installed on user PATH: - bedtools (tested on v2.23.0)


The MUMmer system and the genome sequence aligner nucmer included within it are among the most widely used alignment packages in genomics. Since the last major release of MUMmer version 3 in 2004, it has been applied to many types of problems including aligning whole genome sequences, aligning reads to a reference genome, and comparing different assemblies of the same genome. Despite its broad utility, MUMmer3 has limitations that can make it difficult to use for large genomes and for the very large sequence data sets that are common today. In this paper we describe MUMmer4, a substantially improved version of MUMmer that addresses genome size constraints by changing the 32-bit suffix tree data structure at the core of MUMmer to a 48-bit suffix array, and that offers improved speed through parallel processing of input query sequences. With a theoretical limit on the input size of 141Tbp, MUMmer4 can now work with input sequences of any biologically realistic length. We show that as a result of these enhancements, the nucmer program in MUMmer4 is easily able to handle alignments of large genomes we illustrate this with an alignment of the human and chimpanzee genomes, which allows us to compute that the two species are 98% identical across 96% of their length. With the enhancements described here, MUMmer4 can also be used to efficiently align reads to reference genomes, although it is less sensitive and accurate than the dedicated read aligners. The nucmer aligner in MUMmer4 can now be called from scripting languages such as Perl, Python and Ruby. These improvements make MUMer4 one the most versatile genome alignment packages available.

Citation: Marçais G, Delcher AL, Phillippy AM, Coston R, Salzberg SL, Zimin A (2018) MUMmer4: A fast and versatile genome alignment system. PLoS Comput Biol 14(1): e1005944.

Editor: Aaron E. Darling, University of Technology Sydney, AUSTRALIA

Received: August 15, 2017 Accepted: January 1, 2018 Published: January 26, 2018

This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

Data Availability: The data used for this paper is available from the NCBI SRA, and from the Cold Spring Harbor Laboratory web site

Funding: This research was supported in part by the U.S. National Institutes of Health under grant R01 GM083873 to Steven Salzberg, in part by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4554 to Carl Kingsford, and in part by National Science Foundation Grants IOS-1238231 to Jan Dvorak, IOS-144893 to Herbert Aldwinckle, Keithanne Mockaitis, Aleksey Zimin, James Yorke and Marcela Yepes. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

This is a PLOS Computational Biology Software paper.

Putatively functional variation

When we restricted analyses to the variants most likely to affect gene function, we found a typical genome contained 149–182 sites with protein truncating variants, 10,000 to 12,000 sites with peptide-sequence-altering variants, and 459,000 to 565,000 variant sites overlapping known regulatory regions (untranslated regions (UTRs), promoters, insulators, enhancers, and transcription factor binding sites). African genomes were consistently at the high end of these ranges. The number of alleles associated with a disease or phenotype in each genome did not follow this pattern of increased diversity in Africa (Extended Data Fig. 4): we observed ∼ 2,000 variants per genome associated with complex traits through genome-wide association studies (GWAS) and 24–30 variants per genome implicated in rare disease through ClinVar with European ancestry genomes at the high-end of these counts. The magnitude of this difference is unlikely to be explained by demography 10,11 , but instead reflects the ethnic bias of current genetic studies. We expect that improved characterization of the clinical and phenotypic consequences of non-European alleles will enable better interpretation of genomes from all individuals and populations.


Genome-Wide SNP Analysis

All genomes had an average coverage of at least 29.6×, except DAL972 which had 5.7× coverage and was the only genome in the data set sequenced with the Sanger method ( supplementary table S2 , Supplementary Material online). In total, 890,170 SNPs were called in the genomes of the 56 Trypanosoma strains and 194,566 passed our filtering criteria. The filtered SNPs were used to construct a Neighbournet network ( fig. 1), a haplotype-based clustering analysis ( fig. 2), and a RAxML Maximum Likelihood tree ( supplementary fig. S1 , Supplementary Material online).

—NeighborNet network based on 194,566 genome-wide SNP loci in 3 Trypanosoma brucei gambiense group 1, 3 T. b. gambiense group 2, 17 Trypanosoma brucei rhodesiense, 21 Trypanosoma brucei brucei, 8 Trypanosoma evansi, and 4 Trypanosoma equiperdum strains.

—NeighborNet network based on 194,566 genome-wide SNP loci in 3 Trypanosoma brucei gambiense group 1, 3 T. b. gambiense group 2, 17 Trypanosoma brucei rhodesiense, 21 Trypanosoma brucei brucei, 8 Trypanosoma evansi, and 4 Trypanosoma equiperdum strains.

—Coancestry matrix based on phased haplotype data. Heatmap summarizes the number of haplotype segments (color key on the right) that a given parasite received (rows) from any another parasite (columns). Individuals are ordered along each axis according to the tree (left) inferred from the fineSTRUCTURE run.

—Coancestry matrix based on phased haplotype data. Heatmap summarizes the number of haplotype segments (color key on the right) that a given parasite received (rows) from any another parasite (columns). Individuals are ordered along each axis according to the tree (left) inferred from the fineSTRUCTURE run.

All three analyses revealed a similar evolutionary history for T. evansi and T. equiperdum. The 6 T. evansi type A strains form a monophyletic cluster and exhibit only minor SNP variation over time and space although they were isolated from different animal species in Kenya, Ethiopia, Brazil, Indonesia, and China between 1980 and 2013. Within this cluster, the largest genomic difference was found between STIB810 and E110 with a total of only 2,534 SNP differences (homozygous and heterozygous). The African strains C13 and MU09 showed the lowest genomic difference (375 SNPs) and were more closely related to the Brazilian strain E110 than to the Asian strains STIB805, STIB810, and RoTat 1.2. The two T. evansi type B strains KETRI2479 and MU010 also form a monophyletic cluster, which has emerged separately from the ancestor of Western/Central African trypanosomes.

The T. equiperdum strains are genetically most related to the Eastern African T. brucei strains. The T. equiperdum strains Dodola 943, TeAp-N/D1, and OVI form a monophyletic cluster, closely related to the Kiboko T. b. brucei strains TREU927 and KETRI1738 and the T. b. rhodesiense strain EATRO 240. We observed only 27 SNP differences (homozygous and heterozygous) between the T. equiperdum genomes of Dodola 943 and TeAp-N/D1, 27 SNP differences between OVI and TeAp-N/D1, and 24 SNP differences between Dodola 943 and OVI. The T. equiperdum strain BoTat, isolated from a horse in Morocco is distinct from this monophyletic cluster and its genome is closely related to the T. b. brucei strain J10 isolated from a hyena in Zambia. Both BoTat and J10 show an uncertain ancestry and share haplotypes with T. b. rhodesiense EATRO 240, T. b. brucei TRUE972 and KETRI1738, and T. equiperdum Dodola 943, TeAp-N/D1 and OVI ( fig. 2). To a lower extent, they also share haplotypes with Eastern and Western African T. brucei strains.

Subgroup-Specific SNPs

SNPs unique for T. evansi type A, T. evansi type B, T. equiperdum BoTat and the T. equiperdum Dodola 943, TeAp-N/D1, and OVI monophyletic cluster were identified. We only included mutations that differed from the homozygous reference state (compared with the reference genome TREU927) by being homozygous for the alternative allele. The complete list of SNPs for each studied subgroup is presented in supplementary table S3 , Supplementary Material online. We identified 354 SNPs that are unique to the monophyletic T. equiperdum cluster with Dodola 943, TeAp-N/D1 and OVI, and that did not occur in any other of the 53 strains of this study. Of the 354 SNPs, 224 were in coding regions of which 109 were nonsynonymous substitutions. In the T. equiperdum BoTat strain, 1,425 unique SNPs were observed, of which 850 in coding regions and 429 were nonsynonymous substitutions. Only five unique SNPs were shared by all T. equiperdum genomes, including the distinct BoTat genome. For T. evansi type B we detected 701 unique SNPs of which 454 in coding regions and 238 were nonsynonymous substitutions. An overview of the subgroup specific SNPs is presented in supplementary table S2 , Supplementary Material online.

Human Serum Resistance Genes, Diagnostic VSG’s, and F1-ATP Synthase Subunit γ

The TgsGP gene was detected in all T. brucei gambiense group 1 strains and in none of the other trypanosomes. The T. b. gambiense group 1 specific S210 codon in the TbHpHbR gene was also unique for all T. b. gambiense group 1 strains, whereas the other strains in this study coded for L210 in the TbHpHbR gene. All T. b. rhodesiense genomes contained SRA, except EATRO240. Surprisingly, the T. b. rhodesiense specific SRA gene was also detected in the T. b. brucei strains H883 and STIB213 isolated respectively from a dog in Uganda and a hyena in Tanzania. RoTat 1.2 was found in all T. evansi type A strains and not in any other strain. In contrast, our data show that the VSG JN 2118HU, considered to be unique to T. evansi type B, is also present in the T. b. gambiense type 2 strains ABBA, TH126 and STIB 386, and the T. b. brucei strains B8/18 Clone B, KP33 Clone 16 and TSW187/78E. The sequences for JN 2118HU were not identical in all genomes but there was not a single SNP identified that is unique for T. evansi type B. F1-ATP synthase subunit γ DNA and amino acid sequences were aligned for all genomes included in this study ( supplementary figs. S2 and S3 , Supplementary Material online). The nonsynonymous heterozygous substitution C142C/T (R48R/G) and heterozygous deletion GCT841del (A281del) are unique for all T. evansi type A strains, and the heterozygous A844A/T (M282M/L) for all T. evansi type B strains. The nonsynonymous homozygous substitution G817C (A273P) is unique for the T. equiperdum BoTat strain.

SNiPloid: A Utility to Exploit High-Throughput SNP Data Derived from RNA-Seq in Allopolyploid Species

High-throughput sequencing is a common approach to discover SNP variants, especially in plant species. However, methods to analyze predicted SNPs are often optimized for diploid plant species whereas many crop species are allopolyploids and combine related but divergent subgenomes (homoeologous chromosome sets). We created a software tool, SNiPloid, that exploits and interprets putative SNPs in the context of allopolyploidy by comparing SNPs from an allopolyploid with those obtained in its modern-day diploid progenitors. SNiPloid can compare SNPs obtained from a sample to estimate the subgenome contribution to the transcriptome or SNPs obtained from two polyploid accessions to search for SNP divergence.

1. Introduction

The advent of high-throughput sequencing technologies is revolutionizing our ability to discover and exploit single-nucleotide polymorphisms (SNPs). Polyploidy occurs in many animals and plants but is particularly widespread in flowering plants, including many major crops. However, most methods used to discover and validate predicted SNPs are optimized for diploid species, so specific challenges related to polyploidy remain to be addressed.

Many polyploid plants including coffee (Coffea arabica), wheat (Triticum durum Desf.), cotton (Gossypium hirsutum L.), and peanut (Arachis hypogaea L.) are allopolyploids and contain two or more distinct genomes (homoeologous chromosomes) after interspecific hybridization between related diploid species and chromosome doubling. As a consequence, allopolyploid genomes hold different copies of the most of their genes and genomic merger and doubling leads to an extensive array of genomic effects, including alterations in the expression of these duplicate genes (“homoeologs”). In an allopolyploid, the chromosomes derived from different parental species do not pair at meiosis and the gene copies, “homoeoalleles” or “homoeologs,” derived from different parental species have no allelic relationships and can consequently be distinguished from true alleles. In other words, sequence variation between subgenomes coexists with allelic variation within subgenomes. Accurate identification of homoeoSNPs (i.e., polymorphisms that occurred in only one of the subgenomes) in tetraploid sequence data is a challenge due to the coassembly of homoeologs. In a co-assembly, single nucleotide differences between the two subgenomes could be confused with SNP at a single locus.

The sequencing of transcripts using high-throughput sequencing methods (RNA-Seq) can provide fresh insights into polyploid biology [1]. Typically, the reads from a given allopolyploid are aligned to a reference transcriptome. Then, if the allele sequences of the diploid progenitor species can be sampled, it is possible to infer the genome origin of the identified SNPs and to estimate the contribution of the homoeologous genes to the total transcript level.

Here we present a new tool, SNiPloid, that can tackle the many aspects involved in the analysis of SNPs in the context of allopolyploidy. Based on the coassembly of homoeologs, SNiPloid compares either putative SNPs detected from an allopolyploid to those obtained in its parental genomes, or putative SNPs derived from two allopolyploid accessions to search for polymorphism. SNiPloid web server and source code (downloadable under the CeCILL public license) are accessible at

2. Methods

2.1. Data Preprocessing

Before interpreting the results of RNA-Seq data using SNiPloid, data preprocessing is required. Biologists can preprocess their data through the Galaxy public server ( as described in Figure 1.

Data preprocessing. Before launching SNiPloid, each individual sample needs to be preprocessed by successively running mapping alignments and SNP calling.

SNiPloid assumes that short reads datasets (i.e., samples) derived from unique single genotype or distinct accessions (diploid or polyploid) are separately aligned against a single diploid transcriptome reference corresponding to one of the parental diploids using dedicated mapping software such as BWA [2], Soap [3], or Bowtie [4].

Mapping alignment is a key step in data preprocessing and mapping parameters need to be adjusted and optimized to best fit the single diploid genome used as reference. Actually, since the reference diploid transcriptome is more closely related to one of the two subgenomes in the tetraploid, it might have collateral effects on the mapping efficiency and indirectly cause biases in the interpretation of the SNP, notably when analyzing the relative homoeologous gene expression represented by the contribution of subgenomes to total gene expression.

The SNiPloid utility uses the power of the Variant Call Format (VCF) which lists SNP variations and assigns alleles for each sequenced sample, by comparison with a reference sequence [5]. The VCF format is now widely recognized and is a standard format output of numerous SNP calling softwares. In this perspective, we suggest using the UnifiedGenotyper module in the GATK toolkit [6] for SNP discovery. A second type of input required by SNiPloid corresponds to a coverage depth file outputted by the Depth Of Coverage module of GATK. Optionally, SNP discovery and subsequent SNiPloid analysis can be improved by running the GATK ReadBackedPhasing utility to determine potential associations between alleles and produce phasing.

2.2. SNiPloid Utility

Inputs to the SNiPloid software consist of two different GATK outputs for each sample: (i) a VCF file listing putative SNPs and (ii) a coverage depth file (Figure 1). For each sample, the user can set the minimum depth coverage required to consider a position in the output statistics and the minimum minor allele frequency (MAF) required to consider the position as a variant.

SNiPloid comprises three main steps (Figure 2(a)). The first step of the utility consists in extracting regions that meet a minimum coverage depth threshold for each sample (previously set by the user) and then in identifying overlapping regions between samples. Subsequent analysis will be restricted to these regions for variant comparison. As a consequence, if putative SNPs show sufficient depth coverage in the allopolyploid but not in the diploid, or reciprocally, the position will not be processed.

(b) (a) SNiPloid procedure. For each reference sequence or gene of a diploid genome G2, SNiPloid extracts intervals that meet a minimal coverage depth threshold for each sample (1a) and identify overlapping intervals between samples (1b). It then extracts putative SNPs in both samples within these defined common regions (2) and compares the differences observed between samples in order to interpret the situation (3). (b) Phylogenetic contexts within a polyploidy genome and assignment of SNP categories.

In the second step also for each sample, SNiPloid extracts alleles from the VCF file for SNP positions within the defined common regions. In the third step the differences observed between samples are compared and the situation is interpreted.

Using its main functionality (“Polyploid versus parental diploid”), SNiPloid offers the option to compare, interpret, and cluster SNPs. Based on the coassembly of homoeologs, SNiPloid is able to infer the SNP genome origin and distinguish interspecific SNPs and homoeoSNPs (or genome specific SNP = HSV) [7] by comparing detected SNPs in the allopolyploid to the corresponding nucleotides in both modern parental diploid genomes. SNiPloid thus classifies SNPs in different categories by hypothesizing evolution patterns as follows (Figure 2(b)). (i) Patterns 1 and 2 correspond to interspecific SNPs and are assigned if an allele is specific to one of the parental genomes. The mutation occurred after the polyploidization event (e.g., diploid1 A/A, diploid2 G/G, and tetraploid G/G). (ii) Pattern 5 corresponds to putative homoeoSNPs because the same variation is observed in tetraploids and between parental genomes (e.g., diploid1 A/A, diploid2 G/G, and tetraploid A/G). With this pattern, SNiPloid identifies in which subgenome the homoeoallele resides by using diploid sequence alleles. In the second step, by retrieving and combining allelic depths for the reference and alternate alleles provided in the VCF format, it can estimate the subgenome contribution to the transcriptome for each homoeologous genes. (iii) Patterns 3 and 4 are attributed when the variation observed in the tetraploid is not identified between parental genomes (e.g., diploid1 A/A, diploid2 A/A, and tetraploid A/G). The mutation may have occurred in one of the subgenomes of the allotetraploid after the polyploidization event. With a mixture of reads originating from two subgenomes in the mapping of an allotetraploid, pattern 3 or 4 cannot be attributed without haplotype information, and a pattern “3 or 4” is assigned. In addition, SNiPloid can benefit from the phasing information included in the VCF file derived from the allotetraploid to infer the origin of an allele and distinguish between a hypothetical evolution pattern 3 or 4. Indeed, the VCF format anticipates the coding of allele phasing information (allele pairs specified by 0∣1 instead of 0/1 if phased with the previous polymorphism) in order to define haplotype blocks. Thus, if provided in the VCF, the phasing information can specify potential associations with SNP pattern 5 whose subgenome origin is known and thus distinguish between patterns 3 and 4. Basically, this process based on the haplotype makes it possible to identify putative subgenome specific SNPs.

3. Benefits

3.1. Web Application

SNiPloid is a component of the South Green Bioinformatics Platform ( and is accessible at as a specific utility of the SNiPlay application [8] for the analysis of allopolyploid species.

Alternatively, SNiPloid can be downloaded as a component of the Galaxy project [9], an open-source web based computational framework that allows easy incorporation of different tools. By downloading this package, it is also possible to run the utility by command line, meaning users can manage more voluminous input datasets.

3.2. SNiPloid Outputs

The Web application allows the export of the detailed list of classified SNPs in a tabulated format. At the end of the process, the program summarizes the analysis by counting the different SNP classes for each gene/contig of the reference dataset and by reporting the results in a dynamic sortable table (Figure 3(a)) so that users can easily classify and retrieve SNP classes of interest. For genes presenting at least one SNP class 5, an average ratio is given to obtain a global estimate of the subgenome contribution of the gene to the transcriptome.

SNiPloid outputs. (a) SNiPloid produces HTML outputs showing the number of predefined SNP categories and an approximate ratio of subgenome contribution to the transcriptome for each reference sequence. (b) SNiPloid is also able to generate a graphic image that shows the overall distribution of SNP categories and of subgenome contributions along the chromosomes.

In addition, when the objective is to calculate general statistics or SNP frequencies along the transcriptome, the counting of SNP categories can be reported to the number of positions taken into account for the analysis, that is, positions that had met the minimum coverage depth threshold defined by the user.

3.3. Comparison of Two Samples

Basically, the second option “Polyploid versus polyploid” of the application makes it possible to quickly distinguish and count specific and shared SNPs between two samples. The comparison can be made at three different levels: either between two samples originating from a single polyploid accession, or between two polyploid accessions, or more generally between two species. By using this functionality, new original approaches based on differential SNP can emerge for the study of the genome structure of polyploids or of the subgenome contribution to gene expression.

3.4. SNiPloid Map Viewer

Finally, SNiPloid includes a viewer that allows a graphical overview of the distribution of SNP categories and of subgenome contributions along the chromosomes (Figure 3(b)).

This functionality can only be applied on species for which a complete and fully annotated reference genome sequence is available and requires a structural genome annotation in General Feature Format (GFF) format as additional input, supplying the viewer program with the coordinates of gene models used as reference on the genome. The aim is to rapidly localize potential highly bias-expressed regions, introgressed genes, or homogenized regions within the genome.

3.5. Examples of Use Case

A whole transcriptome analysis was conducted on the allotetraploid Coffea arabica by using the SNiPloid software for the analysis of the contribution of subgenomes to the transcriptome [10]. This study enabled to characterize genome-wide homoeologous expression gene expression in C. arabica, a recent allopolyploid combining two subgenomes that derive from two closely related diploid species: C. canephora and C. eugenioides. Different samples of C. arabica obtained at contrasted temperatures and one C. eugenioides sample were mapped against the C. canephora reference transcriptome, analyzed for SNP discovery, before being compared with SNiPloid in order to estimate homoeologous gene expression and to highlight potential variation between growing conditions. Additionally, by mapping reads against the C. eugenioides transcriptome instead of C. canephora, this study showed that the relative homoeologous gene expression is slightly biased in favour of the genome used as reference, as anticipated above.

Sampled from this study, an example of datasets is provided by the SNiPloid Web server to familiarize users with the correct input and expected results.

3.6. Performance and Limitations

The main functionality of SNiPloid is dedicated to RNA-Seq data and to polyploid species for which a diploid transcriptome reference is available for at least one of the parents.

One limitation of the use of RNA-Seq for SNP detection and subsequent interpretation is that the transcript sequences represent only the expressed part of the genome and that the sequencing depth varies considerably across the genome due to the different gene expression levels. Thus, only SNPs in well-expressed genes can be detected and allele or homoeolog expression bias could make the detection of certain SNP difficult due to their low frequency in the transcriptome. However, NGS technologies and the use of appropriate read cutoffs allow to detect and interpret SNPs for a large number of genes distributed across the genome.

Theoretically, even though the allele expression quantification would not be performed, a genome wide analysis would be also possible on genomic data. However from a technical point of view, whole genome analysis would be difficult to perform through our Web server, since it requires uploading VCF and depths file inputs that would be sizeable and should be computed by command line after having downloaded the SNiPloid package or through Galaxy.

In terms of performance, in our practical experience two RNA-Seq samples derived from a polyploid and a diploid species first mapped against a complete reference transcriptome and then generating 600 000 putative SNPs each can be successfully compared by SNiPloid Web server in less than five minutes.

3.7. Comparison with Other SNP Bioinformatics Tools

Even though numerous SNP bioinformatics tools or pipelines exist for SNP calling (GATK [6], VarScan [11], WEP [12], and MiST [13]) or SNP annotation (SNPEff [14]) at a whole genome scale, only a few software packages allow to automatically categorize and interpret putative SNPs from polyploid species.

An example of pipeline reported by Hand et al. [15] predicts the subgenome-specific origin of SNPs using a phylogenetic approach based on comparison with orthologous sequences from predicted progenitor species. More recently a new pipeline called PolyCat [16] has been developed for mapping and categorizing NGS reads produced from allopolyploid organisms. Having the same aim as SNiPloid, the approach is a little bit different. PolyCat uses reads from diploids to generate preindexed homoeoSNPs that will be then used to assign reads from tetraploids to a subgenome. The subgenome attribution is made at the read level whereas SNiPloid manages the subgenome attribution by considering SNPs position by position, counting homoeoSNPs for each transcript of a whole transcriptome analysis.

This approach is relevant and more advanced but can appear slightly more fastidious to operate. The main advantage of SNiPloid is its ease to be applied since it does not require preliminary work to establish homoeoSNPs database that can be time-consuming, and offers to non-bioinformaticians a ready-to-use Web server allowing to rapidly obtain subgenome attribution thanks to a “one click” analysis.

In addition, our approach seems to be more appropriate for allopolyploid species for which the polyploidization event is relatively recent in the evolution such as Coffea or Spartina.

4. Conclusions

To our knowledge, SNiPloid is the first Web tool dedicated and optimized for the SNP analysis of RNA-Seq data obtained from an allopolyploid species. By exploiting the well-organized information stored in the standard VCF format, SNiPloid helps to interpret putative SNPs detected in a whole transcriptome by a comprehensive SNP categorization. SNiPloid is appropriate for allotetraploids and opens new prospects for investigating allopolyploid genome structure or expression.


  1. J. Higgins, A. Magusin, M. Trick, F. Fraser, and I. Bancroft, “Use of mRNA-Seq to discriminate contributions to the transcriptome from the constituent genomes of the polyploidy crop species Brassica napus,” BMC Genomics, vol. 13, article 247, 2012. View at: Google Scholar
  2. H. Li and R. Durbin, “Fast and accurate short read alignment with Burrows-Wheeler transform,” Bioinformatics, vol. 25, no. 14, pp. 1754–1760, 2009. View at: Publisher Site | Google Scholar
  3. R. Li, C. Yu, Y. Li et al., “SOAP2: an improved ultrafast tool for short read alignment,” Bioinformatics, vol. 25, no. 15, pp. 1966–1967, 2009. View at: Publisher Site | Google Scholar
  4. B. Langmead, “Aligning short sequencing read with Bowtie,” in Current Protocols in Bioinformatics, chapter 11, unit 11. 7, John Wiley & Sons, New York, NY, USA, 2010. View at: Publisher Site | Google Scholar
  5. “VCF format,”�ll𥈏ormat/vcf-variant-call-format-version-41. View at: Google Scholar
  6. A. McKenna, M. Hanna, E. Banks et al., “The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data,” Genome Research, vol. 20, no. 9, pp. 1297–1303, 2010. View at: Publisher Site | Google Scholar
  7. S. Kaur, M. G. Francki, and J. W. Forster, “Identification, characterization and interpretation of single-nucleotide sequence variation in allopolyploid crop species,” Plant Biotechnology Journal, vol. 10, no. 2, pp. 125–138, 2012. View at: Publisher Site | Google Scholar
  8. A. Dereeper, S. Nicolas, L. Le Cunff et al., “SNiPlay: a web-based tool for detection, management and analysis of SNPs. Application to grapevine diversity projects,” BMC Bioinformatics, vol. 12, article 134, 2011. View at: Publisher Site | Google Scholar
  9. J. Goecks, A. Nekrutenko, J. Taylor, and T. Galaxy Team, “Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences,” Genome Biology, vol. 8, no. 8, article R86, 2010. View at: Publisher Site | Google Scholar
  10. M. C. . Combes, A. Dereeper, D. Severac, B. Bertrand, and P. Lashermes, “Contribution of subgenomes to the transcriptome and their intertwined regulation in the allopolyploid Coffea arabica grown at contrasted temperatures,” New Phytologist, vol. 200, no. 1, pp. 251–260, 2013. View at: Publisher Site | Google Scholar
  11. D. C. Koboldt, K. Chen, T. Wylie et al., “VarScan: variant detection in massively parallel sequencing of individual and pooled samples,” Bioinformatics, vol. 25, no. 17, pp. 2283–2285, 2009. View at: Publisher Site | Google Scholar
  12. M. D'Antonio, P. D. De Meo, D. Paoletti et al., “WEP: a high-performance analysis pipeline for whole-exome data,” BMC Bioinformatics, vol. 14, supplement 7, article S11, 2013. View at: Google Scholar
  13. S. Subramanian, V. Di Pierro, H. Shah et al., “MiST: a new approach to variant detection in deep sequencing datasets,” Genome Biology, vol. 11, no. 8, article R86, 2010. View at: Google Scholar
  14. P. Cingolani, A. Platts, L. Wang le et al., “A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w 1118 , iso-2, iso-3,” Fly, vol. 6, no. 2, pp. 80–92, 2012. View at: Publisher Site | Google Scholar
  15. M. L. Hand, N. O. Cogan, and J. W. Forster, “Genome-wide SNP identification in multiple morphotypes of allohexaploid tall fescue (Festuca arundinacea Schreb),” BMC Genomics, vol. 13, article 219, 2012. View at: Publisher Site | Google Scholar
  16. J. T. Page, A. R. Gingle, and J. A. Udall, “PolyCat: a resource for genome categorization of sequencing reads from allopolyploid organisms,” G3, vol. 3, no. 3, pp. 517–525, 2013. View at: Google Scholar


Copyright © 2013 Marine Peralta et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

The basic principles of SNP array are the same as the DNA microarray. These are the convergence of DNA hybridization, fluorescence microscopy, and solid surface DNA capture. The three mandatory components of the SNP arrays are: [3]

  1. An array containing immobilized allele-specific oligonucleotide (ASO) probes.
  2. Fragmented nucleic acid sequences of target, labelled with fluorescent dyes.
  3. A detection system that records and interprets the hybridization signal.

The ASO probes are often chosen based on sequencing of a representative panel of individuals: positions found to vary in the panel at a specified frequency are used as the basis for probes. SNP chips are generally described by the number of SNP positions they assay. Two probes must be used for each SNP position to detect both alleles if only one probe were used, experimental failure would be indistinguishable from homozygosity of the non-probed allele. [4]

A SNP array is a useful tool for studying slight variations between whole genomes. The most important clinical applications of SNP arrays are for determining disease susceptibility [5] and for measuring the efficacy of drug therapies designed specifically for individuals. [6] In research, SNP arrays are most frequently used for genome-wide association studies. [7] Each individual has many SNPs. SNP-based genetic linkage analysis can be used to map disease loci, and determine disease susceptibility genes in individuals. The combination of SNP maps and high density SNP arrays allows SNPs to be used as markers for genetic diseases that have complex traits. For example, genome-wide association studies have identified SNPs associated with diseases such as rheumatoid arthritis, [8] prostate cancer, [9] A SNP array can also be used to generate a virtual karyotype using software to determine the copy number of each SNP on the array and then align the SNPs in chromosomal order. [10]

SNPs can also be used to study genetic abnormalities in cancer. For example, SNP arrays can be used to study loss of heterozygosity (LOH). LOH occurs when one allele of a gene is mutated in a deleterious way and the normally-functioning allele is lost. LOH occurs commonly in oncogenesis. For example, tumor suppressor genes help keep cancer from developing. If a person has one mutated and dysfunctional copy of a tumor suppressor gene and his second, functional copy of the gene gets damaged, they may become more likely to develop cancer. [11]

Other chip-based methods such as comparative genomic hybridization can detect genomic gains or deletions leading to LOH. SNP arrays, however, have an additional advantage of being able to detect copy-neutral LOH (also called uniparental disomy or gene conversion). Copy-neutral LOH is a form of allelic imbalance. In copy-neutral LOH, one allele or whole chromosome from a parent is missing. This problem leads to duplication of the other parental allele. Copy-neutral LOH may be pathological. For example, say that the mother's allele is wild-type and fully functional, and the father's allele is mutated. If the mother's allele is missing and the child has two copies of the father's mutant allele, disease can occur.

High density SNP arrays help scientists identify patterns of allelic imbalance. These studies have potential prognostic and diagnostic uses. Because LOH is so common in many human cancers, SNP arrays have great potential in cancer diagnostics. For example, recent SNP array studies have shown that solid tumors such as gastric cancer and liver cancer show LOH, as do non-solid malignancies such as hematologic malignancies, ALL, MDS, CML and others. These studies may provide insights into how these diseases develop, as well as information about how to create therapies for them. [12]

Breeding in a number of animal and plant species has been revolutionized by the emergence of SNP arrays. The method is based on the prediction of genetic merit by incorporating relationships among individuals based on SNP array data. [13] This process is known as genomic selection.

Genome-wide genetic changes during modern breeding of maize

The success of modern maize breeding has been demonstrated by remarkable increases in productivity over the last four decades. However, the underlying genetic changes correlated with these gains remain largely unknown. We report here the sequencing of 278 temperate maize inbred lines from different stages of breeding history, including deep resequencing of 4 lines with known pedigree information. The results show that modern breeding has introduced highly dynamic genetic changes into the maize genome. Artificial selection has affected thousands of targets, including genes and non-genic regions, leading to a reduction in nucleotide diversity and an increase in the proportion of rare alleles. Genetic changes during breeding happen rapidly, with extensive variation (SNPs, indels and copy-number variants (CNVs)) occurring, even within identity-by-descent regions. Our genome-wide assessment of genetic changes during modern maize breeding provides new strategies as well as practical targets for future crop breeding and biotechnology.

Author information


L.K. Ernst Federal Science Center for Animal Husbandry, Dubrovitzy 60, Podolsk, Moscow, Russia, 142132

Alexander A. Sermyagin, Arsen V. Dotsev, Elena A. Gladyr, Alexey A. Traspov, Tatiana E. Deniskova, Olga V. Kostyunina, Gottfried Brem & Natalia A. Zinovieva

Institute of Genome Biology, Leibniz Institute for Farm Animal Biology (FBN), 18196, Dummerstorf, Mecklenburg-Vorpommern, Germany

Henry Reyer & Klaus Wimmers

Department of Animal Sciences, Food and Nutrition, Università Cattolica del Sacro Cuore, via Emilia Parmense 84, Piacenza, Italy

Russian Research Institute of Farm Animal Genetics and Breeding, Moskovskoe shosse 55a, St. Petersburg–Pushkin, Russia, 196601

Ivan A. Paronyan & Kirill V. Plemyashov

Division of Livestock Sciences, University of Natural Resources and Life Sciences, Gregor-Mendel-Straße 33, 1180, Vienna, Austria

Yakut Scientific Research Institute of Agriculture, 23/1, ul. Bestuzheva-Marlynskogo, Yakutsk, Sakha Republic, Russia, 677001

Institute of Animal Breeding and Genetics, University of Veterinary Medicine, Veterinärplatz 1, 1210, Vienna, Austria