We are searching data for your request:
Upon completion, a link will appear to access the found materials.
I am new to reading raw DNA. When comparing two people's raw data, why does one person have a different SNP than the other, at the same location, on specific chromosome? But on a different chromosome there will be the same SNP, at the same location, for both people.
Because that's how genetic variation works. SNPs are called single nucleotide polymorphisms for a reason: they are polymorphisms. This means that they are loci where different individuals will have a different nucleotide. This is precisely why they are studied and why we have databases of SNPs and the various genotypes they can manifest.
Remember that mutations can occur spontaneously in a single individual. In fact, they can appear spontaneously in a single cell. This means that if you sequence the genome of two different cells of the same individual it is actually possible that you will find small differences.
Finding small differences between different individuals is certain. Our genomes are not identical and while most differences will be in non-coding regions, you will also have SNPs within genes. How much of our phenotypic variation depends on such small differences is an active topic of research but it is safe to assume that small, single nucleotide changes cause at least some of the variation you see in the people around you.
In any case, what would be surprising is if two individuals were to share the exact same SNP forms for all SNPs in their genomes (I'm sure this happens, but I would expect it to be the case only for close relatives and probably not even then). This variation is the whole point of SNP and the only reason we analyze them.
It sounds to me as though you have misunderstood something about the chromosomes that make up a human genome.
on a different chromosome [… ] at the same location
is not the same location at all. There is no surprise in finding 2 genomes to differ in one place, and not at another unrelated place.
edit If this is a mis-reading of your question, and you meant that the 2 persons' SNP was "at the same location" as the other person, then the rest of the answer is irrelevant. 'A SNP' refers to a location, given a name/identifier for the purpose of measuring variation with SNP-chips; necessarily the same location in any person, which may vary, but needn't always differ between everyone (come on, there are only 4 possible values). You should re-phrase your question "why do they differ at SNP 1 and not at SNP2?".
SNP means nothing more than 'a single basepair in the genome, that might vary in a population', so let's just talk of 'difference at a position'.
A single haploid set of the human genome consists of 23 chromosomes, 22 of these are quite 'ordinary', while 1 is involved in sex determination. Of course, most of our cells are diploid and so we have 2 of these sets, with pairs of Chr1, Chr2,… Chr22 and X/Y.
What I think your problem isn't, that you perhaps thought it was
I haven't analysed large bits of genomic data but I believe heterozygous positions (positions at which one individual's paired chromosomes differ) are represented in that individual's sequence data. For this reason, I do not think your question is asking 'how can 2 persons have a difference at basepair x on one copy of Chr1, and no difference on the other copy of Chr1' - because both copies of Chr1 are merged as one person's dataset.
In fact it's impossible to assign (meaning to group in to haplotypes) heterozygous sequence to either one copy or the other without more information, i.e. sequences from parents & grandparents, to see which variants are inherited together (are linked).
What I think your problem really is
I think you are asking 'how can 2 persons have a difference at basepair x on Chr1, and no difference at basepair x on Chr6 (say)?' This is easily answered. Basepair 100 of Chr1 is a completely different and unrelated position in the genome to basepair 100 of Chr6. There is no reason to expect that these positions should be related.
A helpful thought experiment
We can concatenate the chromosomes rather than re-setting our count at the beginning of each, then bp 100 of Chr6 will instead be denoted (approximately) bp 1,080,000,100 of the genome - this makes the difference in these positions crystal clear.
Multiple independent mechanisms link gene polymorphisms in the region of ZEB2 with risk of coronary artery disease
Background and aims: Coronary artery disease (CAD) arises from the interaction of genetic and environmental factors. Although genome-wide association studies (GWAS) have identified multiple risk loci and single nucleotide polymorphisms (SNPs) associated with risk of CAD, they are predominantly located in non-coding or intergenic regions and their mechanisms of effect are largely unknown. Accordingly, our objective was to develop a data-driven informatics pipeline to understand complex CAD risk loci, and to apply this to a poorly understood cluster of SNPs in the vicinity of ZEB2.
Methods: We developed a unique informatics pipeline leveraging a multi-tissue CAD genetics-of-gene-expression dataset, GWAS datasets, and other resources. The pipeline first dissected SNP locations and their linkage disequilibrium relationships, and progressed through analyses of tissue-specific expression quantitative trait loci, and then gene-gene, gene-phenotype, SNP-phenotype relationships. The pipeline concluded by exploring CAD-relevant gene regulatory networks (GRNs).
Results: We identified three independent CAD risk SNPs in close proximity to the ZEB2 coding region (rs6740731, rs17678683 and rs2252641/rs1830321). Our pipeline determined that these SNPs likely act in concert via the atherosclerotic arterial wall and adipose tissues, by governing metabolic and lipid functions. In addition, ZEB2 is the top key driver of a liver-specific GRN that is related to lipid levels, metabolic and anthropometric measures, and CAD severity.
Conclusions: Using a novel informatics pipeline, we disclosed the multi-faceted mechanisms of action of the ZEB2-associated CAD risk SNPs. This pipeline can serve as a roadmap to dissect complex SNP-gene-tissue-phenotype relationships and to reveal targets for tissue- and gene-specific therapeutic interventions.
Keywords: Atherosclerosis Coronary artery disease Genome-wide association study ZEB2.
Copyright © 2020 Elsevier B.V. All rights reserved.
Conflict of interest statement
The authors declared they do not have anything to disclose regarding conflict of interest with respect to this manuscript.
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
2. The SNP databases
The first was developed >5 years ago and contains only a partial listing of the many polymorphisms that exist between N2 and CB4856. Despite being incomplete, this database employs a straightforward interface and, for initial mapping purposes, is likely to be sufficient for most users. The newer database contains information that is somewhat more preliminary and the site is currently under development. However, since this database has been assembled using the complete sequence of CB4856, this site lists many more candidate SNPs than the original database. At present, the newer database may be best suited for finer mapping such as that commonly encountered in the later stages of 3-point SNP mapping or for determining endpoints using 2-point methods (described in Section 5). Both databases are described in some detail below.
The original C. elegans SNP database can be accessed at: http://genome.wustl.edu/genome/celegans/celegans_snp.cgi. Despite having only incomplete genome coverage, this resource provides a very useful inventory of many SNPs for the strains N2 and CB4856. This database is organized according to the physical map by chromosomes, chromosomal subsegments, and cosmids. For example, at the top of sequence Segment 9 on Chromosome X (click ‘Chromosome X Polymorphisms’ at the bottom of the page, then click ’, or go to http://genome.wustl.edu/genome/celegans/chromX_layout.html then click on ’), you will find the SNP B0403:33022 S=CT. This means the polymorphism is on cosmid B0403 at nucleotide position 33,022 and that the two strains differ in having either a C or T at this position. SNPs listed in red lettering have presumably been experimentally confirmed, whereas SNPs listed in white lettering are as yet unconfirmed. In fact, our lab has had at least one bad experience with a “confirmed” SNP, thus it is essential to make sure that any SNP you work with behaves as expected in your own hands .
Clicking on the red letters of B0403:33022 S=CT, we bring up an additional window that shows the actual sequences surrounding the SNP in black lettering (usually ∼ 500 bp upstream and downstream) as well as the SNP itself in red lettering [C/T]. This designation indicates that N2 contains a C at this position whereas CB4856 contains a T. Also, if it is an RFLP-type SNP, the top of this page will show predicted digestion sites for the displayed DNA sequence from N2 and CB4856 (listed here as “HA” for Hawaiian), using one or more enzymes. Looking at this, we notice that in the CB4856 background the presence of the T results in the sequence AGATCT, which is the recognition site for the restriction enzyme BglII . This enzyme cuts once in this segment of the CB4856 sequence and not at all in N2. Thus if we were to amplify this region from N2 and CB4856 worms using PCR and cut the PCR product with BglII, CB4856 would produce a doublet of about 500 bp each, whereas N2 would run as a single band of 1,000 bp. The other enzymes listed as distinguishing this polymorphism (e.g., MnlI and MboI ) although technically correct, are not of much practical use, as they cut many times in both N2 and CB4856 sequences. Therefore, discerning these two largely identical digestion patterns (using a standard agarose gel) would be difficult or impossible.
Moving down to the unconfirmed SNP just below B0403, we find C36B7:21571 S=CT. The presence of a C in N2, and an A in CB4856, leads to the creation of a new site for the enzyme ApoI (consensus RAATTY where R is an A or G and Y is a C or T. For a complete listing of abbreviations, see the back of the NEB catalog). Here we see that ApoI cuts five times in strain CB4856 (59, 405, 500, 638, 648). Directly above this, we see that the N2 digest is listed as “none”. Beware: this does not mean that CB4856 cuts five times with ApoI and not at all in N2! In fact, N2 cuts four times with ApoI (59, 405, 638, 648), just not at the middle position where the actual SNP is located (500). This is obviously misleading. By “none”, they just mean that the polymorphism results in no new enzyme sites that specifically cut the N2 sequence. Another thing to be aware of is that for non-palindromic sites, it may be the bottom (non-scripted) strand of DNA that is relevant.
Because many of the listed SNPs are not experimentally confirmed, the question arises: how many SNPs are actually real and is it possible to intuitively distinguish the real ones from the false ones? (The false ones are simply due to errors in the single sequencing reads of CB4856). For all non-confirmed SNPs, a probability index (Psnp) is given at the top of the page that contains the sequence information. For C36B7:21571, the Psnp is 0.9427, meaning that there is supposedly a 94% chance that the SNP is real based on the quality of the read. For a non-confirmed SNP, this is as good as it gets. In contrast, it is our experience that SNPs with Psnp indices below 0.5 are invariably bogus. Also note that non-confirmed nucleotide substitutions can now be cross referenced using the newer SNP database, described below. In addition to low-scoring substitutions, SNPs that result in single base-pair deletions or insertions within a run of repetitive nucleotides (e.g., A7 versus A8) are often suspect. Although some of these may turn out to be real, common sense dictates that sequencing errors are more likely to occur when attempting to distinguish between these sorts of differences than when comparing sequences such as ATG and ACG. Thus, you will want to use some discretion in your true/false predictions beyond the Psnp index. Of course, you will always want to substantiate any unconfirmed SNP before attempting any significant mapping exercises, no matter what the probability index or your intuition tells you.
The primary advantage to this database is that, as mentioned above, it is based on the complete sequence of CB4856, and thus in theory should identify all known SNPs. However, as of this writing, the database identifies only nucleotide substitutions, but not small deletions and insertions. Given that this latter class comprises a substantial proportion of the differences between N2 and CB4856, this database is currently incomplete the site developers are aware of this deficiency and a fix should be available in the near future.
From the page accessed by the above link, under the pulldown menus for the “group” and “track” inputs, select “Custom Tracks” and “cb4856_snps”, respectively. Use the defaults for all other categories. Under the “region” heading, select “position” and enter a specific chromosomal nucleotide location range, e.g., chrIV:500000-550000. Note that specific nucleotide numbers corresponding to any region of interest can be obtained from Wormbase. For example, entering the cosmid C32F10 on Wormbase and carrying out a “clone” search “ reveals the genomic location of this comsid to be :5,804,218,834,319”, which would be entered into the position box on the SNP site as chrI:5804218-5834319.
For the output section, several formats are available. For example “All fields from selected table” tabulates the changes and positions of the SNPs for that region. This output also provides a score of 40 for each SNP, where higher numbers indicate greater reliability. On average the database contains a false positive rate of ∼ 5%. Also very useful is the “sequence” option. This takes you to new page where you can enter the number of nucleotides on either side of the SNP that you would like displayed. For example, entering ” into both the upstream and downstream boxes and clicking on “get sequence” will produce a list of SNPs, each displaying 101 nucleotide sequences (50 bp per line). In this case, the location of the actual SNP will be at position 51, or the first nucleotide on the second line. Note that the sequence displayed is always the N2 sequence, however, the specific change is indicated above the sequence. Thus, C/T would indicate that the nucleotide at position 51 is a “C” in N2 and a “T” in CB4856. These types of sequences can then be readily pasted into standard DNA analysis software to detect changes in RFLP patterns.
SNPs can also be accessed directly through WormBase, although somewhat less information is currently provided than the SNP-specific websites. To view these, simply go to your region of interest using the WormBase genome browser and select a reasonably-sized region (e.g., 20 kbp) for viewing under the “Scroll/Zoom” pulldown menu. Next, check the “SNPs” box towards the bottom of the page under “Variation Tracks” and click “Update Image”. This will display the predicted SNPs in the region as green or yellow diamonds, indicating RFLP and non-RFLP SNPs, respectively. In addition, SNPs that have been validated by additional sequencing or RFLP analysis are indicated. Clicking on the diamonds or adjacent text brings you to a new page where you have the option of viewing an expanded region (500 bp) surrounding the SNP. Alternatively, you can access SNPs through WormBase via: http://www.wormbase.org/db/searches/strains. Enter landmarks as directed and select “None” under the top Loci heading, “SNPs” under the middle option, and “All” under the bottom SNPs heading to view all verified and predicted SNPs in the region. Note that WormBase does not currently include reliability scores for predicted SNPs and there are no options for viewing different amounts of surrounding sequences or for identifying relevant restriction endonucleases. Nevertheless, the graphical interface is very straightforward and highly useful for visualizing the locations of SNPs within a small region.
With the consortium's sequencing of CB4856 and the expected future improvements of the databases, individual investigator's efforts to detect novel SNPs through sequencing relevant regions of CB4856 will likely be unnecessary in the very near future. Nevertheless, this can be accomplished by amplifying random intergenic sequences in one's region of interest from CB4856. In the past, we have usually amplified an ∼ 1,600-bp region from CB4856 and used two internal sequencing primers. More often than not, one will find at least a single difference within a region of this size.
Amplified Polymorphic Sequences
PCR can be used to amplify polymorphic regions. The revelation of polymorphism in these amplified sequences can be illustrated as mini/microsatellites or VNTRs/STRs where variations in length demonstrate differences in repeated elements in what can be described as Amplified Fragment Length Polymorphisms ( AFLPs ). Cleaved Amplified Polymorphic Sequences ( CAPS ) represent PCR of loci known to contain polymorphic restriction sites. Different alleles using CAPS may be revealed by the presence or absence of RE digestion of amplified products that result in differential banding patterns. In these cases, SNPs may have historically introduced or ablated the presence of a specific restriction site and permits for the presentation of different alleles. A modification of CAPS specifically uses long primers that intentionally introduce a restriction site where one does not exist based on SNPs within the amplified region for SNPs not naturally creating a restriction site. The intentional creation or removal of restrictions sites for one allele versus the other in this case is referred to as a derived Cleaved Amplified Polymorphic Sequence ( dCAPS ).
Single Nucleotide Polymorphisms (SNPs) and Single Nucleotide Variations (SNVs) are nucleotide changes at single genomic positions that differ between significant subsets of a population, or general mutations that often arise due to diseases such as cancer, respectively . While very common and known to cause many diseases, their effects on gene expression, protein binding, and ways in which they cause disease are not completely understood . Missense mutations in coding regions are easily linked to disease, since they cause translation of a defective protein , but most SNPs (∼93% of disease and trait associated SNPs in genome-wide association studies) occur in non-coding regions . Non-coding SNPs can appear in non-coding RNAs, introns, or in 5’ and 3’ untranslated regions (UTRs). Because these non-coding SNPs do not produce an altered protein, the pathways through which they cause disease are less well known, but they are still regularly associated with disease . Understanding the effect of these non-coding or same-sense SNPs has wide-ranging implications for understanding disease, as well as evolutionary genetics [6, 7].
A possible explanation of the effect on phenotype of SNPs in 5’ and 3’ UTRs or non-coding RNAs is that they affect crucial interactions between an RNA and other biomolecules. Indeed, RNAs naturally interact with RNA-binding proteins (RBPs), RNA-protein complexes like the ribosome and the spliceosome, as well as with other RNAs [8–10]. These interactions control every step in an RNA’s life cycle, such as the life time of an RNA molecule, its subcellular localization, and the recruitment of ribosomes to mRNA molecules and ultimately the amount of protein expressed per transcribed mRNA [11, 12]. Thus, it is not surprising that interrupting these interactions is known to cause disease . In line with their importance, there are over 1500 RNA binding proteins and thousands of microRNAs annotated in the human genome alone [14, 15].
It is clear that a SNP will affect protein or microRNA binding if it occurs directly on a binding site [16, 17]. However, as we will show, SNPs are also able to affect protein (or microRNA) binding “at a distance” through the involvement of RNA secondary structure. RNA secondary structures form due to the propensity of the nucleotides of an RNA to base pair . For structural RNAs these base pairings are a significant determinant of the functionally relevant physical shape of the RNA, but messenger and non-coding RNAs that are not necessarily designed for specific structures will also form base pairs and thus secondary structure . As microRNAs and a large fraction of RNA binding proteins bind to unpaired bases only, RNA secondary structure competes with binding of microRNAs or single-stranded RNA binding proteins and thus affects the binding affinity of the RNA for these molecules. For example, we have previously shown the existence of secondary structure mediated cooperativity between RNA binding proteins: binding of one protein to an RNA changes the ensemble of possible secondary structures by excluding the bases in its footprint from base-pairing [20, 21]. This change in secondary structures modifies the accessibility of the footprint for a second protein and thus the affinity of the RNA for this second protein. Depending on the specific sequence one binding event can make the other binding event easier or harder.
It has also been shown experimentally that specific SNPs can affect the secondary structures of mRNAs , and that SNPs can cause disease through changes in RNA secondary structure [23–25]. Here, we show how single nucleotide changes in an RNA molecule can, by making different conformations energetically more or less favorable, also change secondary structure drastically enough to change the affinity of an RNA for an RNA binding protein or a microRNA, and that there is some evidence that this effect might be under selective pressure in the human transcriptome. For simplicity, in the rest of the paper we will refer to the molecules binding to RNAs as “proteins”, even though these binding events could equally occur with mircoRNAs, as shown in , or any other molecule that binds single-stranded RNA. Likewise, we will be referring to the effect of “SNPs” on RNA-protein binding, but these effects should occur equally with any point mutation including SNVs. By computationally folding RNAs using a modified version of the Vienna RNA Package, we are able to quantitatively measure the effect of SNPs on protein binding. Using known human SNPs and PAR-CLIP data, we investigate the genome wide effect of SNPs on HuR (ELAVL1) binding. HuR is an extensively studied RNA binding protein with nearly 500 articles on PubMed. It is a member of the ELAVL family of RNA-binding proteins that selectively bind AU rich sequences, and HuR binds with a 7 nucleotide footprint mostly in the UTRs of many mRNAs . HuR has diverse functions, including stabilizing mRNAs against degradation as a means of regulating gene expression and controlling nuclear export of mRNAs, and has been implicated in several diseases including cancer [28, 29]. We find that SNPs can have a many-fold effect on the binding affinity of HuR binding to RNA transcripts from tens of bases away, simply through changes in secondary structure, and propose this as a general mechanism through which SNPs can affect protein binding.
Refining the genomic location of SNP variation affecting Atlantic salmon maturation timing at a key large-effect locus
Efforts to understand the genetic underpinnings of phenotypic variation often lead to the identification of candidate regions showing signals of association and/or selection. These regions may contain multiple genes and therefore validation of which genes are actually responsible for the signal is required. In Atlantic salmon (Salmo salar) a large-effect locus for maturation timing occurs in a genomic region including two candidate genes, vgll3 and akap11, but data for clearly determining which of the genes (or both) contribute to the association have been lacking. Here, we take advantage of natural recombination events detected between the two candidate genes in a salmon broodstock to reduce linkage disequilibrium at the locus, and thus enabling delineation of the influence of variation at these two genes on maturation timing. By rearing 5895 males to maturation age, of which 81% had recombinant vgll3/akap11 allelic combinations, we found that vgll3 SNP variation was strongly associated with maturation timing, whereas there was little or no association between akap11 SNP variation and maturation timing. These findings provide strong evidence supporting vgll3 as the primary candidate gene in the chromosome 25 locus for influencing maturation timing. This will help guide future research for understanding the genetic processes controlling maturation timing. This also exemplifies the utility of natural recombinants to more precisely map causal variation underlying phenotypic diversity.
Of the 54 609 loci on the BovineSNP50 BeadChip, 21 131 (38.7%) SNPs were successfully genotyped in at least 90% of individuals, and 1068 (2.0% of the total 5.1% of genotyped loci) were polymorphic in deer. In comparison, Pertoldi et al.  successfully genotyped a far greater proportion of loci (96.7.7%) and detected 4% of loci as polymorphic using the same SNP chip in bison and Miller et al.  successfully genotyped over 90% of loci in closely related species of sheep using the OvineSNP50 BeadChip, yet found only 1.7% of sites to be polymorphic (868 out of a total of 49 034 loci). The lower rate of genotyping success in this study when compared with Pertoldi et al.  and Miller et al.  is expected, given the 25.1.1 million year divergence between Bovidae (B. taurus) and Cervidae (O. hemionus and O. virginianus) . The level of polymorphism, however, is unexpectedly high and could result from historically high population sizes of mule deer, black-tailed deer and white-tailed deer in North America . In contrast, the bison species analyzed by Pertoldi et al.  have undergone several severe population bottlenecks, while the wild sheep species investigated by Miller et al.  live in relatively small, isolated populations. The identification of 1068 novel, polymorphic SNPs in this study demonstrates that commercial SNP chip technology is a viable and potentially underutilized means of discovering SNP loci in non-model species, even when used between highly divergent lineages.
Both neutral loci and loci potentially under selection were detected in this study, including 878 neutrally evolving, 116 under the influence of positive selection, and 74 influenced by balancing selection (Table S1). A suite of loci that includes both neutral and selected loci will be useful for a variety of applications. Most population genetic analyses, for example, assume that the genetic markers employed are selectively neutral. Loci under positive selection, however, can be essential in distinguishing between recently diverged species and populations that are otherwise difficult to distinguish using neutral makers , . Characterizing genomic regions under balancing selection could identify advantageous genes and alleles that move between populations, such as loci involved in disease resistance (e.g., ). Thus, a necessary first step in any genetic study is to accurately characterize suites of loci that match study objectives and ensure the application of appropriate analytical models and correct interpretation of results.
Population genetic inferences made with the SNPs identified here were consistent with current taxonomic nomenclature and with previous studies of nuclear  and Y-chromosome  DNA and morphological characters  that identified mule and black-tailed deer as closely related and white-tailed deer as a more divergent evolutionary lineage. All measures of genetic distance (FST, D and Dm) reported lower differentiation between mule deer and black-tailed deer than between white-tailed deer and either O. hemionus lineage ( Figure 2 ). Consistent with the analyses of microsatellites performed here, the three lineages were clearly delineated using exact tests, assignment tests, and FCA using the dataset of all 1068 polymorphic SNPs or the 878 neutral SNPs. Extremely low P(ID) values both overall and within individual lineages suggests that these SNPs would be very useful for fine-scale population genetic analyses requiring unambiguous individual identification. In this study, we used only ‘pure’ representatives of each lineage (as identified by previous genetic analyses ). Further characterization of these SNPs would be necessary to determine their power and accuracy for delineating lineages in areas of sympatry where individuals may be of mixed ancestry.
(a) FST (with standard deviation), (b) Jost’s D (with standard error) and (c) Nei’s minimum distance, Dm.
The level of within-population inbreeding (FIS) differed markedly between datasets ( Table 2 ) and warrants further explanation here. The FIS statistic ranges from 𢄡 to 1, with negative values indicating an excess of heterozygosity and positive values indicating excess homozygosity relative to expectations under HWE. For each lineage, deer were sampled from disparate locations, and as such are expected to belong to different populations and to therefore return positive FIS values consistent with homozygote excess (Wahlund effect). In accordance with these expectations, positive FIS values were returned for all lineages for microsatellites (although FIS was not significantly different from zero in white-tailed deer) and for SNPs in black-tailed deer and white-tailed deer. In contrast, statistically significant negative FIS values were returned in mule deer when all 1068 SNPs or the 878 neutral SNPs were analyzed ( Table 2 ). The unexpected heterozygote excess in the SNP data in the mule deer lineage could be caused by a high proportion of low-frequency alleles in mule deer which would in turn lead to an artificially high HO. Of the 429 loci that were polymorphic in mule deer, 54% (n =) had a minor allele frequency (MAF) less than 0.1 ( Table 1 ). This was higher than the proportion of similarly low-frequency alleles found in black-tailed deer (46% 200 of 434 polymorphic loci within the black-tailed deer lineage) and white-tailed deer, where the MAF could not be less than 0.125 on account of only 4 individuals being analyzed (if at a given locus only one of the four individuals is heterozygous, the MAF of that locus will be 0.125) ( Table 1 ). Multilocus genotypes from additional individuals would be necessary to more fully evaluate potential mechanisms for the observed heterozygote excess in mule deer.
Any process of SNPs discovery carries some risk of ascertainment bias, where the overall pattern of genetic diversity is not accurately represented by the sampled SNPs. In general, small screening panel size, overly stringent SNP identification algorithms, and bias toward polymorphic loci in SNP selection can lead to inaccurate inferences of genetic diversity, population genetic structure, and phylogenetic relationships , . The small sample size of deer initially screened for SNPs in the present study will almost certainly have led to some polymorphic sites not being detected, in particular those sites harboring rare alleles. In addition, the screening of SNPs identified in B. taurus for use in O. hemionus and O. virginianus is likely biased in favor of conserved genomic regions that still retain polymorphisms ancestral to the divergence between Cervidae and Bovidae. Such loci may not be representative of the evolutionary changes that have since occurred within the Cervidae family. The selection of SNPs for the Bovine SNP50 BeadChip that are distributed in a roughly even fashion across the B. taurus genome, however, should minimize the effects of this bias. Downstream applications can avoid compounding ascertainment bias by randomly selecting a panel of SNPs for analysis, rather than using only SNPs that exceed a minimum, predefined level of polymorphism .
One of the most attractive incentives for using model species to identify SNPs in non-model species is the availability of annotations that link SNP variation to DNA sequences and ultimately to biological processes. Although no deer genomes have yet been fully sequenced and annotated, the genomic location of each SNP identified in this study can be mapped on various versions of the B. taurus genome (e.g., the Btau 4.2 assembly, compiled by the Bovine HapMap Consortium, or the UMD3.1 assembly, compiled by the Center for Bioinformatics and Computational Biology at the University of Maryland). The position of each SNP on both Btau4.0 and UMD3.1 is provided in Table S1. However, the level of divergence between our model and non-model species (25 MYA) may not permit accurate chromosomal locations to be determined for all identified SNPs. Multiple chromosome rearrangements have occurred in the Bovidae and Cervidae lineages since their divergence, which is especially evident in a change in karyotype from 2n = in cervids O. virginianus and O. hemionus to 2n = in the bovid B. taurus . In spite of these large-scale rearrangements, alignment of deer DNA sequences to the B. taurus genome has been successful for next-generation sequences generated from O. virginianus , presumably owing to regional synteny. Still, caution is warranted when interpreting results obtained from alignments between such divergent lineages.
The SNPs characterized in this study would likely be useful in a variety of applications for an array of cervid species, given the high cross-species amplification success we observed. Neutral SNPs can be readily applied to more traditional population genetic analyses, such as characterizing population structure, quantifying genetic diversity and inferring migration rates. Loci under natural selection could be used to investigate genetic mechanisms underpinning natural selection and adaptation, or to differentiate recently diverged populations, species and ecotypes that are otherwise difficult to distinguish using neutral loci . Such investigations are relevant not only for evolutionary research but also for conservation and management of mule deer, black-tailed deer and white-tailed deer. In addition to being important game species, the U.S. Fish and Wildlife Service lists the Cedros Island mule deer (O. h. cerrosensis), Florida Key white-tailed deer (O. v. calvium) and Columbian white white-tailed deer in western Oregon (O. v. leucurus) as 𠆎ndangered’ . White-tailed deer are also threatened in Venezuela by overhunting and habitat loss . Thorough delimitation of subpopulation boundaries, identification of locally adapted populations and characterization of genetic diversity patterns will therefore be highly useful in informing regional conservation and management strategies. These commercial SNP chips could even be applied to other cervids of conservation or management concern for example, those listed as threatened on the IUCN Red List  (hog dear, Axis spp, revised to genus Hyelaphus in  Père David’s deer, Elaphurus davidianus Patagonian huemul, Hippocamelus bisulcus).
This study demonstrates the potential utility of commercially available SNP chip technology for identifying SNP loci in non-model organisms. As polymorphic SNPs were identified between lineages that diverged up to 30.1 MYA, SNP chips developed for model organisms can likely identify SNPs in a far wider range of organisms than previously realized. The porcine, ovine, equine and bovine SNP chips, for example, could be used to collectively to develop a panel of SNPs for wide range of highly divergent ungulates while SNP chips developed for dogs (Canis lupus familiaris) could likely identify polymorphic SNPs in a wide range of Carnivora species that would otherwise require extensive DNA sequencing. The cross-species utilization of SNP chips is therefore an exciting avenue of future research.
An organism's genotype may not define its haplotype uniquely. For example, consider a diploid organism and two bi-allelic loci (such as SNPs) on the same chromosome. Assume the first locus has alleles A or T and the second locus G or C. Both loci, then, have three possible genotypes: (AA, AT, and TT) and (GG, GC, and CC), respectively. For a given individual, there are nine possible configurations (haplotypes) at these two loci (shown in the Punnett square below). For individuals who are homozygous at one or both loci, the haplotypes are unambiguous - meaning that there is not any differentiation of haplotype T1T2 vs haplotype T2T1 where T1 and T2 are labeled to show that they are the same locus, but labeled as such to show it doesn't matter which order you consider them in, the end result is two T loci. For individuals heterozygous at both loci, the gametic phase is ambiguous - in these cases, you don't know which haplotype you have, e.g., TA vs AT.
|GG||AG AG||AG TG||TG TG|
|GC||AG AC||AG TC |
|CC||AC AC||AC TC||TC TC|
The only unequivocal method of resolving phase ambiguity is by sequencing. However, it is possible to estimate the probability of a particular haplotype when phase is ambiguous using a sample of individuals.
Given the genotypes for a number of individuals, the haplotypes can be inferred by haplotype resolution or haplotype phasing techniques. These methods work by applying the observation that certain haplotypes are common in certain genomic regions. Therefore, given a set of possible haplotype resolutions, these methods choose those that use fewer different haplotypes overall. The specifics of these methods vary - some are based on combinatorial approaches (e.g., parsimony), whereas others use likelihood functions based on different models and assumptions such as the Hardy–Weinberg principle, the coalescent theory model, or perfect phylogeny. The parameters in these models are then estimated using algorithms such as the expectation-maximization algorithm (EM), Markov chain Monte Carlo (MCMC), or hidden Markov models (HMM).
Microfluidic whole genome haplotyping is a technique for the physical separation of individual chromosomes from a metaphase cell followed by direct resolution of the haplotype for each allele.
Unlike other chromosomes, Y chromosomes generally do not come in pairs. Every human male (excepting those with XYY syndrome) has only one copy of that chromosome. This means that there is not any chance variation of which copy is inherited, and also (for most of the chromosome) not any shuffling between copies by recombination so, unlike autosomal haplotypes, there is effectively not any randomisation of the Y-chromosome haplotype between generations. A human male should largely share the same Y chromosome as his father, give or take a few mutations thus Y chromosomes tend to pass largely intact from father to son, with a small but accumulating number of mutations that can serve to differentiate male lineages. In particular, the Y-DNA represented as the numbered results of a Y-DNA genealogical DNA test should match, except for mutations.
UEP results (SNP results) Edit
Unique-event polymorphisms (UEPs) such as SNPs represent haplogroups. STRs represent haplotypes. The results that comprise the full Y-DNA haplotype from the Y chromosome DNA test can be divided into two parts: the results for UEPs, sometimes loosely called the SNP results as most UEPs are single-nucleotide polymorphisms, and the results for microsatellite short tandem repeat sequences (Y-STRs).
The UEP results represent the inheritance of events it is believed can be assumed to have happened only once in all human history. These can be used to identify the individual's Y-DNA haplogroup, his place in the "family tree" of the whole of humanity. Different Y-DNA haplogroups identify genetic populations that are often distinctly associated with particular geographic regions their appearance in more recent populations located in different regions represents the migrations tens of thousands of years ago of the direct patrilineal ancestors of current individuals.
Y-STR haplotypes Edit
Genetic results also include the Y-STR haplotype, the set of results from the Y-STR markers tested.
Unlike the UEPs, the Y-STRs mutate much more easily, which allows them to be used to distinguish recent genealogy. But it also means that, rather than the population of descendants of a genetic event all sharing the same result, the Y-STR haplotypes are likely to have spread apart, to form a cluster of more or less similar results. Typically, this cluster will have a definite most probable center, the modal haplotype (presumably similar to the haplotype of the original founding event), and also a haplotype diversity — the degree to which it has become spread out. The further in the past the defining event occurred, and the more that subsequent population growth occurred early, the greater the haplotype diversity will be for a particular number of descendants. However, if the haplotype diversity is smaller for a particular number of descendants, this may indicate a more recent common ancestor, or a recent population expansion.
It is important to note that, unlike for UEPs, two individuals with a similar Y-STR haplotype may not necessarily share a similar ancestry. Y-STR events are not unique. Instead, the clusters of Y-STR haplotype results inherited from different events and different histories tend to overlap.
In most cases, it is a long time since the haplogroups' defining events, so typically the cluster of Y-STR haplotype results associated with descendants of that event has become rather broad. These results will tend to significantly overlap the (similarly broad) clusters of Y-STR haplotypes associated with other haplogroups. This makes it impossible for researchers to predict with absolute certainty to which Y-DNA haplogroup a Y-STR haplotype would point. If the UEPs are not tested, the Y-STRs may be used only to predict probabilities for haplogroup ancestry, but not certainties.
A similar scenario exists in trying to evaluate whether shared surnames indicate shared genetic ancestry. A cluster of similar Y-STR haplotypes may indicate a shared common ancestor, with an identifiable modal haplotype, but only if the cluster is sufficiently distinct from what may have happened by chance from different individuals who historically adopted the same name independently. Many names were adopted from common occupations, for instance, or were associated with habitation of particular sites. More extensive haplotype typing is needed to establish genetic genealogy. Commercial DNA-testing companies now offer their customers testing of more numerous sets of markers to improve definition of their genetic ancestry. The number of sets of markers tested has increased from 12 during the early years to 111 more recently.
Establishing plausible relatedness between different surnames data-mined from a database is significantly more difficult. The researcher must establish that the very nearest member of the population in question, chosen purposely from the population for that reason, would be unlikely to match by accident. This is more than establishing that a randomly selected member of the population is unlikely to have such a close match by accident. Because of the difficulty, establishing relatedness between different surnames as in such a scenario is likely to be impossible, except in special cases where there is specific information to drastically limit the size of the population of candidates under consideration.
First of all, PCA is a technique for dimension reduction. Basically, the goal is to compare tens of thousands of SNPs in Drosophila. Now if you only have 2 SNPs, you can plot them on a 2D scatter plot. If you have 3 SNPs, you may try a 3D plot. But now imagine you have 30,000 SNPs, but you CANNOT plot a 30000-dimensional plot. To visualize this high dimensional data, what we can do is to perform dimensional reduction like PCA. PCA tries to find a set of orthogonal coordinations that explains most of the variation in the data (if there no variation, there is no information contained in the data, which essentially means there is no data). The idea is that PC1 carries most variation can be explained, and PC2 carries the second most. For lower PCs like PC50 or PC60, they probably only carry noise in the data. Therefore, the higher PCs (PC1, PC2 and so on) effectively summarizes the useful information in the data. So you can visualize the "structure" of the data in a 2D PCA plot.
By looking at the distance between points on a PCA plot, you can tell how similar the two data points are. But if you see two populations that are perfectly separated on PCA plot, it does not mean that the 2 population differ completely at every SNP, because PCA is a summarization of all SNP included.
Single Nucleotide Polymorphism
The importance of SNPs comes from their ability to influence disease risk, drug efficacy and side-effects, tell you about your ancestry, and predict aspects of how you look and even act. SNPs are probably the most important category of genetic changes influencing common diseases. And in terms of common diseases, 9 of the top 10 leading causes of death have a genetic component and thus most likely one or more SNPs influence your risk.
These youtube video clips explain
All humans have almost the same sequence of 3 billion DNA bases (A,C,G, or T) distributed between their 23 pairs of chromosomes. But at certain locations there are differences - these variations are called polymorphisms. Polymorphisms are what make individuals different from one another. Current estimates indicate that up to .1% of our DNA may vary a bit, meaning any two unrelated individuals may differ at less than 3 million DNA positions. While many variations (SNPs) are known, most have no known effect and may be of little or no importance.
SNPedia is a collection of the subset of SNPs that have been reported to be meaningful, either medically or for other reasons (such as for genealogy). The emphasis in SNPedia is on SNPs that have significant medical consequences, are common, are reproducible (or found in meta-analyses or studies of at least 500 patients), and/or have other historic or medical significance.
This example SNP rs1234 will introduce you to the report format used within SNPedia.
The most obvious DNA-based differences are external, such as rs1805009 which affects red hair color. Most polymorphisms have far less obvious effects though, and many of these may have medical consequences. We are just beginning to learn which of the 30 million or so possible polymorphisms influence health, either individually or in sets. Many polymorphisms are likely to have either no effect at all, or to have such subtle effects that it will be many years before their consequences are understood.
Thomas Mailund explains how scientists and statisticians determine which SNPs are related to which diseases.
These sites provide helpful introductions:
A more recent discovery is larger duplications called Copy Number Variations. These CNVs are not yet as well systematized or studied as SNPs. The database dbVar is for structural variations.