Is there any source for raw data of SNP genotype frequency?

Is there any source for raw data of SNP genotype frequency?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

On sites like SNPedia, some pages contain the frequency of the SNP in question in different populations, based on published research. I'm trying to write a script that takes 23andme data and compares it with SNP frequency to find rare SNPs that the user possesses. I'm thinking that the only way to do this might be to scrape it from a SNP database. Is there anywhere you know of that makes this information available in a more accessible format, ideally preformatted for parsing?

You might be able to get some raw data on SNP frequency by batch querying the dbSNP database. I have not used it myself, though.

This is one of the reasons the 1000 genomes project was created.

Have you looked at ALFRED (The ALlele FREquency Database)? The data is from 2011, but seems extensive and has downloadable zip files at

Toward fast and accurate SNP genotyping from whole genome sequencing data for bedside diagnostics

Motivation: Genotyping a set of variants from a database is an important step for identifying known genetic traits and disease-related variants within an individual. The growing size of variant databases as well as the high depth of sequencing data poses an efficiency challenge. In clinical applications, where time is crucial, alignment-based methods are often not fast enough. To fill the gap, Shajii et al. propose LAVA, an alignment-free genotyping method which is able to more quickly genotype single nucleotide polymorphisms (SNPs) however, there remains large room for improvements in running time and accuracy.

Results: We present the VarGeno method for SNP genotyping from Illumina whole genome sequencing data. VarGeno builds upon LAVA by improving the speed of k-mer querying as well as the accuracy of the genotyping strategy. We evaluate VarGeno on several read datasets using different genotyping SNP lists. VarGeno performs 7-13 times faster than LAVA with similar memory usage, while improving accuracy.

Availability and implementation: VarGeno is freely available at:

Supplementary information: Supplementary data are available at Bioinformatics online.


Populus nigra is a major tree species from Eurasian riparian ecosystems and one of the 3 main parental species used in poplar breeding programs to develop highly productive interspecific cultivated hybrids. For these reasons, several initiatives have recently been set up to create genomic resources within this species as tools to improve conservation and breeding strategies [1, 2]. The main objective of such initiatives is to discover and type genomic variants like Single Nucleotide Polymorphisms (SNPs) for various applications, including the identification and quantification of introgressions from the cultivated compartment, the study of population structure and the identification of variants associated with economically or ecologically relevant phenotypes through association genetics.

Early studies in P. nigra have focused on re-sequencing specific candidate genes from the lignin pathway [3–5], but more recent work has broadened the scope of analyses through the development of a genotyping chip from SNPs detected by whole-genome sequencing [1, 2]. This genotyping tool was successfully used to study the structure of the genetic diversity of the species [1] and to identify some genomic regions associated with economically important traits [6]. However, the genotyping was limited to 7903 SNPs preferentially located within particular candidate regions underlying some Quantitative Trait Loci (QTLs) previously reported in biparental crosses. Moreover, the frequency of the SNPs within P. nigra populations appeared to be upwardly biased, limiting the analyses to common variants [1]. Consequently, the application of this chip especially in association genetics could be limited as underlined by the low number of significant associations reported [6]. Indeed, given the rapid Linkage Disequilibrium (LD) decay within this species and its genome size, an exhaustive genome-wide association study (GWAS) would require between 67,000 and 134,000 evenly spaced SNPs which is between 8 and 16 times more than the number of SNPs available from the chip cited above [7, 8].

In order to access a large number of SNPs, as typically needed for an exhaustive GWAS in P. nigra, several options relying on next-generation sequencing would be available. If whole genome sequencing appears to be still too expensive for a fairly large number of genotypes, reducing the complexity of the genome prior to sequencing for instance with restriction enzymes (GBS [9] RADseq [10]), or sequence capture (exome sequencing, [11]) seems to be a promising way forward for reaching the objectives. Indeed, sequence capture has recently successfully been used to genotype around 350,000 SNPs in P. deltoides and identify putative regulators of bioenergy traits [12]. RNA sequencing (RNAseq) represents also a cost-effective way to reduce complexity while focusing on the expressed fraction of the genome [13]. However, to date, RNAseq has more often been used for SNP discovery than for direct genotyping of large populations. For instance, Geraldes et al. [14] found around 500,000 SNPs through RNAseq of developing secondary xylem in P. trichocarpa, and later on, a SNP chip was developed partly from the previously discovered RNAseq SNPs [8] in order to further carry out association scans [15, 16]. Nevertheless, recent studies have been using RNAseq as a tool for both discovering and genotyping a large number of SNPs in populations [17–21], underlining the interest of this approach for population and quantitative genomics studies. However, to our knowledge, no study so far has evaluated the accuracy of SNP genotyping from RNAseq data.

The present study aims at evaluating RNAseq as a tool to type a sufficiently large amount of SNPs within natural populations of P. nigra to carry out a GWAS. For that purpose, we performed RNAseq on pools of young differentiated xylem and cambium collected on 2 biological replicates of 12 genotypes originated from 6 natural populations. We have further developed a dedicated bioinformatic pipeline to discover and type SNPs within the sequences. The accuracy of the resulting RNAseq-based SNPs has also been evaluated by (i) comparing their position and alleles to those previously reported in candidate genes [3, 4], (ii) assessing their genotyping accuracy with respect to a SNP chip [1], (iii) evaluating their interannual repeatability. Finally, the resulting validated SNPs have been used to perform basic genetic analyses to illustrate the usefulness of the released SNP dataset.


High-throughput genotyping, which leads to the identification of a large number of single-nucleotide polymorphisms (SNPs) is boosting the implementation of genome-wide association studies (GWAS), linking DNA variants to phenotypes of interest (Taranto et al., 2018). In crop species, GWAS enabled the mapping of genomic loci associated with economically important traits, including yield, resistance to biotic and abiotic stresses, and quality (Boyles et al., 2016 Pavan et al., 2017 Hou et al., 2018 Liu et al., 2018 He et al., 2019). This information has been further used to perform marker-assisted selection (MAS) in breeding programs and discover genes underlying phenotypic variation (Liu and Yan, 2019).

Several genotyping methods are available (reviewed by Scheben et al., 2017), which are usually performed by commercial parties upon the receipt of DNA samples. For application in GWAS, widely adopted genotyping options fall into three categories: whole genome resequencing (WGR), reduced representation sequencing (RRS), and SNP arrays. WGR and RRS are based on next-generation sequencing (NGS) technologies and bioinformatics pipelines that align reads to a reference genome and call both SNPs and genotypes (Nielsen et al., 2011). SNP arrays rely on allele-specific oligonucleotide (ASO) probes (including target SNP loci plus their flanking regions) fixed on a solid support, which are used to interrogate complementary fragments from DNA samples and infer genotypes based on the interpretation of the hybridization signal. Choosing the most appropriate (cost-effective) genotyping method for crop GWAS requires careful examination of several aspects, namely, the purpose and the scale of the study, crop-specific genomic features, and technical and economic matters associated with each genotyping method.

Raw SNP datasets resulting from genotyping experiments are typically inaccurate and incomplete. In addition, genes associated with phenotypes can have a small effect on genetic variance. In this scenario, quality control (QC) procedures are of pivotal importance to minimize false-positive or false-negative associations, referred to as type I and type II errors, respectively. QC includes filtering out poor-quality or suspected artifactual SNP loci, filtering out individuals in relation to missing data, anomalous genotype call and genetic synonymies, and the characterization of ancestral relationships among individuals of the GWAS population. Excellent reviews focused on QC of human SNP data (Turner et al., 2011 Marees et al., 2018) however, the QC procedure may be quite different for crop species. In this case, variables that need to be considered include the crop prevailing mating system (self- or open-pollinating) and the breeding history of the specific GWAS population.

This review aims to provide recommendations on how to plan genotyping experiments and best practices on how to perform QC in crop species.


Metzker, M. Sequencing technologies — the next generation. Nature Rev. Genet. 11, 31–46 (2010). This article provides an excellent Review of NGS technologies and their applications.

Li, R. et al. The sequence and de novo assembly of the giant panda genome. Nature 463, 311–317 (2010).

Ng, S. B. et al. Exome sequencing identifies the cause of a mendelian disorder. Nature Genet. 42, 30–35 (2010).

Nagalakshmi, U. et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320, 1344–1349 (2008).

Guttman, M. et al. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nature Biotech. 28, 503–510 (2010).

Trapnell, C. et al. Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotech. 28, 511–515 (2010).

Liti, G. et al. Population genomics of domestic and wild yeasts. Nature 458, 337–341 (2009).

Li, Y. et al. Resequencing of 200 human exomes identifies an excess of low-frequency non-synonymous coding variants. Nature Genet. 42, 969–972 (2010).

Durbin, R. M. et al. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010). This 1000Genomes paper provides an application of many of the state-of-the-art methods for analysis of NGS data.

Flicek, P. & Birney, E. Sense from sequence reads: methods for alignment and assembly. Nature Methods 6, S6–S12 (2009).

Kim, S. Y. et al. Design of association studies with pooled or un-pooled next-generation sequencing data. Genet. Epidemiol. 34, 479–491 (2010).

Li, H., Ruan, J. & Durbin, R. M. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008). This paper describes MAQ, a forerunner of efficient, hash-based alignment algorithms for short reads. MAQ also produces genotype calls. The concept of read-mapping quality is introduced in this paper.

Li, J. B. et al. Multiplex padlock targeted sequencing reveal human hypermutable CpG variations. Genome Res. 19, 1606–1615 (2009).

Li, R. et al. SNP detection for massively parallel whole-genome resequencing. Genome Res. 19, 1124–1132 (2009).

Li, R. et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–1967 (2009).

Ewing, B. & Green, P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8, 186–194 (1998).

Quinlan, A. R. et al. Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Nature Methods 5, 179–181 (2008).

Wu, H, Irizarry, R. A. & Bravo, H. C. Intensity normalization improves color calling in SOLiD sequencing. Nature Methods 7, 336–337 (2010).

Kircher, M., Stenzel, U. & Kelso, J. Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome Biol. 10, R83 (2009).

Kao, W. C., Stevens, K. & Song, Y. S. BayesCall: a model-based basecalling algorithm for high-throughput short-read sequencing. Genome Res. 19, 1884–1895 (2009).

Kao, W. C. & Song, Y. S. naiveBayesCall: an efficient model-based base-calling algorithm for high-throughput sequencing. Lect. Notes Comp. Sci. 6044, 233–247 (2010).

Burrows, M. & Wheeler, D. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation. HP Labs Technical Reports [online], (1994).

Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).

Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

Lunter, G. & Goodson, M. Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 27 Oct 2010 (doi:10.1101/gr.111120.110).

Sundquist, A., Ronaghi, M., Tang, H., Pevzner, P. & Batzoglou, S. Whole-genome sequencing and assembly with high-throughput, short-read technologies. PLoS ONE 2, e484 (2007).

Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008).

Butler, J. et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 18, 810–820 (2008).

Simpson, J. T. et al. ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009).

Chaisson, M. J. P., Brinza, D. & Pevzner, P. A. De novo fragment assembly with short mate-paired reads: does the read length matter? Genome Res. 19, 336–346 (2009).

Brockman, W. et al. Quality scores and SNP detection in sequencing-by-synthesis systems. Genome Res. 18, 763–770 (2008).

McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genet. 10 Apr 2011 (doi:10.1038/ng.806).

Harismendy, O. et al. Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biol. 10, R32 (2009).

Wang, J. et al. The diploid sequence of an Asian individual. Nature 456, 60–65 (2009).

Hedges, D. et al. Exome sequencing of a multigenerational human pedigree. PLoS ONE 4, e8232 (2009).

Martin, E. R. et al. SeqEM: an adaptive genotype-calling approach for next- generation sequencing studies. Bioinformatics 26, 2803–2810 (2010).

Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).

Dai, J. Y. et al. Imputation methods to improve inference in SNP association studies. Genet. Epidemiol. 30, 690–702 (2006).

Minichiello, M. J. & Durbin, R. Mapping trait loci by use of inferred ancestral recombination graphs. Am. J. Hum. Genet. 79, 910–922 (2006).

Scheet, P. & Stephens, M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, 629–644 (2006).

Browning, S. R. & Browning, B. L. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81, 1084–1097 (2007).

Marchini, J., Howie, B., Myers, S., McVean, G. & Donnely, P. A new multipoint method for genome-wide association studies via imputation of genotypes. Nature Genet. 39, 906–913 (2007).

Howie, B. N., Donnelly, P. & Marchini, J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 5, e1000529 (2009).

Marchini, J. & Howie, B. Genotype imputation for genome-wide association studies. Nature Rev. Genet. 11, 499–511 (2010). This Review provides a comprehensive overview of available statistical methods for imputing genotypes and discusses various uses of imputation.

Huang, L. et al. The relationship between imputation error and statistical power in genetic association studies in diverse populations. Am. J. Hum. Genet. 85, 692–698 (2009).

Schaid, D. J., Rowland, C. M., Tines, D. E., Jacobson, R. M. & Poland, G. A. Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am. J. Hum. Genet. 70, 425–434 (2002).

Servin, B. & Stephens, M. Imputation-based analysis of association studies: candidate genes and quantitative traits. PLoS Genet. 3, e114 (2007).

Hellmann, I. et al. Population genetic analysis of shotgun assemblies of genomic sequences from multiple individuals. Genome Res. 18, 1020–1029 (2008).

Johnson, P. L. F. & Slatkin, M. Accounting for bias from sequencing error in population genetic estimates. Mol. Biol. Evol. 25, 199–206 (2008).

Johnson, P. L. F. & Slatkin, M. Inference of population genetic parameters in metagenomics. A clean look at messy data. Genome Res. 16, 1320–1327 (2006).

Yi, X. et al. Sequencing of 50 human exomes reveals adaptation to high altitude. Science 329, 75–78 (2010).

Li, H. et al. The sequence alignment/map (SAM) format and SAMtools. Bioinformatics. 25, 2078–2079 (2009).

Le, S. Q. & Durbin, R. SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. Genome Res. 27 Oct 2010 (doi:10.1101/gr.113084.110).


Single nucleotide polymorphisms (SNPs) are variable single base positions within a genome that represent the simplest and possibly most common type of genetic variation. Accordingly, SNPs have emerged as a powerful tool for tracking heredity and genetic variation, and have become especially popular for phenotype genome-wide association studies (1, 2). The critical role of the laboratory mouse has led to several efforts aimed at large-scale collection and analysis of mouse SNPs (3𠄷).

The Center for Genome Dynamics Single Nucleotide Polymorphism database (CGDSNPdb) was designed to bring together multiple sources of mouse SNP data, while checking them for accuracy and consistency among sources. CGDSNPdb is distinguished by the inclusion of two unique data sets:

The Imputed SNP Genotype Resource (IGR) (8) generated by a Hidden Markov Model (HMM) that assigns probable genotype and associated confidence levels for over 8 million SNPs in 74 strains of mice.

Data collected from over 140 strains of laboratory mice (filtered to 72 inbred strains in the current release, version 1.3) with the Mouse Diversity Genotyping Array [MusDiv (9)], a high density microarray with probes that target 623 124 SNPs and over 900 000 invariant genomic regions targeting features such as exons and copy number variations. MusDiv SNP data will also be submitted to dbSNP following publication of an analysis manuscript (in preparation).

The CGDSNPdb search engine facilitates a number of different queries, including search by chromosome region(s), nearby gene annotations, or SNP identifiers. Results can be returned as dynamic html or in flat-text comma-separated-value (CSV) format.

Annotations in CGDSNPdb include characteristics of the SNP (e.g. presence in CpG dinucleotide, major/minor allele frequencies), along with functional characteristics of protein-coding genes affected by the SNP (e.g. changes in amino-acid physical and chemical characteristics, changes in codon usage, and overlapping or closest neighboring genes). All annotations were generated using an automated analysis pipeline with subsequent quality controls, described below.

CGDSNPdb was constructed primarily as a resource to support the imputation and mouse diversity array projects, however, it is being made available as a somewhat reduced size, but high confidence, collection of mouse SNPs. Database updates will be driven by the availability of new or updated genome assemblies, updated releases of major external SNP data sets, new SNP data sources, and maintenance. Future growth of the database will be targeted primarily at large-scale projects such as the ‘mouse genomes project’ ( as well as data sets that can increase the represented strain diversity. Minor releases of CGDSNPdb may also be generated for improved data visualization or underlying quality control procedures. This manuscript provides a high-level overview of the main components of CGDSNPdb, as of version 1.3 (January 2010).

Materials and Methods

Sampling, DNA isolation and genotyping

A total of 240 cod individuals from 9 locations at 7 ICES (International Council for the Exploration of the Sea) subdivisions along a transect across the Baltic Sea, Kattegat and North Sea (Fig. 10, Table 5) were collected between October 2012 - August 2013. Fin clips were stored in 70% ethanol at −70 °C. Genomic DNA was isolated using the Qiagen DNeasy 96 blood and tissue kit according the manufacturer’s instructions and stored at −20 °C. The concentration of DNA was determined by UV-vis spectroscopy using an Epoch Microplate Spectrophotometer (BioTek Instruments, Inc., Winooski, USA). After normalization, samples were genotyped on a custom Gadus mohua SNP-array (Illumina, USA) containing 10,923 SNP assays, and developed by a Norwegian consortium composed of four research organisations: Norwegian University of Life Sciences (NMBU), University of Oslo (UiO), NOFIMA AS, and the Institute for Marine Research (IMR) 38,68,69 . Samples were processed according manufacturers instructions and genotypes obtained from Genome Studio (V2011.1). After filtering to remove poorly clustering SNPs (failing assays, multisite variants), a total of 8221 diploid SNPs remained. This data set was further trimmed to remove: SNPs with relatively a high missing data level (over 20% n = 15), monomorphic SNPs (n = 32), and SNPs with minor allele frequencies (MAF) < 0.01 (n = 98). The final data set included genotypes from 8076 loci.

Map showing sampling sites and ICES subdivisions. Samples locations and codes are detailed in Table 5. Thin lines show borders between ICES subdivisions.

Genetic Wild West: 23andMe Raw Data Contains 75 Alzheimer’s Mutations

The spit tube arrives in the mail in a pretty box, stamped with the cheery message “Welcome to You.” The journey into your genome begins a few weeks later, with an email inviting you, a 23andMe customer, to explore your DNA. You’ll learn fun facts about your ancestry, your Neanderthal vestiges, and whether you are likely to turn beet red from a single cocktail. But if you want, the company will also disclose genetic health risks, and this is where things turn serious. Approved in April 2017 by the FDA, 23andMe’s health reports estimate genetic risk for four diseases so far, including Parkinson’s and Alzheimer’s—the latter based solely on carriage of the ApoE4 allele. Customers also learn if they are unwitting carriers of any of 42 recessive alleles that pose no threat to them, but could harm their children. Importantly, with a few more clicks you can open Pandora’s box, filled to the brim with your genotype at some 600,000 single polymorphism (SNP) positions. You can download this nondescript list of A, C, T, and G combinations to your computer. At this point, you are on your own. 23andMe warns that this data isn’t validated. Some SNPs have spurious associations in the published literature. Others could pack quite a blow. The 23andMe chip reveals your genotype at 75 of the dominantly inherited mutations known to cause Alzheimer’s disease. Once that data is on your computer, can you resist taking a peek?

The implications are exciting, as unfettered access to one’s own genetic data holds great opportunities: In the case of ApoE4, it may motivate carriers to join an AD prevention trial or adopt a healthier lifestyle. People who discover in their raw data that they carry an autosomal-dominant AD (ADAD) mutation may find their way to the Dominantly Inherited Alzheimer’s Network (DIAN). However, to the chagrin of AD researchers, 23andMe has not agreed to refer carriers of these mutations toward such clinical studies.

The implications are also unnerving. The company told Alzforum that 2 million people have ordered the 23andMe kit since 2007, but declined to say how many found out their ApoE status or accessed their raw data. Even so, the overall growth of the direct-to-consumer (DTC) testing industry makes it likely that, before long, millions of people will grapple with the meaning of the complex, unwieldy, and massive data set that is their genome. Health care providers and genetic counselors—trained to guide people toward making informed decisions about whether to order genotyping—are increasingly being asked to pick up the pieces afterward. This amounts to crisis counseling for distraught consumers caught off-guard by their results. Others may not tell a soul, perhaps fearing discrimination by employers or insurers based on their genes. The company’s efforts to link customers to genetic counselors to help them understand surprises lurking in the raw data are seen as insufficient by some.

In general, the rise of direct-to-consumer genetic testing, a field 23andMe pioneered, raises scientific, ethical, and social issues that must be addressed, said Carlos Cruchaga of Washington University in St. Louis. “Clearly this is happening,” he told Alzforum. “Human genetics is moving very quickly. It’s easy to obtain the data, but understanding how to deal with it has always been a challenge.” Researchers sharply disagree on the ethics of selling raw genetic data to the general public.

After Early Stumble, DTC Genetic Testing Is Off to the Races
23andMe began offering its direct-to-consumer genetic testing services more than a decade ago, at one time providing its customers with genetic risk assessments for 254 diseases in addition to information about their ancestry and other physical traits. That changed in the United States in 2013, when the FDA shut down the company’s genetic risk component until it could show its tests and interpretations were valid (see Feb 2014 news). In April 2017, the FDA authorized the company to market so-called genetic health risk reports for 10 diseases, each validated for technical reproducibility and clinical relevance based on the scientific literature. In addition to AD and PD, the approved risk reports interpret variants associated with α-1 antitrypsin deficiency and hereditary thrombophilia. Reports for Gaucher disease type 1, celiac disease, early onset primary dystonia, factor XI deficiency, glucose-6-phosphate dehydrogenase deficiency, and hereditary hematochromatosis have also been approved but are not available yet. Carriage of the ApoE4 allele is used to assess a customer’s AD risk, while the G2019S mutation in LRRK2 and the N370S mutation in GBA are used for PD. Nowadays, customers can choose between two products: the cheaper ancestry-only results, or ancestry plus health reports.

As part of its pre-authorization work with the FDA, 23andMe had to conduct a study to show that consumers could understand their results and what they meant. The company agreed to present customers with warnings and limitations prior to unlocking their genetic health risk reports. This includes statements that genetic risk is but one component of overall risk, that the tests do not diagnose a disease, and that the results may upset some people. The company also suggests some clients may benefit from genetic counseling—either before or after receiving test results—and directs them to the National Society of Genetic Counselors (NSGC) to search for a counselor.

Susan Hahn, a NSGC member who specializes in AD, told Alzforum that direct-to-consumer testing has shifted the workload of counselors toward post-counseling, rather than pre-counseling. Most counselors work in partnership with health care providers, and help clients decide ahead of time if genetic testing is the right choice for them. “A genetic counselor may raise the difficult questions you didn’t think to ask yourself,” she said. With the rise of DTC testing, Hahn said counselors are increasingly seeing clients only after the test results are in. “At that point, you’re often just doing damage control,” she said, referring to clients who took the tests without fully considering the consequences.

Processing unexpected genotype information can be difficult. Jamie Tyrone, 57, of San Diego found out she carried two copies of the ApoE4 allele as a participant in a research study exploring attitudes about genetic risk. She had joined the study to know more about the genetic underpinnings of multiple sclerosis, but was unaware she would find out her ApoE genotype. She was offered no genetic counseling as part of the study, she told Alzforum. “Had I seen a counselor, I would have decided not to participate,” she said. Tyrone’s father suffered from AD, so she knew the gravity of the disease. The result pitched her into a dark hole, she said, and she was diagnosed with post-traumatic stress disorder, a condition she claimed cost $40,000 in counseling. Ironically, the study Tyrone participated in concluded that most volunteers did not suffer clinically significant levels of anxiety or distress (Boeldt et al., 2015).

Nowadays, Tyrone participates in a longitudinal study sponsored by the Banner Alzheimer’s Institute in Phoenix, which aims to chart the preclinical course of AD. When she turns 60, she hopes to qualify for a clinical trial. “The opportunity to participate in research is the only benefit I’ve received from learning about my ApoE genotype,” Tyrone told Alzforum.

23andMe does warn its users about the implications of learning about ApoE4. Still, Tyrone told Alzforum that she considers these warnings inadequate to prepare people for the consequences.

Other ApoE4 carriers see it differently. Julie Gregory, 55, of Long Beach, Indiana, found out in 2012 that she carried two copies of ApoE4 using 23andMe. At that time, the company had yet to offer genetic health risk reports that explained the results. Shaken and confused, Gregory started commiserating with hundreds of other carriers on 23andMe’s customer forums. Those interactions motivated her to form the website ApoE4.Info, where ApoE4 carriers support each other and discuss the latest research. “Although I was initially traumatized, knowing my status has been a good thing for me,” she said. “Knowledge is power.”

Gregory told Alzforum that while genetic pre-counseling would have been helpful, she doubts most people use it. “It’s great advice, but I’ve never seen anyone follow it,” she said. More important is that people have a place to turn after finding out their results, and that they feel motivated to make lifestyle changes to counter their genetic risk, she said. Gregory added that while she knows people who suffered severe psychological harm after learning their genotype, including PTSD, they tend to be the exception.

Gregory’s observation has a basis in research. In a recent study, only 4 percent of DTC genetic testing customers sought counseling (Koeller et al., 2017). Scott Roberts of the University of Michigan in Ann Arbor headed the study, which surveyed more than 1,000 customers of 23andMe and Pathway Genomics, another genetic testing company, in 2012, before 23andMe ran afoul of the FDA. The few participants who did use counseling tended to have had prior experience with genetic counselors, were highly educated, more affluent, and younger. Far more people shared their genetic testing information with primary care providers than with genetic counselors. This deference to physicians could be a case of “first-stop shopping,” Roberts told Alzforum, or a consequence of the dearth of available genetic counselors. He added that many genetic counselors have long waiting lists they also tend to give DTC customers low priority because they consider at-home tests less urgent than those ordered by a doctor.

Does 23andMe point its customers toward clinical studies they could join? Not in Alzheimer’s disease. As part of its genetic health risk reports, the company shares general research information with its customers, including risk statistics for each variant, other non-genetic risk factors (such as cardiovascular disease, education, and lifestyle), and data supporting the potential benefits of exercise and diet. However, the company does not point them toward specific studies geared to their genotype.

For ApoE4 homozygotes, an obvious choice would be the Generation program by Novartis and Banner. This set of two secondary prevention/early intervention trials is evaluating a BACE inhibitor and an Aβ vaccine from Novartis in asymptomatic ApoE4 homozygotes and heterozygotes (Sep 2016 conference news). These trials are ramping up, seeking a total of 3,340 participants. Based on the allele frequency of ApoE4, more than 100,000 people will have to be screened genetically to fill those trials alone. Researchers universally agree that recruiting asymptomatic, at-risk participants who are willing to learn their ApoE status represents a significant challenge for trial sponsors and participating sites in academia and industry.

23andMe could help with this challenge. After all, presumably many thousands of people have learned their ApoE genotype through the company. According to Jessica Langbaum at Banner, the institute has tried, but has not been able to come to such an agreement with 23andMe. Banner, Novartis, and 23andMe all declined to discuss the issue further with Alzforum, citing ongoing negotiations or legal concerns. In the past, 23andMe has monetized its genetic data, receiving $60 million from Genentech for access to its Parkinson’s data (Jan 2015 news). For now, absent a referral partnership with 23andMe, Langbaum told Alzforum that Banner is stepping up its own efforts to raise awareness about the Generation program, in hopes of catching the attention of the growing number of ApoE4 carriers who have learned their genotype.

Langbaum noted that several former 23andMe customers have found and joined the Generation program. Even though they entered the study already aware of their genetic status, these volunteers undergo the same extensive counseling as other participants in the trial, and then get genotyped again to confirm their status. Prior to joining a Generation trial, many of these participants had already done extensive private research about ApoE alas, many of them only got a taste of one-on-one genetic counseling upon joining the study, said Langbaum. “For most people, this is their first chance to ask someone questions,” she said.

Your Data in the Raw: A Trip Down the Genomic Rabbit Hole?
If the impact of learning your ApoE genotype seems unpredictable, imagine opening a data file containing 600,000 genotypes. 23andMe gives customers access to their personal file, with the stipulation that variants other than the select few included in the genetic health risk reports are not validated for accuracy. The company tells customers that the raw data is only suitable for “research, educational, and informational use, and not for medical, diagnostic, or other use.” Still, inquiring minds may want to know.

Data in the Raw.

A text file of raw data from 23andMe lists the rsid number, chromosome position, and genotype associated with each of more than 600,000 polymorphisms.

Customers can browse their raw genotype data—either by chromosome, gene, or SNP—on 23andMe’s secure website. They can also download it, and upload it to a third-party service for interpretation. For example, for $5 and with a simple click on a bright-green button on its website, Promethease, currently the most popular of these genome-interpretation services, will upload anyone’s 23andMe data. Less than 15 minutes later, the customer can browse their SNPs within Promethease. This site curates information about the potential meaning of DNA variations from SNPedia. For its part, SNPedia derives its data from a variety of sources, including the scientific literature, the NIH-supported database ClinVar, and even the Alzforum mutations database. The current 23andMe genotyping chip contains about 20,000 of the roughly 100,000 SNPs curated by SNPedia, according to SNPedia and Promethease co-founder Greg Lennon.

Perhaps most importantly, a 23andMe/Promethease customer can type in any of the genes known to harbor ADAD mutations—PSEN1, PSEN2, or APP—and voila, a list of potentially pathogenic mutations pops up, along with your predicted genotype, and whether that genotype is “good” or “bad,” and how bad on a scale of one to 10 (see example below).

Dominant Data. One of many ADAD mutations listed in Promethease, found by uploading a 23andMe raw data file and searching for PSEN1. “Magnitude” refers to the size of the variant’s effect. This sample person carries the T allele if he or she carried the G allele in this location, “Repute” would indicate “Bad,” and “Magnitude” would likely indicate the maximum assigned for the pathogenic variant, in this case 7.

According to an analysis conducted by Cruchaga, the custom Illumina genotyping chip that 23andMe currently uses contains 75 familial AD mutations: 53 in PSEN1, nine in PSEN2, and 13 in APP. In addition, Alzforum found dozens of autosomal-dominant mutations for frontotemporal dementia (FTD) in the tau (MAPT), progranulin (GRN), and CHMP2B genes. A bevy of risk-associated SNPs are also represented, including many of the current top 10 GWAS hits listed for AD (see AlzGene), PD (PDGene), and ALS (ALSGene).

Outside of neurodegeneration, SNPs on the 23andMe chip have been linked to myriad other diseases or traits, including cancer, diabetes, cardiovascular disease, addictive behavior, even lack of empathy. Other SNPs were included on the chip to derive ancestry information.

Customers who open their raw data file can find out their genotype at any of the 600,000 SNP positions contained on the chip, far beyond the 18 SNPs that are used for the genetic health risk reports that 23andMe shares with FDA approval. The raw data is available to all customers, even those who purchase the less expensive ancestry-only product that does not come with genetic health risk reports or carrier status, Wu told Alzforum. Therefore, if an ancestry-only customer carrying an ApoE4 allele were to upload his or her raw data to Promethease, they could see their ApoE genotype (see below). 23andMe is not the only DTC genetic testing company that shares raw data with its customers. For example, customers can also riffle through their raw data, which contain myriad clinically relevant SNPs as well.

Ready or Not: ApoE4. A carrier uploading his or her data to Promethease would face this entry. gs141 designates the ApoE3/4 genotype. Carriers are referred to Julie Gregory’s ApoE.Info website for support.

Surprises in a person’s raw data raise questions about accuracy and once again highlight the importance of genetic counseling. Consider Summer Warner, a Midwestern U.S. woman in her early 20s. Warner was just curious about her ancestry, but got blindsided by an apparent increased risk for developing a deadly neurodegenerative disease. She received her 23andMe results back in 2010. Uploading her raw data to Promethease a few years later, Warner discovered that she had variants associated with the C9ORF72 hexanucleotide expansion that can cause amyotrophic lateral sclerosis or frontotemporal dementia. Terrified, she followed a link from 23andMe’s website to Informed DNA, a genetic counseling service. The company informed Warner that they would not discuss the result with her, Warner told Alzforum. Seeking advice, she reached out to 23andMe,which assured her that Informed DNA would indeed talk with her. She tried again, and received no response. When she reached out to genetic counselors at Washington University in St. Louis, the closest major city, they told her they would not discuss raw data from 23andMe, as they were not validated to clinical standards.

Nevertheless, Warner persisted, and ultimately found her way to Chris Shaw of King’s College London, a geneticist and co-author on the ALS research study mentioning her variant. Shaw relieved Warner’s worries over email. “From our analysis it appeared that the SNP data did not increase her risk of carrying the C9ORF72 disease allele,” Shaw told Alzforum separately. “Her haplotype itself is very common in Europeans and is in no way a proxy for the expansion mutation.” Warner said that while this information assuaged her anxiety, it would have been preferable to speak with a genetic counselor. “Instead, I had to go this crazy route,” Warner said.

Shaw declined to comment about 23andMe directly. He told Alzforum that he considers it irresponsible to share this kind of raw genetic information with people without counseling. Shaw added that complex genetic data, and its relationship with disease risk, is often misunderstood or disagreed upon even among geneticists, counselors, and clinicians, let alone the average person. Even supposedly solid risk factors such as ApoE4 have varied effects in different populations, he said. “Misinformation about genetic data can generate a lot of fear,” he said.

The amount and quality of the data backing up the alleged phenotypic meaning behind a given genotype varies greatly it evolves along with the progress of human genetics research overall.

Adam Boxer of the University of California, San Francisco, said he would approach 23andMe’s raw data with a hefty dose of caution. In particular, Boxer questioned the ethics of handing over information about causal mutations outside of a clinical context, without counseling, and using unvalidated data. Boxer heads the ARTFL consortium, which conducts longitudinal studies on FTD (Nov 2014 conference news). Among ARTFL’s participants are asymptomatic carriers of autosomal-dominant mutations in tau, progranulin, and C9ORF72, some of which can be found on the 23andMe genotyping chip. Boxer said ARTFL adheres to strict protocols for disclosing information about causal mutations to its participants. While he stopped short of suggesting that people not be allowed to access their raw data files, he suggested some sort of warning system or firewall be put in place to alert carriers of severe mutations that they may be about to view potentially alarming information.

Bradley Boeve of the Mayo Clinic in Rochester, Minnesota, heads LEFFTDS, a subset of ARTFL that includes carriers of autosomal-dominant FTD mutations. Boeve told Alzforum that he would be “uncomfortable” with any DTC company offering testing for causal mutations. He cited the potential for psychological harm to people who learn the information in the wrong context. Echoing Shaw and Boxer, Boeve added that there are aspects of genetics that uninformed people would not understand without counseling, such as incomplete penetrance.

Some of the causal mutations genotyped on 23andMe’s chip—including most FTD­ mutations—require genetic know-how to find, as not all of them pop up readily via a third party like Promethease. Even so, in essence, 23andMe informally makes available hard-hitting genetic risk information far beyond the formal health reports sanctioned by the FDA.

Shirley Wu of 23andMe reiterated to Alzforum that the company does not recommend that its customers attempt to extract clinical information from the raw data, but added that they should be free to look through it. “We don’t want to be the gatekeepers of this information,” she said. “It is really the individual’s choice what they want to do with their data.”

According to Wu, the genetic counseling landscape is changing, as a growing cadre of counselors are beginning to specialize in the interpretation of direct-to-consumer genetic results. Brianne Kirkpatrick, owner of Watershed DNA in Crozet, Virginia, is such a counselor. She started her service after encountering anxious people in Warner’s situation, she told Alzforum. Kirkpatrick counsels people before and, more commonly nowadays, after they undergo direct-to-consumer genetic testing. She helps people interpret findings in their raw data files. Kirkpatrick told Alzforum that confusion and anxiety are common. For example, one recent client had a scare upon seeing the list of pathogenic Alzheimer’s mutations in Promethease, because she did not understand that she carried the normal, rather than pathogenic, variant of each one.

Kirkpatrick said she sometimes helps clients navigate the literature supporting disease- associated SNPs, making a point not to overinterpret weak or contradictory information. If a pathogenic mutation, such as an ADAD mutation, is reported, Kirkpatrick recommends that clients obtain a clinical-level genetic test to confirm it. She added that the disclaimers direct-to-consumer genetic testing companies use to caution clients against overinterpreting their raw data don’t work with most people. “Even with the disclaimer, people still believe the data is reliable,” Kirkpatrick said. “The disclaimers are not sufficient to educate the public.”

Case in point, Alzforum found several heart-stopping errors within a 23andMe raw data file obtained from a volunteer. Using rsid numbers associated with mutations in the AD and FTD Mutation Database, Alzforum found genotypes for 26 pathogenic progranulin mutations in the volunteer’s raw data file. According to the genotypes called by 23andMe, this person was homozygous for three separate dinucleotide deletions, each reported to cause FTD in an autosomal-dominant fashion. Homozygous progranulin mutations trigger the childhood lysosomal storage disease neuronal ceroid lipofuscinosis (NCL). However, this volunteer is a healthy adult with no family history of FTD or NCL.

Why the error? One possibility is that the mutations—which have been documented in the genetics literature in one to seven families each—are not actually pathogenic, or were mislabeled in the ADFTD mutation database. A more likely explanation in this case is that the volunteer’s genotypes were wrong, Jose Bras and Rita Guerreiro of University College London explained to Alzforum. Genotyping chips often interpret deletions and insertions incorrectly, they said. In fact, Bras and Guerreiro ran into this same issue with some progranulin mutations on the NeuroXChip—a genotyping chip for neurological disorders that they helped design. Bras said that Illumina, the company that manufactures both NeuroX and 23andMe’s custom chips, gives researchers an estimate of how likely a given mutation is to be called correctly. But in the end, the only way to know for sure is to try the chip out, they said. If it proves not to work for a given genotype, researchers know to ignore it. “Of course, unlike 23andMe, we are not giving this raw data to customers,” Bras said. The progranulin problem is a prime example of just how unvalidated the raw data can be.

People who have had their entire genomes sequenced—an undertaking that is becoming increasingly affordable—have run into similar mismatches between their genotypes and phenotypes. A recent study reported that 11 out of 50 people who had their genomes sequenced supposedly carried pathogenic variants for various disorders. However, only two of those people actually had outward signs of the disease (Vassy et al., 2017).

Given the validation the FDA required to approve the 10 genetic health risk reports 23andMe issues, how does the agency view the raw data? Nary a mention of the raw data file appears in FDA’s authorization letter to 23andMe. The letter does state, however, that the 23andMe personal genome service, which includes the approved genetic health risk reports, cannot be used for “assessing the presence of deterministic autosomal dominant variants.” In an email to Alzforum, the FDA wrote, “Excluded from this authorization are diagnostic tests that are often used as the sole basis for major treatment decisions. Tests providing diagnostic information would be a different intended use to the genetic health risk reports recently authorized by the FDA and require a separate FDA submission.”

Alberto Gutierrez, director of the FDA’s Office of In Vitro Diagnostics and Radiological Health, which approved 23andMe’s genetic health risk reports, told Alzforum that the agency regulates the interpretation of genetic data, not the data itself. “There is a very strong desire by some people to own what they think is their data,” Gutierrez told Alzforum, referring to the raw data files. “As long as 23andMe is not making medical claims about it, we’re allowing them to share it.” What about third-party companies, such as Promethease, that do offer data interpretation? Gutierrez said that the FDA is watching such companies closely, and plans to contact those that “cross the line,” although exactly what that means is unclear at the moment.

In a June 20 commentary about the FDA’s approval of genetic health risk reports published in the Annals of Internal Medicine, Julia Wynn and Wendy Chung from Columbia University in New York criticized the agency’s attempt to separate genetic information from medical claims. “To allow DTC genetic testing and not expect persons to use the information to inform medical decisions is disingenuous and irresponsible,” they wrote. “This ruling is confusing—in asymptomatic persons, the genetic test provides the only data to support the diagnosis of or increased risk for such conditions as Alzheimer or Parkinson disease, and could be the sole basis for a medical decision.”

Good examples of medically actionable information are the BRCA mutations known to drive risk for breast cancer, FDA spokesperson Tara Goodin told Alzforum, because carriers could seek preventive procedures such as a mastectomy. BRCA mutations are not included in the genetic health risk reports, though some are in the raw data file. But how about the 75 autosomal-dominant AD mutations lurking in the raw data? Goodin told Alzforum that as the data are not validated, nor interpreted by 23andMe, customers should not view them as diagnostic tests. If someone were to find they harbored such a mutation, the next step would be to confirm the result via clinical testing with a health care provider, she said.

Randall Bateman of Washington University in St. Louis partly agrees. However, he recommended that before seeing a doctor, people who discover one of these deterministic mutations in their 23andMe raw data first secure life and long-term care insurance, especially if they have a family history of AD. The reason is that, unlike workplace equality or access to medical insurance, access to life and long-term care insurance are not protected under the Genetic Information Nondiscrimination Act (GINA). Once a person has a validated clinical result on his or her medical record, he or she may be required to divulge it on insurance application forms. This issue has even come up for people who carry two copies of the ApoE4 allele (May 2017 New York Times article). After getting insurance, the next step would be to decide, with the guidance of a genetic counselor, whether to seek bona fide clinical testing for the mutation, Bateman said.

Bateman readily acknowledged that discovering a familial AD mutation in a raw data file could be disturbing. “Maybe you were just interested in finding out if you descended from Vikings, and then you find one of these mutations instead,” he said. However, he added that given the strong family history of autosomal-dominant Alzheimer’s disease, the revelation would come as little surprise for most carriers of ADAD mutations. People who discover such a mutation in their raw data file can contact the DIAN study through the DIAN Expanded Registry, where they will be guided toward counseling if they wish. In fact, a few participants discovered DIAN after perusing their 23andMe raw data, though they did so without referral help from the company.

Just as 23andMe has no partnership with Banner or Novartis to direct ApoE4 homozygotes to the Generation trials program, it also does not point carriers of ADAD mutations to DIAN. 23andMe spokesperson Andy Kill told Alzforum that as of now, the company does not keep tabs on how many of its customers carry ADAD mutations. However, Bateman said that directing these carriers to the DIAN registry would align with the company’s mission of sharing useful information with its customers. Like the Generation program, DIAN is expanding its trials unit DIAN-TU in particular needs more participants to gain statistical power for its prevention studies.

Given the serious implications of carrying an ADAD mutation, Bateman raised the bar, suggesting that direct-to-consumer genetic testing companies have a responsibility to share this information with willing customers. After all, a person who has ordered this product has expressed an interest in genetic information, and arguably deserves follow-up in instances where there are concrete actions he or she can take in the face of distressing risk, such as join a prevention drug study. Bateman proposed a notification process by which the testing company would ask both carriers and non-carriers whether they would want to be notified if they did harbor a serious mutation. For those who answer yes, the company could perform clinical testing to confirm the result, and cover the cost of genetic counseling. Boxer would like to see a similar effort to refer carriers to registries that feed research studies such as ARTFL.

“These companies could develop a process to enable individuals, in a safe and ethical way, to find information that may change their lives and the lives of their families,” Bateman said. “As holders of this information, DTC companies are in the position to make a difference. The question is, what kind of difference do they want to make?”

Besides 23andMe, another venue for recruitment to prevention trials would be genetic-interpretation companies such as Promethease. People who find out their genotype there could be directed to trials such as the Generation program (for ApoE4 carriers) or the DIAN registry (for ADAD mutation carriers). Of course, this pool would be limited to customers who took the step of analyzing their data via Promethease.

Lennon, the co-founder of SNPedia and Promethease, said his company is willing to direct mutation carriers to prevention studies. “We could do this automatically, and easily,” he told Alzforum. “But every foundation or nonprofit we’ve gone to has ultimately said no.” While he declined to name the organizations, he said that they were wary of referrals based on unvalidated data, or that they asked for too much exclusivity at the expense of other foundations.

Lennon told Alzforum that Promethease tries to give carriers of pathogenic variants medically useful information as available. One source is the evidence-based summaries developed by ClinGen’s Actionability Working Group. This NIH-funded panel scores the “clinical actionability” of various pathogenic variants, and gives recommendations. Promethease presents this to carriers, as exemplified by the BRCA2 mutation below. Neurodegenerative diseases are largely absent from this ClinGen list, due to the lack of approved drugs, Lennon said.

A Call to Action? Promethease brings in recommendations from ClinGen to help mutation carriers take action to prevent or treat disease.

Even as researchers would like to see DTC companies step up their games on counseling and referral, 23andMe has contributed to research in other areas, especially Parkinson’s. Spurred in part by the discovery that Sergey Brin, the ex-husband of 23andMe founder Anne Wojcicki, carries a pathogenic LRRK2 mutation, 23andMe tackled the genetic architecture of PD. Partnering with the Michael J. Fox Foundation, 23andMe gathered a deeply genotyped and phenotyped cohort of PD patients who have contributed to GWAS, informed biomarker research, and whose DNA is currently being plumbed in search of rare pathogenic variants (Jul 2014 news and Nalls et al., 2015).—Jessica Shugart


Minimal marker set

Up to week 18, the high-quality COG-UK sequence alignment comprised 14,277 sequences, as indicated in the accompanying metadata file. We found 41 SNPs meeting our criteria of a minimum minor allele frequency of 0.1%. Of these, our pipeline identified 22 as sufficient to provide the maximum possible discrimination between samples in the COG-UK dataset. Three SNPs were removed manually from this list as either their flanking sequences (for probe design) were overlapping or contained ambiguous bases (‘N’) close to the SNP of interest. Prior to wet-lab marker validation, we found that these 19 SNPs were capable of delineating 59 distinct variants from the COG-UK sequence alignment (S3 Table). To test the discriminatory power of the 19-marker set (hereafter, named the test set), random pairs of haplotypes for our marker positions were sampled from the COG-UK sequence alignment without replacement. We found that 89.1% of 6,202 random sample pairs were distinct at one of more marker positions. The flanking sequences for the 19 selected SNPs of the test set (S1 Table), were sent to 3CR Biosciences for probe design.

Synonymous and non-synonymous SNPs.

All nineteen SNP markers in the test set target SNPs located in coding sequences. With regard to the codons within the open reading frame (ORF) of these genes, five of the SNPs were at position 1, six at position 2 and eight at position 3. Twelve of the SNPs were non-synonymous and would result in changes to the amino acid at the given position (Table 2).

Evaluation of the test set.

Initial evaluation of the test set was performed using the two cell culture propagated SARS-CoV-2 isolates GBR/Liverpool_strain/2020 and hCoV-19/England/02/2020. The two virus genomes vary at ten nucleotide positions (Table 1) but have no differences in the wt spike gene sequences. However, in addition to the wt viral genome, the hCoV-19/England/02/2020 virus stock was known to contain a variant genome that arose during viral passage in tissue culture, which had a 24 nt in frame deletion in the spike gene sequence (BrisΔS, Table 1). Genotypes were obtained for all 19 markers (Table 3).

Concordance between genotyping and sequencing.

The two SARS-CoV-2 isolates GBR/Liverpool_strain/2020 and hCoV-19/England/02/2020 had been sequenced, enabling a comparison with our genotyping data (Table 3). All genotyping results were concordant with the sequence data. In two cases, it was possible to confirm SNPs (at nts 11083 and 28144) differentiating the two wt SARS-CoV-2 isolates with both sequence and genotyping data. We also compared these data with the available COG-UK sequences from the 2020-05-08 dataset (representing PCR positives samples circulating March–May 2020). This showed that the majority of genotype calls concord with the major allele found in the COG-UK database.

Genotyping clinical SARS-CoV-2 samples

To further evaluate the test set we genotyped 50 SARS-CoV-2 positive samples obtained from PHE (samples collected from the South West of England). For 41 of the 50 samples, results were obtained from at least 50% of the SNP markers in our panel those that fell below this threshold were excluded from further analysis (S4 Table). For 22 of the remaining 41 samples results were obtained for all 19 markers and for a further 13 samples, results were obtained from at least 15 of the 20 markers.

We found that 11 of the 19 markers were polymorphic among the 50 PHE samples and could be used to assign them to 15 distinct groups (Fig 3 and S4 Table). To quantify the utility of our SNP panel in separating positive samples into distinct groups, we sampled random pairs of the 50 genotyped samples 1000 times and found that they were separated by at least one marker in 619 cases (61.9%).

SNPs with a single allele call per sample are marked in dark blue (major allele) or orange (minor allele). Mixed calls are shown in gold and missing data in light blue. Twelve out of 19 markers were polymorphic in our small test panel of PHE samples and cell lines (eleven out of 19 markers were polymorphic in PHE samples) and eight samples had mixed calls for one or more markers.

Marker fail rate in PHE samples.

The average fail rate by marker (that is, the marker produced no signal for some samples) was 19.4% ranging from 4% (marker Bris_SARS-CoV-2_25429) to 32% (markers Bris_SARS-CoV-2_2558 and Bris_SARS-CoV-2_25350). The number of fails per sample ranged from 0% (22 of the samples) to 80% (2 of the samples) those samples with fewer than 10 calls (9 in total) were removed from further analysis (S4 Table).

An evolving target

The Microreact website [8] shows how SARS-CoV-2 lineage frequencies have changed during the outbreak and similarly the SNPs we targeted in our panel also changed in frequency over time. To quantify the effect of alterations in SNP frequency over time on the discriminative power of the 19 SNP panel, it was tested bioinformatically against random pairs of samples drawn from week 19 through week 35 in the 2020-09-03 COG-UK data. The probability of the original marker set discriminating a random pair of samples decreased from 89.1 to 77.6%. There was, however, an anomaly in this analysis as our G/T SNP at position 11,083, recorded as a variant in the 2020-05-08 COG-UK data and polymorphic in our genotyping results, is reported as the non-IUPAC character “?” the 2020-09-03 COG alignment due to it exhibiting homoplasy in phylogenetic reconstruction (Andrew Rambaut, personal communication). The loss of data for this marker from the latest COG-UK alignment means we will have underestimated the discriminatory power of our panel on more recent samples. Nonetheless, we re-ran the SNP marker discovery pipeline on the week 19–35 sequences and found that the number of SNPs present at a frequency greater than 0.001 had increased from 41 to 97 (noting that the SNP at 11,083 has been masked out of that alignment) and that 51 markers were now required to discriminate all samples to the maximum amount possible. However, the majority of variants were extremely rare, such that just the first 24 markers (S5 Table) were capable of discriminating 95% of randomly selected sample pairs.

Promethease — a tool for anyone to understand genetic health risks

Promethease is a literature retrieval system that pulls its information from SNPedia, a vast wiki of research studies on how genes affect (predominantly) medical traits. The genetic information is then mapped against the genetic data you uploaded to generate a personal DNA report on genetic health risks. Medical information based on your raw data has been notably hard to come by in consumer DNA tests, owing to strict FDA oversight. Combine that with Promethease’s not-so-friendly user interface, and we have many people missing out on the benefits of a Promethease report — namely getting an idea of whether they have any medical risks to look out for.

That being said, there are clear limitations rooted in how Promethease functions. Details like the genetic variants referenced in a report not being corroborated by other research studies or having SNPs with contradicting results can really throw you off course. Many of these limitations can be controlled for by making use of the many filters they have.

So before we get into how to interpret your report, let me share my guidelines for pre-filtering the results. The main purpose is so that you don’t get unnecessarily overwhelmed when you first see your report, so feel free to adjust the specific numbers as you see fit. (Once you’re more familiar with the format, I encourage you to play around with the numerous filter settings.)


  1. Mejinn

    Nice post! I drew up a lot of new and interesting things for myself! I'll go give a link to a friend in ICQ :)

  2. Timo

    Completely I share your opinion. It seems to me it is good idea. I agree with you.

  3. Icarus

    You are not right. Email me at PM, we will talk.

  4. Wilpert

    I congratulate, what words..., a magnificent idea

  5. Moki

    agree with all of you !!!!!

Write a message