Whole Genome Sequencing and Chromosome Counts

Whole Genome Sequencing and Chromosome Counts

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

In general, do standard whole genome sequencing techniques rely more on known chromosome counts, independently arrive at chromosome counts, and/or not directly address issues such as base number, aneuploidy, and polyploidy?

For example would have normal whole genome sequencing techniques detected that humans have 46 rather than 48 chromosomes?

Maybe a better way to put this question would be as follows:

Given the whole genome sequence for humans and a belief that chimpanzee had 46 chromosomes would whole genome sequencing of the chimpanzee most likely say

1) Here are the sequences for those 46 and by the way there was material left over.

2) Here is the best representation of all the chimpanzee's DNA mapped onto 46 chromosomes.

3) Here are the sequences for those 46 and chimpanzee's appear to have 48 chromosomes.

What Biomed_guy says is basically the answer, I just wanted to clarify a bit. When you sequence DNA, you do something during the preparation of your DNA library to turn it into small fragments if it isn't already, like shearing the DNA. This gives you very small pieces of DNA that should still be quite unique when matched to reference genome.

But as previously answered, the determining factor for how DNA is then mapped is the reference genome you are using. So the answer is what you state in both 1 and 2. When trying to figure out what DNA you have, you start with a reference sequence which is all of the genomic DNA 'layed out' so that your fragments can then be compared against it and matched to a best fit region based on matching sequence. Anything that does not match the ref genome gets dumped into a 'junk' output file.

So conceptually, many gene sequences are highly conserved between humans and chimpanzees. This would mean that you could successfully align your sequences fragments to either genome. Importantly, this is number of chromosomes agnostic. It is the genes that matter.

Imagine that if Ras, Myc, and Erk are all on human chr 12 in humans, but split up in chimps so that Ras is on chr 12, Myc on Chr 13, and Erk on Chr 24, this would make no difference when you map back to the genome assuming the genes are conserved between human and chimp. If you have a sequenced fragment from human Myc that is the same as Chimp Myc, it would align to a location on chr 12 if you use human DNA as ref, or it would align to a location on chr 13 if you use Chimp as ref. Make sense?

I'm pretty sure that it relies on a reference genome. As in, since the original human genome sequencing, most of the techniques used today rely on that original construction (with modifications).

That's how people use it to detect copy number variations (CNVs) and single nucleotide polymorphisms (SNPs) with these sequencing techniques (and aneuploidy as well).

The reads from the sequencing run are compared to the reference genome and if statistically significant differences in the number of reads are seen, then they are noted as differences in the genome. If there is extra unmapped DNA, then it would likely just not know where to put it and call it "extra material".

So #1.

You understand, it's not that chimps have lots of extra DNA that humans do not, they have two chromosomes which are fused in humans.

Theoretically, discrepancies between the reads and the reference could be discovered. Practically, with current short read technology, it would be difficult to sort out a discrepancy that involved a region of the genome that was repetitive, which the telomeres of the chimp 2A and 2B and the telomeric-like sequences in the regions where those fused in humans, would be.

Types of DNA sequencing: Sanger sequencing, whole genome shotgun sequencing, next-generation sequencing

DNA sequencing is a process of determining the order of the four chemical building blocks - called "nitrogenous bases" - that make up the DNA molecule.

The process used to sequence DNA is known as chain termination sequencing or Sanger DNA sequencing. It relies on a modified form of the polymerase chain reaction.

Strategies Used in Sequencing Projects

The basic sequencing technique used in all modern day sequencing projects is the chain termination method (also known as the dideoxy method), which was developed by Fred Sanger in the 1970s. The chain termination method involves DNA replication of a single-stranded template with the use of a primer and a regular deoxynucleotide (dNTP), which is a monomer, or a single unit, of DNA. The primer and dNTP are mixed with a small proportion of fluorescently labeled dideoxynucleotides (ddNTPs). The ddNTPs are monomers that are missing a hydroxyl group (&ndashOH) at the site at which another nucleotide usually attaches to form a chain (Figure (PageIndex<1>)).

Figure (PageIndex<1>): A dideoxynucleotide is similar in structure to a deoxynucleotide, but is missing the 3' hydroxyl group (indicated by the box). When a dideoxynucleotide is incorporated into a DNA strand, DNA synthesis stops.

Each ddNTP is labeled with a different color of fluorophore. Every time a ddNTP is incorporated in the growing complementary strand, it terminates the process of DNA replication, which results in multiple short strands of replicated DNA that are each terminated at a different point during replication. When the reaction mixture is processed by gel electrophoresis after being separated into single strands, the multiple newly replicated DNA strands form a ladder because of the differing sizes. Because the ddNTPs are fluorescently labeled, each band on the gel reflects the size of the DNA strand and the ddNTP that terminated the reaction. The different colors of the fluorophore-labeled ddNTPs help identify the ddNTP incorporated at that position. Reading the gel on the basis of the color of each band on the ladder produces the sequence of the template strand (Figure (PageIndex<2>)).

Figure (PageIndex<2>) : Frederick Sanger's dideoxy chain termination method is illustrated. Using dideoxynucleotides, the DNA fragment can be terminated at different points. The DNA is separated on the basis of size, and these bands, based on the size of the fragments, can be read.

Early Strategies: Shotgun Sequencing and Pair-Wise End Sequencing

In shotgun sequencing method, several copies of a DNA fragment are cut randomly into many smaller pieces (somewhat like what happens to a round shot cartridge when fired from a shotgun). All of the segments are then sequenced using the chain-sequencing method. Then, with the help of a computer, the fragments are analyzed to see where their sequences overlap. By matching up overlapping sequences at the end of each fragment, the entire DNA sequence can be reformed. A larger sequence that is assembled from overlapping shorter sequences is called a contig . As an analogy, consider that someone has four copies of a landscape photograph that you have never seen before and know nothing about how it should appear. The person then rips up each photograph with their hands, so that different size pieces are present from each copy. The person then mixes all of the pieces together and asks you to reconstruct the photograph. In one of the smaller pieces you see a mountain. In a larger piece, you see that the same mountain is behind a lake. A third fragment shows only the lake, but it reveals that there is a cabin on the shore of the lake. Therefore, from looking at the overlapping information in these three fragments, you know that the picture contains a mountain behind a lake that has a cabin on its shore. This is the principle behind reconstructing entire DNA sequences using shotgun sequencing.

Originally, shotgun sequencing only analyzed one end of each fragment for overlaps. This was sufficient for sequencing small genomes. However, the desire to sequence larger genomes, such as that of a human, led to the development of double-barrel shotgun sequencing, more formally known as pairwise-end sequencing . In pairwise-end sequencing, both ends of each fragment are analyzed for overlap. Pairwise-end sequencing is, therefore, more cumbersome than shotgun sequencing, but it is easier to reconstruct the sequence because there is more available information.

Next-generation Sequencing

Since 2005, automated sequencing techniques used by laboratories are under the umbrella of next-generation sequencing , which is a group of automated techniques used for rapid DNA sequencing. These automated low-cost sequencers can generate sequences of hundreds of thousands or millions of short fragments (25 to 500 base pairs) in the span of one day. These sequencers use sophisticated software to get through the cumbersome process of putting all the fragments in order.

Evolution Connection:Comparing Sequences

A sequence alignment is an arrangement of proteins, DNA, or RNA it is used to identify regions of similarity between cell types or species, which may indicate conservation of function or structures. Sequence alignments may be used to construct phylogenetic trees. The following website uses a software program called BLAST (basic local alignment search tool).

Under &ldquoBasic Blast,&rdquo click &ldquoNucleotide Blast.&rdquo Input the following sequence into the large "query sequence" box: ATTGCTTCGATTGCA. Below the box, locate the "Species" field and type "human" or "Homo sapiens". Then click &ldquoBLAST&rdquo to compare the inputted sequence against known sequences of the human genome. The result is that this sequence occurs in over a hundred places in the human genome. Scroll down below the graphic with the horizontal bars and you will see short description of each of the matching hits. Pick one of the hits near the top of the list and click on "Graphics". This will bring you to a page that shows where the sequence is found within the entire human genome. You can move the slider that looks like a green flag back and forth to view the sequences immediately around the selected gene. You can then return to your selected sequence by clicking the "ATG" button.

Early Strategies: Shotgun Sequencing and Pair-Wise End Sequencing

Originally, shotgun sequencing only analyzed one end of each fragment for overlaps. This was sufficient for sequencing small genomes. However, the desire to sequence larger genomes, such as that of a human, led to the development of double-barrel shotgun sequencing, more formally known as pairwise-end sequencing. In pairwise-end sequencing, both ends of each fragment are analyzed for overlap. Pairwise-end sequencing is, therefore, more cumbersome than shotgun sequencing, but it is easier to reconstruct the sequence because there is more available information.


Even though the sequencing accuracy for each individual nucleotide is very high, the very large number of nucleotides in the genome means that if an individual genome is only sequenced once, there will be a significant number of sequencing errors. Furthermore, many positions in a genome contain rare single-nucleotide polymorphisms (SNPs). Hence to distinguish between sequencing errors and true SNPs, it is necessary to increase the sequencing accuracy even further by sequencing individual genomes a large number of times.

Ultra-deep sequencing Edit

The term "ultra-deep" can sometimes also refer to higher coverage (>100-fold), which allows for detection of sequence variants in mixed populations. [4] [5] [6] In the extreme, error-corrected sequencing approaches such as Maximum-Depth Sequencing can make it so that coverage of a given region approaches the throughput of a sequencing machine, allowing coverages of >10^8. [7]

Transcriptome sequencing Edit

Deep sequencing of transcriptomes, also known as RNA-Seq, provides both the sequence and frequency of RNA molecules that are present at any particular time in a specific cell type, tissue or organ. [8] Counting the number of mRNAs that are encoded by individual genes provides an indicator of protein-coding potential, a major contributor to phenotype. [9] Improving methods for RNA sequencing is an active area of research both in terms of experimental and computational methods. [10]

Sometimes a distinction is made between sequence coverage and physical coverage. Where sequence coverage is the average number of times a base is read, physical coverage is the average number of times a base is read or spanned by mate paired reads. [2] [11] [12]

Uses of Genome Sequences

DNA microarrays are methods used to detect gene expression by analyzing an array of DNA fragments that are fixed to a glass slide or a silicon chip to identify active genes and identify sequences. Almost one million genotypic abnormalities can be discovered using microarrays, whereas whole-genome sequencing can provide information about all six billion base pairs in the human genome. Although the study of medical applications of genome sequencing is interesting, this discipline tends to dwell on abnormal gene function. Knowledge of the entire genome will allow future onset diseases and other genetic disorders to be discovered early, which will allow for more informed decisions to be made about lifestyle, medication, and having children. Genomics is still in its infancy, although someday it may become routine to use whole-genome sequencing to screen every newborn to detect genetic abnormalities.

In addition to disease and medicine, genomics can contribute to the development of novel enzymes that convert biomass to biofuel, which results in higher crop and fuel production, and lower cost to the consumer. This knowledge should allow better methods of control over the microbes that are used in the production of biofuels. Genomics could also improve the methods used to monitor the impact of pollutants on ecosystems and help clean up environmental contaminants. Genomics has allowed for the development of agrochemicals and pharmaceuticals that could benefit medical science and agriculture.

It sounds great to have all the knowledge we can get from whole-genome sequencing however, humans have a responsibility to use this knowledge wisely. Otherwise, it could be easy to misuse the power of such knowledge, leading to discrimination based on a person's genetics, human genetic engineering, and other ethical concerns. This information could also lead to legal issues regarding health and privacy.

What is Whole Genome Sequencing? | What is WGS?

Whole genome sequencing, also known as WGS, is a laboratory technique in which the entire coding (exon) and non-coding regions of the genome are obtained. It provides a complete, comprehensive map of a person’s genetic makeup and allows extensive analysis of all genes to be performed. It’s a single lab test that obtains all of the data on all of your genes.

What is Whole Genome Sequencing?

Whole genome sequencing is a genetic testing technology that obtains comprehensive data on every gene and all of your chromosomes in your DNA. While other DNA tests obtain data on either one gene (PCR) or spots within your DNA (genotyping microarrays), whole genome sequencing is different.

WGS is the acronym for whole genome sequencing. It’s a commonly used abbreviation when discussing whole genome sequencing. They are used interchangeably as they mean the same thing.

WGS obtains data on every chromosomal coordinate from the beginning of chromosome 1 to the end of chromosome 1. Sequencing then obtains data on all of chromosome 2. Then chromosome 3 and so on. It also includes full sequencing of the sex chromosomes (chromosome X and, in males, chromosome Y) as well as the mitochondrial chromosome.

WGS is different from the types of DNA tests used by laboratories such as MyHeritage, 23andMe, and FamilyTreeDNA. Those companies use a type of genetic test known as genotyping using DNA microarrays. Genotyping obtains data on ‘spots’ throughout the genome. For example, 23andMe’s DNA test obtains data on around 700,000 data points throughout a person’s genome while WGS obtained data on around 3 billion data points (100% of the genome).

This means that the types of tests performed by many of the direct-to-consumer genetic testing companies actually test less than 0.1% of the genome!

You may have seen WGS appear with a number and an ‘x’ before it, such as 0.4x WGS or 30x WGS. Those numbers refer to the ‘depth’ of sequencing. 30x WGS means that the genome has been sequenced 30 times. Another way to state is this that the genome was sequenced to a depth of 30.

Why would someone need to sequence their genome more than once? The ‘30x’ actually means that the laboratory that sequenced the genome repeated the sequencing 30 times. The data from each of the 30 sequences are combined together and analyzed by a powerful computer.

While sequencing a genome one time may include errors in the data, having access to multiple copies of the same sequence enables the computer to quickly exclude any errors. When there are 30 copies of the sequence, which is 30x WGS, the computer software is able to generate a single highly accurate genome sequence.

As an example, imagine if you were to copy the dictionary from the first page to the last page. No matter how careful a person is, everyone would end up making some errors. Now imagine if you copied the dictionary another 29 times. You’d definitely have a very sore hand, and you’d have 30 copies that each have some errors.

In copy 1, an error may be made on page 488 but in the other 29 copies, page 488 was copied perfectly. If a computer had access to all 30 copies that you wrote, it would be able to easily identify and exclude the error on page 488 since it has 29 other copies to compare it to.

Now imagine copying six billion letters. This is what a sequencing machine does because each person’s genome contains six billion letters! While an advanced sequencing technology may be 99.9998% accurate, this means that an error will occur 0.0001% of the time.

An error rate of 0.0001% while sequencing six billion letters in a genome means that there will be 6,000 errors every time a genome is sequenced! But if a genome is sequenced 30 separate times, a computer program is able to easily identify those errors and create a single, highly accurate sequence.

This is why sequencing a genome 30 times is so important. It means the sequence that you’ll receive is highly accurate!

If 30x means sequencing the same genome 30 times, what does 0.4x mean? We answer this question in our other blog post: What is 0.4x and 30x Whole Genome Sequencing?

How is Whole Genome Sequencing Done?

During WGS, a DNA sample is collected from a person. The sample contains chromosomal DNA as well as DNA contained in the mitochondria. The DNA is then broken into smaller fragments, and making use of new technologies that allow for quick sequencing, the combination of nucleotides (A, T, C, and G bases) are analyzed.

Once the DNA is sequenced, it is then put back together in the correct order using bioinformatics approaches. This step is referred to as “assembly.” When the genome assembly is completed, researchers, physicians, and citizen scientists can conduct further analysis such as determining whether genetic variations are associated with health conditions.

There are several methods of DNA sequencing available. Whole genome sequencing and whole exome sequencing are the two methods most used in healthcare and research to identify genetic variations. However, whole genome sequencing should not be confused with whole exome sequencing because whole exome sequencing only analyzes less than 1% of the genome.

Both approaches can use third-generation or second-generation sequencing technologies. SMRT sequencing, Illumina dye sequencing, and pyrosequencing are examples of these technologies. Other technologies developed recently use novel approaches. For example, in nanopore sequencing, a strand of DNA is passed through a protein nanopore offering real-time and direct DNA sequencing.

How Big is One Person’s Whole Genome?

It’s huge! A single genome contains around six billion letters.

When your genome is sequenced, it is stored as data files such as fastq, bam and vcf. The file containing a single person’s whole genome can be more than 300 GB in size! That means one person’s genome is larger than the entire hard drive of many computers!

This very large size usually means it is difficult to store your genome on your computer and instead most people need to store their genome in the cloud. is designed to provide just this: safe, confidential storage of your whole genome so you and your healthcare providers can always securely access and obtain value from it.

The Importance of Whole Genome Sequencing

The very first human genome sequencing was completed in 2003 as part of the international scientific research Human Genome Project. It cost $2.7 billion and took 15 years.

Today, it can be accomplished in as little as a few weeks and is increasingly priced at less than $1,000. As a result of this incredible price reduction, whole genome sequencing is becoming an affordable option for many people.

This is important because whole genome sequencing is our most powerful tool for testing for genetic disorders such as mutations that drive cancer development, heart disease, medication reactions and even tracking infectious disease outbreaks.

As a result of this, the Food and Drug Administration (FDA) is laying the foundation for the use of whole-genome sequencing in public health by identifying the genomic sequence of pathogens associate in foodborne illness outbreaks.

Whole genome sequencing is also becoming particularly useful in creating personalized treatment plans for patients with cancer and some genetic conditions. It is also is starting to become the norm for citizen scientists and other consumers looking to learn a little bit more about their health.

In fact, using, consumers who have had whole genome sequencing can upload their genome sequencing data, store it for free and then analyze their entire genome with apps from our DNA App Store.

Where to Buy Whole Genome Sequencing

Our Ultimate Genome Sequencing service provides affordable, clinical-grade whole genome sequencing. You can purchase it directly from our website.

Using to Store, Analyze and Understand Your Genome

By using our apps with a whole genome sequence, consumers can learn about their health, nutrition, fitness, beauty, and ancestry in the most accurate way possible since all of their DNA is available for analysis.

If you have a whole genome sequence that you would like to analyze, it’s easy to upload it and then use DNA analysis reports to help you outsmart your genes!

Diagnostic Genomics and Clinical Bioinformatics

A. Haworth , . N. Lench , in Medical and Health Genomics , 2016

Whole Genome Sequencing

Whole genome sequencing (WGS) offers the ability to interrogate the entire DNA sequence of the genome without the need to use selective capture techniques to isolate specific regions of DNA. Conceptually, this approach is very appealing and will enable the identification of additional classes of mutation that are refractory to detection by exome sequencing. These include the identification of large structural rearrangements, balanced translocations, uniparental isodisomy, and mosaicism. WGS also offers the opportunity to interrogate noncoding regions of DNA and identify functionally important sequence variants that influence gene expression. Removing the need to capture sequences removes selection bias so that coverage across sequences is more uniform. The main obstacles to the uptake of WGS include cost and dealing with the enormous amount of data produced. Moving, analyzing, interpreting, and storing large amounts of genetic data have significant resource and cost implications, many of which are currently beyond the majority of routine diagnostic laboratories. As the cost of sequencing continues to decrease and experience is gained in data analysis and interpretation, we can anticipate that WGS will be the method of choice for the clinical diagnosis of rare genetic disorders.

Whole Genome Sequencing Reveals Genetic Structural Secrets of Schizophrenia

A study co-led by Jin Szatkiewicz, PhD, Patrick Sullivan, MD, and colleagues at UNC-Chapel Hill and four Swedish universities found that extremely rare variants affect the boundaries of a 3D genome structure called topologically associated domains in people with schizophrenia much more than they affect people without the condition. Jin Szatkiewicz, PhD

A study co-led by Jin Szatkiewicz, PhD, Patrick Sullivan, MD, and colleagues at UNC-Chapel Hill and four Swedish universities found that extremely rare variants affect the boundaries of a 3D genome structure called topologically associated domains in people with schizophrenia much more than they affect people without the condition.

Jin Szatkiewicz, PhD
Pat Sullivan, MD

Media contact: Mark Derewicz, 984-974-1915

CHAPEL HILL, NC – April 16, 2020 – Most research about the genetics of schizophrenia has sought to understand the role that genes play in the development and heritability of schizophrenia. Many discoveries have been made, but there have been many missing pieces. Now, UNC School of Medicine scientists have conducted the largest-ever whole genome sequencing study of schizophrenia to provide a more complete picture of the role the human genome plays in this disease.

Published in Nature Communications, the study co-led by senior author Jin Szatkiewicz, PhD, associate professor in the UNC Department of Genetics, suggests that rare structural genetic variants could play a role in schizophrenia. Patrick Sullivan, MD, the Yeargen Distinguished Professor of Psychiatry and Genetics at the UNC School of Medicine and Director of the Center for Psychiatric Genomics, is co-senior author.

“Our results suggest that ultra-rare structural variants that affect the boundaries of a specific genome structure increase risk for schizophrenia,” Szatkiewicz said. “Alterations in these boundaries may lead to dysregulation of gene expression, and we think future mechanistic studies could determine the precise functional effects these variants have on biology.”

Previous studies on the genetics of schizophrenia have primarily involved using common genetic variations known as SNPs (alterations in common genetic sequences and each affecting a single nucleotide), rare variations in the part of DNA that provide instructions for making proteins, or very large structural variations (alterations affecting a few hundred thousands of nucleotides). These studies give snapshots of the genome, leaving a large portion of the genome a mystery, as it potentially relates to schizophrenia.

In the Nature Communications study, Szatkiewicz and colleagues examined the entire genome, using a method called whole genome sequencing (WGS). The primary reason WGS hasn’t been more widely used is that it is very expensive. For this study, an international collaboration pooled funding from National Institute of Mental Health grants and matching funds from Sweden’s SciLife Labs to conduct deep whole genome sequencing on 1,165 people with schizophrenia and 1,000 controls – the largest known WGS study of schizophrenia ever.

As a result, new discoveries were made. Previously undetectable mutations in DNA were found that scientists had never seen before in schizophrenia.

In particular, this study highlighted the role that a three-dimensional genome structure known as topologically associated domains (TADs) could play in the development of schizophrenia. TADs are distinct regions of the genome with strict boundaries between them that keep the domains from interacting with genetic material in neighboring TADs. Shifting or breaking these boundaries allows interactions between genes and regulatory elements that normally would not interact.

When these interactions occur, gene expression may be changed in undesirable ways that could result in congenital defects, formation of cancers, and developmental disorders. This study found that extremely rare structural variants affecting TAD boundaries in the brain occur significantly more often in people with schizophrenia than in those without it. Structural variants are large mutations that may involve missing or duplicated genetic sequences, or sequences that are not in the typical genome. This finding suggests that misplaced or missing TAD boundaries may also contribute to the development of schizophrenia. This study was the first to discover the connection between anomalies in TADs and the development of schizophrenia.

This work has highlighted TADs-affecting structural variants as prime candidates for future mechanistic studies of the biology of schizophrenia.

“A possible future investigation would be to work with patient-derived cells with these TADs-affecting mutations and figure out what exactly happened at the molecular level,” said Szatkiewicz, an adjunct assistant professor of psychiatry at UNC. “In the future, we could use this information about the TAD effects to help develop drugs or precision medicine treatments that could repair disrupted TADs or affected gene expressions which may improve patient outcomes.”

This study will be combined with other WGS studies in order to increase the sample size to further confirm these results. This research will also help the scientific community build on the unfolding genetic mysteries of schizophrenia.

Co-first authors from UNC are Matthew Halvorsen, PhD, Ruth Huh, PhD, Jia Wen, PhD, from the UNC Department of Genetics. Other UNC authors are Paola Giusto-Rodriguez, PhD, NaEshia Ancalade, Martilias Farrell, PhD, James Crowley, PhD, and Yun Li, PhD.

This research was a collaboration between researchers at UNC-Chapel Hill, Lund University, Chalmers University of Technology, the Karolinska Institutet, and Uppsala University.


In the present study, we provide results from the first whole-genome sequencing analysis at high coverage (90X) performed in 81 105+/110+ (mean age: 106.6 ±ਁ.6 years) and 36 CTRL (mean age 68.0 ±ਅ.9 years) representative of one specific population (i.e. the Italian peninsula).

This study design attempts, for the first time, to deal with the main weaknesses encountered in the study of genetics of longevity that Sebastiani and colleagues recently highlighted:

A ‘relaxed definition’ of longevity as survival to age 85 and older, in order to increase the sample size through a meta-analysis. This inevitably increases the heterogeneity of the phenotype and to avoid this risk, we considered only individuals that reached the last decades of lifespan and individuals older than 100 years for the replication. The apparently low number of 105+/110+ is due to the fact that the recruitment of these most unique persons is complicated because of their very low number in the general population (considering individuals born in Italy in 1903, the number of people alive at age 105 was 78, given 100,000 alive at birth according to the Italian national registry ISTAT) and their delicate health conditions

The issue of population heterogeneity in terms of genetic ancestry and ethnicities. This study specifically focused on one population (the Italian one) to reduce the bias due to tangled population-specific dynamics (Giuliani et al., 2017 Yashin et al., 2014), taking into account the fact that population specific evolutionary dynamics (such as demography or selection) can lead to high frequencies of certain variants linked to healthy aging or modern pathologies (Sazzini et al., 2016). We selected 105+/110+ individual perfectly matched with controls for geographically origin (from North to South Italy) according to an ecological approach recently described (Franceschi et al., 2020 Giuliani et al., 2018a).

The choice of controls is challenging for studies on human longevity. Here we considered a group of healthy unrelated individuals selected from the general population as control group. We are aware that, since they are still alive, some of them may eventually become 105+/110+, but we believe that in any case this number will be very small given the low prevalence of 105+/110+ in the general population.

We identified five common variants in LD (rs7456688, rs10257700, rs10279856, rs69685881, and rs7805969) with significance at adjusted p-value 10%, all in the same region located between COA1 gene and STK17A gene. The gene-based analysis of WGS data identified STK17A gene as the most significant gene that is validated in the Cohort 2.

The U-shaped in allele frequency of rs7456688-A allele showed that these variants are peculiar of 105+/110+ individuals, and this is the first study that includes a high number of 105+/110+ to detect this signal.

All these variants were replicated in Cohort 2 (unadjusted p-valuesπ.05) which is made of 333 Italian centenarians (𾄀 years) geographically matched to 358 controls (mean age: 60.7 ±ਇ.2).

One of these five variants, rs10279856, may play a regulatory role in the region, as supported by the results obtained from risk variant inference (Riviera) and GTEx database. The SNP rs10279856 seems to play a pleiotropic role as it is an eQTL for STK17A gene and for two other genes (COA1 and BLVRA). The haplotype-based analysis confirmed that COA1 presented the most significant signal and identified a haplotype strongly associated to extreme longevity (chr7: 43720429�) (p-value=1.84*10 𠄸 ). Moreover, the comparison with existing data (Giuliani et al., 2018b) also identified one SNP (rs623108) with potential impact on STK17A expression, indicating that different signals from different SNPs in moderate LD seems to converge on regulating the expression of COA1, STK1A, and BLVRA genes. Further functional studies are needed to elucidate the role of these genes.

Considering the four SNPs identified by Riviera analysis – that is rs10279856, rs3779059, rs849166, rs849175 – we observed that the most frequent alleles in 105+/110+ (rs10279856-G reference allele and rs3779059-A, rs849166-A, rs849175-A alternative alleles) are associated with the increase in SKT17A gene expression in heart (atrial and left ventricle), lung, nerve, and thyroid (data from GTEx portal). STK17A is involved in DNA damage response and positive regulation of apoptotic process (Sanjo et al., 1998) and regulation of reactive oxygen species (ROS) metabolic process. Moreover, it has been suggested that STK17A can be activated in response to external stimuli such as UV radiation and drugs (Sanjo et al., 1998). SNP rs7805969-A allele (located in STK17A/COA1 region) was found to be associated to systemic lupus erythematosus (SLE) in a population from Southern Brazil (da Silva Fonseca et al., 2013), and a reduced expression of SKT17A has been observed during the active phase of SLE disease (Sandrin-Garcia et al., 2009). These data suggest a possible role of this gene in DNA damage response as the variants associated to an increase of SKT17A expression (in-silico prediction) were found more frequent in 105+/110+ than controls, supporting the data by Gorbunova and colleagues on a central role of DNA repair mechanisms in aging and longevity (Gorbunova et al., 2007). They proposed the following sequence of events that occurs during aging: (i) mutation impairs function of genes involved in stress response and DNA repair (2) DNA repair became more error-prone leading to accumulation of DNA damage (3) this process accelerates age-related decline. In this model, genetic variants in STK17A may maintain DNA damage responses in 105+/110+, favoring healthy aging. On the contrary, autoimmune disease (such as SLE) are characterized by the accumulation of DNA double strand breaks possibly due to impaired repair (Souliotis et al., 2016) which is in line with data that described a reduced expression of SKT17A. These data on human extreme longevity support a recent study on lifespan in mammals which analyze evolutionary constraints at protein level and found DNA repair as one of the mechanisms allowing an extended lifespan across species (Kowalczyk et al., 2020).

Moreover, the most frequent genotypes in 105+/110+ (rs10279856-G reference allele and rs3779059-A, rs849166-A, rs849175-A alternative alleles) are not only associated to STK17A expression but also to a reduced expression of COA1 gene in adipose, artery, esophagus – mucosa, nerve – tibial and skin. COA1 gene is a component of the MITRAC complex (mitochondrial translation regulation assembly intermediate of cytochrome c oxidase complex) that regulates cytochrome c oxidase assembly. MITRAC complexes regulate both the translation of mitochondrial-encoded components and the assembly of nuclear-encoded components imported in mitochondrion and in particular the respiratory chain complex I and IV. Our result constitute the first evidence of an association with longevity of nuclear loci mapping in a gene deeply involved in mitochondrial dynamics, supporting the hypothesis that nuclear/mitochondrial co-evolution may have a crucial role for human longevity and health (Garagnani et al., 2014). The same SNPs are associated with an increase in BLVRA expression in whole blood and a decrease of the expression of the same gene in artery (tibial) and esophagus (mucosa). The protein encoded by the BLVRA gene belongs to the biliverdin reductase family, members of which catalyze the conversion of biliverdin to bilirubin. Recently it has been established that a redox cycle based on BVRA activity provides physiologic cytoprotection as BVRA depletion exacerbates the formation of reactive oxygen species (ROS) and increase cell death. Interestingly, BLVRA contributes significantly to modulation of the aging process by adjusting the cellular oxidative status (Kim et al., 2011). Moreover, Biliverdin reductase A was previously shown to regulate the inflammatory response to endotoxin, by inhibiting Toll-like receptor 4 (TLR4) gene expression (Wegiel et al., 2011).

Considering the complexity of the trait under study, it has recently been proposed that even suggestive and marginally significant p-values can be highly informative in the case of longevity (Erikson et al., 2016 Zeng et al., 2016), an argument supported by Yashin and colleagues who showed that longevity also depends on several small-effect alleles (Yashin et al., 2010). In this context, the pathway analysis is crucial as the integration of many SNPs with modest p-values may identify biological functions and crucial pathways involved in longevity (Johnson et al., 2015). This analysis identified in several pathways enriched in our cohort: axon guidance, calcium signaling, glycine serine and threonine metabolism, long-term potentiation, melanogenesis, PPAR signaling and taste transduction (see Supplementary Material 1 for more details).

In this study, APOE-e4, the gene identified in a high number of studies on human longevity showed only a general trend but no significant association with longevity was found in Cohort 1. This is in line with recent data published by the GEHA Consortium (European project on the Genetics of Healthy Ageing) where APOE-e4 did not show association with longevity in the Italian population. Factors explaining this discrepancy are the variability of this haplotype across Europe, the cline that led to the low frequency in Italy (APOE-e4 is around 8% in South Italy), the peculiar gene-environment interaction experienced by certain birth cohort, and the gender effect (Giuliani et al., 2018a). The analysis of private mutations of 105+/110+ showed that some damaging variants and pathogenic variants are compatible with extreme longevity and healthy ageing (Supplementary file 9).

Rare variants analysis showed significant associations for the NME1 gene when all rare variants were considered, and for the PLEKHG4 (puratrophin-1) gene when only damaging rare variants were considered. NME-1 is the first metastasis suppressor gene discovered Steeg et al., 1988 whose expression inhibits cell motility and metastasis in different human cancers. It regulates signalling pathways stimulated by various growth factors, including TGF-beta, platelet-derived growth factor, IGF1, lysophosphatidic acid, and serum, which suppresses metastasis (Russell et al., 1998). Recently, it has been demonstrated that NME1 is rapidly recruited to double-strand breaks promoting DNA repair (Kaetzel et al., 2015). PLEKHG4 is associated with Spinocerebellar ataxia, a neurodegenerative disease affecting cerebellar Purkinje cells. Atrophic Purkinje cells from these SCA patients have cytoplasmic aggregates containing Puratrophin-1 and the actin-binding protein Spectrin (Ishikawa et al., 2005). These regions seems to be preserved in 105+/110+ individuals who largely postpone age-related diseases and cancers, among other common diseases (Ishikawa et al., 2005).

The analysis of somatic mutations suggests that 105+/110+ individuals seem to be protected from accumulation of such mutations and we did not observe such an increase as would be expected considering their age. 105+/110+ individuals are characterized by a lower prevalence of somatic mutations in six out of the seven genes considered that is statistically significant for DNMT3A and ASXL1 genes. Focusing on somatic mutations with a potential impact on protein function the prevalence was not different from the control group.

This supports recent longitudinal data that showed that somatic mutations in DNMT3A and TET2 genes previously linked to hematopoietic malignancies are common in the oldest old (Genome of The Netherlands Consortium et al., 2016).

These results show that 105+/110+ individuals seem spared from the age-related exponential increase of disruptive mutations, and this might have contributed in protecting from CVD (Genovese et al., 2014 Jaiswal et al., 2014 Jaiswal and Ebert, 2019).

However, it is to note that a depth of coverage of 90x is not the golden standard to call somatic mutations that require a coverage around 4000x as performed in recent studies (Buscarlet et al., 2017). A lower sequencing depth is less sensitive to detect low allele fraction variants. Other studies about somatic mutations have been performed considering exome sequencing data or whole genome sequencing with a 30x mean coverage only (Zink et al., 2017 Jaiswal et al., 2014 Genovese et al., 2014 among others). The methodological variability (in term of coverage and part of the genome analysed) makes the comparison among existing studies difficult and not always possible.

On the contrary the existing PRS for CVD showed that 105+/110+ are not protected from the risk of CVD as the data showed no significant results when 105+/110+ were compared to controls. This can be due to three non-mutually excluding reasons: (1) PRS does not include population-specific dynamics and may not be specifically informative for the Italian population (2) 105+/110+ have the same CVD risk variants of the general populations (3) PRS score may include variants which effect can be neutralize by peculiar environmental factors or epistatic interactions. This result agreed with the studies that showed that centenarians and long-lived individuals are characterized by disease-associated variants frequencies similar to the general population (Bonafè et al., 1999, p. 53 Beekman et al., 2010 Sebastiani and Perls, 2012 Freudenberg-Hua et al., 2014 Erikson et al., 2016, Erikson et al., 2016). Using genetic data of 105+/110+ will be of extreme value in future studies to weight the role of certain ‘risk’ variants and could be used to identify new informative PRS.

Thus, the data reported here suggest that 105+/110+ escape CVD not because of genetic protection toward cardiovascular risk but because they are protect from the burden of somatic mutations (mainly disruptive) observed during ageing.

As follows we acknowledge the main limitation of this study:

The relaxed cut-off used in the discovery phase, that however is motivated by the crucial role of small-effect genetic variants in longevity (Yashin et al., 2010) and by the difficulties in the recruitment of 105+/110+ because of the rarity of the phenotype (i.e. extreme longevity)

The unbalanced case/control ratio where the case group is more than twice as large compared to the control group whose sample size is low (N =ꀶ). However, the control group here analysed is – to date – the only representative cohort of all the Italian peninsula, including population clusters at the opposite ends of the cline of Italian variation (Sazzini et al., 2020). We decided to not include the TSI of the 1000 Genomes project in the control group first because their age is not known, secondly because they are not representative of all the Italian peninsula (as Tuscany is located in the Centre of Italy) and to maintain the matched with the 81 semi-supercentenarians who comes from Northern, Centre and Southern Italy.

The possibility that the signals here identified are peculiar of the Italian population. Gene-environment interactions are population-specific also because of the variability in environmental and cultural settings (dietary habits and lifestyle among others) and thus we cannot exclude that these results will not be generalizable. Only more data on semi-supercentenarians from other countries will clarify this point.

We selected a population of 105+/110+ perfectly matched with controls for geographical origin (from Northern to Southern Italy) to reduce bias due to genetic population variability, however a potential limitation is inextricably intertwined with this experimental design. Gene-environment interactions are population-specific also because of the variability in environmental and cultural settings (dietary habits and lifestyle among others) and thus it is likely that interactions with genetics may be different and not generalizable. Population-driven studies in which environmental and cultural data are included are desirable in this sense.

The major strengths of the present study are the following: (1) the design of this study based on the careful selection of individuals with more than 105 years old in order to focus on a peculiar phenotype that is extreme longevity (2) the selection of 105+/110+ and controls in an homogenous population all matched for geographical origin (3) the use of a second validation cohort of centenarians from the same population (4) the high coverage of the sequencing that allowed somatic mutations analysis.

In conclusion, this study constitutes the first whole genome sequencing of extreme longevity at high coverage, that also allows somatic mutations analysis, in which 105+/110+ are compared with a group of healthy individual geographically matched. The results showed that 105+/110+ are characterized by a peculiar genetic background associated to efficient DNA repair mechanisms, as evidenced by both germline data and somatic mutations patterns (low/similar mutation load if compared to younger healthy controls from the general population). The model of 105+/110+ supports the recent literature that suggests a genetic signature in DNA repair mechanisms and clonal haematopoiesis are crucial players for cellular homeostasis and in cardiovascular events and that they can be the two central mechanisms that have protected 105+/110+ from age-related diseases, including CVDs.

Whole Genome Sequencing

Image Courtesy of National Human Genome Research Institute

During whole genome sequencing, researchers collect a DNA sample and then determine the identity of the 3 billion nucleotides that compose the human genome. The very first human genome was completed in 2003 as part of the Human Genome Project, which was formally started in 1990. Today, sequencing technology is much more efficient, and a human genome can be sequenced in a matter of days for under $10,000. The first human genome cost $2.7 billion. Today, most genetic testing focuses on one or a few genes, rather than the entire genome. However, with the falling cost of genome sequencing, more individuals are pursuing this option. Physicians can look at an entire genome to see how specific treatments for a disease will be affected by an individual’s unique genetics. For example, the physician may opt to look at genes involved in drug metabolism when deciding dosage. In the future, whole genome sequencing may enable everyone to develop a personalized treatment plan.

Advantages of Whole Genome Sequencing

• Creating personalized plans to treat disease may be possible based not only on the mutant genes causing a disease, but also other genes in the patient’s genome.

* Genotyping cancer cells and understanding what genes are misregulated allows physicians to select the best chemotherapy and potentially expose the patient to less toxic treatment since the therapy is tailored.

* Previously unknown genes may be identified as contributing to a disease state. Traditional genetic testing looks only at the common “troublemaker” genes.

* Lifestyle or environmental changes that can mediate the effects of genetic predisposition may be identified and then moderated.

Disadvantages of Whole Genome Sequencing
* The role of most of the genes in the human genome is still unknown or incompletely understood. Therefore, a lot of the “information” found in a human genome sequence is unusable at present.

* Most physicians are not trained in how to interpret genomic data.

* An individual’s genome may contain information that they DON’T want to know. For example, a patient has genome sequencing performed to determine the most effective treatment plan for high cholesterol. In the process, researchers discover an unrelated allele that assures a terminal disease with no effective treatment.

* The volume of information contained in a genome sequence is vast. Policies and security measures to maintain the privacy and safety of this information are still new.

Ng P and Kirkness E. Whole Genome Sequencing. Methods in Molecular Biology. 2010. 628: 215-226


  1. Burghard

    It is strange why no one is discussing this publication? The topic is interesting ...

  2. Molkis

    I think it is serious failure.

  3. Fenrizil

    They are wrong. I propose to discuss it. Write to me in PM, speak.

  4. Moraunt

    Yes, really. So happens. Let's discuss this question. Here or in PM.

  5. Morland

    In my opinion, you are wrong. I'm sure. Email me at PM.

  6. Gabor

    I do not trust you

Write a message