What are non-heritable changes to genomes?

What are non-heritable changes to genomes?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am told that mutations are heritable changes to the genome.

So this begs the question - what are non-heritable changes to genome?

I don't know what you really mean by "heritable changes to the genome". I think you will understand why this sentence makes no sense after reading what follows. I start with some background and then try to address directly what confuses you.

Short introduction to the concept of heritability

The concept of heritability may have two meanings.

Heritability is a concept defined at the population level for one given trait. The heritability ($h_B^2$) (in the broad sense) is the ratio of the genetic variance $V_p$ over the phenotypic variance $V_p$, where the phenotypic variance can itself be decomposed into environmental $V_e$ and genetic variance $V_g$ (and their covariance that we will neglect for the purpose of this question).

$$h_B^2 = frac{V_{g}}{V_{p}} = frac{V_{g}}{V_{e} + V_{g}}$$

When saying environmental variance $V_e$, we don't refer to the total variance in the environment (such as the variance in temperature for example) but we refer to the phenotypic variance (in a given population of a given trait of interest) that is caused by environmental variance. The same logic is true for the genetic variance.


While the concept of heritability applies to phenotypic traits, the concept of inheritence can apply to DNA material. A mutation is inherited if the offspring receives the sequence from his/her mother or from his/her father (assuming we are talking about a species that have sexual reproduction with 2 genders).

Somatic versus germline mutations

As pointed by @Chris in the comments. In multicellular organisms, not all mutations can be transmitted to the multicellular offspring. Most of the cells do not give rise to any other multicellular organisms. For example, imagine that while the biceps is under development at very low age, a mutation occurs that will be inherited by the daughter cells but not to the multicellular offspring. we call the line of cells that do not give rise to gametes (sperm and ovules), the soma line, by opposition to the germ line which gives rise to the gametes.

To answer your question

As soon as a given locus (position on the DNA) has some variance and that this variance explains some phenotypic variance, then the phenotypic trait it influences has a heritability greater than zero. If the variance at this locus has no effect on the phenotype, then it's heritability is zero (because $V_g = 0$, although there is some variance in the actual sequence). If the phenotypic trait of interest does not show any variance (in the population considered), then the concept of heritability is undefined for this trait!

I think you may have a confusion between "inherited trait" and "heritable trait". A mutation will necessarily be transmissible to the offspring (except if it happen to be in the somatic line in such case it will only be transmitted to daughter cells but not to the multicellular offspring) but it doesn't mean that this mutation will necessarily explain some variance in a phenotypic trait. A new much is necessarily heritable (pay attention to the soma vs germ line) but does not mean that a phenotypic will have higher heritability thanks to this mutation.

Most mutations to the DNA are heritable, but not all are inherited.

Mutations occur in the DNA, DNA is then replicated and transmitted to offspring (cells or organisms). Generally all mutations are heritable because they can be inherited, but some aren't - either by random chance (drift) or deleterious fitness effects (selection). Any mutation which makes the cell less fit has a reduced chance of being inherited, but even mutations leading to cancer are inherited at some level (from mother cancer cell to daughter cancer cells) but at some point this lineage of inheritance will stop because it will kill the host (but this is no different to species/lineages of organism going extinct - if I carried a new novel mutation but didn't have children then that mutation is not inherited).

*I suppose a mutation which stops DNA from being replicated could be classed as not-heritable because the DNA is not going to be replicated, but that depends on there being no other sources of replication machinery (other cells perhaps) - a process I don't know well enough to offer any firm conclusions on.

Non-heritable changes to the genome

DNA can be altered beyond just sequence mutation, for example methylation- but this can be used as a heritable change to the genome (genomic imprinting). Perhaps gene expression with strong environmental effects on gene expression could be considered as not heritable (because if the environment is not "inherited" then parent-offspring regression would be 0) but maybe that's a change to the transcriptome rather than genome.


Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes. In contrast to genetics, which refers to the study of individual genes and their roles in inheritance, genomics aims at the collective characterization and quantification of all of an organism's genes, their interrelations and influence on the organism. [1] Genes may direct the production of proteins with the assistance of enzymes and messenger molecules. In turn, proteins make up body structures such as organs and tissues as well as control chemical reactions and carry signals between cells. Genomics also involves the sequencing and analysis of genomes through uses of high throughput DNA sequencing and bioinformatics to assemble and analyze the function and structure of entire genomes. [2] [3] Advances in genomics have triggered a revolution in discovery-based research and systems biology to facilitate understanding of even the most complex biological systems such as the brain. [4]

The field also includes studies of intragenomic (within the genome) phenomena such as epistasis (effect of one gene on another), pleiotropy (one gene affecting more than one trait), heterosis (hybrid vigour), and other interactions between loci and alleles within the genome. [5]


Although the issue of progress is an ambiguous concept in evolutionary biology, it is hard to deny that some lineages have become more complex during evolution. One possible measure of complexity is the number of functionally distinct genes. There exists considerable variation between organisms in gene number. Although there are certain exceptions, it is generally true that eukaryotes have more genes than prokaryotes, and multicellular organisms have more genes, than unicellulars (Miklos & Rubin, 1996). How can we understand the evolution of this apparent increase in genomic complexity over time?

Such an increase is by no means inevitable. Evolution by natural selection does not predict a general increase of complexity: adaptation to local environmental conditions can be fulfilled by loss or gain of new genes, functions or morphological structures. The reduced genome of numerous parasitic and endosymbiotic species being a case in point (e.g. Charles & Ishikawa, 1999 Fukuda et al., 1999).

According to the most widely accepted scenario, increased gene number arises as a by-product of local adaptation. For example, eukaryotes have a chromosome segregation mechanism that enables the replication of DNA to start at many points simultaneously, compared with the single origin in prokaryotes. As a by-product, these changes also enabled the increase of genome size (Cavalier Smith, 1985 Maynard Smith & Szathmáry, 1995). Similarly, the acquisition of mitochondria in early eukaryotes may have also reduced the energetic limits on genome size (Vellai & Vida, 1999).

While these sorts of forces might affect genome size, they do not necessarily impose limits on gene number. Further, neither of the above forces can explain the variation in gene number within eukaryotes (Cavalier Smith, 1985). Two hypotheses have been presented to suggest that gene number might be limited. Limits to the amount of functional DNA may be imposed by the accumulation of harmful mutations or by inappropriate gene expression.

There are a large number of related arguments suggesting that harmful mutations might impose a strict limit on the maximum number of genes. One type of argument emphasizes the impossibility of preserving the non-mutant (master) sequence with growing information content. Some theoretical models (Eigen, 1971 Maynard Smith, 1983 Higgs, 1994) show that if the genomic mutation rate exceeds a critical value, then selection cannot prevent the accumulation of harmful mutations, even under infinite population size. The consequence of this error threshold is the limit it sets on gene number. However, these models assume very specific interactions between mutations. All mutant variants are considered to have lower fitness, which is the same for all of them, irrespective of the number of mutations. It has been also shown that the conclusions of these models cannot be generalized to arbitrary interactions between mutations (Charlesworth, 1990): the error threshold phenomenon arises only under diminishing epistasis, which is rarely observed in extant organisms. Hence, there is no obvious biological justification of the error threshold concept.

Another argument concentrates on the loss of mutation free variants under the combination of mutation pressure and genetic drift. With a given accuracy of replication and a growing number of genes, the genomic mutation rate also increases. Under finite population size, the enhanced genomic mutation rate may accelerate the accumulation of harmful mutations (Muller, 1964). This effect is even stronger if the decline in population fitness decreases the population size, which in turn facilitates the spread of harmful mutations (Lynch et al., 1993). It is important to note that the accumulation of harmful mutations is irreversible in asexual populations, while recombination may recreate mutation-free genotypes. Hence, it is possible that sexual reproduction may enable the maintenance of a higher number of genes (Hurst, 1995).

If an error threshold places an upper limit on genomic complexity, we should expect to find an inverse relationship between gene number (or more precisely functional DNA content) and the per base pair mutation rate. Although this analysis has not been done precisely, Drake (1991) found that genome size and the per base pair mutation rate were negatively correlated in a wide range of unicellular organisms. If we suppose that for the organisms investigated there exists a positive correlation between gene number and genome size (as seems likely given that these organisms have little junk DNA), then the prediction holds. These data support the idea that gene number cannot increase indefinitely without a compensatory reduction in the per base pair mutation rate. If such compensation cannot occur indefinitely then mutations will impose a limit on gene number.

However, genetic errors alone are unlikely to explain all of the variation in gene number. The effective genomic mutation rate, which measures the total mutation rate in coding regions, is usually several orders of magnitude higher in plants and animals than in microbial eukaryotes (Drake et al., 1998). But, contrary to obvious expectations, it is the latter that have fewer genes. One possibly way to reconcile theory and data is if we assume that sexual processes occur much more frequently in higher organisms (Hurst, 1995). Unfortunately, much more data would be necessary to resolve this issue.

Furthermore, vertebrates have extensive DNA-methylation, which is potently mutagenic, while also having more genes (Holliday & Grigg, 1993 Smith & Hurst, 1999). That vertebrates have extensive methylation fits much better with Bird’s hypothesis (Bird, 1995) that the sustainable number of genes might be intrinsically hampered by the imprecision of biochemical mechanisms governing gene expression. He considered the fact that these failures occur at a much higher rate than genetic mutations as support for his theory. Such errors are, however, not heritable and therefore cannot accumulate in populations.

Are these two sources of error equally likely to affect the evolution of gene number or is one intrinsically a much stronger force than the other? In this review we address this question by examining two issues. First, does the fact that gene expression failures cannot accumulate across generations mean that they are less important than heritable mutations? Second, when one asks about limits to genomic complexity, all previous analyses have been group selective. For example, they point out that a population with more genes can have a higher chance of extinction if the mutation rate is not adjusted accordingly. But, to understand limits on genomic complexity, we should address the issue by asking about limitations to the spread of a new gene as it enters the population. Therefore we shall ask whether mutations or gene expression errors have an immediate impact on the probability of the spread of a new gene. First, however, we shall discuss in more detail the forms of non-heritable errors.


An arthropod evolution resource

As a pilot project for the i5K initiative to sequence 5000 arthropod genomes [6], we sequenced and annotated the genomes of 28 arthropod species (Additional file 1: Table S1). These include a combination of species of agricultural or ecological importance, emerging laboratory models, and species occupying key positions in the arthropod phylogeny. We combined these newly sequenced genomes with those of 48 previously sequenced arthropods creating a dataset comprising 76 species representing the four extant arthropod subphyla and spanning 21 taxonomic orders. Using the OrthoDB gene orthology database [7], we annotated 38,195 protein ortholog groups (orthogroups/gene families) among all 76 species (Fig. 1). Based on single-copy orthogroups within and between orders, we then built a phylogeny of all major arthropod lineages (Fig. 2). This phylogeny is mostly consistent with previous arthropod phylogenies [8,9,10], with the exception being that we recover a monophyletic Crustacea, rather than the generally accepted paraphyletic nature of Crustacea with respect to Hexapoda the difference is likely due to our restricted taxon sampling (see “Methods”). We reconstructed the gene content and protein domain arrangements for all 38,195 orthogroups in each of the lineages for the 76 species in the arthropod phylogeny. This resource (available at and Additional file 1: Table S11) forms the basis for the analyses detailed below and is an unprecedented tool for identifying and tracking genomic changes over arthropod evolutionary history.

OrthoDB orthology delineation for the i5K pilot species. The bars show Metazoa-level orthologs for the 76 selected arthropods and three outgroup species (of 13 outgroup species used for orthology analysis) partitioned according to their presence and copy number, sorted from the largest total gene counts to the smallest. The 28 i5K species generated in this study with a total of 533,636 gene models are indicated in bold green font. A total of 38,195 orthologous protein groups were annotated among the total 76 genomes

Arthropod phylogeny inferred from 569 to 4097 single-copy protein-coding genes among the six multi-species orders, crustaceans, and non-spider chelicerates (Additional file 1: Table S13) and 150 single-copy genes for the orders represented by a single species and the deeper nodes. Divergence times estimated with non-parametric rate smoothing and fossil calibrations at 22 nodes (Additional file 1: Table S14). Species in bold are those sequenced within the framework of the i5K pilot project. All nodes, except those indicated with red shapes, have bootstrap support of 100 inferred by ASTRAL. Nodes of particular interest are labeled in orange and referred to in the text. Larger fonts indicate multi-species orders enabling CAFE 3.0 likelihood analyses (see “Methods”). Nodes leading to major taxonomic groups have been labeled with their node number and the number of genes inferred at that point. See Additional file 2: Figure S16 and Additional file 1: Table S12 for full node labels

Genomic change throughout arthropod history

Evolutionary innovation can result from diverse genomic changes. New genes can arise either by duplication or, less frequently, by de novo gene evolution [11]. Genes can also be lost over time, constituting an underappreciated mechanism of evolution [12, 13]. Protein domains are the basis of reusable modules for protein innovation, and the rearrangement of domains to form new combinations plays an important role in molecular innovation [14]. Together, gene family expansions and contractions and protein domain rearrangements may coincide with phenotypic innovations in arthropods. We therefore searched for signatures of such events corresponding with pivotal phenotypic shifts in the arthropod phylogeny.

Using ancestral reconstructions of gene counts (see “Methods”), we tracked gene family expansions and losses across the arthropod phylogeny. Overall, we inferred 181,157 gene family expansions and 87,505 gene family contractions. A total of 68,430 gene families were inferred to have gone extinct in at least one lineage, and 9115 families emerged in different groups. We find that, of the 268,662 total gene family changes, 5843 changes are statistically rapid (see “Methods”), with the German cockroach, Blattella germanica, having the most rapid gene family changes (Fig. 3e). The most dynamically changing gene families encode proteins involved in functions of xenobiotic defense (cytochrome P450s, sulfotransferases), digestion (peptidases), chitin exoskeleton structure and metabolism, multiple zinc finger transcription factor types, HSP20 domain stress response, fatty acid metabolism, chemosensation, and ecdysteroid (molting hormone) metabolism (Additional file 1: Table S15). Using the estimates of where in the phylogeny these events occurred, we can infer characteristics of ancestral arthropods. For example, we identified 9601 genes in the last insect common ancestor (LICA) and estimate

14,700 LICA genes after correcting for unobserved gene extinctions (Fig. 2, Additional file 2: Figure S1 and Additional file 1: Table S16). We reconstructed similar numbers for ancestors of the six well-represented arthropod taxa in our sample (Fig. 2 and Additional file 1: Table S16). Of the 9601 genes present in LICA, we identified 147 emergent gene families (i.e., lineage-restricted families with no traceable orthologs in other clades) which appeared concurrently with the evolution of insects (Fig. 3a, Fig. 2 node 62, Additional file 1: Table S18). Gene Ontology term analysis of these 147 gene families recovered multiple key functions, including cuticle and cuticle development (suggesting changes in exoskeleton development), visual learning and behavior, pheromone and odorant binding (suggesting the ability to sense in terrestrial/aerial environments rather than aquatic), ion transport, neuronal activity, larval behavior, imaginal disc development, and wing morphogenesis. These emergent gene families likely allowed insects to undergo substantial diversification by expanding chemical sensing, such as an expansion in odorant binding to locate novel food sources and fine-tune species self-recognition [15,16,17]. Others, such as cuticle proteins underlying differences in exoskeleton structure, may enable cuticle properties optimized for diverse environmental habitats or life history stages [18]. In contrast, the data reveal only ten gene families that arose along the ancestral lineage of the Holometabola (Fig. 3b, Additional file 1: Table S19), implying that genes and processes required for the transition to holometabolous development, such as imaginal disc development, were already present in the hemimetabolous ancestors. This is consistent with Truman and Riddiford’s model that the holometabolous insect larva corresponds to a late embryonic state of hemimetabolous insects [19].

Summary of major results from gene family, protein domain, and methylation analyses. a We identify 147 gene families emerging during the evolution of insects, including several which may play an important role in insect development and adaptation. b Contrastingly, we find only ten emergent gene families during the evolution of holometabolous insects, indicating many gene families were already present during this transition. c Among all lineage nodes, we find that the node leading to Lepidoptera has the most emergent gene families. d We find that rates of gene gain and loss are highly correlated with rates of protein domain rearrangement. Leafcutter ants have experienced high rates of both types of change. e Blattella germanica has experienced the highest number of rapid gene family changes, possibly indicating its ability to rapidly adapt to new environments. f We observe signals of CpG methylation in all Araneae (spiders) genomes investigated (species shown: the brown recluse spider, Loxosceles reclusa) and the genome of the bark scorpion, Centruroides exilicauda. The two peaks show different CG counts in different gene features, with depletion of CG sequences in the left peak due to methylated C’s mutating to T. This suggests epigenetic control of a significant number of spider genes. Additional plots for all species in this study are shown in Additional file 2: Figure S5

We identified numerous genes that emerged in specific orders of insects. Strikingly, we found 1038 emergent gene families in the first ancestral Lepidoptera node (Fig. 3c). This node has by far the most emergent gene families, with the next highest being the node leading to the bumble bee genus Bombus with 860 emergent gene families (Additional file 2: Figure S2). Emergent lepidopteran gene families show enrichment for functional categories such as peptidases and odorant binding. Among the other insect orders, we find 227 emergent families in the node leading to the Hymenoptera, 205 in that leading to Coleoptera, and 156 in that leading to Diptera. Though our sampling is extensive, it is possible that gene families we have classified as emergent may be present in unsampled lineages.

Similarly, we reconstructed the protein domain arrangements for all nodes of the arthropod phylogeny, that is, the permutations in protein domain type per (multi-domain) gene. In total, we can explain the underlying events for more than 40,000 domain arrangement changes within the arthropods. The majority of domain arrangements (48% of all observable events) were formed by a fusion of two ancestral arrangements, while the fission of an existing arrangement into two new arrangements accounts for 14% of all changes. Interestingly, 37% of observed changes can be explained by losses (either as part of an arrangement (14%) or the complete loss of a domain in a proteome (23%)), while emergence of a novel protein domain is a very rare event, comprising only 1% of total events.

We observe high concordance between rates of gene family dynamics and protein domain rearrangement (Fig. 4 and Additional file 2: Figure S3). In some cases, we find specific examples of overlap between gene family and protein domain evolution. For example, spiders have the characteristic ability to spin silk and are venomous. Correspondingly, we identify ten gene families associated with venom or silk production that are rapidly expanding within Araneae (spiders, Additional file 1: Table S20). In parallel, we find a high rate of new protein domains in the subphylum Chelicerata, including a large number within Araneae associated with venom and silk production. For example, “spider silk protein 1” (Pfam ID: PF16763), “Major ampullate spidroin 1 and 2” (PF11260), “Tubuliform egg casing silk strands structural domain” (PF12042), and “Toxin with inhibitor cystine knot ICK or Knottin scaffold” (PF10530) are all domains that emerged within the spider clade. Venom domains also emerged in other venomous chelicerates, such as the bark scorpion, Centruroides sculpturatus.

Rate of genomic change along the arthropod phylogeny: a frequency of amino acid substitutions per site, b gene gains/losses, and c domain changes. All rates are averaged per My and color-indicated as branches of the phylogenetic tree. Species names are shown on the right specific subclades are highlighted by colors according to the taxonomic groups noted in Fig. 2

We identified gene family changes that may underlie unique phenotypic transitions. The evolution of eusociality among three groups in our study, bees and ants (both Hymenoptera), and termites (Blattodea), requires these insects to be able to recognize other individuals of their colony (such as nest mates of the same or different caste), or invading individuals (predators, slave-makers and hosts) for effective coordination. We find 41 functional terms enriched for gene family changes in all three groups, with multiple gene family gains related to olfactory reception and odorant binding (Additional file 1: Table S21) in agreement with previous chemoreceptor studies of these species [20, 21].

Finally, we observe species-specific gene family expansions that suggest biological functions under selection. The German cockroach, a pervasive tenant in human dwellings across the world, has experienced the highest number of rapidly evolving gene families among the arthropods studied here, in agreement with a previously reported major expansion of chemosensory genes [22]. We also find the largest number of domain rearrangement events in B. germanica. The impressive capability of this cockroach to survive many environments and its social behavior could be linked to these numerous and rapid evolutionary changes at the genomic level and warrants more detailed investigation.

Evolutionary rates within arthropod history

The rate of genomic change can reflect key events during evolution along a phylogenic lineage. Faster rates might imply small population sizes or strong selective pressure, possibly indicative of rapid adaptive radiations, and slower rates may indicate stasis. Studying rates of change requires a time-calibrated phylogeny. For this, we used 22 fossil calibration points [8, 23] and obtained branch lengths for our phylogeny in millions of years (My) (Fig. 2) that are very similar to those obtained by Misof et al. [8] and Rota-Stabelli et al. [9].

We examined the rates of three types of genomic change: (i) amino acid substitutions, (ii) gene duplications and gene losses, and (iii) protein domain rearrangements, emergence, and loss. While clearly not changing in a clock-like manner, all types of genomic change have a strikingly small amount of variation in rate among the investigated species (Fig. 4). We estimate an average amino acid substitution rate of 2.54 × 10 − 3 substitutions per site per My with a standard deviation of 1.11 × 10 − 3 . The slowest rate is found in the branch leading to the insect order Blattodea (cockroaches and termites), while the fastest rates are found along the short branches during the early diversification of Holometabola, suggesting a period of rapid evolution, a pattern similar to that found for amino acid sequence evolution during the Cambrian explosion [24]. Other branches with elevated amino acid divergence rates include those leading to Acarina (mites), and to the Diptera (flies).

Though we observe thousands of genomic changes across the arthropod phylogeny, they are mostly evenly distributed (Fig. 3d). Rates of gene duplication and loss show remarkably little variation, both across the tree and within the six multi-species orders (Additional file 1: Table S13). Overall, we estimate an average rate of 43.0 gains/losses per My, but with a high standard deviation of 59.0 that is driven by a few lineages with greatly accelerated rates. Specifically, the terminal branches leading to the leafcutter ants Atta cephalotes and Acromyrmex echinatior along with the internal node leading to the leafcutter ants and the red fire ant (node HY29) have exceptionally high gene gain/loss rates of 266, 277, and 370 per My, respectively (Fig. 3d). This is an order of magnitude higher than average, as previously reported among leafcutter ants [25]. Removing these nodes, the average becomes 27.2 gains/losses per My (SD 19.7). Interestingly, the high gain/loss rates observed in these ants, in contrast to other arthropods, are not due to large gene content change in a small number of gene families. They are instead due mostly to single gene gains or losses in a large number of gene families.

Regarding protein domain rearrangements, which mainly arise from duplication, fusion and terminal losses of domains [26], we estimate an average rate of 5.27 events per My, approximately eightfold lower than the rate of gene gain/loss. Interestingly, we discovered a strong correlation between rates of gene gain/loss and domain rearrangement (Figs. 3d and 4 and Additional file 2: Figure S3). For example, terminal branches within the Hymenoptera have an accelerated rate of domain rearrangement, which coincides with the increased rate of gene gains and losses observed along those branches. This novel finding is surprising, given that these processes follow largely from different underlying genetic events (see [27] for discussion of these processes).

Our examination found no correlation between variation in amino acid substitution rates and rates of gene gain/loss or domain rearrangement rates (Fig. 4 and Additional file 2: Figure S3). Branches with accelerated rates of amino acid substitution, such as the lineage leading to the most recent common ancestor of the insect superorder Holometabola, do not show corresponding increases in gene gain/loss rates. Similarly, the hymenopteran lineages displaying the fastest rate of gene gain/loss in our analysis do not display higher rates of amino acid substitutions.

Control of novel genes: methylation signals in arthropod genomes

Our description of gene family expansions in arthropods by gene duplication naturally suggests the need for differential control of duplicated genes. Insect epigenetic control by CpG methylation is important for caste development in honey bees [28] and polyphenism in aphids [29]. However, signals of methylation are not seen in every insect, and the entire Dipteran order appears to have lost the capacity for DNA methylation. Given this diversity in the use of, and capacity for epigenetic control by DNA methylation, we searched for signals of CpG methylation in our broader sampling of arthropod genomes. We find several independent losses of the DNA methylation machinery across the arthropods (Additional file 2: Figure S4) [30]. This indicates that DNA methylation is not universally necessary for development and that the DNA methyltransfereases in insects may function in ways not previously appreciated [31]. Additionally, putative levels of DNA methylation vary considerably across arthropod species (Additional file 2: Figures S4, S5). Notably, the hemimetabolous insects and non-insect arthropods show higher levels of DNA methylation signals than the holometabolous insects [30]. Araneae (spiders), in particular, show clear bimodal patterns of methylation (Fig. 3f and Additional file 2: Figure S5), with some genes displaying high methylation signals and others not. A possible connection between spider bimodal gene methylation and their proposed ancestral whole genome duplication will require additional investigation. This pattern is also found in some holometabolous insects, suggesting that the division of genes into methylated and unmethylated categories is a relatively ancient trait in Arthropoda, although many species have since lost this clear distinction. Finally, some taxa, particularly in Hymenoptera, show higher levels of CpG di-nucleotides than expected by chance alone, which may be a signal of strong effects of gene conversion in the genome [32].

A new perspective on the genomes of archaic humans

A genome by itself is like a recipe without a chef -- full of important information, but in need of interpretation. So, even though we have sequenced genomes of our nearest extinct relatives -- the Neanderthals and the Denisovans -- there remain many unknowns regarding how differences in our genomes actually lead to differences in physical traits.

"When we're looking at archaic genomes, we don't have all the layers and marks that we usually have in samples from present-day individuals that help us interpret regulation in the genome, like RNA or cell structure," said David Gokhman, a postdoctoral fellow in biology at Stanford University.

"We just have the naked DNA sequence, and all we can really do is stare at it and hope one day we'd be able to understand what it means," he said.

Motivated by such hopes, a team of researchers at Stanford and the University of California, San Francisco (UCSF), have devised a new method to harvest more information from the genomes of archaic humans to potentially reveal the physical consequences of genomic differences between us and them.

Their work, published April 22 in eLife, focused on sequences related to gene expression -- the process by which genes are activated or silenced, which determines when, how and where DNA's instructions are followed. Gene expression tends to be the genetic detail that determines physical differences between closely related groups.

Starting with 14,042 genetic variants unique to modern humans, the researchers found 407 that specifically contribute to differences in gene expression between modern and archaic humans. In further analysis, they determined that the differences were more likely to be associated with the vocal tract and the cerebellum, which is the part of our brain that receives sensory information and controls voluntary movement, including walking, coordination, balance and speech.

"It just seems so implausible that you could make a call like, 'I think the voice box evolved,' from the information we have," said Dmitri Petrov, the Michelle and Kevin Douglas Professor in the School of Humanities and Sciences, who is co-senior author of the paper with Gokhman and Nadav Ahituv, a professor of bioengineering at UCSF. "The predictions are almost science fiction. If five years ago, somebody told me that this would be possible, I would not have put much money on it."

The path to modern humans

With such a large number of variants to examine, the researchers relied on a technique called a "massively parallel reporter assay" to test which sequences actually affect gene regulation. Their version of this technique, which was developed by Ahituv, involves packaging the DNA sequence variant into a "reporter gene" inside a virus. That virus is then put into a cell. If that variant affects gene expression, the reporter gene produces a barcoded molecule that identifies what DNA sequence it came from. The barcode allows the researchers to scan the products of a large number of variants at once.

Essentially, the whole process imitates an abridged version of how each variant would play out in a cell in real life and reports the results.

Lana Harshman, a graduate student at UCSF and co-lead author of the paper, infected three types of cells with the team's variant packages. These cells were related to the brain, skeleton and early development -- subjects that are most likely to reveal evolutionary differences between us and our most recent ancestors. Carly Weiss, a postdoctoral scholar in the Petrov lab and co-lead author of the paper, analyzed the results of these experiments.

In total, the researchers found 407 sequences that represented a change in expression in modern humans compared to our predecessors. Among that list, genes that affect the cerebellum and genes that affect the voice box, pharynx, larynx and vocal cords seem to be overrepresented.

"This would suggest some kind of rapid evolution of those organs or some kind of a path that is specific to modern humans," said Gokhman. The next step, he added, would be trying to understand more about these sequences and the roles they played in the evolution of modern humans.

Even with those unknowns, this technique by itself is a significant advance for evolutionary research, said Petrov.

"This goes beyond the sequencing of the DNA from the Neanderthal and Denisovan bones. This begins to put meaning on those differences," said Petrov. "It's an important conceptual step from just the sequence -- no tissue, no cells -- to biological information and will enable many future studies."

Hunter Fraser, associate professor of biology at Stanford, and Fumitaka Inoue (UCSF) are also co-authors of the paper. Fraser is also a member of Stanford Bio-X, the Maternal & Child Health Research Institute (MCHRI) and the Stanford Cancer Institute. Petrov is also a member of Stanford Bio-X and the Maternal & Child Health Research Institute (MCHRI), and an affiliate of the Stanford Woods Institute for the Environment.

This research was funded by Human Frontier, Rothschild and Zuckerman fellowships the National Human Genome Research Institute the National Institute of Mental Health the Uehara Memorial Foundation and the Stanford Center for Computational, Evolutionary and Human Genomics (CEHG).

Platypus and echidna genomes reveal mammalian biology and evolution

Egg-laying mammals (monotremes) are the only extant mammalian outgroup to therians (marsupial and eutherian animals) and provide key insights into mammalian evolution 1,2 . Here we generate and analyse reference genomes of the platypus (Ornithorhynchus anatinus) and echidna (Tachyglossus aculeatus), which represent the only two extant monotreme lineages. The nearly complete platypus genome assembly has anchored almost the entire genome onto chromosomes, markedly improving the genome continuity and gene annotation. Together with our echidna sequence, the genomes of the two species allow us to detect the ancestral and lineage-specific genomic changes that shape both monotreme and mammalian evolution. We provide evidence that the monotreme sex chromosome complex originated from an ancestral chromosome ring configuration. The formation of such a unique chromosome complex may have been facilitated by the unusually extensive interactions between the multi-X and multi-Y chromosomes that are shared by the autosomal homologues in humans. Further comparative genomic analyses unravel marked differences between monotremes and therians in haptoglobin genes, lactation genes and chemosensory receptor genes for smell and taste that underlie the ecological adaptation of monotremes.

Conflict of interest statement

J.K. is an employee of Pacific Biosciences, a company that develops single-molecule sequencing technologies.


Fig. 1. Chromosome assembly of monotreme and…

Fig. 1. Chromosome assembly of monotreme and mammalian genome evolution.

Fig. 2. Origin and evolution of the…

Fig. 2. Origin and evolution of the sex chromosomes of the platypus.

Fig. 3. Interactions between the platypus sex…

Fig. 3. Interactions between the platypus sex chromosomes.

Fig. 4. Genomic features related to biological…

Fig. 4. Genomic features related to biological characteristics of the monotremes.

Extended Data Fig. 1. Platypus genome assembly…

Extended Data Fig. 1. Platypus genome assembly and evaluation.

Extended Data Fig. 2. Mammalian genome evolution.

Extended Data Fig. 2. Mammalian genome evolution.

Extended Data Fig. 3. Evolution of immune…

Extended Data Fig. 3. Evolution of immune gene family in monotremes.

Extended Data Fig. 4. Genomic composition of…

Extended Data Fig. 4. Genomic composition of monotreme sex chromosomes.

Extended Data Fig. 5. Evolution of PARs…

Extended Data Fig. 5. Evolution of PARs after the platypus and echidna divergence.

Extended Data Fig. 6. Sex chromosome evolution…

Extended Data Fig. 6. Sex chromosome evolution in monotremes.

Extended Data Fig. 7. Chromatin conformation of…

Extended Data Fig. 7. Chromatin conformation of monotreme sex chromosomes.

Extended Data Fig. 8. Loss of dietary-related…

Extended Data Fig. 8. Loss of dietary-related genes in monotremes.

Extended Data Fig. 9. Taste-receptor evolution and…

Extended Data Fig. 9. Taste-receptor evolution and olfactory-receptor organization in monotremes.

Extended Data Fig. 10. Genomic features related…

Extended Data Fig. 10. Genomic features related to haemoglobin clearance and reproduction in monotremes.

Open Research

Radiocarbon dates have been uploaded to the Arctic Data Center ( Raw sequencing data generated from YG303.325 and YG188.42 are available in NCBI BioProject PRJNA727160 (SAMN19007726, SAMN19007727). All newly generated mitochondrial genomes have been uploaded to GenBank with ID nos. MW846090–MW846167. beast files and the final g-phocs “filter” file are available on Data Dryad All scripts, as well as filtering criteria used to ascertain the set of putatively neutral loci, are published on github

Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.


The information in genomes provides the instruction set for producing each living organism on the planet. While we have a growing understanding of the basic biochemical functions of many of the individual genes in genomes, understanding the complex processes by which this encoded information is read out to orchestrate production of incredibly diverse cell types and organ functions, and how different species use strikingly similar gene sets to nonetheless produce fantastically diverse organismal morphologies with distinct survival and reproductive strategies, comprise many of the deepest questions in all of science. Moreover, we recognize that inherited or acquired variation in DNA sequence and changes in epigenetic states contribute to the causation of virtually every disease that afflicts our species. Spectacular advances in genetic and genomic analysis now provide the tools to answer these fundamental questions.

Members of the Department of Genetics conduct basic research using genetics and genomics of model organisms (yeast, fruit fly, worm, zebrafish, mouse) and humans to understand fundamental mechanisms of biology and disease. Areas of active investigation include genetic and epigenetic regulation of development, molecular genetics, genomics and cell biology of stem cells, the biochemistry of micro RNA production and their regulation of gene expression, and genetic and genomic analysis of diseases in model systems and humans including cancer, cardiovascular and kidney disease, neurodegeneration and regeneration, and neuropsychiatric disease. Members of the Department have also been at the forefront of technology development in the use of new methods for genetic analysis, including new methods for engineering mutations as well as new methods for production and analysis of large genomic data sets.

The Department sponsors a graduate program leading to the PhD in the areas of molecular genetics and genomics, development, and stem cell biology. Admission to the Graduate Program is through the Combined Programs in Biological and Biomedical Sciences (BBS).

In addition to these basic science efforts, the Department is also responsible for providing clinical care in Medical Genetics in the Yale New Haven Health System. Clinical genetics services include inpatient consultation and care, general, subspecialty, and prenatal genetics clinics, and clinical laboratories for cytogenetics, DNA diagnostics, and biochemical diagnostics. The Department sponsors a Medical Genetics Residency program leading to certification by the American Board of Medical Genetics. Admission to the Genetics Residency is directly through the Department.

Somatic Cell Genome Editing

At the close of every year, editors and writers at the journal Science review the progress that’s been made in all fields of science—from anthropology to zoology—to select the biggest advance of the past 12 months. In most cases, this Breakthrough of the Year is as tough to predict as the Oscar for Best Picture. Not in 2020. In a year filled with a multitude of challenges posed by the emergence of the deadly coronavirus disease 2019 (COVID-2019), the breakthrough was the development of the first vaccines to protect against this pandemic that’s already claimed the lives of more than 360,000 Americans.

In keeping with its annual tradition, Science also selected nine runner-up breakthroughs. This impressive list includes at least three areas that involved efforts supported by NIH: therapeutic applications of gene editing, basic research understanding HIV, and scientists speaking up for diversity. Here’s a quick rundown of all the pioneering advances in biomedical research, both NIH and non-NIH funded:

Shots of Hope. A lot of things happened in 2020 that were unprecedented. At the top of the list was the rapid development of COVID-19 vaccines. Public and private researchers accomplished in 10 months what normally takes about 8 years to produce two vaccines for public use, with more on the way in 2021. In my more than 25 years at NIH, I’ve never encountered such a willingness among researchers to set aside their other concerns and gather around the same table to get the job done fast, safely, and efficiently for the world.

It’s also pretty amazing that the first two conditionally approved vaccines from Pfizer and Moderna were found to be more than 90 percent effective at protecting people from infection with SARS-CoV-2, the coronavirus that causes COVID-19. Both are innovative messenger RNA (mRNA) vaccines, a new approach to vaccination.

For this type of vaccine, the centerpiece is a small, non-infectious snippet of mRNA that encodes the instructions to make the spike protein that crowns the outer surface of SARS-CoV-2. When the mRNA is injected into a shoulder muscle, cells there will follow the encoded instructions and temporarily make copies of this signature viral protein. As the immune system detects these copies, it spurs the production of antibodies and helps the body remember how to fend off SARS-CoV-2 should the real thing be encountered.

It also can’t be understated that both mRNA vaccines—one developed by Pfizer and the other by Moderna in conjunction with NIH’s National Institute of Allergy and Infectious Diseases—were rigorously evaluated in clinical trials. Detailed data were posted online and discussed in all-day meetings of an FDA Advisory Committee, open to the public. In fact, given the high stakes, the level of review probably was more scientifically rigorous than ever.

First CRISPR Cures: One of the most promising areas of research now underway involves gene editing. These tools, still relatively new, hold the potential to fix gene misspellings—and potentially cure—a wide range of genetic diseases that were once to be out of reach. Much of the research focus has centered on CRISPR/Cas9. This highly precise gene-editing system relies on guide RNA molecules to direct a scissor-like Cas9 enzyme to just the right spot in the genome to cut out or correct a disease-causing misspelling.

In late 2020, a team of researchers in the United States and Europe succeeded for the first time in using CRISPR to treat 10 people with sickle cell disease and transfusion-dependent beta thalassemia. As published in the New England Journal of Medicine, several months after this non-heritable treatment, all patients no longer needed frequent blood transfusions and are living pain free [1].

The researchers tested a one-time treatment in which they removed bone marrow from each patient, modified the blood-forming hematopoietic stem cells outside the body using CRISPR, and then reinfused them into the body. To prepare for receiving the corrected cells, patients were given toxic bone marrow ablation therapy, in order to make room for the corrected cells. The result: the modified stem cells were reprogrammed to switch back to making ample amounts of a healthy form of hemoglobin that their bodies produced in the womb. While the treatment is still risky, complex, and prohibitively expensive, this work is an impressive start for more breakthroughs to come using gene editing technologies. NIH, including its Somatic Cell Genome Editing program, continues to push the technology to accelerate progress and make gene editing cures for many disorders simpler and less toxic.

Scientists Speak Up for Diversity: The year 2020 will be remembered not only for COVID-19, but also for the very public and inescapable evidence of the persistence of racial discrimination in the United States. Triggered by the killing of George Floyd and other similar events, Americans were forced to come to grips with the fact that our society does not provide equal opportunity and justice for all. And that applies to the scientific community as well.

Science thrives in safe, diverse, and inclusive research environments. It suffers when racism and bigotry find a home to stifle diversity—and community for all—in the sciences. For the nation’s leading science institutions, there is a place and a calling to encourage diversity in the scientific workplace and provide the resources to let it flourish to everyone’s benefit.

For those of us at NIH, last year’s peaceful protests and hashtags were noticed and taken to heart. That’s one of the many reasons why we will continue to strengthen our commitment to building a culturally diverse, inclusive workplace. For example, we have established the NIH Equity Committee. It allows for the systematic tracking and evaluation of diversity and inclusion metrics for the intramural research program for each NIH institute and center. There is also the recently founded Distinguished Scholars Program, which aims to increase the diversity of tenure track investigators at NIH. Recently, NIH also announced that it will provide support to institutions to recruit diverse groups or “cohorts” of early-stage research faculty and prepare them to thrive as NIH-funded researchers.

AI Disentangles Protein Folding: Proteins, which are the workhorses of the cell, are made up of long, interconnected strings of amino acids that fold into a wide variety of 3D shapes. Understanding the precise shape of a protein facilitates efforts to figure out its function, its potential role in a disease, and even how to target it with therapies. To gain such understanding, researchers often try to predict a protein’s precise 3D chemical structure using basic principles of physics—including quantum mechanics. But while nature does this in real time zillions of times a day, computational approaches have not been able to do this—until now.

Of the roughly 170,000 proteins mapped so far, most have had their structures deciphered using powerful imaging techniques such as x-ray crystallography and cryo–electron microscopy (cryo-EM). But researchers estimate that there are at least 200 million proteins in nature, and, as amazing as these imaging techniques are, they are laborious, and it can take many months or years to solve 3D structure of a single protein. So, a breakthrough certainly was needed!

In 2020, researchers with the company Deep Mind, London, developed an artificial intelligence (AI) program that rapidly predicts most protein structures as accurately as x-ray crystallography and cryo-EM can map them [2]. The AI program, called AlphaFold, predicts a protein’s structure by computationally modeling the amino acid interactions that govern its 3D shape.

Getting there wasn’t easy. While a complete de novo calculation of protein structure still seemed out of reach, investigators reasoned that they could kick start the modeling if known structures were provided as a training set to the AI program. Utilizing a computer network built around 128 machine learning processors, the AlphaFold system was created by first focusing on the 170,000 proteins with known structures in a reiterative process called deep learning. The process, which is inspired by the way neural networks in the human brain process information, enables computers to look for patterns in large collections of data. In this case, AlphaFold learned to predict the underlying physical structure of a protein within a matter of days. This breakthrough has the potential to accelerate the fields of structural biology and protein research, fueling progress throughout the sciences.

How Elite Controllers Keep HIV at Bay: The term “elite controller” might make some people think of video game whizzes. But here, it refers to the less than 1 percent of people living with human immunodeficiency virus (HIV) who’ve somehow stayed healthy for years without taking antiretroviral drugs. In 2020, a team of NIH-supported researchers figured out why this is so.

In a study of 64 elite controllers, published in the journal Nature, the team discovered a link between their good health and where the virus has inserted itself in their genomes [3]. When a cell transcribes a gene where HIV has settled, this so-called “provirus,” can produce more virus to infect other cells. But if it settles in a part of a chromosome that rarely gets transcribed, sometimes called a gene desert, the provirus is stuck with no way to replicate. Although this discovery won’t cure HIV/AIDS, it points to a new direction for developing better treatment strategies.

In closing, 2020 presented more than its share of personal and social challenges. Among those challenges was a flood of misinformation about COVID-19 that confused and divided many communities and even families. That’s why the editors and writers at Science singled out “a second pandemic of misinformation” as its Breakdown of the Year. This divisiveness should concern all of us greatly, as COVID-19 cases continue to soar around the country and our healthcare gets stretched to the breaking point. I hope and pray that we will all find a way to come together, both in science and in society, as we move forward in 2021.

Bio Eats World: Viral Genomes from A to Z

If there is one rule in biology, it is that there is an exception to every rule. This includes even the basic biochemistry of DNA — which was once thought to be universal. On this episode, host Lauren Richardson and Judy Savitskaya (a16z bio deal team member and synthetic biology expert), discuss the results and implications three related articles co-published in Science, which all advance our understanding of a very unique kind of DNA.

If you open any biology text book, it will say that the genetic code is made up of 4 bases: Adenine, Thymine, Cytosine, and Guanine, or ATCG. But, back in 1977, scientists discovered a phage — the technical term a virus that infects bacteria — that encodes its genome in ZTCG. Z is a derivative of A that has an extra amino group tagged on, and while that may sound minor, it changes some of the key properties of DNA. These three new articles seek to understand how Z is made and how it is incorporated into DNA. This is essential information for taking Z from a weird, wild bio story into a practical application. The conversation covers what makes Z different than other bases, what these three articles reveal about the synthesis and polymerization of Z, and how we can use use Z in a wide range of applications, from bio-containment to new therapeutics to DNA storage.

The three articles discussed are:

“A widespread pathway for substitution of adenine by diaminopurine in phage genomes” by Yan Zhou, Xuexia Xu, Yifeng Wei, Yu Cheng, Yu Guo, Ivan Khudyakov, Fuli Liu, Ping He, Zhangyue Song, Zhi Li, Yan Gao, Ee Lui Ang, Huimin Zhao, Yan Zhang, and Suwen Zhao

“A third purine biosynthetic pathway encoded by aminoadenine-based viral DNA genomes” by Dona Sleiman, Pierre Simon Garcia, Marion Lagune, Jerome Loc’h, Ahmed Haouz, Najwa Taib, Pascal Röthlisberger, Simonetta Gribaldo, Philippe Marlière, and Pierre Alexandre Kaminski

“Noncanonical DNA polymerization by aminoadenine-based siphoviruses” by Valerie Pezo, Faten Jaziri, Pierre-Yves Bourguignon, Dominique Louis, Deborah Jacobs-Sera, Jef Rozenski, Sylvie Pochet, Piet Herdewijn, Graham F. Hatfull, Pierre-Alexandre Kaminski, and Philippe Marliere