What sort of assay could be used to identify mutants with mutator phenotype?

What sort of assay could be used to identify mutants with mutator phenotype?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

By mutator phenotype, I mean being more prone to mutations, for example due to mutations in genes involved in DNA repair. I was thinking about exposing the cells to agents that damage DNA (uv light, for example). Compared to normal cells, mutators would be less likely to survive, because they lack the ability to repair damages in the genetic material. To a certain extent, that's the principle of radiotherapy (cancer cells are more suscetible towards radiation). Are there other options? Moreover, is it possible to select the mutators (instead of killing them)?

There are a number of ways to screen for mutator genes. One straightforward approach in bacteria is to take an antibiotic sensitive non-mutator strain, grow it up, and expose to different antibiotics. Cells that survive in the presence of multiple antibiotics have acquired multiple de novo resistance mutations and thus are highly enriched for mutator alleles. Here is one example: Wiegand et al (2008) AAC.

Lethal Mutant

Special classes of conditional lethal mutants are created when the condition is not environmental, but rather is the specific genetic background of the host strain. In this situation, when a mutation is introduced into one genetic background the strain remains viable, but when introduced into another genetic background, it is not viable. Viruses that could infect one strain but not another led to the discovery of this type of conditional lethality.

Several types of genetic background mutations can allow the growth of mutant viruses that cannot grow in the wild-type host. A host carrying a nonsense suppressor mutation allows growth of a virus carrying a nonsense mutation. Other types of suppressor mutations may compensate for an otherwise lethal mutation. Generally known as extragenic or second-site suppressors, these mutations allow a strain carrying an otherwise lethal mutation to grow. Genetic crosses must demonstrate that the original lethal mutation is still present, and that the second mutational change suppresses the original lethal effect and allows growth.

Other types of genetic background mutations prevent the viability of a strain carrying a mutant protein. In this situation, mutations are found to be lethal when combined, while the individual mutations are not lethal. Such mutations are termed synthetically lethal, because the mutations are only lethal when they are put together in the same strain. Figure 1 shows the phenotypes and genotypes of these different cases of conditional lethal mutations.

Figure 1 . (a) Ts-1 is a heat-sensitive conditional lethal mutant cs-1 is a cold-sensitive conditional lethal mutant. The growth phenotypes at different temperatures are shown: (+) Growth (–) No Growth. (b) Synthetic lethality occurs when two non-lethal mutations are combined, and found to have a lethal phenotype when in combination. (c) An extragenic suppressor is a mutation in another gene that can suppress the phenotype of the first mutation. In this case, a temperature-sensitive mutation (ts-1) was found to be suppressed by an extragenic suppressor, sup-1. In the strain with the two mutations ts-1 and sup-1, growth can occur at 42 °C, which is non-permissive for strains carrying the ts-1 mutation alone.

High-throughput screening technologies for enzyme engineering

Ultra-high-throughput technologies have emerged for enzyme engineering.

These technologies are enabling screens of millions of enzyme variants.

Cells or synthetic droplets can be used as compartments for enzyme reactions.

Microfluidic devices and microchamber arrays have been developed for screening.

Workflow, throughput, and assay format and flexibility varies for each technology.

Emerging technologies are enabling ultra-high-throughput screening of combinatorial enzyme libraries to identify variants with improved properties such as increased activity, altered substrate specificity, and increased stability. Each of these enzyme engineering platforms relies on compartmentalization of reaction components, similar to microtiter plate-based assays which have been commonly used for testing the activity of enzyme variants. The technologies can be broadly divided into three categories according to their spatial segregation strategy: (1) cells as reaction compartments, (2) in vitro compartmentalization via synthetic droplets, and (3) microchambers. Here, we discuss these emerging platforms, which in some cases enable the screening of greater than 10 million enzyme variants, and highlight benefits and limitations of each technology.

Transposons: Definition and Types (With Diagram)

Presence of transposable elements was first predicted by Barbara McClintock in maize (corn) in late 1940s. After several careful studies, she found that certain genetic elements were moving from one site to an entirely different site in the chromosome. She called this phenomenon of changing sites of genetic elements as transposition and those genetic elements were called by her as controlling elements.

These controlling elements were later on called as transposable elements by Alexander Brink. In late 1960s this phenomenon was also discovered in bacteria.

Consequently, the molecular biologists called them as Transposons. A transposon may be defined as: “a DNA sequence that is able to move or insert itself at a new location in the genome.” The phenomenon of movement of a transposon to a new site in the genome is referred to as transposition.

Transposons are found to encode a special protein named as transposase which catalyses the process of transposition. Transposons are particular to different groups of organisms. They constitute a fairly accountable fraction of genome of organisms like fungi, bacteria, plants, animals and humans. Transposons have had a major impact on changing or altering the genetic composition of organisms.

Transposons or transposable genetic elements are often referred to as ‘mobile genetic elements’ also. They can be categorized on different bases like their mode of transposition or on the basis of the organisms in which they are present.

Types of Transposons:

Different transposons may change their sites by following different transposition mechanisms.

On the basis of their transposition mechanism, transposons may be categorized into following types:

(i) Cut-and-Paste Transposons:

They transpose by excision (cutting) of the transposable sequence from one position in the genome and its insertion (pasting) to another position within the genome (Fig. 1).

The cut-and-paste transposition involves two transposase subunits. Each transposase submit binds to the specific sequences at the two ends of transposon. These subunits of transposase protein then come together and lead to the excision of transposon.

This excised ‘transposon-Transposase Complex’ then gets integrated to the target recipient site. In this manner, the transposon is cut from one site and then pasted on other site by a mechanism mediated by transposase protein (Fig. 2).

Examples of cut-and-paste type of transposons are IS-elements, P-elements in maize, hobo-elements in Drosophila etc.

(ii) Replicative Transposons:

They transpose by a mechanism which involves replication of transposable sequence and this copy of DNA, so formed, is inserted into the target site while the donor site remains unchanged (Fig. 3). Thus, in this type of transposition, there is a gain of one copy of transposon and both-the donor and the recipient DNA molecule are having one-one transposable sequence each, after transposition.

Tn3-elements found in bacteria are good examples of such type of transposons.

(iii) Retro Elements:

Their transposition is accomplished through a process which involves the synthesis of DNA by reverse transcription (i.e. RNA DNA) by using elements RNA as the template (Fig. 4). This type of transposition involves an RNA intermediate, the transposable DNA is transcribed to produce an RNA molecule.

This RNA is then used as a template for producing a complementary DNA by the activity of enzyme reverse transcriptase. This single stranded DNA copy so formed, is then made double stranded and then inserted into the target DNA site. The transposable elements which require reverse transcriptase tor their movement are called retro transposons.

The Retro elements may be viral or non-viral. Out of these two, the non-viral retro elements are important and may further be classified as:

(A) Retrovirus like elements:

They carry long terminal repeats (LTR). Examples are copia, gypsy elements in Drosophila.

LTR are absent. Examples are LINEs and SINEs in humans.

Transposable Elements in Prokaryotes:

Although the presence of transposons was predicted in eukaryotes but first observation at molecular level was done in bacteria, which is a prokaryote.

Bacterial transposable elements are of the following types:

(a) Insertion Sequences or IS Elements:

They are the transposable sequences which can insert at different sites in the bacterial chromosomes.

IS-elements contain ITRs (Inverted Terminal Repeats), these were first observed in E.coli. IS elements are relatively short usually not exceeding 2500 bp. The ITRs present at the ends of IS-elements are an important feature which enables their mobility. The ITRs present in the IS-elements of E.coli usually range between 18-40 bp.

The term ‘Inverted Terminal Repeat’ (ITR) implies that the sequence at 5 end of one strand is identical to the sequence at 5′ end of the other strand but they run in inverse opposite direction (Fig. 5). In Exoli chromosome, a number of copies of several IS-elements like IS1, IS2, IS3, IS4 and IS5 are present.

(b) Prokaryotic Transposon Element:

These are also called composite transposons and are shown by the symbol Tn. It is made up of two IS elements, one present at each end of a DNA sequence which contains genes whose functions are not related to the transposition process. These transposons have been found to have inverted repeats at the ends. The length of these inverted repeats ranges from a few nucleotides to about 1500 bp.

It can be said that these are the large transposons which are formed by capturing of an immobile DNA sequence within two insertion sequences thus enabling it to move. Examples of such transposons include the members of Tn series like Tn1, Tn5, Tn9, Tn10, etc.

Transposable Elements in Eukaryotes:

(a) Transposons in Maize:

Different types of transposons present in maize are described below:

This system of transposable elements in maize was analysed and given by Barbara Mc. Clintock. Here Ac stands for Activator and Ds for Dissociation. Barbara found that Ds and Ac genes were sometimes mobile and moved to different chromosomal locations thus resulting in different kernel phenotypes.

Ds element is activated by Ac and on activation it serves as the site provider for breakage in chromosome. Ac can move autonomously while Ds can move only in the presence of Ac (Fig. 6). The transposition involving this Ac-Ds system produces altered kernel phenotypes.

Other transposable elements of maize are:

i. spm (suppressor mutator) system,

iii. Mu (Mutator) system, etc.

(b) Transposons in Drosophila:

A number of transposable elements are found in Drosophila which are of different types and account for a quite high fraction of Drosophila genome.

Some of these transposons are given below:

These were discovered during the study of ‘hybrid-dysgenesis’ which is a sterility causing condition. They are 2.9 kb long and contain 31 bp long inverted terminal repeats High rate of P-element transposition causes hybrid dysgenesis. P-elements encode transposase enzyme which helps in their transposition. These are also useful as vectors for introducing foreign genes into Drosophila.

Their transposition causes mutations for eye-colour in Drosophila. They are of size approximately 5-8 kb with direct terminal repeat (DTR) of about 276 bp at each end. Within each of this direct repeats is present short inverted repeat (IR) of about 17 bp length. About 10-80 copia- elements are present in cell-genome (Fig. 7).

These are the fold back elements present in Drosophila genome. These have ability to fold back to form a stem and loop structure due to the presence of long inverted terminal repeats. Their transposition results into a changed expression by causing mutation by insertion or by affecting the normal gene expression.

Other important types of transposable elements found in Drosophila are:

(c) Transposons in Humans:

Transposons in humans are in the form of repetitive DNA which consists of sequences that are interspersed within the entire human genome. These sequences are transposable and can move to different locations within the genome.

These are of following two types:

(1) SINEs (Short Interspersed Elements):

300 bp long and may be present about 5 lakh times in human genome. Alu sequences are the best characterized SINEs in humans.

These are termed as ‘Alu’ elements because they contain specific nucleotide sequences which are cleaved by the restriction enzyme named Alul. Alu elements contain Direct Terminal Repeats (DTR) of 7-20 bp length. These DTRs help them in the insertion process during transposition.

(2) LINEs (Long Interspersed Elements):

6400 bp long and are present about 1 lakh times in the human genome. Most prominent example is LI sequence. These transposable elements are some of the most abundant and common families of moderately repeated sequences in human DNA.

Significance of Transposable Elements:

1. Transposons may change the structural and functional characteristics of genome by changing their position in the genome.

2. Transposable elements cause mutation by insertion, deletion, etc.

3. Transposons make positive contribution in evolution as they have tremendous impact on the alteration of genetic organisation of organisms.

4. They are useful as cloning vectors also, in gene cloning. For example, P-elements are frequently used as vector for introducing transgenes into Drosophila.

5. Transposons may also be used as genetic markers while mapping the genomes.

6. Transposon-mediated gene tagging is done for searching and isolation of a particular gene.

Techniques of Gene Mapping:

A gene map is the detailed schematic representation of the positions of genes or sequences of interest in a chromosome. It may also provide details of relative distances between these genes. A gene map may also be referred to as a genome map or molecular map. It attempts to describe the structural and functional organization of genome of an organism.

Gene mapping has attained much attention during the last two decades not only in case of plants but also in animals. These maps provide immense benefit in improvement of commercially important plants or animals. Gene mapping of human genome is the prime goal of Human Genome Project (HGP). It will surely help to solve the genetic disorders in humans.

Gene maps also aid for assessment of genetic diversity and taxonomic classification of organisms by comparison. Most common technique of development of a gene map is by using the molecular markers. For functioning of molecular markers, the basis is provided by polymorphism.


Genome of the organism contains a number of polymorphisms (literal meaning being many forms). These polymorphisms are actually the positions in genome where the nucleotide sequence is not the same in every member of a population. These variable sites may be used as the DNA markers (or molecular marker) for genome mapping.

The detection of polymorphism can be done by either of the following ways:

This detection can be done on account of the differences in structure and composition of the proteins encoded by the polymorphic sites of genomes.

It can be done by visual examination, without involving any specialized biochemical or molecular technique.

(iii) Molecular Detection:

This is done at the DNA level i.e., the DNA sequences having the differences can be detected. The polymorphism occurring at the molecular level i.e., in DNA sequence, can be used efficiently as the molecular/DNA marker and hence is important for preparing gene maps.

Molecular Markers:

These may also be termed as the DNA markers. They represent primary method of development of gene maps. Molecular markers, actually display the variability at DNA level. Molecular markers may be described as the DNA sequences which reveal variations at DNA level which can be easily detected and monitored in the subsequent generations.

On the basis of principles and methods employed for the development and use of the molecular markers, they may be classified as the hybridization-based markers and PCR-based markers.

A good molecular marker must be polymorphic and should be evenly distributed within the genome. A molecular marker should be easy and quick to be detected. As stated earlier also, major use of molecular marker is in the development of gene maps.

Different important DNA markers which are utilized for gene mapping are given below:

(i) RFLP (Restriction Fragment Length Polymorphism):

It is a hybridization based molecular marker. The principle of RFLP marker is the restriction digestion of pure DNA sample by restriction endonuclease enzymes. It represents the polymorphism as single base changes.

(ii) RAPD (Random Amplified Polymorphic DNA):

It is a PCR (Polymerase Chain Reaction) based technique. Basic principle involved in the functioning of RAPD is the DNA amplification using PCR. RAPD is a quicker technique.

(iii) AFLP (Amplified Fragment Length Polymorphism):

It is a combination of features of both RAPD and RFLP. The basic principle of working of AFLP is the PCR amplification of fragments of genome produced by applying restriction enzymes. AFLP is a faster and less laborious technique.

(iv) VNTR (Variable Number of Tandem Repeats):

These are also called as mini-satellites. Polymorphism in VNTRs is associated with the number of repeat units at a given position in a chromosome in different individuals,

(v) STRs (Short Tandem Repeats):

These are also termed as microsatellites. They show polymorphism due to the variation in the number of repeats. These are ideal markers to develop a high resolution molecular map. The molecular markers are employed for construction of genomic maps in plants as well as in animals and microorganisms also. However many other techniques as in-situ hybridisation etc., are also utilized for mapping.

Types of Gene Maps:

On the basis of strategies followed for the preparation of gene maps, there are following types of gene maps:

Genetic Map:

It is obtained by genetic studies using mendelian principles like crossing over, linkage, etc. Such maps are not considered to be very accurate. They only provide information about the position of concerned genes (i.e., on which chromosome they are present) and also give an idea, roughly, about the relative distance between the concerned genes.

Linkage Map is the term which is given especially for those genetic maps in which the relative distances between genetic markers or concerned genes is measured in terms of recombination frequencies between them. The unit of distance in a linkage map is centiMorgan (cM) or map-unit. One cM is defined as that distance which allows 1% recombination between the genes.

During the preparation of linkage maps, recombination frequencies between the genes are studied. The data based on such recombination frequency is processed to measure the distance between genes.

In the recent years, a number of advanced computer software have been introduced which help in fast and accurate processing of data on recombination frequency and thus construct the linkage maps of genomes. Some such computer software’s are—LINKAGE, MAPMAKER, CRI-MAP, etc.

One main drawback of genetic map is that sometimes, the distances between genes in a genetic map do not correspond to the actual distance between them on the chromosome. Many times there may remain gaps on the genetic maps.

This happens basically because the recombination frequencies (which are used as the measure of distance between genes) obviously get affected by environmental conditions and nature and position of mutants used for study.

Recently different techniques have been devised for genetic mapping. These involve the use of DNA sequences which are not genes but display variations in a population. Such sequences are called as the molecular markers.

Most important molecular markers used for genetic mapping are:

(a) RFLP: These polymorphic markers have been used for genetic mapping in a number of crop plants along with mice, human beings, Drosophila, etc. also. RFLP genetic maps have been so far developed successfully in wheat, rice, rye, barley, tomato, potato, etc.

(b) RAPD: Genetic maps are successfully prepared by using PCR-based technique named RAPD. These markers provide more convenient and quick method of genetic mapping than RFLPs. However, less information is served by RAPD genetic map.

(c) Genetic maps have been successfully constructed by using the mini-satellites (also called variable number of tandem repeats, VNTR) and microsatellites (i.e., short tandom repeats, STR) also.

(d) For production of genetic maps, the AFLPs, may also be used. Technique of AFLP has the features of both RFLP and RAPD and it is more advantageous and faster technique for constructing genetic maps.

Cytogenetic Map:

They may also be called as cytological maps or chromosome maps. When genes are assigned to specific chromosome arms and their distances from the centromere are also shown, then the map is known as a cytogenetic map and the technique is called the cytogenetic mapping.

A cytogenetic map makes possible to locate concerned genes, not only on a specific chromosome but also on the specific regions of this chromosome. Cytogenetic mapping is generally better used in case of those organisms which have larger microscopically observable chromosomes.

There are different techniques which may be employed for construction of cytogenetic maps. Some of them are given below:

(a) FISH (Fluorescence In-situ Hybridization):

The technique of in-situ hybridization involves the detection and location of desired sequence of DNA directly inside the cell. When in situ hybridization involves labelling with fluorescent molecules, then, it is called as fluorescent in-situ hybridization (FISH). By using FISH, the genes can be located on the chromosome within the cell in correct order and thus provide a cytogenetic map.

The cytogenetic maps can also be prepared by using molecular markers like RFLP. In this approach, RFLP locus is located on the specific regions of particular chromosome.

(c) Use of somatic cell Hybrids:

The somatic cell hybrids having chromosomes of known identity have been used successfully for preparation of cytogenetic maps preferably in humans. For example, the man-mouse somatic cell hybrids have resulted in mapping of genes on human chromosome to more specific regions.

Physical Map:

These maps represent the correct order of genes on chromosome and they are based on the physical distances between genes or between a gene and the centromere.

In a physical map, distance is never given in centiMorgan, instead, it is given in terms of number of base pairs between the genes. Physical maps are considered as more accurate than genetic maps. They ultimately result in obtaining the entire sequencing of whole genome and that too with the knowledge of physical distances between genes.

It is a type of physical mapping because in it, distances are given in terms of base pairs. It is a successful technique for mapping the prokaryotic genome. Preparation of restriction maps involves the use of restriction endonuclease enzymes which cleave the DNA at specific sites. To prepare a restriction map, one or more restriction endonuclease enzymes are used to cleave DNA at different sites (Fig. 9).

As a result, DNA fragments of varying lengths are obtained. The sample having such DNA fragments of different size are then subjected to a technique called gel electrophoresis for separation. Consequently, a series of bands is obtained on the gel where position of bands depends on its size.

This gel with different bands is then calibrated with the help of DNA fragments of known lengths, to obtain the sites of cleavage. These sites of cleavage are then identified and mapped together to produce a complete restriction map.

Different techniques for preparing physical maps are:

(a) ISH (in situ Hybridisation) Technique:

It can be used for physical mapping, this has been successfully utilized for obtaining physical maps in rye, wheat and barley. An advanced technique FISH (Fluorescence in-situ hybridisation) can also be employed for physical mapping.

(b) Physical mapping can also be performed by using chromosomal aberrations like duplication, deletion and translocation.

(c) Physical maps can also be developed by using a mapping reagent. The mapping reagent is actually a collection of DNA fragments spanning a complete chromosome or the entire genome.

(d) The technique of chromosome walking is also helpful for developing physical maps. However, this technique is not of much use in case of higher eukaryotes due to the presence of highly repeated sequences.

(e) Using YAC (Yeast Artificial Chromosome) has also proved to be of great use for physical mapping of Drosophila and human beings.

Significance of Gene Maps:

1. Gene maps play an important role in the researches related to plant as well as animal biotechnology.

2. Genetic and physical maps are important for the Human Genome Project (HGP).

3. Gene maps aid the genetists to study phylogenetic relationships and evolutionary patterns of organisms.

4. These are helpful for characterisation of genetic resources and estimation of genetic diversity.

5. Gene maps have enormous utility in crop improvement programmes.

6. They may provide an extra aid for solving the problems related to a number of harmful genetic disorders in human beings.

Chromosome Walking:

Chromosome walking is an important aspect of cytogenetics:

It is a method for analysing long stretches of DNA. By using this technique, large regions of chromosome of about 1000 kb length can be easily characterized. In contrast, the conventional cloning methods or PFGE (Pulse Field Gel Electrophoresis) etc., can characterize only about 100 kb long segments of chromosome.

Chromosome walking is done on the DNA fragment containing the gene of interest:

The process starts with a known gene present near the gene of interest on the DNA fragment. For chromosome walking, clones of interest are derived from the genomic library.

During chromosome walking, the end-piece of a cloned DNA fragment is sub cloned and is used as a probe to recover another overlapping clone from the genomic library. Restriction mapping of such overlapping sequences may be used to construct the original sequence of DNA stretch under study.

These are the labelled DNA or RNA molecules which are used to identify target genes or molecules.

The cloning of a clone is called sub cloning.

Steps in Chromosome Walking:

i. First clone of interest is selected from the genomic library after identifying by a probe.

ii. A small fragment from one end of this clone is sub cloned.

iii. This sub cloned fragment is now used as a probe and is hybridized with other clone from the genomic library.

iv. Now, the second clone hybridized with the sub clone of the first clone is identified due to the presence of overlapping region.

v. End piece of the second clone is then sub cloned and used for hybridization with another clone from library.

vi. Again, the third clone hybridized with the sub clone of second clone is also identified due to the presence of overlapping region.

vii. This process of sub cloning and probing the genomic library is repeated to recover overlapping clones until the gene of interest is reached.

viii. Restriction maps of the overlapping clones may be constructed so as to get the entire sequence of original DNA stretch (Fig. 10).

Applications of Chromosome Walking:

1. This technique is used successfully for the characterization of large regions of chromosomes.

2. Chromosome walking is applied for the identification of specific genes.

3. It can be used for the isolation of specific genes also.

4. Chromosome walking is a technique which is frequently used for the preparation of genome maps specially the physical maps.

5. It is of great importance for the identification of genetic disorders in human like cystic fibrosis, muscular dystrophy, etc.

Limitations of Chromosome Walking:

1. This technique is time-consuming.

2. It is a laborious technique.

3. It requires the construction of a genomic library.

4. It is usually difficult to perform chromosome walking in complex eukaryotic genomes like those of human beings because they carry highly repetitive sequences.

3. Intragenic suppression by compensatory second site mutation

In principle, missense mutations that result in a non-functional protein can sometimes be suppressed by introducing another change elsewhere in the protein. Relatively few cases of this effect have been studied so far in C. elegans . An example is provided by the reversion of temperature-sensitive missense mutations in glp-1 , which encodes a cell interaction protein this experiment yielded a variety of second-site missense changes that are able to correct the original defect (Lissemore et al., 1993).


Sequencing of Independent Mutants

Using growth on glucose medium as a selection, 24 mutants with the desired phenotype were produced. The genomic DNA was pooled into 8 libraries each consisting of exactly three strains. These libraries were tagged, combined, and sequenced in a single lane of a high-throughput Illumina Hi-Seq sequencer. The resulting fragments were filtered, aligned to the reference E. coli K-12 substr. MG1655 genome sequence, and scanned for sequence variants. Sequencing produced 145 million reads of 100 basepairs each for a total of 14.5 Gb of genomic sequence, of which approximately 118 million reads successfully demultiplexed (had an identifiable tag) and aligned to the reference genome. From the pooled libraries, we identified 2157 SNPS (1450 nonsynonymous, 707 synonymous) after filtering for quality and strand bias (see methods), yielding approximately 100 mutations per strain. These SNPs showed a strong preference for GC-sites in line with the mutagenesis spectrum of NTG [38]. SNPs were detected in 1348 genes 1012 genes had one or more nonsynonymous mutations.

Pathway-Phenoseq Analysis

We developed a method for scoring individual pathways, based on the number of non-synonymous mutations occurring in genes in each pathway (see Method for details). As a comprehensive set of E. coli pathway annotations, we used the EcoCyc Functionally Associated Groups database, totalling 536 groups [24], of which 336 were hit by non-synonymous mutations in our sequencing dataset. (For simplicity, we will refer to these EcoCyc Functionally Associated Groups as ''pathways''). We applied our scoring method (which we will refer to throughout as ''pathway-phenoseq'') to all 336 pathways, and ranked them by their p-values ( Table 1 ). After the Bonferroni multiple hypothesis correction, the top five pathways (containing 12 genes total) were statistically significant.

Table 1

GroupGenesp-value (phenoseq)
PD04099 aceK iclR
CPLX0-2101 malE malF malG malK lamB
ABC-16-CPLX malF malE malG malK
PD00237 malS malT
CPLX-155 chbA chbB chbC ptsH ptsI
PWY0-321 paaZ paaA paaB paaC paaD paaE paaF paaG paaH paaJ paaK
RNAP54-CPLX rpoA rpoB rpoC rpoN
APORNAP-CPLX rpoA rpoB rpoC rpoD

Gene-phenoseq identified three of these genes (iclR and aceK in pathway PD04099 malT in pathway PD00237) as statistically significant ( Table 2 ). Thus pathway-phenoseq detected more than twice as many causal pathways for this phenotype, and four times as many genes as the gene-phenoseq scoring. Even for pathways detected by both, gene-phenoseq had a much weaker p-value (strongest score ) than pathway-phenoseq () this is expected to be true generally whenever signal is spread over multiple genes in a pathway.

Table 2


0.1 Assessment vs. Experimental Literature

We assessed these results against pathways shown to be involved in this specific phenotype in previous experimental literature. Fong et al. performed natural evolution experiments on a knockout strain selecting for the same phenotype (recovery of the ability to grow on glucose as a carbon source) [24]. After 45 days of growth and selection, they obtained two mutant strains, which had growth rates and glucose consumption rates very similar to the wild type strain, more than double the strain on day 0. While Fong et al. did not identify the specific causal mutations, they found that metabolic flux through the glyoxylate shunt (aceA and aceB) increased, and also that the expression level of these two genes increased. Two other studies found increased flux in the glyoxylate shunt in mutants [39], including a rescue of such a mutant by overexpression of the shunt [40].

These data validate our top pathway hit (PD04099, aceK and iclR), which regulates the glyoxylate shunt [41] [42], the pathway reported by Fong et al. to be specifically up-regulated in association with this phenotype. And a separate set of mutagenesis studies have shown that mutations in iclR do indeed increase flux through the glyoxylate shunt [43].

Literature assessment of the top pathways highlights two distinct mechanisms for our growth phenotype ( Fig. 2 ). On the one hand, the glyoxylate shunt provides an alternative source for the cell to make OAA (via the glyoxylate cycle, which produces two OAA molecules for every OAA molecule it consumes). The seventh top hit (PWY0-321) represents the phenylacetate degradation pathway, which produces succinyl-CoA from phenylacetate [44]. This matches the validated glyoxylate shunt mechanism for our growth phenotype that is, it provides an alternative source for OAA synthesis, from phenylacetate to succinyl-CoA to OAA. On the other hand, the literature indicate that our other pathways instead can increase PEP levels sufficient to induce its conversion to OAA via the PEPCK reverse reaction. For example, the second, third and fourth top hits (CPLX0-2101, ABC-16-CPLX, and PD00237) are all components of the maltose transport pathway, and the sixth pathway (PTS) is a separate transport pathway that consumes PEP to drive transport of glucose. The maltose transporter actively transports glucose into the cell using ATP as energy, whereas other glucose transporters such as PTS consume PEP [45]. So increased maltose transporter activity and decreased PTS activity would both increase PEP levels, and favor its reverse reaction via PEPCK to produce OAA. This has been demonstrated by several experimental studies: a laboratory evolution experiment selecting for increased growth and succinate production identified mutations in PTS that increased flux through PEPCK in the reverse direction [46]. In accordance with Le Chateliers principle, increasing the level of cellular PEP leads to higher reverse PEPCK activity. Zhang et al. also showed that increased expression of the galactose permease (galP), in combination with deactivation of the PTS system, increased the PEP efficiency of glucose transport and succinate production [27]. Indeed, a number of studies have reported that mutations in the PEP-dependent phosphotransferase system (PTS) lead to increased flux from PEP to OAA (and on to succinate) [47] [48] [49]. The fifth top hit (GLYCOGENSYNTH-PWY) is not a sugar transport pathway, but instead the glycogen synthesis pathway. While it does not directly consume PEP, it consumes Glucose-6P (G6P), a metabolic precursor of PEP. Loss of (or reduced) function mutations in this pathway would boost G6P and hence PEP levels, as well as glucose and ATP levels, which both decrease the consumption of PEP for glucose transport. It should be emphasized that these previous experimental studies do not prove that mutations in pathways 2𠄷 can cause our specific phenotype (growth of knockout strain on glucose), as they did not test this specific phenotype.

Bioinformatic Tests

As an additional test of the entire set of top scoring pathways, we computed a p-value for evidence of positive selection (Ka/Ks > 1) within this set ( Table 3 ). Whereas the phenoseq scoring is based on the total number of mutations in a region, the Ka/Ks is based on the ratio of non-synonymous vs. synonymous mutations (note that the latter are not considered by the phenoseq scoring function). The Ka/Ks ratio for the total dataset of 2157 SNPs was 1.0026, consistent with neutral selection, as expected from random mutagenesis. We therefore computed a p-value for the null hypothesis that mutations in the top pathways are drawn from the same background distribution as the total set of mutations (i.e. neutral) using the Fisher Exact Test (see Methods for details).

Table 3

Pathwaycumulative p-valueexcluding iclR, aceK, malT

The top 10 pathway-phenoseq pathways contained a total of 103 non-synonymous mutations vs. only 21 synonymous mutations, yielding a p-value of . This is strong evidence of positive selection. Even leaving out the genes detected by gene-phenoseq (iclR, aceK, malT), the p-value is . Furthermore, this evidence of positive selection extends throughout the top ten pathways. For example, if one leaves out pathways 6 through 10, the p-value becomes weaker (, or again leaving out iclR, aceK, malT, 0.056). Indeed the p-value becomes stronger (smaller p-value) with each additional pathway added to the analysis, indicating that each pathway shows evidence of positive selection. Note that at the level of single-gene analysis, only one gene (iclR with 19 non-synonymous mutations and 1 synonymous mutation) could be detected as showing statistically significant evidence of positive selection () other genes simply did not have enough total mutation counts to attain significance. Only two gene groups (combined into a single meta-gene), PD04099 containing iclR and aceK and TRNA-CHARGING-PWY, have a Ka/Ks value greater than one with a p-value less than 0.1 from Fisher's exact test. The latter pathway is not obviously connected to the phenotype and is composed of 23 genes involved in many cellular functions.

It is interesting to ask what fraction of the genes in these pathways show evidence of causing the phenotype. It is evident (e.g. from the known experimentally validated genes) that real causal genes are present far below the 0.05 significance threshold of gene-phenoseq scoring (also found to be the case in a previous phenotype sequencing experiment [6].) To assess this, we took the top 50 gene-phenoseq genes, and asked what pathways were strongly enriched ( Table 4 ). Given a top list of genes, one can assess whether they cluster within specific subgroups of a standard functional annotation using the hypergeometric p-value test [49]. This analysis identified statistically significant clustering within three EcoCyc pathways. Furthermore, six of the top ten pathways matched the top 10 pathway-phenoseq pathways. These data indicate that at least 9 of the genes in these pathways contribute causally to the phenotype (since they were individually detected among the top 50 gene-phenoseq hits). Only 28 pathways intersected the top 50 list.

Table 4

GroupGenesGenes in top 20p-value (hypergeometric)
ABC-16-CPLX malF malE malG malK
PD04099 aceK iclR
CPLX0-2101 malE malF malG malK lamB
CPLX-63 torY torZ
PD00237 malS malT
ABC-42-CPLX alsA alsB alsC
SECE-G-Y-CPLX secE secG secY
CPLX0-221 rpoA rpoB rpoC fecI

Causal Mutations Analysis

Finally, we sought to estimate the number of mutations in each group that actually help cause the phenotype (''causal mutations''). In principle, one can estimate this from the observed bias towards non-synonymous mutations (compared with that expected under neutral selection as observed in the total dataset). Specifically, we assume that all causal mutations must be non-synonymous, whereas non-causal mutations are drawn from the background mixture of synonymous + non-synonymous mutations (i.e. neutral selection). We can then estimate the fraction of mutations in each pathway that are causal, since the observed fraction of non-synonymous mutations in a pathway will reflect the mix of causal vs. non-causal mutations:

where is the fraction of non-synonymous mutations observed in the entire dataset (which almost exactly matches that expected for neutral selection). Then

We then estimated the number of causal mutations in a pathway as , where N is the total number of mutations observed in the pathway ( Table 5 ). It is striking, for example, that the estimated number of causal mutations in the top pathway (iclR + aceK) precisely equals the number of independent mutant strains sequenced (24). This suggests that each strain with this phenotype was mutated once in this pathway, and though there were at least three mutations in this pathway in each pool, we are unable to directly verify a mutation in every strain because of the pooling of mutant strains for sequencing. Nevertheless, given the low amount of mutations per strain, it is statistically unlikely that any particular gene was mutated more than once per strain. The number of causal mutations estimated in the remaining pathways ranged from 4 to 9, suggesting that at least one additional mutation in these other pathways was present in each strain. For each pool of three strains, at least three nonsynonymous mutations were observed in the (iclR + aceK) pathway, so our data is consistent with the hypothesis that there must be a mutation in this pathway to achieve the phenotype.

Table 5

GroupSynonymous MutationsNon-synonymous MutationsCausal Mutations


R35P and R168C topoisomerase I mutants show biased mutation spectra, enriched in sequence deletions and insertions

To explore the potential effect of topoisomerase I on the spectrum of mutational events, we genetically introduced the mutated topA alleles to an E. coli BW25113 genetic background (Materials and Methods). Whole-genome sequencing verified the replacement of the native topA gene with sequences encoding for either the R35P or the R168C mutated enzyme variants. In addition to the mutations introduced through allelic replacement, in the R35P strain we identified four mutations in genes unrelated to DNA processing which were most probably acquired during early handling of the strain ( Supplementary Table S2 ). In addition, we observed that mutations in gyrA or gyrB genes, with a positive fitness effect, often emerged in the R35P strain during continuous strain propagation. Compensatory mutations in gyrase genes can reduce the accumulation of DNA torsional stress, and were reported to be essential for the viability of topA null mutants ( 21).

Next, we quantified the spectrum of mutational events (i.e. point mutations, insertions and deletions) in the topA mutant strains and in a topA + strain using two independent drug resistance assays (Materials and Methods). As selection agents, we used 2-deoxy- d -galactose (DOG), an inhibitor of galactokinase (encoded by galK), and azidothymidine (AZT), an inhibitor of thymidine/deoxyuridine kinase (encoded by tdk). Both drugs inhibit the growth of E. coli on minimal media via interactions with known molecular targets ( 22, 23). Briefly, we plated cells that were cultured in drug-free media on solid minimal media containing the toxic drug (either DOG or AZT) and a suitable carbon source (Materials and Methods). Resistant mutants that gave rise to colonies were isolated, and the resistance-conferring mutation in each colony was determined by sequencing the known drug target gene. To ensure the analysis of independent mutational events, only a single colony was sampled from every assay repeat.

As shown in Figure 1, the sequencing of galK, which is the molecular target of DOG, revealed that resistant colonies arising from topA mutants exhibit significantly altered mutation spectrum in comparison to the topA + strain (n = 24 for each strain, Supplementary Table S3 ). In colonies arising from the R168C mutant strain, we found that 50% of the resistance conferring mutations in galK were sequence insertions, composed entirely from de novo tandem duplications with lengths of 2–19 bp. In the R35P mutant strain, we found that all of the resistance conferring mutations in galK were due to short sequence deletions, with lengths of 1–7 bp. These results were in marked contrast with the point mutation dominated spectrum (≈66%) that was observed in resistant colonies arising from the topA+ strain. When we repeated the experiment using AZT as a selection drug, the results were in agreement with the values reported above ( Supplementary Figure S1 and Table S3 ). In addition to the observed differences in the mutation spectra of topA mutants in comparison to the topA + strain, we noted that both R35P and R168C mutants exhibited a strong localization pattern of mutations with well-defined hotspots (Figure 1B). This contrasts with the uniform spatial distribution of mutations observed in the topA+ strain, where mutations occurred across the galK coding sequence without noticeable hotspots. Similarly, a mutational hotspot was also observed in the topA mutants when AZT was used as a selection drug, where ≈50% of the resistance conferring mutations were localized in a 20 bp region ( Supplementary Figure S1 ). Comprehensive sequence data for the mutations identified in this experiment can be found in Supplementary Table S3 . We computationally compared (using MEME suite 5.0.5) the hotspot mutation regions in galK and tdk genes to a set of previously characterized topoisomerase I preferred cleavage motifs ( 24), but did not find that these regions match any of the known sequences.

Drug resistance assay reveals an enrichment of insertion and deletion mutations in topA mutants. (A) Mutation spectrum analysis of DOG resistant colonies arising in R168C and R35P mutant strains or from the topA+ strain. To determine the resistance conferring mutations, we used Sanger sequencing of PCR amplicons of the galK locus. Overall, we sampled 24 independent resistant colonies for each genetic background. Notably, only a single colony was sampled from every assay to ensure the analysis of independent mutational events. (B) Resistance conferring mutations arising in the topA mutants were mostly localized in two mutation hotspots in galK. (C) In contrast to the topA+ strain, where the majority of resistance conferring was due to point mutations, topA mutants are highly enriched in deletions (R35P) and tandem sequence duplications (R168C).

Drug resistance assay reveals an enrichment of insertion and deletion mutations in topA mutants. (A) Mutation spectrum analysis of DOG resistant colonies arising in R168C and R35P mutant strains or from the topA+ strain. To determine the resistance conferring mutations, we used Sanger sequencing of PCR amplicons of the galK locus. Overall, we sampled 24 independent resistant colonies for each genetic background. Notably, only a single colony was sampled from every assay to ensure the analysis of independent mutational events. (B) Resistance conferring mutations arising in the topA mutants were mostly localized in two mutation hotspots in galK. (C) In contrast to the topA+ strain, where the majority of resistance conferring was due to point mutations, topA mutants are highly enriched in deletions (R35P) and tandem sequence duplications (R168C).

Estimation of genomic mutation rates in R168C and R35P topoisomerase I mutants using mutation accumulation lines

To quantify the genome-wide mutation rate of topA mutants, we conducted a mutation accumulation assay in which isogenic replicates were passaged through a single-colony bottleneck for 1200 or 600 generations. We used whole-genome sequencing to identify the mutations that accumulated in each line and calculated the rates of deletion, insertion, and point mutations for the different strains ( Supplementary Table S4 ). As shown in Figure 2, our analysis revealed that the rate of short sequence deletion events in the R35P mutant is significantly higher in comparison to the topA + strain (P-value < 0.001, t-test) with an ≈100-fold change in the mean rate of short deletions.

Rate and molecular spectrum of mutations in topA and wild-type strains as estimated by mutation accumulation lines. (A–C) Each circle represents the genome of a single mutation accumulation line (n=11). Mutations that were identified by whole-genome sequencing are positioned on the circle according to their chromosomal coordinates (clockwise). Dashed circles indicate genomes in which no mutation was detected. Detailed lists of mutations can be found in Supplementary Table S4 . (D) Average mutation rate per genome replication calculated for each mutation type. Error bars represent SEM. (E) Histogram of short deletion lengths that were identified in R35P mutant strains. (F) Mutations identified in isogenic lines (n = 9) of the R35P strain that contains a compensatory mutation in gyrA. Compensatory mutations in gyrase genes in topA mutants were reported to decrease the accumulation of DNA torsional stress and are essential for the viability of topA null mutants.

Rate and molecular spectrum of mutations in topA and wild-type strains as estimated by mutation accumulation lines. (A–C) Each circle represents the genome of a single mutation accumulation line (n=11). Mutations that were identified by whole-genome sequencing are positioned on the circle according to their chromosomal coordinates (clockwise). Dashed circles indicate genomes in which no mutation was detected. Detailed lists of mutations can be found in Supplementary Table S4 . (D) Average mutation rate per genome replication calculated for each mutation type. Error bars represent SEM. (E) Histogram of short deletion lengths that were identified in R35P mutant strains. (F) Mutations identified in isogenic lines (n = 9) of the R35P strain that contains a compensatory mutation in gyrA. Compensatory mutations in gyrase genes in topA mutants were reported to decrease the accumulation of DNA torsional stress and are essential for the viability of topA null mutants.

Since no sequence duplication events were observed in the topA + lines during our experiment, we cannot directly determine the change in frequency of tandem duplication events. Previously published estimations of sequence insertion rates in wild-type E. coli found it to be ≈1 × 10 −4 mutations per genome per generation ( 25), a rate which is 5-fold-lower than the rate we observed in the R168C mutant. However, this estimation of the insertion rate in the wild-type strain is dominated by the rate of short sequence insertions (1–3 bp), occurring mainly in homopolymer tracts. As longer insertions, particularly those outside the context of existing sequence repeats, are significantly more rare, we note that the increase in the rate of de novo duplication mutations in the R168C is likely to be significantly higher. In contrast to the marked changes in the frequencies of deletion and duplication events, we found no significant difference in the rate of point mutations between the mutant and the topA + strains. The point mutation rates measured for all of the strains in our analysis were in agreement with previously reported values of ≈1–2 × 10 −3 mutations per genome per generation ( 25).

In addition, we used mutation accumulation lines to measure the effect of spontaneously arising compensatory mutations on the mutation rate in topA mutants. We isolated a clone in which a spontaneous mutation in gyrA had emerged in the R35P strain background. Based on our mutation accumulation analysis (n = 9), we found that the rate of short sequence deletions in this double mutant remained significantly higher than the topA + strain (P-value <0.001, t-test). However, the mutator phenotype was attenuated with an ≈2-fold decrease in the deletion rate in comparison to the R35P which lacks the compensatory mutation. Taken together, our results unequivocally show that R35P and R168C substitutions in topoisomerase I give rise to mutator phenotypes with distinct mutational bias.

R35P and R168C substitutions affect highly conserved residues in DNA topoisomerase I, lead to slower doubling time, and inhibit DNA relaxation activity

Sequence alignment of 1750 bacterial topoisomerase I homologs shows that both R35P and R168C substitution mutations affect conserved residues in the protein. Our analysis finds that R168 is conserved in all 1750 sequences while R35 is conserved in over 1350 sequences (75%) (Figure 3A and Supplementary item 1 ). Based on the structural analysis of topoisomerase I, Zhang et al. found that R168 participates in a network of ionic and hydrogen bond interactions which hold the DNA substrate in proper conformation for cleavage or re-ligation ( 26). The same study reports that an activity assay of a mutant enzyme in which these interactions have been perturbed (R168A) shows a significant (>80-fold) loss of relaxation activity when compared to the wild type enzyme.

R168C and R35P mutations affect conserved residues and decrease supercoiling relaxation activity. (A) Multiple sequence alignment of bacterial topoisomerases indicating that the mutated residues (labeled in pink), R35 and R168, are highly conserved (see also Supplementary item 1 for a broader phylogenetic analysis). (B) Doubling times of topA+ stain in comparison to topA mutants in LB media. Error bars represent the standard deviation of biological repeats (n = 6). *P < 0.01, **P < 0.001 (C) In vitro plasmid relaxation assay shows significant reduction in DNA supercoiling activity in R168C and R35P in comparison to the native enzyme.

R168C and R35P mutations affect conserved residues and decrease supercoiling relaxation activity. (A) Multiple sequence alignment of bacterial topoisomerases indicating that the mutated residues (labeled in pink), R35 and R168, are highly conserved (see also Supplementary item 1 for a broader phylogenetic analysis). (B) Doubling times of topA+ stain in comparison to topA mutants in LB media. Error bars represent the standard deviation of biological repeats (n = 6). *P < 0.01, **P < 0.001 (C) In vitro plasmid relaxation assay shows significant reduction in DNA supercoiling activity in R168C and R35P in comparison to the native enzyme.

To determine the fitness effect of these mutations, we compared the growth rate of topA mutants to the native strain in LB and in glucose supplemented minimal media (Figure 3B and Supplementary Figure S2 ). We found that in all cases, mutations in topA resulted in fitness cost (P-value <0.01 for all mutant strains, t-test). The effect was most prominent in the R35P mutant, where average doubling time was ≈50% longer in comparison to the topA + strain. The compensating mutation in gyrA in the R35P background reduced the effect but did not eliminate the fitness cost. The smallest fitness cost was observed in the R168C mutant strain with ≈5% increase in doubling time in LB media. Similar fitness costs were observed when the strains were cultured in glucose-limited minimal media ( Supplementary Figure S2 ), indicating that the fitness cost is independent of nutrient availability and of the maximal growth rate.

Next, we tested the effect of R35P and R168C mutations on the catalytic activity of the enzyme using in vitro DNA relaxation assay of heterologously expressed topoisomerase I mutant enzymes (Figure 3C). For both of the mutant variants we observed a decrease in plasmid relaxation activity in comparison to the native enzyme. In the R168P mutant, we found that the minimal amount of enzyme required for complete plasmid relaxation was 4-fold higher in comparison to the wild-type enzyme, while in the R35P mutant protein we were not able to detect any plasmid relaxation activity.

The biased mutation spectra in topA mutants impact the emergence of antibiotic resistance

Previous studies have shown that the mutational spectra in mismatch repair E. coli mutator strains affect the distributions of beneficial mutations in an antibiotic resistance model system ( 11). As the mutational spectrum defines accessible beneficial mutations, we sought to experimentally study the effect of the biased mutation spectra observed in topA mutants on the emergence of antibacterial resistance. We performed fluctuation assays to compare the rate of spontaneously arising resistance to d -cycloserine (DCS), a broad-spectrum antibiotic, in the R35P mutant strain and in an MMR-deficient mutator strain (mutS). Previous studies demonstrated that mutS mutators exhibit over 100-fold increase in the frequency of point mutations ( 27). This value is comparable to the rate of deletion mutations observed in the R35P strain, although the two strains differ in their mutation spectra (point mutations versus short sequence deletions).

Under our experimental conditions, we find that DCS resistant colonies emerged in ≈50% of R35P topA mutant cultures, but only in ≈3% of the mutS cultures (Figure 4A, n = 96). Whole-genome sequencing of independently arising resistant colonies from R35P strain cultures (n = 5) revealed that all of the analyzed clones acquired mutations in ispB, an essential gene in the biosynthesis of isoprenoid quinones. Specifically, we observed the repeated occurrence of complex mutations, combining base insertions and deletions, in a localized hotspot in the ispB gene. Sanger sequencing of the ispB gene verified that in all five occurrences, these mutations impacted 2–3 amino acids within the coding sequence (Figure 4B and Supplementary Table S5 ). In contrast, resistant colonies arising from cultures of the mutS strain (n = 5) were dominated by point mutations with no single locus that was mutated in more than two clones ( Supplementary Table S5 ). The marked differences in DCS resistance rates and genotypes between the mutS and topA mutators strengthen previous observations regarding the key role of the mutational spectra on the emergence of antibiotic resistant phenotypes.

Cycloserine resistance conferring mutations in ispB observed in a topA mutant (R35P) but not in a MMR mutator (mutS). (A) Drug resistance fluctuation assay of R35P topAand mutS (topA+) mutants. Error bars represent the standard deviation of three experimental repeats, with 96 cultures per experiment. (B) Whole genome sequencing of cycloserine resistant colonies arising in the R35P topA mutant revealed complex mutations in ispB (n = 5). In all five resistant clones, at least two residues in the protein sequence were affected by a combination of sequence deletions (red) and insertions (green).

Cycloserine resistance conferring mutations in ispB observed in a topA mutant (R35P) but not in a MMR mutator (mutS). (A) Drug resistance fluctuation assay of R35P topAand mutS (topA+) mutants. Error bars represent the standard deviation of three experimental repeats, with 96 cultures per experiment. (B) Whole genome sequencing of cycloserine resistant colonies arising in the R35P topA mutant revealed complex mutations in ispB (n = 5). In all five resistant clones, at least two residues in the protein sequence were affected by a combination of sequence deletions (red) and insertions (green).

2. When the complementation test is not so simple

There are two ways the complementation test can mislead you. First, alleles of the same gene can sometimes complement each other, termed "intragenic complementation". Second, mutations of two separate genes can sometimes not complement one another, termed "non-allelic non-complementation". In both instances, complex allelic interactions are observed that can lead to problems knowing when and how to assign the mutant alleles to complementation groups. Although these allelic conundrums are found in many organisms, this discussion is focused on examples found in C. elegans .

2.1. Intragenic complementation

During intragenic complementation, alleles of the same gene complement one another, even though both alleles produce a faulty gene product. There are different means by which mutant alleles of the same gene can mutually correct one another. First, one mutant gene product may reduce the dosage of the other mutant product. Second, a faulty complex formed by one mutant gene product may be stabilized by the presence of an alternatively mutant gene product. Third, a gene product containing a mutation that affects one function may provide the function missing from an alternatively altered gene product.

2.1.1. Reducing dosage of a mutant product

Alleles of genes required for cuticle formation exhibit intragenic complementation (De Melo et al., 2002 Kramer et al., 1988 Kusch and Edgar, 1986). In the case of the cuticle collagen sqt-1 , the recessive missense mutation, sc101 , is alleviated in trans by a null mutation, sc103 (Table 2 and Table 3 Kramer and Johnson, 1993). Although both sc101 and sc103 homozygotes have tail defects, and sc101 / sc101 is a long weak left roller, the trans-heterozygotes are wild-type (Kusch and Edgar, 1986 Kramer and Johnson, 1993). This result suggests that the presence of the sc101 product has a negative effect on normal collagen associations, since decreasing the amount of the mutant product (e.g. replacing one allele with an allele that produces no product) restores a wild-type phenotype. The poisoning effect of the Gly-X-Y mutations seems to be especially detrimental to collagen proteins and cuticle formation as they are often found in alleles that exhibit non-allelic non-complementation with other collagen genes (see Non-allelic non-complementation).

Table 1. Phenotypes of lin-3 homozygous and heteroallelic strains

lin-3 allele n378 e1417 n1058 n1059
n378 97% Vul ( n =266) 88% Vul ( n =226) 78% Vul ( n =364) 100% Vul ( n =665)
e1417 89% Vul ( n =351) 59% Vul ( n =414) 99.8% Vul ( n =471)
n1058 Sterile occasional arrested larvae Arrested larvae
n1059 Arrested larvae
lin-3 alleles exhibit intragenic complementation. e1417 / n1058 and n378 / n1058 trans-heterozygotes are viable and are less likely to be Vul than e1417 or n378 homozygotes. Abbreviations:Vul-Vulvaless. Reprinted with permission from Ferguson and Horvitz, 1985. Copyright ©1985 the Genetics Society of America.

Table 2. Mutated domains of the non-allelic, non-complementing SQT gene products

Table 3. Complex interactions are observed between collagen genes, sqt-1 , 2 , 3 and rol-8

2.1.2. Stabilizing a complex

In 1964, Crick and Orgel proposed that inter-allelic complementation may be due to a "good corrects bad" mechanism (Crick and Orgel, 1964). In this scenario, a mutation in a subunit of a homomer or multimer would result in the mis-folding of the monomer subunit and disrupt the activity of the whole complex. Likewise, a subunit with a different mutation would also produce a faulty complex. However, a functional complex can be restored if one mutant subunit is able to structurally compensate for the other mutant subunit.

One example of this type of mutual correction may be demonstrated by alleles of let-2 . The initial identification of let-2 occurred in a screen for X-linked lethals and steriles (Meneely and Herman, 1979). In this screen, four alleles of let-2 were identified and were shown to exhibit a complex complementation pattern. In fact, every allele complemented at least one other allele, including two alleles identified independently in other labs. A second screen published a few years later identified another eight alleles, which also exhibited the same complex pattern, giving rise to a complicated web of allelic interactions (Figure 4A Meneely and Herman, 1981). These alleles also complement at least one other allele (see Table 4 Meneeley and Herman, 1981). The strongest allele, mn153 , is a mutation in an N-terminal splice site and does not complement any other allele except b246 . b246 exhibits the least severe phenotype of all the alleles it is embryonic lethal only at non-permissive temperatures. Unexpectedly, b246 complements all other let-2 alleles (see Table 4). let-2 encodes an α 2(IV) collagen chain, which is a major component of basement membrane (see Basement membranes Sibley et al., 1993). Like other collagen molecules, LET-2 has an extensive Gly-X-Y repeat domains, making up about three-quarters of the molecule, and conserved non-collagenous regions (Figure 4B). Type IV collagen molecules are made up of heterotrimers of one α 2(IV) chain and 2 α 1(IV) chains (Figure 4C). These heterotrimers dimerize via their conserved C-terminal NC1 domains and can tetramerize via their C-termini to form a collagen lattice that provides structural support for basement membranes. The strong mutations in the Gly-X-Y regions can disrupt trimer formation and cause the accumulation of misfolded heterotrimers in the cell in a temperature dependent manner (Gupta et al., 1997). The b246 mutation occurs in a region of relatively high thermal stability and is therefore unlikely to have as drastic an effect on triple helical formation as mutations in regions of lower thermal stability (such as the mutations noted in bold in Figure 4B) in subsequent complex formations (Sibley et al., 1993). The presence of the b246 mutant product in the type IV collagen heterotrimers may help to stabilize the presence of other more severely altered heterotrimers in trans-heterozygotes.

Table 4. Complementation tests among let-2 allele*

Figure 4. Intragenic complementation of let-2 alleles. (A) let-2 alleles exhibit a very complex pattern of complementation. Lines between alleles indicate non-complementation (the expected outcome). Alleles that are not joined by a line complement one another to some extent. Reprinted with permission from Meneely and Herman, 1981. Copyright ©1981 the Genetics Society of America. (B) A schematic of the LET-2 ( α 2)IV collagen, adapted from Sibley et al. (1993). Conserved type IV collagen features include extensive Gly-X-Y repeats (in blue) with interspersed interruptions (vertical black lines), non-collagenous termini (in grey) with a conserved NC1 domain at the C-terminus. Cross-hatched area denotes an alternative splice region. Locations of mutations are indicated with the strongest mutations (>90% homozygous embryonic lethality at 20 ° ) in bold. mn153 (boxed in red), is the only mutation that does not complement any other allele aside from b246 (boxed in green), which complements all other alleles to some extent. The interruptions in the Gly-X-Y repeat domains lower the thermal stability of the region resulting in more severe phenotypes of Gly-X-Y mutations (noted in bold). (C) Type IV collagen is composed of a heterotrimer of 1 α 2(IV) collagen chain and 2 α 1(IV) chains. These heterotrimers can dimerize at their C-terminal NC1 domains, and then form a complex lattice through tetramerization of their N-terminal regions and lateral interactions along their triple helical domains. Mutations in the Gly-X-Y repeats would disrupt heterotrimer formation and subsequent associations.

2.1.3. Providing the missing function

Many examples of intragenic complementation have been reported for genes that encode products with independently functioning domains, or loci that are involved in independently regulated functions. In these cases, a mutation can result in a gene product that specifically disrupts one process while functioning relatively normally in others. Trans-heterozygotes made with such alleles exhibit intragenic complementation. This appears to be the most frequently reported type of intragenic complementation in C. elegans , perhaps since it results in some very interesting interactions. This type of intragenic complementation has been observed for a number of genes, including lin-3 , gld-1 , unc-5 , unc-84 , and bli-4 .

As mentioned before, lin-3 is required in different developmental processes. lin-3 is also independently mutable, such that a mutation in one part of the gene can disrupt one process while having minimal effects on other LIN-3 requiring processes. For example, the e1417 mutation only disrupts vulval induction, having no effect on male spicule development, while the n1058 mutation only weakly disrupts vulval induction, but strongly disrupts male spicule development (Liu et al., 1999). These homozygous phenotypes suggest that each allele has residual function in certain tissues. Furthermore, the n1058 mutation can complement two other lin-3 alleles, e1417 and n378 , for vulval induction (see Table 1 and Table 3 in Liu et al., 1999). The intragenic complementation in this case suggests that residual function from each allele is enough to complement the missing tissue specific function of the other allele, especially since the null mutation, n1059 , does not complement any of these alleles (Table 1). Molecular analysis of the lin-3 mutations supports these genetic observations. Specifically, e1417 has a mutation in an anchor-cell specific enhancer, which is required for LIN-3 function in vulval development, but would not affect LIN-3 function in other tissues (Hwang and Sternberg, 2004 Figure 2B). Additionally, e378 has a missense mutation resulting in reduction-of-function and n1058 has a splice site mutation, which may cause a truncated LIN-3 product or lowered amount of the LIN-3 product to be made (Ferguson and Horvitz, 1985 Hwang and Sternberg, 2004 Liu et al., 1999). Neither of these mutations would completely remove LIN-3 function.

Such functional compensation has also been observed for alleles of gld-1 , unc-5 and unc-84 (Francis et al., 1995 Merz et al., 2001 Malone et al., 1999). For example, gld-1 encodes an RNA binding protein required for several aspects of cell cycle progression during gametogenesis (Jones and Schedl, 1995). GLD-1 is required in a temporally and spatially regulated manner during germline development, and these requirements are reflected in the range of mutant phenotypes exhibited by gld-1 mutants. gld-1 mutations fall into five classes, A-E (Francis et al., 1995). Class C mutations result in the masculinization of the germline so that only sperm and no oocytes are produced. gld-1 class D mutations result in the opposite phenotype of feminization of the germline so that only oocytes and no sperm are produced. Trans-heterozygotes between class C and D alleles produce a wild-type phenotype presumably because gene products carrying the class C mutation can provide GLD-1 function during spermatogenesis while products carrying the class D mutation provide GLD-1 function during oogenesis.

Finally, if a mutation results in the lack of expression of gene product in a subset of cells, a compensating mutation can be one that restores that expression pattern. One example of compensation by expression is illustrated by mutations of bli-4 . bli-4 encodes an essential Kex2/substilin-like protease that cleaves the N-terminus of pre-pro-peptides (Thacker et al., 1995). Two lethal alleles of bli-4 , sy90 and h754 , which do not complement other lethal alleles of bli-4 , were found to complement the viable Blister allele, e937 (Peters et al., 1991 Rose and Baillie, 1980). Other lethal alleles exacerbate the Blister phenotype of e937 . bli-4 encodes nine splice variants of the Kex2/substilin-like protease. All variants share the first 12 exons, including the coding region for the protease domain. Mutations in the protease domain affect all nine isoforms and result in embryonic arrest. e937 is a deletion that affects 5 of the 9 isoforms (Thacker et al., 2000). The sy90 and h754 mutations have not been identified, suggesting that these mutations may lie in regulatory elements. In addition, sy90 and h754 homozygous mutants arrest development during L1, which is later than the other lethal mutants suggesting that these mutations alter expression or function of BLI-4 after the embryonic period. Thus in the e937 / sy90 or e937 / h754 trans-heterozygote it appears that enough BLI-4 is supplied by the e937 allele to proceed past L1 development while the sy90 or h754 alleles may provide enough BLI-4 in later stages to abrogate the Blister phenotype.

2.1.4. Other examples

One last example of intragenic complementation, which could occur through stabilizing a complex, or by providing a compensatory function, is exhibited by eat-2 alleles. eat-2 encodes a subunit of the nicotinic acetylcholine gated ion channel required for pharyngeal pumping (Raizen et al., 1995). Nicotinic acetylcholine gated ion channels are hetero-pentamers that are activated by endogenous acetylcholine or drugs such as nicotine. The extracellular ligand binding sites are formed at the interfaces between two different subunits (Changeux and Edelstein, 1998). Alleles of eat-2 fall into five classes, A-E. Although class A alleles do not complement alleles in any class, alleles in class B-E exhibit intragenic complementation (Figure 5A). The eat-2 mutations were sequenced and complementing mutations of class B-E were located in the extracellular domain of the subunit (Figure 5B McKay et al., 2004). In general, disruptions in extracellular domains would be expected to alter ligand-binding sites. One can imagine then that a channel composed of either faulty subunit would not form any stable ligand binding sites however, two differently altered subunits might stabilize the channel and the ligand binding sites to allow some increase in channel activity. Alternatively, in an eat-2 trans-heterozygote with two faulty EAT-2 subunits, two different channels might be made with compensatory ligand specificities, thus restoring the functions of the EAT-2 nicotinic acetylcholine receptor.

Figure 5. (A) Intragenic complementation of eat-2 alleles. The number of alleles in each class is noted in parentheses below the class letter along the bottom row. Intragenic complementation results are highlighted in yellow. *These trans-heterozygotes are not fully wild-type however, thay are less defective than either homozygote. Data was extrapolated from Raizen et al. (1995). (B) A schematic of the EAT-2 subunit with approximate locations of the mutations that can complement other eat-2 mutations (highlighted in yellow). Class A non-complementing mutations are highlighted in purple. Data extrapolated from McKay et al. (2004). Abbreviations: Pmp-Pumping defective.

2.2. Non-allelic non-complementation

Non-allelic non-complementation, also referred to as second-site non-complementation † , intergenic non-complementation, unlinked non-complementation, or second-site dominant enhancer, occurs when alleles of two different loci behave as if they are alleles of the same locus, e.g. the double heterozygote m1/+ +/m2 looks like either homozygote m1 +/m1 + or + m2/+ m2 . Such interactions are not routinely encountered during complementation testing, because such assays are usually done between mutations that have been mapped to the same region and thus have a physical connection on the chromosome. By contrast, this interaction between mutations does not result from their physical location in the genome, but instead reflects a functional connection between different gene products, and often implies that the gene products physically interact. Other experimental approaches may however reveal that this effect is not uncommon. Specifically, screens for new alleles of a gene have the potential to uncover non-allelic non-complementing mutations (Hays et al., 1989). Hawley and Walker extensively review second-site non-complementation giving in depth examples of this event found in Drosophila and other organisms (Hawley and Walker, 2003). The following examples from C. elegans complement their review. Non-allelic non-complementation can be due to two situations: First, one or both mutations act as poisons to protein complexes, thus functional protein complexes become a limiting factor in a process. Second, both mutations reduce a threshold level of gene product, thus the dosage of the individual gene product becomes a limiting factor in a process. These situations, referred to as The Poison and Dosage Models, were put forth by M. Fuller, T. Stearns and D. Botstein and are discussed in more detail below (Fuller et al., 1989 Stearns and Botstein, 1988).

2.2.1. Non-allelic non-complementation by poison

In the Poison Model of non-allelic non-complementation, an altered gene product impairs the protein complex with which it normally associates. Although the complex-poisoning effect of this mutation does not result in a visible phenotype on its own in heterozygotes, a simultaneous mutation in another member of the protein complex can reveal a visible defect. Such interactions have been observed between α and β tubulin genes in Drosophila and yeast, where altered a tubulin act as poisons by either sequestering β tubulin or by disrupting the polymerization of the microtubule (Fuller et al., 1989 Hays et al., 1989 Stearns and Botstein, 1988).

Such poisonous interactions have also been observed in C. elegans most notably, Kusch and Edgar observed many instances of unconventional allelic interactions in their study of mutations affecting body shape and morphology (Kusch and Edgar, 1986). Specifically, it was noted that sqt-1 alleles did not complement certain sqt-3 or rol-8 mutations (Table 3). sqt-1 , sqt-3 and rol-8 all encode cuticle collagens (Cox et al., 1989 Kramer et al., 1988 Novelli et al., 2004 van der Keyl et al., 1994). Collagens undergo significant amounts of homo- and heteromeric interactions, propeptide processing and crosslinking that result in permanent associations (Figure 6B Johnstone, 2000 Myllyharju and Kivirikko, 2004 Page and Winter, 2003). Thus, an aberrant collagen monomer can potentially disrupt the cuticle matrix at many stages of cuticle assembly. The prevalence of non-allelic non-complementation among alleles of cuticle collagen genes supports this notion. In addition, mutations among collagen alleles that exhibit non-allelic, non-complementation interactions tend to disrupt a Gly-X-Y repeat domain, which is proposed to be required for monomer associations (Table 2, Figure 6 Kramer and Johnson, 1993 Novelli et al., 2004 van der Keyl et al., 1994). This suggests that mutations in the Gly-X-Y repeats act as poisons by causing disruptions in trimer formation and by sequestering wild-type monomers in non-functional protein interactions thereby decreasing the amount of functional trimers.

Figure 6. Cuticle collagens have a conserved domain structure. (A) Collagens in vertebrates have been demonstrated to form propeptide homo- or heteromeric helical trimers. Trimer formation is preceded by the association of the cysteine containing domains that are specific for each family of collagen. The association of the cysteine domains also puts the monomers in the right registration for trimer formation, which is carried out by the Gly-X-Y repeats in a zipping fashion starting from the C-terminal end. (B) The helical trimer propeptide is further processed by protease cleavage of the N-terminus by BLI-4 (in SQT-3 the C-terminus may also be a site of propeptide cleavage Novelli et al., 2004). After proteolytic processing, the trimers are cross-linked to other trimers via N-terminal and C-terminal cross-linking sites to form the collagen matrix.

Non-allelic non-complementation has also been observed among genes required for synaptic vesicle fusion (Yook et al., 2001). In particular, a hypomorphic allele of unc-13 , n2813 , acts as a poison in synaptic transmission. unc-13 (n2813) is a recessive allele that causes a jerky, Unc phenotype due to decreased synaptic transmission. However, n2813 as a trans-heterozygote with mutations in other synaptic function loci exhibits an Unc phenotype even when wild-type copies of both loci are present (E. Jorgensen, pers. comm.). Furthermore, in a sensitive drug assay, n2813 exhibits a poisonous effect on synaptic transmission. UNC-13 is a diacylglycerol binding protein with multiple C2 Ca ++ binding domains that physically interact with UNC-64/syntaxin to prime vesicles for fusion to the plasma membrane (Ahmed et al., 1992 Maruyama and Brenner, 1991 Richmond and Broadie, 2002). It was demonstrated that unc-13 (n2813) did not complement a null or hypomorphic allele of unc-64 , although a null allele of unc-13 fully complemented the unc-64 null allele. These results suggest that synaptic transmission is not sensitive to the amount of its individual UNC-13 or UNC-64 protein molecules however, the number of functional UNC-13/UNC-64 protein complexes is a limiting factor. It is possible that UNC-13 (n2813) acts as a poison to synaptic transmission by sequestering wild-type UNC-64 into non-functional complexes, thus making the amount of UNC-64 a limiting factor to synaptic transmission, following the Poison Model of non-allelic non-complementation (Figure 7A).

Figure 7. The Poison and Dosage Models of vesicle priming at the synapse. (A) A schematic of primed vesicles along the active zone at a wild-type synapse. In The Dosage Model, the unc-13(null)/+ unc-64(null)/+ trans-heterozygote would have half the amount of UNC-13 and UNC-64, which would result in a decrease in the number of primed vesicles ready for synaptic release. In The Poison Model, the unc-13(poison)/+ unc-64(poison)/+ trans-heterozygote would produce products that would interfere with the formation of functional complexes. Altered UNC-13 and altered UNC-64 can participate in the same complex and render it non-functional. In addition these altered products can form complexes with wild-type partners, sequestering them in non-functional complexes. These interactions would reduce the level of functional complexes to below half the normal amount. (B) The synaptic vesicle cycle at a cholinergic synapse. Synaptic vesicles and associated proteins are transported to the synapse from the cell body by the synaptic vesicle kinesin protein UNC-104. Acetylcholine, made by CHA-1, is packaged into vesicles by the acetylcholine transporter, UNC-17. Mature synaptic vesicles must be docked and primed at the synapse so that the vesicle can rapidly fuse with the plasma membrane when the neuron is depolarized. Docking and priming of the vesicle requires the dissociation of UNC-18 from UNC-64/syntaxin and the association of UNC-13 with UNC-64. The vesicle is fully primed when SNB-1 joins the UNC-13/UNC-64 complex. After vesicle fusion, the vesicle and its associated proteins are recovered from the plasma membrane through clathrin-mediated endocytosis, which utilizes the μ 2/DPY-23 containing AP-2 adaptor complex (W.S. Davis et al., WBPaper00023077). Recycling of the synaptic vesicles and their associated proteins is required for maintaining a readily releasable pool.

These results support the view that non-allelic non-complementation signifies a physical interaction between the mutant gene products however, in the same study it was demonstrated that this is not always the case. Specifically, unc-13 (n2813) failed to complement mutations in other genes required for synaptic vesicle dynamics (Figure 7B). Whereas the greatest degree of non-complementation occurred between mutations in genes whose products are known to have a physical interaction (i.e. unc-13 and unc-64 , unc-13 and unc-18 , unc-64 and snb-1 ), non-complementation was also observed between genes whose products are not known to have a physical interaction (i.e. unc-13 and snb-1 , unc-13 and dpy-23 , unc-13 and unc-104 ). These observations suggest that the unc-13 ( n2813 ) aberrant gene product makes synaptic transmission sensitive to perturbations in other synaptic function loci. In particular, the effects of the UNC-13 poison extends to those loci that affect the concentration of synaptic components at the synapse, such as unc-104 and dpy-23 . However, there is a limit to this interaction as non-complementation was not observed for unc-13 and cha-1 , unc-13 and unc-17 or unc-13 and syd-1 trans-heterozygotes, as cha-1 , unc-17 and syd-1 do not play a direct role in the synaptic vesicle cycle (see references in Yook et al., 2001).

2.2.2. Non-allelic non-complementation by dosage (i.e. combined-haplo-insufficiency)

In the Dosage Model, the limiting factor in a process is the total amount of gene product such that a simultaneous decrease in the levels of expression of both genes results in a mutant phenotype. Dosage-sensitive processes have been reported for developmental pathways where events are controlled by protein gradients, such as observed in Drosophila and vertebrates (Jackson and Berg, 1999 Kidd et al., 1999 Rancourt et al., 1995). Few examples of dosage sensitive processes have been reported in C. elegans however, combined-haplo-insufficiency has been reported among ram genes required in male tail ray morphogenesis (Baird and Emmons, 1990). Baird and Emmons demonstrated that null-like mutations in ram-4 do not complement presumed null mutations in any of the other ram loci. Specifically, ram-4(bx25ts) behaves like the deficiency, mDf9 , in its trans-heterozygous interactions with mutations at other ram loci. Unfortunately, these studies were limited to assaying gene interactions between ram alleles that were not necessarily nulls. Putative null alleles of the ram genes have since been obtained and verified and combined-haplo-insufficiency has been observed for some but not all of the ram loci (K.L. Chow, pers. comm.). Specifically, ram-1(wx71) , ram-2(bx76) , and ram-4(bx48) exhibit non-allelic non-complementation with one another, but not with ram-6(wx66) (K.L. Chow, pers. comm.). Further work has demonstrated that ram-1 , ram-2/ram-3 and ram-4 encode cuticular collagens (Tam et al., International C. elegans Meeting 2003, 81 Tam et al., in prep. Yu and Chow, International C. elegans Meeting 2001, 424 Yu et al., in prep.).

2.2.3. Other examples of non-allelic non-complementation

Non-allelic non-complementation has been noted in other processes. In particular, in a screen for suppressors of glp-1 , sog-1 alleles were demonstrated to not complement alleles of five other sog loci for glp-1 suppression (Maine and Kimble, 1993). In addition, in a screen for suppressors of rol-3 lethality, srl-2(s2506) was demonstrated to not complement mutations in srl-1 (Barbazuk et al., 1994). Unfortunately, molecular information is currently unavailable for these genes and mutations.

More recently, Chang et al. have reported non-allelic non-complementation between genes required for establishing left-right asymmetry of the ASE chemosensory neurons (Chang et al., 2003). These researchers identified a number of lsy ( lim-6 symmetry) mutants that had altered asymmetric expression of ASEL or ASER specific reporter constructs. Mutations in one class were identified as alleles of unc-37 and cog-1 . Mutations in this class exhibited ectopic expression of an ASEL specific reporter, gcy-7::gfp , in ASER. These researchers also observed ectopic expression of gcy-7::gfp in ASER in an unc-37(e262)/+ +/cog-1(ot28) trans-heterozygote whereas there is only ASEL expression in either heterozygote alone. unc-37 encodes the C. elegans ortholog of the Groucho transcription co-factor and cog-1 encodes the ortholog of vertebrate Nkx6 type homeobox genes (Palmer et al., 2002 Pflugrad et al., 1997). Biochemical studies have demonstrated that vertebrate COG-1 ortholog interacts with Drosophila Groucho through a conserved engrailed homolog (eh1) domain, thus it is likely that UNC-37 and COG-1 physically interact (Muhr et al., 2001). Other lsy mutants in the same class exhibit non-allelic non-complementation with unc-37 and cog-1 however, these have yet to be characterized (Oliver Hobert pers. comm.).

† In Drosophila , "second-site non-complementation" or SSNC is the predominant title for this interaction, however to avoid confusion with intragenic mutations that modify allelic mutations as in the case of "second-site suppressors", we have opted to stay with the "nonallelic non-complementation" title.


Combining knowledge-based and lexical approaches for ontology integration

We developed and extended the PhenomeNET ontology to integrate several species-specific phenotype ontologies and identify mappings between phenotype classes. Here, we consider a mapping between two classes (in two ontologies) a formal relation between them, i.e., an axiomatic relation such as equivalence, sub- or super-class, or disjointness. An alignment between two ontologies is created by a set of mappings. Ontology matching is the process of finding mappings between classes in two ontologies. Ontology integration goes beyond identification of an ontology alignment in that two or more ontologies are merged into a single ontology that encompasses all classes in the original ontologies [14].

Phenotype classes in the HP and MP ontologies are formally defined using the Entity-Quality (EQ) pattern [4, 26]. Based on the EQ patterns, a phenotype is decomposed into an affected entity and a quality that specifies how the entity is affected. The Entity will usually be a class taken either from an anatomy ontology or a physiology ontology. For example, the phenotype class macroglossia ( HP:0000158 ) describes an anatomical abnormality and is defined as equivalent to ‘has part’ some (‘increased size’ and (‘inheres in’ some tongue)and (‘has modifier’ some abnormal)) , relying on the entity tongue (from the UBERON anatomy ontology [9]) and the quality increased size (from PATO) in its definition. The class abnormality of salivation ( HP:0100755 ) is a physiological abnormality and is defined as equivalent to ‘has part’ some (quality and (‘inheres in’ some ‘saliva secretion’) and (‘has modifier’ some abnormal)) , where saliva secretion is a class from the biological process branch of the Gene Ontology (GO) [10].

The general pattern for defining a phenotype class in both the HP and MP ontologies, given Entity E and Quality Q, is to declare them equivalent to ‘has part’ some (Q and ‘inheres in’ some E) . In some cases, the Entity E is further constrained, e.g., by a location in which a certain process may happen. The “E” classes are generally taken either from the UBERON cross-species anatomy ontology [9] or from the GO. As the use of anatomy and physiology ontologies (UBERON and GO) is shared between MP and HP, it is possible to integrate both ontologies directly, based on the axiom patterns used to constrain their classes. However, the type of axiom pattern used in both ontologies results in a classification that is primarily based on the PATO ontology, as the Quality Q is the main feature that distinguishes different classes.

In the PhenomeNET ontology, we rewrite all axioms in HP and MP using a pattern-based approach that allows us to utilize axioms from anatomy and physiology ontologies and enrich the classification of phenotype classes [11, 27]. In general, we declare phenotype classes defined using an Entity E and Quality Q as equivalent to ‘has part’ some (E and has-quality some Q) and we further add grouping classes that are defined as equivalent to ‘has part’ some ((‘part of’ some E)and has-quality some Quality) . For example, based on the axiom that defines macroglossia ( HP:0000158 ) as equivalent to ‘has part’ some (‘increased size’ and (‘inheres in’ some tongue) and (‘has modifier’ some abnormal)) , we generate two new axioms: macroglossia Equivalent To: ‘has part’ some (tongue and has-quality some ‘increased size’) as well as ‘tongue abnormality’ EquivalentTo: ‘has part’some ((‘part of’ some tongue) and has-quality some Quality) . These two axioms, together with the transitivity and reflexivity of the part of’ relation, ensure that macroglossia becomes a subclass of tongue abnormality, and that all phenotypes affecting the tongue or a part of the tongue also become a subclass of tongue abnormality. The aim of rewriting the axioms is to base the classification of phenotype classes primarily on anatomical or physiological entities instead of the quality, and to utilize the axioms involving parthood in anatomy and physiology ontologies [11, 28]. Crucially, all axioms we generate fall in the OWL 2 EL profile [29, 30] and allow efficient automated reasoning using optimized OWL 2 EL reasoners such as ELK [31]. The first version of the PhenomeNET ontology (PhenomeNET-Plain) consists only of these axioms and no additional mappings.

In addition to this knowledge-based approach to linking the HP and MP ontologies, we also add lexical mappings, mappings derived from cross-references in the ontologies [5], mappings between HP and MP from BioPortal [22], and mappings generated by the AgreementMaker Light (AML) [14] in its default settings with a score greater than 0.7. Each mapping is added as a single equivalent classes axiom to the PhenomeNET ontology (PhenomeNET-Plain) to generate a version of the PhenomeNET ontology with lexical mappings (PhenomeNET-Map).

Neither HP nor MP contain mappings to the DO or ORDO ontologies, despite a significant overlap between the four ontologies. Moreover, since neither DO nor ORDO contain axioms that follow a similar pattern to the axioms in HP and MP, we have to rely exclusively on lexical mappings in order to integrate DO and ORDO. To achieve this, we use the AML [14] in its default settings to generate mappings between HP and DO, HP and ORDO, MP and DO, MP and ORDO, and DO and ORDO (see Table 1). We then add an equivalent class axiom for each mapping AML identifies with a score greater than 0.7. The resulting ontology (PhenomeNET-Full) contains HP, MP, ORDO, and DO, and can be used to generate further mappings between these ontologies. Figure 1 provides an overview of the different data sources we used to generate the mappings for the three PhenomeNET ontologies.

An overview of the data sources and strategies used to generate the PhenomeNET ontologies. On one side, we use mappings between HP, MP, DO, and ORDO, generated using the AML ontology matching system on the other side, we use the axioms used to define classes in HP and MP together with the background knowledge in other ontologies to generate mappings formally. Using the ELK reasoner, we generate a hierarchical ontology structure (i.e., a taxonomy) from which we derive equivalent class, sub-class, and super-class mappings. The PhenomeNET-Full ontology is based on a combination of all these mapping approaches, while PhenomeNET-Map uses only the AML-generated mappings between HP and MP. PhenomeNET-Plain does not use any of the AML-generated mappings but solely relies on the axioms and background knowledge

All versions of the PhenomeNET ontology contain the classes from the HP and MP ontologies as well as the subclass axioms between named classes asserted in these ontologies. Furthermore, the PhenomeNET ontology imports the ChEBI [32] and Mouse Pathology [33] ontologies using an OWL import statement. Additionally, PhenomeNET includes all classes from the UBERON, the GO, the BioSpatial Ontology [34], the Zebrafish Anatomy ontology [35], the PATO ontology [4], the Cell Ontology [36], and the Neuro-Behavior Ontology [37]. However, these ontologies are not directly imported but rather pre-processed so that all disjointness axioms from these ontologies are excluded while all other axioms contained within them are included in the PhenomeNET ontology. The aim of this pre-processing step is to avoid unsatisfiable classes due to different conceptualizations between anatomy and phenotype ontologies, or within anatomy ontologies (Zebrafish Anatomy and UBERON) [3].

Mappings between ontologies included in PhenomeNET are generated using the ELK reasoner [31]. We use ELK to classify the PhenomeNET ontology and identify pairs of equivalent classes C 1 and C 2 that belong to the ontologies to be aligned. These constitute equivalent class mappings. Furthermore, we also use ELK to identify pairs of classes C 1 and C 2 such that C 1 is a proper sub- or super-class of C 2 to generate sub- and super-class mappings. A reasoner such as ELK is also required to explore and visualize the PhenomeNET ontology structures, and the PhenomeNET-Map ontology can be explored and visualized in the AberOWL ontology repository [38].

Evaluation of mappings: HP and MP

We employ the PhenomeNET ontology primarily for integrating the HP and MP ontologies. Using the axioms in the ontology alone (PhenomeNET-Plain), we identify 745 equivalent classes between the HP and MP ontologies (see Table 2). Additionally, a large number of sub- and super-class mappings can be identified based on querying the ontology using the ELK reasoner [31] for sub- or super-classes in the two ontologies.

The number of pairs of equivalent classes identified increases to 1536 when adding explicit mappings derived from AML. Of these, 370 are generated by automated reasoning and are also included in AML, 791 are generated from the AML-derived equivalent classes axioms, and 375 could only be derived through the automated reasoning. For example, using the PhenomeNET ontology, we infer an equivalence class mapping between Copper accumulation in brain HP:0012676 ) and Increased brain copper level ( MP:0011214 ) based on their shared definition ‘has part’ some (‘increased amount’ and (‘inheres in’ some (‘copper atom’ and (‘part of’ some brain))) and (‘has modifier’ some abnormal)) . Such mappings are not easily identified by methods that do not consider the axioms constraining the ontology classes.

Additionally, we observe an increase in the number of equivalent class mappings when adding the ORDO and DO ontologies to the PhenomeNET ontology. The increase in mappings (from 1536 to 1582 classes) is a result of additional inferences obtained from adding the mappings from HP and MP to ORDO and DO, and combining them with the axioms in the PhenomeNET ontology. For example, we infer a new mapping between decreased IgG level ( MP:0001805 ) and agammaglobulinemia ( HP:0004432 ) based on the equivalence axioms between both classes and agammaglobulinemia ( DOID:2583 ) generated by AML (based on the asserted synonym “hypogammaglobulinemia” shared between the classes in DO and MP). Table 2 summarizes our results.

Evaluation of mappings: ORDO and DO

PhenomeNET is primarily designed for ontologies that follow the Entity-Quality definition pattern based on the PATO ontology. Neither ORDO nor DO follow this pattern, and ORDO and DO are primarily included in the PhenomeNET ontology through equivalent class axioms based on lexical mappings generated by AML. Notably, the mappings we generate are increased by including HP and MP. For example, we identify a mapping between mandibulofacial dysostosis ( ORPHANET:155899 ) and Treacher Collins syndrome ( DOID:2908 ), based on common AML-generated mappings to mandibulofacial dysostosis ( HP:0005321 ).

OAEI evaluation

PhenomeNET participated in the Ontology Alignment Evaluation Initiative (OAEI) 2016 challenge where several ontology alignment systems where evaluated according to the following criteria:

precision and recall with respect to a silver standard generated by voting (using either two or three votes) the outputs of the participating systems,

recall with respect to manually generated mappings,

and a manual assessment of the mappings that were unique to a particular system.

In the first dataset, a silver standard reference alignment was generated from the systems participating in the OAEI challenge, using a vote of two of the participating systems. PhenomeNET-Full reached an F-measure of 0.829 in the HP-MP task and 0.886 in the DO-ORDO task. The LogMap system [39] achieved the highest F-measure in this evaluation of 0.925 for the HP-MP task, and the FCA_Map system [40] achieved the an F-measure of 0.962 in the DO-ORDO task. Results are similar when evaluating with a silver standard reference alignment generated by three votes of systems participating in the challenge. In particular in the DO-ORDO evaluation, PhenomeNET-Full achieved the second-highest F-score of 0.935, while the LogMap system [39] achieved an F-measure of 0.937.

When evaluating against manually created mappings, PhenomeNET-Full achieved the highest recall of 0.897 in the HP-MP task but could not generate any of the manually created mappings between DO and ORDO. Furthermore, when evaluating mappings that were uniquely identified by individual systems, 89 mappings between HP and MP as well as 3 mappings between ORDO and DO were generated only by the PhenomeNET ontologies and no other participating system. These were manually assessed, and PhenomeNET obtained a precision of 1.0 both for the 89 unique mappings generated between HP and MP as well as for the 3 mappings generated between DO and ORDO. We provide full evaluation for the OAEI as Additional file 1 results are also available at

As PhenomeNET relies on generating a taxonomic structure in which classes from HP and MP are combined, PhenomeNET also generated a large number of subclass and superclass mappings. While these were not explicitly evaluated, PhenomeNET was the only system explicitly focusing on these kind of mappings, while other participating systems primarily focused on identifying mappings represented class equivalence.

Predicting gene–disease associations

To determine the impact of the different mapping approaches in biomedical data analysis, we also apply the three ontologies in the task for which PhenomeNET was originally designed, predicting gene–disease associations based on semantic similarity between mouse model phenotypes and human phenotypes [3, 5]. For this purpose, we use the PhenomeNET ontology as an integrated version of both HP and MP so that semantic similarity can be computed simultaneously over both ontologies. Semantic similarity establishes a measure of relatedness between classes, or sets of classes, within an ontology (or, in some cases, between classes from multiple ontologies) [25].

To evaluate the success of the three ontologies in disease gene prioritization, we obtain mouse model phenotypes associated with loss-of-function mutations in single genes from the MGI database [23] as well as human disease phenotypes associated with Mendelian diseases from the HPO database [12], and apply a semantic similarity measure [24, 41] to compare the phenotypic similarity between phenotypes associated with mouse mutants and human disease. We systematically compute phenotypic similarity between 9131 loss-of-function mouse mutants and 7066 diseases. We perform this experiment three times, once for each version of the PhenomeNET ontology (PhenomeNET-Plain, PhenomeNET-Map, and PhenomeNET-Full). Additionally, to determine the effect of PhenomeNET’s knowledge-based approach, we also generate an integrated ontology based only on an alignment between HP and MP generated by AML.

We test how well this approach recovers known gene–disease associations. We use two sets for this evaluation: human gene–disease associations observed in a clinical context and presented in the Online Mendelian Inheritance in Man (OMIM) database [42], and mutant mice identified by curators as models of a human disease represented in the MGI database [23]. The receiver operating characteristic (ROC) curves [43] for this evaluation are shown in Fig. 2. We find that the PhenomeNET-Map version, which focuses specifically on generating mappings between MP and HP, performs best among our ontologies in this evaluation (AUROC 0.794 for human gene–disease associations and 0.930 for mouse associations), followed by PhenomeNET-Full (AUROC 0.791 for human 0.929 for mouse gene–disease associations) and PhenomeNET-Plain (AUROC 0.790 and 0.920 for human and mouse, respectively). An ontology generated only by mappings from AML, however, performs better than any of the PhenomeNET ontologies despite producing a fewer number of mappings. Using an ontology based only on the AML-derived mappings we achieve a AUROC of 0.795 and 0.934 for the human and mouse evaluation sets, respectively. However, none of the differences between the ontologies is statistically significant (p>0.05 for all 12 comparisons, Wilcoxon rank sum test, Bonferroni correction).

ROC curves for predicting gene–disease associations using the three different ontologies


We thank George Chaconas (University of Calgary, Calgary, AB, Canada) for the generous gift of rabbit anti-MuA polyclonal antibody. We acknowledge Keith Derbyshire (University of Albany, Albany, NY, USA) for providing the E. coli strains DH10B and HT321 as well as plasmid pNT105. We thank Sari Tynkkynen and Pirjo Rahkola for excellent technical assistance. This study was funded by the Academy of Finland, the Finnish National Technology Agency (TEKES) and the Finnish Cultural Foundation.

Watch the video: Αρχαία Μέθοδος Να Διώξετε Τις Τοξίνες Το Πρωί! (July 2022).


  1. Esra

    I fully share your opinion. I think this is a great idea. I agree with you.

  2. Tukora

    I think you are not right. I can prove it.

  3. Dajas

    All above told the truth.

  4. Gasho

    Fair thinking

  5. Duarte

    I hope tomorrow will be ...

  6. Zulkijar

    You read this and think….

Write a message