# Maximum parsimony tree - Am I right, or is the correction model right?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Alright. Monday we have a test, and now I was making a practise test.

We have to make a maximum parsimony tree. We must do that monday again, so I want to know if I am thinking wrong, or if the answer model is incorrect.

The sequences are:

``Seq 1 2 3 4 5 6 7 1 A T A A G C C 2 T C A C C T G 3 A T C C G A C 4 T G C A C T G``

This results in:

``1 2 3 4 ---------------------------- 1 | 2 | 6 3 | 3 6 4 | 6 3 6``

And eventually, this results in:

``1,3 2,4 ----------------------------- 1,3 | 2,4 | 6``

This is the tree in the correction model:

My tree has two`1,5's`instead of the two`3's`at the root of the tree. Because`6`is the last distance.`6/2 = 3`, so both branches should get a distance of 3. But, according to some examples I saw on the internet, it's a distance of 3 to the end of tree then, and not 3 to the next node. And there is already a distance of`1,5`in further benches at both sides, so only`1,5`remains (because`3 - 1.5 = 1.5`).

So, am I wrong, or is the correction model wrong? Then I know how I must do it at the real test.

Sorry if my English is bad, and it's a little hard to explain because English is not my native language, but if I am right, you should know what I mean.

## Discrete morphology - Multistate Characters

Morphological data is commonly used for estimating phylogenetic trees from fossils. This tutorial will focus on estimating phylogenetic trees from discrete characters, those characters which can be broken into non-overlapping character states. This type of data has been used for estimation of phylogenetic trees for many years. In the past twenty years, Bayesian methods for estimating phylogeny from this type of data have become increasingly common.

This tutorial will give an overview of common models and assumptions when estimating a tree from discrete morphological data. We will use a dataset from (Zamora et al. 2013). This dataset contains 27 extinct echinoderm taxa and 60 binary and multistate characters.

### Overview of Discrete Morphology Models

As technologies for obtaining low-cost and high-throughput nucleotide sequence data have become available, many scientists have become reliant on molecular data for phylogenetics. However, morphological data remain the only direct observations we have of most extinct organisms, and are an independent data source for understanding phylogeny. Many of the phylogenetic methods we will discuss in this tutorial were invented for use with sequence data. However, these methods are still very useful for discrete morphological data. We will examine some common assumptions for modeling data in a phylogenetic context, then move on to look at relaxing these assumptions.

Modeling discrete morphological data requires an understanding of the underlying properties of the data. When we work with molecular data, we know a priori that certain types of changes are more likely than others. For example, changes within a type of base (purine and pyrimidine) are much more likely than changes between types of bases. This information can be used to add parameters to the phylogenetic model. There are no equivalent and generalizable truths across characters in a morphological data matrix. For example, while 0 and 1 are commonly coded to “presence” and “absence”, this is not always the case, nor are all characters atomized at the same magnitude. For instance, at one character, changing character states may not reflect a large amount of genetic changes. Theca shape (character 2 in the Zamora et al. 2013 dataset), for example appears quite labile. At another, the changes to the character state may reflect a rearrangement of genetic elements, or might have larger ramifications for the organism’s life and behavior. Character 38, the central plate of the lintel, may be one such character, as it changes seldom.

When we work with morphological data in a Bayesian context, we are performing these analyses after a long history of workers performing phylogenetic analysis in a maximum parsimony framework. Under maximum parsimony, trees are proposed. The number of changes in the data implied by the tree are then counted. The tree implying the fewest changes is considered the best. There may be multiple most parsimonious trees in a dataset. Parsimony has been the dominant method for estimating phylogenetic trees from discrete morphological data. Characters that cannot be used to discriminate between tree topologies are not typically collected by workers using parsimony. For example, characters that do not vary are not collected, as they all have the same length (0 steps) on a tree. Likewise, autapomorphies are typically not collected. As we will see later, this has ramifications for how we model the data.

Graphical model showing the Mk model (left panel) and corresponding Rev code (right panel).

For many years, parsimony was the only way to estimate a phylogenetic tree from morphological data. In 2001, Paul Lewis published the Mk model of morphological evolution. The Mk model (Lewis 2001) is a generalization of the Jukes-Cantor model (Jukes and Cantor 1969) of nucleotide sequence evolution. This model, while simple, has allowed researchers to access the toolkit of phylogenetic methods available to researchers working with other discretely-valued data, such as nucleotides or amino acids.

#### The Mk Model

As mentioned above, the Mk model is a generalization of the JC model. This model assumes that all transitions between character states are equal, and that all characters in the matrix have the same transition matrix. The transition matrix for a binary trait looks like so:

In this matrix, \$mu\$ represents the transition probability between the two states that follow it. A transition matrix for multistate data simply expands.

However, the Mk model sets transitions to be equal from any state to any other state. In that sense, our multistate matrix really looks like this:

You might notice that these transition rates are not different than what we might expect from an equal-weights parsimony matrix. In practice, the Mk model makes very few assumptions due to the complexity and non-generalizability of morphological data.

This model may strike some readers as too simplistic to be adequate for morphological data. However, Bayesian methods are less likely to be mislead by homoplasy than is parsimony (Felsenstein 1983). More recent work has demonstrated that the model outperforms parsimony in many situations, particularly those in which there is high homoplasy (Wright and Hillis 2014), with empirical work demonstrating that it fits many datasets reasonably well (Wright et al. 2016).

In the first part of this tutorial, we will estimate a tree under the Mk model as proposed by Lewis (2001). We will then relax core parameters of the model.

#### Ascertainment Bias

One remaining component of the model we have not yet discussed is ascertainment bias. Because workers using parsimony do not collect invariant characters and seldom collect autapomorphies, our data are biased. Imagine, for a moment, that you were to measure the average height in a room. But first, you asked the 10 shortest people to leave. Your estimate of the average height would be too tall! In effect, this happens in the morphological data, as well. Because the characters with the fewest changes are not collected, we over estimate the amount of evolutionary change on the tree. At the time of publication, Lewis (2001) also included a correction factor for this bias.

These original corrections involved simulating parsimony non-informative characters along each proposed tree. These would be used to normalize the likelihood value. While this procedure is statistically valid, it is a bit slow. There are multiple ways to perform this correction (Allman and Rhodes 2008). RevBayes uses a dynamic likelihood approach to avoid repeated simulations.

## Background

The inference of the Last Universal Common Ancestor (LUCA) of all modern cellular organisms can be approached in two ways. The “forward in time” approach uses the knowledge about conditions on the prebiotic Earth, tries to understand what kinds of replicating systems could emerge under these conditions, and proposes the mechanisms for these genetic systems to evolve into LUCA. The “backward in time” approach uses the information about currently living organisms – in particular, about completely sequenced genomes of Bacteria, Archaea, Eukarya, and even viruses – to reconstruct the traits of LUCA. The latter class of methods takes us directly to the last common ancestor of the currently living life forms, rather than to an ancestor of such ancestor [1], and the approach taken here is of that kind.

The problem of inference of ancestral gene content has been stated as follows: for each gene in every sequenced genome, determine its state as either ancestral, i.e., present in LUCA, or non-ancestral, i.e., absent from LUCA [1–4]. Since the task is prohibitively difficult for a gene that is found in just one genome, a practical modification of the problem is to label each set of orthologous genes, shared by several genomes, as either ancestral or non-ancestral (see [5] for definition of orthology and discussion of issues in practical detection of orthologs). In this study, we suggest a statistical approach to address this problem. We utilize two kinds of data: (a) the evolutionary history of a set of species, modeled as a species’ phylogenetic tree, the root of which is assumed to be the LUCA and (b) the record of presence and absence of orthologous genes in the same set of species, summarized as phyletic vectors, in which each coordinate represents the status of a gene in one species. As we argue in the last section of this paper, such a framework is a necessary prerequisite to more complex and realistic models of evolution, in particular those that would give the explicit account of horizontal gene transfer between species.

In the context of our current inference problem, there are two classes of evolutionary events that occur along the branches of a tree: gene gain, in which the state of gene changes from absence to presence (in the simplest binary coding of presences and absences, gene gain is depicted as change of state 0 → 1 , and gene loss as 1 → 0 ). Any inference of the ancestral state of a gene relies on a quantitative model of such changes.

Different methods for ancestral state reconstruction, including maximum parsimony (MP) [2, 6, 7] and approaches based on more extensive modeling, such as maximum likelihood (ML) and Bayesian inference, have been introduced (e.g., [8]). The MP approach infers the ancestral states by starting with the current states of each gene at the tips of the tree and proceeding backwards in time, to the root, minimizing the total number of events (gains and losses) during the evolutionary history of a given set of species. As always with parsimony approaches, it is possible that two or more scenarios consist of different events but have the same (minimal) number of them this requires additional criteria for breaking the ties. More important, it is not clear that unweighted parsimony, which in effect postulates that a gain and a loss of a gene are equally likely, is best compatible with the data. Mirkin et al.[2] proposed the weighted parsimony approach, which takes into account the possible difference between gene gain rate and gene loss rate. This was done by using a parameter called gene penalty, defined as the ratio of gene gain rate to gene loss rate. It was observed, however, that the ancestral gene sets constructed with the gain penalty g=1 tended to have the smallest number of genes whose predicted functions were biochemically coherent enough to sustain life, suggesting that the number of gene gains and losses encountered by a system may be at approximate equilibrium.

Methods based on maximum likelihood are of interest because they can take into account more information about the process of gene gains and gene losses, and because they can reflect the uncertainties in deciding the state of the gene at each ancestral node in the tree by assigning probabilities of presence and absence of each gene at this node. The likelihood framework can also incorporate the knowledge of branch lengths in the species tree and the lineage-specific differences between the frequencies of various classes of events across different genes.

Likelihood-based reconstruction of ancestral molecular traits have been attempted in the recent years (see [9–13]), focusing mostly on inferring the ancestral nucleotide or protein sequences on the basis of sequences from present-day species. These approaches model the evolutionary history of an orthologous nucleotide or amino acid site as a continuous-time Markov process, in which the substitution rates are associated with time (tree branch length) and are estimated by maximizing the likelihood of the given phylogenetic tree and the sequences of a specific gene of interest. The most likely ancestral state of each site is then chosen by evaluating the marginal probability for each state. Many of these models can be modified to deal with the ancestral gene content problem.

Cohen et al.[8] have used a likelihood framework to analyze the binary gene presence-absence vectors for multiple orthologous genes in a set of existing species with completely sequenced genomes. Their analysis allowed the gene gain and loss rates to be unequal, and the results indicated that the gain and loss rates that vary between different gene families explain the observed data better than the constant gain and loss rates. In another study, presences and absences were replaced with multiple states for the gene family size, to describe the history of a gene in relation to duplications and gene losses in the MP framework, without explicitly reconstructing gene content in LUCA [7].

Here we extend this class of models to examine the changes between the states of gene absence, of a single-copy gene presence, and presence of a group of in-paralogs, in the maximum likelihood framework. The calculation of the probability of the ancestral presence (“ancestrality”) of each gene uses the information on the changes in the number of in-paralogs of a gene in evolution. We explore several likelihood models of increasing complexity. Our results indicate that, when more than two states of genes are allowed, the estimated gene loss rates tend to be higher than estimated gene gain rates, with the loss-to-gain rate ratios around 6 for the majority of COGs. All models give relatively close estimates for the number of genes in LUCA, around 500 genes, but the identities of genes that are confidently placed into LUCA are different under different models. Probabilistic approach of that kind is a necessary step towards more detailed, quantitative reconstructions of gene content and metabolic networks in LUCA.

## Maximum parsimony tree - Am I right, or is the correction model right? - Biology

The aim of this project is to implement a versatile high-performance software library for phylogenetic analysis. The library should serve as a lower-level interface of PLL (Flouri et al. 2015) and should have the following properties:

• open source code with an appropriate open source license.
• 64-bit multi-threaded design that handles very large datasets.
• easy to use and well-documented.
• SIMD implementations of time-consuming parts.
• as fast or faster likelihood computations than RAxML (Stamatakis 2014).
• fast implementation of the site repeats algorithm (Kobert 2017).
• functions for tree visualization.
• bindings for Python.
• generic and clean design.
• Linux, Mac, and Microsoft Windows compatibility.

Currently, libpll requires that GNU Bison and Flex are installed on the target system. On a Debian-based Linux system, the two packages can be installed using the command

apt-get install flex bison

The library also requires that a GNU system is available as it uses several functions (e.g. asprintf ) which are not present in the POSIX standard. This, however will change in the future in order to have a more portable and cross-platform library.

The library can be compiled using either of the following two ways.

Cloning the repo Clone the repo and bild the executable and documentation using the following commands.

When using the cloned repository version, you will also need autoconf, automake and libtool installed. On a Debian-based Linux system, the packages can be installed using the command

The library will be installed on the operating system's standard paths. For some GNU/Linux distributions it might be necessary to add that standard path (typically /usr/local/lib ) to /etc/ld.so.conf and run ldconfig .

Microsoft Windows compatibility was tested with a cross-compiler and seems to work out-of-the-box using MingW.

libpll currently implements the General Time Reversible (GTR) model (Tavare 1986) which can be used for nucleotide and amino acid data. It supports models of variable rates among sites, the Inv+Γ (Gu et al. 1995) and has functions for computing the discretized rate categories for the gamma model (Yang 1994). Furthermore, it supports several methods for ascertainment bias correction (Kuhner et al. 2000, McGill et al. 2013, Lewis 2011, Leaché et al. 2015). Additional functionality includes tree visualization, functions for parsimony (minimum mutation cost) calculation and ancestral state reconstruction using Sankoff's method (Sankoff 1975, Sankof and Rousseau 1975). The functions for computing partials, evaluating the log-likelihood and updating transition probability matrices are vectorized using both SSE3, AVX and AVX2 instruction sets.

Please refer to the wiki page and/or the examples directory.

libpll includes code from several other projects. We would like to thank the authors for making their source code available.

The code is written in C with some parts written using in-line assembler and intrinsic functions.

File Description
compress.c Functions for compressing alignment into site patterns.
core_derivatives_avx2.c AVX2 vectorized core functions for computing derivatives of the likelihood function.
core_derivatives_avx.c AVX vectorized core functions for computing derivatives of the likelihood function.
core_derivatives.c Core functions for computing derivatives of the likelihood function.
core_derivatives_sse.c SSE vectorized core functions for computing derivatives of the likelihood function.
core_likelihood_avx2.c AVX2 vectorized core functions for computing the log-likelihood.
core_likelihood_avx.c AVX vectorized core functions for computing the log-likelihood.
core_likelihood.c Core functions for computing the log-likelihood, that do not require partition instances.
core_likelihood_sse.c SSE vectorized core functions for computing the log-likelihood.
core_partials_avx2.c AVX2 vectorized core functions for updating vectors of conditional probabilities (partials).
core_partials_avx.c AVX vectorized core functions for updating vectors of conditional probabilities (partials).
core_partials.c Core functions for updating vectors of conditional probabilities (partials).
core_partials_sse.c SSE vectorized core functions for updating vectors of conditional probabilities (partials).
core_pmatrix_avx2.c AVX2 vectorized core functions for updating transition probability matrices.
core_pmatrix_avx.c AVX vectorized core functions for updating transition probability matrices.
core_pmatrix.c Core functions for updating transition probability matrices.
core_pmatrix_sse.c SSE vectorized core functions for updating transition probability matrices.
derivatives.c Functions for computing derivatives of the likelihood function.
fasta.c Functions for parsing FASTA files.
fast_parsimony_avx2.c AVX2 fast unweighted parsimony functions.
fast_parsimony_avx.c AVX fast unweighted parsimony functions.
fast_parsimony.c Non-vectorized fast unweighted parsimony functions.
fast_parsimony_sse.c SSE fast unweighted parsimony functions.
gamma.c Functions related to Gamma (Γ) function and distribution.
hardware.c Hardware detection functions.
lex_rtree.l Lexical analyzer for parsing newick rooted trees.
lex_utree.l Lexical analyzer for parsing newick unrooted trees.
likelihood.c Functions ofr computing the log-likelihood of a tree given a partition instance.
maps.c Character mapping arrays for converting sequences to the internal representation.
models.c Model parameters related functions.
output.c Functions for output in terminal (i.e. conditional likelihood arrays, probability matrices).
parse_rtree.y Functions for parsing rooted trees in newick format.
parse_utree.y Functions for parsing unrooted trees in newick format.
parsimony.c Parsimony functions.
partials.c Functions for updating vectors of conditional probabilities (partials).
phylip.c Functions for parsing phylip files.
pll.c Functions for setting PLL partitions (instances).
random.c Re-entrant multi-platform pseudo-random number generator.
rtree.c Rooted tree manipulation functions.
utree.c Unrooted tree manipulation functions.
utree_moves.c Functions for topological rearrangements on unrooted trees.
utree_svg.c Functions for SVG visualization of unrooted trees.

The source code in the master branch is thoroughly tested before commits. However, mistakes may happen. All bug reports are highly appreciated. You may submit a bug report here on GitHub as an issue, or you could send an email to [email protected]

• Tomáš Flouri
• Diego Darriba
• Kassian Kobert
• Mark T. Holder
• Alexey Kozlov
• Alexandros Stamatakis

Special thanks to the following people for patches and suggestions:

Flouri T., Izquierdo-Carrasco F., Darriba D., Aberer AJ, Nguyen LT, Minh BQ, von Haeseler A., Stamatakis A. (2015) The Phylogenetic Likelihood Library. Systematic Biology, 64(2): 356-362. doi:10.1093/sysbio/syu084

Gu X., Fu YX, Li WH. (1995) Maximum Likelihood Estimation of the Heterogeneity of Substitution Rate among Nucleotide Sites. Molecular Biology and Evolution, 12(4): 546-557.

Kobert K., Stamatakis A., Flouri T. (2017) Efficient detection of repeating sites to accelerate phylogenetic likelihood calculations. Systematic Biology, 66(2): 205-217. doi:10.1093/sysbio/syw075

Leaché AL, Banbury LB, Felsenstein J., de Oca ANM, Stamatakis A. (2015) Short Tree, Long Tree, Right Tree, Wrong Tree: New Acquisition Bias Corrections for Inferring SNP Phylogenies. Systematic Biology, 64(6): 1032-1047. doi:10.1093/sysbio/syv053

Lewis LO. (2001) A Likelihood Approach to Estimating Phylogeny from Discrete Morphological Character Data. Systematic Biology, 50(6): 913-925. doi:10.1080/106351501753462876

Sankoff D. (1975) Minimal Mutation Trees of Sequences. SIAM Journal on Applied Mathematics, 28(1): 35-42. doi:10.1137/0128004

Sankoff D, Rousseau P. (1975) Locating the Vertices of a Steiner Tree in Arbitrary Metric Space. Mathematical Programming, 9: 240-246. doi:10.1007/BF01681346

Stamatakis A. (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics, 30(9): 1312-1313. doi:10.1093/bioinformatics/btu033

Tavaré S. (1986) Some probabilistic and statistical problems in the analysis of DNA sequences. American Mathematical Sciety: Lectures on Mathematics in the Life Sciences, 17: 57-86.

Yang Z. (2014) Maximum likelihood phylogenetic estimation from dna sequences with variable rates over sites: Approximate methods. Journal of Molecular Evolution, 39(3): 306-314. doi:10.1007/BF00160154

## Results

In addition to the analysis below, we provide the results of our simulations in full (see S1 Table).

### Probabilities

Fig 1 displays the success probability, separately visualized for each of the mutation probabilities q = 0.08, 0.16, ⋯, 0.48. The same information is given numerically in Table 1, where success probabilities of at least 0.90 are highlighted. In the most amenable setting (q = 0.08), having 64 characters is not enough to obtain a success probability of 0.90 except for the case n = 5 for 6 ≤ n ≤ 12, having 128 characters is sufficient. The extreme case q = 0.48 is intractable even for five taxa and 256 characters, the true phylogeny could be inferred in only about 70% of the experiments. Fig 1 clearly shows that if the number of characters is kept fixed, the probability of success rapidly decreases as the number of taxa increases.

When features, including morphological characters and gene loci, are inherited from a common ancestor, for example, a gene in two species originating from a single ancestral gene.

Homologous sequences that have diverged due to speciation events.

Continuous time Markov Chain probabilistic models that describe changes between nucleotides or amino acids over evolutionary time.

A phylogenetic tree for a set of species that underlies the gene trees at individual loci.

Homologous sequences that have diverged due to duplication events so that both copies have descended side by side during the history of an organism.

Homologous sequences originating from horizontal gene transfer (also known as lateral gene transfer).

Insertion of gaps in homologous sequences so that nucleotides or amino acids in the same column are homologous.

The phylogenetic or genealogical tree of sequences at a gene locus or genomic region.

Errors due to incorrect model assumptions.

Incomplete lineage sorting

Discordance of gene trees from the species tree due to ancestral polymorphism.

The branching pattern of a phylogenetic tree indicating relationships between taxa.

(LBA). The phenomenon of inferring an incorrect tree in which taxa with long branches are grouped together.

A clade is a group of taxa on a tree that includes their most recent common ancestor and all its descendants, also known as a monophyletic group.

Errors due to the finite length of sequences in the alignment.

A model that assumes the same substitution rate or process across alignment sites, taxa and time.

Homogeneity in nucleotide or amino acid frequencies across lineages of a phylogeny.

Models that assume different substitution rates or processes across sites of the alignment.

Models that assume multiple sets of state frequencies for sites (for example, CAT, C10–C60).

The process of lineage joining when one traces the history of a sample of sequences backwards in time.

The process of random changes in allele frequencies over generations due to the stochastic nature of reproduction.

## References

Burleigh, J. G. & Mathews, S. Phylogenetic signal in nucleotide data from seed plants: implications for resolving the seed plant tree of life. Am. J. Bot. 91, 1599–1613 (2004)

Soltis, P. S., Soltis, D. E. & Chase, M. W. Angiosperm phylogeny inferred from multiple genes as a tool for comparative biology. Nature 402, 402–404 (1999)

Graham, S. W. & Olmstead, R. G. Utility of 17 chloroplast genes for inferring the phylogeny of the basal angiosperms. Am. J. Bot. 87, 1712–1730 (2000)

Mathews, S. & Donoghue, M. J. Basal angiosperm phylogeny inferred from duplicate phytochromes A and C. Int. J. Plant Sci. 161, (6 Suppl.)S41–S55 (2000)

Zanis, M. J., Soltis, D. E., Soltis, P. S., Mathews, S. & Donoghue, M. J. The root of angiosperms revisited. Proc. Natl Acad. Sci. USA 99, 6848–6853 (2002)

Borsch, T. et al. Noncoding plastid trnTtrnF sequences reveal a well resolved phylogeny of basal angiosperms. J. Evol. Biol. 16, 558–576 (2003)

Qiu, Y.-L. et al. Phylogenetic analyses of basal angiosperms based on nine plastid, mitochondrial, and nuclear genes. Int. J. Plant Sci. 166, 815–842 (2005)

Leebens-Mack, J. et al. Identifying the basal angiosperm node in chloroplast genome phylogenies: sampling one’s way out of the Felsenstein zone. Mol. Biol. Evol. 22, 1948–1963 (2005)

Doyle, J. A. & Endress, P. K. Morphological phylogenetic analysis of basal angiosperms: comparison and combination with molecular data. Int. J. Plant Sci. 161, (6 Suppl.)S121–S153 (2000)

Angiosperm Phylogeny Group (APG II). An update of the Angiosperm Phylogeny Group classification for the orders and families of flowering plants. Bot. J. Linn. Soc. 141, 399–436 (2003)

Williams, J. H. & Friedman, W. E. The four-celled female gametophyte of Illicium (Illiciaceae Austrobaileyales): implications for understanding the origin and early evolution of monocots, eumagnoliids, and eudicots. Am. J. Bot. 91, 332–351 (2004)

Feild, T. S., Arens, N. C., Doyle, J. A., Dawson, T. E. & Donoghue, M. J. Dark and disturbed: a new image of early angiosperm ecology. Paleobiology 30, 82–107 (2004)

Hamann, U. Hydatellaceae—a new family of Monocotyledoneae. N. Zeal. J. Bot. 14, 193–196 (1976)

Bremer, K. Gondwanan evolution of the grass alliance of families (Poales). Evolution 56, 1374–1387 (2002)

Dahlgren, R. M. T., Clifford, H. T. & Yeo, P. F. The Families of the Monocotyledons: Structure, Evolution, and Taxonomy (Springer, Berlin, 1985)

Hamann, U. in The Families and Genera of Vascular Plants IV. Flowering Plants. Monocotyledons. Alismatanae and Commelinanae (except Gramineae) (ed. Kubitzki, K.) 231–234 (Springer, Berlin, 1998)

Stevenson, D. W. et al. in Monocots: Systematics and Evolution (eds Wilson, K. L. & Morrison, D. A.) 17–24 (CSIRO, Collingwood, Australia, 2000)

Michelangeli, F. A., Davis, J. I. & Stevenson, D. W. Phylogenetic relationships among Poaceae and related families as inferred from morphology, inversions in the plastid genome, and sequence data from the mitochondrial and plastid genomes. Am. J. Bot. 90, 93–106 (2003)

Graham, S. W., Olmstead, R. G. & Barrett, S. C. H. Rooting phylogenetic trees with distant outgroups: a case study from the commelinoid monocots. Mol. Biol. Evol. 19, 1769–1781 (2002)

Doyle, J. A. Early evolution of angiosperm pollen as inferred from molecular and morphological phylogenetic analyses. Grana 44, 227–251 (2005)

Sun, G. et al. Archaefructaceae, a new basal angiosperm family. Science 296, 899–904 (2002)

Graham, S. W. et al. in Monocots: Comparative Biology and Evolution (excluding Poales) (eds Columbus, J. T., Friar, E. A., Porter, J. M., Prince, L. M. & Simpson, M. G.) 3–21 (Rancho Santa Ana Botanic Garden, Claremont, California, 2006)

Chase, M. W. et al. in Monocots: Comparative Biology and Evolution (excluding Poales) (eds Columbus, J. T., Friar, E. A., Porter, J. M., Prince, L. M. & Simpson, M. G.) 63–75 (Rancho Santa Ana Botanic Garden, Claremont, California, 2006)

Swofford, D. L. Phylogenetic Analysis Using Parsimony* (PAUP*) (Sinauer Associates, Sunderland, Massachusetts, 2002)

Guindon, S. & Gascuel, O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52, 696–704 (2003)

Hamann, U. Neue Untersuchungen zur Embryologie und Systematik der Centrolepidaceae. Bot. Jahrb. Syst. 96, 154–191 (1975)

Cooke, D. A. in The Families and Genera of Vascular Plants IV. Flowering Plants. Monocotyledons. Alismatanae and Commelinanae (except Gramineae) (ed. Kubitzki, K.) 106–109 (Springer-Verlag, Berlin, 1998)

Appel, O. & Bayer, C. in The Families and Genera of Vascular Plants IV. Flowering Plants. Monocotyledons. Alismatanae and Commelinanae (except Gramineae) (ed. Kubitzki, K.) 208–211 (Springer, Berlin, 1998)

Kim, S., Soltis, D. E., Soltis, P. S., Zanis, M. J. & Suh, Y. Phylogenetic relationships among early-diverging eudicots based on four genes: were the eudicots ancestrally woody? Mol. Phylog. Evol. 31, 16–30 (2004)

Maddison, D. R. & Maddison, W. P. MacClade 4: Analysis of Phylogeny and Character Evolution, Version 4.03 (Sinauer Associates, Sunderland, Massachusetts, 2001)

Harden, G. J. (ed.) Flora of New South Wales. Vol. 4 (Univ. of New South Wales, Kensington, New South Wales, Australia, 1993)

## Investigation on the Conserved MicroRNA Genes in Higher Plants

Analysis of evolving microRNA repertoires within the plant domain can further corroborate our understanding of genome evolution and plasticity. An extensive collection of relatively unbiased miRBase-registered plant miRNAs and predicted unlisted MIRs from 23 plant ESTs were examined. As a result, 4324 pre-miRNAs were predicted and classified in 656 miRNA gene families with mostly being transposons (57.81%). From 216 newly identified pre-miRNAs, 103 distinct types belonged to reduced complexity/repeated regions. Collinearity between the numbers of miRNAs in each species with the relevant sizes of genomes was absent. Duplications of MIRs were evident, with higher MIR paralogs in Liliopsida compared with dicots. Due to the lack of an apparent pattern of phylogeny, Dollo maximum parsimony was used that established the acceleration of gains and potential losses of miRNA gene families within Mesangiospermae during the last 200 million years ago. Phylogenetic analysis of Liliopsida in contrast to Eudicotyledons agreed with the reconstructed tree based on the possible expansion of distinguished MIR families. In marked contrast to dicots, the degrees of resemblance in Liliopsida were higher than their direct predecessors. Analyses of recent monophyletic lineages were illustrative of miRNA horizontal genes transfer.

This is a preview of subscription content, access via your institution.

## Ancestral Sequence Reconstruction and Infectious Disease

Ancestral sequence reconstruction can be used to understand viral evolution and towards therapeutic applications (Arenas 2020). An understanding of the evolutionary histories of these viruses can lead to applications in detecting targeted regions for future therapeutics, and to assist in predicting new viral resistance against current drugs.

Ancestral sequence reconstruction is also of emerging interest for vaccine technologies, especially for the development of vaccines to combat rapidly evolving viruses such as HIV and influenza strains (Gaschen et al. 2002 Ducatez et al. 2011). Using ancestrally derived sequences to create vaccine reagents takes advantage of the evolutionary history of the virus. This strategy contrasts with other methods which construct a consensus sequence from different viral strains, ignoring phylogenetic structure. A vaccine reagent can be based on the last common ancestral sequence of all the strains that are circulating, or from other points in the tree. For example, when the phylogenetic topology is skewed, the “center of tree” method may be implemented. The center of tree method considers the ancestral sequence that minimizes the evolutionary distance between different viral strains of interest (Nickle et al. 2003).

In the age of the SARS-CoV-2, ancestral sequence reconstruction has become of immediate interest to assist in vaccine development (Zhou et al. 2020). Like the rapidly evolving RNA virus influenza and retrovirus HIV, SARS-CoV-2 is also an RNA virus. However, a recent study used ancestral sequence reconstruction to demonstrate that unlike other RNA viruses, mutations in SARS-CoV-2 are rare, as the evolution rate is slower than the transmission rate. Because of the slow evolution of SARS-CoV-2, only one vaccine candidate may be necessary to match all currently circulating SARS-CoV-2 variants (Dearlove et al. 2020).

Aside from disease causing viruses, viruses are also developed to serve as a vehicle for gene therapy (Ivics et al. 1997). The Adeno-associated Virus (AAV) has been considered an efficient gene therapy for both inherited and infectious diseases. However, the complex structure and diversity associated with different target receptor binding for AAV make the virus difficult to properly structurally assemble when designed. Using ancestral sequence reconstruction, Zinn et al. (2015) were able to provide a virus with a structure that would remain evolutionarily resilient to future mutations and maintain broad clinical applicability.

Buy a single issue of Science for just \$15 USD.

### Science

Vol 323, Issue 5911
09 January 2009

### Article Tools

By Iván F. Acosta , Hélène Laparra , Sandra P. Romero , Eric Schmelz , Mats Hamberg , John P. Mottinger , Maria A. Moreno , Stephen L. Dellaporta

Science 09 Jan 2009 : 262-265

A gene that controls male floral development in maize is involved in synthesis of a hormone that suppresses female organ development.

1. Mugrel

I think this has already been discussed, use the search on the forum.

2. Yishai

I can not solve.

3. Williams

In my opinion, it is the big error.