Information

4.7: Comparative Genome Analysis - Biology

4.7: Comparative Genome Analysis - Biology


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Paralogous Genes

  • Genes that are similar because of descent from a common ancestor are homologous.
  • Homologous genes that have diverged after speciation are orthologous.
  • Homologous genes that have diverged after duplication are paralogous.

One can identify paralogous groups of genes encoding proteins of similar but not identical function in a species e.g., ABC transporters: 80 members in E. coli

Core proteomes vary little in size

Proteome: all the proteins encoded in a genome

To calculate the Core proteome:

Count each group of paralogous proteins only once

Number of distinct protein families in each organism

SpeciesNumber of genesCore proteome
Haemophilus17091425
Yeast62414383
Worm184249453
Fly136018065

Figure 4.22.Little change in core proteome size in eukaryotes

Core proteomes are conserved

  • Many of the proteins in the core proteomes are shared among eukaryotes
  • 30% of fly genes have orthologs in worm
  • 20% of fly genes have orthologs in both worm and yeast
  • 50% of fly genes have likely orthologs in mammals

Function of proteins in flies (and worms and yeast) provides strong indicators of function in humans. Flies have orthologs to 177 of the 289 human disease genes

Figure 4.23. Functional categories in eukaryotic proteomes

Figure 4.24. Distribution of the homologues of the predicted human proteins

Conserved Segments in the Human and Mouse Genomes

Figure 4.25. Regions of human chromosomes homologous to regions of mouse chromosomes (indicated by the colors). For example, virtually all of human chromosome 20 is homologous to a region on mouse chromosome 2, and almost all of human chromosome 17 is homologous to a region on mouse chromosome 11. More commonly, segments of a given human chromosomes are homologous to different mouse chromosomes. Chromsosomes from mouse have more rearrangements relative to humans than do chromosomes from many mammals, but the homologous relationships are still readily apparent.

CHROMOSOMES and CHROMATIN

Chromosomes are the cytological package for genes. Genomes are much longer than the cellular compartment they occupy compartment dimensions length of DNA

  • Phage T4: [0.065 imes 0.10 ,mm 55, mm = 170, kb]
  • E. coli: [1.7 imes 0.65, mm ,1.3, mm = 4.6 imes 10^3, kb]
  • Nucleus (human): [6 mm ,diam. 1.8, m = 6 imes 10^6, kb]

Definition: Packing ratio

[ ext{Packing ratio} = dfrac{ ext{length of DNA}}{ ext{length of the unit that contains it}}.]

The smallest human chromosome contains about

[46 imes 10^6, bp = 14,000, mm = 1.4,cm ,DNA.]

When condensed for mitosis, this chromosome is about. 2 mm long. The packing ratio is therefore about 7000!

Loops, matrix and the chromosome scaffold

When DNA is released from mitoticchromosomes by removing most of the proteins, long loops of DNA are seen, emanating from a central scaffold that resembles the remnants of the chromosome.

Figure 4.26: EM analysis of intact nuclei shows network of fibers called a matrix.

Biochemical preparations using salt and detergent to remove proteins and nuclease to remove most of the DNA leaves a "matrix" or "scaffold" preparation. Similar DNA sequences are found in these preparations; these sequences are called matrix attachment regions = MARs (or scaffold attachment regions = SARs). They tend to be A+T rich and have sites for cleavage by topoisomerase II. Topoisomerase II is one of the major components of the matrix preparation; but the composition of the matrix is still in need of further study.

Since it is attached at the base to the matrix, each loop is a separate topological domain and can accumulate supercoils of DNA.

From the measured sizes of loops, and calculations based on the amount of nicking required to relax DNA within the loops, we estimate that the average size of these loops is about 100 kb (85 kb based on nicking frequency for relaxation).

Some evidence suggests that replication and possibly some transcriptional control may be exerted at the bases of the loops.

Interphase chromatin and mitotic chromosomes

During interphase, i.e. between mitotic divisions, the highly condensed mitotic chromosomesspread out through the nucleus to form chromatin. Interphase chromatin is not very densely packed in most of the nucleus (euchromatin). In some regions it is very densely packed, comparable to a mitotic chromosome (heterochromatin).

Both interphase chromatin and mitotic chromosomes are made of a 30 nm fiber. The mitotic chromosome is much more coiled than interphase chromosomes.

Most transcription occurs in euchromatin.

  • Constitutive heterochromatin = nonexpressed regions that are condensed (compact) in all cells (e.g. centromeric simple repeats)
  • Facultative heterochromatin = inactive in only some cell lineages, active in others.

One example of heterochromatin is the inactive X chromosome in female mammals. The choice of which X chrosomosome to inactivate is random in various cell lineages, leading to a mosaic phenotypes for some X-linked traits. For instance, one genetic determinant of coat color in cats is X-linked, and the patchy coloration on calico cats results from this random inactivation of one of the X chromosomes, leading to the lack of expression of this determinant in some but not all hair cells.

Cytologically visible bands in chromosomes

G bands and R bands in mammalian mitotic chromosomes (Figure 4.27)

Giemsa‑dark (G) bands tend to be A+T rich, with a large number of L1 repeats.

Giemsa‑light bands tend to be more G+C rich, with very few L1 repeats and many Alu repeats.

(R bands are about the same as Giemsa-light bands. They are visualized by a different preparative procedure so that the "reverse" of the Giemsa-stained images are seen.)

T bands are adjacent to telomeres, do not stain with Giemsa, and are extremely G+C rich, with lots of genes and myriad Alu repeats.

The functional significance of these bands is still under active investigation.

One can localizea gene to a particular region of a chromosome by in situhybridization with a radioactive or, now more commonly, fluorescent probe for the gene. The region of hybridization is determined by simultaneously viewing the stained banding pattern and the hybridization pattern. Many spreads of mitotic chromosomes are viewed and scored, and the gene is localized to the chromosomal region with a significantly greater incidence of hybridization signal than that seen to the rest of the chromosomes.

Another common method of mapping the location of genes is by hybridization to DNA isolated from a panel of somatic cell hybrids, each hybrid cell carrying a small subset of, e.g., human chromosomes on a hamster background. Some hybrid cells carry broken human chromosomes, which allows even more precise localization (see Figure 1.8.2, "J-1 series").

Polytene chromosomes are visible in several Drosophilatissues

These contain many copies of the chromosomes, side by side in register. Thus most chromosomal regions are highly amplified in these tissues. Chromosomal stains reveal characteristic banding pattern, which is the basis for the cytological map. The cytological map (of polytene bands) combined with the genetic map gives a cytogenetic map, which is a wonderful guide to the Drosophila genome. One can localize a gene to a particular region by in situ hybridization (in fact the technique was invented using Drosophilapolytene chromoomes.

Multiple genes per band on mammalian chromosomes

Figure 4.27 gives a view of human chromosome 11 at several different levels of resolution. The region 11p15 has many genes of interest, including genes whose products regulate cell growh (HRAS), determination and differentiation of muscle cells (MYOD), carbohydrate metabolism (INS), and mineral metabolism (PTH). The b-globin gene (HBB) and its closely linked relatives are also in this region. A higher resolution view of 11p15, based on a compilation of genetic and physical mapping (Cytogenetics and Cell Genetics, 1995) is shown next to the classic ideogram (banding pattern). This is in a scale of millions of base pairs, and one can start to get a feel for gene density in this region. Interestingly, it varies quite a lot, with the gene-dense sub-bands near the telomeres; these may correspond to the T-bands discussed above. Other genes appear to be more widely separated. For instance, each of the b-like globin genes is separated by about 5 to 8 kb from each other (see the map of the YAC, or yeast artificial chromosome, carrying the b-like globin genes), and this gene cluster is about 1000 kb (i.e. 1 Mb) from the nearest genes on the map. However, further mapping will likely find many other genes in this region. Now even more information is available at the web sites mentioned earlier.

Figure 4.27.

The relationship between recombination distances and physical distances varies substantially among organisms. In human, one centiMorgan (or cM) corresponds to roughly 1 Mb, whereas in yeast 1 cM corresponds to about 2 kb, and this value varies at least 10-fold along the different yeast chromosomes. This is a result of the different frequencies of recombination along the chromosomes.

Specialized regions of chromosomes

Centromere: region responsible for segregation of chromosomes at mitosis and meiosis. The centromere is a constricted region (usually) toward the center of the chromosome (although it can be located at the end, as with mouse chromosomes.) It contains a kinetochore, a fibrous region to which microtubules attach as they pull the chromosome to one pole of the dividing cell. DNA sequences in this region are highly repeated simple sequences (in Drosophila, the unit of the repeat is about 25 bp long, repeated hundreds of times). Specific proteins are at the centromere, and are now intensely investigated.

Telomere: forms the ends of the linear DNA molecule that makes up the chromosome. The telomeres are composed of thousands of repeats of CCCTAA in human. Variants of this sequence are found in the telomeres in other species. Telomeres are formed by telomerase; this enzyme catalyzed the synthesis of more ends at each round of replication to stabilize linear molecules.

The Principal Proteins in Chromatin are Histones

Composition of chromatin: Various biochemical methods are avialable to isolated chromatin from nuclei. Chemical analysis of chromatin reveals proteins and DNA, with the most abundant proteins being the histones. A complex set of less abundant histones are referred to as the nonhistone chromosomal proteins.

The histones and DNA present in equal masses.

Mass Ratio DNA: histones: nonhistone proteins: RNA = 1: 1: 1: 0.1

Histones are small, basic (positively charged), highly conserved proteins. They bind to each other to form specific complexes, around which DNA wraps to form nucleosomes. The nucleosomes are the fundamental repeating unit of chromatin.

There are 5 histones, 4 in the core of the nucleosome and one outside the core.

H3, H4: Arg rich, most conserved sequence ü

ý CORE Histones

H2A, H2B: Slightly Lys rich, fairly conservedþ

H1: very Lys rich, most variable in sequence between species.

X-ray diffraction studies of histone complexes and the nucleosome core have provided detailed insight into how histones interact with each other and with DNA in this fundamental entity of chromatin structure.

Key reference: "Crystal structure of the nucleosome core particle at 2.8 Å resolution" by Luger, K. Mader, A., Richmond, R.K., Sargent, D.F. & Richmond, T.J. in Nature 389: 251-260 (1997)

Histone Interactions via the Histone fold

The core histones have a highly positively charged amino-terminal tail, and most of the rest of the protein forms an a-helical domain. Each core histone has at least 3 a-helices.

Figure 4.28

The a-helical domain forms a characteristic histone fold, in which shorter a1 and a3 helices are perpendicular to the longer a2 helix. The a-helices are separated by two loops, L1 and L2. The histone fold is the dimerization domain between pairs of histones, mediating the formation of crescent-shaped heterodimers H3-H4 and H2A-H2B. The histone-fold motifs of the partners in a pair are antiparallel, so that the L1 loop of one is adjacent to the L2 loop of the other.

Figure 4.29

A structure very similar to the histone fold has now been seen in other nuclear proteins, such as some subunits of TFIID, a key component in the general transcription machinery of eukaryotes. It also serves as a dimerization domain for these proteins.

Two H3-H4 heterodimers bind together to form a tetramer.

Nucleosomes are the Subunits of the Chromatin Fiber

The most extended chromatin fiber is about 10 nm in diameter. It is composed of a series of histone-DNA complexes called nucleosomes.

Principal lines of evidence for this conclusion are:

  1. Observations of this 10 nm fiber in the electron microscope showed a series of bodies that looked like beads on a string. We now recognize the beads as the nucleosomal cores and the string as the linker between them.
  2. Digestion of DNA in chromatin or nuclei with micrococcal nuclease releases a series of products that contain DNA of discrete lengths. When the DNA from the products of micrococcal nuclease digestion was run on an agarose gel, the it was found to be a series of fragments of 200 bp, 400 bp, 600 bp, 800 bp, etc. , i.e. integral multiples of 200 bp. This showed that cleavage by this nuclease, which has very little sequence specificity, was restricted to discrete regions in chromatin. Those regions of cleavage are the linkers.
  3. Physical studies, including both both neutron diffraction and electron diffraction data on fibers and most recently X-ray diffraction of crystals, have provided more detailed structural information.

2. The nucleosomal core is composed of an octamer of histones with 146 bp of duplex DNA wrapped around it in 1.65 very tight turns. The octamer of histones is actually a tetramer H32H42 at the central axis, flanked by two H2A-H2B dimers (one at each end of the core.

Figure 4.30. Schematic views of the nucleosomal core

The 10 nm fiber is composed of a string of nucleosomal cores joined by linker DNA. The length of the linker DNA varies among tissues within an organism and between species, but a common value is about 60 bp. The nucleosome is the core plus the linker, and thus contains about 200 bp of DNA.

Figure 4.31. A string of nucleosomes

Detailed structure of the nucleosomal core.

Path of the DNA and tight packing

The 146 bp of DNA is wrapped around the histone octamer in 1.65 turns of a flat, left-handed torroidal superhelix. Thus 14 turns or "twists" of the DNA are in the 1.65 superhelical turns, presenting 14 major and 14 minor grooves to the histone octamer. Pancreatic DNase I will cleave DNA on the surface of the core about every 10 bp, when each twist of the DNA is exposed on the surface.

The DNA superhelix has an average radius of 41.8 Å and a pitch of 23.9 Å. This is a very tight wrapping of the DNA around the histones in the core - note that the duplex DNA on one turn is only a few Å from the DNA on the next turn! The DNA is not uniformly bent in this superhelix. As the DNA wraps around the histones, the major and then minor grooves are compressed, but not in a uniform manner for all twists of the DNA. G+C rich DNA favors the major groove compression, whereas A+T rich DNA favors the minor groove compression. This is an important feature in translational positioning of nucleosomes and could also affect the affinity of different DNAs for histones in nucleosomes.

The DNA phosphates have high mobility when not contacting histones; the DNA phosphates facing the solvent are much more mobile than is seen with other protein-DNA complexes.

Figure 4.32. A cross-sectional view of the nucleosome core showing histone heterodimers and contacts with DNA. This images corresponds to the proteins and DNA in about one half of the nucleosome.

The left-handed torroidal supercoils of DNA in nucleosomal cores is the equivalent of a right-handed, hence negative, supercoil. Thus the DNA in nucleosomes is effectively underwound.

Figure 4.33.

Histones in the nucleosome core particle

The protein octamer is composed of four dimers (2 H2A-H2B pairs and 2 H3-H4 pairs) that interact through the "histone fold". The two H3-H4 pairs interact through a 4-helix bundle formed between the two H3 proteins to make the H32H42 tetramer. Each H2A-H2B pair interacts with the H32H42 tetramer through a second 4-helix bundle between H2B and H4 histone folds.

The histone-fold regions of the H32H42 tetramer bind to the center of of the DNA covering a total of about 6 twists of the DNA, or 3 twists of DNA per H3-H4 dimer. Those of the H2A-H2B dimers cover a comparable amount of DNA, 3 twists per dimer. Additional helical regions extend from the histone fold regions and are an integral part of the the core protein within the confines of the DNA superhelix.

Histone-DNA interactions in the core particle.

The histone-fold domain of the heterodimers (H3-H4 and H2A-H2B) bind 2.5 turns of DNA double helix, generating a 140˚ bend. The interaction with DNA occurs at two types of sites:

  1. The L1 plus L2 loops at the narrowly tapered ends of each heterodimer form a similar DNA binding site for each histone pair. The L1-L2 loops interact with DNA at each end of the 2.5 turns of DNA.
  2. The a1 helices of each partner in a pair form the convex surface in the center of the DNA binding site. The principal interactions are H-bonds between amino acids and the phosphate backbone of the DNA (there is little sequence specificity to histone-DNA binding). However, there are some exceptions, such a hydrophobic contact between H3Leu65 and the 5-methyl in thymine. An Arg side chain from a histone fold enters the minor groove at 10 of the 14 times it faces the histone octamer. The other 4 occurrences have Arg side chains from tail regions penetrating the minor groove.

Histone Tails

The histone N- and C-termial tails make up about 28% of the mass of the core histone proteins, and are seen over about 1/3 of their total length in the electron density map - i.e. that much of their length is relatively immobile in the structure.

The tails of H3 and H2B pass through channels in the DNA superhelix created by 2 juxtaposed minor grooves. One H4 tail segment makes a strong interparticle connection, perhaps relevant to the higher-order structure of nucleosomes.

The most N-terminal regions of the histone tails are not highly ordered in the X-ray crystal structure. These regions extend out from the nucleosome core and hence could be involved in interparticle interactions. The sites for acetylation and de-acetylation of specific lysines are in these segments of the tails that protrude from the core. Post-translational modifications such as acetylation have been implicated in "chromatin remodeling" to allow or aid transcription factor binding. It seems likely that these modifications are affecting interactions between nucleosomal cores, but not changing the structure of the core particle.

Outside Links

  • Some excellent resources are available on the World Wide Webfor visualizing and further investigating chromatin structure and its involvment in nuclear processes.
  • Dmitry Pruss maintains a site with many good images, including dynamic, step-by-step view of the nuclesomal core beginning with the histone fold domains and ending with a complete core, with DNA. www.average.org/~pruss/nucleosome.html
  • Another good site is from J.R. Bone: rampages.onramp.net/~jrbone/chrom.html

Higher order chromatin structure

  1. The 10 nm fiber composed of nucleosomal cores and spacers is folded into higher order structures for much of the DNA in chromatin. In fact, the 10 nm fiber with the beads-on-a-string appearance in the electron microscope was prepared at very low salt concentrations and is free of histone H1.
  2. In the presence of H1 and at more physiological salt concentrations, chromatin forms a 30 nm fiber. The exact structure of this fiber remains a point of considerable debate, and one cannot rule the possibility of multiple structure in this fiber.
  3. One reaonable model is that the 10 nm fiber coils around itself to generate a solenoid that is 30 nm in diameter, with 6 nucleosomes per turn of of the solenoid.

Histone H1 binds to the outer surface of the nucleosomal core, interacting at the points of DNA entry and exit. H1 molecules can be cross-linked to each other with chemical reagents, indicating that the H1 proteins also interact with each other. Interactions between H1 proteins, each bound to a nucleosomal core, may be one of the forces driving the formation of the 30 nm fiber.

Figure 4.34. Model for one turn of the solenoid in the 30 nm fiber.

4. Each level of chromatin structure produces a more compact arrangment of the DNA. This can be described in terms of a packing ratio, which is the length of the DNA in an extended state divided by the length of the DNA in the more compact state.

For the 10 nm fiber, the packing ratio is about 7, i.e. there are 7mm of DNA per mm of chromatin fiber. The packing ratio in the core is higher (see problems), but this does not include the additional, less compacted DNA in the spacer. In the 30 nm fiber, the packing ratio is about 40, i.e. there 40mm DNA per mm of chromatin fiber.

5. The 30 nm fiber is probably the basic constituent of both interphase chromatin and mitotic chromosomes. It can be compacted further by additional coils and loops. One of the key issues in gene regulation is the nature of the chromating fiber in transcriptionally acative euchromatin. Is it the 10 nm fiber? the 30 nm fiber? some modification of the latter? or even some higher order structure? These are topics for current research.


Comparative genome and transcriptome analysis of diatom, Skeletonema costatum, reveals evolution of genes for harmful algal bloom

Diatoms play a great role in carbon fixation with about 20% of the whole fixation in the world. However, harmful algal bloom as known as red tide is a major problem in environment and fishery industry. Even though intensive studies have been conducted so far, the molecular mechanism behind harmful algal bloom was not fully understood. There are two major diatoms have been sequenced, but more diatoms should be examined at the whole genome level, and evolutionary genome studies were required to understand the landscape of molecular mechanism of the harmful algal bloom.

Results

Here we sequenced the genome of Skeletonema costatum, which is the dominant diatom in Japan causing a harmful algal bloom, and also performed RNA-sequencing analysis for conditions where harmful algal blooms often occur. As results, we found that both evolutionary genomic and comparative transcriptomic studies revealed genes for oxidative stress response and response to cytokinin is a key for the proliferation of the diatom.

Conclusions

Diatoms causing harmful algal blooms have gained multi-copy of genes related to oxidative stress response and response to cytokinin and obtained an ability to intensive gene expression at the blooms.


Study of Comparative Genomics

In this article we will discuss about the study of comparative genomics.

All the genes of an organism are not functional. In different groups of organisms the percentage of functional genes varies. For example in bacteria 3-5 genes are non-functional, whereas in humans 97 % genes are non-functional. Besides, the level of evolutionary conservation of microbial proteins is rather uniform with 70% of gene products.

Each of the sequenced genomes has homologs in distant genomes. Thus, the function of many of these genes can be predicted by comparing different genomes and by transferring functional annotation of proteins from better studied organisms to their orthologs from lesser studied organisms.

Based on the above facts, study of comparative genomics proved a powerful approach for achieving a better understanding of the genomes and, subsequently of the biology of the respective organisms. Recently, some of the genome of the microorganisms viz. Haemophilus influenzae, Mycoplasma genitalium, Methanococcus jannaschii, Saccharomyces cerevisiae, Escherichia coli. Bacillus subtilis have been fully sequenced.

Computational analysis of complete genomes requires a database (a repository of gene structure of organisms) that store genomic information’s and bioinformatics tools. To study completely sequenced genomes, analysis of nucleic acids, proteins, etc. are required. Now-a-days even the analysis of protein sets also proved as a tool to study genome analysis.

Thus, it is possible to know by comparing different genomes and by transferring functional annotation of proteins from better-studied organisms to their orthologs [i.e. genes that are connected by vertical evolutionary descent (the “same” gene in different species)] as opposed to paralogs (i.e. genes related by duplication within a genome) from lesser-studied organisms.

This makes comparative genomics a powerful approach to achieving a better understanding of the genomes, and subsequently of the biology of the respective organisms.

Databases for Comparative Genomics:

World Wide Web (www) is acces­sible to anyone by using Internet.

This database gives in­formation about the proteins, their three-dimensional structures, enzyme patterns, PROSITE patterns, Pfam domains, BLOCKS and SCOP domains as well as PIR keywords and PIR super families.

Clusters of Orthologs Groups (COGs) are applicable to simplify evolutionary studies of complete genomes and improve functional assignments of individual proteins. It comprises of -2,800 conserved families of proteins from each of the sequenced genomes.

It contains orthologus sets of proteins from at least three phylogenetic lineages which are assumed to have evolved from an individual ancestral protein. The functions of orthologs are same in all organisms.

The protein families in the COGs database are separated into 17 functional groups that include a group of uncharacterized yet conserved proteins as well as a group of proteins for which only a general functional assignment appeared appropriate.

In COGs database due to storage of diverse nature of data on proteins, the similarity searches also give some information for those proteins which has no clear information’s in databases. The databases also act as a tool for a comparative analysis of complete genomes.

Kyoto Encyclopedia of Genes and Genomes (KEGG) Centers on cellular metabolism was proposed by Kaneshisa and Goto (2000). A comprehensive set of metabolic pathway charts both general and specific has been given for the sequenced (genome) organism. In this, enzymes identified in a particular organism are colour-coded, so that one can easily trace the pathways.

It also provides the enzymes coded for the orthologus genes. These genes if located adjacent to each other, form like operons, for example comparison between two complete genomes in which genes are located relatively close or adjacent (with in five genes) can be made. This site is useful to get information’s for the analysis of metabolism in various organisms.

Microbial Genome Database (MBGD) is situated in the University of Tokyo, Japan. This database helps to search for microbial genomes. MBGD accept the several sequences at once (-2000 residues) for searching against all of the complete genomics available displays colour-coded functions of the detected homologs, and their location on a circular genome map. This database also gives information’s regarding the functions e.g. degradation of hydrocarbon or biosynthesis of nucleotides, etc.

Similar to KEGG, WIT (“What is there” database) gives information’s regarding metabolic reconstruction for completely sequenced genomes. The WIT features are to provide sequence of reactions between two bifurcations besides to include proteins from many partially- sequenced genomes. These features of WIT provide many more information’s on the sequences of the same proteins/enzymes obtained/from different organisms.

Bioinformatics Subgroups:

The bioinformatics has more subgroups viz. networking, sequence database and alignment theories, phylogenetic analysis, secondary structure predictions and DNA analysis, bio molecular structures, dynamics and function, protein motifs, modeling analysis of 3- D structures of macromolecules, applications in the discovery of synthetic molecules to heat, human diseases, and molecular mechanisms involved with gene regulation, etc.

Steps of Sequence Formation:

The tool of bioinformatics provides the analysis of se­quence information.

This process involves:

i. Identifying the genes in the DNA se­quences from various organisms.

ii. Developing methods to study the struc­ture and/or functions of newly identified sequences and corresponding structural RNA sequences.

iii. Identifying families of related sequences and the development of models.

iv. Aligning similar sequences and generating phylogenetic trees to examine evolutionary relationships.

To know the biological and biophysical knowledge, conversion of sequence information is required. Information’s of the biological sequence can deciphere the structural, functional and evolutionary clues encoded in the languages of biological sequences. The decoding of languages may be decomposed into sentences (proteins), words (motifs) and letters (amino acids), and the code may be tackled at a variety of these levels.

A single letter change within a word can sometimes change its meaning for example, a chain codon for glutamic acid (GAA) to valine (GUA) in homozygous individuals. This minute difference results in a change from a normal healthy state to fatal sickle cell anaemia.

Basic Requirements:

Following are some of the requirements:

a. Biological research on the web.

b. Sequence analysis, pair wise alignment and database searching.

c. Multiple sequence alignments, trees and profiles.

d. Visualizing protein structures and computing structural properties.

e. Predicting protein structure and function from sequence.

f. Tools for genomics and proteomics.

The well known packages (softwares) for DNA and protein sequence analysis include Staden and Gene world (for DNA and protein sequence) Gene Thesarus (access to public data and integration with proprietary data), Lasergene (for coding analysis, pattern site matching, structure and comparative analysis, restriction site analysis, PGR primer and probe designing, sequence editing, assembly and analysis, etc.), CINEMA (package provides facilities for motif identification using BLAST), EMBOSS (using nucleotide sequence pattern analysis, codon usage analysis, gene identification tools, protein motif identification and rapid database searching with sequence pattern), EGCG (for fragment assembly, mapping multiple sequence analysis, pattern recognition nucleotide and protein sequence analysis, etc.).

The biological data and information storage are given below in Table 27.12:

Classification of Databases:

The databases are broadly classified into two categories: sequence databases (it involves both proteins and nucleic acid sequences), and structural databases (it involves only protein databases).

Moreover, it is also classified into three categories:

Primary databases contain information of the sequence or structure alone of either protein or nucleic acid e.g. PIR or protein sequences, GenBank and DDBJ for genome sequences. Secondary databases contain derived informations from the primary databases, for example informations on conserved sequence, signature sequence and active site residues of protein families by using SCOP, eMOTIF, etc.

The composite database is obviating the need to search multiple resources. The SCOP is structural classification of proteins in which the proteins are classified into hierarchical levels such as classes, folds, superfamilies.

Comparative Modelling or Homology Modelling:

It is useful in aligning two sequences to identify segments that share similarity. It later identifies the structure of desired protein. After predicting the structure of the homology, rigid body assembly approach is applied for assembling the structure that represents the core loop regions, side chains, etc. In sediment matching procedure, coordinates are calculated from approximate position of conserved atoms of the templates.

The alignment of the sequence of interest with one or more structural templates can be used to derive a set of distance constraints which gives informations on distance geometry or retrained energy minimization or retained molecular dynamics to obtain the structure.

It is a technique to match a sequence with a protein shape in the absence of any substantial sequence identity to proteins of known structure, whereas comparative modelling requires protein sequences.

Threading is followed by scoring, that creates a profile for each site or using a potential based pair wise interaction. Potential energy functions may be obtained from ab initio quantum mechanical calculations or from thermodynamic, spectroscopic or crystallographic method or by combination method.

(b) Sequence analysis:

In order to understand the protein/nucleic acid structure and evolution, the analysis of their sequence data is required. The sequence analysis is the detection of homologus (orthologus: same function, different species) or paralogus (different but related functions within one organism) relationships by means of routine database searches.

Some of the important resources are outlined in the following:


Comparative Genome Analysis and Global Phylogeny of the Toxin Variant Clostridium difficile PCR Ribotype 017 Reveals the Evolution of Two Independent Sublineages

The diarrheal pathogen Clostridium difficile consists of at least six distinct evolutionary lineages. The RT017 lineage is anomalous, as strains only express toxin B, compared to strains from other lineages that produce toxins A and B and, occasionally, binary toxin. Historically, RT017 initially was reported in Asia but now has been reported worldwide. We used whole-genome sequencing and phylogenetic analysis to investigate the patterns of global spread and population structure of 277 RT017 isolates from animal and human origins from six continents, isolated between 1990 and 2013. We reveal two distinct evenly split sublineages (SL1 and SL2) of C. difficile RT017 that contain multiple independent clonal expansions. All 24 animal isolates were contained within SL1 along with human isolates, suggesting potential transmission between animals and humans. Genetic analyses revealed an overrepresentation of antibiotic resistance genes. Phylogeographic analyses show a North American origin for RT017, as has been found for the recently emerged epidemic RT027 lineage. Despite having only one toxin, RT017 strains have evolved in parallel from at least two independent sources and can readily transmit between continents.

Keywords: Clostridium difficile SNPs antibiotic resistance evolution phylogenetics phylogeny ribotype 017 sequencing.

Copyright © 2017 Cairns et al.

Figures

Maximum-likelihood phylogenetic analysis of 277…

Maximum-likelihood phylogenetic analysis of 277 global RT017 isolates based on core genome SNPs…

Maximum-likelihood phylogenetic analysis of 277…

Maximum-likelihood phylogenetic analysis of 277 global RT017 isolates based on core genome SNPs…

Bayesian evolutionary analysis of 277…

Bayesian evolutionary analysis of 277 global RT017 isolates based on core genome SNPs…

Maximum-likelihood phylogenetic analysis of the…

Maximum-likelihood phylogenetic analysis of the global RT017 isolates based on core genome SNPs…

Global transmission events inferred from…

Global transmission events inferred from Bayesian evolutionary analysis of RT017. From the geotemporal…


Conclusions

The isolation and genome sequencing of six L. brevis strains combined with thirteen additional, publicly available L. brevis genomes allowed a comparative genome analysis of the L. brevis species. The deduced pan-genome of these L. brevis isolates appears to be in a closed state, indicating that the representatives used in this study are sufficient to describe the genetic diversity of the taxon. Throughout evolution, it appears that L. brevis strains specified and differentiated one from another by acquiring plasmids and prophages, despite for the presence of CRISPR-Cas and R/M systems which may have limited such foreign DNA invasion events. These latter systems are of relevance for future functional investigations that may necessitate the development of DNA transfer and/or mutagenesis tools. L. brevis strains represent a significant threat for the brewing industry being the most common cause of beer spoilage however, this spoiling ability is strain specific. The comparative genome analysis performed here highlights that L. brevis strains with the ability to grow in beer possess a higher number of CDSs in their overall chromosomal sequences. This observation suggests a link to evolution and adaptation to beer in which the strain would have acquired novel genes and functions in order to adapt and survive in the harsh environment that beer represents. The role(s) of the “acquired” or beer-specific CDSs revealed that almost a quarter of these are linked to oxido-reduction reactions, possibly playing a role in the response to oxidative stress. Another 22% are linked to transcription regulation, 21% encode cell surface proteins while 14% are encoding membrane transport related proteins and possibly associated to harmful compound extrusion encountered by the L. brevis strains when surviving and growing in beer. Additional genetic diversification of these L. brevis strains is expected to have occurred through plasmid acquisition that also likely contributes to beer adaptation. The plasmid content analysis of the different L. brevis beer-spoiler strains highlighted the presence of unique proteins shared among these strains. These proteins are mostly hypothetical proteins while approximately 30% are linked to membrane transport, and cell-wall synthesis. These observations demonstrate the complexity of microorganisms’ beer spoilage ability and suggests that adaptation of the L. brevis strain to beer is a complex process, not due to the action of only one specific gene, but more likely the intervention of a complex, multi-factorial response.


Comparative Genome Analysis of Four Magnetotactic Bacteria Reveals a Complex Set of Group-Specific Genes Implicated in Magnetosome Biomineralization and Function

FIG. 1 . Phylogenetic affiliation of best BLAST hits of all conserved ORFs from MSR-1. Bars represent the top-10 numbers of the best E-value hits from each conserved gene in MSR-1. (A) Distribution with all database species from genomesDB included. (B) Distribution after closest relatives AMB-1, MS-1, and R. rubrum were excluded from analysis. FIG. 2 . Comparative gene content analysis of MTB based on reciprocal best matches. The Venn diagrams illustrate the shared gene content between the four genomes. For visualization, individual diagrams for three genomes are shown. The numbers of species-specific genes and shared genes are indicated. (A) Shared gene content between MSR-1, AMB-1, and MS-1. (B) Shared gene content between MSR-1, AMB-1, and strain MC-1. FIG. 3 . Phylogenetic tree of MamH (MGR4089) orthologous and paralogous proteins including the MTB-related MGR4148 (maximum-likelihood analysis). MamH represents a typical example for an MTB-related protein defined in this study i.e., it forms a coherent phylogenetic branch within its family tree. In addition, the newly identified MTB-related MGR4148 gene is related to MamH but forms a distinct group. The three major clusters are indicated by different colors. The numbers indicate the bootstrap support for selected nodes. FIG. 4 . Gene neighborhood representation of selected group-specific genes. Identical colors indicate homologous genes in the corresponding genomes. Arrows in bold lines indicate identification of the gene product within the magnetosome membrane. (A) mamXY cluster. Conserved gene neighborhood of MGR4148, mamX (MGR4149), and mamY (MGR4150) (top). Schematic representation of the different Pfam domain structure of the MTB-related gene MGR4148 compared to mamH (bottom) (B) Gene neighborhood of mtxA. The corrected annotation for the MGR0208 homolog of AMB-1 is shown. (C) Gene neighborhood of MGR3500. (D) Gene neighborhood of mmsF (MGR4072).

Conclusions

The bench-top sequencing revolution has led to a ‘democratization’ of sequencing, meaning most research laboratories can afford to sequence whole bacterial genomes when their work demands it. However analysing the data is now a major bottleneck for most laboratories. We have provided a starting point for biologists to quickly begin working with their own bacterial genome data, without investing money in expensive software or training courses. The figures show examples of what can be achieved with the tools presented, and the accompanying tutorial gives step-by-step instructions for each kind of analysis.


What other genomes have been sequenced?

Researchers have sequenced the complete genomes of hundreds of animals and plants-more than 250 animal species and 50 species of birds alone-and the list continues to grow almost daily.

In addition to the sequencing of the human genome, which was completed in 2003, scientists involved in the Human Genome Project sequenced the genomes of a number of model organisms that are commonly used as surrogates in studying human biology. These include the rat, puffer fish, fruit fly, sea squirt, roundworm, and the bacterium Escherichia coli. For some organisms NHGRI has sequenced many varieties, providing critical data for understanding genetic variation.

DNA sequencing centers supported by NHGRI also have sequenced genomes of the chicken, dog, honey bee, gorilla, chimpanzee, sea urchin, fungi and many other organisms.

Researchers have sequenced the complete genomes of hundreds of animals and plants-more than 250 animal species and 50 species of birds alone-and the list continues to grow almost daily.

In addition to the sequencing of the human genome, which was completed in 2003, scientists involved in the Human Genome Project sequenced the genomes of a number of model organisms that are commonly used as surrogates in studying human biology. These include the rat, puffer fish, fruit fly, sea squirt, roundworm, and the bacterium Escherichia coli. For some organisms NHGRI has sequenced many varieties, providing critical data for understanding genetic variation.

DNA sequencing centers supported by NHGRI also have sequenced genomes of the chicken, dog, honey bee, gorilla, chimpanzee, sea urchin, fungi and many other organisms.


XuW, XiW, and WZ designed and coordinated the study and carried out the data analysis. XuW, XiW, LS, and WZ performed the bioinformatics analysis. XuW, XiW, JL, and RY carried out the experiments and interpreted data for the work. XuW, XiW, and WZ wrote the manuscript. GQ checked and edited the manuscript. All authors have read and approved the manuscript.

This work was supported by the National Natural Science Foundation of China (No. 31470230, 51320105006, 51604308), the Youth Talent Foundation of Hunan Province of China (No. 2017RS3003), Natural Science Foundation of Hunan Province of China (No. 2018JJ2486), Key Research and Development Projects in Hunan Province (2018WK2012), Fundamental Research Funds for the Central Universities of Central South University (No. 2018zzts767).


Materials and Methods

Cotton Materials

Plants of G. rotundifolium (accession number K201), G. arboreum (cultivar Shixiya-1) and G. raimondii (accession number D502) are maintained in the National Wild Cotton Nursery and are also cultivated in the greenhouse of Huazhong Agricultural University in Wuhan, China. Fresh young leaves were collected individually and immediately frozen in liquid nitrogen.

Library Construction and Nanopore Sequencing

High-quality genomic DNA from one plant was extracted and inspected for purity, concentration, and integrity using Nanodrop, Qubit, and 0.35% agarose gel electrophoresis, respectively. Large DNA fragments (20–150 Kb) were collected using the BluePippin system. DNA libraries were constructed using the SQK-LSK109 kit following the standard protocol of Oxford Nanopore Technologies (ONT). Briefly, DNA fragments were subject to optional fragmentation, end repair, ligation of sequencing adapters, and tether attachment. The Qubit machine was used to quantify each DNA library. DNA sequencing was performed on the PromethION platform (R9.4.1 FLO-PRO002 Biomarker Technologies). Nanopore data (binary fast5 format) was subjected to base calling using the Guppy software from the MinKNOW package. Processed reads were subject to removal of sequencing adapters and filtering of reads with low quality and/or short length (<2,000 bp), and surviving reads were converted to fastq format for subsequent analysis. For each accession, we also constructed DNA libraries using the NEBNext ® Ultra™ DNA Library Prep Kit for sequencing on the Illumina Novaseq 6000 platform (paired-end, 150 bp).

Hi-C Experiment and Library Construction

Fresh leaves (1 g) from G. rotundifolium were chopped with sharp blades, fixed with 1% formaldehyde solution, frozen in liquid nitrogen, and were used for nuclear extraction. Nuclei were digested with 30–50 U HindIII/DpnII for 15 h at 37°C. Digested chromatin was end-labeled with biotin-14-dCTP, and the DNA product was purified after blunt-end ligation. Then, the DNA was fragmented by ultrasound to a length of less than 500 bp. DNA fragments of 300–500 bp were captured by Streptavidin T1 magnetic beads. The library was prepared from the DNA isolated by the magnetic beads using the DNA library kit (Vazyme, #NDM607), and the obtained DNA library was sequenced (paired-end 150 bp reads) using the MGI2000 system.

Genome Assembly and Assessment

Nanopore sequencing reads were corrected via Canu (v1.3) with the parameter “correctedErrorRate = 0.045” ( Koren et al. 2017). Clean reads were subsequently subject to de novo assembly using wtdbg (Ruan and Li 2019) (https://github.com/ruanjue/wtdbg). Assembled contigs were calibrated using Racon ( Vaser et al. 2017) and then polished with the Illumina sequencing reads using Pilon ( Walker et al. 2014) (v1.22 parameters: –mindepth 10 –changes –fix bases) for three iterations. In total, we corrected 12.6 million (M), 6.0 M and 27.2 M SNPs, and 17.6 M, 9.2 M, and 31.0 M InDels in the A2, D5 and K2 assemblies, respectively. Assembly quality was assessed three ways. First, Illumina reads were mapped to the contigs using BWA (-mem) ( Li and Durbin 2009), and the properly mapped reads were counted using SAMTools (v0.1.19 -flagstat) ( Li et al. 2009). Second, the assemblies were evaluated for the 458 conserved core genes found in the CEGMA (v2.5) database ( Parra et al. 2007). Finally, the assemblies also evaluated using the BUSCO embryophyta_odb9 data set, which contains 1,440 conserved eukaryotic genes ( Simao et al. 2015).

Chromosome Assembly Using Hi-C

Hi-C data were used to construct chromosome-level assemblies for the three genomes. Hi-C data of G. arboreum and G. raimondii were previously published ( Wang et al. 2018). Hi-C data of G. rotundifolium was newly generated here with two independent experiments (HindIII and DpnII for digestion of chromatin) ( supplementary table 2 , Supplementary Material online). Notably, up to 99.5% of A/B compartment regions and 96.4% of TAD boundaries overlapped in these two experiments (The method for A/B compartment and TAD analysis was described below), and the HindIII Hi-C data was used for further analysis. The resolution of Hi-C data sets was estimated as 20 Kb for G. arboreum, 10 Kb for G. raimondii, and 20 Kb for G. rotundifolium using the method described previously ( Rao et al. 2014). We performed a preassembly for error correction of contigs, which required splitting the contigs into segments of 50 Kb (on average). Hi-C data were mapped to these fragments and unique mappings were retained for the assembly using LACHESIS (v1.0) ( Burton et al. 2013). Any two segments that showed inconsistent connections with information from the raw contigs were checked manually. Corrected contigs were used to construct chromosome-level assemblies using LACHESIS with the parameters (CLUSTER_MIN_RE_SITES = 10, CLUSTER_MAX_LINK_DENSITY = 2, CLUSTER_NONINFORMATIVE_RATIO = 2, ORDER_MIN_N_RES_IN_TRUN = 219, ORDER_MIN_N_RES_IN_SHREDS = 216). To assess assembly quality, each assembly was split into 100-Kb bins to serve as a reference for Hi-C data mapping using HiC-Pro (v2.7.1) ( Servant et al. 2015). Obvious placement and orientation errors in chromatin interaction patterns were manually adjusted. The interaction matrices generated by HiC-Pro were displayed with heatmaps at a 100 Kb resolution.

Transposon Prediction

We used both LTR_Finder (v1.07) ( Xu and Wang 2007) with “-C -M 0.8” and RepeatScout (v1.0.5) ( Price et al. 2005) with default parameters to construct a repetitive sequence library, representing structure-based prediction and ab initio prediction, respectively. PASTEClassifier (v1.0) was used to classify sequences in the library with respect to repeat type, and these were subsequently merged with Repbase (version 19.06) for the final repeat library ( Bao et al. 2015). This library was used to predict repetitive sequences in each genome using RepeatMasker (-nolow -no_is -norna -engine wublast) ( Tarailo-Graovac and Chen 2009).

LTR Retrotransposon Analysis

LTR_Finder ( Xu and Wang 2007) was used with parameter settings (-C -M 0.8) to identify full-length LTRs in each genome. Long-terminal repeat (LTR) sequences were clustered from each full-length LTR element using the CD-HIT program ( Fu et al. 2012) with parameter “-d 0 -c 0.8 -aL 0.80 -T 0 -M 1500000” for LTR family analysis. For each full-length LTR retrotransposon, the 5′ LTR and 3′ LTR sequences were aligned using MUSCLE (v3.8.1551) ( Edgar 2004) and the divergence distance between them was calculated with a Kimura two parameter (K2P) model using “distmat” from the EMBOSS toolkit ( Rice et al. 2000). Divergence time was estimated using the formula T = K/2r (where K is the distance between two LTRs and r is the rate of nucleotide substitution per site per year, r = 3.5 × 10 −9 ) ( Chen et al. 2020 Huang et al. 2020). According to the time of divergence (5 Ma) among the three Gossypium species, the burst time of full-length LTR retrotransposons were divided into ancient TE (≥5 Ma) and young TE (<5 Ma), depending on whether the burst was inferred to have occurred prior to or following divergence of these clades. The expression level of transposon was calculated based on the definition of Reads Per Kilobase per Million mapped reads (RPKM), and those with RPKM greater than 0.1 were considered as “expressed TE.” Gossypium retrotransposable Gypsy-like element (Gorge3) sequences ( Hawkins et al. 2006) were aligned against the full-length LTR elements from G. rotundifolium, G. arboreum, G. raimondii, and Gossypioides kirkii (Udall, Long, Ramaraj et al. 2019) using a reciprocal blastn (-e 1e-05) search. MAFFT (v7.453) ( Katoh and Standley 2013) was used for Gorge3 5' LTR domain with multiple sequence alignments in four species, and then phylogenic tree was constructed using the IQ-TREE program ( Nguyen et al. 2015).

Gene Prediction

To predict protein-coding genes, three different strategies were adopted, including ab initio prediction, homolog-based prediction, and transcript-based prediction. Genscan ( Burge and Karlin 1997), Augustus (v2.4) ( Stanke and Morgenstern 2005), GlimmerHMM (v3.0.4) ( Majoros et al. 2004), SNAP (v2006-07-28) ( Korf 2004) were used for ab initio prediction. GeMoMa (v1.3.1) ( Keilwagen et al. 2018) was used for predicting genes based on homologous protein from other species (Populus trichocarpa, Arabidopsis thaliana, Vitis vinifera, Theobroma cacao, and G. raimondii). Hisat2 (v2.0.4) ( Kim et al. 2015) and Stringtie (v1.2.3) ( Pertea et al. 2015) were used for reference-guided transcript assembly. PASA (v2.0.2) ( Haas et al. 2003) was used to predict unigene sequences based on RNA-Seq data without reference-guided assembly. Finally, EVM (v1.1.1) ( Haas et al. 2008) was used to integrate the prediction results obtained by the above three methods, and PASA (v2.0.2) ( Haas et al. 2003) was used to modify gene models. To identify pseudogenes, GenBlastA (v1.0.4) ( She et al. 2009) was used to scan each genome after masking predicted protein-coding sequences and GeneWise (v2.4.1) ( Birney et al. 2004) was used to identify premature stop codons and frameshift mutations relative to the intact reference proteins. The functional annotation of predicted genes was performed using 1) InterProScan (v5.0) ( Jones et al. 2014) with “-iprlookup -goterms” parameter settings, 2) NR (v20190625) with “-evalue 1e-05 -best_hit_overhang 0.25 -max_target_seqs 5”, and 3) The Arabidopsis Information Resource 10 (TAIR10) database ( Lamesch et al. 2012). Gene Ontology (GO) enrichment analysis was performed using a Fisher’s exact test method ( Carbon et al. 2019). GO enrichment analysis was performed for genes showing A-to-B and B-to-A compartment status change, using different background gene sets (K2 and A2 genes were combined as a reference set and orthologous gene pairs showing A/B compartment status change were used as a test set similarly, A2 and D5 genes were combined as another reference set).

Identification of Centromeric Regions

Previously identified centromeric regions from the published TM-1 reference genome, that is, GhCR1-5′LTR, GhCR2-5′LTR, GhCR3-5′LTR and GhCR4-5′LTR ( Wang et al. 2015 Wang et al. 2019), were aligned to the K2, A2, and D5 genome sequences using MUMmer (v4.0) ( Delcher et al. 2002), with the parameters “-c 90 -l 40” followed by “delta-filter -1,” to identify uniquely aligning regions. After manual filtering of alignments, the SPSS software (version 17.0) was used to calculate the 95% confidence interval for the median representing the centromeric region for each chromosome.

Comparative Genomes and Gene Synteny Analysis

The genomic sequences of G. rotundifolium, G. arboreum, and G. raimondii were aligned using MUMmer (v4.0) with the following parameters: 1) nucmer -max match -c 90 -l 40 and 2) delta-filter -1. Syntenic blocks among the three genomes were constructed using MCScanX ( Tang et al. 2008) with default settings and requiring a minimum of five homologous genes. The newly assembled A2 and D5 reference genomes were compared with published genomes ( Paterson et al. 2012 Du et al. 2018 Udall, Long, Hanson et al. 2019 Huang et al. 2020) from CottonGen website (https://www.cottongen.org/data/download) by MUMmer (v4.0) and MCScanX. The Chr01-Chr02 large translocation of A2-specific rearrangement and Chr13-Chr05 large translocation of K2-specific rearrangement were confirmed by comparing with the published A1 ( Huang et al. 2020), D1 ( Grover et al. 2019), D10 (Udall, Long, Hanson et al. 2019 Udall, Long, Ramaraj et al. 2019) and F1 ( Grover et al. 2020) genomes. The single-copy gene families among three Gossypium genomes were extracted using an OrthoMCL analysis ( Li et al. 2003).

Analysis of A and B Compartments

Hi-C interaction data can be used to partition the genome into two compartments, based on spatial organization of the chromatin and the relative paucity of interactions between compartments. Referred to as A/B compartments, these represent chromatin regions corresponding to open and closed chromatin, respectively. We evaluated each genome for the presence of A/B compartments, as described previously ( Lieberman-Aiden et al. 2009). Briefly, Hi-C data for each species were aligned using HiC-Pro, as mentioned above. Valid interaction reads were used to construct heatmaps of each chromosome at resolutions of 20 Kb, 50 Kb, and 100 Kb. Raw contact maps were normalized using a sparse-based implementation of the iterative correction method embedded in HiC-Pro (v2.11.1) ( Servant et al. 2015). The principal component analysis (PCA) method was used to identify A and B compartments by the HiTC (v1.0) package in R ( Servant et al. 2012). Each chromosome was divided into consecutive 50 Kb bins for the construction of normalized interaction matrices as described in our previous study ( Wang et al. 2018). Chromosomal bins with values of greater than zero were regarded as “A compartment,” bins with values of less than zero were regarded as “B compartment.” At the chromosome level, A compartment has a higher gene density and a lower transposon density than B compartment. To analyze the A/B compartment status of homologous gene regions among three Gossypium genomes, genomic sequences of gene body, upstream and downstream 2 Kb that were known to be important for gene transcriptional regulation, were extracted. In this analysis, we only considered the regions where the first principal component value changes from positive (A) to negative (B) or vice versa.

Analysis of Topologically Associating Domains

Topologically associating domains (TAD) are regions of highly selfinteracting chromatin that have distinct boundaries and which have been shown to align with coordinately related gene clusters in some species. TAD regions for each species were identified using the HiTAD ( Wang et al. 2017) software with default settings. In this analysis, the raw chromatin interaction matrix for each chromosome was constructed using HiC-Pro at a resolution of 50 Kb. Each matrix file was transformed into the cooler format using the toCooler tool of HiCPeaks (https://github.com/XiaoTaoWang/HiCPeaks). In each species, TADs with a size of 300 Kb–2 Mb were retained for further analysis. To identify conserved and lineage-specific TADs, we compared TAD boundaries located in syntenic blocks from the results of MCScanX. Conserved boundaries were defined as those with a maximum boundary change of 3-resolution distance (150 Kb) and sequence similarity supported by the MUMmer alignments between two genomes.

TAD Boundary Motif Analysis

In each genome, the TAD boundary flanking 50 Kb were used to predict motifs with the findMotifsGenome.pl program in HOMER (v5.0) ( Heinz et al. 2010) software, with the parameters “-len 8,10,12 -size 200.” Putative motifs were filtered with cutoffs of P ≤ 0.01 for known and P ≤ 1e−10 for de novo prediction. We used 1,000 uniformly distributed random genomic regions that did not overlap with TAD boundaries as a control set for nonboundary regions.

RNA-Seq and Data Analysis

For each species, leaf total RNA was extracted using the Spectrum TM Plant Total RNA Kit (Sigma, STRN250). RNA libraries were constructed using the Illumina TruSeq RNA Library Preparation Kit (Illumina, San Diego, CA, USA) and sequenced on the Illumina HiSeq 4000 platform (pair-end 150 bp). After filtering of low-quality bases and sequence adapters, the clean RNA sequencing data were mapped to each genome using hisat2 (v2.0.4) ( Kim et al. 2015) software. High-quality mapping reads were extracted using SAMTools (v0.1.19 -q 25) ( Li et al. 2009). After filtering PCR duplicates using samtools (rmdup), the remaining reads were used to calculate the expression level of genes using Stringtie (v1.2.3) ( Pertea et al. 2015).


Watch the video: The magic of Fibonacci numbers. Arthur Benjamin (May 2022).


Comments:

  1. Panteleimon

    It's hard to say.

  2. Mezizil

    In it something is. Thanks for the information, can I help you synonymous with something?

  3. Tojakora

    What word is mean?

  4. Amma

    Girls lack femininity, and women lack virginity. Sculptural group: Hercules tearing the mouth of a peeing boy. Badge on a 150-kilogram man: Progress made sockets inaccessible to most children - the most gifted die. My friend's wife is not a woman for me ... But if she is pretty. ... ... he is not my friend! Drunkenness - fight! Fuck - fuck! Love is the triumph of imagination over intellect. I hate two things - racism and blacks.



Write a message