What sequences are between adjacent genes?

What sequences are between adjacent genes?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

The human genome has a lot of non-coding regions, which include regulatory elements, repetitive DNA, and introns. Suppose there are two adjacent genes on a chromosome, and their positions on the chromosome are, say,

For the first gene: 11,785,723 to 11,803,245 bp.

For the second gene: 11,806,096 to 11,806,143 bp. These values include their regulatory elements, promoter and introns as well.

So what sequence is present between bases 11,803,245 and 11,806,096? Are they satellite sequences? Is this non-coding, non-regulatory region 11,803,245 to 11,806,096 bp heterochromatin?

No. The intergenic regions are not necessarily heterochromatin. Chromatin conformations are usually very long range and are not usually confined to a single gene. The spread of chromatin state can be prevented by insulators/boundary elements, which again are not a part of the transcribed region. The intergenic regions can also harbour distal regulatory elements such as enhancers and silencers; they can have microsatellites, transposons etc too. To know what all are there in a specific region, you can look at the different tracks in the UCSC genome browser.

In short, there can be functional and/or non-functional DNA in-between the transcribed regions.

Repeat regions need not be heterochromatinized and conversely not all heterochromatic regions consist of repeats.

Insertion Sequences


Insertion sequences (ISs) are small pieces of DNA which move within or between genomes using their own specialized recombination systems. They were discovered in the mid-1960s in studies of gene expression in Escherichia coli and its bacteriophages. Initially recognized by their ability to generate highly polar but unstable mutations in the gal and lac operons and in the early genes of bacteriophage lambda, they were later identified by electron microscopy as short insertions of DNA. The repeated isolation of a limited number of identical DNA sequences associated with these unstable mutations led to their being named ‘insertion sequences’.

The similarity of ISs and the mobile genetic elements described by Barbara McClintock in Zea mays in the 1940s became clear when it was realized that IS formed an integral part of the E. coli genome and that their mutagenic activity was a result of their movement to new genetic locations. At about this time, transmissible resistance to antibiotics was also observed. Genetic studies of this phenomenon implicated an analogous mechanism of gene mobility in the distribution of these drug-resistant genes among the conjugal plasmids and phage involved in this transmission. Subsequently, ISs were shown, in many cases, to play a key role in mobilizing these genes.

DNA is made up of two strands. At one end of each strand there is a phosphate group attached to the carbon atom number 5 of the deoxyribose (this indicates the 5' terminal) and at the other end of each strand is a hydroxyl group attached to the carbon atom number 3 of the deoxyribose (this indicates the 3' terminal). The strands run in opposite directions and so we say that they are antiparallel. One strand runs in a 5'-3' direction and the other runs in a 3'-5' direction. Adjacent nucleotides are attached together via a bond between the phosphate group of one nucleotide and the carbon atom number 3 of the deoxyribose of the other nucleotide.

The bases of each strand link together via hydrogen bonds. Adenine and Guanine are purines as they have two rings in their molecular structure. Thymine and Cytosine are pyrimidines as they only have one ring in their molecular structure. A purine will link with a pyrimidine. Adenine and thymine link together by forming two hydrogen bonds while Guanine and cytosine link together by forming 3 hydrogen bonds.

Results and discussion

Genome assembly and annotation

We sequenced an individual Anser cygnoides genome using an Illumina HiSeq-2000 instrument, obtaining approximately 139.55 Gb with small-insert-size libraries (200 bp, 500 bp, or 800 bp average read length: 100 bp) and large-insert-size libraries (2 kb, 5 kb, 10 kb, or 20 kb average read length: 49 bp Additional file 1: Table S1). Sequence data were assembled into a 1.12-Gb draft genome using SOAPdenovo software (Table 1). Our assembly covered >98% of the transcriptome-assembled unigenes (Additional file 1: Table S2), indicating that the genome sequence was of high quality. The average GC content of the goose genome is approximately 38%, similar to that of other birds such as chicken, duck, turkey, and zebra finch (Additional file 2: Figure S1). By combining homology-based, ab initio prediction and transcriptome-assisted methods, we predicted 16,150 genes (Additional file 1: Table S3), 75.7% of which are supported by homology-based evidence (Additional file 1: Table S4), and 77.7% are covered by transcriptome reads (Table 1). We found that 77.7% of the identified genes were well supported by public protein databases (Additional file 1: Table S5). The repeat content of the goose genome is similar to that of chicken, duck, turkey, and zebra finch (Additional file 1: Table S6). We also predicted 153 microRNAs (miRNAs), 69 rRNAs, 226 tRNAs, and 206 small nuclear RNAs (snRNAs) in the goose genome (Additional file 1: Table S7).

Comparative genomic analysis

We compared genome synteny and orthologous relationships among bird genomes. The goose genome has a high synteny with the duck genome [8], which covered approximately 81.09% and 82.35% of each genome, respectively (Additional file 1: Table S8 and Additional file 2: Figure S2), whereas approximately 592 goose scaffolds with lengths >5 kb mapped to and occupied 67.67% of the chicken genome [9] (Additional file 1: Table S8 and Additional file 2: Figure S3). In addition, we found that chromosomal rearrangements occur between the goose and chicken genomes (Additional file 1: Tables S9 and S10 and Additional file 2: Figure S4). For example, scaffold 45 is a goose genome sequence fragment, but it was in synteny with chromosomes 4 and 5 of the chicken genome. When comparing orthologs, 70% of the goose genes corresponded with 1:1 orthologs in the chicken gene-set (Additional file 2: Figure S5). Of the 1:1 orthologs for goose vs. duck (8,322 orthologs), however, 26.62% share up to 90% identity (Additional file 2: Figure S5). For chicken vs. turkey, 48.33% of the 1:1 orthologs (9,378 orthologs) share up to 90% identity (Additional file 2: Figure S5). For peregrine vs. saker, 57.87% of the 1:1 orthologs (10,569 orthologs) share up to 90% identity (Additional file 2: Figure S5).

A phylogenetic tree of eight avian species (goose, duck, chicken, turkey, zebra finch, pigeon, peregrine, and saker) was constructed using 4-fold degenerate sites from 5,081 single-copy orthologs. Analysis of the resulting tree revealed that geese and ducks belong to a subclade that was most likely derived from a common ancestor approximately 20.8 million years ago (Mya), whereas the chicken and turkey diverged 20.0 Mya, and the peregrine and saker diverged 1.3 Mya (Figure 1 and Additional file 2: Figure S6). Of the nine species, goose-specific gene families (other species lack these families) have enriched gene ontology (GO) functions, such as zinc ion binding, integrase activity, and DNA integration. Moreover, the olfactory receptor activity, DNA metabolic processing, G-protein coupled receptor activity, and transmembrane receptor activity GO categories exhibit the most significant gene-family expansion when compared with others birds (Additional file 1: Table S11), indicating that these function were enhanced during goose evolution.

Divergence times for the nine species investigated in this study. A phylogenetic tree based on 4-fold degenerate sites in single-copy orthologous genes is shown. The divergence time estimates were calibrated using fossil data for lizard-bird and chicken-zebra finch. The estimated divergence times and associated 95% CIs are shown.

Rapidly and slowly evolved GO terms

To identify the GO categories that have undergone rapid or slow evolution in waterfowl, we compared two waterfowl (goose and duck) with terrestrial birds (chicken and turkey). We searched for functionally related genes with exceptionally high or low selection constraints in the goose and duck. For categories with at least 10 genes, the ω value (ω = Ka/Ks, where Ka = number of non-synonymous substitutions per non-synonymous site, and Ks = number of synonymous substitutions per synonymous site) was calculated for these categories and normalized using the median ω of each species pair. We identified 191 GO categories with elevated Ka/Ks ratios at the specified threshold between the waterfowl and terrestrial birds (Additional file 1: Table S12). Nineteen of these GO categories, including GTPase activity, galactosyltransferase activity, chloride transport, and GABA-A receptor activity may have undergone significantly rapid evolution (Additional file 1: Table S12).

Positive selection

Ortholog identification was performed for goose, duck, zebra finch, chicken, turkey, and pigeon genome sequences, using the method applied for accelerated GO category analysis. Alignments of 7,861 orthologous genes were used to estimate the ratio of the rates of non-synonymous and synonymous substitutions per gene (ω), using the Codeml program under a branch-site model and F3x4 codon frequencies. We then performed a likelihood ratio test and identified 21 positively selected genes (PSGs) in waterfowl branches by means of FDR adjustment with Q-values <0.05 (Additional file 1: Table S13). Several of the PSGs, including eIF-3S1, GATA1, and eIF-3A, are involved in transcription or translation regulation. Kinase (PIK3R, FGFR2) and signaling molecule (KAI1) genes were also under positive selection, indicating that they may be involved in adaptation to an aquatic environment (Additional file 1: Table S13).

The resistance of waterfowl to disease

The major histocompatibility complex (MHC) gene is widely expressed in jawed vertebrates, and its function correlates with host disease resistance and immune responses [10-12]. Transposable elements in the chicken MHC region are more prevalent compared to the goose MHC region (54.62% in chicken vs. 15.11% in goose Additional file 1: Table S14). Moreover, the distribution of the goose and chicken MHC region is different (Additional file 1: Table S15 and Additional file 2: Figure S7). In addition, we found that the goose genome exhibits substantial copy-number variations of innate immune response-related genes, as well as gene structures, when compared with chicken, turkey, zebra finch, human, and rat genomes (Additional file 1: Table S16). RNA viruses that escape toll-like receptors and infiltrate the cytoplasm are recognized by Retinoic acid-inducible gene I (RIG-I), a pattern-recognition receptor that plays an important antiviral role [13-16]. Results from recent studies have shown that RIG-I is present in most mammals and some birds [17-19]. We found that RIG-I genes aligned well between goose and zebra finch (Additional file 1: Tables S17 and S18), but only fragments of the goose RIG-I aligned with the chicken and turkey RIG-I genes (Additional file 1: Table S19). We constructed a phylogenetic tree based on these data (Additional file 2: Figures S8 and S9) and found that the RIG-I gene is absent in chickens and turkeys. Compared to turkeys and chickens, some mammal and waterfowl species have increased resistance to the influenza virus [20,21]. This phenomenon may be because most mammals have two Myxovirus resistance (Mx) genes, while avian birds have only one. The Mx gene is a member of the guanine-3 phosphokinase gene family, and its expression is induced by interferons [21]. Many Mx proteins have been shown to provide influenza virus resistance at the cellular level [22]. Moreover, the different Mx proteins confer resistance to different diseases, and single base mutations can affect the ability of the protein to confer resistance [21,22]. In addition, the phylogenetic tree shows that mutations at key sites in the chicken and turkey Mx genes may inactivate the Mx protein, affecting antiviral activity and leading to diminished viral resistance (Additional file 2: Figures S10 and S11).

The susceptibility of geese to fatty liver

The liver is a vital organ that plays an important role in lipid metabolism, digestion, absorption, synthesis, decomposition, and transport. Under natural conditions, birds, especially some wild waterfowl, are more likely to show non-pathological hepatic steatosis as a result of energy storage before migration [23]. To identify the genetic mechanism underlying the occurrence of fatty liver, many previous studies have focused on goose fatty liver formation [5-7,24,25]. However, to date, the adaptive molecular mechanisms that induce higher synthesis of hepatic lipids, especially unsaturated fatty acids, in response to carbohydrate-rich diets remain to be understood in waterfowl species. To establish the molecular mechanism responsible for fat deposition in goose liver, we analyzed goose liver tissues in terms of cell morphology and plasma parameters, as well as performed tissue transcriptome and microRNA sequencing and analysis. After 20 d of overfeeding, the body weights of overfed geese were significantly higher than that of control geese. Liver weights were considerably higher in overfed geese (P <0.01) and accounted for 8.44% of the overall body weight, compared with 3.26% in the control geese (Additional file 1: Table S20). During the force-feeding period, overfeeding significantly increased the glucose, total cholesterol (TC), triglyceride (TG), and free fatty acid serum concentrations (Additional file 1: Table S21). Figure 2 shows that overfeeding of geese with a high-energy diet resulted in liver enlargement, with several lipid droplets deposited in the liver cells. Transcriptome analysis showed that the gene expression levels of key enzymes involved in hepatocyte fatty acid synthesis (hk1, gpi, pfkm, pdh, cs, acly, mdh1, me1, acc, fasn, elovl6, scd, fads1, fads2, and dgat2) were significantly elevated (red italic lettering shown in Figure 3 and Table 2), while the activities of extracellular liver lipoprotein lipase (lpl) and the first key enzyme (pksG) involved in hepatic cholesterol synthesis were significantly reduced (green italic lettering in Figure 3 and Table 2). The expression of fatty acid transport protein genes (fatp), which are responsible for the transport of exogenous lipids into cells [26], was significantly increased (Figure 3 and Table 2). In contrast, expression of apolipoprotein B (apoB), which is responsible for binding with endogenous lipids and promoting their diffusion from liver cell membranes as very low-density lipoproteins (VLDLs) [27,28], was significantly attenuated (Figure 3 and Table 2). Previous studies have shown that lpl plays a major role in lipolysis of fatty acids from extracellular chylomicrons or VLDL, which can then be used or deposited in fat or muscle tissues [7,23]. The reduction in lpl activity increases the tendency for a large amount of extracellular lipids to diffuse into liver cells. These results suggest that the mechanism of goose fatty liver formation is mainly attributable to an imbalance between the storage and secretion (as plasma lipoproteins) of newly synthesized endogenous lipids and exogenous lipids in the cytoplasm. The liver lipid secretion capacity cannot offset the storage of newly synthesized cytoplasmic lipids, resulting in fat deposition in the liver.

Comparison of livers and liver tissue sections between overfed and control geese. (A) Goose liver tissue section after 3 weeks of overfeeding (200×) (a) Goose liver after 3 weeks of overfeeding. (B) Normal goose liver tissue section (200×) (b) Normal goose liver.

What the new pangenome reveals about bovine genes

Genome data from the original Brown Swiss were incorporated into the first pangenome of the domestic cattle. Credit: Colourbox

When researchers at ETH Zurich compared the reference genomes between several breeds of domestic cattle and closely related wild cattle, they discovered genes with previously unknown functions.

Modern genetic research often works with what are known as reference genomes. Such a genome comprises data from DNA sequences that scientists have assembled as a representative example of the genetic makeup of a species.

To create the reference genome, researchers generally use DNA sequences from a single or a few individuals, which can poorly represent the complete genomic diversity of individuals or sub-populations. The result is that a reference does not always correspond exactly to the set of genes of a specific individual.

Until a few years ago, it was very laborious, expensive and time-consuming to generate such reference genomes. For this reason, researchers concentrated on human genomes and the most important biological model organisms, such as the roundworm C. elegans.

However, as researchers now have access to fast sequencing machines, sophisticated algorithms that assemble DNA sequence readouts into complete chromosomes, and much greater computing power, creating reference genomes for other species has become increasingly practical. If researchers are to better understand evolution and other fundamental questions of biology, they need high-quality reference genomes for as many species as possible.

This includes livestock. For domestic cattle (Bos taurus), only a single reference genome was available until recently: from a Hereford cow called Dominette. Researchers had previously compared other DNA sequences of cattle against this reference to detect genetic variations and define corresponding genotypes. However, as it did not contain any genetic variants by which individuals differ, the previous reference did not reflect the diversity of the species.

A research team led by Hubert Pausch, Assistant Professor of Animal Genomics at ETH Zurich, has now filled this gap: with the genomes of three further breeds of domestic cattle, including the Brown Swiss (Original Schweizer Braunvieh), two closely related (sub-)species such as the zebu and the yak, and the existing reference genome for domestic cattle, the researchers have created a "pangenome." The study detailing these findings has just been published in the scientific journal PNAS.

This cattle pangenome integrates sequences contained in the six individual reference genomes. "This means we can reveal very precisely which sequences are missing, for example, in the Hereford‑based reference genome, but are present in, say, our Brown Swiss genome or the genomes of other cattle breeds and species," Pausch says.

Family tree of the domestic cattle: This is how different cattle breeds and species are related to each other. The genomes of the respective breeds and (sub-)species (Yak and Brahman) flowed into the pan-genome. Credit: Graphic: ETH Zurich / Colourbox

New genes and functionalities discovered

In this way, the ETH researchers discovered numerous DNA sequences and even whole genes that were missing in the previous reference genome of the Hereford cow. In a further step, the researchers investigated the transcripts of these genes (messenger RNA molecules), which allowed them to classify some of the newly discovered sequences as functionally and biologically relevant. Many of the genes they discovered are connected with immune functions: in animals that had contact with pathogenic bacteria, these genes were stronger or less active than in animals that had no contact with the pathogens.

This project was made possible by a new sequencing technology that has been available at the Functional Genomics Center Zurich for a year now. With this new technology, the researchers are able to precisely read out long DNA sections, reducing the complexity of the computing process needed to correctly assemble the analyzed sections. "The new technology simplifies the genome assembly process. Now we can create reference genomes quickly and precisely from scratch," Pausch says. In addition, such analyses also cost less, meaning that researchers can now generate genomes in reference quality from many individuals of a species.

The ETH researchers are collaborating closely with the Bovine Pangenome Consortium, which wants to create a reference genome of at least one animal from every cattle breed worldwide. It also plans to analyze the genetic makeup of wild relatives of domestic cattle in this way.

More targeted breeding possible

The consortium and ETH professor Pausch hope that the reference genome collection will help them make useful discoveries such as genetic variants that are no longer present in domesticated animals, but that their wild relatives still possess. This would provide clues as to which genetic characteristics were lost as a result of domestication.

"Things get really exciting when we compare our indigenous cattle with the zebu or with breeds that are adapted to other climate conditions," Pausch explains. This lets researchers find out which genetic variants make animals in tropical environments more heat tolerant. The next step could be to deliberately use crossbreeding to introduce these variants into other cattle breeds or precisely introduce them through genome editing. However, that is still a long way off. For the present, researchers can benefit from the greater speed and precision that the new cattle pangenome brings to the process of detecting the genes and DNA variants that differ between cattle breeds.


Microbial profiles were analyzed from a total of ten colorectal cancer associated studies, comprising 588 matched tumor and tumor-adjacent specimens (n = 294 pairs from nine studies) and 84 matched fecal and tumor biopsy specimens (n = 42 pairs from four studies Tables 1 and 2). Principal coordinate analysis (PCoA) of paired tumor:tumor-adjacent samples revealed that these communities clustered primarily by study, then by platform and gene target. Although separation between these microbial communities was discernable, it was not completely distinct (S1 Fig). Tumor biopsy:fecal pairs from the same CRC case showed a compositional change in taxon abundances, especially in the investigations conducted by Chen et al. (Chen_V13_454) and Mira-Pascual et al. (Pascual_V13_454) (Panel A in S2 Fig). This difference was even more apparent when the PC3 axis was plotted against PC4 (Panel B in S2 Fig). Procustes rotation revealed a moderate degree of similarity in most paired tumor: tumor-adjacent samples, while even greater similarity was observed in the studies conducted by Marchesi et al. (Marchesi_V13_454), Dejea et al. (Dejea_V35_454), Weir et al. (Weir_V4_454), and Kostic et al (Kostic_V35_454)(Fig 1A and 1B). The overall correlation was 0.68 for axis 1 vs 2 (sum of squared deviations m 2 = 0.53) and 0.85 for axis 2 vs 3 (m 2 = 0.27 [values for m 2 range from 0 (matrices are highly similar) to 1 (matrices are dissimilar)]), with p = 0.001, rejecting the null hypothesis that the degree of congruence between the two Procustes matrices is no greater than random (Fig 1A and 1B). The same Procustes graphical super-imposition showed a separation between the matched CRC tumor tissue and fecal samples (m 2 = 0.57 for axis 1 vs 2 and 0.25 for axis 2 vs 3, permutation-based p-value = 0.001 Fig 1C and 1D).

In Fig 1, the Procustes analysis showed a moderate [in magnitude] but statistically significant difference between both the paired tumor and tumor-adjacent biopsy (Fig 1A and 1B) microbiome (m 2 = 0.68, p < 0.001) as well as paired fecal and CRC tumor tissue samples (Fig 1C and 1D) m 2 = 0.65, p < 0.001) from the same case of CRC. Lines connect paired samples. Shapes indicate sample phenotype colors indicate study cohort.

Phylum-level differences revealed that CRC tumor biopsy specimens harbored greater abundances of Fusobacteria and Actinobacteria, while their paired adjacent tissue counterparts harbored an elevated abundance of Firmicutes. Compared to their tumor biopsy counterparts, fecal samples harbored greater abundances of Verrucomicrobia and Euryarcheota and fewer Proteobacteria (S3 Fig). In a pair-by-pair comparison of the most abundant annotated genera, CRC tumor samples exhibited greater mean abundances of Fusobacterium and Parvimonas while tumor-adjacent samples presented greater mean abundances of Ruminococcaceae, Faecalibacterium and Parabacteroides among others (Fig 2A). In the matched comparison, fecal samples yielded greater mean abundances of Roseburia, Blautia, and Bifidobacterium while biopsy samples harbored greater mean abundances of Fusobacterium, Streptococcus, Prevotella, and Staphylococcus (Fig 2B). Within paired samples, there was considerable intra- and inter-study heterogeneity with respect to the magnitude and direction (elevated versus attenuated in tumor biopsy) of taxonomic changes. That said, a small number of taxa, e.g., Fusobacterium, Parvimonas, and Streptococcus were consistently detected in greater abundance in tumor-associated samples, compared to both adjacent tissues and feces.

Boxplots indicate the distribution of the relative abundances of various taxa and corresponding lines connect paired samples, depicting the direction of change in relative abundance of statistically significantly different families between CRC tumor biopsy samples (left) and adjacent non-affected tissue microbiome (Fig 2A, n = 294 pairs, 588 samples) or fecal sample (Fig 2B, n = 42 pairs, n = 84 samples) for the various studies (colors) * indicates mean relative abundance was statistically significantly different between the genera by paired Wilcoxon signed rank test and p<0.05 after FDR adjustment. All biopsy-based taxa presented in Fig 2A were statistically significantly different between tumor and tumor biopsy samples by above mentioned test.

To identify robust, genus-specific associations across all studies, we performed differential abundance testing which accounted for the paired study design by assigning a ‘pair factor id’ to matched samples. Results from this per-study DESEq2 evaluation for 294 tumor:tumor adjacent biopsy pairs were compared across the nine studies with a random effects model. Of the 80 genera analyzed, 41 were identified as being differentially abundant in 5 or more studies (i.e., >50% of studies analyzed), and 5 of these genera remained significant after FDR adjustment (p ≤ 0.1). Consistently observed were the increased abundances of Fusobacterium spp. (8/8 studies, adjusted REM model Log2fold change: 2.6, 95% CI: (0.9, 4.5), p = 0.002, FDR p = 0.02), Leptotrichia (5/8 studies, adjusted REM model Log2fold change: 1.4, 95% CI: (0.7, 3.7), p = 0.002, FDR p = 0.02), and Parvimonas (8/8 studies, adjusted REM model Log2fold change: 1.5, 95% CI: (0.6, 2.5), p < 0.001, FDR p = 0.001), along with Peptostreptococcus and Streptococcus, in tumor biopsy tissues relative to tumor-adjacent tissues. In contrast, an unclassified genus in the family Ruminococcaceae (8/8 studies, adjusted REM model Log2fold change: -0.7, 95% CI: (-1.1, -0.4), p = 1.9*10 −5 , FDR p = 0.001) and species of Faecalibacterium (8/8 studies, adjusted REM model Log2fold change: -0.7, 95% CI: (-1.1, -0.3), p = 0.001, FDR p = 0.02) were significantly more abundant in adjacent tissues than in tumor-associated specimens (Fig 3A and S2 Table).

Plots depict per study and adjusted (REM model) log-fold change across all studies for taxa that were differentially abundant in >50% of available studies i.e ≥ five of the eight studies with paired CRC biopsy samples (shift to right indicates taxa elevated in tumor shift to left indicates taxa elevated in tumor adjacent biopsy) in Fig 3A and ≥ three studies of the total four for the paired CRC fecal and biopsy samples studies (i.e., for both Fig 3A and 3B) (to the right indicates taxa elevated in tumor biopsies and to the left indicates taxa elevated in fecal CRC case) in Fig 3B. Individual log fold changes and FDR p-values for paired biopsy and paired fecal comparisons are provided in S2 and S3 Tables, respectively. Error bars denote 95% confidence intervals, size of point indicates the precision of the point estimate for individual studies [1/ (95% CI Upper Bound– 95% CI lower bound)]. REM-model point size is fixed. Blank values for a particular study indicate that DESeq2 did not determine that taxa to be differentially abundant in that particular study cohort.

In evaluating fecal and biopsy samples from the same CRC case, a total of 42 pairs (n = 84 samples) from four distinct studies were considered. Of the 73 genera detected among these samples, 38 were differentially abundant in at least three of the four cohorts (i.e., >50% of studies analyzed), and three genera were significantly differentially abundant by the REM. These included the observed increase in abundance of Pseudomonas (3 of 4 studies, adjusted REM model Log2fold change: 4.0, 95% CI: (2.5, 5.5), p = 2.8*10 −7 , FDR p = 1.1*10 −5 ), Streptococcus (3 of 4 studies, adjusted REM Log2fold change: 1.9, 95% CI: (0.8, 3.0), p < 0.001, FDR p = 0.006), and Porphyrmonas (adjusted REM Log2fold change: 2.3, 95% CI: (0.7, 3.8), p = 0.004, FDR p = 0.05) in tumor-associated specimens relative to fecal samples. Although Fusobacterium and Parvimonas exhibited high REM adjusted Log2fold change values (1.8 in 3 of 4 studies and 2.0 in 4 of 4 studies, respectively), these did not retain statistical significance after FDR correction (Fig 3B and S3 Table). Per the RE model, four taxa were common across the paired biopsy and biopsy:fecal comparisons: species of Parvimonas, Porphyrmonas, Phascolarctobacterium, and Lachnobacterium.

We evaluated the similarity (and dissimilarity) of taxa in biopsies and fecal samples. Of the 35 non-zero abundance genera present in both, 6 were unique to biopsies, 21 were present in biopsies as well as fecal samples while fecal samples had an additional 8 unique taxa (S4 Table). A random forest classifier to distinguish mucosal and fecal associated taxa performed with reasonable accuracy. With an area under the ROC curve of 82.5% (Fig 4), the taxa contributing to differentiation between the two sample types were members of the phylum Proteobacteria (Panel B in S4 Fig). It should be noted that the fecal-biopsy classifier was based on the relative abundances of microbial features rather than their simple presence or absence. We found many overlapping taxa between these ecological niches, and the RF model demonstrates that although the distribution of these taxa is shared, their richness or density vary based upon niche. The random forest model for classifying paired tumor biopsy samples and tumor-adjacent tissues exhibited an area under the ROC curve of 64.3% (Fig 4), suggesting that tumor-adjacent tissues harbor microbial communities that are more difficult to distinguish from, and thus more similar to, tumor-associated communities than tumor versus stool-associated communities. The more discriminatory taxa for the paired biopsy samples included those within the genera Fusobacterium and Faecalibacterium (Panel A in S4 Fig).

The tumor biopsy vs. fecal classifier [area under curve (AUC) = 82.5] was better able to distinguish CRC fecal samples from tumor tissue samples than tumor vs. tumor adjacent biopsy classifier (AUC = 64.3). Again, given the compositional overlap between these niches, these classifiers relied on differentially abundant features rather than niche-specific distribution.

The final aim of this study was to determine which functional differences may be present in tumor-associated communities and the degree to which these differences may be driven by the primary taxonomic perturbations we identified or were the result of subtle shifts among multiple taxa. The single-taxon filter in FishTaco was used to identify 14 differentially abundant KEGG pathways. Of these, six statistically significant pathways remained after being further evaluated in the multi-taxa mode (accounting for taxa co-variation) and subjected to multiple comparison adjustment. The relative abundances of pathways for tyrosine metabolism, glutathione metabolism, lipopolysaccharide (LPS) biosynthesis, polycylic aromatic hydrocarbon degradation, ethylbenzene degradation, and stillbenoid, diarylheptanoid and gingerol biosynthesis differed significantly between tumor and tumor-adjacent tissue samples. Species of Fusobacterium and Leptotrichia were the primary CRC case-associated taxa associated with enrichment of tyrosine metabolism, LPS biosynthesis, and polycyclic aromatic hydrocarbon degradation (Panel A in Fig 5).

For each pathway presented, the top left bar shows the tumor biopsy-associated taxa that attenuate the functional shift, the top right bar shows the tumor biopsy-associated taxa that are associated with an increase in the functional shift magnitude, and the bottom bars are referring to Fig 5A: tumor-adjacent taxa or Fig 5B: fecal-associated taxa. OTUs mentioned in the legend are OTUs classified to genus level. Red diamond markers indicate the cumulative metagenome-based shift in Wilcoxon score. In Fig 5A, tumor (top bar): tumor-adjacent biopsy (bottom bar) samples, Fusobacterium and Leptotrichia are tumor biopsy associated and related with increased function. Parvimonas, is also tumor biopsy associated but related with attenuated functional shifts for most pathways. On the other hand, in Fig 5B, in tumor biopsy (top bar) and fecal samples (bottom bar) obtained from the same CRC patient, several different Proteobacteria (e.g., Xanthomonadaceae, Comamonadaceae, Enterobacteriaceae, Halomonas, and Morganella) were associated with tumor biopsy and enrichment of the functional pathways.

In a paired tumor biopsy:fecal comparison, single-taxon permutation analyses identified 13 differentially abundant KEGG pathways that, when subject to multi-taxa analysis coupled with Shapley orderings, yielded a total of six statistically significant functional pathways. These included synthesis and degradation of ketone bodies, which were largely impacted by differing abundances of Xanthomonadaceae, Shewanella, and Acinetobacter (all belonging to Phylum Proteobacteria). Pseudomonas, members of the families Comamondaceae and Enterobacteriaceae, and Staphylococcus contributed marginally to valine, leucine, and isoleucine degradation, tyrosine metabolism, alpha-Linolenic metabolism, and the renin-angiotensin system (Fig 5B).

Identification of genes that are associated with DNA repeats in prokaryotes

Using in silico analysis we studied a novel family of repetitive DNA sequences that is present among both domains of the prokaryotes (Archaea and Bacteria), but absent from eukaryotes or viruses. This family is characterized by direct repeats, varying in size from 21 to 37 bp, interspaced by similarly sized non-repetitive sequences. To appreciate their characteri-stic structure, we will refer to this family as the clustered regularly interspaced short palindromic repeats (CRISPR). In most species with two or more CRISPR loci, these loci were flanked on one side by a common leader sequence of 300-500 b. The direct repeats and the leader sequences were conserved within a species, but dissimilar between species. The presence of multiple chromosomal CRISPR loci suggests that CRISPRs are mobile elements. Four CRISPR-associated (cas) genes were identified in CRISPR-containing prokaryotes that were absent from CRISPR-negative prokaryotes. The cas genes were invariably located adjacent to a CRISPR locus, indicating that the cas genes and CRISPR loci have a functional relationship. The cas3 gene showed motifs characteristic for helicases of the superfamily 2, and the cas4 gene showed motifs of the RecB family of exonucleases, suggesting that these genes are involved in DNA metabolism or gene expression. The spatial coherence of CRISPR and cas genes may stimulate new research on the genesis and biological role of these repeats and genes.

Why repetitive DNA is essential to genome function?

There are clear theoretical reasons and many well-documented examples which show that repetitive, DNA is essential for genome function. Generic repeated signals in the DNA are necessary to format expression of unique coding sequence files and to organize additional functions essential for genome replication and accurate transmission to progeny cells. Repetitive DNA sequence elements are also fundamental to the cooperative molecular interactions forming nucleoprotein complexes. Here, we review the surprising abundance of repetitive DNA in many genomes, describe its structural diversity, and discuss dozens of cases where the functional importance of repetitive elements has been studied in molecular detail.

In particular, the fact that repeat elements serve either as initiators or boundaries for heterochromatin domains and provide a significant fraction of scaffolding/matrix attachment regions (S/MARs) suggests that the repetitive component of the genome plays a major architectonic role in higher order physical structuring. Employing an information science model, the ‘functionalist’ perspective on repetitive DNA leads to new ways of thinking about the systemic organization of cellular genomes and provides several novel possibilities involving repeat elements in evolutionarily significant genome reorganization. These ideas may facilitate the interpretation of comparisons between sequenced genomes, where the repetitive DNA component is often greater than the coding sequence component.

What sequences are between adjacent genes? - Biology

I Historical questions

II.1 Light chains (kappa or lambda)

II.1.1 Kappa chain: V-J rearrangements
II.1.2 Lambda chain: V-J rearrangements
II.1.3 Allele exclusion and isotype

II.2.1 V-D-J rearrangements
II.2.2 Isotype switching

II.3 Membrane and secreted Igs

III.1. Germline diversity: multigene families
III.2. Diversity due to DNA rearrangements
III.3. Diversity as a result of somatic hypermutations

An immunoglobulin (Ig) consists of 2 identical light chains (L) and 2 identical heavy chains (H) (for example IgG-type) at the three-dimensional level, an Ig chain consists of one N-terminal variable domain, V, and one (for an L chain) or several (for an H chain) C-terminal constant domain(s), C.

The cells of the B line synthesize immunoglobulins. They are either produced at a membrane (on the surface of the B-lymphocytes) or are secreted (by the plasmocytes).

As soon as the main characteristics of the immunoglobulins were discovered, a number of questions arose:

II.1. Light chains (kappa or lambda)

II.1.1. Kappa chain: V-J rearrangements

NOTE: Only the genes for the immunoglobulins and T-receptors undergo DNA rearrangement.

Each IGKV gene is followed downstream (in the 3' position) by an RS consisting of a CACAGTG heptamer, and then by a 12-bp spacer, and then an ACAAAAACC nonamer.

Each IGKJ gene is preceded upstream (in the 5' position) by an RS consisting, between 5' and 3', of a GGTTTTTGT nonamer, a 23-bp spacer and a CACTGTG heptamer.

II.1.2. Lambda chain: V-J rearrangements

II.1.3. Allele exclusion and isotype

II.3. Membrane and secreted Igs

III.1. Germline diversity: multigene families

III.2. Diversity due to DNA rearrangements

III.3.Diversity as a result of somatic hypermutations

Finally, somatic mutations are extremely numerous (somatic hypermutations) and produce very targeted characterization of the rearranged V-J and V-D-J genes of the Ig, but their mechanism of onset is not yet known. AID (activation-induced cytidine deaminase) may be implicated both in the occurrence of the mutations and the switch mechanism. The mutations appear during the differentiation of the B lymphocyte in the lymph glands and contribute to increasing the diversity of the Igs by a further factor of 10 3 , which makes it possible to achieve a potential diversity of 10 12 different Igs (answer to question A).

These different mechanisms of diversity make it possible to obtain 10 12 different immunoglobulins, capable of responding to the several million known antigens (answer to question A).

The number of different Igs is in fact limited by the number of B cells in a given species.