How to quantify the Abundance of a particular gene in Bacteria and Archaea?

How to quantify the Abundance of a particular gene in Bacteria and Archaea?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

For 16s rDNA metagenomic analysis, do I need to do the following steps?

DNA extraction-> Purification-> PCR amplification-> Clone -> Pick every single colony-> PCR amplify -> Sequence -> Phylogenetic Analysis

For the abundance of a particular gene abundance, what steps do I need to follow? Is it necessary to clone?



Many terrestrial habitats are limited in the major nutrients P and N (Elser et al. 2007). The P concentration of the parent rock material, the turnover of the internally bound organic P, and the sorption of P onto soil particles most importantly determine the P availability in forest soils, as fertilizer hardly plays a role (Walker and Syers 1976). Changes in P pools are strongly driven by microbial activities and dependent on microbial community composition and its activity pattern (Richardson and Simpson 2011 Rodríguez et al. 2006). Vice versa, the composition of P pools in soil shapes microbial communities and determines their functional potential (Bergkemper et al. 2016a). However, not only P pools and P availabilities drive P transformation processes, but also the overall nutrient stoichiometry strongly influences microbial P turnover, as microorganisms keep a stable ratio of macronutrients in their biomass (Cleveland and Liptzin 2007). For example, Sorkau et al. (2018) described a positive correlation of microbial P and N in soils of different temperate forest regions. Moreover, a positive correlation of bioavailable P fractions in soil and microbes potentially able to fix N was found (Bergkemper et al. 2016a). Vice versa, inorganic P (Pi) limitation can repress N assimilation as the respective genes are under the control of the Pho regulon, which contains several genes controlled by a two-component system, which detects Pi concentrations in the environment (Santos-Beneit 2015). Highest turnover rates for plant-available nutrients like N, P, or C have been described for biological hotspots like rhizosphere or drilosphere in soil (Hoang et al. 2016 Kuzyakov and Blagodatskaya 2015 Lipiec et al. 2016 Reinhold-Hurek et al. 2015 Schulz et al. 2013 Uksa et al. 2015). Biological soil crusts (biocrusts) can be considered as such hotspots. Biocrusts are mostly dominated by organisms like phototrophic cyanobacteria, microalgae, lichens, or mosses, which drive the input of nutrients into the system. Together with associated heterotrophic microorganisms such as archaea, bacteria, or fungi, they create stable microhabitats, for example, by the excretion of polysaccharides (Cania et al. 2020 Mugnai et al. 2018, Vuko et al. 2020). Biocrusts have been mostly studied in arid and nutrient-poor habitats (Belnap et al. 2001), where they are the predominant vegetation form and their growth is limited by water and nutrient availability. Biocrusts from those regions contribute about half of the terrestrial N2 fixation (Elbert et al. 2012). Moreover, it was demonstrated by Beraldi-Campesi et al. (2009) that biocrusts are not only enriched in N but that this also comes along with higher total P concentrations underlining the importance of balanced nutrient concentrations as proposed by Cleveland and Liptzin (2007). Less is known and only a few studies exist, which describe biocrusts as hotspots for nutrient turnover in temperate regions (Baumann et al. 2019 Brankatschk et al. 2013 Corbin and Thiet 2020 Gypser et al. 2016 Schaub et al. 2019 Schulz et al. 2016 Szyja et al. 2018), especially within well-developed ecosystems like forests (Baumann et al. 2017 Glaser et al. 2018 Williams et al. 2016). Atmospheric P inputs by dry and wet deposition in forests contribute as well since forests are dust traps also for particles from irrigated, highly fertilized agricultural soils (Aciego et al. 2017 Berthold et al. 2019) and those might be entrapped in the polymeric matrix of biocrusts. However, the relative contribution to P pools strongly depends on the bulk soil P concentration (Aciego et al. 2017). The role of biocrusts in P transformation in temperate forests has been described by Baumann et al. (2017), who found that, similar to biocrusts from arid regions, the P concentrations in biocrusts are enriched compared with adjacent bulks soils. Additionally, they demonstrated that especially the concentration of P-containing minerals decreased and of organic P concentrations increased in biocrusts compared with bulk soil. Thus, we hypothesize that (i) microorganisms colonizing biocrusts are involved in solubilization of mineral P and its transformation to biomass and thus abundances of those microorganisms are higher in biocrusts compared with bulk soil. (ii) As it was supposed that P and N turnover are closely linked, we further hypothesize that similar to P mineralization, also the potential for N mineralization is more pronounced in biocrusts, because of the higher microbial abundance, while the energy demanding fixation of N is less important in biocrusts from nutrient-rich forest soils. (iii) The strength of correlations between N and P turnover strongly depends on the N/P ratio of the bulk soil.

To test these hypotheses, we compared biocrust and bulk soil samples from temperate beech (Fagus sylvatica L.) forests, along a gradient in soil texture, nutrient concentrations, and pH values. Samples were taken once when biomass of biocrusts was highest. As the lifetime of biocrusts in temperate forests is short (they typically occur only at the end of the winter period before trees develop leaves) and the size of biocrusts is small as they newly develop every season, which excludes repeated sampling of the same biocrust, we abstained from repeated samplings but instead increased the number of analyzed replicates. Bacterial genes coding for proteins catalyzing P mineralization (alkaline phosphatase—phoD phosphonoacetaldehyde hydrolase—phnX), solubilization (quinoprotein glucose dehydrogenase—gcd), and uptake (substrate binding protein of the phosphate ABC transporter—pstS low-affinity Pi transporter—pitA) as well as N mineralization (bacterial chitinase group A—chiA alkaline metalloprotease—apr) and N2 fixation (dinitrogenase reductase subunit of the nitrogenase—nifH) were used as proxies for the abundance of the respective functional bacterial groups. We used qPCR to measure the abundance of bacterial genes involved in P and N turnover and correlated the data with the stable and labile P and N pools.


The nitrogen biogeochemical cycle carried out by microorganisms has been investigated previously in diverse environments including soils, estuaries, sediments, coral ecosystems and freshwater and ocean ecosystems. The most intensively studied process thus far has been the initial nitrification step (ammonia oxidation), which can be carried out by both bacteria and archaea (Prosser and Nicol, 2008). Since the discovery of the archaeal amoA (ammonia monooxygenase subunit alpha) from a metagenomic studies conducted in open ocean (Venter et al., 2004) and soil (Treusch et al., 2005), the ammonia-oxidizing archaea (AOA) have been recognized as the predominant ammonia-oxidizing microbial community in various environments, including soils (Leininger et al., 2006) and marine ecosystems (Beman et al., 2007). The detection of archaeal amoA sequences in harsh conditions such as alkaline soil (Shen et al., 2008), hot spring sediments (Reigstad et al., 2008) and cold deep sea water (Nakagawa et al., 2007) supports the general supposition of the worldwide ubiquity of the AOA. The determinative environmental factors for the AOA community may be as follows: ammonium levels, organic carbon, temperature, salinity, dissolved oxygen level, pH, sulfide and phosphate levels (Erguder et al., 2009). A great deal of detailed information about AOA ecology has already been compiled (Erguder et al., 2009).

Nitrate reductase encoded by the narGHJI operon is responsible for the reduction of nitrate to nitrite (Philippot and Hojberg, 1999). Two types of nitrite reductase catalyze the reduction of nitrite to nitric oxide: a cytochrome cd1 encoded for by nirS, or a Cu-containing enzyme encoded by nirK. Nitric oxide is subsequently reduced by the nitric oxide reductase encoded by norB, producing nitrous oxide, a powerful greenhouse gas. Nitrous oxide contributes to global warming and climate change much more profoundly than carbon dioxide or methane. The final step in denitrification is the reduction of nitrous oxide to gaseous dinitrogen by nitrous oxide reductase encoded by nosZ. However, some denitrifiers do not harbor complete sets of genes required for complete denitrification (Zumft, 1997).

The Antarctic terrestrial ecosystem is also subject to climate change. The rate of surface temperature increase has averaged more than 0.1 °C per decade over the past 50 years (Steig et al., 2009). The majority of climate change models predicts increases in overall annual precipitation and temperature, and also that the impact of these changes will be greater at the polar region (Maxwell, 1992). Changes in soil moisture will influence microbial activity, organic matter turnover rates and ammonium availability via mineralization. The nitrogen cycle in Antarctic soils will therefore be influenced by climate change. Nitrogen fixation could be inhibited by increased nitrogen availability, which functions as a negative feedback in this process. Increased mineralization could accelerate nitrification and may, in turn, promote denitrification and nitrous oxide production (Paul and Clark, 1996). However, studies of the nitrogen cycle in the polar region have primarily been carried out in ocean ecosystems, and only amoA has been of interest in many of these cases, aside from nitrogen fixation or denitrification genes. The nitrogen cycle and the impacts of climate change have yet to be adequately evaluated in Antarctic soils. Therefore, the principal objective of this study was to evaluate nitrogen cycle potential by assessing the abundance of genes involved in every step of the nitrogen pathway. Microbial responses to warming effects and nutrient availability were assessed at the microcosm scale.


B.L. acknowledges support from a NERC Algorithm PhD Studentship (NE/F008864/1) and from the NERC-funded programme Oceans 2025 (Theme 3: Coastal and shelf processes). We thank the crew of the RV Sepia for their help with sediment collection.

Fig. S1. Variation in monthly pelagic nutrient concentrations in Jennycliff Bay (5.3497 N, 04.1331 W) in the Western English Channel. (A) nitrite, (B) nitrate, (C) ammonium, (D) silicate, (E) phosphate. Inset legends show water depth in metres. Data are from the PML Benthic Survey Data Inventory (Woodward et al., 2013). (F) Variation in monthly temperature (closed circles black line) and salinity (open diamonds grey line) for the L4 site (50.225 N, 04.1944 W), in the Western English Channel, with lines connecting the average value for each month. Data are from the Western Channel Observatory Data Inventory (Fishwick, 2013).

Table S1. Sediment sampling strategy.

Table S2. Primer pairs and reaction conditions used for quantitative polymerase chain reaction (q-PCR) assays. q-PCR amplification and detection for all assays was carried out using an ABI 7000 sequence detection system (Applied Biosystems, Carlsbad CA, USA) with an initial denaturation for 5 min at 95°C, followed by 40 cycles of 95°C for 15 s and annealing temperatures as listed below for 1 min. All reactions were carried out in 25 μl final volume final primer concentrations are shown for this reaction volume. For each reaction, the standard curve was calculated using the ABI Prism 7000 sequence detection software. From the standard curve, the slope (m), y intercept and coefficient of determination (r 2 ) recorded and used to calculate the efficiency of the amplification (E) using the equation E = (10 1/ m − 1) × 100. Values for calculating the efficiency of each reaction are given, as well as the threshold cycle value (CT), which was determined using automatic analysis settings.

Table S3. Seasonal variation in the average gene abundance (gene copies g −1 wet sediment) in burrow (B) and surface (S) sediments, and comparison to abundances reported in the literature for similar marine environments and using similar gene primers. Full data are available from the Dryad Digital Repository (doi: to be confirmed). Absolute gene abundances were calculated from standard curves using the r 2 , y intercept and efficiency values given in Table S2.

Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.

How can microbial population genomics inform community ecology?

Populations are fundamental units of ecology and evolution, but can we define them for bacteria and archaea in a biologically meaningful way? Here, we review why population structure is difficult to recognize in microbes and how recent advances in measuring contemporary gene flow allow us to identify clearly delineated populations among collections of closely related genomes. Such structure can arise from preferential gene flow caused by coexistence and genetic similarity, defining populations based on biological mechanisms. We show that such gene flow units are sufficiently genetically isolated for specific adaptations to spread, making them also ecological units that are differentially adapted compared to their closest relatives. We discuss the implications of these observations for measuring bacterial and archaeal diversity in the environment. We show that operational taxonomic units defined by 16S rRNA gene sequencing have woefully poor resolution for ecologically defined populations and propose monophyletic clusters of nearly identical ribosomal protein genes as an alternative measure for population mapping in community ecological studies employing metagenomics. These population-based approaches have the potential to provide much-needed clarity in interpreting the vast microbial diversity in human and environmental microbiomes.

This article is part of the theme issue ‘Conceptual challenges in microbial community ecology’.

1. Introduction

Take any introductory biology textbook and you will probably find evolution defined as change in the genetic makeup of populations. Being defined as the locally coexisting representatives of species, populations are, in practice, also the units of diversity that are used when we wish to measure species diversity to assess ecological interactions as well as ecosystem stability and resilience [1]. For microbes, however, populations have been notoriously difficult to define [2], and we use arbitrary units of diversity to measure the genetic makeup of communities [3]. This difficulty in defining populations is, of course, rooted in the absence of a biologically meaningful species concept for bacteria and archaea [3–6]. Without clearly defined populations, many of the most fundamental questions in community ecology are difficult to answer. For example, do disturbances lead to changes in genotypic composition within populations or to species turnover? Differentiating between these possibilities is a meaningful question because shifts in genotype within a population may be far less disruptive to ecological networks than wholesale changes in species composition. Indeed, this question is at the heart of understanding the dynamics of key microbial communities, including the human microbiome.

Defining bacterial and archaeal populations, and by extension species, is, therefore, an important endeavour for community ecology, but can we do it? Is microbial diversity organized into natural units to which we can ascribe biologically meaningful properties? Specifically, do fundamental evolutionary processes organize coexisting genotypes into units through which adaptations can specifically spread, giving rise to ecological units with clearly different dynamics? If we can define microbial populations in such a way, then we may be able to apply the rich evolutionary and ecological theory developed for animals and plant populations [7,8] if not, then we might need fundamentally different theory and approaches [2].

Here, we explore the question of whether bacteria are organized into genetically clearly delineated, ecologically differentiated populations. We argue that although bacterial and archaeal recombination, both homologous and non-homologous, is unidirectional and promiscuous, environmental structure and selection have the potential to structure gene flow sufficiently for ecologically differentiated units to arise. We next discuss why recognizing such units has remained so difficult and show that, by estimating only very recent gene flow, congruent units of gene flow and ecology are indeed recovered. Although many more examples are still needed, these units may be the bacterial and archaeal equivalent of populations and their identification may ultimately contribute to solving the microbial species problem. We conclude by drawing implications for measuring biologically meaningful diversity in the environment.

2. Should we expect to find clearly delineated populations among bacteria and archaea?

Although gene flow is potentially promiscuous in the sense that any microbe can, in principle, share genes with any other [9,10], it need only be structured enough to allow for preferential adaptations to spread in order for populations as local ecological units to emerge [11,12]. Consider that populations, which occupy a defined habitat, consist of individuals that are under similar selective pressures because they coexist and carry out similar functions (figure 1). Such habitats may be small organic particles in soils or aquatic environments, or more expansive bodies of water with defined physical and chemical properties [13–15]. However, the key is that habitats are nearly always patchy and ephemeral, and that they allow for a subset of populations within the community to increase in abundance through preferential growth [13,16–18]. As a result, active populations have a higher probability of sharing genetic material because homologous recombination rates decrease exponentially with sequence divergence [19,20] and preferential microhabitat associations ensure higher encounter rates (figure 1).

Figure 1. The magnitude of gene flow between microbial populations is shaped predominantly by the genetic similarity and ecological overlap of the individual strains that make up those populations. While the efficiency of homologous recombination decreases exponentially with sequence divergence, the likelihood of transfer increases with greater physical contact between strains that occupy similar physical niches. (Online version in colour.)

This increased encounter and recombination of actively growing genotypes has important consequences for creating and maintaining ecological cohesion [12]. If an adaptation arises within a population, it will spread more easily within the population, owing to the combination of preferential gene flow and fitness increase in the genotypes carrying the adaptation [11]. In other words, depending on the balance between the strength of selection and rate of recombination, the adaptation may spread through the population by a selective sweep [12,21]. If the adaptation is useful to other coexisting populations, its fitness advantage to a particular population is short-lived because horizontal gene transfer probably makes it available to other populations [22]. However, the scenario can be quite different if trade-offs are associated with the carriage of the adaptation, meaning that it may not function as well in a different genomic or ecological background [12,23,24]. If this is the case, an adaptation may remain population- or species-specific for much longer and enforce ecological differentiation. Trade-offs can also initiate the process of speciation if genotypes carrying the adaptation are more fit in a new habitat but less so in the ancestral habitat [12,23]. This effect may induce physical separation and thus a gene flow barrier between the nascent populations [12,25].

The trade-offs discussed above are often difficult to identify because they require the examination of very recently speciated populations. Among more divergent species, too many genetic changes have typically accumulated as well as been lost to identify the trait associated with the trade-off. One clear example comes from recently speciated bacterial populations in the ocean [26]. A comparative genomic approach identified two populations of Vibrio cyclitrophicus that were differentially distributed in ocean samples, one being associated with organic particles and the other occurring free-living. Both populations contained genome regions that differentiated them, including regions that contained much reduced nucleotide diversity, indicating a recent sweep of a specific allele, as well as regions that showed differential gene presence as expected from recent population-specific additions or losses. Some of these differentiating alleles and genes were clearly associated with biofilm formation and attachment, leading to the hypothesis that the ability to associate with particles was either lost or gained in one of the populations [26].

This hypothesis of differential adaptation based on observed genetic differences was subsequently confirmed by behavioural observations of representatives of the two populations that suggested a competition–dispersal trade-off [27]. Microfluidics was used to create an ecological landscape resembling conditions in the ocean where small particles represent a habitat to which bacteria can attach and degrade the solid organic material [13,16]. This degradation process itself creates an ephemeral habitat of patches of dissolved organic material because the attached bacteria extracellularly degrade organic polymers faster than they can import the breakdown products into the cell [16]. A cloud of mono- or oligomers forms around the particle by diffusion, and this material can be consumed by motile bacteria [28]. When such conditions were simulated in the microfluidic system, the two populations appeared differentially adapted to the solid and dissolved resources, respectively. While one responded by attaching to the particles and growing in biofilms, the other was capable of efficient dispersal among particles, rapidly detecting them and swimming towards new particles [27]. This suggests that the latter population is indeed better adapted to the exploitation of ephemeral, soluble nutrient patches, while the first commits to the degradation of the solid organic material. Although difficult to prove, it was inferred from the genomic comparison that these behavioural differences were involved in the speciation process because the differential adaptations represent an ecological trade-off that cannot easily coexist in genomes.

Although the above example demonstrates the power of population genomics combined with fine-scale environmental sampling, the discovery of such recently speciated populations was nonetheless fortuitous. It was aided by the fact that a protein-coding gene used as a marker to differentiate isolates initially was linked to a sweep region and thus clearly differentiated these two populations [26]. In most cases, population structure cannot be inferred a priori and instead such inference requires an approach where some measure of diversity is mapped onto environmental samples. We next outline reasons for this difficulty of recognizing population or species boundaries among bacteria and archaea based on genetic information alone.

3. Why is it so difficult to define populations?

In a recent opinion piece, Rocha [2] outlined challenges in bacterial (and archaeal) population genetics in the light of the neutral theory of evolution. One of the most important problems is that it has been nearly impossible to define the object of the study because of its fuzzy nature. Similar arguments have been made earlier for species boundaries [29]. Such fuzziness is observed in phylogenetic trees of multiple loci across the genome because they result in different topologies. That is, although clustering is observed, it is inconsistent when different genes are considered, reflecting their divergent evolutionary history [29,30]. A recent paper even argued that recombination has been so promiscuous among Escherichia coli isolates that there is no majority tree, even though, paradoxically, a similar tree is always produced when averaging over different larger genome regions [31]. This is potentially problematic when, as in many recombination estimation methods, individual genes are compared to such a consensus tree that is supposed to reflect the clonal history (or clonal frame) of the population. Overall, these observations suggest that phylogenetic methods can encounter problems in delineating populations and species.

The issue with phylogenetic methods may be that they integrate over too long evolutionary timeframes to be useful for population differentiation. In particular, among recently speciated populations, only a very small fraction of the genome supports a distinction between them [26]. This is illustrated well in the analysis of two recently speciated V. cyclitrophicus populations, where essentially every genomic region they shared had its own unique evolutionary history and both populations appeared completely intermixed [26]. This is an apparent paradox: how can there be recombination across population boundaries while population-specific sweeps are observed? The answer lies in the time scales over which phylogenetic comparisons integrate. When a method was devised to analyse only the most recent recombination events, these were more frequent within populations. This suggests that while the two populations shared a common history of recombination, the most recent post-population-divergence recombination events were population-specific [26].

Even many methods designed to measure recombination may suffer from a similar problem of integrating over evolutionary timeframes that are too long to capture speciation events. We recently carried out a simple experiment where we simulated a burst of recombination among a group of otherwise clonally evolving genomes and observed how the signal of recombination decayed as mutations accumulated [32]. When recombination was analysed with two different methods that rely on the identification of homoplasies, there was still considerable signal long after gene flow was terminated. This is because homoplasies are only slowly erased by the random mutational process, so that methods relying on their measurement integrate over long periods of time and do not capture only the contemporary recombination process. Such integration over long timeframes becomes problematic when closely related populations or even species are being compared and suggests that methods capable of analysing more contemporary gene flow are needed to correctly recover population or species boundaries [32].

4. Can we estimate gene flow in the context of contemporary population structure?

If current methods cannot recover species or population boundaries, is there an alternative that can correctly identify such boundaries? We have recently proposed such a method that relies on measuring the homogenizing force of recombination between two genomes and is able to identify much more recent gene transfer than other methods [32]. This method, called populations as clusters of gene transfer (PopCOGenT), differs from others, in that it estimates recent gene transfer via shared identical genome regions (figure 2). Because such identical tracks between two closely related genomes can originate via vertical inheritance or horizontal gene transfer, PopCOGenT differentiates the two using a simple model of vertical (clonal) inheritance. If two genomes diverge clonally by mutational accumulation without recombination, they will have a characteristic length and frequency distribution of identical regions that can be estimated by a Poisson model of single-nucleotide polymorphisms [32]. Significant enrichment in identical regions above that expectation can then serve as an estimate of gene transfer (figure 2). The gene transfer signal decays by an order of magnitude within the time it takes for genomes to diverge by 0.1%, and PopCOGenT can, therefore, provide a much more contemporary measure of gene transfer than other methods [32].

Figure 2. The method ‘populations as clusters of gene transfer’ (PopCOGenT) estimates the amount of recent horizontal gene transfer by measuring the distribution of lengths of identical sequences shared by any two genomes. By comparing this distribution to a null model of clonal evolution (i), PopCOGenT determines a ‘transfer bias’ owing to horizontal gene transfer. After the cessation of horizontal transfer between genomes, this transfer bias decays rapidly owing to the accumulation of mutations. (Online version in colour.)

Importantly, the measure of gene transfer provided by PopCOGenT can be used to construct a network to examine how recombination structures genetic diversity (figure 3). In the example shown in figure 3, the individual genomes show different amounts of gene flow between them. Some isolates form a clearly isolated cluster, while others remain connected by considerable gene flow, yet are further structured into more weakly connected subclusters. As detailed below, such subclusters can be observed by applying a simple clustering algorithm to the raw gene flow network. Moreover, because PopCOGenT works with pairwise alignments, it can compare all shared regions, irrespective of whether these are shared by all isolates across a population. In that way, recently shared genetic material in both the core and flexible genome can be taken into account, i.e. in the gene complement that is shared by all and or subsets of isolates in a population, respectively.

Figure 3. PopCOGenT identifies populations through pairwise whole-genome alignments of environmentally derived isolate or single-cell genomes. It is often unclear how to group strains together into biologically meaningful populations from phylogenetic trees made from multiple genome alignments or concatenated marker genes (left). Further, the diversity in these phylogenetic trees can only ever depict the evolutionary history of core genomic regions. By performing pairwise alignments, PopCOGenT estimates gene transfer across all regions shared by any two genomes and identifies population structure without relying on rigid identity cut-offs (middle). While some populations are completely disconnected from other groups by gene flow, others remain interconnected, and the underlying population structure is revealed through clustering that identifies subclusters of extensively connected strains (right). The isolated clusters of genomes can be considered species-like owing to the properties they share with the biological species concept definition requirement of genetic isolation. (Online version in colour.)

When applied to several bacterial and archaeal model systems for which population structure has been estimated (using population genomics combined with ecological and physiological data), PopCOGenT was able to recapitulate the original predictions [32]. These model systems represent a critical test, as each has been shown to comprise closely related sister populations distinguished by cohesive properties, including differential dynamics in environmental samples. When PopCOGenT was used to construct a gene flow network among genomes from these model systems, the raw network was structured into gene flow clusters that were highly congruent with the previously identified genetic and ecological units.

These initial clusters in the raw gene flow network had no connection to other such clusters, indicating that recent gene flow between many ecological populations is essentially undetectable [32]. When a simple clustering algorithm was applied, however, the additional structure was revealed in some cases, i.e. subclusters of enriched gene flow within that maintain some gene flow between. These subclusters also recapitulated two models of recently diverged populations in V. cyclitrophicus and Sulfulobus icelandicus [26,33], indicating that PopCOGenT can correctly identify nascent populations separated by weaker gene flow discontinuities [32]. One of the datasets also consisted primarily of genomes amplified from single cells of the ocean cyanobacterium Prochlorococcus. Such single-cell genomes are usually difficult to compare by traditional methods because they are incomplete in random areas. However, PopCOGenT can handle incomplete information because it relies on pairwise comparisons as long as sufficient overlap between pairs is available. What constitutes sufficient information remains poorly explored and datasets can also easily be confounded by contaminating DNA that may be scored as gene transfer connections among unrelated genomes. Nonetheless, the potential to carry out population genomics with single-cell genomes and thus sidestep cultivation represents a potential advantage of PopCOGenT. Overall, the observation of clusters and subclusters among closely related genomes suggests that estimates of gene flow alone can be used to hypothesize genetic and ecological units. However, how can we be sure that the correct boundaries among these units have been identified?

5. How can we test if predicted population structure is biologically meaningful?

To answer this question, we return to the argument that for genetic and ecological units to be congruent, adaptations must be able to spread in a species- or population-specific manner. A critical test is, therefore, whether there are properties that differentiate the most closely related sister populations. Both examples of the speciation models of V. cyclitrophicus and S. islandicus suggest that such properties can be identified [26,33]. We, therefore, extended the logic of the gene flow analysis to the identification of alleles and genes that have swept in a population-specific manner [32] (figure 4). We reanalysed Ruminococcus gnavus genomes isolated from healthy individuals as well as patients with Crohn's disease and ulcerative colitis [34]. The application of PoCOGenT showed a connected network with three subclusters, two of which were sampled enough to test for adaptations in the form of population-specific alleles or genes [32]. For these adaptations to have arisen recently by population-specific sweeps, they should show much reduced diversity in the alleles or genes encoding them compared to the average nucleotide diversity across the genomes of the populations.

Figure 4. A major function of populations and species identified by gene flow is that they are the fundamental units through which adaptive traits radiate and spread. When alleles are acquired by a population (either through de novo mutation or horizontal acquisition from a distant relative), those alleles can be transferred to other members of the same population by homologous recombination. Further, if those traits provide a niche-specific benefit that substantially increases the fitness of their host, they will rise to fixation in that population owing to selection. Consequently, a hallmark of these regions when comparing genomes is locally diminished nucleotide diversity at the selected locus. The observation of these regions that have undergone recent selective sweeps are a useful confirmation that the predicted population structure is biologically meaningful. Indeed, randomized population groupings consistently prevent the identification of swept regions.

When a pipeline was developed to identify genome regions with significantly reduced nucleotide diversity compared to the population average (figure 4), several alleles in the core genome and genes in the flexible genome were identified that differentiated both populations [32]. These regions were all unlinked and distributed across the genome, indicating that they arose independently from each other. Many of these alleles and genes could not be annotated, but several encoded surface proteins, suggesting that they are involved in some form of communication with the environment. These results, therefore, suggest that gene flow is sufficiently biased in a population-specific manner to allow for adaptations to spread by recombination and serves as a strong confirmation that correct ecological units have been identified.

6. How can population structure evolve under horizontal gene transfer?

How can the observation of clearly delineated clusters in contemporary gene flow be reconciled with observations of horizontal gene transfer that has, in some cases, been called ‘rampant’ [35]? There is abundant evidence that there is a continuous uptake and incorporation of divergent genetic material into bacterial and archaeal genomes [25]. That is, each cell might at any point harbour genes that have recently been acquired from any number of other microbes. Although such incorporation of divergent genes will affect phylogenetic clustering of isolates, it will not disrupt the gene flow network sufficiently to mask population structure if the gene flow within populations is much higher than between, as we suggest here. Moreover, if the gene flow is fairly random, it will link strains between populations in a more or less haphazard way, so that connections are fairly unstructured. Indeed, many of the acquired genes may be lost fairly quickly if they are, as seems likely, at least slightly deleterious to the recipient genome [11]. Hence, populations and possibly species are indeed fuzzy units owing to horizontal gene transfer, but such fuzziness does not preclude their definition as ecological units if gene flow is sufficiently biased towards within-population recombination to allow for adaptations to sweep in a specific manner.

A constant sampling of genetic material from divergent sources can, in fact, provide the raw material for adaptation [11]. Although it is widely accepted that evolutionary innovation can arise by horizontal addition of genes into the genome, the extent to which even allelic sweeps arose horizontally rather than by mutation within the population was surprising in our recent analysis of the recently differentiated Ruminococcus populations discussed above. The vast majority of adaptive alleles we were able to identify were horizontally acquired from divergent sources [32]. Similarly, an adaptive radiation that differentiated closely related populations of ocean bacteria for different physical forms of the same polysaccharide was based on acquisition and loss dynamics of genes [36]. Even multiple copies of the same polysaccharide lyases originated by transfer rather than duplication, including some enzymes that were present in as many as seven copies per genome. These observations are consistent with previous analysis of diverse genomes that also showed duplication of genes within the same genome to be rare in microbes [37]. This is a fundamental difference to eukaryotes, where duplications are common and evolutionary innovation arises by mutation within the genome [38].

7. What are potential caveats of population structure predictions?

Considering that the results so far demonstrate the existence of surprisingly highly isolated gene flow clusters, are there potential scenarios where the horizontal transfer can mask or erase population structure? This aspect remains poorly explored, but several scenarios can at least be imagined. Recombination rates among microbes are highly variable [32,39], and if very low, the input of a larger set of genes from one other population may create a strong link with a subset of genomes in the population under consideration, confounding population structure analysis. The most likely scenario is a population with low recombination rates that receives a large mobile genetic element (MGE) that is under positive selection in both the donor and recipient population and thus connects a large fraction of the genomes. Such a case might originate if, for example, an antibiotic resistance plasmid moves through a microbiome under strong antibiotic selection. It is thus advisable to test population structure with and without MGEs, or to include closely related genomes from samples that have not been subject to antibiotic treatment. Moreover, it is possible that two related populations suddenly occupy similar niches owing to some environmental change. Such alteration in co-occurrence may allow for increased gene flow, especially if under selection, and lead to despeciation as has been postulated for some Campylobacter species in animal microbiomes [40]. Although these types of situations may lead to population structure that is less clear than those identified in the model systems we analysed, the gene flow patterns are nonetheless biologically relevant and may lead to interesting hypotheses about the environmental selection.

We stress that any population structure prediction represents a hypothesis in itself and needs to be carefully analysed as it may be affected by sampling and other factors. However, we believe that if populations carry signatures of specific adaptations, such as gene-specific sweeps (figure 4), these serve as some of the strongest possible evidence that the predicted population represents an ecological unit and hence the most relevant unit for community ecology.

8. What are key properties of populations defined by gene flow?

One striking feature of the populations identified here is that they contain relatively low nucleotide diversity in their core genome, i.e. in the genes shared by all. The genomes of both bacteria and archaea analysed so far are typically more than 98% similar in the nucleotide sequence within populations, which is consistent with data obtained from a different approach for predicting population structure [41]. Such high similarity would also ensure that homologous recombination within populations remains efficient, as its rate decays exponentially with sequence divergence [19,20]. It should also be noted that these low values are quite consistent with nucleotide diversity within animal and plant species. For example, human genomes differ by at most 0.2% of nucleotide sites compared to the human reference human genome [42].

If the populations defined by gene flow are taken as local representatives of species, they are considerably more narrowly defined than those resulting from the comparison of average nucleotide identity (ANI), which has become the basis of a popular species definition [43,44]. When ANI is compared across diverse groups of genomes, there is typically a minimum observed at around 95% ANI, the presumed species boundary [44]. However, this boundary probably does not conform to population or species boundaries for reasons similar to those voiced above concerning population boundaries estimated with some recombination methods. Once gene flow decreases owing to speciation, the genetic similarity between the nascent species will decay because recombination no longer acts as a homogenizing force [25]. Yet, this decay is a slow process and for the signal of genetic similarity to reach a minimum will take considerable time [32]. Hence, the population or species boundary may lie within the 95% similarity value, and, importantly, recently speciated populations may not be recognizable because their genomes have not diverged enough, masking ecological or disease associations as recently demonstrated [26,32,45]. Hence, while appealing for their simplicity, it is questionable whether ANI minima can define biologically meaningful species boundaries.

A further important property of populations defined by gene flow is that the pan-genome remains of considerable size [46]. That is, in spite of genomes being very closely related across the shared genes, they display a considerable number of genes that are not shared. Many of these genes remain unannotated and their role for population biology is thus unclear. However, there are an increasing number of examples which show that the flexible genome might, at least in part, be under negative frequency-dependent selection, a form of selection where the fitness of a genotype decreases as it becomes more frequent in the population [46]. This effect may be especially strong for organismal interaction such as public good production and predation. For example, the production of siderophores by some genotypes has been shown to be accompanied by the evolution of cheaters that lack the production genes but retain the uptake genes [47,48]. Moreover, viral receptors and defence genes are frequently relegated to the flexible genome, indicating that they cannot rise to high abundance within populations as protection against specific viruses decimating the population [46,49,50]. Finally, there is also increasing evidence that such flexible genome regions can be preferentially shared within populations by homologous recombination of the flanking regions so that rather than being repeatedly acquired de novo, many flexible regions are part of a population's biology [46].

9. What are the implications for measurement of diversity in the environment?

The approach for hypothesizing population structure based on gene flow followed by testing of the hypothesis by the identification of population-specific sweeps allows for a reverse ecology approach that predicts ecological units from genomic information alone [32,51]. In this way, the approach can provide an unbiased framework for identifying important variables that drive diversification in microbial populations by highlighting alleles and genes under strong selection. This approach thus provides a unique lens to delineate microbial niche space that is agnostic to being able to accurately measure where strains fall along environmental gradients. Of course, direct insights into ecological differentiation based on any genomic approaches depend heavily on the accuracy of gene annotations, which is currently patchy at best. But a reverse ecology approach can also help formulate hypotheses for relevant genes that need to be further characterized by other approaches such as molecular genetics or structural analysis and may thus help build a more structured approach towards solving the omnipresent annotation problem.

Loci under selection are particularly useful for assessing the abundance of populations in environmental samples because their within-population diversity is exceptionally low, while the diversity between populations is much higher because evidence so far indicates that most loci arose by horizontal gene transfer from divergent sources [32]. These properties mean that sweep loci can be detected with very high accuracy in environmental samples, and their prevalence throughout the genome of recombinogenic organisms adds statistical power in assessing the abundance of populations in complex communities. Accordingly, shotgun metagenomes of DNA extracted from microbial communities provide a convenient way to quantitatively assess the abundance of multiple loci in multiple samples. However, this approach is of limited use if target populations are rare in their environment. Sweep loci could also be targets for high-resolution assays such as digital polymerase chain reaction allowing researchers to rapidly measure the abundance of populations in different samples if greater sensitivity is required. These regions could also be targets for fluorescence in situ hybridization probes to directly visualize how closely related populations are differently distributed in the environment. We envision that this will allow for more targeted testing of fine-scale environmental associations that far exceeds the efficiency of traditional forward ecological approaches, which often rely on mapping microbial groups onto coarse environmental variables and then using genomics to find potential differences [12].

How do populations defined by gene flow compare to the traditional measurement of microbial diversity by 16S rRNA gene sequencing often used to map microbial populations onto environmental samples? To answer this question, we use an example from our own work where we have delineated Vibrionaceae bacteria into coexisting populations in ocean water. We typically find around 20 or so coexisting populations that were originally defined by the fine-scale environmental sampling of isolates, sequencing of protein marker genes and application of mathematical modelling to link genetic diversity to environmental structure [52–55]. These population predictions have recently been confirmed by the much simpler gene flow analysis [32] enabling direct comparison of one of the protein marker genes (hsp60) with different 16S rRNA gene fragments used to define operational taxonomic units (OTUs) for their potential to differentiate ecological units in samples.

This comparison shows disturbingly low resolution of the 16S rRNA genes when compared with populations defined by gene flow (figure 5). Especially 16S rRNA tags typically used in high throughput sequencing have essentially zero resolution for ecological populations. For the full-length gene, this is only slightly better, showing that speciation by far outpaces the resolution of the 16S rRNA genes. This means that the gene has very limited information when it comes to ecological dynamics of populations in environmental samples, and a unique sequence may mask many ecologically differentiated populations, an effect that obviously becomes worse the more broadly OTUs are defined in terms of sequence divergence.

Figure 5. 16S rRNA gene sequence clusters can distinguish 0–7 of 14 ecologically distinct Vibrionaceae populations depending on sequence length and clustering cut-off, while clusters in the hsp60 marker gene can distinguish all or nearly all. Phylogeny is based on 52 concatenated ribosomal proteins. A shaded box indicates that a taxon can be uniquely distinguished with the given gene length and clustering method, while a white box indicates that a taxon is merged with at least one other taxon, in at least one gene cluster. Habitat distribution descriptions are derived from a quantitative analysis of populations' distributions across three different sample sets by Preheim et al. [54]. Taxa without habitat descriptions were excluded from that analysis because of limited sampling. (Online version in colour.)

Considering that the prediction of population structure by gene flow requires isolates or single-cell genomes, is there a proxy that can be developed for species and population identification in metagenomes? Potentially yes. One interesting feature of the populations we have identified is that they are quite well approximated by nearly identical ribosomal protein sequences [32,45]. Although even for these, some structure can be masked because of rapid speciation, these genes can serve nonetheless as a much more accurate proxy for population structure. Whether this observation holds more broadly across many taxa will have to be explored in larger datasets [56], but importantly, identical ribosomal proteins can be extracted from metagenomic datasets and their dynamics thus easily analysed [57]. We, therefore, recommend targeting ribosomal proteins when species and population dynamics are of interest in metagenomic samples.

10. Concluding remarks

The identification of populations as gene flow clusters that are also ecological units has major implications for microbiology, which has long suffered from the fuzzy definition of populations [2]. We suggest that recent gene flow measured from collections of closely related genomes can clearly delineate population boundaries even at relatively early stages of differentiation. These populations are characterized by alleles and genes that have recently swept to fixation, indicating that positive selection can spread adaptations in a specific and exclusive manner. The identification of such gene-specific sweeps provides both confidence in the population boundaries and creates hypotheses of recent adaptations that differentiate populations from each other. Hence, these populations can be regarded as adaptively optimized units of bacteria and archaea equivalent to how populations are viewed in macroecology and evolution. Such populations then hold significance when we want to study community ecology, as they allow for sharper identifications of associations with biotic and abiotic factors.

Finally, considering that many of the populations defined here display a very high degree of genetic isolation, it is tempting to invoke the biological species concept, which posits that species are reproductively isolated groups of organisms [58]. However, we stress that the analyses for bacteria and archaea presented here primarily considered organisms that either coexist or live in separate locations connected by high migration. As we outlined here, genetic isolation of such populations may be enforced by selection. Yet, a characteristic of many species is that they consist of geographically separate populations connected by various degrees of gene flow. How such structure influences the delineation of clusters remains an open question, but this will be important to determine in the pursuit of a biologically meaningful species concept for bacteria and archaea.


3.1. Spatial patterns of microbial abundance and potential nitrification

Ordinary kriging plots revealed that the overall abundance of bacterial 16S rRNA, archaeal 16S rRNA, and fungal ITS genes were significantly (p <਀.05) higher in the woodland samples than in adjacent grassland samples with a distinction across the transition zone (Figure  1 upper panel Table  1 ). In general, bacterial abundance was consistently high along the grid with the number of gene copies varying between 10 7 and 10 8 per gram of dry soil. Conversely, archaeal (10 5 � 7 ) and fungal (10 4 � 7 ) abundance showed much larger variation in copy numbers. A number of small areas were found near the transition where the abundances of all three microbial kingdoms were high. Soil PNR closely resembled the spatial patterns of bacterial ammonia oxidizers (Figure  1 lower panel). Similar to overall microbial abundance, the abundance of ammonia oxidizers was significantly (p <਀.05) higher in the woodland than the grassland, and a visually distinct transition zone was also noted. Bacterial amoA displayed a significantly (R 2 = 0.317 p <਀.001) positive association with PNR across the ecotone (Supporting Information Figure S3). All five gene abundances and PNR displayed high spatial dependency (SPD =਀.631𠄰.999) and operated between 14 m and 36 m (Supporting Information Table S2). Spherical and Gaussian models showed significant (R 2  =਀.285𠄰.900 p <਀.01) spatial structure for all properties tested. Bacterial amoA and PNR structured at a smaller spatial range (14.7�.7 m) than archaeal amoA, which had a spatial range 㹐 m.

Geostatistical kriging plots showing the spatial patterns of microbial abundance and potential nitrification across the woodland‐grassland ecotone. A 50 m ×ꀠ m grid was established across woodland and grassland in Namadgi National Park, Australia. Ordinary kriging was performed after semivariance analysis and cross‐validation. The dotted lines on kriging maps indicate the boundary between woodland in the upper half of the map and grassland in the lower half

Table 1

Soil microbial indices and potential nitrification rate across the woodland‐grassland ecotone at the Namadgi National Park, Australia

Soil microbiota and activitiesEcotone components
Abundance (Log copies g 𢄡 dry soil)
Bacterial 16S rRNA8.40 (0.05) a 8.29 (0.08) a 7.84 (0.12) b
Archaeal 16S rRNA6.91 (0.11) a 6.96 (0.21) a 6.10 (0.16) b
Fungal ITS7.69 (0.09) a 7.06 (0.16) a 5.39 (0.66) b
Bacterial amoA4.55 (0.25) a 4.82 (0.20) a 3.54 (0.34) b
Archaeal amoA5.86 (0.28) a 5.12 (0.20) a 5.51 (0.19) a
Archaea42.2 (7.71) a 21.5 (1.91) b 22.1 (1.30) b
Bacteria1169 (43.10) b 1147 (96.03) b 1395 (21.87) a
Fungi374 (14.10) a 308 (27.03) b 337 (12.22) ab
Pielou's evenness
Archaea0.40 (0.04) a 0.09 (0.02) c 0.21 (0.03) b
Bacteria0.77 (0.01) a 0.80 (0.01) b 0.82 (0.01) b
Fungi0.61 (0.04) a 0.68 (0.04) a 0.72 (0.03) a
Diversity (Shannon‐Weaver)
Archaea1.45 (0.15) a 0.27 (0.07) c 0.66 (0.11) b
Bacteria5.48 (0.08) b 5.67 (0.15) b 6.00 (0.03) a
Fungi3.57 (0.23) a 3.86 (0.19) a 4.22 (0.21) a
Nitrification potential (PNR) (μg NO3‐NO2 g 𢄡 ਍ry soil hr 𢄡 )0.13 (0.04) a 0.36 (0.078) b 0.08 (0.01) a

Along the grid length (50 m), the first 20 m was woodland, 10 m was transition, and the last 20 m was grassland, resulting (n) in 20, 15, and 20 samples, respectively. For microbial diversity indices, n =ਆ.

Soil microbial properties were compared between woodland, transition, and grassland by performing one‐way ANOVA with Duncan post hoc test.

Different letters indicate statistical significance at < 0.05.

3.2. Microbial community structure, taxonomic composition, and diversity

Archaeal, bacterial, and fungal communities displayed a significant habitat effect with the woodland and grassland samples forming distinct clusters and those from the transition zone samples showing a gradient (Figure  2 upper panel). PERMANOVA confirmed such habitat effects for bacterial (Pseudo‐F =਄.79 p <਀.001), archaeal (Pseudo‐F =਄.42 p <਀.001), and fungal (Pseudo‐F =ਃ.43 p <਀.001) communities. Principal coordinates explained 52%, 81%, and 30% variation in bacterial, archaeal, and fungal communities, respectively. Similar to community structure, taxonomic composition was also influenced by the habitat edge (Figure  2 lower panel). For bacteria, Acidobacteria, Actinobacteria, and Alphaproteobacteria were the dominant members, representing more than 80% of total bacterial abundance across the ecotone. Acidobacteria and Alphaproteobacteria were more abundant in the woodland samples while Betaproteobacteria and Chloroflexi were the most common within the transition zone (p <਀.05). Nitrososphaerales was the predominant archaeal group, especially, in the grassland and transition zone samples where it represented up to 99% of the total archaea. On the other hand, a number of fungal groups showed significant (p <਀.01) differences across the ecotone. For example, Agaricomycetes and Leotiomycetes were 2𠄳 times more abundant in woodland samples than either the transition zone or grassland samples. Similarly, Saccharomycetes and Dothideomycetes were significantly more abundant (p <਀.05) in the transition zone and Agaricostilbomycetes (p <਀.001) in the grassland. Microbial alpha diversity indices showed contrasting patterns across the ecotone (Table  1 ). For example, richness, evenness, and diversity of bacteria were significantly (p <਀.05) lower in the woodland samples than the transition zone and grassland. Conversely, archaeal richness, evenness, and diversity were significantly (p <਀.05) higher in the woodland than either the transition zone or the grassland samples. This indicates that the woodland samples harbored a less diverse bacterial community and a highly diverse archaeal community. On the other hand, the relatively N‐ and P‐rich grassland samples supported a diverse bacterial community (Supporting Information Table S1). Fungi also had a significantly higher richness in woodland samples.

Principal coordinate analysis revealing community structure of bacteria, archaea, and fungi in woodland, grassland, and transition zone (upper panel). Stacked bar chart (bottom panel) showing relative abundance of various phyla and classes of bacteria, archaea, and fungi in woodland, grassland, and transition zone

3.3. Microbial co‐occurrences

Network analysis showed distinct co‐occurrences of archaeal, bacterial, and fungal members across the woodland, grassland, and transition zone (Figure  3 ). The microbial network including the top 1000 MIC scores comprised 324 nodes (260 bacterial, four archaeal, and 60 fungal OTUs Figure  4 ). Among the top 10 keystone taxa, six were bacteria and four were fungi (Table  2 Figure  3 a). Half of the bacterial keystone taxa belonged to the class Alphaproteobacteria whereas most fungal keystone taxa were members of the Ascomycetes. Two bacterial OTUs belonged to Rhizobiales and Burkholderiales. Random Forest Analysis showed that ammonium, total carbon, and C:N ratio were the major determinants of the abundance of microbial keystone taxa (Figure  3 b). The network across the ecotone consisted of 193 nodes of which woodland and grassland were associated with 116 and 74 nodes, and the transition zone connected to two nodes (Figure  3 c). The overall network comprised 1767 nodes and had a diameter of 6 and a radius of 4 (Supporting Information Figure S3). Woodland and grassland formed clusters away from the transition zone. While the network was dominated by bacteria, fungal and archaeal nodes were also abundant in the woodland. Overall, microbial and co‐occurrences were considerably different in three habitat components across the ecotone.

(a) Microbial network showing co‐occurrences of bacterial, archaeal, and fungal OTU s. This network of top 1,000 interactions consisted of 324 nodes. Enlarged nodes represent the top ten microbial keystone taxa of which six were bacterial and four fungal. (b) Results of Random Forest Analysis showing the edaphic drivers of microbial keystone taxa. The MSE indicates vector of mean square errors. (c) Microbial co‐occurrences in the woodland, grassland, and transition zone. This network comprised 193 nodes. To indicate the most important interactions, only strong positive (r >਀.8), strong negative (r < 𢄠.8), and strong nonlinear ( MIC  – ρ 2  >਀.8) relationships were shown in the networks. Oval nodes represent bacterial OTU s, rectangular nodes represent fungal OTU s, and triangular nodes represent archaeal OTU s. Color of the nodes represents different taxonomic groups while green, red, and wavy lines represent positive, negative, and nonlinear relationships, respectively. Only statistically significant (p <਀.05) relationships are shown

Relationship among microbial co‐occurrence, soil properties, and ecological processes. (a) Archaeal, bacterial, and fungal OTU s formed distinct clusters with soil chemical properties. Large clusters such as P consisted of 164 nodes, C:N comprised 76 nodes, and pH consisted of 42 nodes. (b) Clusters of microbial OTU s linked to potential nitrification and extracellular enzyme activities. The cluster of PNR comprised 49 nodes. To indicate the most important interactions, only strong positive (r >਀.8), strong negative (r < 𢄠.8), and strong nonlinear ( MIC  – ρ 2  >਀.8) relationships were shown in the networks. Oval nodes represent bacterial OTU s, rectangular nodes represent fungal OTU s, and triangular nodes represent archaeal OTU s. Color of the nodes represents different taxonomic groups while green, red, and wavy lines represent positive, negative, and nonlinear relationships, respectively. Only statistically significant (p <਀.05) relationships are shown

Table 2

Network features and taxonomy of top ten keystone taxa. OTUs with highest degree, highest closeness centrality, and lowest betweenness centrality were selected as the keystone taxa

OTUidNetwork featuresTaxonomy
Betweenness centralityCloseness centralityDegreeKingdomPhylum or classOrder
Botu7810.0240.502265Bacteria Acidobacteria Acidobacteriales
Fotu6710.0210.510254Fungi Eurotiomycetes Chaetothyriales
Fotu6950.0260.508250Fungi Dothideomycetes Capnodiales
Botu7060.0150.499246Bacteria Alphaproteobacteria Rhodospirillales
Fotu6260.0130.502241Fungi Zygomycota Mortierellales
Botu9140.0120.501238Bacteria Verrucomicrobia Pedosphaerales
Botu2570.0080.484227Bacteria Alphaproteobacteria Caulobacterales
Botu8900.0150.494220Bacteria Alphaproteobacteria Rhizobiales
Fotu5690.0110.486210Fungi Eurotiomycetes Chaetothyriales
Botu810.0050.476199Bacteria Betaproteobacteria Burkholderiales

3.4. Relationships among microbial co‐occurrences, soil properties, and ecological processes

Microbial co�undances were linked to soil properties and ecological processes, with the subnetworks comprising mainly bacterial OTUs (Figure  4 Supporting Information Figure S4). For soil properties, total soil P formed the cluster with maximum nodes followed by C:N ratio and pH (Figure  4 a). Mineral N, DON, C:N, and P clusters shared several nodes and were predominantly connected through positive and linear edges. In general, the subnetworks were dominated by Alphaproteobacteria and Actinobacteria OTUs in bacteria. For soil ecological processes, PNR formed a large and distinct cluster away from soil enzymes and predominantly consisted of bacterial nodes but no archaeal nodes, whereas soil enzymes formed individual clusters but were interconnected through shared nodes (Figure  4 b). This is especially true for cellobiohydrolase and phosphatase activities that had a large shared guild. β‐glucosidase had the smallest cluster but was connected with both cellobiohydrolase and phosphatase through a fungal node belonging to Dothideomycetes. Similar to soil P content, phosphatase formed the largest clusters. These clusters of soil enzymes were also dominated by bacterial OTUs. Potential nitrification showed consistently significant (p <਀.01) correlations with soil properties and extracellular enzymes (Supporting Information Table S3). Bacterial amoA gene copy number had a strong association with PNR activity in these relatively N‐rich soils, and this was also supported by the fact that bacterial amoA was strongly (p <਀.01) correlated with soil N content. Taken together, microbial co‐occurrences reflected the differences in soil properties and ecological processes.


How did the proteomes expand across the ToL, and how did chaperones evolve to support this expansion? To address this question, we compiled data from multiple sources and analyzed them under one roof, thus allowing a systematic, quantitative comparison, as summarized in Fig. 4. From the simplest free-living prokaryotes to plants and animals, proteomes have continuously expanded by both duplications and innovations. It is primarily due to the latter that proteome “complexity” has continuously increased in various ways that demand increased chaperone action. Across the ToL and especially when comparing prokaryotes with eukaryotes, we see a larger number of proteins per proteome (Fig. 4A) as well as larger proteins. The latter relates to multidomain proteins being increasingly represented. The number of different folds increases as well as of unique combinations of folds in multidomain proteins. Proteomes also contain a larger fraction of protein types that are prone to misfolding, such as repeat proteins and proteins comprising beta sheets, and those that are predicted to be highly aggregation-prone. The birth of a new fold, or of a new domain combination, likely results in poor foldability. With time, mutation and selection would improve foldability (59) and could ultimately render a newly born chaperone-independent protein. Nonetheless, the cumulative impact of newly evolved proteins, and of certain protein types (repeat or beta-rich proteins), likely demands increased chaperone capacity.

Summary figure describing the parallel expansion of proteomes and chaperones. Bar heights (y axis) were scaled such that the highest value per parameter assumed the same height (the absolute values are listed above the bars). (A) Bar graphs describing the expansion of proteomes in a nutshell. For the simplest free-living archaea, bacteria, fungi, plants, and chordates, plotted are the number of proteins per proteome (light gray), median protein length (light yellow), number of unique folds (gray), number of unique fold combinations per proteome (yellow), percentage of multidomain proteins (out of all proteins in the proteome orange), percentage of proteome length that corresponds to repeat proteins (calculated by residue length dark gray), percentage of proteins that have the beta-superfold architecture (wine), percentage of proteins predicted as highly aggregation-prone (dark yellow), and percentage of proteome length predicted as intrinsically disordered (red). (B) Same as A, for the expansion of chaperones. Plotted are the number of core- (cyan) and cochaperone families per proteome (navy), percentage of core-chaperone genes in the proteome (blue), relative mRNA abundance of core chaperones compared with ribosomal proteins (green), and relative protein abundance of core chaperones compared with all other proteins (dark green). (C) A schematic description of the expansion of the integrated chaperone network. Core chaperones are shown in various colors and with black outlines, while cochaperones are in gray with no outline. Cochaperones of HSP60, HSP70, and HSP90 are connected to their respective core chaperone by black lines. Cooperativity between core chaperones is represented by overlaps between circles, and substrate sharing between different core chaperones is shown by red arrows. Arrow direction and width represent the direction and magnitude of substrate sharing. Note that the network is shown for the simplest free-living archaea, bacteria, fungi, and chordates.

This dramatic increase in proteome complexity, and hence the demand for chaperone action, has not been met by the emergence of new core chaperones. Eukaryotes possess the same five core-chaperone families as prokaryotes, and metazoans and chordates have in fact lost HSP100 (Fig. 4B). Further, the relative representation of core-chaperone genes does not vary between prokaryotes and eukaryotes. Rather, the need for increased chaperone action was met in two ways: first, by an increased cellular abundance of core chaperones, and second, though, by the emergence of new cochaperones—while in bacteria ∼4 cochaperone families are found, in eukaryotes their number increased to 15 or even 20 in Mammalia.

Most of the expanding proteome features are likely the outcome of adaptive evolution (e.g., the emergence of new folds and domain combinations). However, expansion may also occur by drift, that is, by fixation of genetic changes by chance, due to population bottlenecks. Indeed, the effective population size (Ne) has dropped from prokaryotes (typically >10 8 ) to unicellular eukaryotes (∼10 7 ), invertebrates and land plants (∼10 6 ), and chordates (∼10 5 ) (60). Consequently, neutral, or even mildly deleterious mutations that would be purged in prokaryotes, might readily fix in multicellular eukaryotes. For example, drift may have driven the accumulation of hydrophobic residues on protein surfaces regardless of their protein–protein interaction potential, thus leading to lower protein stability and oligomerization but also increased aggregation propensity in eukaryotic proteins (61). Similarly, insertions fixed by drift could elongate disordered segments (62) and repeat proteins. The higher chaperone levels in eukaryotes (Fig. 3 C and D) may relate to the mitigation of the deleterious effects of such accumulating mutations (63). Pathogenic bacteria often experience severe population bottlenecks, and their chaperone expression levels are comparable to those of extremophiles (Fig. 3C and Dataset S11). Overall, the impact of drift on proteome and chaperone evolution merits further investigation.

The above trends highlight two features that comprise hallmarks of the chaperone machinery: the generalist nature of core chaperones, and their ability to act in a cooperative mode alongside cochaperones as an integrated network. HSP60, HSP70, HSP90, and HSP100 are core chaperones acting as generalist unfolding–refolding machineries that work on a broad range of differently misfolded and aggregated protein substrates, largely regardless of size [except HSP60 (64)], structure, and function (11, 31). While core chaperones can exert high-affinity binding to few specific substrates at their native folded state, they generally tend to bind misfolded and aggregated polypeptides that abnormally expose hydrophobic surfaces (31, 65, 66). The main driving force for duplication and specialization is a functional tradeoff—optimization of one function comes at the expense of other functions (67). However, given a “generalist” mode of function, the quality control of increasingly large and complex proteomes could be achieved by an elevated abundance of existing core chaperones, rather than by the emergence of new core-chaperone families. Indeed, although gene copy numbers of core chaperones have indeed increased by gene duplication, their relative representation compared with proteome size remained constant (Fig. 3B) and the resulting paralogous copies have mostly relocalized to different subcellular compartments or are expressed under different stress conditions (68). In parasitic microbes and photosynthetic organisms, duplicates of HSP70 and HSP90 have subspecialized to resist host immune responses and oxidative stress (54 ⇓ ⇓ –57). However, consistent with their generalist nature, the challenge of maintaining large, complex proteomes (Fig. 4A) has primarily been met by increased abundances of preexisting core chaperones rather than by the de novo emergence of new ones.

In healthy cells, an integrated chaperone network, comprising both core and cochaperones, controls protein quality (69 ⇓ –71). In this network (Fig. 4C), the highly abundant core chaperones operate cooperatively, namely they not only share, and exchange incompletely processed misfolded or unfolded protein substrates, but also trigger the activities of one another. HSP70 plays a critical role in this network by mediating cooperative communications between the other core chaperones. For example, HSP70 triggers the disaggregase activity of HSP100, and jointly they disaggregate aggregated proteins and promote their subsequent refolding (72 ⇓ –74). In another example, HSP20 can transfer misfolded substrates to HSP70 for ATP-driven unfolding, from which they can be further transferred to HSP60 for final refolding to the native state (75). Likewise, HSP90 can promote the maturation of incompletely processed HSP70 substrates (76, 77). Cooperativity and substrate sharing between core chaperones are schematically represented in Fig. 4C. Together, these generalist, cooperative core chaperones constitute the core of an integrated chaperone network that has emerged from a simple two-component system in LUCA (Fig. 4C).

Alongside the expansion of proteome complexity, the chaperone network has also expanded—primarily by the emergence of cochaperones (Fig. 4C). This expanding array of cochaperones augmented the ability of core chaperones to efficiently share substrates and to function cooperatively. In contrast to the generalist core chaperones, cochaperones are more diverse and accordingly seem to subspecialize in specific roles, including cochaperones that handle specific proteins. Examples include UNC45, a cochaperone that emerged in Fungi, and facilitates HSP90-mediated maintenance of myosin in metazoan skeletal and cardiac muscles (78). Another Fungi-born cochaperone, the Tsc1/2 heteromer, specializes in recruiting kinase and some nonkinase substrates to HSP90 (79). Other cochaperones mediate protein transport examples include Tom70 and P23 that facilitate protein trafficking through Golgi and mitochondrial membranes (80 ⇓ –82). The specialist mode of function of cochaperones coincides with how they expanded, namely by duplication and divergence of ancient prokaryote-born cochaperones but also via bona fide innovations, namely by the emergence of completely new specialized cochaperones in eukaryotes. As shown here, the emergence of new cochaperones coincides with the emergence of new proteins (i.e., by de novo emergence rather than by duplication of preexisting proteins). However, co-occurrence does not mean coevolution—indeed, we know very little about the latter. Did certain cochaperones emerge to support the de novo emergence of a specific protein or protein class? If so, does chaperone dependency persist, hence making codependency a “selfish” irreversible trait? Alternatively, as some newly emerged proteins evolved further, their foldability improved, allowing them to become chaperone-independent.

Thus, across the Tree of Life, proteomes have massively expanded, not just by duplication of preexisting proteins but also by the emergence of completely new ones. Eukaryotic proteomes became particularly large and specifically richer in repeat, beta-rich, and aggregation-prone proteins whose folding is inherently challenging. These changes in proteome size and composition intensified the demand for chaperone action. Curiously, however, no new core chaperones emerged in response to this increased demand. Instead, they increased in abundance relative to all other proteins in the cell. Foremost, an entire network of cochaperones had evolved that facilitate the basal core-chaperone activity.

A seminal advance from J. Craig Venter and his colleagues at The Institute for Genome Research in 1995 heralded a new era in genome analysis. They reported the complete sequence of the genome of the bacterium Haemophilus influenza, all 1,830,137 bp (Fleischmann et al., Science, vol. 269, pp. 496-512, 1995). In this method, genomic DNA is randomly sheared into small fragments about 1000 bp in size, cloned into plasmids, and determining the sequence from the ends of randomly picked clones (Figure 4.10). This process is repeated many times, until each nucleotide in the genome has been sequenced multiple times on average. If the genome is 3 million base pairs, then determining 9 million base pairs of sequence from random clones give 3X coverage of the genome. This is sufficient data from which an almost-complete sequence of a bacterial genome can be assembled by linking overlapping sequences, using computational tools. Some gaps remain, and these are filled with directed sequencing. Larger genomes can be sequenced (or at least a major portion of them) by going to higher coverage, e.g. 8X to 10X. This approach requires NO prior knowledge of the genes or their positions on the bacterial chromosome. Several bacterial genomes have been sequenced this way, and Dr. Venter and colleagues have used the same approach to sequence almost all of the genomes of Drosophila melanogaster(in a collaboration between his company Celera and a publicly funded effort) and Homo sapiens(in a competition with the publicly funded effort). Variations on this theme improve effectiveness, such as cloning and sequencing both small (1 kb) and large (10 kb) inserts into plasmids, and then using the sequences from the ends of the longer inserts to help assemble the overall sequence. A similar idea uses the sequence from the ends of BAC inserts, which are about 100 kb in size, for large-scale assembly.

Figure 4.10. Shotgun sequencing and assembly.

Other major genome sequencing projects, such as those that generated the Saccharomyces cerevisiaeand E. colisequences, started with a large set of mapped clones, which were then sequenced in a directed manner. This works well, and one has a high resolution genetic and physical map for years before the genome sequence is complete. It is slower than the random approach, but it may achieve a greater extent of completeness for large, complex genomes. This is essentially the approach that the publicly funded, international collaboration, referred to as the International Human Genome Sequencing Consortium (IHGSC), followed.

The most recent phase of this project made extensive use of BAC clones, with an average insert size of about 100 kb (Figure 4.11). Libraries of BAC clones containing human DNA inserts were ordered by a high throughput mapping effort. Restriction digests of each clone in the library were analyzed, and overlapping clones determined by finding fragments in common. The BAC clones were then organized into contiguous overlapping arrays, or contigs. A minimal tiling path needed to determine the sequence of each chromosome was established, and the ends of the BAC clones on that path were sequenced to provide a dense array of markers through the chromosome. BAC clones in the contigs were then sequenced, at this point using the shotgun sequencing of the BAC insert (100 kb), not the whole genome (3.2 million kb). Sequences of BAC clones at about 3X coverage are called draft sequences, and those at higher coverage with gaps filled by directed sequencing are considered finished sequences. A combination of draft and finished sequence data are being assembled using the BAC end sequences and other information. The assembly is publicly available at the Human Genome Browser at the University of California at Santa Cruz ( and the Ensembl site at the Sanger Center (

Figure 4.11. Directed sequencing of BAC contigs.

The results of the Celera and public collaboration on the fly sequence was published in early 2000, and descriptions of the human genome sequence were published separately by Celera and IHGSC in 2001. Neither genome is completely sequenced (as of 2001), but both are highly sequenced and are stimulating a major revolution in the life sciences.

The wisdom of which approach to take is still a matter of debate, and depends to some extent on how thoroughly one needs to sequence a complex genome. For instance, a publicly accessible sequence of the mouse genome at 3X coverage was recently generated by the shotgun approach. Other genomes will likely be &ldquolightly sequenced&rdquo at a similar coverage. But a full, high quality sequence of mouse will likely use aspects of the more directed approach. Also, the Celera assembly (primarily shotgun sequence) used the public data on the human genome sequence as well. Thus current efforts use both the rapid sequencing by shotgun methods and as well as sequencing mapped clones.

Survey of sequenced genomes

The genome sequences are available for many species now, covering an impressive phylogenetic range. This includes more than 28 eubacteria, at least 6 archaea, a fungus (the yeast Saccharomyces cerevisiae), a protozoan (Plasmodium falciparum), a worm (the nematode Caenorhabditis elegans), an insect (the fruitfly Drosophila melanogaster), two plants (Arabadopsisand rice (soon)), and two mammals (human Homo sapiensand mouse Mus domesticus). Some information about these is listed in Table 4.4.

Table 4.4.Sequenced genomes. This table is derived from the listing of &ldquoComplete Genomes Mapped on the KEGG Pathways (Kyoto Encyclopedia of Genes and Genomes)&rdquo at

Additional genomes have been added, but only samples of the bacterial sequences are listed.


Sample preparation and DNA extraction

A total of 897 subjects were enrolled in the Korean gut microbiome project. Subjects who had consumed antibiotics in the 3 months before the study or those with a history of major gastrointestinal diseases were excluded. Subjects consisted of 203 males and 694 females, and their age ranged from 20 to 90. All subjects completed a questionnaire that covers comprehensive demographic and lifestyle information and a food frequency questionnaire (FFQ) [49]. Subjects underwent blood biochemical tests and anthropometrical measurements. Metadata, including clinical measurements and food frequency details, are available in the independent study by Lim et al. [50]. Participants’ faecal samples were collected in both an OMNIgene-GUT tube (DNA Genotek, Ontario, Canada) and a sterile stool collection container. Microbial DNA was extracted using the QIAamp DNA stool mini kit (QIAGEN, Hilden, Germany) with additional bead-beating and heating steps [51]. The elution buffer volumes were 200 μL, and the extracted DNA samples were stored at − 20 °C until further use.

Library preparation and sequencing of archaeal 16S rRNA gene

To amplify the archaeal 16S rRNA genes, a nested PCR was performed as described previously [28] using MG 2X PCR mastermix (MGmed, Seoul, South Korea) and Accupower™ Hotstart PCR premix (Bioneer, Daejeon, South Korea) in the first (25 cycles) and second PCRs (25 cycles), respectively. Briefly, we used the primer pairs S-D-Arch-0344-a-S-20 and S-D-Arch-0911-a-A-20 in the first PCR, and S-D-Arch-0349-a-S-17 and S-D-Arch-0519-a-A-16, which included the Illumina adapter sequence, in the second PCR. The primer sequences are listed in Additional file 2: Supplementary Table S3. Three PCR products with the same template were pooled. To purify the products obtained in the first and second rounds, x-tracta™ Gel Extractor (Promega, Madison, WI, USA) and Qiaquick PCR & gel cleanup kit (QIAGEN) were used. Among the 897 samples, 381 were confirmed using electrophoresis on 1% agarose gel. To verify the size of the PCR-enriched fragments, template size distribution was checked using the Agilent Technologies 2100 bioanalyzer and a DNA 1000 chip. Libraries for archaeal 16S rRNA gene sequences were constructed using the Nextera DNA Flex library preparation kit (Illumina, San Diego, CA, USA), following the manufacturer’s instruction. A subsequent limited-cycle amplification step was performed to add multiplexing indices and Illumina sequencing adapters. The final products were normalised and pooled using PicoGreen, and the size of the libraries was verified using the TapeStation DNA screentape D1000 (Agilent, Santa Clara, CA, USA). Sequencing was performed using the HiSeq™ X platform (Illumina). We also investigated possible DNA contamination in all reagents used for DNA extraction. PCR analysis targeting the archaeal 16S rRNA gene (25 + 25 cycle reactions) revealed no apparent contamination (Additional file 1: Supplementary Fig. S7a). PCR amplicons from ‘blank’ negative DNA extraction/PCR controls (i.e. PCR products of template acquired from a sham extraction to which no faecal sample is added n = 3) were used as the negative controls (Additional file 1: Supplementary Fig. S7b).

Analysis of 16S rRNA gene sequence data

By using the bcl2fastq2 conversion software v. 2.20.0. (Illumina), the adapter sequences were trimmed from the raw FASTQ reads, and the trimmed reads were demultiplexed according to the samples. To generate the ASVs feature table, the sorted reads were imported and processed using QIIME2 v. 2018.11 [52]. In total, 275,909,328 imported paired reads were quality filtered, denoised and merged using the DADA2 plugin [53]. Chimeric sequences and singleton ASVs were excluded in further analyses. Rarefaction curves were constructed using plugin diversity alpha-rarefaction in QIIME2. For taxonomic classification, we used plugin q2-feature-classifier, the classify-sklearn method [54] and the pre-trained SILVA v. 132 database [55] with 99% identity. An overview of the human gut archaeal 16S rRNA gene sequence dataset is provided in Additional file 2: Supplementary Table S4.

To calculate the 16S rRNA gene similarities with the secondary structure, the ASVs were aligned using the RDP aligner ( Afterwards, the aligned ASVs were used to construct a phylogenetic tree using the neighbour joining algorithm based on the Kimura 2-parameter model with 1000 bootstraps in MEGA X [56]. Phylogenetic tree visualisation was performed using iTOL v. 5 [57]. Phylogenetic analysis of the haloarchaea-assigned sequences included 16S rRNA gene sequences of the type strains belonging to the class Halobacteria and clone sequences from Oxley et al. [36], whereas that of the methanogen-assigned sequences included 16S rRNA gene sequences of the type strains belonging to the genera Methanobrevibacter and Methanosphaera and the family Methanomethylophilaceae. To determine the species diversity in each human faecal sample, alpha and beta diversity analyses were performed using the plugin q2-diversity in QIIME2 v. 2018.11. Based on rarefaction results, we subsampled the sequences at a sampling depth of 10,000 (Additional file 1: Supplementary Fig. S1) and included 342 from the 381 archaeal sequence-positive samples. Subsequently, group difference was determined based on metadata.

Archaeal enterotypes were identified as previously described [58]. Briefly, the samples were clustered using PAM clustering [59] with the Bray-Curtis dissimilarity matrix [60] at the family level. The optimal number of clusters was determined using the silhouette index [61]. Enterotypes were visualised by the PCoA plot. To find the sequences assigned to haloarchaea in other sample cohorts, we trawled both the publicly available human metagenomic and metataxonomic datasets using the EBI MGnify [62]. All the data were obtained from the EBI MGnify database.

The archaeal 16S rRNA dataset generated using HiSeq for the negative controls is summarised in Additional file 2: Supplementary Table S5. The taxonomic annotation data for the negative controls are shown in Additional file 2: Supplementary Table S6. Majority of the assigned reads in the negative controls were highly unlikely to be present in the human gut, suggesting no (or very little) impact of contamination on the archaeal 16S rRNA gene analysis.

Estimation of archaeal abundance

DNA was extracted from the faecal samples as described above. In total, 150 samples were randomly selected and subjected to real-time quantitative PCR [11] in three replicates. The archaeal primer sequences are listed in Additional file 2: Supplementary Table S3. Bacterial 16S rRNA gene (primers Bac1055YF and Bac1392R) was used as the control [63]. PCR was performed using the CFX96™ real-time PCR detection system (Bio-Rad, Hercules, CA, USA) in a reaction volume of 20 μL containing 10 μL TOPreal qPCR 2X premix (Enzynomics, Daejeon, South Korea), 300 nM of each of the forward and reverse primers and 2 ng template DNA. The Cq values were determined using the Bio-Rad CFX Manager software version 3.1. Escherichia coli K12 and Haloplanus salinus JCM 18368 T were used to construct standard curves and perform quantitative analysis. The PCR efficiency and R 2 were 96.92% and 0.9995 for bacteria, and 97.57% and 0.9983 for archaea, respectively.

Fluorescence in situ hybridisation analysis

Fluorescence in situ hybridisation (FISH) analysis was performed according to a method by Hugenholtz [64] with minor modification. Briefly, the haloarchaea-specific probe (HALO775) was designed using the ARB software [65]. The specificity of the HALO775 was evaluated based on the TestProbe in SILVA (SILVA SSU database v. 138) and the ProbeMatch in ribosomal database project (RDP, v. 11.5) databases (Additional file 2: Supplementary Tables S1 and S2). The designed probe, the bacterial 16S rRNA gene-targeting probe EUB338 [66] and non-specific probe NONEUB [67] were synthesised and labelled at the 5′ end with Cy3, Cy5 and FAM by Macrogen (Seoul, South Korea), respectively. To evaluate the possible cross-binding activity, we tested the binding activities of HALO775 to a haloarchaeon (Haloplanus salinus JCM 18368 T ), bacterium (Escherichia coli K12) and another archaeon (Methanobrevibacter smithii JCM 30028 T ). The cultured bacterium and the archaea were fixed with 4% paraformaldehyde in PBS at 4 °C for 4 h, and the fixed samples were washed in PBS and gradually dehydrated in PBS-ethanol solution (final ratio of 1:1, vol/vol). Hybridisation was performed at 46 °C with 20% formamide hybridisation buffer. Afterwards, washing was performed again at 48 °C. Both non-binding probe (NONEUBFAM) and no-probe controls were always included to decipher false positives. Samples were observed under a confocal microscope (LSM710 Carl Zeiss, Oberkochen, Germany) with × 1000 magnification using the imaging software of ZEN v. 3.1 (blue edition, Carl Zeiss).

Measurements of salinity and inorganic elements

According to the relative abundance of haloarchaea, 20 faecal samples were selected. Each specimen (200 mg) was diluted in 1 mL distilled water, and the salinity was measured using a salinity refractometer (Atago, Japan). The inorganic elements from these samples were measured using ICP-MS, as previously described [68].

Statistical analysis

Normality tests (Shapiro-Wilk) were carried out prior to correlation coefficient analysis and comparison of multiple samples. Correlation coefficient analysis was performed based on the relative abundance of the proportionally abundant archaeal taxa with respect to dietary nutrients, clinical metadata and food categories. Multiple samples were compared using the nonparametric Kruskal-Wallis test, followed by Dunn’s multiple comparisons test. PERMANOVA analysis was done based on the Bray-Curtis and Jaccard dissimilarity matrices, with 999 permutations. All statistical analyses were performed using the GraphPad Prism software v. 7.05 (*P < 0.05, **P < 0.01 and ***P < 0.001).