Information

Who/where has the advocado genome been sequenced?

Who/where has the advocado genome been sequenced?


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I have read several places that the genome of the advocado has been sequenced in mexico but i am not sure at what university or persons involved.


You can search for genome sequences at the National Center for Biotechnology Information (NCBI) site.

Here is a link to Persea americana (avocado) from NCBI. It states on the page the submitter for the genome was Hainan University.


How the attempt to sequence &ldquoBigfoot&rsquos genome&rdquo went badly off track

Reader comments

Share this story

When we first looked at the report of the bigfoot genome, it was an odd mixture of things: standard methods and reasonable looking data thrown in with unusual approaches and data that should have raised warning flags for any biologist. We just couldn't figure out the logic of why certain things were done or the reasoning behind some of the conclusions the authors reached. So, we spent some time working with the reported genome sequences themselves and talked with the woman who helped put the analysis together, Dr. Melba Ketchum. While it didn't answer all of our questions, it gave us a clearer picture of how the work came to be.

The biggest clarification made was what the team behind the results considered their scientific reasoning, which makes sense of how they ran past warning signs that they were badly off track. It provided an indication of what motivated them to push the results into a publication that they knew would cause them grief.


Avocado Genome Has Been Sequenced

Scientists led by the National Laboratory of Genomics for Biodiversity (LANGEBIO) in Mexico, Texas Tech University, and the University at Buffalo in the USA have sequenced the avocado genome. The scientists report their findings in a paper published in the Proceedings of the National Academy of Sciences (PNAS). The study sheds light on the origins of the fruit and lays the groundwork for future improvements to farming.

The study reveals that the popular Hass avocado, which comprises the bulk of all avocados grown and eaten around the world, inherited about 61 percent of its DNA from Mexican varieties and about 39 percent from Guatemalan ones. Aside from the Hass avocado, the scientists also sequenced avocados from Mexico, Guatemala, and the West Indies, which are home to genetically distinct, native cultivars of the fruit. The paper also reports that the avocado went through two ancient polyploidy events. Many of the duplicated genes were eventually deleted, but some developed new and useful functions.

The research provides key reference material for learning about the function of individual avocado genes, and for using genetic engineering to boost productivity of avocado trees, improve disease resistance and create fruit with new tastes and textures.

For more details, read the article in the University at Buffalo News Center.


Guacamole lovers, rejoice! The avocado genome has been sequenced

Scientists have sequenced the avocado genome, shedding light on the ancient origins of this buttery fruit and laying the groundwork for future improvements to farming.

With regard to modern affairs, the study reveals for the first time that the popular Hass avocado inherited about 61 percent of its DNA from Mexican varieties and about 39 percent from Guatemalan ones. (Avocados come in many types, but Hass -- first planted in the 1920s -- comprises the bulk of avocados grown around the world.)

The research also provides vital reference material for learning about the function of individual avocado genes, and for using genetic engineering to boost productivity of avocado trees, improve disease resistance and create fruit with new tastes and textures.

The study is important for agriculture. The growing global market for avocados was worth about $13 billion in 2017, with Mexico, the largest producer, exporting some $2.5 billion worth of the fruit that year, according to Statista, a provider of market and consumer data. Around the world, avocados are spread on tortillas, mashed up to flavor toast, rolled into sushi and blended into milkshakes (a popular treat in parts of Southeast Asia).

Scientists sequenced not only the Hass avocado, but also avocados from Mexico, Guatemala and the West Indies, which are each home to genetically distinct, native cultivars of the fruit.

The project was led by the National Laboratory of Genomics for Biodiversity (LANGEBIO) in Mexico, Texas Tech University, and the University at Buffalo. The research was published on Aug. 6 in the Proceedings of the National Academy of Sciences.

"Avocado is a crop of enormous importance globally, but particularly to Mexico. Although most people will have only tasted Hass or a couple of other types, there are a huge number of great avocado varieties in the species' Mexican center of diversity, but few people will have tried them unless they travel south of the U.S. border. These varieties are genetic resources for avocado's future. We needed to sequence the avocado genome to make the species accessible to modern genomic-assisted breeding efforts," says Luis Herrera-Estrella, PhD, President's Distinguished Professor of Plant Genomics at Texas Tech University, who conceived of the study and completed much of the work at LANGEBIO, where he is Emeritus Professor, prior to joining Texas Tech University.

"Our study sets the stage for understanding disease resistance for all avocados," says Victor Albert, PhD, Empire Innovation Professor of Biological Sciences in the UB College of Arts and Sciences and a Visiting Professor at Nanyang Technological University, Singapore (NTU Singapore). Albert was another leader of the study with Herrera-Estrella. "If you have an interesting tree that looks like it's good at resisting fungus, you can go in and look for genes that are particularly active in this avocado. If you can identify the genes that control resistance, and if you know where they are in the genome, you can try to change their regulation. There's major interest in developing disease-resistant rootstock on which elite cultivars are grafted."

The family history of an eccentric, big-pitted fruit

While the avocado rose to international popularity only in the 20th century, it has a storied history as a source of sustenance in Central America and South America, where it has long been a feature of local cuisines. Hundreds of years ago, for example, Aztecs mashed up avocados to make a sauce called āhuacamolli.

Before that, in prehistoric times, avocados, with their megapits, may have been eaten by megafauna like giant sloths. (It's thought that these animals could have helped to disperse avocados by pooping out the seeds in distant locations, Albert says.)

The new study peers even further back into time. It uses genomics to investigate the family history of the avocado, known to scientists as Persea americana. "We study the genomic past of avocado to design the future of this strategic crop for Mexico," Herrera-Estrella said. "The long life cycle of avocado makes breeding programs difficult, so genomic tools will make it possible to create faster and more effective breeding programs for the improvement of this increasingly popular fruit."

The avocado belongs to a relatively small group of plants called magnoliids, which diverged from other flowering plant species about 150 million years ago. The new research supports -- but does not prove -- the hypothesis that magnoliids, as a group, predate the two dominant lineages of flowering plants alive today, the eudicots and monocots. (If this is right, it would not mean that avocados themselves are older than eudicots and monocots, but that avocados belong to a hereditary line that split off from other flowering plants before the eudicots and monocots did.)

"One of the things that we did in the paper was try to solve the issue of what is the relationship of avocados to other major flowering plants? And this turned out to be a tough question," Albert says. "Because magnoliids diverged from other major flowering plant groups so rapidly and so early on, at a time when other major groups were also diverging, the whole thing is totally damn mysterious. We made contributions toward finding an answer by comparing the avocado genome to the genomes of other plant species, but we did not arrive at a firm conclusion."

Magnoliids were estimated by a 2016 research paper to encompass about 11,000 known living species on Earth, including avocados, magnolias and cinnamon. In comparison, some 285,000 known species were counted as eudicots and monocots.

The avocado as a chemist, and the heritage of the hybrid Hass

Scientists don't know how old the avocado is, and the new study doesn't address this question. But the research does explore how the avocado has changed -- genetically -- since it became its own species, branching off from other magnoliids.

The paper shows that the avocado experienced two ancient "polyploidy" events, in which the organism's entire genome got copied. Many of the duplicated genes were eventually deleted. But some went on to develop new and useful functions, and these genes are still found in the avocado today. Among them, genes involved in regulating DNA transcription, a process critical to regulating other genes, are overrepresented.

The research also finds that avocados have leveraged a second class of copied genes -- tandem duplicates -- for purposes that may include manufacturing chemicals to ward off fungal attacks. (Tandem duplicates are the product of isolated events in which an individual gene gets replicated by mistake during reproduction.)

"In the avocado, we see a common story: Two methods of gene duplication resulting in very different functional results over deep time," Albert says.

"In plants, genes retained from polyploidy events often have to do with big regulatory things. And genes kept from the more limited one-off duplication events often have to do with biosynthetic pathways where you're making these chemicals -- flavors, chemicals that attract insects, chemicals that fight off fungi. Plants are excellent chemists," Herrera-Estrella says.

Having addressed some ancient mysteries of the avocado, the new study also moves forward in time to explore a modern chapter in the story of this beloved fruit: how humans have altered the species' DNA.

Because commercial growers typically cultivate avocados by grafting branches of existing trees onto new rootstocks, today's Hass avocados are genetically the same as the first Hass avocado planted in the 1920s. These modern-day Hass avocados are grown on Hass branches grafted onto various rootstock that are well adapted for particular geographic regions.

While the Hass avocado was long thought to be a hybrid, the details of its provenance -- 61 percent Mexican, 39 percent Guatemalan -- were not previously known. The scientists' new map of the Hass avocado genome reveals huge chunks of contiguous DNA from each parental type, reflecting the cultivar's recent origin.

"Immediately after hybridization, you get these giant blocks of DNA from the parent plants," Herrera-Estrella says. "These blocks break up over many generations as you have more reproductive events that scramble the chromosomes. But we don't see this scrambling in the Hass avocado. On chromosome 4, one whole arm appears to be Guatemalan, while the other is Mexican. We see big chunks of DNA in the Hass avocado that reflect its heritage."

"We hope that the Mexican Government keeps supporting these types of ambitious projects that use state-of-the-art technology to provide a deep understanding of the genetics and genomics of native Mexican plants," Herrera-Estrella said.

In addition to LANGEBIO, UB and Texas Tech University, the avocado genome sequencing team included scientists from the Swedish University of Agricultural Sciences Instituto de Ecología, A.C. Universidad Nacional Autónoma de México Nanyang Technological University University of Ottawa VIB-UGent Center for Plant Systems Biology Universidad de Guanajuato University of Florida University of Nevada, Reno Queensland Alliance for Agriculture and Food Innovation Universitat de Barcelona USDA-ARS Subtropical Horticulture Research Station Universidad Autónoma Chapingo Natural History Museum of Denmark and Université Paul Sabatier.

The research was funded by SAGARPA/CONACYT, the Governors University Research Initiative of the State of Texas, the U.S. National Science Foundation, Horticulture Innovation Australia Ltd. and the Australian Bureau of Agricultural and Resource Economics and Sciences.

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.


By Charlotte Hsu

Release Date: August 6, 2019

BUFFALO, N.Y. &mdash We now know the DNA of guacamole.

Scientists have sequenced the avocado genome, shedding light on the ancient origins of this buttery fruit and laying the groundwork for future improvements to farming.

With regard to modern affairs, the study reveals for the first time that the popular Hass avocado inherited about 61 percent of its DNA from Mexican varieties and about 39 percent from Guatemalan ones. (Avocados come in many types, but Hass &mdash first planted in the 1920s &mdash comprises the bulk of avocados grown around the world.)

The research also provides vital reference material for learning about the function of individual avocado genes, and for using genetic engineering to boost productivity of avocado trees, improve disease resistance and create fruit with new tastes and textures.

The study is important for agriculture. The growing global market for avocados was worth about $13 billion in 2017, with Mexico, the largest producer, exporting some $2.5 billion worth of the fruit that year, according to Statista, a provider of market and consumer data. Around the world, avocados are spread on tortillas, mashed up to flavor toast, rolled into sushi and blended into milkshakes (a popular treat in parts of Southeast Asia).

Scientists sequenced not only the Hass avocado, but also avocados from Mexico, Guatemala and the West Indies, which are each home to genetically distinct, native cultivars of the fruit.

The project was led by the National Laboratory of Genomics for Biodiversity (LANGEBIO) in Mexico, Texas Tech University, and the University at Buffalo. The research was published on Aug. 6 in the Proceedings of the National Academy of Sciences.

&ldquoAvocado is a crop of enormous importance globally, but particularly to Mexico. Although most people will have only tasted Hass or a couple of other types, there are a huge number of great avocado varieties in the species' Mexican center of diversity, but few people will have tried them unless they travel South of the U.S. border. These varieties are genetic resources for avocado's future. We needed to sequence the avocado genome to make the species accessible to modern genomic-assisted breeding efforts,&rdquo says Luis Herrera-Estrella, PhD, President's Distinguished Professor of Plant Genomics at Texas Tech University, who conceived of the study and completed much of the work at LANGEBIO, where he is Emeritus Professor, prior to joining Texas Tech University.

&ldquoOur study sets the stage for understanding disease resistance for all avocados,&rdquo says Victor Albert, PhD, Empire Innovation Professor of Biological Sciences in the UB College of Arts and Sciences and a Visiting Professor at Nanyang Technological University, Singapore (NTU Singapore). Albert was another leader of the study with Herrera-Estrella. &ldquoIf you have an interesting tree that looks like it&rsquos good at resisting fungus, you can go in and look for genes that are particularly active in this avocado. If you can identify the genes that control resistance, and if you know where they are in the genome, you can try to change their regulation. There&rsquos major interest in developing disease-resistant rootstock on which elite cultivars are grafted.&rdquo

The family history of an eccentric, big-pitted fruit

While the avocado rose to international popularity only in the 20th century, it has a storied history as a source of sustenance in Central America and South America, where it has long been a feature of local cuisines. Hundreds of years ago, for example, Aztecs mashed up avocados to make a sauce called āhuacamolli.

Before that, in prehistoric times, avocados, with their megapits, may have been eaten by megafauna like giant sloths. (It&rsquos thought that these animals could have helped to disperse avocados by pooping out the seeds in distant locations, Albert says.)

The new study peers even further back into time. It uses genomics to investigate the family history of the avocado, known to scientists as Persea americana. &ldquoWe study the genomic past of avocado to design the future of this strategic crop for Mexico,&rdquo Herrera-Estrella said. &ldquoThe long life cycle of avocado makes breeding programs difficult, so genomic tools will make it possible to create faster and more effective breeding programs for the improvement of this increasingly popular fruit.&rdquo

The avocado belongs to a relatively small group of plants called magnoliids, which diverged from other flowering plant species about 150 million years ago. The new research supports &mdash but does not prove &mdash the hypothesis that magnoliids, as a group, predate the two dominant lineages of flowering plants alive today, the eudicots and monocots. (If this is right, it would not mean that avocados themselves are older than eudicots and monocots, but that avocados belong to a hereditary line that split off from other flowering plants before the eudicots and monocots did.)

&ldquoOne of the things that we did in the paper was try to solve the issue of what is the relationship of avocados to other major flowering plants? And this turned out to be a tough question,&rdquo Albert says. &ldquoBecause magnoliids diverged from other major flowering plant groups so rapidly and so early on, at a time when other major groups were also diverging, the whole thing is totally damn mysterious. We made contributions toward finding an answer by comparing the avocado genome to the genomes of other plant species, but we did not arrive at a firm conclusion.&rdquo

Magnoliids were estimated by a 2016 research paper to encompass about 11,000 known living species on Earth, including avocados, magnolias and cinnamon. In comparison, some 285,000 known species were counted as eudicots and monocots.

The avocado as a chemist, and the heritage of the hybrid Hass

Scientists don&rsquot know how old the avocado is, and the new study doesn&rsquot address this question. But the research does explore how the avocado has changed &mdash genetically &mdash since it became its own species, branching off from other magnoliids.

The paper shows that the avocado experienced two ancient &ldquopolyploidy&rdquo events, in which the organism&rsquos entire genome got copied. Many of the duplicated genes were eventually deleted. But some went on to develop new and useful functions, and these genes are still found in the avocado today. Among them, genes involved in regulating DNA transcription, a process critical to regulating other genes, are overrepresented.

The research also finds that avocados have leveraged a second class of copied genes &mdash tandem duplicates &mdash for purposes that may include manufacturing chemicals to ward off fungal attacks. (Tandem duplicates are the product of isolated events in which an individual gene gets replicated by mistake during reproduction.)

&ldquoIn the avocado, we see a common story: Two methods of gene duplication resulting in very different functional results over deep time,&rdquo Albert says.

&ldquoIn plants, genes retained from polyploidy events often have to do with big regulatory things. And genes kept from the more limited one-off duplication events often have to do with biosynthetic pathways where you&rsquore making these chemicals &mdash flavors, chemicals that attract insects, chemicals that fight off fungi. Plants are excellent chemists,&rdquo Herrera-Estrella says.

Having addressed some ancient mysteries of the avocado, the new study also moves forward in time to explore a modern chapter in the story of this beloved fruit: how humans have altered the species&rsquo DNA.

Because commercial growers typically cultivate avocados by grafting branches of existing trees onto new rootstocks, today&rsquos Hass avocados are genetically the same as the first Hass avocado planted in the 1920s. These modern-day Hass avocados are grown on Hass branches grafted onto various rootstock that are well adapted for particular geographic regions.

While the Hass avocado was long thought to be a hybrid, the details of its provenance &mdash 61 percent Mexican, 39 percent Guatemalan &mdash were not previously known. The scientists&rsquo new map of the Hass avocado genome reveals huge chunks of contiguous DNA from each parental type, reflecting the cultivar&rsquos recent origin.

&ldquoImmediately after hybridization, you get these giant blocks of DNA from the parent plants,&rdquo Herrera-Estrella says. &ldquoThese blocks break up over many generations as you have more reproductive events that scramble the chromosomes. But we don&rsquot see this scrambling in the Hass avocado. On chromosome 4, one whole arm appears to be Guatemalan, while the other is Mexican. We see big chunks of DNA in the Hass avocado that reflect its heritage.&rdquo

&ldquoWe hope that the Mexican Government keeps supporting these types of ambitious projects that use state-of-the-art technology to provide a deep understanding of the genetics and genomics of native Mexican plants,&rdquo Herrera-Estrella said.


Materials and methods

Reference genotype tissue and DNA

All source material was obtained from grafted ramets of our reference Pinus taeda genotype 20-1010. Our haploid target megagametophyte was dissected from a wind-pollinated pine seed collected from a tree in a Virginia Department of Forestry seed orchard near Providence Forge, Virginia. Diploid tissue was obtained from needles collected from trees at the Erambert Genetic Resource Management Area near Brooklyn, Mississippi and the Harrison Experimental Forest near Saucier, Mississippi. A detailed description of the preparation and QC of DNA from these tissue samples is contained in [9].

Sequencing, assembly, and validation

A detailed description of the whole genome shotgun sequencing, assembly, and validation of the V1.0 and V1.01 loblolly pine genomes is contained in [9].

To compare the contiguity of our V1.01 whole genome shotgun assembly to contemporary conifer genome assemblies the scaffold sequences for white spruce genome [7] and Norway spruce [8] were obtained from Genbank.

CEGMA analysis of the core gene set [18] performed on the V1.0 and V1.01 loblolly pine genomes was obtained as described in [9]. Similarly, a Norway spruce analysis was performed with results consistent with those reported in [8]. The results for the white spruce assembly were taken directly from [7].

To assemble the mitochondrial genome, a subset of the WGS sequence consisting of 255 bp paired end MiSeq reads from four Illumina paired end libraries (median insert sizes: 325, 441, 565, and 637) were selected for an independent organelle assembly. The 28.5 Mbp of sequence, representing less than 0.3× nuclear genomic coverage, was assembled using SOAPdenovo2 (K = 127). The resulting contigs were aligned using nucmer to a database containing the loblolly pine chloroplast, sequencing vector, 102 BACs, and 50 complete plant mitochondria. Contigs were identified and labeled as mitochondrial if they aligned exclusively to existing mitochondrial sequence and had high coverage (> = 8×) and G + C% (> = 44%). The contigs were then combined with additional linking libraries, the LPMP_23 mate pair library and all DiTag libraries, and assembled a second time with SOAPdenovo2. Subsequently intra-scaffold gaps were closed using and GapCloser (v1.12). The assembled sequences were iteratively scaffolded and gaps were closed until no assembly improvements could be made.

Annotation

The assembled genome was annotated with the MAKER-P pipeline [19] as described in [20]. Prior to gene prediction, the sequence was masked with similarity searches against RepBase and the Pine Interspersed Element Resource (PIER) [32]. Following the annotation, the TRIBE-MCL pipeline [94], was used to cluster the 399,358 protein sequences from 14 species into orthologous groups as described in [20].

Repetitive DNA content

Interspersed repeat detection was carried out in two stages, homology-based and de novo as described in [20]. For homology-based identification, RepeatMasker 3.3.0 [95] was run against the PIER 2.0 repeat library [32] for both the full genome and introns. REPET 2.0 [96] was implemented with the pipeline described in [32] for de novo repeat discovery. Only the 63 longest scaffolds were used in the all-vs-all alignment (approximately 1% of the genome). In addition, PIER 2.0, the spruce repeat database, and publicly available transcripts from Pinus taeda and Pinus elliottii were utilized as input for known repeat and host gene recognition.

To identify tandem repeats, Tandem Repeat Finder (v4.0.7b) [97] was run on both the genome and transcriptome as described in [20]. Filtering of multimeric repeats and overlaps with interspersed repeats, helped assess total tandem coverage and relative frequencies of specific satellites.

Data availability

Primary sequence data may be obtained from NCBI and is indexed under BioProject PRJNA174450. The whole genome shotgun sequence obtained for this assembly is available from the sequence read archive (SRA: SRP034079). The V1.0 and V1.01 genome sequences are available at [98]. Access to gene models, annotations, and Genome Browsers [99, 100] are available through the TreeGenes database [93, 101].


Methods

Data sets

The Roadmap ChIP-seq and DNase-seq epigenomic data was downloaded from http://egg2.wustl.edu/roadmap/data/byFileType/signal/consolidated/macs2signal/pval/. Only cell types that had at least five experiments done, and assays that had been run in at least five cell types, were used. These criteria resulted in 1014 histone modification ChIP-seq tracks spanning 127 cell types and 24 assays. The assays included 23 histone modifications and DNase sensitivity. RNA-seq bigwigs containing unstranded normalized read counts across the entire genome for 47 cell types were also downloaded for the purpose of downstream analyses, rather than for inclusion in the imputation task. The full set of 24 assays imputed by ChromImpute were downloaded from http://egg2.wustl.edu/roadmap/data/byFileType/signal/consolidatedImputed/, and the full set of 24 assays from PREDICTD were downloaded from the ENCODE portal at https://www.encodeproject.org/.

The specific ChIP-seq measurements downloaded were the − log10p values. These measurements correspond to the statistical significance of an enrichment at each genomic position, with a low-signal value meaning that there is unlikely to be a meaningful enrichment at that position. Tracks that encode statistical significance, such as the − log10p value of the signal compared to a control track, typically have a higher signal-to-noise ratio than using fold enrichment. Furthermore, to reduce the effect of outliers, we use the arcsinh-transformed signal

for both training of the Avocado model and all evaluations presented here. Other models, such as PREDICTD [5] and Segway [41], also use this transformation, because it sharpens the effect of the shape of the signal while diminishing the effect of large values.

Gene bodies were defined as GENCODE v19 gene elements (https://www.gencodegenes.org/releases/19.html) from chromosomes 1 through 22 that had one of their transcripts annotated as the primary transcript for that gene. This resulted in 16,724 gene bodies.

Promoter regions were defined at the transcription start site for each of the GENCODE v19 gene elements that gene bodies were identified for, accounting for the strand of the gene. For the purpose of the MSEProm metric and for the gene expression prediction task, the span of the promoter was defined as 2 kbp upstream from the transcription start site. For the purpose of the visualization of promoters and enhancers, promoters were defined as ± 250 bp from the transcription start site.

Enhancer elements were defined using two sets of enhancers defined by the FANTOM5 consortium. For the purpose of the MSEEnh metric, the set of “permissive” enhancers was used, in order to get a wider view of potential enhancer activity. For the purpose of visualization of promoters and enhancers, enhancers were defined using ± 250 bp from the middle of each enhancer in the “robust” enhancer set. Both enhancer sets are available at http://slidebase.binf.ku.dk/human_enhancers/presets.

Promoter-enhancer interactions were obtained from the public GitHub repository for [8], available at https://github.com/shwhalen/targetfinder/tree/master/paper/targetfinder/combined/output-epw. This data set includes promoter-enhancer interactions as defined by ChIA-PET interactions for four cell lines—GM12878, HeLa-S3, IMR90, and K562. To correct a recently identified bias in this particular benchmark [30], the data set was further processed as described in Additional file 6.

Replication timing data was downloaded from http://www.replicationdomain.org. The resulting tracks encode early- and late-stage timing as continuous values, which are subsequently binarized using a threshold of 0.

FIRE scores were obtained from the supplementary material of [9] for the seven cell lines TRO, H1, NPC, GM12878, MES, IMR90, and MSC. These measurements are composed of binary indicators at 40-kbp resolution, resulting in 72,036 loci for each cell type.

Network topology

Avocado is a deep tensor factorization model, i.e., a tensor factorization model that uses a neural network instead of a scalar product to combine factors into a prediction. The tensor factorization component is comprised of five matrices of latent factors, also known as embedding matrices, that encode the cell type, assay, 25-bp genome, 250-bp genome, and 5-kbp genome factors. These matrices represent each element as a set of latent factors, with 32 factors per cell type, 256 factors per assay, 25 factors per 25-bp genomic position, 40 factors per 250-bp genomic position, and 45 factors per 5-kbp genomic position. For a specific prediction, the factors corresponding to the respective cell type, assay, and genomic position are concatenated together and fed into a simple feed-forward neural network. This network has two intermediate dense hidden layers that each have 2048 neurons before the regression output, for a total of three weight matrices to be learned. The network uses the ReLU activation function, ReLU(x)= max(0,x), on the hidden layers and no activation function on the prediction. The training process jointly optimizes the latent factors in the tensor factorization model and the neural network, rather than switching between optimizing each.

The model was implemented using Keras [46] with the Theano backend [47], and experiments were run using Tesla K40c and GTX 1080 GPUs. For further background on neural network models, we recommend the comprehensive review by J. Schmidhuber [48].

Inputs and outputs

Avocado takes as input the indices corresponding to a genomic position, assay, and cell type, and outputs an imputed data value. The indices for each dimension are a set of sequential values that uniquely represent each of the possibilities for that dimension, e.g., a specific cell type, assay, or genomic position. Any data value in the Roadmap compendium can thus be uniquely represented by a triplet of indices, specifying the cell type, index, and assay.

Training

Avocado is trained using standard neural network optimization techniques. The model was fit using the ADAM optimizer due to its widespread adoption and success across several fields [49]. Avocado’s loss function is the global mean squared error (MSE). Most training hyperparameters are set to their default values in the Keras toolkit. For the ADAM optimizer, this corresponds to an initial learning rate of 0.01, beta1 of 0.9, beta2 of 0.999, epsilon of 10 −8 , and a decay factor of 1–10 −8 . The embedding matrices are initialized with random uniform weights in the range [− 0.5,0.5]. Dense layers are initialized using the “glorot uniform” setting [50]. Using these settings, our experiments show that performance, as measured by MSE, was similar across different model initializations.

Avocado does not fit a single model to the full genome because the genome latent factors could not fit in memory. Instead, training is performed in two steps. First, the model is trained on the selected training tracks but with the genomic positions restricted to those in the ENCODE Pilot Regions [51]. Second, the weights of the cell type factors, assay factors, and neural network parameters are frozen, and the genome factors are trained for each chromosome individually. This training strategy allows the model to fit in memory while also ensuring consistent parameters for the non-genomic aspects of the model across chromosomes, and for the latent factors learned on the genomic axis to be comparable across cell types. Both of the stages involve the same set of training experiments. During cross-validation, this procedure is repeated separatedly for each fold. We did not find that this procedure was sensitive to using other equally sized regions for the initial training step (Additional file 7).

The two steps of training have the same initial hyperparameters for the ADAM optimizer but are run for different numbers of epochs. Each epoch corresponds to a single pass through the genomic axis such that each 25-bp position is seen exactly once, with cell type and assays chosen randomly for each position. This definition of “epoch” ensures that the entire genome is seen the same number of times during training. Training is carried out for 800 epochs on the ENCODE Pilot regions and 200 epochs on each chromosome. No early stopping criterion is set, because models converge in terms of validation set performance for all chromosomes in fewer than 200 epochs but do not show evidence of over-fitting if given extra time to train.

Evaluation of variable genomic loci

For each assay, we evaluated the performance of Avocado, PREDICTD, and ChromImpute, at genomic positions segregated by the number of cell types in which that genomic locus was called a peak by MACS2. We first calculated the number of cell types that each genomic locus was called a peak by summing together MACS2 narrow peak calls across chromosome 20 and discarded those positions that were never a peak. This resulted in a vector where each genomic locus was represented by the number of cell types in which it was a peak, ranging between 1 and the number of cell types in which that assay was performed. For each value in that range, we calculated the MSE, the recall, and the precision, for each technique. Because precision and recall require binarized inputs, the predictions for each approach were binarized using a threshold on the -log10 p value of 2, corresponding to the same threshold that Ernst and Kellis used to binarize signals as input for ChromHMM.

Supervised machine learning model training

We performed three tasks that involved training a gradient boosted decision tree model to predict some genomic phenomenon across cell types. In each task, we used a 20-fold cross-validation procedure, where the data from a single cell type is split into 20 folds, 19 are used for training and 1 is used for model evaluation. This procedure was performed for each cell type, feature set, and task. These models were trained using XGBoost [52] with a maximum of 5000 estimators, a maximum depth of 6, and an early stopping criterion that stopped training if performance on a held out validation set, one of the 19 folds used for training, did not improve after 20 epochs. No other regularization was used, and the remaining hyperparameters were kept at their default values.

For the task of predicting promoter-enhancer interactions, we used logistic regression as an additional safeguard against the bias issue described in Additional file 6. Rather than performing 20-fold cross-validation, we performed 5-fold cross-validation 20 times, shuffling the data set after each cross-validation. We adopted this approach due to the small number of positive samples in each cell type, such that there would be fewer than 10 positive samples in each fold of a 20-fold cross-validation. Additionally, we tuned the regularization strength in the default manner for scikit-learn, which considers 10 regularization strengths evenly spaced logarithmically between 10 −4 and 10 4 and choosing the strength that performs best on an internal 3-fold cross-validation on the training set.

We evaluate each model in each task according to the average precision (AP) on the test set, which summarizes a precision-recall curve in a single score. The score is calculated as

where Recalln and Precisionn are the recall and the precision at the nth calculated threshold, with one threshold for each data point.


Diagnostic Genomics and Clinical Bioinformatics

A. Haworth , . N. Lench , in Medical and Health Genomics , 2016

Whole Genome Sequencing

Whole genome sequencing (WGS) offers the ability to interrogate the entire DNA sequence of the genome without the need to use selective capture techniques to isolate specific regions of DNA. Conceptually, this approach is very appealing and will enable the identification of additional classes of mutation that are refractory to detection by exome sequencing. These include the identification of large structural rearrangements, balanced translocations, uniparental isodisomy, and mosaicism. WGS also offers the opportunity to interrogate noncoding regions of DNA and identify functionally important sequence variants that influence gene expression. Removing the need to capture sequences removes selection bias so that coverage across sequences is more uniform. The main obstacles to the uptake of WGS include cost and dealing with the enormous amount of data produced. Moving, analyzing, interpreting, and storing large amounts of genetic data have significant resource and cost implications, many of which are currently beyond the majority of routine diagnostic laboratories. As the cost of sequencing continues to decrease and experience is gained in data analysis and interpretation, we can anticipate that WGS will be the method of choice for the clinical diagnosis of rare genetic disorders.


Applications of Genome Sequencing

Here are some of the applications of genome sequencing.

1. Diagnostics and Medicine

DNA sequencing has elaborate applications in screening the risk of genetic diseases, gene therapy-based treatments, genetic engineering, and gene manipulation.

2. Evolutionary biology

The ability to sequence the whole genome of many related organisms has allowed large-scale comparative genomics, phylogenetic and evolutionary studies.

3. Forensic Science

DNA sequencing has widespread applications in DNA profiling, forensic sampling and identification, and paternity testing.

4. Metagenomics

Shotgun sequencing of complex communities of microorganisms, metagenome sequencing of environmental or human microbiomes, and environmental profiling.

5. Agriculture

Sequencing of microorganisms to engineer resistant genes in crops. Mapping and whole-genome sequencing of food plants to increase productivity and nutritional contents as well as environmental tolerance.

6. Molecular Biology

Study of genotypes, genes, and proteins gene-based studies of cancers construction of endonuclease maps detection of mutations construction of molecular evolution map, and transcriptome profiling.


Scientists Sequence Genome of Deep-Sea Snailfish

A team of researchers in China has sequenced and assembled the high-quality reference genome for the Yap hadal snailfish, which was captured at a depth of 7,000 m in the Yap Trench in the western Pacific Ocean.

The Yap hadal snailfish. Image credit: Mu et al., doi: 10.1371/journal.pgen.100953.

Hadal environments (depths below 6,000 m) are characterized by extremely high hydrostatic pressures, low temperatures, a scarce food supply, and little light.

Fish are the only vertebrates inhabiting the hadal zone, and hadal snailfishes have been found in at least five geographically separated marine trenches.

However, the genetic mechanisms that allow vertebrates to live in such extreme conditions are not well understood.

To understand how hadal snailfishes have adapted to life in the deep sea, Dr. Xinhua Chen of the Fujian Agriculture and Forestry University and colleagues sequenced the whole genome of a snailfish from the Yap Trench.

The analysis of the genome revealed multiple adaptations for living in a cold, dark, high-pressure environment.

The scientists also found that the Yap hadal snailfish carries extra genes for DNA repair, which may help keep its genome intact under high pressures.

The fish also has five copies of a gene for an enzyme that takes a compound produced by bacteria in its gut and transforms it into one that stabilizes the structure of proteins under high hydrostatic pressure.

It has also lost certain genes involved in vision, taste and smell, which are likely unnecessary in its dark, food-limited environment.

“Many genes associated with DNA repair show evidence of positive selection and have expanded copy numbers in the genome of the Yap hadal snailfish, which potentially reflect the difficulty of maintaining DNA integrity under high hydrostatic pressure,” Dr. Chen said.

“The five copies of the trimethylamine N-oxide (TMAO)-generating enzyme flavin-containing monooxygenase-3 gene (fmo3) and the abundance of trimethylamine (TMA)-generating bacteria in the gut of the Yap hadal snailfish could provide enough TMAO to improve protein stability under hadal conditions.”

“Our results provide new insights into the molecular mechanisms underlying the adaptation of hadal organisms to the deep-sea environment and valuable genomic resources that will help further clarify hadal adaptations,” the authors concluded.

The findings were published in the journal PLoS Genetics.

Y. Mu et al. 2021. Whole genome sequencing of a snailfish from the Yap Trench (


Watch the video: Τι είναι η πρωτεΐνη; (May 2022).


Comments:

  1. Mosheh

    to read it by what it

  2. Tarn

    very funny thought

  3. Finnbar

    I absolutely agree with you. There is something in this and I like your idea. I propose to bring it up for general discussion.

  4. Ini-Herit

    Sorry, that doesn't help. Hope they will help you here. Don't despair.

  5. Iov

    In it all business.

  6. Kealy

    Sorry, but this option does not suit me. Maybe there are more options?



Write a message