What is the close and related genome used for in Gene models?

What is the close and related genome used for in Gene models?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.


I am a little bit confused about using the related genome or reference genome.

When we have a reference genome, we can do alignment. Also we can do the assembly.

Can you give some more reason why a related genome can help to improve gene model ?

And if we do use a related genome, what kind of problems or caveats may occur there ?

Why a related genome helps:

1) Alignment of the reads first and assemble next. 2) The gene-space is already predefined ( the genes and their co-ordinates are already known), so if your assembly is fragmented or missing a portion of the gene information, that can be accomodated with reference genome.

Limitations: Rather than assembling your own genome, you are forcing the reference genome to be part of your genome assembly. If at all, any differences are there, they are washed out, when you do a reference assembly.

Transgenic Animals

Applications in Agriculture and the Pharmaceutical Industry

Transgenic animal models of human disease can be useful for preclinical drug testing. Animals engineered to be susceptible to human viruses, by introduction of viral receptors or other host range determinants, can also be used for testing human vaccines.

Transgenic animals can serve as ‘factories’ that, in some cases, may produce large amounts of proteins more efficiently than alternative expression systems such as bacteria, yeast, or mammalian cell cultures. Transgenic mice have been engineered to express human antibodies (which are superior to murine antibodies for use as drugs) by introducing large segments of human DNA encoding human immunoglobulin genes, and breeding these transgenic animals with strains in which the endogenous immunoglobulin loci are mutated. In transgenic large animals such as cows or sheep, proteins of pharmaceutical value can be produced in large quantity in milk (and later purified) by introducing the appropriate gene under the control of regulatory elements that direct expression in the mammary glands.

Transgenesis can in principle be used to alter many phenotypic properties that may increase the value of agriculturally important animals. These include growth rate, fat composition, milk production, and hair texture. It may also be possible to modify domestic animals such as pigs to make them more suitable as organ donors for human transplant patients.

A comprehensive map of the SARS-CoV-2 genome

Images for download on the MIT News office website are made available to non-commercial entities, press and the general public under a Creative Commons Attribution Non-Commercial No Derivatives license. You may not alter the images provided, other than to crop them to size. A credit line must be used when reproducing images if one is not provided below, credit the images to "MIT."

Previous image Next image

In early 2020, a few months after the Covid-19 pandemic began, scientists were able to sequence the full genome of SARS-CoV-2, the virus that causes the Covid-19 infection. While many of its genes were already known at that point, the full complement of protein-coding genes was unresolved.

Now, after performing an extensive comparative genomics study, MIT researchers have generated what they describe as the most accurate and complete gene annotation of the SARS-CoV-2 genome. In their study, which appears today in Nature Communications, they confirmed several protein-coding genes and found that a few others that had been suggested as genes do not code for any proteins.

“We were able to use this powerful comparative genomics approach for evolutionary signatures to discover the true functional protein-coding content of this enormously important genome,” says Manolis Kellis, who is the senior author of the study and a professor of computer science in MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) as well as a member of the Broad Institute of MIT and Harvard.

The research team also analyzed nearly 2,000 mutations that have arisen in different SARS-CoV-2 isolates since it began infecting humans, allowing them to rate how important those mutations may be in changing the virus’ ability to evade the immune system or become more infectious.

Comparative genomics

The SARS-CoV-2 genome consists of nearly 30,000 RNA bases. Scientists have identified several regions known to encode protein-coding genes, based on their similarity to protein-coding genes found in related viruses. A few other regions were suspected to encode proteins, but they had not been definitively classified as protein-coding genes.

To nail down which parts of the SARS-CoV-2 genome actually contain genes, the researchers performed a type of study known as comparative genomics, in which they compare the genomes of similar viruses. The SARS-CoV-2 virus belongs to a subgenus of viruses called Sarbecovirus, most of which infect bats. The researchers performed their analysis on SARS-CoV-2, SARS-CoV (which caused the 2003 SARS outbreak), and 42 strains of bat sarbecoviruses.

Kellis has previously developed computational techniques for doing this type of analysis, which his team has also used to compare the human genome with genomes of other mammals. The techniques are based on analyzing whether certain DNA or RNA bases are conserved between species, and comparing their patterns of evolution over time.

Using these techniques, the researchers confirmed six protein-coding genes in the SARS-CoV-2 genome in addition to the five that are well established in all coronaviruses. They also determined that the region that encodes a gene called ORF3a also encodes an additional gene, which they name ORF3c. The gene has RNA bases that overlap with ORF3a but occur in a different reading frame. This gene-within-a-gene is rare in large genomes, but common in many viruses, whose genomes are under selective pressure to stay compact. The role for this new gene, as well as several other SARS-CoV-2 genes, is not known yet.

The researchers also showed that five other regions that had been proposed as possible genes do not encode functional proteins, and they also ruled out the possibility that there are any more conserved protein-coding genes yet to be discovered.

“We analyzed the entire genome and are very confident that there are no other conserved protein-coding genes,” says Irwin Jungreis, lead author of the study and a CSAIL research scientist. “Experimental studies are needed to figure out the functions of the uncharacterized genes, and by determining which ones are real, we allow other researchers to focus their attention on those genes rather than spend their time on something that doesn’t even get translated into protein.”

The researchers also recognized that many previous papers used not only incorrect gene sets, but sometimes also conflicting gene names. To remedy the situation, they brought together the SARS-CoV-2 community and presented a set of recommendations for naming SARS-CoV-2 genes, in a separate paper published a few weeks ago in Virology.

Fast evolution

In the new study, the researchers also analyzed more than 1,800 mutations that have arisen in SARS-CoV-2 since it was first identified. For each gene, they compared how rapidly that particular gene has evolved in the past with how much it has evolved since the current pandemic began.

They found that in most cases, genes that evolved rapidly for long periods of time before the current pandemic have continued to do so, and those that tended to evolve slowly have maintained that trend. However, the researchers also identified exceptions to these patterns, which may shed light on how the virus has evolved as it has adapted to its new human host, Kellis says.

In one example, the researchers identified a region of the nucleocapsid protein, which surrounds the viral genetic material, that had many more mutations than expected from its historical evolution patterns. This protein region is also classified as a target of human B cells. Therefore, mutations in that region may help the virus evade the human immune system, Kellis says.

“The most accelerated region in the entire genome of SARS-CoV-2 is sitting smack in the middle of this nucleocapsid protein,” he says. “We speculate that those variants that don't mutate that region get recognized by the human immune system and eliminated, whereas those variants that randomly accumulate mutations in that region are in fact better able to evade the human immune system and remain in circulation.”

The researchers also analyzed mutations that have arisen in variants of concern, such as the B.1.1.7 strain from England, the P.1 strain from Brazil, and the B.1.351 strain from South Africa. Many of the mutations that make those variants more dangerous are found in the spike protein, and help the virus spread faster and avoid the immune system. However, each of those variants carries other mutations as well.

“Each of those variants has more than 20 other mutations, and it’s important to know which of those are likely to be doing something and which aren’t,” Jungreis says. “So, we used our comparative genomics evidence to get a first-pass guess at which of these are likely to be important based on which ones were in conserved positions."

This data could help other scientists focus their attention on the mutations that appear most likely to have significant effects on the virus’ infectivity, the researchers say. They have made the annotated gene set and their mutation classifications available in the University of California at Santa Cruz Genome Browser for other researchers who wish to use it.

“We can now go and actually study the evolutionary context of these variants and understand how the current pandemic fits in that larger history,” Kellis says. “For strains that have many mutations, we can see which of these mutations are likely to be host-specific adaptations, and which mutations are perhaps nothing to write home about.”

The research was funded by the National Human Genome Research Institute and the National Institutes of Health. Rachel Sealfon, a research scientist at the Flatiron Institute Center for Computational Biology, is also an author of the paper.

Zebrafish Genome Found Strikingly Similar to Humans

According to a paper published in Nature, 70 per cent of protein-coding human genes are related to genes found in the zebrafish (Danio rerio), and 84 per cent of genes known to be associated with human disease have a zebrafish counterpart.

Orthologue genes shared between the zebrafish, human, mouse and chicken genome (Kerstin Howe et al)

The team developed a high-quality annotated zebrafish genome sequence to compare with the human reference genome. Only two other large genomes have been sequenced to this high standard: the human genome and the mouse genome. The completed zebrafish genome will be an essential resource that drives the study of gene function and disease in people.

Zebrafish are remarkably biologically similar to people and share the majority of the same genes as humans, making them an important model for understanding how genes work in health and disease.

“Our aim with this project, like with all biomedical research, is to improve human health. This genome will allow researchers to understand how our genes work and how genetic variants can cause disease in ways that cannot be easily studied in humans or other organisms,” said study senior author Dr Derek Stemple of the Wellcome Trust Sanger Institute.

Zebrafish research has already led to biological advances in cancer and heart disease research, and is advancing our understanding of muscle and organ development. Zebrafish have been used to verify the causal gene in muscular dystrophy disorders and also to understand the evolution and formation of melanomas or skin cancers.

“The vast majority of human genes have counterparts in the zebrafish, especially genes related to human disease. This high quality genome is testament to the many scientists who worked on this project and will spur biological research for years to come. By modeling these human disease genes in zebrafish, we hope that resources worldwide will produce important biological information regarding the function of these genes and possibly find new targets for drug development,” explained senior author Prof Jane Rogers, also of the Wellcome Trust Sanger Institute.

The zebrafish genome has some unique features, not seen in other vertebrates. They have the highest repeat content in their genome sequences so far reported in any vertebrate species: almost twice as much as seen in their closest relative, the common carp. Also unique to the zebrafish, the team identified chromosomal regions that influence sex determination.

The zebrafish genome contains few pseudogenes – genes thought to have lost their function through evolution – compared to the human genome.

The team identified 154 pseudogenes in the zebrafish genome, a fraction of the 13,000 or so pseudogenes found in the human genome.

“To realize the benefits the zebrafish can make to human health, we need to understand the genome in its entirety – both the similarities to the human genome and the differences. Armed with the zebrafish genome, we can now better understand how changes to our genomes result in disease,” said Prof Christiane Nüsslein-Volhard, co-author and Nobel laureate from the Max Planck Institute for Developmental Biology.

“This genome will help to uncover the biological processes responsible for common and rare disease and opens up exciting new avenues for disease screening and drug development,” Dr Stemple said.

Bibliographic information: Kerstin Howe et al. The Zebrafish Reference Genome Sequence and its Relationship to the Human Genome. Nature 496, 498–503 doi: 10.1038/nature12111

Catfish genomic studies: progress and perspectives

Yulin Jin , . Zhanjiang Liu , in Genomics in Aquaculture , 2016

Gene knockout systems and their potential use in catfish

Gene knockout is considered to be a major component of the functional genomics toolbox, and is a top priority in revealing and clarifying the function of genes discovered by large-scale sequencing programs ( Bouché and Bouchez, 2001 ). It is accomplished through a combination of techniques. Homologous recombination is a DNA repair mechanism that is employed in gene targeting to insert a designed mutation into the homologous genetic locus ( Hall et al., 2009 ). In such a way, it is enforceable to create a mutation into a selected gene by directly utilizing a potentially important genomic clone. This approach is widely used in yeast genetics to assess or modify gene function, and thousands of knockouts have been obtained in mice ( Deutscher et al., 2006 Vogel, 2007 ). In the aspect of animals, knockout mouse has been viewed as a powerful tool for geneticists to identify the role of a gene in embryonic development and to discern its function in normal physiological homeostasis ( Hall et al., 2009 ). In this regard, gene inactivation by knockout might be the best way to delineate the biological role of a protein.

Knockout requires recognition and replacement of the gene sequence by a defective copy via homologous recombination. However, gene targeting has never been easy for other organisms. In terms of farmed fish, the lack of methodologies for homologous recombination and embryonic stem cell derivation makes it difficult to conduct specific gene targeting technologies to unravel the function of genes ( Li et al., 2013b ). Up to date, only a few targeted gene knockout have been reported in aquaculture species. The targeted disruption of the mstn gene using ZFNs was conducted in yellow catfish ( Dong et al., 2011 ). The knockout of Dmrt1 and Foxl2 to investigate their effects on sex differentiation was conducted using TALENs in tilapia ( Li et al., 2013b ). As for catfish, with the completion of the whole genome sequencing and genome annotation, it’s readily applicable to perform functional analysis with gene knockout or editing with the state-of-the-art technologies such as TALEN and CRISPR/Cas-9. It is the time to establish an efficient and effective genome editing protocol to study the functional genomics in catfish.

What is the close and related genome used for in Gene models? - Biology

A collection of genomics, functional genomics, and genetics studies and links to their resulting datasets. This resource describes project scope, material, and objectives and provides a mechanism to retrieve datasets that are often difficult to find due to inconsistent annotation, multiple independent submissions, and the varied nature of diverse data types which are often stored in different databases.

Database that groups biomedical literature, small molecules, and sequence data in terms of biological relationships.

A collection of sequence alignments and profiles representing protein domains conserved in molecular evolution. It also includes alignments of the domains to known 3-dimensional protein structures in the MMDB database.

A database of known interactions of HIV-1 proteins with proteins from human hosts. It provides annotated bibliographies of published reports of protein interactions, with links to the corresponding PubMed records and sequence data.

A collection of consolidated records describing proteins identified in annotated coding regions in GenBank and RefSeq, as well as SwissProt and PDB protein sequences. This resource allows investigators to obtain more targeted search results and quickly identify a protein of interest.

A collection of related protein sequences (clusters), consisting of Reference Sequence proteins encoded by complete prokaryotic and organelle plasmids and genomes. The database provides easy access to annotation information, publications, domains, structures, external links, and analysis tools.

A database that includes protein sequence records from a variety of sources, including GenPept, RefSeq, Swiss-Prot, PIR, PRF, and PDB.

Protein Family Models is a collection of models representing homologous proteins with a common function. It includes conserved domain architecture, hidden Markov models and BlastRules. A subset of these models are used by the Prokaryotic Genome Annotation Pipeline (PGAP) to assign names and other attributes to predicted proteins.

A collection of curated, non-redundant genomic DNA, transcript (RNA), and protein sequences produced by NCBI. RefSeqs provide a stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis, expression studies, and comparative analyses. The RefSeq collection is accessed through the Nucleotide and Protein databases.


BLAST executables for local use are provided for Solaris, LINUX, Windows, and MacOSX systems. See the README file in the ftp directory for more information. Pre-formatted databases for BLAST nucleotide, protein, and translated searches also are available for downloading under the db subdirectory.

Sequence databases for use with the stand-alone BLAST programs. The files in this directory are pre-formatted databases that are ready to use with BLAST.

This site provides full data records for CDD, along with individual Position Specific Scoring Matrices (PSSMs), mFASTA sequences and annotation data for each conserved domain. See the README file for full details.

Sequence databases in FASTA format for use with the stand-alone BLAST programs. These databases must be formatted using formatdb before they can be used with BLAST.

The protein sequences corresponding to the translations of coding sequences (CDS) in GenBank are collected for each GenBank release..Please see the README file in the directory for more information.

This site contains all nucleotide and protein sequence records in the Reference Sequence (RefSeq) collection. The ""release"" directory contains the most current release of the complete collection, while data for selected organisms (such as human, mouse and rat) are available in separate directories. Data are available in FASTA and flat file formats. See the README file for details.


An online form that provides an interface for researchers, consortia and organizations to register their BioProjects. This serves as the starting point for the submission of genomic and genetic data for the study. The data does not need to be submitted at the time of BioProject registration.


Finds regions of local similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as to help identify members of gene families.

Allows you to retrieve records from many Entrez databases by uploading a file of GI or accession numbers from the Nucleotide or Protein databases, or a file of unique identifiers from other Entrez databases. Search results can be saved in various formats directly to a local file on your computer.

COBALT is a protein multiple sequence alignment tool that finds a collection of pairwise constraints derived from conserved domain database, protein motif database, and sequence similarity, using RPS-BLAST, BLASTP, and PHI-BLAST.

A stand-alone application for viewing 3-dimensional structures from NCBI's Entrez retrieval service. Cn3D runs on Windows, Macintosh, and UNIX and can be configured to receive data from most popular web browsers. Cn3D simultaneously displays structure, sequence, and alignment, and has powerful annotation and alignment editing features.

Identifies the conserved domains present in a protein sequence. CD-Search uses RPS-BLAST (Reverse Position-Specific BLAST) to compare a query sequence against position-specific score matrices that have been prepared from conserved domain alignments present in the Conserved Domain Database (CDD).

Tools that provide access to data within NCBI's Entrez system outside of the regular web query interface. They provide a method of automating Entrez tasks within software applications. Each utility performs a specialized retrieval task, and can be used simply by writing a specially formatted URL.

A utility for computing alignment of proteins to genomic nucleotide sequence. It is based on a variation of the Needleman Wunsch global alignment algorithm and specifically accounts for introns and splice signals. Due to this algorithm, ProSplign is accurate in determining splice sites and tolerant to sequencing errors.

Provides a configurable graphical display of a nucleotide or protein sequence and features that have been annotated on that sequence. In addition to use on NCBI sequence database pages, this viewer is available as an embeddable webpage component. Detailed documentation including an API Reference guide is available for developers wishing to embed the viewer in their own pages.

What's New!

Interactions Methods Paper Published

A new paper on methods to identify protein interactions in Tetrahymena has been published. Functional proteomics protocol for the identification of interaction partners in Tetrahymena thermophila. Congratulations to the authors!

Special Issue of Microorganisms (ISSN 2076-2607): Ciliates as Model Organisms: From ‘omics’ to Genetics, Ecology and Signaling

This Special Issue is open to reporting all studies on ciliates as model organisms, seeking to understand their genetics, cell biology, biochemistry, evolution, ecological adaptation, and the complex mechanisms of signaling systems, from the genes involved to the changes in gene expression during cell response, and from the structure and involved evolution of signal molecules to the membrane traffic in the cells. The deadline for submission is 31 December 2021. We look forward to receiving your contributions. Prof. Dr. Cristina Miceli Prof. Dr. Adriana Vallesi Guest Editors Dr. Ronald Edward Pearlman Co-Guest Editor

Tetrahymena thermophila macronuclear genome fully completed

We have received an update of the Tetrahymena thermophila genome sequence from researchers at Ocean University of China. The gene pages, BLAST server, and Genome Browser have all been updated accordingly to display these new data. Congratulations to the team involved in this effort!

A publication explaining the new sequences and features has been published:
The completed macronuclear genome of a model ciliate Tetrahymena thermophila and its application in genome scrambling and copy number analyses.
Sheng Y, Duan L, Cheng T, Qiao Y, Stover NA, Gao S.
Sci China Life Sci. 202010.1007/s11427-020-1689-4. doi:10.1007/s11427-020-1689-4

Proteomics Review Published

A review of nuclear proteomics results described for Tetrahymena thermophila by Saettone, et al. has been published in the journal Genes. Congratulations to the authors!

Coregulation Data Harvester

The Co-regulation Data Harvester (CDH) is a software tool that allows the rapid collection of annotation data for co-regulated Tetrahymena genes. Thank you to Lev Tsypin and Aaron Turkewitz for contributing this valuable program to the community. See their publication The Co-regulation Data Harvester: Automating gene annotation starting from a transcriptome database in the journal Software X.

2018 GSA Ciliate Molecular Biology Meeting

Dear Colleagues, We are pleased to announce the 2018 GSA Ciliate Molecular Biology Meeting that will take place July 17-22, 2018 at American University in Washington DC. Boris Striepen (UPENN) will be the Keynote speaker. We hope you will join us for this interactive and engaging meeting covering a broad series of topics in ciliate molecular biology. Flights into Washington DC are convenient (Ronald Reagan National Airport (DCA), Washington Dulles International Airport (IAD), and Baltimore/Washington International Airport (BWI)). Please mark your calendars. See you in DC! 2018 GSA CMB Organizing Committee Chad Pearson, Naomi Stover and Martin Simon

Ciliate Research Video Released

Micronuclear genome structure published

The article Structure of the germline genome of Tetrahymena thermophila and relationship to the massively rearranged somatic genome has been published by Hamilton, et al. in the journal eLife. Sequence data from this paper are available at TGD under the prefix "2016_mic". The sequences have been added to GBrowse and as an option on the BLAST server. Congratulations to the authors!

ASCB Ciliate Lunch

The Ciliate Lunch at the 2015 ASCB meeting in San Diego will be held on Dec. 15th. Please contact Mark Winey (mark.winey(at) if you plan to attend this event.

2016 Ciliate Molecular Biology Conference

The 2016 Ciliate Molecular Biology Conference will occur next summer in a new exciting context that we call the Totally Awesome Genetics Conference (TAGC) that will unite for the first time, most of the GSA-sponsored model organism meetings at one venue, run concurrently. This is THE Ciliate Meeting for 2016, but it is much, much more!

The meeting will be in Orlando, Florida from July 13-17 at the Orlando World Center Marriott, which provides a campus-like environment for unparalleled networking. In Orlando you will find incredible room rates and abundant and inexpensive domestic and international flights. We have chosen this venue that can accommodate this great meeting while keeping your costs to participate as low as possible.

In addition to the exciting work in each organism, senior graduate students will have the opportunity to explore post-doctoral interests in other fields, postdocs will have the opportunity to network with faculty from other institutions, and everyone will enjoy a breadth of perspective never before possible at any model-organism focused meeting. So bring your shorts, shades and flip-flops and meet us in Orlando this July.

Organizer, 2016 Ciliates Molecular Biology Conference

2015 Ciliate Molecular Biology Conference

The 2015 Ciliate Molecular Biology will be held at the University of Camerino (Camerino, Italy) from July 10 - July 16, 2015. Information about the meeting location and a preliminary program can be found on the conference website.

Gene Models Updated

The gene models in our database have been updated to match the 2014 annotation produced by JCVI. Over the coming weeks we will be adding new domain, homolog, GO, and other functional annotations to the website based on the new models. Thank you all for you patience as we work to improve the site.

Textpresso: Full text search

We have implemented Textpresso, the popular text-mining tool developed by Wormbase, at TGD Wiki. Textpresso for Tetrahymena, allows searching of over 1700 full-text papers using keywords and semantic searches. More papers will be added to the library in the future.

Tetrahymena Annotation Workshop

A limited number of openings are available to attend a 2.5 day Tetrahymena genome annotation workshop that will be held at the J. Craig Venter Institute in Rockville, MD (outside Washington, DC) July 7-9, 2014. The principal intended audience will be faculty members interested in applying web-accessible tools for structural and functional gene annotation within an integrated research and education program. Our goal is to enable faculty and students (primarily undergraduates) to contribute, in a small or large way, to the ongoing improvement of the gene annotations available through the Tetrahymena Genome Database. Faculty should be committed to providing some such opportunities on a regular continuing basis, either in classes (genetics, molecular biology, bioinformatics, cell biology, etc.) and/or through independent research projects. The workshop will cover how to weigh various forms of evidence to make predictions of gene structure using the WebApollo interface and how to make functional assignments (gene names, GO terms, etc.) using information on protein domains and homology. We will also introduce Gbrowse tools for comparison of the macronuclear and micronuclear genomes and of the macronuclear genomes of T. thermophila and related species. Previous experience with genome annotation is not required.

Participants should plan on arriving on or before Sunday, July 6 and leaving July 9 in the evening, or later. All standard expenses will be covered by an award from the National Science Foundation for each faculty member and, if possible, an accompanying student (rising senior or junior) who may act as a Teaching Assistant for one or more of the faculty's classes and/or as a mentor to students in the lab. Members of under-represented minorities or faculty at institutions that serve such populations are especially encouraged to apply.

Please address all inquiries to one of the organizers, listed below. If you wish to apply, please send your name and contact information and a brief description of how meeting your research and education goals will benefit from the workshop to Bob Coyne, by April 25th.

Bob Coyne
rcoyne at

Nick Stover
nstover at

Emily Wiley
ewiley at

Three new species added

Three new Tetrahymena macronuclear genomes sequenced by the Broad Institute (T. malaccensis, T. elliotti, and T. borealis) have been added to TGD Wiki. Search these sequences in BLAST and GBrowse, or download them from our Genome Data page. The original data can be accessed at the Broad's Tetrahymena Comparative Database.

Pubmed Entries Updated

TGD Wiki has updated citation information from Pubmed to include papers from the last year. Please take a moment to annotate genes mentioned in these publications in the References section of the Gene Page.

Gene Names Extended to Four Letters

To help accommodate a wider variety of gene names, we have increased by one (from three to four) the number of letters allowed to form a Gene Name prefix. Names were previously limited to the format "ABC123". Names with the format "ABCD123" will now be accepted. Please note that the additional letter must be included before the numerals letters after the numerals (e.g. "ABC123D") are currently not accepted. We hope this modification to the published gene naming guidelines helps with the push to name as many genes as possible by the end of the month.

SUPRDB collects unpublished research

The Student/Unpublished Research Database (SUPRDB, at has been established by to accept unpublished data to aid in the annotation of the Tetrahymena genome. SUPRDB began as part of the Ciliates in the Classroom project, but we encourage contributions from all members of the research community. Reports in standard scientific format can be entered at the site. The SUPRDB ID for the report can then be used just like a Pubmed ID in the GO Annotations and Associated Literature sections of TGD. Think of it as Pubmed Central for all of the unpublished findings we've made over the years.

SUPRDB is the latest addition to the family of websites, which now includes genome databases for Tetrahymena (, Ichthyophthirius (, and Oxytricha ( To sign up for write access for any of these sites, contact us at [email protected]

Gene Naming Drive

TGD Wiki and TetRA, the Tetrahymena Research Advisory Board, encourage all community members to name genes in their area expertise over the next few weeks. To help with this effort, we have written a guide for naming genes based on simple BLAST searches. The criteria are straightforward and should allow us to quickly name conserved genes and gene families. This Gene Naming guide is posted under our Resources menu, or you can access it below.
Naming Genes using BLAST (PDF)

As a reminder, if you have published articles with new gene names, please take a moment to add these to TGD Wiki as well. Thanks for contributing!

New TGD Wiki paper

A new article about TGD Wiki has been published in Database: The Journal of Biological Databases and Curation. Enjoy!

2013 FASEB Ciliate Molecular Biology Conference

Please mark your calendar! The 2013 FASEB Ciliate Molecular Biology Conference will be held July 7-12, 2013 at the Steamboat Grand Resort (Steamboat Springs, Colorado). Information about the conference location can be found at the FASEB web site.

Sidebar Functions

The Recent Activity section of the left sidebar now shows the latest three gene pages edited by members of the community, including updates to gene names, GO annotations, and the list of related papers. Recent Papers shows the last three articles added to our index of Tetrahymena papers (downloaded regularly from Pubmed). Authors, please take a moment to link new papers to the genes they describe.

BLAST updated

We have updated the BLAST server and its collection of sequence datasets. The new BLAST software is capable of translating both query and database sequences using a variety of genetic codes. (Please note that Tetrahymena uses the "Ciliate Nuclear (6)" genetic code to translate mRNAs.) The most recent (v.2008) Protein, CDS, Assembly, and Trace sequences are currently available for search. For the time being, we also have a link to the legacy BLAST server at Stanford, which contains the v.2004 sequences. Please let us know if you find the new BLAST server lacks tools or datasets you found useful at the legacy server, which will soon be decommissioned.

Links to TetraFGD

The Expression Profile section of each gene page has been updated to link with the redesigned Tetrahymena Functional Genomics Database. TetraFGD shows RNA-seq, microarray, and gene network profiles for T. thermophila genes.

Preliminary Ich genome sequence browser

The preliminary annotation of the Ichthyophthirius multifiliis genome is now available for browsing and keyword searching at the genome browser. We will update the site with the official gene names and models once they have been finalized for publication. The Ich genome browser can be accessed directly at

Genome Browser updated

The genome browser has been updated to GBrowse 2 and is now showing the v.2008 annotation of the Tetrahymena genome sequence. We will maintain a link on the gene pages to the v.2004 browser until we are able to recreate the useful tracks available there, but please note that its sequences and gene models may be out of date. The TGD Wiki gene pages and v.2008 browser both show the current T. thermophila annotation.

TGD Wiki gets a new look!

The TGD Wiki website has been redesigned. Don't worry, all your favorite genes and tools are still here - but now it should be even more enjoyable to work on them! Thanks for this update go out to our newest programmer, Mike Bowen.

Article on TGD Wiki

Bradley University has featured TGD Wiki, a collaborative project between the Biology and Computer Science departments, in a Spotlight news article.

Registration is now open!

Tetrahymena researchers who wish to contribute to the community annotation effort can now register and begin editing TGD Wiki. Simply send the information requested on the User Registration page to ciliate-curator. Once your lab receives a user name and password, visit the new Wiki Edit Guide to see what kinds of annotations can be made to your favorite Tetrahymena genes.

2011 FASEB Conference

Just announced: The 2011 FASEB Ciliate Molecular Biology Conference will be held at the Orthodox Academy of Crete (Kolymvari, Chania, Greece), from July 11-15, 2011. Details are available here.

References updated

We have loaded citation information from Pubmed for Tetrahymena papers published in the last three years. From this point on, TGD Wiki will be updated regularly to include new papers that become available through Pubmed.

TGD Wiki is now online!

TGD Wiki is the new hub for information about the genes and proteins of Tetrahymena. TGD Wiki currently displays the most recent Tetrahymena gene/protein sequences and functional annotations from TIGR and other sources. In order to keep the information in our database as current as possible, we will soon be inviting the members of the Tetrahymena community to add and update these annotations to reflect published research. Check back here for updates as we continue to develop and improve this website.

Study revealing the secret behind a key cellular process refutes biology textbooks

New research has identified and described a cellular process that, despite what textbooks say, has remained elusive to scientists until now -- precisely how the copying of genetic material that, once started, is properly turned off.

The finding concerns a key process essential to life: the transcription phase of gene expression, which enables cells to live and do their jobs.

During transcription, an enzyme called RNA polymerase wraps itself around the double helix of DNA, using one strand to match nucleotides to make a copy of genetic material -- resulting in a newly synthesized strand of RNA that breaks off when transcription is complete. That RNA enables production of proteins, which are essential to all life and perform most of the work inside cells.

Just as with any coherent message, RNA needs to start and stop in the right place to make sense. A bacterial protein called Rho was discovered more than 50 years ago because of its ability to stop, or terminate, transcription. In every textbook, Rho is used as a model terminator that, using its very strong motor force, binds to the RNA and pulls it out of RNA polymerase. But a closer look by these scientists showed that Rho wouldn't be able to find the RNAs it needs to release using the textbook mechanism.

"We started studying Rho, and realized it cannot possibly work in ways people tell us it works," said Irina Artsimovitch, co-lead author of the study and professor of microbiology at The Ohio State University.

The research, published online by the journal Science today, Nov. 26, 2020, determined that instead of attaching to a specific piece of RNA near the end of transcription and helping it unwind from DNA, Rho actually "hitchhikes" on RNA polymerase for the duration of transcription. Rho cooperates with other proteins to eventually coax the enzyme through a series of structural changes that end with an inactive state enabling release of the RNA.

The team used sophisticated microscopes to reveal how Rho acts on a complete transcription complex composed of RNA polymerase and two accessory proteins that travel with it throughout transcription.

"This is the first structure of a termination complex in any system, and was supposed to be impossible to obtain because it falls apart too quickly," Artsimovitch said.

"It answers a fundamental question -- transcription is fundamental to life, but if it were not controlled, nothing would work. RNA polymerase by itself has to be completely neutral. It has to be able to make any RNA, including those that are damaged or could harm the cell. While traveling with RNA polymerase, Rho can tell if the synthesized RNA is worth making -- and if not, Rho releases it."

Artsimovitch has made many important discoveries about how RNA polymerase so successfully completes transcription. She didn't set out to counter years of understanding about Rho's role in termination until an undergraduate student in her lab identified surprising mutations in Rho while working on a genetics project.

Rho is known to silence the expression of virulence genes in bacteria, essentially keeping them dormant until they're needed to cause infection. But these genes do not have any RNA sequences that Rho is known to preferentially bind. Because of that, Artsimovitch said, it has never made sense that Rho looks only for specific RNA sequences, without even knowing if they are still attached to RNA polymerase.

In fact, the scientific understanding of the Rho mechanism was established using simplified biochemical experiments that frequently left out RNA polymerase -- in essence, defining how a process ends without factoring in the process itself.

In this work, the researchers used cryo-electron microscopy to capture images of RNA polymerase operating on a DNA template in Escherichia coli, their model system. This high-resolution visualization, combined with high-end computation, made accurate modeling of transcription termination possible.

"RNA polymerase moves along, matching hundreds of thousands of nucleotides in bacteria. The complex is extremely stable because it has to be -- if the RNA is released, it is lost," Artsimovitch said. "Yet Rho is able to make the complex fall apart in a matter of minutes, if not seconds. You can look at it, but you can't get a stable complex to analyze."

Using a clever method to trap complexes just before they fall apart enabled the scientists to visualize seven complexes that represent sequential steps in the termination pathway, starting from Rho's engagement with RNA polymerase and ending with a completely inactive RNA polymerase. The team created models based on what they saw, and then made sure that these models were correct using genetic and biochemical methods.

Though the study was conducted in bacteria, Artsimovitch said this termination process is likely to occur in other forms of life.

"It appears to be common," she said. "In general, cells use similar working mechanisms from a common ancestor. They all learned the same tricks as long as these tricks were useful."

Artsimovitch, working with an international research team of collaborators, co-led the study with Markus Wahl, a former Ohio State graduate student now at Freie Universität Berlin.

This work was supported by grants from the German Research Foundation the German Federal Ministry of Education and Research the Indian Council of Medical Research the Department of Biotechnology, Government of India the National Institutes of Health and the Sigrid Jusélius Foundation.


In order to understand functional genomics it is important to first define function. In their paper [1] Graur et al. define function in two possible ways. These are "Selected effect" and "Causal Role". The "Selected Effect" function refers to the function for which a trait (DNA, RNA, protein etc.) is selected for. The "Causal role" function refers to the function that a trait is sufficient and necessary for. Functional genomics usually tests the "Causal role" definition of function.

The goal of functional genomics is to understand the function of genes or proteins, eventually all components of a genome. The term functional genomics is often used to refer to the many technical approaches to study an organism's genes and proteins, including the "biochemical, cellular, and/or physiological properties of each and every gene product" [2] while some authors include the study of nongenic elements in their definition. [3] Functional genomics may also include studies of natural genetic variation over time (such as an organism's development) or space (such as its body regions), as well as functional disruptions such as mutations.

The promise of functional genomics is to generate and synthesize genomic and proteomic knowledge into an understanding of the dynamic properties of an organism. This could potentially provide a more complete picture of how the genome specifies function compared to studies of single genes. Integration of functional genomics data is often a part of systems biology approaches.

Functional genomics includes function-related aspects of the genome itself such as mutation and polymorphism (such as single nucleotide polymorphism (SNP) analysis), as well as the measurement of molecular activities. The latter comprise a number of "-omics" such as transcriptomics (gene expression), proteomics (protein production), and metabolomics. Functional genomics uses mostly multiplex techniques to measure the abundance of many or all gene products such as mRNAs or proteins within a biological sample. A more focused functional genomics approach might test the function of all variants of one gene and quantify the effects of mutants by using sequencing as a readout of activity. Together these measurement modalities endeavor to quantitate the various biological processes and improve our understanding of gene and protein functions and interactions.

At the DNA level Edit

Genetic interaction mapping Edit

Systematic pairwise deletion of genes or inhibition of gene expression can be used to identify genes with related function, even if they do not interact physically. Epistasis refers to the fact that effects for two different gene knockouts may not be additive that is, the phenotype that results when two genes are inhibited may be different from the sum of the effects of single knockouts.

DNA/Protein interactions Edit

Proteins formed by the translation of the mRNA (messenger RNA, a coded information from DNA for protein synthesis) play a major role in regulating gene expression. To understand how they regulate gene expression it is necessary to identify DNA sequences that they interact with. Techniques have been developed to identify sites of DNA-protein interactions. These include ChIP-sequencing, CUT&RUN sequencing and Calling Cards. [4]

DNA accessibility assays Edit

Assays have been developed to identify regions of the genome that are accessible. These regions of open chromatin are candidate regulatory regions. These assays include ATAC-seq, DNase-Seq and FAIRE-Seq.

At the RNA level Edit

Microarrays Edit

Microarrays measure the amount of mRNA in a sample that corresponds to a given gene or probe DNA sequence. Probe sequences are immobilized on a solid surface and allowed to hybridize with fluorescently labeled “target” mRNA. The intensity of fluorescence of a spot is proportional to the amount of target sequence that has hybridized to that spot, and therefore to the abundance of that mRNA sequence in the sample. Microarrays allow for identification of candidate genes involved in a given process based on variation between transcript levels for different conditions and shared expression patterns with genes of known function.


Serial analysis of gene expression (SAGE) is an alternate method of analysis based on RNA sequencing rather than hybridization. SAGE relies on the sequencing of 10–17 base pair tags which are unique to each gene. These tags are produced from poly-A mRNA and ligated end-to-end before sequencing. SAGE gives an unbiased measurement of the number of transcripts per cell, since it does not depend on prior knowledge of what transcripts to study (as microarrays do).

RNA sequencing Edit

RNA sequencing has taken over microarray and SAGE technology in recent years, as noted in 2016, and has become the most efficient way to study transcription and gene expression. This is typically done by next-generation sequencing. [5]

A subset of sequenced RNAs are small RNAs, a class of non-coding RNA molecules that are key regulators of transcriptional and post-transcriptional gene silencing, or RNA silencing. Next generation sequencing is the gold standard tool for non-coding RNA discovery, profiling and expression analysis.

Massively Parallel Reporter Assays (MPRAs) Edit

Massively parallel reporter assays is a technology to test the cis-regulatory activity of DNA sequences. [6] [7] MPRAs use a plasmid with a synthetic cis-regulatory element upstream of a promoter driving a synthetic gene such as Green Fluorescent Protein. A library of cis-regulatory elements is usually tested using MPRAs, a library can contain from hundreds to thousands of cis-regulatory elements. The cis-regulatory activity of the elements is assayed by using the downstream reporter activity. The activity of all the library members is assayed in parallel using barcodes for each cis-regulatory element. One limitation of MPRAs is that the activity is assayed on a plasmid and may not capture all aspects of gene regulation observed in the genome.

STARR-seq Edit

STARR-seq is a technique similar to MPRAs to assay enhancer activity of randomly sheared genomic fragments. In the original publication, [8] randomly sheared fragments of the Drosophila genome were placed downstream of a minimal promoter. Candidate enhancers amongst the randomly sheared fragments will transcribe themselves using the minimal promoter. By using sequencing as a readout and controlling for input amounts of each sequence the strength of putative enhancers are assayed by this method.

Perturb-seq Edit

Perturb-seq couples CRISPR mediated gene knockdowns with single-cell gene expression. Linear models are used to calculate the effect of the knockdown of a single gene on the expression of multiple genes.

At the protein level Edit

Yeast two-hybrid system Edit

A yeast two-hybrid screening (Y2H) tests a "bait" protein against many potential interacting proteins ("prey") to identify physical protein–protein interactions. This system is based on a transcription factor, originally GAL4, [9] whose separate DNA-binding and transcription activation domains are both required in order for the protein to cause transcription of a reporter gene. In a Y2H screen, the "bait" protein is fused to the binding domain of GAL4, and a library of potential "prey" (interacting) proteins is recombinantly expressed in a vector with the activation domain. In vivo interaction of bait and prey proteins in a yeast cell brings the activation and binding domains of GAL4 close enough together to result in expression of a reporter gene. It is also possible to systematically test a library of bait proteins against a library of prey proteins to identify all possible interactions in a cell.

AP/MS Edit

Affinity purification and mass spectrometry (AP/MS) is able to identify proteins that interact with one another in complexes. Complexes of proteins are allowed to form around a particular “bait” protein. The bait protein is identified using an antibody or a recombinant tag which allows it to be extracted along with any proteins that have formed a complex with it. The proteins are then digested into short peptide fragments and mass spectrometry is used to identify the proteins based on the mass-to-charge ratios of those fragments.

Deep mutational scanning Edit

In deep mutational scanning every possible amino acid change in a given protein is first synthesized. The activity of each of these protein variants is assayed in parallel using barcodes for each variant. By comparing the activity to the wild-type protein, the effect of each mutation is identified. While it is possible to assay every possible single amino-acid change due to combinatorics two or more concurrent mutations are hard to test. Deep mutational scanning experiments have also been used to infer protein structure and protein-protein interactions.

Loss-of-function techniques Edit

Mutagenesis Edit

Gene function can be investigated by systematically “knocking out” genes one by one. This is done by either deletion or disruption of function (such as by insertional mutagenesis) and the resulting organisms are screened for phenotypes that provide clues to the function of the disrupted gene*

RNAi Edit

RNA interference (RNAi) methods can be used to transiently silence or knock down gene expression using

20 base-pair double-stranded RNA typically delivered by transfection of synthetic

20-mer short-interfering RNA molecules (siRNAs) or by virally encoded short-hairpin RNAs (shRNAs). RNAi screens, typically performed in cell culture-based assays or experimental organisms (such as C. elegans) can be used to systematically disrupt nearly every gene in a genome or subsets of genes (sub-genomes) possible functions of disrupted genes can be assigned based on observed phenotypes.

CRISPR screens Edit

CRISPR-Cas9 has been used to delete genes in a multiplexed manner in cell-lines. Quantifying the amount of guide-RNAs for each gene before and after the experiment can point towards essential genes. If a guide-RNA disrupts an essential gene it will lead to the loss of that cell and hence there will be a depletion of that particular guide-RNA after the screen. In a recent CRISPR-cas9 experiment in mammalian cell-lines, around 2000 genes were found to be essential in multiple cell-lines. [11] [12] Some of these genes were essential in only one cell-line. Most of genes are part of multi-protein complexes. This approach can be used to identify synthetic lethality by using the appropriate genetic background. CRISPRi and CRISPRa enable loss-of-function and gain-of-function screens in a similar manner. CRISPRi identified

2100 essential genes in the K562 cell-line. [13] [14] CRISPR deletion screens have also been used to identify potential regulatory elements of a gene. For example, a technique called ScanDel was published which attempted this approach. The authors deleted regions outside a gene of interest(HPRT1 involved in a Mendelian disorder) in an attempt to identify regulatory elements of this gene. [15] Gassperini et al. did not identify any distal regulatory elements for HPRT1 using this approach, however such approaches can be extended to other genes of interest.

Functional annotations for genes Edit

Genome annotation Edit

Putative genes can be identified by scanning a genome for regions likely to encode proteins, based on characteristics such as long open reading frames, transcriptional initiation sequences, and polyadenylation sites. A sequence identified as a putative gene must be confirmed by further evidence, such as similarity to cDNA or EST sequences from the same organism, similarity of the predicted protein sequence to known proteins, association with promoter sequences, or evidence that mutating the sequence produces an observable phenotype.

Rosetta stone approach Edit

The Rosetta stone approach is a computational method for de-novo protein function prediction. It is based on the hypothesis that some proteins involved in a given physiological process may exist as two separate genes in one organism and as a single gene in another. Genomes are scanned for sequences that are independent in one organism and in a single open reading frame in another. If two genes have fused, it is predicted that they have similar biological functions that make such co-regulation advantageous.

Because of the large quantity of data produced by these techniques and the desire to find biologically meaningful patterns, bioinformatics is crucial to analysis of functional genomics data. Examples of techniques in this class are data clustering or principal component analysis for unsupervised machine learning (class detection) as well as artificial neural networks or support vector machines for supervised machine learning (class prediction, classification). Functional enrichment analysis is used to determine the extent of over- or under-expression (positive- or negative- regulators in case of RNAi screens) of functional categories relative to a background sets. Gene ontology based enrichment analysis are provided by DAVID and gene set enrichment analysis (GSEA), [16] pathway based analysis by Ingenuity [17] and Pathway studio [18] and protein complex based analysis by COMPLEAT. [19]

New computational methods have been developed for understanding the results of a deep mutational scanning experiment. 'phydms' compares the result of a deep mutational scanning experiment to a phylogenetic tree. [20] This allows the user to infer if the selection process in nature applies similar constraints on a protein as the results of the deep mutational scan indicate. This may allow an experimenter to choose between different experimental conditions based on how well they reflect nature. Deep mutational scanning has also been used to infer protein-protein interactions. [21] The authors used a thermodynamic model to predict the effects of mutations in different parts of a dimer. Deep mutational structure can also be used to infer protein structure. Strong positive epistasis between two mutations in a deep mutational scan can be indicative of two parts of the protein that are close to each other in 3-D space. This information can then be used to infer protein structure. A proof of principle of this approach was shown by two groups using the protein GB1. [22] [23]

Results from MPRA experiments have required machine learning approaches to interpret the data. A gapped k-mer SVM model has been used to infer the kmers that are enriched within cis-regulatory sequences with high activity compared to sequences with lower activity. [24] These models provide high predictive power. Deep learning and random forest approaches have also been used to interpret the results of these high-dimensional experiments. [25] These models are beginning to help develop a better understanding of non-coding DNA function towards gene-regulation.

The ENCODE project Edit

The ENCODE (Encyclopedia of DNA elements) project is an in-depth analysis of the human genome whose goal is to identify all the functional elements of genomic DNA, in both coding and noncoding regions. Important results include evidence from genomic tiling arrays that most nucleotides are transcribed as coding transcripts, noncoding RNAs, or random transcripts, the discovery of additional transcriptional regulatory sites, further elucidation of chromatin-modifying mechanisms.

The Genotype-Tissue Expression (GTEx) project Edit

The GTEx project is a human genetics project aimed at understanding the role of genetic variation in shaping variation in the transcriptome across tissues. The project has collected a variety of tissue samples (> 50 different tissues) from more than 700 post-mortem donors. This has resulted in the collection of >11,000 samples. GTEx has helped understand the tissue-sharing and tissue-specificity of EQTLs. [26]

DNA methylation age of human tissues and cell types

Background: It is not yet known whether DNA methylation levels can be used to accurately predict age across a broad spectrum of human tissues and cell types, nor whether the resulting age prediction is a biologically meaningful measure.

Results: I developed a multi-tissue predictor of age that allows one to estimate the DNA methylation age of most tissues and cell types. The predictor, which is freely available, was developed using 8,000 samples from 82 Illumina DNA methylation array datasets, encompassing 51 healthy tissues and cell types. I found that DNA methylation age has the following properties: first, it is close to zero for embryonic and induced pluripotent stem cells second, it correlates with cell passage number third, it gives rise to a highly heritable measure of age acceleration and, fourth, it is applicable to chimpanzee tissues. Analysis of 6,000 cancer samples from 32 datasets showed that all of the considered 20 cancer types exhibit significant age acceleration, with an average of 36 years. Low age-acceleration of cancer tissue is associated with a high number of somatic mutations and TP53 mutations, while mutations in steroid receptors greatly accelerate DNA methylation age in breast cancer. Finally, I characterize the 353 CpG sites that together form an aging clock in terms of chromatin states and tissue variance.

Conclusions: I propose that DNA methylation age measures the cumulative effect of an epigenetic maintenance system. This novel epigenetic clock can be used to address a host of questions in developmental biology, cancer and aging research.

Watch the video: The Two People Were All Related To (May 2022).