Extract mutations from fasta sequences

Extract mutations from fasta sequences

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I have a large amount of protein sequences in the .fasta format and I would like to extract only the amino acid mutations from these sequences, so that, in the end, I want to have a list that looks something like this: I456L, W675T, etc Is there a program or way to do this? Thankful


Genome2D Genome Tools (Dr. Anne de Jong, Molecular Genetics, University of Groningen, The Netherlands) - this is my go-to site for all manner of analyses. Under "Genome Tools" select "Conversions." This will allow you to convert a GenBank flatfile (gbk) to GFF (General Feature Format, table), CDS (coding sequences), Proteins (FASTA Amino Acids, faa), DNA sequence (Fasta format).

Galaxy is an open, web-based platform for accessible, reproducible, and transparent computational biomedical research. This web server makes analysis tools, genomic data, tutorial demonstrations, persistent workspaces, and publication services available to any scientist. Extensive user documentation applicable to any public or local Galaxy instance is available. Offers a huge varierty of tools for analysis and file interconversion.

Sequence conversion ( Bioinf @ Bugaco ) - a huge suite of conversion tools.

Readseq developed by D.G. Gilbert (Indiana University) reads and converts biosequences between a selection of common biological sequence formats, including EMBL, GenBank and fasta sequence formats is available here .

EMBOSS Seqret reads and writes (returns) sequences. It is useful for a variety of tasks, including extracting sequences from databases, displaying sequences, reformatting sequences, producing the reverse complement of a sequence, extracting fragments of a sequence, sequence case conversion or any combination of the above functions.

Format Converter - This program takes as input a sequence or sequences (e.g., an alignment) in an unspecified format and converts the sequence(s) to a different user-specified format. Also converts *.gbk to *.gff3.

ApolloRNA Convert data - Transformation of TransTermHP, CRISPRfinder, MOSAIC, PatScan, DARN! (GFF), GenBank output data in GFF and GAME XML format data that can be read by Apollo.

GenBank Trans Extractor accepts a GenBank file as input and returns each of the protein translations described in the file in FASTA format. GenBank Trans Extractor should be used when you are more interested in the predicted protein translations of a DNA sequence than the DNA sequence itself. Part of the Sequence Manipulation Suite.

FeatureExtract 1.2L (light) Server - extracts sequence and feature annotation, such as intron/exon structure, from GenBank entries and other GenBank format files. ( Reference: R. Wernersson (2005) Nucleic Acids Res. 33(Web Server issue): W567&ndashW569).

Sequence editor - converts DNA and RNA sequences. Generate antiparallel, complement and inverse sequences.

Format conversion - (single sequence, set of sequences, alignment, tree, matrix, . ) and format are automatically recognized. Output: FASTA, NEXUS, PHYLIP, Clustal, EMBL, Newick, New Hampshire).

GenBank 2 Sequin ( P. Lehwark & S. Greiner, Max-Planck Institute for Molecular Plant Physiology, Germany ) - this extremely usesful program is designed to convert revised GeSeq output into the Sequin format, required for NCBI submission. None the less, any custom GenBank file can be prepared for NCBI submission using GenBank 2 Sequin.

JaMBW ( European Molecular Biology Laboratory of Heidelberg, Germany). Java based Molecular Biologist's Workbench.Select Chapter 1 for sequence format conversion (upper />lower case T />U reverse or complement sequence).

Nucleic Acid Sequence Massager (Allotron Biosensor Corporation) which in addition to removing spurious material (numbers, breaks, HTML, spaces) changes the format (upper to low case, complement, reverse, RNA to DNA, and triplets).

extractUpStreamDNA (A. Villegas, Public Health Ontario) - takes a Genbank flatfile (*.gbk) as input and parses through and for every CDS that it finds, it extracts a pre-determined length of DNA upstream (length will be an argument and will include 3 nt for the initiation codon). Output will be an FFN file of these upstream DNA sequences. N.B. this only WORKS for prokaryotic sequences because it does not handle Splits or Joins found in eukaryotic. This data then can be analyzed with programs such as MEME.This program is temporarily unavailable online, though one can download it from here.

Convert GenBank to Fasta (G. Rocap, School of Oceanography, University of Washington, U.S.A. ) - Select a GenBank formatted file containing a feature table. Select whether to extract translated peptide sequences, DNA sequence for each feature, or the entire DNA sequenceof the whole record. If you chose "Peptide Sequence", your feature table must have "translation"sub-features.

FaBox (Palle Villesen Fredsted, Aarhus University, Denmark) - an online fasta sequence toolbox, including Fasta header editor, Fasta header replacer, Fasta sequence extractor, Fasta sequence subtractor, Fasta sequence joiner, Fasta dataset splitter/divider

FeatureExtract - this very useful service extracts sequence and feature annotation, such as intron/exon structure, from GenBank entries and other GenBank format files. ( Reference: R. Wernersson. 2005. Nucl. Acids Res. 33 ( Web Server issue): W567-W569). Also possible is extraction of 5' and 3' sequences.

SIGENAE Fasta Clean - SIGENAE means Information System of AGENAE program. The AGENAE program (Analysis of Breeding Animals&rsquo Genome) is an Inra national program with the ambition to develop research in the domain of breeding animal genomics - pig, chicken, trout, cattle, rabbit, sheep.

SeqScrub - is a web application that cleans up FASTA file headers and appends information from external databases. ( Reference: Foley G et al. (2019) BioTechniques 67(2): 50-54).

PAL2NAL - is a program that converts a multiple sequence alignment of proteins and the corresponding DNA (or mRNA) sequences into a codon alignment. The program automatically assigns the corresponding codon sequence even if the input DNA sequence has mismatches with the input protein sequence, or contains UTRs, polyA tails. It can also deal with frame shifts in the input alignment, which is suitable for the analysis of pseudogenes. The resulting codon alignment can further be subjected to the calculation of synonymous (dS) and non-synonymous (dN) substitution rates. ( Reference: Suyama M et al. (2006) Nucl Acids Res 34: W609-W612).

Shuffle DNA and Sequence Randomizer permit one to randomize a sequence to compare with one's own.

Working with “big data”¶

The term “big data” is a bit nebulous. But it is certainly possible to create sequence files that are too big to be stored in RAM (Random Access Memory).

In these instances one needs to be able to read part of the file and “yield” records as the file is being processed. In other words the content of the whole file is never stored in RAM.

In the specific case of processing FASTA files one needs to be able to “yield” FASTA records as one works one’s way through the file. This means that one cannot create a function that creates a big list of FASTA records and returns that list. Rather, one needs to create a function that “yields” FASTA records as the FASTA file is being processed.

This can be achieved using the yield keyword.

To illustrate this imagine that one had a text file where the number of characters in each line was a key metric. Below is a function that takes the file name as an input parameter and returns a list of integers representing the number of characters in each line.

The function above would run into trouble if the file was really big. This is because the data_collection list is stored in RAM. If a file had more lines than the computer had memory, calling the function would result in a MemoryError .

How come I don’t get a MemoryError when I open big files?

The open function does not actually read the file. It creates a “handle” to the file (a pointer to the start of the file).

The file handle has a read() function, which reads the entire content of the file into memory. If you called this function and the file was big enough you would run into a MemoryError . This is part of the reason this function is rarely used.

It is much more common to use the readlines() function, which makes use of the yield statment to “yield” lines one at a time, which is why we can process big files.

It is also possible to process a file line by line by placing the file handle in a for loop. The file handle then behaves as a so colled “iterator”, yielding one line at a time.

Below is a function that overcomes this issue by “yielding” the lengths of the lines as they are being processed, circumventing the need to store the whole data structure in RAM.

How might this be used in practise? A practical application would be to write the results to file. This could be achieved using the code below.

Target Audience

Graduates, postgraduates and PIs working with or about to embark on analysis of data from next generation sequencing platforms (Illumina focus). A reference genome is required.

Prerequisites for attendance:

Basic familiarity with Linux environment and S, R, or Matlab. Must be able to complete and understand the following simple Linux and R tutorials before attending:

You will also require your own laptop computer with wireless internet capability Minimum requirements: 1024x768 screen resolution, 1.5GHz CPU, 1GB RAM, recent versions of Windows, Mac OS X or Linux (Most computers purchased in the past 3-4 years likely meet these requirements). If you do not have access to your own computer, you may loan one from the CSHL. Please contact CSHL in advance to request a laptop.”

Additional content can be found on the workshop page.

Extract mutations from fasta sequences - Biology

a robust python module for fast random access to sequences from plain and gzipped FASTA/Q file

The pyfastx is a lightweight Python C extension that enables users to randomly access to sequences from plain and gzipped FASTA/Q files. This module aims to provide simple APIs for users to extract seqeunce from FASTA and reads from FASTQ by identifier and index number. The pyfastx will build indexes stored in a sqlite3 database file for random access to avoid consuming excessive amount of memory. In addition, the pyfastx can parse standard (sequence is spread into multiple lines with same length) and nonstandard (sequence is spread into one or more lines with different length) FASTA format. This module used kseq.h written by @attractivechaos in klib project to parse plain FASTA/Q file and zran.c written by @pauldmccarthy in project indexed_gzip to index gzipped file for random access.

This project was heavily inspired by @mdshw5's project pyfaidx and @brentp's project pyfasta.

  • Single file for the Python extension
  • Lightweight, memory efficient for parsing FASTA/Q file
  • Fast random access to sequences from gzipped FASTA/Q file
  • Read sequences from FASTA file line by line
  • Calculate N50 and L50 of sequences in FASTA file
  • Calculate GC content and nucleotides composition
  • Extract reverse, complement and antisense sequences
  • Excellent compatibility, support for parsing nonstandard FASTA file
  • Support for FASTQ quality score conversion
  • Provide command line interface for splitting FASTA/Q file

Currently, pyfastx supports Python 3.5, 3.6, 3.7, 3.8, 3.9. Make sure you have installed both pip and Python before starting.

You can install pyfastx via the Python Package Index (PyPI)

Pyfastx provide a simple and fast python binding for kseq.h to iterate over sequences or reads in fasta/q file. The FASTX object will automatically detect the input sequence format (fasta or fastq) to return different tuple.

When iterating over sequences on FASTX object, a tuple (name, seq, comment) will be returned, the comment is the content of header line after the first white space character.

When iterating over reads on FASTX object, a tuple (name, seq, qual, comment) will be returned, the comment is the content of header line after the first white space character.

Read plain or gzipped FASTA file and build index, support for random access to FASTA.

Building index may take some times. The time required to build index depends on the size of FASTA file. If index built, you can randomly access to any sequences in FASTA file. The index file can be reused to save time when you read seqeunces from FASTA file next time.

The fastest way to iterate plain or gzipped FASTA file without building index, the iteration will return a tuple contains name and sequence.

You can also iterate sequence object from FASTA object like this:

Iteration with build_index=True (default) return sequence object which allows you to access attributions of sequence. New in pyfastx 0.6.3.

Calculate assembly N50 and L50, return (N50, L50), learn more about N50,L50

Get counts of sequences whose length >= specified length

Subsequences can be retrieved from FASTA file by using a list of [start, end] coordinates

Sometimes your fasta will have a long header which contains multiple identifiers and description, for example, ">JZ822577.1 contig1 cDNA library of flower petals in tree peony by suppression subtractive hybridization Paeonia suffruticosa cDNA, mRNA sequence". In this case, both "JZ822577.1" and "contig1" can be used as identifer. you can specify the key function to select one as identifier.

Retrieving genome sequence data using SeqinR¶

Instead of going to the NCBI website to retrieve sequence data from the NCBI database, you can retrieve sequence data from NCBI directly from R, by using the SeqinR R package.

For example, you learnt above how to retrieve the DEN-1 Dengue virus genome sequence, which has NCBI accession NC_001477, from the NCBI website. To retrieve a sequence with a particular NCBI accession, you can use R function “getncbiseq()” below, which you will first need to copy and paste into R:

Once you have copied and pasted the function getncbiseq() into R, you can use it to retrieve a sequence from the NCBI Nucleotide database, such as the sequence for the DEN-1 Dengue virus (accession NC_001477):

The variable dengueseq is a vector containing the nucleotide sequence. Each element of the vector contains one nucleotide of the sequence. Therefore, to print out a certain subsequence of the sequence, we just need to type the name of the vector dengueseq followed by the square brackets containing the indices for those nucleotides. For example, the following command prints out the first 50 nucleotides of the DEN-1 Dengue virus genome sequence:

Note that dengueseq[1:50] refers to the elements of the vector dengueseq with indices from 1-50. These elements contain the first 50 nucleotides of the DEN-1 Dengue virus sequence.

DNA Sequence Quality - Phred - provides base calling, chromatogram display and high quality sequence region evaluation and presentation for up to five sequences simultaneously.

Sequence assembly - you don't need your own contig assembly program when you can use:

EGassember - aligns and merges sequence fragments resulting from shotgun sequencing or gene transcripts (EST) fragments in order to reconstruct the original segment or gene ( Reference: A. Masoudi-Nejad et al. 2006. Nucl. Acids Res. 34: W459-462).

CGE Assembler 1.2 - assembles Illumina, 454, SOLid and Ion Torrent data ( Reference: Larsen MV, et al. J. Clin. Micobiol. 2012. 50(4): 1355-1361).
CGE SPAdes 3.9 - assembles Illumina and Ion Torrent data ( Reference: S. Nurk et al. Research in Computational Molecular Biology: pp 158-170).

CAP3 (PBIL, France ), ( Reference: Huang,X. & Madan A. 1999. Genome Res. 9: 868-877), and here.
CAP EST Assembler (Istituto FIRC di Oncologia Molecolare, Italy) - Maximum sequence length for each sequence is 30 kb - Maximum number of sequences 10 kb

MicroScope web site (hosted at Genoscope), provides an environment for expert annotation and comparative genomics. Genome project: Annotation and comparative analyses of finished or draft genome sequences. For pre-annotated sequences, they only integrate annotations from NCBI RefSeq complete genome section. Metagenome project: Annotation and comparative analyses of assembled metagenomic sequences. Currently, they are able to integrate datasets below 20 Mb of contigs per bin.

NanoPipe - was developed in consideration of the specifics of the MinION sequencing technologies, providing accordingly adjusted alignment parameters. The range of the target species/sequences for the alignment is not limited, and the descriptive usage page of NanoPipe helps a user to succeed with NanoPipe analysis. The results contain alignment statistics, consensus sequence, polymorphisms data, and visualization of the alignment. ( Reference: Shabardina V et al. (2019) Gigascience 8(2). pii: giy169).

COV2HTML: a visualization and analysis tool of bacterial next generation sequencing (NGS) data for postgenomics life scientists - allows performing both coverage visualization and analysis of NGS alignments performed on prokaryotic organisms (bacteria and phages). It combines two processes: a tool that converts the huge NGS mapping or coverage files into light specific coverage files containing information on genetic elements and a visualization interface allowing a real-time analysis of data with optional integration of statistical results. ( Reference: Monot M. et al. 2014. OMICS 18(3): 184-95).

DCA Divide-and-Conquer Multiple Sequence Alignment ( Universitat Bielefeld, Germany) - is a program for producing fast, high quality simultaneous multiple sequence alignments of amino acid, RNA, or DNA sequences. ( Reference: Brinkmann, G. et al. Mathematical Programming 79: 71-97, 1997).

PhageTerm - is a fast and user-friendly software package which can be used to determine bacteriophage termini and packaging mode from randomly fragmented NGS data. It is part of the Galaxy package, and can be found in the "NGS: Mapping" directory. Ideal is you want an automated answer. ( Reference: Garneau JR, et al. 2017. Sci Rep. 7(1):8292).

QUAST - a quality assessment tool for evaluating and comparing genome assemblies. This tool improves on leading assembly comparison software with new ideas and quality metrics. QUAST can evaluate assemblies both with a reference genome, as well as without a reference. QUAST produces many reports, summary tables and plots to help scientists in their research and in their publications. ( Reference: A. Gurevich et al. 2013. Bioinformatics, 29(8): 1072&ndash1075). N.B. This server is as of April 2020, but there are hopes that it will be back online (see here for software downloads).

Sequencing errors: - if your DNA sequence doesn't match the expected protein sequence you can check for errors at GeneWise (EMBL-EBI) which compares a protein sequence to a genomic DNA sequence, allowing for introns and frameshifting errors. Other programs include:

FrameD ( Reference: T. Schliex et al. 2003. Nucl. Acids Res. 31: 3738-3741)
AMIGene - annotation of microbial genes ( Reference: Bocs S et al. (2003) Nucleic Acids Res. 13(31): 3723-3726).
path :: protein back-translation and alignment - addresses the problem of finding distant protein homologies where the divergence is the result of frameshift mutations and substitutions. Given two input protein sequences, the method implicitly aligns all the possible pairs of DNA sequences that encode them, by manipulating memory-efficient graph representations of the complete set of putative DNA sequences for each protein. ( Reference: Gîrdea M et al. 2010. Algorithms for Molecular Biology 5:) (Dr. Joseba Bikandi & co-workers, Faculty of Pharmacy, in the University of the Basque Country) - allows in silico experiments including theoretical PCR amplification, AFLP-PCR , restriction analysis and pulsed field gel electrophoresis [PFGE] with bacterial & archael genomes found in the public database.

NCBI Prokaryotic Genomes Automatic Annotation Pipeline. This will completely annotate your bacterial genome and provide you with a Sequin submission file. N.B. an NCBI Phage Automatic Annotation Pipeline is in developement.

RAST (Rapid Annotation using Subsystem Technology) is a fully-automated service for annotating bacterial and archaeal genomes. It provides high quality genome annotations for these genomes across the whole phylogenetic tree. Requires registration. ( Reference: Aziz, RK et al. 2008. BMC Genomics 9:75.).

BASys Bacterial Annotation Tool - this incredible tool supports automated, in-depth annotation of bacterial genomic sequences. It accepts raw DNA sequence data and an optional list of gene identification information (Glimmer) and provides extensive textual annotation and hyperlinked image output. BASys uses >30 programs to determine 60 annotation subfields for each gene, including gene/protein name, GO function, COG function, possible paralogues and orthologues, molecular weight, isoelectric point, operon structure, subcellular localization, signal peptides, transmembrane regions, secondary structure, 3D structure, reactions and pathways. ( Reference: G.H. Van Domselaar et al. 2005. Nucl. Acids Res. 33(Web Server issue): W455-W459).

MicroScope - (CEA, Institut de Génomique - Genoscope, France) is a microbial genome annotation & analysis platform which provides access to a wide range of tools including COG analysis, comparative genomics . ( Reference: Vallenet D et al. (2017) Nucleic Acids Res. 45(D1): D517-D528). Requires registration.

MAKER Web Annotation Service (MWAS) is an easily configurable web-accesible genome annotation pipeline. It's purpose is to allow research groups with small to intermediate amounts of eukaryotic and prokaryotic genome sequence (i.e. BAC clones, small whole genomes, preliminary sequencing data, etc.) to independently annotate and analyse their data and produce output that can be loaded into a genome database. ( Reference: Holt, C. & Yandell, M. 2011. BMC Bioinformatics 12:491).

MITOS - a pipeline is designed to provide consistent and high quality de novo annotation of Metazoan mitochondrial genomes sequences. We show that the results of MITOS match RefSeq and MitoZoa in terms of annotation coverage and quality. At the same time we avoid biases, inconsistencies of nomenclature, and typos originating from manual curation strategies. ( Reference: M. Bernt et al. 2013. Molecular Phylogenetics & Evolution 69:313-319).

GenSAS - Genome Sequence Annotation Server - provides a one-stop website with a single graphical interface for running multiple structural and functional annotation tools, enabling visualization and manual curation of genome sequences. Users can upload sequences into their account and run gene prediction programs, protein homology searches, map ESTs, identify repeats, ORFs and SSRs with custom parameter settings. Each analysis is displayed on separate tracks of the graphical interface with custom editabe tracks to select final annotation of features and create gff3 files for upload to genome browsers such as GBrowse. Additional programs can be easily added using this Drupal based software.

Viral Genome ORF Reader (VIGOR) - supports high throughput feature prediction and annotation. VIGOR employs an extrinsic strategy and boasts sensitivity and specificity greater than 98% for the RNA viral genomes we tested. Genome-specific features identified by VIGOR include frameshifts, ribosomal slippage, RNA editing, stop codon read-through, overlapping genes, embedded genes, and mature peptide cleavage sites. Genotyping capability for influenza and rotavirus is built into the program.
( Reference: S. Wang et al. 2011. BMC Bioinformatics 2010, 11:451)

FLAN (FLu ANnotation) is an NCBI web server for genome annotation of influenza virus is a tool for user-provided influenza A virus or influenza B virus sequences. It can validate and predict protein sequences encoded by an input flu sequence. ( Reference: Y. Bao et al. 2007. Nucleic Acids Res. Web Server issue) 35: W280-W284.)

CpGAVAS ( Chloroplast Genome Annotation, Visualization, Analysis and GenBank Submission Tool) - allows accurate chloroplast genome annotation, the generation of circular maps, the provision of useful analysis results of the annotated genome, the creation of files that can be submitted to GenBank directly. ( Reference: C. Liu et al. 2012. BMC Genomics 13: 715)

Genome Annotation Transfer Utility (GATU) annotates a genome based on a very closely related reference genome. The proteins/mature peptides of the reference genome are BLASTed against the genome to be annotated in order to find the genes/mature peptides in the genome to be annotated ( Reference: T. Tcherepano v et al. 2006. BMC Genomics 7:150.)

BioGPS (The Scripps Research Institute, USA) - is a one-stop gene annotation portal that emphasizes user-customizability and community-extensibility It is a customizable gene annotation portal and a complete resource for learning about gene and protein function.

BAGEL (Groningen Biomolecular Sciences and Biotechnology Institute, Haren, the Netherlands) - will determine from an existing or non submitted GenBank file the presence of bacteriocins based on a database containing information of known bacteriocins and adjacent genes involved in bacteriocin activity. An alternative site for bacteriocins is BACTIBASE which is a data repository of bacteriocin natural antimicrobial peptides. See . LABioicin if you are interested in the topic of Lactic Acid Bacteria (LAB) and its bacteriocins.

MICheck (MIcrobial genome Checker) - enables rapid verification of sets of annotated genes and frameshifts in previously published bacterial genomes, or genomes for which the user has a *.gbk file. This tool can be seen as a preliminary step before the functional re-annotation step to check quickly for missing or wrongly annotated genes. It worked nicely with phage genomes from 43-135kb. ( Reference: S. Cruveiller et al. 2005. Nucl. Acids Res. 33: W471- W479).

WebGeSTer - Genome Scanner for Terminators - my favourite terminator search program is finally web enabled. Please note that if you want to analyze data from a *.gbk file you need to use their conversion program "GenBank2GeSTer" first. A complete description of each terminator including a diagram is produced by this program. This site linked to an extensive database of transcriptional terminators in bacterial genome (WebGeSTer DB) ( Reference: Mitra A. et al. 2011. Nucl. Acids Res. 39(Database issue):D129-35).

RibEx: Riboswitch Explorer - scans <40kb DNA for potential genes (which are linked to BLASTP) and several hundred regulatory elements, including riboswitches. If you click on the "search for attenuators" it finds terminators and antiterminators. It presents the capculated genes and perits BLAST analysis at NCBI ( Reference: C. Abreu-Goodger & E. Merino. 2005. Nucl. Acids Res. 33: W690-W692).

tRNAs: tRNAscan-SE - is incredibly sensitive & also provides secondary structure diagrams of the tRNA molecules (Reference: Schattner, P. et al. 2005. Nucleic Acids Res. 33: W686-689). Alternatively use ARAGORN ( Reference: Laslett, D. & Canback. 2004. Nucleic Acids Research 32:11-16).
Test sequences.

LTR_Finder - is an efficient program for finding full-length LTR retrotranspsons in genome sequences. The size of input file is now limited to 50MB ( Reference: Z. Xu & H. Wang. 2007. Nucl. Acids Res.35(Web Server issue): W265-W268).
RTAnalyzer - finds retrotransposons and detects L1 retrotransposition signatures ( Reference: J-F. Lucier et al. 2007. Nucl. Acids Res. 35(Web Server issue):W269-W274

MG-RAST (Metagenome Rapid Annotation using Subsystem Technology) is a fully-automated service for annotating metagenome samples. It provides annotation of sequence fragments, their phylogenetic classification and an initial metabolic reconstruction. The service also provides means for comparing phylogenetic classifications and metabolic reconstructions of metagenomes ( Reference: F. Meyer et al. 2008. BMC Bioinformatics 9: 386).

The following four programs can be used to prediction phage proteins:

PVPred ( Reference: Ding H et al (2014) Mol Biosyst 10(8): 2229-2235).
PHPred ( Reference: Ding H (2016) Computers Biol Med 71: 156&ndash161).
PVP-SVM ( Reference: Manavalan B et al. (2018) Front Microbiol 9: 476).
PVPred-SCM ( Reference: Charoenkwan P et al. (2020) Cells 9(2) pii: E353.

Chromosome replication origin:

Ori-Finder and Ori-Finder 2 - are useful platforms for the identification and analysis of replication origins (oriCs) in the bacterial and archaeal genomes, respectively. ( Reference: Luo H et al. (2019) Brief Bioinform 20(4): 1114-1124). Please note that these tools have been used to create DoriC - a database of replication origins in prokaryotic genomes including chromosomes and plasmids. (Reference: Luo H & Gao F (2019) Nucleic Acids Res. 47(D1): D74-D77).

One of the problems with GenBank is that scientists do not update their submission data nor correct errors. In part this is due to laziness but is also due to the fact that GenBank is, in most cases, unwilling to accept a new version of the Sequin file. Tbl2asn is a command-line program that automates the creation of sequence records for submission to GenBank but, from my perspective, it is not easy to use. The only online program is GenBank 2 Sequin which generates not only a Sequin file (*.sqn), but also a five-column "Annotation Table" (*.tbl). This together with the fasta-formatted DNA sequence can be submitted to GenBank by Email ( [email protected] ). In its absence I recommend the perl script available for downloading here.

PlasmidFinder 1.3 - identifies plasmids in total or partial sequenced isolates of bacteria. The method uses BLAST for identification of replicons of plasmids belonging to the major incompatibility (Inc) groups of Enterobacteriaceae. As input, the method can use both pre-assembled, complete or partial genomes, and short sequence reads from four different sequencing platforms. See also pMLST ( Reference: Carattoli A et al. 2014. Antimicrob. Agents Chemother. 58: 3895-903)

PHACTS can be used to quickly classify the lifestyle of a phage (temperate or lytic). All that is needed is the proteome of the phage to be classified and PHACTS will predict the lifestyle of that phage and return a confidence value for that prediction. ( Reference: K. McNair et al. 2012. Bioinformatics 28: 614-618).

SpeciesFinder 1.0 (Danish Technical University) - predicts the species of bacteria from pre-assembled, complete or partial genomes, and short sequence reads. The prediction is based on the 16S rRNA gene.

CSI Phylogeny 1.1 (Call SNPs & Infer Phylogeny) - calls SNPs, filters the SNPs, does site validation and infers a phylogeny based on the concatenated alignment of the high quality* SNPs. ( Reference: Kaas, R.S. et al. PLoS ONE 2014 9: e104984.)

KmerFinder 2.0 &ndash predicts the species of bacteria from pre-assembled, complete or partial genomes, and short sequence reads. The prediction is based on the number of co-occurring k-mers (substrings of k nucleotides in DNA sequence data, in this case 16-mers) between the genomes of reference bacteria in a database and the genome provided by the user. ( Reference: Hasman H et al. 2013. J Clin Microbiol. 52:139-146)

VIOLIN: Vaccine Investigation and Online Information Network - allows easy curation, comparison and analysis of vaccine-related research data across various human pathogens VIOLIN is expected to become a centralized source of vaccine information and to provide investigators in basic and clinical sciences with curated data and bioinformatics tools for vaccine research and development. VBLAST: Customized BLAST Search for Vaccine Research allows various search strategies against against 77 genomes of 34 pathogens. ( Reference: He, Y. et al. 2014. Nucleic Acids Res. 42 (Database issue):D1124-32).

MLST 1.8 (MultiLocus Sequence Typing) - currently only works with assembled genomes and contigs ( Reference: Larsen MV et al. 2012. J. Clin. Micobiol. 50: 1355-1361).

ECFfinder - extracytoplasmic function (ECF) sigma factors - the largest group of alternative sigma factors - represent the third fundamental mechanism of bacterial signal transduction, with about six such regulators on average per bacterial genome. Together with their cognate anti-sigma factors, they represent a highly modular design that primarily facilitates transmembrane signal transduction. ( Reference: Staron A et al. (2009) Mol Microbiol 74(3): 557-581).

BacWGSTdb - is designed for monitoring the emergence and outbreak of important bacterial pathogens. In detail, it serves two particular purposes: Typing & Tracking. The former refers to an integrated genotyping at both the traditional multi-locus sequence typing (MLST) and whole-genome sequencing typing (WGST) level. The latter refers to source tracking (i.e., finding highly similar isolates) according to the typing result and isolates information stored in BacWGSTdb. ( Reference: Z. Ruan 7 Y. Feng, Nucleic Acids Research. 2016 44(D1): D682-D687).

SISTR: Salmonella In Silico Typing Resource - (Public Health Agency of Canada, Laboratory for Foodborne Zoonoses) is a bioinformatics resource for rapidly interpreting in silico data for multiple Salmonella subtyping methods from draft bacterial genome assemblies. In addition to performing serovar prediction by genoserotyping, this resource integrates sequence-based typing analyses for: Multi-Locus Sequence Typing (MLST), ribosomal MLST (rMLST), and core genome MLST (cgMLST). Google Chrome is recommended Firefox is also supported but the SVG visualizations within this app may not be as responsive. Internet Explorer is unsupported.

FSFinder2 (Frameshift Signal Finder) - Programmed ribosomal frameshifting is involved in the expression of certain genes from a wide range of organisms such as virus, bacteria and eukaryotes including human. In programmed frameshifting, the ribosome switches to an alternative frame at a specific site in response to a special signal in a messanger RNA. Programmed frameshift plays role in viral particle morphogenesis, autogenous control, and alternative enzymatic activities. The common frameshift is a -1 frameshift, in which the ribosome shifts a single nucleotide in the upstream direction. The major elements of -1 frameshifting consist of a slippery site, where the ribosome changes reading frames, and a stimulatory RNA structure such as pseudoknot or stem-loop located a few nucleotides downstream. +1 frameshifts are much less common than -1 frameshifting but are observed in diverse organisms.

InBase, The Intein Database and Registry - Protein splicing is defined as the excision of an intervening protein sequence (the INTEIN) from a protein precursor and the concomitant ligation of the flanking protein fragments (the EXTEINS) to form a mature extein host protein and the free intein (Perler 1994). Protein splicing results in a native peptide bond between the ligated exteins. This is a database site which permits BLAST analysis. ( Reference: Perler, F.B. 2002. Nucleic Acids Res. 30: 383-384).

P2RP (Predicted Prokaryotic Regulatory Proteins) - users can input amino acid or genomic DNA sequences, and predicted proteins therein are scanned for the possession of DNA-binding domains and/or two-component system domains. RPs identified in this manner are categorised into families, unambiguously annotated. ( Reference: Barakat M, et al. 2013. BMC Genomics 14:269).

P2CS (Prokaryotic 2-Component Systems) is a comprehensive resource for the analysis of Prokaryotic Two-Component Systems (TCSs). TCSs are comprised of a receptor histidine kinase (HK) and a partner response regulator (RR) and control important prokaryotic behaviors. It can be searched using BLASTP. ( Reference: P. Ortet et al. 2015. Nucl. Acids Res. 43 (D1): D536-D541).

COG analysis - Clusters of Orthologous Groups - COG protein database was generated by comparing predicted and known proteins in all completely sequenced microbial genomes to infer sets of orthologs. Each COG consists of a group of proteins found to be orthologous across at least three lineages and likely corresponds to an ancient conserved domain (CloVR) . Sites which offer this analysis include:

WebMGA ( Reference: S. Wu et al. 2011. BMC Genomics 12:444), RAST ( Reference: Aziz RK et al. 2008. BMC Genomics 9:75), and BASys (Bacterial Annotation System Reference: Van Domselaar GH et al. 2005. Nucleic Acids Res. 33(Web Server issue):W455-459.) and JGI IMG (Integrated Microbial Genomes Reference: Markowitz VM et al. 2014. Nucl. Acids Res. 42: D560-D567. )

Other sites:

EggNOG - A database of orthologous groups and functional annotation that derives Nonsupervised Orthologous Groups (NOGs) from complete genomes, and then applies a comprehensive characterization and analysis pipeline to the resulting gene families. ( Reference: Powell S et al. 2014. Nucleic Acids Res. 42 (D1): D231-D239

OrthoMCL - is another algorithm for grouping proteins into ortholog groups based on their sequence similarity. The process usually takes between 6 and 72 hours.( Reference: Fischer S et al. 2011. Curr Protoc Bioinformatics Chapter 6:Unit 6.12.1-19).

KAAS (KEGG Automatic Annotation Server) provides functional annotation of genes by BLAST or GHOST comparisons against the manually curated KEGG GENES database. The result contains KO (KEGG Orthology) assignments and automatically generated KEGG pathways. ( Reference: Moriya Y et al. 2007. Nucleic Acids Res. 35(Web Server issue):W182-185).

ResFinder (Acquired antimicrobial resistance gene finder) - uses BLAST for identification of acquired antimicrobial resistance genes in whole-genome data. As input, the method can use both pre-assembled, complete or partial genomes, and short sequence reads from four different sequencing platforms. Tested with 1411 different resistance genes with 100% identity. ( Reference: Zankari E et al. 2012. J Antimicrob Chemother. 67:2640-2644)

ARG-ANNOT (Antibiotic Resistance Gene-ANNOTation) is a new tool that was created to detect existing and putative new antibiotic resistance (AR) genes in bacterial genomes. ARG-ANNOT uses a local blast program in Bio-Edit software that allows the user to analyze sequences without web interface ( Reference: Gupta, S.K. et al. 2014. Antimicrob Agents Chemother. 58: 212&ndash220).

CARD (The Comprehensive Antibiotic Resistance Database) - a rigorously curated collection of known resistance determinants and associated antibiotics, organized by the Antibiotic Resistance Ontology (ARO) and AMR gene detection models ( Reference: Jia, B. et al. 2017. Nucleic Acids Research, 45: D566-573).

MEGARes - is a hand-curated antimicrobial resistance database and annotation structure that provides a foundation for the development of high throughput acyclical classifiers and hierarchical statistical analysis of big data ( Reference: Lakin, S.N.. et al. 2017. Nucleic Acids Research, 45: D574-D580 ) .

BacMet (Antibacterial Biocide & Metal Resistance Genes Database) - a database of biocide and metal resistance genes with highly reliable content. In BacMet version 1.1, the experimentally confirmed database contains 704 resistance genes, whereas the predicted database contains 40,556 resistance genes ( Reference: Pal, C. et al. 2014. Nucleic Acids Research, 42: D737-743 ) .

Specialized annotation - CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats):

CRISPRfinder - enables the easy detection of CRISPRs in locally-produced data and consultation of CRISPRs present in the database. It also gives information on the presence of CRISPR-associated (cas) genes when they have been annotated as such. . ( Reference: I. Grissa et al. 2007. Nucl. Acids Res. 35 (Web Server issue): W52-W57).

CRISPRmap -provides a quick and detailed insight into repeat conservation and diversity of both bacterial and archaeal systems. It comprises the largest dataset of CRISPRs to date and enables comprehensive independent clustering analyses to determine conserved sequence families, potential structure motifs for endoribonucleases, and evolutionary relationships. ( Reference: S.J. Lange et al. 2013. Nucleic Acids Research, 41: 8034-8044).

CRISPI : a CRISPR Interactive database - includes a complete repertory of associated CRISPR-associated genes (CAS). A user-friendly web interface with many graphical tools and functions allows users to extract results, find CRISPR in personal sequences or calculate sequence similarity with spacers.( Reference: Rousseau C et al. 2009. Bioinformatics. 25: 3317&ndash3318).

CRISPRTarget - that predicts the most likely targets of CRISPR RNAs. This can be used to discover targets in newly sequenced genomic or metagenomic data. ( Reference: Biswas A et al. 2013. RNA Biol. 10:817-827).

CRISPy-web - is an easy to use web tool based on CRISPy to design sgRNAs for any user-provided microbial genome. CRISPy-web allows researchers to interactively select a region of their genome of interest to scan for possible sgRNAs. After checks for potential off-target matches, the resulting sgRNA sequences are displayed graphically and can be exported to text files. ( Reference: K. Blin et al. 2016. Synthetic and Systems Biotechnology 1(2): 118-121).

Specialized annotation - virulence determinants: This is of particular interest to those working on bacteriophages for therapy

VirulenceFinder (Danish Technical University) &ndash identification of virulence genes. The method uses BLAST for identification of known virulence genes in Escherichia coli. The method is being extended to also include virulence genes for Enterococcus and Staphylococcus aureus. As input, the method can use both pre-assembled, complete or partial genomes, and short sequence reads from four different sequencing platforms.

ClanTox: a classifier of short animal toxins - predicts whether each sequence is toxin-like and provides a ranked list of positively predicted candidates according to statistical confidence. For each protein, additional information is presented including the presence of a signal peptide, the number of cysteine residues and the associated functional annotations. ( Reference: G. Naamati et al. 2009. Nucleic Acids Res. 37(Web Server issue): W363&ndashW368).

t3db the Toxin and Toxin Target Database - combines detailed toxin data with comprehensive toxin target information. The database currently houses 3,053 toxins which are linked to 1,670 corresponding toxin target records. Each toxin record (ToxCard) contains over 50 data fields and holds information such as chemical properties and descriptors, toxicity values, molecular and cellular interactions, and medical information. ( Reference: Lim E et al. 2010. Nucleic Acids Res. 38(Database issue): D781-786).

TAfinder 2.0 - is a web-based tool to identify Type II toxin-antitoxin loci in bacterial genome ( Reference: Xie Y et al. (2018) Nucleic Acids Res. 46(D1): D749-D753 ).

DBETH Database of Bacterial ExoToxins for Humans is a database of sequences, structures, interaction networks and analytical results for 229 exotoxins, from 26 different human pathogenic bacterial genus. All toxins are classified into 24 different Toxin classes. The aim of DBETH is to provide a comprehensive database for human pathogenic bacterial exotoxins. ( Reference: Chakraborty A et al. 2012. Nucleic Acids Res. 40(Database issue): D615-620).

VFDB - is an integrated and comprehensive database of virulence factors for bacterial pathogens (also including Chlamydia and Mycoplasma). ( Reference : L.H. Chen et al. 2012. Nucleic Acids Res. 40(Database issue): D641-D645).

PAIDB (Pathogenicity Island Database) - Pathogenicity islands (PAIs) and resistance islands (REIs) are key to the evolution of pathogens and appear to play complimentary roles in the process of bacterial infection. While PAIs promote disease development, REIs give a fitness advantage to the host against multiple antimicrobial agents. An anncillary program, PAI Finder, identifies PAI-like regions or REI-like regions in a multi-sequence query. ( Reference : S.H Yoon et al. 2015. Nucl. Acids Res. 43 (D1): D624-D630).

IslandViewer - includes a new interactive genome visualization tool, IslandPlot, and expanded virulence factor, antimicrobial resistance gene, and pathogen-associated gene annotations, as well as homologs of these genes in closely related genomes. Notably, incomplete genomes are accepted as input in IslandViewer 3, though they strongly urge users to use complete genomes whenever possible. ( Reference : B.K. Dhillon et al. 2015. Nucl. Acids Res. 43 (W1): W104-W108).

Gypsy Database - an open editable database about the evolutionary relationship of viruses, mobile genetic elements (MGEs Ty3/Gypsy, Retroviridae, Ty1/Copia and Bel/Pao LTR retroelements and the Caulimoviridae pararetroviruses of plants) and other genomic repeats. Equipped for BLAST and HMM searches. ( Reference : Llorens, C et al. 2011. Nucl. Acids Res. 39(suppl 1): D70-D74).

PanDaTox (Pan Genomic Database for Genomic Elements Toxic to Bacteria) - is a database of genes and intergenic regions that are unclonable in E. coli, to aid n the discovery of new antibiotics and biotechnologically beneficial functional genes. It is also designed to improve the efficiency of metabolic engineering. BLAST Search feature included. ( Reference : Mitai G & Sorek R. 2012. Bioengineered, 3: 218-221.)

PathogenFinder (predicts pathogenic potential) &ndash Based on complete genomes from 513 bacteria annotated as human non-pathogens and 372 bacteria annotated as human pathogens, a database of protein families, which are either mainly associated with non-pathogens or with pathogens have been created. This database is then used for predicting the pathogenic potential of bacteria. As input, the method can use both pre-assembled, complete or partial genomes, and short sequence reads from four different sequencing platforms. ( Reference: Cosentino S et al. 2013. PLoS ONE 8: e77302)

VirulentPred - is a SVM based method to predict bacterial virulent proteins sequences, which can be used to screen virulent proteins in proteomes. Together with experimentally verified virulent proteins, several putative, non annotated and hypothetical protein sequences have been predicted to be high scoring virulent proteins by the prediction method. ( Reference: Garg A & Gupta G. 2008. BMC Bioinformatics 9: 62).

The Type III secretion system (T3SS) is an essential mechanism for host-pathogen interaction in the infection process. The proteins secreted through the T3SSmachinery of many Gram-negative bacteria are known as T3SS effectors (T3SEs). These can either be localized subcellularly in the host, or be part of the needle tip of the T3SS that interacts directly with the host membrane to bring other effectors into the target cell. T3SEdb represents such an effort to assemble a comprehensive database of all experimentally determined and putative T3SEs into a web-accessible site. BLAST search is available. ( Reference: Tay DM et al. 2010. BMC Bioinformatics. 11 Suppl 7:S4).

Effective (University of Vienna, Austria & Technical University of Munich, Germany) - Bacterial protein secretion is the key virulence mechanism of symbiotic and pathogenic bacteria.Thereby effector proteins are transported from the bacterial cytosol into the extracellular medium or directly into the eukaryotic host cell. The Effective portal provides precalculated predictions on bacterial effectors in all publicly available pathogenic and symbiontic genomes as well as the possibility for the user to predict effectors in own protein sequence data.

SIEVE Server is a public web tool for prediction of type III secreted effectors. The SIEVE Server scores potential secreted effectors from genomes of bacterial pathogens with type III secretion systems using a model learned from known secreted proteins. The SIEVE Server requires only protein sequences of proteins to be screened and returns a conservative probability that each input protein is a type III secreted effector. ( Reference: McDermott JE et al. 2011. Infect Immun. 79:23-32).

T3SE - Type III secretion system effector prediction ( Reference: Löwer M, & Schneider G. 2009. PLoS One. 4:e5917. Erratum in: PLoS One. 20094(7).

Phage_Finder - was created to identify prophage regions in completed bacterial genomes. Using a test dataset of 42 bacterial genomes whose prophages have been manually identified, Phage_Finder found 91% of the regions, resulting in 7% false positive and 9% false negative prophages. A search of 302 complete bacterial genomes predicted 403 putative prophage regions, accounting for 2.7% of the total bacterial DNA. Analysis of the 285 putative attachment sites revealed tRNAs are targets for integration slightly more frequently (33%) than intergenic (31%) or intragenic (28%) regions, while tmRNAs were targeted in 8% of the regions. ( Reference: D.E. Fouts. 2006. Nucleic Acids Res. 34 : 5839&ndash5851).

Prophinder - is the tool used for detecting prophages in bacterial genomes. Select a GenBank formatted file.

PHAST (PHAge Search Tool) - is designed to rapidly and accurately identify, annotate and graphically display prophage sequences within bacterial genomes or plasmids. It accepts either raw DNA sequence data or partially annotated GenBank formatted data and rapidly performs a number of database comparisons as well as phage &ldquocornerstone&rdquo feature identification steps to locate, annotate and display prophage sequences and prophage features. Relative to other prophage identification tools, PHAST is up to 40 times faster and up to 15% more sensitive. It is also able to process and annotate both raw DNA sequence data and Genbank files, provide richly annotated tables on prophage features and prophage &ldquoquality&rdquo and distinguish between intact and incomplete prophage. PHAST also generates downloadable, high quality, interactive graphics that display all identified prophage components in both circular and linear genomic views.Furthermore, tests indicate that PHAST is as accurate or slightly more accurate than all available phage finding tools, with sensitivity of 85.4% and positive predictive value of 94.2%. ( Reference: Zhou, Y. et al. 2011. Nucl. Acids Res. 39(suppl 2): W347-W352).

PHASTER PHAge Search Tool Enhanced Release - is a significant upgrade to PHAST for the rapid identification and annotation of prophage sequences within bacterial genomes and plasmids. Numerous software improvements and significant hardware enhancements have now made PHASTER faster, more efficient, more visually appealing and much more user friendly. In particular, PHASTER is now 4.3X faster than PHAST. ( Reference: D. Arndt et al. Nucleic Acids Res. 2016 44(W1):W16-21).

Prophage Hunter - provides a one-stop web service to extract prophage genomes from bacterial genomes, evaluate the activity of the prophages, identify phylogenetically related phages, and annotate the function of phage proteins. ( Reference: Song W et al. (2019) Nucleic Acids Res 47(W1): W74&ndashW80).

IslandViewer - integrates two sequence composition GI prediction methods SIGI-HMM and IslandPath-DIMOB, and a single comparative GI prediction method IslandPick ( Reference: Langille et al. 2008. BMC Bioinformatics 9: 329).

PAIDB (PAthogenicity Island DataBase) has made an effort to collect known PAIs and to detect the potential PAI regions in the prokaryotic complete genomes. Pathogenicity islands (PAIs) are distinct genetic elements of pathogens encoding various virulence factors. ( Reference: Yoon SH et al. 2007. Nucleic Acids Res. 35 (Database Issue): D395-D400).

MTGIpick can identify genomic islands from a single genome, without annotated information of genomes or prior knowledge from other datasets. In simulations with alien fragments from artificial and real genomes, MTGIpick reported robust results across different experiments ( Reference: Dai Q et al. (2018) Brief Bioinform 19(3): 361-373).

SyntTax - is a web server linking synteny to prokaryotic taxonomy. SyntTax incorporates a full hierarchical taxonomic tree allowing intuitive access to all completely sequenced prokaryotes (Archaea and Bacteria). Single or multiple organisms can be chosen on the basis of their lineage by selecting the corresponding rank nodes in the tree. This is my favourite among the synteny programs ( Reference: Oberto J. 2013. BMC Bioinformatics. 14:4). The results below were generated using the heat-shock sigma factor (RpoH) from Salmonella Typhimurium against the Pseudomonadales.

Cinteny Server for Synteny Identification and Analysis of Genome Rearrangement (A. U. Sinha & J. Meller, University of Cincinnati, USA) - this server can be used for finding regions syntenic across multiple genomes and measuring the extent of genome rearrangement using reversal distance as a measure. You may create a project and upload your own data or work with pre-loaded prokaryote or eukaryote data.

SimpleSynteny - provides a pipeline for evaluating the synteny of a preselected set of gene targets across multiple organismal genomes. An emphasis has been placed on ease-of-use, and users are only required to submit FASTA files for their genomes and genes of interest. SimpleSynteny then guides the user through an iterative process of exploring and customizing genomes individually before combining them into a final high-resolution figure. ( Reference: Veltri D et al. 2016. Nucleic Acids Res. 44(Web Server issue): W41&ndashW45).

Synteny Portal - eukaryotic genome users can easily (i) construct synteny blocks among multiple species by using prebuilt alignments in the UCSC genome browser database, (ii) visualize and download syntenic relationships as high-quality images, (iii) browse synteny blocks with genetic information and (iv) download the details of synteny blocks to be used as input for downstream synteny-based analyses, all in an intuitive and easy-to-use web-based interface. ( Reference: Lee J et al. 2016. Nucleic Acids Res 44(W1): W35&ndashW40).

AutoGRAPH is an integrated web server for multi-species comparative genomic analysis. It is designed for constructing and visualizing synteny maps between two or three species, determination and display of macrosynteny and microsynteny relationships among species, and for highlighting evolutionary breakpoints.
The web server constructs synteny maps by pairwise comparison of marker/anchor orders between a reference chromosome and one or two tested genome(s). It permits users to visualize and characterize several features: Conserved segments (CS), Conserved Segments Ordered (CSO) and breakpoints. ( Reference: Derrien T et al. 2007. Bioinformatics 23:498-499).

Sibelia (University of California San Diego, USA) - is a tool for finding synteny blocks in multiple closely related microbial genomes using iterative de Bruijn graphs. Unlike most other tools, Sibelia can find synteny blocks that are repeated within genomes as well as blocks shared by multiple genomes. It represents synteny blocks in a hierarchy structure with multiple layers, each of which representing a different granularity level.

Kablammo helps you create interactive visualizations of BLAST results from your web browser. Find your most interesting alignments, list detailed parametersfor each, and export a publication-ready vector image. Incredibly easy to use - here are the results for a BLASTN comparison to Escherichia phages T1 (query) and ADB-2. ( Reference: Wintersinger JA et al. Bioinformatics 31:1305-1306).

M1CR0B1AL1Z3R - is a 'one-stop shop' for conducting microbial genomics data analyses via a simple graphical user interface. Some of the features implemented in M1CR0B1AL1Z3R are: (i) extracting putative open reading frames and comparative genomics analysis of gene content (ii) extracting orthologous sets and analyzing their size distribution (iii) analyzing gene presence-absence patterns (iv) reconstructing a phylogenetic tree based on the extracted orthologous set (v) inferring GC-content variation among lineages. M1CR0B1AL1Z3R facilitates the mining and analysis of dozens of bacterial genomes using advanced techniques. ( Reference: Avram O et al. (2019) Nucleic Acids Res. 47(W1): W88-W92).

GeneOrder 4.0 (D. Seto, Bioinformatics & Computational Biology, George Mason Univ., U.S.A.) is designed to can be used to compare the gene order between two bacterial genomes ( Reference: Mahadevan P. & Seto D. 2010. BMC Research Notes 3:41).
CoreGenes (D. Seto & P. Mahadevan, Bioinformatics & Computational Biology, George Mason Univ., U.S.A) - tallies the total number of genes in common between the two genomes being compared displays the percent value of genes in common with a specific genome determines the unique genes contained in a pair of proteomes. CoreGenes 3.5 is the batch CoreGenes server. I have extensively used this set of resources in the classification of bacterial viruses.

If you have a a gbk file for a phage which has not yet been deposited in GenBank you can use these instructions to convert your data into CoreGenes format for use here.

WebACT - this is the web version of ACT (Artemis Comparison Tool) a DNA sequence comparison viewer based on Artemis ( Reference: 21: 3422 - 3423 Visit the database page of EMBL-EBI and select EMBL and "Standard Query Form" to determine the EMBL accession number for the sequence you are interested in.

Panseq (Chad Laing, Public Health Agency of Canada) - a group of tools for the analysis of the 'pan genome' of a group of genomic sequences. The pan-genome of a bacterial species consists of a core genome and an accessory gene pool, the latter of which allows subpopulations of the organism to adapt to specific environments. These include Novel Region Finder, which will find sequences that are unique to a strain or group of strains with respect to another strain or group of strains. Pan-genome Analysis identifies the pan-genome among your sequences and, finds SNPs in the core genome and determine the distribution of accessory genomic regions.Loci Selector identifies loci that offer the best discrimination among your dataset. ( Reference: Laing, C. et al. 2010. BMC Bioinformatics . 11: 461).

PARIGA - enables users to perform all-against-all BLAST searches on two sets of sequences selected by the user. Moreover, since it stores the two BLAST output in a python-serialized-objects database, results can be filtered according to several parameters in real-time fashion, without re-running the process and avoiding additional programming efforts. ( Reference: Orsini M. et al. 2013. PLoS One 8(5):e62224).

EDGAR (Efficient Database framework for comparative Genome Analyses using BLAST score Ratios) - EDGAR is designed to automatically perform genome comparisons in a high throughput approach and can be used for core genome, pan genome and singleton analysis, and Venn diagram construction. ( Reference: Blom J. et al. 2009. BMC Bioinformatics 10: 154).

OrthoVenn - is a web server for genome wide comparison and annotation of orthologous clusters across multiple species. It provides coverage of vertebrates, metazoa, protists, fungi, plants and bacteria for the comparison of orthologous clusters and also supports uploading of customized protein sequences from user-defined species. An interactive Venn diagram, summary counts, and functional summaries of the disjunction and intersection of clusters shared between species are displayed as part of the OrthoVenn result. OrthoVenn also includes in-depth views of the clusters using various sequence analysis tools. Furthermore, it identifies orthologous clusters of single copy genes and allows for a customized search of clusters of specific genes through key words or BLAST. ( Reference: Y. Yang et al. 2015. Nucl. Acids Res. 43 (W1): W78-W84). Also found here.

BEACON is a software tool that compares annotations of a particular genome from different Annotation Methods (AMs). It uses GenBank format as input and derives Extended Annotation (EA) along side listing original annotations from individual AMs. ( Reference: Kalkatawi M, BMC Genomics. 201516(1): 1-8).

ANI (Average Nucleotide Identity) calculator - estimates the average nucleotide identity using both best hits (one-way ANI) and reciprocal best hits (two-way ANI) between two genomic datasets. Typically, the ANI values between genomes of the same species are above 95% (e.g., Escherichia coli). Values below 75% are not to be trusted, and AAI should be used instead. This tool supports both complete and draft genomes (multi-fasta). ( Reference: Goris J et al. 2007. Int J Syst Evol Microbiol. 57(Pt 1): 81-91).

Average Nucleotide Identity (ANI) calculator - their ANI Calculator uses the OrthoANIu algorithm, an improved iteration of the original OrthoANI algorithm, which uses USEARCH instead of BLAST ( Reference: Yoon, S. H. et al. (2017). Antonie van Leeuwenhoek. 110:1281&ndash1286).

VIRIDIC (Virus Intergenomic Distance Calculator C. Moraru, Institute for Chemistry and Biology of the Marine Environment, Germany) - the first level of bacteriophage classification by ICTV involves computing the overall DNA sequence identity between two viruses. This new tool computes pairwise intergenomic distances/similarities amongst phage genomes. To run it, upload a single fasta file with all phage genomes of interest, create a project and press run. Save the project ID that will be displayed when the project is created. You will need it to access the data if the calculations take a long time.

GGDC (Genome-To-Genome Distance Calculator) - provides methods for inferring whole-genome distances which are well able to mimic DNA-DNA hybridization (DDH). Values calculated with GGDC yield a somewhat better correlation with wet-lab DDH values than alternative approaches such as "ANI". These distance functions can also cope with heavily reduced genomes and repetitive sequence regions. Some of them are also very robust against missing fractions of genomic information (due to incomplete genome sequencing). Thus, this web service can be used for genome-based species delineation. ( Reference: Meier-Kolthoff JP et al. 2013. BMC Bioinformatics 14: 60).

POGO-DB - Based on computationally intensive whole-genome BLASTs, POGO-DB provides several metrics on pairwise genome: (a) Average Amino Acid Identity of all bi-directional best blast hits that covered at least 70% of the sequence and had 30% sequence identity (b) Genomic Fluidity that estimates the similarity in gene content between two genomes (c) Number of orthologs shared between two genomes (as defined by two criteria) (d) Pairwise identity of the most similar 16S rRNA genes (e) Pairwise identity of 73 additional globally-conserved marker genes (which were determined by us to exist in at least 90% of all the genomes). ( Reference: Lan Y et al. 2014. Nucl. Acids Res. 42 (D1): D625-D632).

VICTOR (Virus Classification and Tree Building Online Resource Leibniz-Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH). This web service compares bacterial and archaeal viruses ("phages") using their genome or proteome sequences. The results include phylogenomic trees inferred using the Genome-BLAST Distance Phylogeny method (GBDP), with branch support, as well as suggestions for the classification at the species, genus and family level. (The service can be applied to other kinds of viruses, too, but has not yet been tested in this respect.) Upload your FASTA files, GenBank files and/or GenBank accession IDs. ( Reference: JP Meier-Kolthoff & M Göker. 2017. Bioinformatics 33(21): 3396&ndash3404).

VIRFAM is dedicated to the recognition of head-neck-tail modules and of recombinase genes in phage genomes. You can use this server to search for remote homologs of specific protein families within protein sequences of bacteriophages. Input: protein sequences you&rsquore your phage output includesd a phylogenetic tree with the placement of your virus. ( Reference: Lopes A et al. Nucleic Acids Res. (2010) 38(12): 3952-62).

Seeker - is a deep-learning tool for reference-free identification of phage sequences. Seeker allows rapid detection of phages in sequence datasets and clean differentiation of phage sequences from bacterial ones, even for phages with little sequence similarity to established phage families. We comprehensively validate Seeker ability to identify unknown phages and employ Seeker to detect unknown phages, some of which are highly divergent from known phage families. ( Reference: Auslander N et al. (2020)

VipTree - generates a "proteomic tree" of viral genome sequences based on genome-wide sequence similarities computed by tBLASTx. The original proteomic tree concept (i.e., "the Phage Proteomic Tree&rdquo) was developed by Rohwer and Edwards, 2002. A proteomic tree is a dendrogram that reveals global genomic similarity relationships between tens, hundreds, and thousands of viruses. It has been shown that viral groups identified in a proteomic tree well correspond to established viral taxonomies. ( Reference: Nishimura Y et al. (2017) Bioinformatics 33: 2379&ndash2380).

MiGA (Microbial Genomes Atlas) - a webserver that allows the classification of an unknown query genomic sequence, complete or partial, against all taxonomically classified taxa with available genome sequences, as well as comparisons to other related genomes including uncultivated ones, based on the genome-aggregate Average Nucleotide and Amino Acid Identity (ANI/AAI) concepts. ( Reference: Rodriguez-R et al (2018) Nucleic Acids Research 46(W1): W282-W288).

CGView Server - is a comparative genomics tool for circular genomes that allows sequence feature information to be visualized in the context of sequence analysis results. A genome sequence is supplied to the program in FASTA, GenBank, EMBL or raw format. Up to three comparison sequences (or sequence sets) in FASTA format can also be submitted. The CGView Server uses BLAST to compare the genome sequence to the comparison sequences, and then converts the results and any available feature information (from the GenBank, EMBL or optional GFF file) or analysis information (from an optional GFF file) into a high-quality graphical map showing the entire genome sequence, or a zoomed view of a region of interest. Several options are available for specifying how the BLAST comparisons are conducted, and for controlling how results are displayed.( Reference: Grant JR & Stothard P. 2008. Nucleic Acids Res. 36(Web Server issue): W181-184)

Jena Prokaryotic Genome Viewer (JPGV) - from a GenBank flatfile (*.gbk) generates linear or circular plots including if desired GC content, GC skew, purine excess and keto excess can be displayed. Also allows BLAST analysis against related genomes. Requires free registration.

GenomeVx - makes editable, publication-quality, maps of mitochondrial and chloroplast genomes and of large plasmids. These maps show the location of genes and chromosomal features as well as a position scale. The program takes as input either raw feature positions or GenBank records. In the latter case, features are automatically extracted and colored, an example of which is given. Output is in the Adobe Portable Document Format (PDF) and can be edited by programs such as Adobe Illustrator.( Reference: G. Conant & K. Woolfe. 2008. Bioinformatics 24:861-862).

myGenomeBrowser - is a web-based environment that provides biologists with a way to build, query and share their genome browsers. This tool, that builds on JBrowse, is designed to give users more autonomy while simplifying and minimizing intervention from system administrators. They have extended genome browser basic features to allow users to query, analyze and share their data. ( Reference: S. Carrere & J. Gouzy. Bioinformatics (2017) 33 (8): 1255-1257).

DNAPlotter - is an interactive Java application for generating circular and linear representations of genomes. Making use of the Artemis libraries to provide a user-friendly method of loading in sequence files (EMBL, GenBank, GFF) as well as data from relational databases, it filters features of interest to display on separate user-definable tracks. It can be used to produce publication quality images for papers or web pages.( Reference: Carver, T. et al. 2008. Bioinformatics 25:119-120)

GeneWiz (Center for Biological Sequence Analysis, Danish Technical University) produces linear or circular genome altases such as the one below. They have ready name ones for most bacteria, but by uploading custom data in GenBank format (.gbk) one can make one's own diagram showing the genetic and physical properties of your genome.

OrganellarGenomeDRAW - is a suite of software tools that enable users to create high-quality visualrepresentations of both circular and linear annotated genome sequences provided as GenBank files oraccession numbers. Although all types of DNA sequences are accepted as input, the software has beenspecifically optimized to properly depict features of organellar genomes. A recent extension facilitates theplotting of quantitative gene expression data, such as transcript or protein abundance data, directly ontothe genome map ( Reference: Lohse M, et al. 2013. Nucleic Acids Res. 41(Web Server issue):W575-81) .

PlasmaDNA - Starting with a primary DNA sequence, PlasmaDNA looks for restriction sites, open reading frames, primer annealing sequences, and various common domains. The databases are easily expandable by the user to fit his most common cloning needs. PlasmaDNA can manage and graphically represent multiple sequences at the same time, and keeps in memory the overhangs at the end of the sequences if any. This means that it is possible to virtually digest fragments, to add the digestion products to the project, and to ligate together fragments with compatible ends to generate the new sequences. Excellent package for plasmids. (Reference: Angers-Loustau A et al. 2007. BMC Mol Biol. 2007 8:77).

GSDraw (Gene Structure Draw Server) is a web server for gene family to draw gene structure schematic diagrams. Users can submit genomic, CDS and transcript sequences. GSDraw uses this information to obtain the gene structure, protien motif and phylogenetics tree, then draw diagram for it. (Reference: Wang Y, et al. 2013. Nucleic Acids Res. 41(Database issue):D1159-66).

GECA is a user-friendly tool for representing gene exon/intron organization and highlighting changes in gene structure among members of a gene family. It relies on protein alignment, completed with the identification of common introns in the corresponding genes using CIWOG. GECA produces a main graphical representation showing the resulting aligned set of gene structures, where exons are to scale. The important and original feature of GECA is that it combines these gene structures with a symbolic display highlighting sequence similarity between subsequent genes. It is worth noting that this combination of gene structure with the indications of similarities between related genes allows rapid identification of possible events of gain or loss of introns, or points to erroneous structural annotations. The output image is generated in a portable network graphics format which can be used for scientific publications. ( Reference: Fawal N, et al. 2012. Bioinformatics 28:1398-9).

GeneDesign - is an excellent resource for designing synthetic genes. It includes tools for codon optimization and removal of restriction sites ( Reference: Richarson, S.M. et al. 2006. Genome Research 16:550-556)

Orphelia - Orphelia is a metagenomic ORF finding tool for the prediction of protein coding genes in short, environmental DNA sequences with unknown phylogenetic origin. Orphelia is based on a two-stage machine learning approach that was recently introduced by our group. After the initial extraction of ORFs, linear discriminants are used to extract features from those ORFs. Subsequently, an artificial neural network combines the features and computes a gene probability for each ORF in a fragment. A greedy strategy computes a likely combination of high scoring ORFs with an overlap constraint. ( Reference: K.J. Hoff et al. 2009. Nucl. Acids Res. 37(Web Server issue:W101-W105).

WebMGA is a customizable web server for fast metagenomic analysis which includes over 20 commonly used tools for analyses such as ORF calling, sequence clustering, quality control of raw reads, removal of sequencing artifacts and contaminations, taxonomic analysis, functional annotation etc. All the tools behind WebMGA were implemented to run in parallel on our local computer cluster. ( Reference: Wu S, et al. 2011. BMC Genomics. 12:444).

MG-RAST (the Metagenomics RAST) server is an automated analysis platform for metagenomes providing quantitative insights into microbial populations based on sequence data. The server primarily provides upload, quality control, automated annotation and analysis for prokaryotic metagenomic shotgun samples. ( Reference: Wilke A, et al. 2016. Nucleic Acids Res. 44(D1):D590-4).

MetaBin Comprehensive Taxonomic Assignment of Metagenomic Sequences (Laboratory for Integrated Bioinformatics, RIKEN, Japan) web server and standalone program allow faster and more accurate taxonomic assignment of single and paired-end sequence reads of varying lengths (&ge45 bp) obtained from both Sanger and next-generation sequencing platforms. Has a tutorial.

AmphoraNet - uses 31 bacterial and 104 archaeal protein coding marker genes for metagenomic and genomic phylotyping. Most of these are single copy genes, therefore AmphoraNet is suitable for estimating the taxonomic composition of bacterial and archaeal communities from metagenomic shotgun sequencing data. ( Reference: Kerepesi C, et al. 2014. Gene. 533:538-40).

METAGENassist - allows users to take bacterial census data from different environment sites or different biological hosts, and perform comprehensive multivariate statistical analyses on the data. These multivariate analyses can be done using either taxonomic or automatically generated phenotypic labels and visualized using a variety of high quality graphical tools. The bacterial census data can be derived from 16S rRNA data, NextGen shotgun sequencing or even classical microbial culturing techniques. Includes a tutorial. ( Reference: Arndt D, et al. 2012. Nucleic Acids Res. 40(Web Server issue):W88-95).

Real Time Metagenomics (Dr. Robert Edwards, San Diego State University, USA) - is the next revolution in metagenome annotation: Real time data processing and analysis. You can finally annotate a metagenome in real time, with no waiting. You can upload your own data for analysis. They accept either fasta or fastq files, and you can provide zip or gzip compressed data.

EBI Metagenomics (EMBL-EBI) - is an automated pipeline for the analysis and archiving of metagenomic data that aims to provide insights into the phylogenetic diversity as well as the functional and metabolic potential of a sample. You can freely browse all the public data in the repository. The service identifies rRNA sequences, using rRNASelector, and performs taxonomic analysis upon 16S rRNAs using Qiime. The remaining reads are submitted for functional analysis of predicted protein coding sequences using the InterPro sequence analysis resource. InterPro uses diagnostic models to classify sequences into families and to predict the presence of functionally important domains and sites. By utilising this resource, the service offers a powerful and sophisticated alternative to BLAST-based functional metagenomic analyses. Data submitted to the EBI Metagenomics service is automatically archived in the European Nucleotide Archive (ENA). Accession numbers are supplied for sequence data.

Kaiju - is a fast and sensitive taxonomic classification for metagenomics which takes nucleotide sequences in compressed FASTA or FASTQ format. Reads are directly assigned to taxa using the NCBI taxonomy and a reference database of protein sequences from bacterial, archaeal and viral genomes. By default, Kaiju uses either the available complete genomes from NCBI RefSeq or the microbial subset of the non-redundant protein database nr used by NCBI BLAST. Kaiju translates reads into amino acid sequences, which are then searched in the database using a modified backward search on a memory-efficient implementation of the Burrows-Wheeler transform, which finds maximum exact matches (MEMs), optionally allowing mismatches in the protein alignment. ( Reference: Menzel P et al. 2016. (Nat. Commun. 7:11257)

PhyloPythiaS - is a fast and accurate sequence composition-based classifier that utilizes the hierarchical relationships between clades. Taxonomic assignments with the web server can be made with a generic model, or with sample-specific models that users can specify and create. Several interactive visualization modes and multiple download formats allow quick and convenient analysis and downstream processing of taxonomic assignments. ( Reference: Patil KR, et al. 2012. PLoS One. 7:e38581).

Virtual Metagenome - A web server to reconstruct metagenomes from 16S rRNA sequences. a novel method for the rapid and efficient reconstruction of a virtual metagenome in environmental microbial communities without using large-scale genomic sequencing. We demonstrate this approach using 16S rRNA gene sequences obtained from denaturing gradient gel electrophoresis analysis, mapped to fully sequenced genomes, to reconstruct virtual metagenome-like organizations. ( Reference: Okuda S, et al. 2012. Nat Commun. 3:1203.)

MetaPhlAn2 (version 2.0.0) - is a computational tool for profiling the composition of microbial communities (Bacteria, Archaea, Eukaryotes and Viruses) from metagenomic shotgun sequencing data with species level resolution. It is also able to identify specific strains and to track strains across samples for all species. It allows for unambiguous taxonomic assignments, accurate estimation of organismal relative abundance, and species-level resolution for bacteria, archaea, eukaryotes and viruses. ( Reference: Segata N, et al. 2012. Nature Methods 8: 811&ndash814).

CoMet-Universe &mdash a web-server for comparative analysis of metagenomes based on protein domain signatures. Starting with an upload of your DNA sequences the CoMet pipeline performs all necessary steps for a comprehensive metagenome analysis including gene prediction, protein domain detection using Pfam 27, metabolic profiling based on KEGG pathways and taxon abundance estimation across all domains of life and viruses. ( Reference: Aßhauer KP et al. Int J Mol Sci. 2014 15(7):12364-78).

16S Classifier - is a tool for fast and accurate taxonomic classification of 16S rRNA hypervariable regions in metagenomic datasets. On real metagenomic datasets, it showed up to 99.7% accuracy at the phylum level and up to 99.0% accuracy at the genus level. ( Reference: N. Chaudhary et al. 2015. PLoS One 10(2): e0116106). It can also be accessed here

DNAATLAS (DNA2.0 Inc., U.S.A.) - A place for all your sequences. Easily import all your constructs including Genbank, Gene Designer, Excel, Word, and nearly any text-based format. DNA Atlas immediately parses your upload files and infers whether each sequence is a feature, construct, primer, DNA or amino acid. Upload features and primers to see them annotated in your sequences. Instantly view constructs annotated with our curated list of over 1000 features, or add your own. Use the BLAST-based sequence search to quickly align and compare your sequences.Keep track of your sequences, features, and primers. Categorize them using tags - from freezer locations to characterization data. (requires registration).

SuperPhy (Chad Laing & Vic Gannon, Public Health Agency of Canada) is an online tool for the predictive genomics of Escherichia coli. The platform integrates the analyses tools and genome sequence data for all publicly available E. coli genomes and facilitates the upload of new genome sequences from users under public or private settings. SuperPhy provides real-time analyses of thousands of genome sequences based on strain metadata, including geospatial and phylogenetic context.

Naming your bacteriophage: This is of prime importance for members of the bacterial virus community to name their newly isolated phages appropriately. A good place to start is " How to Name and Classify Your Phage: An Informal Guide." ( Reference: Adriaenssens E & Brister JR. 2017. Viruses 9(4). pii: E70) to which I will add the following points (a) please check that the name you propose has not been used already and, (b) Do not name your phage Enterobacter ia phage ø1234 or Enterobacteria phage 2017/ABC_567 since these names are incompatable with the creation of new species and genera taxa by the International Committee on Taxonomy of Viruses (ICTV). To find if your proposed name is unique consult:

Phage Name Check (Stephen T. Abedon, Ohio State University, USA) - to see whether 'your' phage name is currently found on Google Scholar, Google Books, PubMed, or even Bacteriophage Names 2000.

CPT Phage Name Search (Center for Phage Technology at Texas A&M University)

BLAST Databases

No doubt readers familiar with BLAST have been curious: aren’t there databases of some kind involved in BLAST searches? Not necessarily. As we’ve seen, simple FASTA files will suffice for both the query and subject set. It turns out, however, that from a computational perspective, simple FASTA files are not easily searched. Thus BLAST+ provides a tool called makeblastdb that converts a subject FASTA file into an indexed and quickly searchable (but not human-readable) version of the same information, stored in a set of similarly named files (often at least three ending in .pin , .psq , and .phr for protein sequences, and .nin , .nsq , and .nhr for nucleotide sequences). This set of files represents the “database,” and the database name is the shared file name prefix of these files.

Running makeblastdb on a FASTA file is fairly simple: makeblastdb -in <fasta file> -out <database name> -dbtype <type> -title <title> -parse_seqids , where <type> is one of prot or nucl , and <title> is a human-readable title (enclosed in quotes if necessary). The -parse_seqids flag indicates that the sequence IDs from the FASTA file should be included in the database so that they can be used in outputs as well as by other tools like blastdbcmd (discussed below).

Once a BLAST database has been created, other options can be used with blastn et al.:

  • -db <database name>
    • The name of the database to search against (as opposed to using -subject ).
    • Use <integer> CPU cores on a multicore system, if they are available.

    When using the -db option, the BLAST tools will search for the database files in three locations: (1) the present working directory, (2) your home directory, and (3) the paths specified in the $BLASTDB environment variable.

    The tool blastdbcmd can be used to get information about BLAST databases—for example, with blastdbcmd -db <database name> -info —and can show the databases in a given path with blastdbcmd -list <path> (so, blastdbcmd -list $BLASTDB will show the databases found in the default search paths). This tool can also be used to extract sequences or information about them from databases based on information like the IDs reported in output files. As always, reading the help and documentation for software like BLAST is highly recommended.

    File Format Conversion

    Suppose you have a GenBank file which you want to turn into a Fasta file. For example, lets consider the file which is included in the Biopython unit tests under the GenBank directory.

    You could read the file like this, using the Bio.SeqIO.parse() function:

    Notice that this file contains six records. Now instead of printing the records, let’s pass the SeqRecord iterator to the Bio.SeqIO.write() function, to turn this GenBank file into a Fasta file:

    Or more concisely using the Bio.SeqIO.convert() function (in Biopython 1.52 or later), just:

    In this example the GenBank file started like this:

    The resulting Fasta file looks like this:

    Note that all the Fasta file can store is the identifier, description and sequence.

    By changing the format strings, that code could be used to convert between any supported file formats.

    SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments

    Rapidly decreasing genome sequencing costs have led to a proportionate increase in the number of samples used in prokaryotic population studies. Extracting single nucleotide polymorphisms (SNPs) from a large whole genome alignment is now a routine task, but existing tools have failed to scale efficiently with the increased size of studies. These tools are slow, memory inefficient and are installed through non-standard procedures. We present SNP-sites which can rapidly extract SNPs from a multi-FASTA alignment using modest resources and can output results in multiple formats for downstream analysis. SNPs can be extracted from a 8.3 GB alignment file (1842 taxa, 22 618 sites) in 267 seconds using 59 MB of RAM and 1 CPU core, making it feasible to run on modest computers. It is easy to install through the Debian and Homebrew package managers, and has been successfully tested on more than 20 operating systems. SNP-sites is implemented in C and is available under the open source license GNU GPL version 3.

    Extract mutations from fasta sequences - Biology

    I have a large amount of align protein sequences in the .fasta forma, and a reference sequence, every of that has the same length. I would like to extract only the amino acid mutations from these sequences, so that, in the end, I want to have a list that looks something like this: I456L, W675T, etc . Is there a software or any way to do this? Thankful

    Pierre has a complete solution but in case that does not work you could use blastp with -outfmt 3 which will identify the difference and output it so.

    Biopython blast parser may be able to help finish the rest.

    Using bioalcidaejdk and a fasta file where the very first sequence is the reference:

    Pierre Lindenbaum I cannot find the fold dist, and bioalcidaejdk.jar too. There is a file named in the bioalcidae folder.

    Did you follow install instructions:

    java compiler SDK 11. Please check that this java is in the $. Setting JAVA_HOME is not enough : (e.g: )

    Thank you both genomax and Pierre Lindenbaum. I had a problem with my JDK compiler, just solved it, i get running and everything went well.

    Login before adding your answer.

    Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

    Watch the video: How to Design Primer Sequences for PCR (May 2022).


  1. Samson

    I agree, your thought is just excellent

  2. Doushicage

    I consider, that you are mistaken. Let's discuss it. Write to me in PM.

  3. Birkey

    I would like to know, thank you very much for your assistance in this matter.

  4. Faemuro

    And what are we going to stop at?

  5. Benoic

    This message, amazing))), I like it :)

Write a message