Information

How to determine the primary Uniprot accession number from a list of accession numbers?

How to determine the primary Uniprot accession number from a list of accession numbers?


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Given a list of UniProt IDs that are linked to an Ensembl ID, is there a way to systematically determine which is the primary accession number with no other information?

According to ExPasy

Researchers who wish to cite entries in their publications should always cite the first accession number. This is commonly referred to as the 'primary accession number'. 'Secondary accession numbers' are sorted alphanumerically.

But what if the order has been jumbled or compiled from a different source and resorted.

For example:

Ensembl:

ENSMUSG00000035642

Uniprot:

Q8R0P4, Q8CF11, D6RJK8, D6RJJ4, D3Z442, D3Z1Q3, D3YZD8, D3YY39, D3YX09, D3YWY5

This question is cross listed on the Bioinformatics stack-exchange.


I think there is an issue with the terminology. The "primary" accession number, is the first accession number in cases where an entry has more than one accession number, as described in http://www.uniprot.org/help/accession_numbers:

Entries can have more than one accession number. This can be due to two distinct mechanisms:

a) When two or more entries are merged, the accession numbers from all entries are kept. The first accession number is referred to as the

'Primary (citable) accession number', while the others are referred to as 'Secondary accession numbers'. These are listed in alphanumerical order.

b) If an existing entry is split into two or more entries ('demerged'), new 'primary' accession numbers are attributed to all

the split entries while all original accession numbers are retained as 'secondary' accession numbers.

Example: P29358 which has been 'demerged' into P68250 and P68251.

Both reviewed and unreviewed entries can have primary accession numbers.

What you probably mean, as previous posters understood, are accession numbers of reviewed entries as opposed to unreviewed ones.

In that case, you can indeed add "reviewed:yes" to your query, e.g. when you are using the UniProt ID mapping, http://www.uniprot.org/help/uploadlists


Swissprot is the reviewed section of Uniprot's holdings. TrEMBl contains everything else.

Q8R0P4 or Mth938 domain-containing protein (AAMDC_MOUSE) is the reviewed Swissprot, aka reliable, identifier.

When searching in Uniprot you can filter to only see Reviewed/Swissprot identifiers; see the top-left corner of the link above.


EFI - Enzyme Similarity Tool

EFI-EST uses the UniProtKB protein sequence database (maintained by EMBL-EBI) for its annotations because, it provides the ability for members of the community to modify and/or correct functional annotations. In addition, EFI-EST uses the Pfam and InterPro databases (also maintained by EMBL-EBI) to provide easy access to the complete memberships of a large number of curated protein families/superfamilies (16,712 families for Pfam 31.0 30,876 families/domains/sites for InterPro 64.0). The InterPro database collects signature sequences from 12 different databases, including Pfam, to define its families. Because the different databases may define the "same" family with slightly different signature sequences, InterPro families almost always are larger than Pfam families.

The sequence similarity networks generated by this webserver utilize the full length sequences of the proteins that are identified via their UniProt accession IDs (by BLAST in Option A, members of specified Pfam and/or InterPro families in Option B, the headers in a FASTA file in Option C, when read, and from lists of accession IDs in Option D). As a result, the clusters that are generated and visualized in the networks will result from sequence similarities for the entire sequence.

Many proteins have multiple domains for these proteins the alignments used to calculate the alignment scores will not necessarily be for the domain in which you may be interested. However, we provide an "Advanced Option" for Option B that provides the capability to trim the full length sequences of multidomain proteins to generate SSNs using domain boundaries defined by Pfam for the Pfam family that you enter. We recommend that you use this advanced option carefully—Pfam families "always" contain fragments of full-length sequences plus domains often are interrupted by insertions, both potentially complicating the interpretation of the SSN.

INPUT: Four options for generating SSNs are available.

Select the option you want to use and enter the required information. For each input method, an "Advanced Options" menu allowing modification of the default parameters is available.

By default, the all-by-all BLAST used to calculate the edges for the SSN returns a result only if the e-value is &le 10 -5 .

We recommend SSNs to be generated with the 10 -5 default value and an examination of the percent identity quartile plot to determine whether the default value should be changed. For short sequences, e.g., < 100 residues, this e-value may be too small to allow an alignment score corresponding to 30% or less to be used for filtering in the Analyze Data step. The "Advanced options" menu for each option allows to select a larger upper limit for the e-value by entering an integer &le 5 (the negative log of the e value) the lower limit for the input is 0.

After the input has been entered for any of the four options on the start screen, as shown in Figure 1, enter your e-mail address (for data retrieval only blue arrow), and hit "Submit Analysis" at the bottom of the screen (green arrow). EFI-EST will assemble the sequence dataset and perform the all-by-all BLAST. The all-by-all BLAST will return alignment scores/edges for those sequence pairs for which the BLAST e-values are less than an upper limit threshold of 10 -5 (or a different threshold specified in the 'Advanced Options'). For most families, the default threshold should provide sufficient internode connections (edges) in the networks that inferences about divergent evolution of protein function are possible.

If you are interested in detailed exploration of sequence-function relationships in families with more than 100,000 sequences, please submit a summary of your interests via the feedback form at https://efi.igb.illinois.edu//feedback.php and we may be able to assist.


Figure 1. Entire EFI-EST starting page.

Option A: Single sequence query

Networks for close homologs to a user-supplied sequence. Paste a protein sequence (without a FASTA header) into the input box (red arrow). A sequence dataset will be built containing the most closely related sequences retrieved from the UniProtKB database using a BLAST e value upper limit threshold of 10 -5 . A default of 1,000sequences is used, but the dataset may be smaller if < 1,000 sequences are found using a BLAST alignment score upper limit of 10 -5 . A default of &le 1,000 sequences is used because, in most cases, a full network with all sequences (nodes) will be viewable without having to collapse nodes into representative nodes (explained here). Use this option if you are only interested in those proteins that are most similar to your protein of interest.


Figure 2. Settings for Option A.

Advanced Options (magenta arrows): By clicking on the Advanced Options tab below the input box, you can enter "custom" values for the maximum number of sequences that will be collected and the e-value used.

Maximum BLAST Sequences: Option A allows the user to collect a subset of sequences. It is possible to collect a maximum of 10,000 sequences. This option may be preferred if a full family network is difficult to handle in Cytoscape on memory limited computers. Alternatively, you can download a representative node network to visualize larger networks.

Option B: Pfam and/or InterPro families

Defined protein families are used to generate the SSN.

Pfam and/or InterPro family identifier(s) for your family of interest are used as input. The Pfam and/or InterPro families to which proteins belong to can be determined on the Pfam and InterPro websites.

More than one Pfam and/or InterPro family number(s) can be entered as the input for Option B, in a comma-separated list (red arrow). The number of sequences that can be used in Option B is limited to &le100,000. This limit is set to ensure that assembling the dataset/performing the all-by all BLAST as well as generating the networks for most families can be completed within several hours (very large families may require several days). When the dataset is complete, you will receive an e-mail with a link to analyze the dataset. This link will be active for 14 days so that you may return at your convenience.

When an entry is recognized, the sequence count per family and estimated total count (there may be redundancy between families) are displayed (blue box).

Option B usually will result in a much larger dataset than Option A because all of the members of families are included. Full networks may be problematic to open in Cytoscape on memory limited computers when large families are analyzed. As an alternative to full networks, representative node networks are available to download on the result page.


Figure 3. Settings for Option B.

Advanced Options (magenta arrows): By clicking on the "Advanced Options" menu, you can enter a "custom" e-value used in the all-by-all BLAST. You also can select a fraction of the sequences in the input Pfam and/or InterPro family(ies) so that you can generate an "overview" of the families you are interested in. You can also choose to generate the SSN with the Pfam defined domains instead of the full-length sequences.

Fraction: If the dataset you initially select is too large (> 100000 sequences) you can select the same dataset and specify a fraction of that dataset to be analyzed. This decreases the number of sequences, but provides representative overview of the original dataset. The value entered represents the divisor by which you wish to fractionate the dataset, e.g., 10 = only every 10th sequence in the total sequence dataset is used. The Uniprot sequence dataset is not preorganized, so the sampling is "random".

Domains: It is difficult/impossible to infer the functional relationships between proteins that possess a single domain and ones composed of multiple domains using SSNs. Pfam defines N- and C-terminal domain boundaries for members of its families based on sequence, not structure, comparisons. Using these domain definitions, it is possible to trim full-length sequences of multi-domain proteins to obtain only the domain specified by the Pfam family ID.

For example, in nonribosomal peptide synthases (NRPSs), the domain definitions can be used to extract the individual domains (e.g, condensation domains, PF00668) and use these to generate a SSN. If the full-length sequence has multiple homologues of the same domain, all of the domains will be extracted and used to generate the SSN.

By using the "Enable Domain" option, the SSN will be generated with the sequences from the defined domain instead of the full-length sequences. In the networks, the N- and C-boundaries of the domain are appended to the UniProt accession ID for the full-length sequence (ID:N-terminus:C-terminus). This makes the produced SSN incompatible with the generation of a corresponding GNN and the use of the coloring utility.

Please be aware that Pfam families "always" include at least some fragments of full-length sequences as the result of sequencing errors, so these may complicate the analyses of networks for domain. In addition, in some proteins the domain belonging to one family may be inserted in the domain for a second family, resulting in two pieces of the second domain in the network.

Option C: User-supplied FASTA file


Figure 4. Settings for Option C.

Option C allows the user to input protein sequences in a FASTA format, using the direct input box or by uploading a file, and generate an SSN using those sequences (red arrows). The sequences submitted can be enriched with sequences from specified Pfam and/or InterPro families so that the provided sequences can be placed in the context of a protein family (orange arrow). When a protein family is supplied for enriching your initial submission, the number of sequences from this family is displayed, for information.

Option C provides two further options for handling the FASTA file (yellow arrow).

By default, the sequences from the FASTA file are used for generating the SSN. All characters of the FASTA header are used as the “Description” node attribute in the SSN for the corresponding protein sequence, and the number of residues is the value of the “Sequence_Length” node attribute. In addition, “shared name” and “name” node attributes are assigned individually to each sequence, and numbered sequentially starting with 0. The preceding characters (to make 6) in the “shared name” and “name” node attributes will be "z", e.g., zzz123.

If the option is activated by ticking the box, the FASTA header for each sequence is read, to import the accession IDs. UniProt IDs and/or NCBI IDs (RefSeq IDs, UniProt/Swiss-Prot IDs, GenBank IDs, PDB IDs, and/or “retired” NCBI GI numbers) present in the FASTA header are identified (following the formatting “rules” described below). A UniProt ID is used to directly identify the sequences and annotations for SSN node attributes in the UniProt database. An NCBI ID is used to query the idmapping file provided by UniProt to identify the equivalent UniProt ID, and the sequence and annotations for SSN node attributes are obtained from the UniProt database. For these entries (with UniProt or NCBI IDs in the header), two additional node attributes will be present: “Query_IDs” will list the UniProt and/or NCBI ID(s) from the FASTA header and “Sequence_Source” will indicate “USER”.

Not all NCBI IDs will identify an equivalent UniProt ID (the NCBI database is larger than the UniProt database). For these entries, the default information (FASTA header as the Description and Sequence Length) will be provided.

If the user enters Pfam and/or InterPro families IDs (orange arrow), the node attributes associated with these sequences will include “FAMILY” as the “Sequence_Source” node attribute. If a node is associated with both the FASTA file and a sequence from Pfam/InterPro family, the “Sequence_Source” node attribute will be “FAMILY+USER”.

The NCBI BLAST server provides FASTA files in which multiple FASTA headers often are provided for the same sequence. As a result, more than one header/accession ID may identify the same UniProt ID. Also, files from the NCBI BLAST can contain entries for the PDB structures of mutant proteins: the PDB ID for a mutant often will identify the UniProt ID for the wild type protein, so multiple PDB entries (for the wild type and mutant proteins) will identify the same UniProt ID. When this occurs, the SSN will contain a single node for the UniProt ID, and the “Query ID” node attribute will contain a list of all of the NCBI IDs that located the UniProt ID.

If a UniProt ID cannot be located for a sequence in the UniProt format because it is more recent than our database or the NCBI ID cannot be located in the idmapping file, the default information (FASTA header as the Description and Sequence Length) will be provided.

Two outcomes are possible if an NCBI ID cannot be located in the idmapping file:

  1. If the FASTA header is one of several associated with the same sequence (files from the NCBI BLAST server) and a UniProt ID can be identified for at least one of the headers, the NCBI ID will be included in the “Other_IDs” node attribute for each of the UniProt IDs that are identified for the sequence. The “shared name” and “name” attributes will have “z” format described previously.
  2. Otherwise, the sequence in the FASTA entry will be used for the SSN. As described for Option C, the “shared name” and “name” node attributes have a total of six characters. The sequences in the FASTA file are numbered sequentially starting with 0. The preceding characters (to make 6) will be "z", e.g., zzz123. The NCBI ID is included in the “Other_IDs” node attribute. If the sequence has more than one FASTA header with an NCBI ID that cannot be retrieved, all will be included in the “Other_IDs” node attribute.

When the “Read FASTA headers” is not selected, the FASTA header is not interrogated for an accession ID and is used only as the “Description” node attribute. The sequence in the FASTA file is used to generate the SSN. The node “name” and “shared name” node attributes will be generated as described two paragraphs above, e.g., zzz123. The sequences from the FASTA file will have USER as the “Sequence_Source”.

The acceptable formats for FASTA headers are provided in the following examples taken from output files from the UniProt and NCBI BLAST servers (accession ID highlighted):

UniProt (from UniProt BLAST TrEMBL and SwissProt, respectively)
>tr|R9RJF1|R9RJF1_PSEAI Mandelate racemase OS=Pseudomonas aeruginosa PE=4 SV=1
>sp|P11444|MANR_PSEPU Mandelate racemase OS=Pseudomonas putida GN=mdlA PE=1 SV=1

NCBI RefSeq (from NCBI BLAST)
>WP_016501748.1 mandelate racemase [Pseudomonas putida]

NCBI UniProt/Swiss-Prot ID (from NCBI BLAST)
>Q0TE80.1 RecName: Full=Enolase AltName: Full=2-phospho-D-glycerate hydro-lyase AltName: Full=2-phosphoglycerate dehydratase

NCBI GenBank ID (from NCBI BLAST)
>AAA25887.1 mandelate racemase (EC 5.1.2.2) [Pseudomonas putida]

NCBI PDB ID (from NCBI BLAST)
>pdb|1MDR|A Chain A, The Role Of Lysine 166 In The Mechanism Of Mandelate Racemase From Pseudomonas Putida: Mechanistic And Crystallographic Evidence For Stereospecific Alkylation By (r)-alpha-phenylglycidate

NCBI GI Number (from NCBI BLAST now retired)
>gi|347012980| 4-O-methyl-glucuronoyl methylesterase [Myceliophthora thermophila ATCC 42464]

Option C also accepts FASTA headers in which the IDs (formats described in Option D) immediately follow the “>” symbol, e.g., the following headers abbreviated from those shown above:

UniProt
>R9RJF1
>P11444

NCBI RefSeq
>WP_016501748.1

NCBI UniProt/Swiss-Prot ID)
>Q0TE80.1

NCBI GenBank ID
>AAA25887.1

NCBI PDB ID
>1MDR

NCBI GI Number (now retired)
>347012980

Advanced Options (magenta arrows): By clicking on the Advanced Options tab below the input box, you can enter a “custom” value used in the all-by-all BLAST. You also can select a fraction of the sequences in the input Pfam and/or InterPro family(ies) so that you can generate a “representative” network for families &le 100,000 sequences.

Fraction: This advanced option applies ONLY to the sequences in the Pfam or InterPro family if so specified, not in the user-supplied FASTA file. As in Option B, although the limit on the number of sequences that can be used to generate a SSN is limited to &le 100,000, with this advanced option you can select a fraction of the total number of sequences for larger sequence sets to generate a network.

Option D: SSNs for a user-supplied text file of accession IDs.


Figure 5. Settings for Option D.

The user uploads a text file containing UniProt IDs, NCBI IDs (RefSeq IDs, UniProt/Swiss-Prot IDs, GenBank IDs, and/or “retired” GI numbers), and/or PDB IDs (red arrows). These are the most commonly encountered sequence database accession IDs that users may have for their “favorite” proteins.

A UniProt ID is used to directly identify the sequences and annotations for SSN node attributes in the UniProt database. An NCBI ID is used to query the idmapping file provided by UniProt to identify the equivalent UniProt ID, and the sequence and annotations for SSN node attributes are obtained from the UniProt database. For these entries (with UniProt or NCBI IDs in the header), two additional node attributes will be present: “Query_IDs” will list the UniProt and/or NCBI ID(s) from the FASTA header and “Sequence_Source” will indicate “USER”.

The formats for UniProt IDs, NCBI IDs, and PDB IDs are described below with examples:

UniProt IDs
UniProtKB ID is 6 or 10 alphanumerical characters in the following formats:
For example:
P11444
T2HDW6
A0A0A7PVN6

NCBI RefSeq IDs
An NCBI RefSeq ID is 2 letters followed by an underscore followed by a series of digits, a period, and one or more digits for the sequence version number, e.g.,
WP_016501748.1
NP_708575.1
YP_002409124.1

NCBI UniProt/Swiss-Prot IDs
An NCBI UniProt/Swiss-Prot ID is the UniProt ID followed by a period and one or more digits for the sequence version number, e.g.,
Q31XL1.1
B7LEJ8.1
C4ZZT2.1

NCBI GenBank IDs
The format for NCBI GenBank IDs is 3 letters followed by five digits, a period, and one or more digits for the sequence version number, e.g.,
BAN56663.1
AAC15504.1
BAM38409.1

PDB IDs
The format for PDB IDs is one digit followed by two letters and a digit/letter:
1MDL
1MRA
3UXL

NCBI GI Numbers
An NCBI GI number (now retired) is a series of digits.

Sequences and annotations may not be retrievable for NCBI IDs, PDB IDs, and GI numbers because “equivalent” UniProt matches could not be located in the UniProt idmapping file (the UniProt database is smaller than the NCBI database some GI numbers may not be currect).

Option D reads the accession in the user-uploaded text file. For a UniProt ID, the sequence and annotation information is retrieved IDs from our local database downloaded from UniProt. Some UniProt IDs may not be in the database used to generate SSNs—because our database is downloaded with every other release of the UniProt database (every 8 weeks), the user’s input file may contain more recent UniProt IDs that are not in our database.

When an NCBI ID. PDB ID, or GI number is located in the idmapping file provided by UniProt, the “equivalent” UniProt ID is used to retrieve the sequence and annotation information from our database. In the SSN, the identity of the NCBI ID, PDB ID and/or GI number is included in the “Query_ID” node attribute.

Not all NCBI IDs and GI numbers are included in the idmapping file because the UniProt database is smaller than the NCBI database, so sequences and annotations will not be retrieved for some of the NCBI IDs. For these IDs, the ID is added to the “nomatch” list that can be downloaded from the “Analyze Data” page. In the nomatch file, UniProt IDs that could not be located are designated “NOT_FOUND_DATABASE" NCBI and PDB IDs that could not be located are designated “NOT_FOUND_IDMAPPING”. When several IDs are locating the same Uniprot IDs, DUPLICATE is mentioned within the source attribute column information.

The SSNs generated with Option D provide a node attribute (“Query ID”) that associates the UniProt IDs in the SSN (in the “name” and “shared name” node attributes) with the NCBI IDs, PDB IDs, and GI numbers provided in the input file. Multiple NCBI and PDB IDs can be associated with the same UniProt ID if/when this occurs, the node attribute is a list of the IDs associated with the UniProt ID. This node attribute can be searched in Cytoscape so that the user can locate the sequences/node attributes for the input accession IDs.

As described for Option C, the user can specify one or more Pfam and/or InterPro families to be included in the SSN. The node attributes for the sequences in the Pfam/InterPro family members will be those provided in Option B. The SSN includes a node attribute that specifies whether the sequence is associated with a sequence in the input file (USER) or Pfam/InterPro family (FAMILY).

Advanced Options: same as those described for Option C.

Advanced Options (magenta arrows): By clicking on the Advanced Options tab below the input box, you can enter a “custom” value used in the all-by-all BLAST. You also can select a fraction of the sequences in the input Pfam and/or InterPro family(ies) so that you can generate an “overview” network for families &le 100,000 sequences.

Fraction: This advanced option applies ONLY to the sequences in the Pfam or InterPro family if so specified, not in the user-supplied FASTA file. As in Option B, although the limit on the number of sequences that can be used to generate a SSN is limited to &le 100,000, with this advanced option you can select a fraction of the total number of sequences for larger sequence sets to generate a network.

Utility for the identification and coloring of independent clusters within a SSN.


Figure 6. Settings for coloring utility.

The EFI-GNT server for generating genome neighborhood networks (GNNs http://efi.igb.illinois.edu/efi-gnt/) retrieves genome neighborhood information for sequences in an input SSN. The input SSN is generated by EFI-EST (Options A, B, D, and E based on UniProt IDs) or exported by Cytoscape after analysis. EFI-GNT recognizes the clusters in the SSN and extracts the UniProt IDs for the sequences in each cluster. Each cluster is assigned a unique cluster number, and the nodes for the sequences in each cluster are assigned a unique color. This “colored SSN” is available for download, along with the GNNs. The colored SSN assists the user in analyzing the GNNs by allowing color-guided association of the cluster nodes in the GNNs with the clusters in the input SSN.

However, a colored SSN also is useful for analyses of SSNs. For example, instead of analyzing a monochromatic SSN, the colored SSN may provide the ability to more easily locate and identify clusters in complicated SSNs.

Also, the colors in a colored SSN can be used to identify how isofunctional clusters emerge as the alignment score is increased (vide infra). Sequences in clusters that are intermingled at low values of the alignment score and segregate into separate clusters as the alignment score is increased may share functional properties. This tracking of cluster separation is made “easy” if the colors assigned to the clusters in the “final” colored SSN with segregated clusters can be assigned to the nodes/sequences in SSNs filtered with smaller alignment scores.


<p>This section provides any useful information about the protein, mostly biological knowledge.<p><a href='/help/function_section' target='_top'>More. </a></p> Function i

Plays a major role in tight junction-specific obliteration of the intercellular space, through calcium-independent cell-adhesion activity.

<p>Manually curated information which has been propagated from a related experimentally characterized protein.</p> <p><a href="/manual/evidences#ECO:0000250">More. </a></p> Manual assertion inferred from sequence similarity to i


DDBJ/EMBL/GenBank Accession Prefix Format

The format for GenBank Accession numbers are:

Nucleotide Accession Prefixes

Protein Accession Prefixes

Swiss-Prot/UniProtKB accession numbers follow a different format.

RefSeq Accession Format

The RefSeq projects are NCBI sequence annotation projects and are not part of DDBJ/EMBL/GenBank. RefSeq accession numbers can be distinguished from GenBank accessions by their distinct format of an underbar in the third position.


3 RESULTS AND DISCUSSION

Initial population of the database takes ∼10 days indicating why it is very important that the system is able to be updated. An update corresponding to a new full release of UniProtKB/SwissProt takes <17 h. Approximate timings for populating the database and updating it are shown in Table 1.

Approximate times required to populate and update the database shown in hours

Processing stage . Approximate wall-clock time (h) .
. Initial population . Updating .
Processing SwissProt 0.5 0.5
Processing trEMBL 1.5 1.5
Processing PDB files 2.0 0.1
Fixing cross-references, etc 0.5 0.2
Brute-force scan 216.0 13.0
Performing alignments 13.5 0.6
Dumping results 0.3 0.3
Database data analysis 0.5 0.5
Total 234.8 16.7
Processing stage . Approximate wall-clock time (h) .
. Initial population . Updating .
Processing SwissProt 0.5 0.5
Processing trEMBL 1.5 1.5
Processing PDB files 2.0 0.1
Fixing cross-references, etc 0.5 0.2
Brute-force scan 216.0 13.0
Performing alignments 13.5 0.6
Dumping results 0.3 0.3
Database data analysis 0.5 0.5
Total 234.8 16.7

Timings were on a system using an Athlon XP 2800+ processor, but are highly dependent on other parameters such as disk and network access speeds and, most importantly, the size of the database. ‘Database data analysis’ represents the time taken for PostgreSQL analyze steps to update the indexes—see text.

Approximate times required to populate and update the database shown in hours

Processing stage . Approximate wall-clock time (h) .
. Initial population . Updating .
Processing SwissProt 0.5 0.5
Processing trEMBL 1.5 1.5
Processing PDB files 2.0 0.1
Fixing cross-references, etc 0.5 0.2
Brute-force scan 216.0 13.0
Performing alignments 13.5 0.6
Dumping results 0.3 0.3
Database data analysis 0.5 0.5
Total 234.8 16.7
Processing stage . Approximate wall-clock time (h) .
. Initial population . Updating .
Processing SwissProt 0.5 0.5
Processing trEMBL 1.5 1.5
Processing PDB files 2.0 0.1
Fixing cross-references, etc 0.5 0.2
Brute-force scan 216.0 13.0
Performing alignments 13.5 0.6
Dumping results 0.3 0.3
Database data analysis 0.5 0.5
Total 234.8 16.7

Timings were on a system using an Athlon XP 2800+ processor, but are highly dependent on other parameters such as disk and network access speeds and, most importantly, the size of the database. ‘Database data analysis’ represents the time taken for PostgreSQL analyze steps to update the indexes—see text.

The PostgreSQL database is easily able to cope with the rather large tables. The ‘sprot’, ‘idac’ and ‘acac’ tables have more than 2 million rows each, while the ‘alignment’ table contains nearly 8 million rows. However, we found it was important to run the PostgreSQL analyze command at regular intervals while populating the database. This updates the statistics on the database contents and allows indexes to work with maximum efficiency. If this was not done, the main ‘postmaster’ process could start to crawl using lots of CPU time and achieving very little.

Table 2 shows the number of chains mapped to UniProt entries from each of the sources of information. The vast majority of entries mapped using a link in the PDB entry will also have a link from UniProt. However, since links from the PDB currently take priority over links from UniProtKB, this information is not recorded.

Sources of link information in the complete mapping

Source of mapping data . Number of chains mapped .
PDB entry 40 664
UniProtKB 15 057 a
Brute-force scan 10 324 b
DNA 6261
Short peptides 1647
fasta33 failed 111
Unmatched 1063
Source of mapping data . Number of chains mapped .
PDB entry 40 664
UniProtKB 15 057 a
Brute-force scan 10 324 b
DNA 6261
Short peptides 1647
fasta33 failed 111
Unmatched 1063

a Since links from PDB to UniProtKB take priority over links in the other direction, this figure considers only those links from UniProtKB to PDB where links in the other direction are absent.

b While 10 324 chains were assigned by the brute-force scan, 815 of these were chains in multi-chain PDB files linked from UniProtKB/SwissProt but which were not identified as matching because other chains matched with a higher sequence identity. The true number of additional chains found by the brute-force scan is therefore 9509.

Sources of link information in the complete mapping

Source of mapping data . Number of chains mapped .
PDB entry 40 664
UniProtKB 15 057 a
Brute-force scan 10 324 b
DNA 6261
Short peptides 1647
fasta33 failed 111
Unmatched 1063
Source of mapping data . Number of chains mapped .
PDB entry 40 664
UniProtKB 15 057 a
Brute-force scan 10 324 b
DNA 6261
Short peptides 1647
fasta33 failed 111
Unmatched 1063

a Since links from PDB to UniProtKB take priority over links in the other direction, this figure considers only those links from UniProtKB to PDB where links in the other direction are absent.

b While 10 324 chains were assigned by the brute-force scan, 815 of these were chains in multi-chain PDB files linked from UniProtKB/SwissProt but which were not identified as matching because other chains matched with a higher sequence identity. The true number of additional chains found by the brute-force scan is therefore 9509.

3.1 Comparison with the EBI mapping

As a validation of the mapping we have created, we have made some comparisons with the mapping produced and kindly provided to us by the EBI.

We have identified one case in which a protein from the wrong species has been identified by our method. PDB entry 1rbf (blank chain name) is an exact match to UniProtKB/SwissProt entry P61824 from Bison bison. However 1rbf is a structure of part of the chain from Bos taurus (P61823). Over the 104 residues of the sequence included in the structure, these two sequences are 100% identical. Chain A of PDB file 1aby ( Looker et al., 1992) consists of two copies of the haemoglobin alpha chain (UniProtKB/SwissProt entry P69907) spliced together. Currently our mapping and the EBI MSDLite mapping both match only one of these in the alignment. Thus far, we have identified no other anomalies in our data.

We did, however, find a small number of minor problems in the EBI mapping. PDB entry 1dsj corresponds to UniProtKB/SwissProt entry P12520 and the chain begins with a HETATM ‘ACE’ group (an N-terminal acetylation) and ends with an additional HETATM ‘NH2’ group. The most recent downloadable EBI mapping, dated September 21, 2004, maps both of these to real amino acids (Thr49 and Cys76 in the UniProtKB/SwissProt entry, respectively). However, the new mapping from UniProtKB/SwissProt to residue ranges within chains has corrected this error.

We also identified an error in the EBI's downloadable mapping for 5azu which contains four identical chains (A–D). All these match UniProtKB/SwissProt entry P00282. However, in their mapping residues 28–30 of the B chain were erroneously identified as coming from Q51325 (this is a secondary accession code for P19919). Again this error does not occur in the mapping from UniProtKB/SwissProt residue ranges to PDB chains.

The mapping provided in the UniProtKB/SwissProt file provides a PDB chain and then specifies the range of residues within the UniProtKB/SwissProt entry that matches that chain. This scheme is unable to address chimeric sequences such as that found in PDB file 1a7m ( Hinds et al., 1998). In this PDB file residues 1–47 and 82–180 come from UniProtKB/SwissProt entry P09056 while residues 48–81 come from P15018. In these two UniProtKB/SwissProt entries, a cross-reference to PDB file 1a7m is provided, but the residue range is not given. Our system correctly addresses chimeric chains from the PDB (providing DBREF records are present describing the chimeric construction). The exception to correct processing of chimeric chains is the ‘self-chimera’, 1aby chain A, described above.

While the downloadable mapping from the EBI is not regularly updated, the MSDLite web server also contains mapping data. We have noted some anomalies in these data as well. For example, while the downloadable mapping for PDB entry 487d adopts the same strategy as ours of simply ignoring non-standard amino acids (MSE at I113, I116 and I182), the MSDLite server correctly identifies the UniProtKB entries, but does not include an alignment at all. Similarly for PDB entry 1val, the MSDLite identifies the same UniProtKB entries as our server, but provides no alignment.

At the time of writing, we have identified 115 chimeric chains in the PDB for which residue range mappings are not present in UniProtKB/SwissProt. As shown in Table 2, the brute-force scan of our method identifies approximately 9500 additional chain mappings (representing ∼12.5% of chains in the PDB) for which cross-links were not present in either the PDB or UniProtKB/SwissProt. After accounting for DNA chains, short peptides and cases where fasta33 failed, only around 1050 chains (1.5% of chains in the PDB) were unassigned to UniProt sequences. Some chains, such as antibodies, are only partial assignments. The constant domain is assigned, but the variable domain is not because antibody variable domains do not appear in UniProt.

The procedure also identified a number of errors in the residue ranges specified in DBREF records of PDB files. For example, PDB file 1qsn ( Rojas et al., 1999) contains a DBREF record which indicates that residues 9–19 of chain B should match residues 9–19 of UniProtKB/SwissProt entry P02303 (a secondary accession which has been replaced by P61830). However, the residues in chain B are numbered from 309, so this range should be 309–319. The DBREF record in PDB entry 1cxx gives a residue range of 81–193 for the A chain matching Q05158, but the ATOM records start from residue 117 and the SEQRES records appear to start from 82. Similar problems were identified in PDB entries 1a45, 1dj8, 1dox, 1doy, 1fo7, 1fv2, 1g50, 1g50, 1g6w, 1g6w, 1g6y, 1gd2, 1hgx, 1hqo, 1hqo, 1hr8, 1hr8, 1hr8, 1jid, 1b10, 1k0a, 1k0a, 1k0b, 1k0b, 1ltj, 1m1d, 1kna, 1kne, 4cat, 2pgk, 1bpl.

3.2 Search interface and availability

The complete mapping is available for download via the author's web site at Author Webpage. The site also provides a search interface allowing searches on the basis of PDB code (optionally with chain label), UniProtKB accession or UniProtKB/SwissProt identifier, all optionally with residue numbers. The results provide links to the PDB and full UniProtKB entries. The web interface also provides a REST-style API (representational state transfer)—an option to return results in plain text making it easy to parse. This allows simple queries to be made from Perl scripts using the Perl LWP package avoiding the necessity for ‘screen scraping’ of HTML. This is invaluable for users wishing to employ the results in automated scripts and provides an easy alternative to a SOAP interface. Full instructions are provided on the web site.

The author wishes to thank members of the MSD and SwissProt groups at the EBI (in particular, Sameer Valenka, Virginie Mittard, Phil McNeil, Rolf Apweiler and Kim Henrick) for making their PDB/SwissProt mapping available. This work was funded by a grant from the Wellcome Trust.


INTRODUCTION

We are at a critical point in the development of protein sequence databases. Continuing advances in next generation sequencing mean that for every experimentally characterized protein, there are now many hundreds of proteins that will never be experimentally characterized in the laboratory. In addition, there are new data types being introduced by developing high-throughput technologies in proteomics and genomics. The combination of both provides new opportunities for the life sciences and the biomedical domain. Therefore, it is crucial to identify experimental characterizations of proteins in the literature and to capture and integrate this knowledge into a framework in combination with high-throughput data and automatic annotation approaches to allow it to be fully exploited. UniProt facilitates scientific discovery by organizing biological knowledge and enabling researchers to rapidly comprehend complex areas of biology.

In brief, UniProt is composed of several important component parts. The section of UniProt that contains manually curated and reviewed entries is known as UniProtKB/Swiss-Prot and currently contains about half a million sequences. This section grows as new proteins are experimentally characterized ( 1). All other sequences are collected in the unreviewed section of UniProt known as UniProtKB/TrEMBL. This portion of UniProt currently contains around 80 million sequences and is growing exponentially. Although entries in UniProtKB/TrEMBL are not manually curated they are supplemented by automatically generated annotation. UniProt also makes available three sets of sequences that have been made non-redundant at various levels of sequences identity: UniRef100, UniRef90 and UniRef50 ( 2). The UniParc database is a comprehensive set of all known sequences indexed by their unique sequence checksums and currently contains over 70 million sequences entries ( 3). The UniProt database has cross-references to over 150 databases and acts as a central hub to organize protein information. Its accession numbers are a primary mechanism for accurate and sustainable tagging of proteins in informatics applications.

In this manuscript we describe the latest progress on developing UniProt. There are numerous challenges facing UniProt's goal to organize and annotate the universe of protein sequences. In particular, the great growth of microbial strain sequences has motivated us to create a new proteome identifier, which is described in more detail below. A central activity of UniProt is to curate information about proteins from the primary literature. In this paper we look at the annotation of enzymes with a focus on orphan enzyme activities. The UniProt database is used by thousands of scientists around the world every day and its website has been visited by over 400 000 unique visitors in 2013. We describe a complete redevelopment of the website based on a user experience design process below.


Protein Sequence Alignment from Protein Databank to Cosmic or Uniprot

I would like to match up PDB files from the Protein Databank to canonical AA sequences for the protein as displayed in Cosmic or Uniprot. Specifically, what I need to do is pull from the pdb file, the carbon alpha atoms in the backbone and their xyz positions. I also need to pull their actual order in the proteins sequence. For structure 3GFT (Kras - Uniprot Accession Number P01116), this is easy, I can just take the ResSeq number. However, for some other proteins, I can't figure out how this is possible.

For example, for structure (2ZHQ) (protein F2 - Uniprot Accession Number P00734), the Seqres has the ResSeq numbers repeated for numbers "1" and "14" and only differs in the Icode entry. Further the icode entries are not in lexographic order so it's hard to tell what order to extract.

It get's even worse if you consider structure 3V5Q (Uniprot Accession Number Q16288). For most of the protein, the ResSeq number matches the actual amino acid from a source like COSMIC or UNIPROT. Howver after Position 711, it jumps to position 730. When looking at REMARK 465 (the missing atoms), it shows that for chain A , 726-729 are missing. However after matching it up to the protein, those AA actually are 712-715.

I've attached code that works fro the simple 3GFT example but if someone is an expert in pdb files and can help me get the rest of it figured out, I would be much obliged.


How to determine the primary Uniprot accession number from a list of accession numbers? - Biology

The Gene Ontology (GO) project was established to provide a common language to describe aspects of a gene product's biology. A gene product's biology is represented by three independent structured, controlled vocabularies: molecular function, biological process and cellular component. For more information on GO, see the SGD GO Help page or the GO consortium home page.

To provide the most detailed information available, gene products are annotated to the most granular GO term(s) possible. For example, if a gene product is localized to the perinuclear space, it will be annotated to that specific term only and not the parent term nucleus. In this example the term perinuclear space is a child of nucleus. However, for many purposes, such as analyzing the results of microarray expression data, it is very useful to "calculate" on GO, moving up the GO tree from the specific terms used to annotate the genes in a list to find GO parent terms that the genes may have in common.

This GO Term Finder tool allows you to do this - It finds significant GO terms shared among a list of genes from your organism of choice, helping you discover what these genes may have in common (example results for SGD and a simple query list). To map granular GO annotations for genes in a list to more general terms binning them into broad categories, please use the GO Term Mapper tool.

    Required Basic Input Options

    1. Enter a list of genes
      Either type the name of the genes (separate each gene by a return) in the input box or upload a file that contains the gene names. The upload file may be a single list of gene names, one name per line, or it may be an archive containing multiple files, each consisting of a list. For example, an archive might contain these files: By default all files will be processed. If the archive contains other files, specify the file name extension of the gene list files (for example 'txt' or 'list') in the advanced options section.

To create an archive using tar (most commonly found on UNIX or MacOS X), you could do something like this:

On Windows, use an archive utility such as WinZip to create a .zip or .tar file. Create a new archive file and just drag the files or directories into it that you wish to submit.

Once you have created the .tar or .zip file, simply hit "Browse" and select it as the file to upload. Note that the extension (.tar, .zip, etc.) must correctly match the file type in order for the server to properly process the file.

The table below lists the types of identifiers in the gene association files that the GO Term Finder program can currently accept for gene names. It also provides links to tools that help you to convert from one identifier system to another, so that if you need to, you can convert your identifiers into different types of identifiers in the gene association files that can be used by the GO Term Finder.


    Enter Number of Gene Products Estimated for the Organism
    This total gene number is used to calculate the background distribution of GO terms.

GO Term Finder looks for significant GO terms shared among groups of genes in your list of input genes (see table below). To determine the statistical significance of a particular GO term associated with a group of genes in the list, GO Term Finder calculates the p-value - the probability or chance of seeing at least x number of genes out of the total n genes in the list annotated to a particular GO term, given that y number of genes out of the total N genes within the genome known to have that GO term annotation (i.e. given the background distribution). The closer the p-value is to zero, the more significant the particular GO term associated with the group of genes is (i.e. the less likely the observed annotation of the particular GO term to a group of genes occurs by chance).

Terms from the Function Ontology for Different Mouse Gene Numbers with P-value Cutoff of 0.01
Gene Ontology Term Cluster Frequency Genome Frequency of Use P-value Genes Annotated to the Term
calcium-transporting ATPase activity 3 out of 9 genes (33.3%) 5 out of 33884 genes (0.0%) 2.46e-09 MGI:105368, MGI:1347353, MGI:1889008
ATPase activity 3 out of 9 genes (33.3%) 237 out of 33884 genes (0.7%) 0.00052 MGI:105368, MGI:1347353, MGI:1889008
carrier activity 3 out of 9 genes (33.3%) 410 out of 33884 genes (1.2%) 0.00265 MGI:105368, MGI:1347353, MGI:1889008
calcium-transporting ATPase activity 3 out of 9 genes (33.3%) 5 out of 15000 genes (0.0%) 2.83e-08 MGI:105368, MGI:1347353, MGI:1889008
ATPase activity 3 out of 9 genes (33.3%) 237 out of 15000 genes (1.6%) 0.00579 MGI:105368, MGI:1347353, MGI:1889008
carrier activity - - - -

The p-value of a GO term associated with a group of genes in your gene list is affected by the total number of genes estimated for an organism. The higher the total number of genes estimated for the organism, the closer the p-value is to zero and the more significant the particular GO term annotation to the group of genes in the list (see table above, compare respectively rows 1, 2 and 3 with rows 4, 5 and 6). For example, as shown in the table above, when searching the function ontology with a p-value cutoff of 0.01, no significant 'carrier activity' GO term was found for the list of 9 mouse genes for the specified 15,000 total mouse genes (row 6, due to a p-value above the p-value cutoff of 0.01)), while 3 genes out of the 9 genes in the list annotated to the 'carrier activity' GO term were found for the estimated 33,884 total mouse genes (row 3) with a p-value = 0.00265, which is still below the p-value cutoff of 0.01. Thus, though the same number of mouse genes (410) within the mouse genome annotated to the 'carrier activity' GO term, the higher total number of genes (33,884 versus 15,000) estimated for the mouse lowers the frequency the term used to annotate genes in the entire mouse genome, thereby yields the lower p-value for the group of genes in the list annotated to the 'carrier activity' GO term.

The p-value of a GO term associated with a group of genes in your gene list is also affected by the number of genes within an organism having that GO term annotation. The higher the number of genes within the organism with a particular GO term annotation that a group of genes in the list have, the further the p-value is to zero and the less significant the particular GO term is associated with the group of genes in the list. For example, as shown in the table above, though the same 3 mouse genes in the list are annotated to the 'calcium- transporting ATPase activity' (row 1) and 'carrier activity' (row 3) GO terms, the 'calcium-transporitng ATPase activity' GO term associated with the 3 mouse genes is more significant (i.e. lower p-value) than the 'carrier activity' GO term associated with those same 3 mouse genes, due to higher number of genes within the mouse genome annotated to 'carrier activity' GO term.

For more information on how GO Term Finder determines the statistical significance of GO terms annotation, please see the Description of GO Term Finder Algorithm at SGD or How GO Term Finder Calculates P-values (also available in PDF ).

Gene Association File Table lists the total annotated gene products and total estimated gene products for each organism. If the total estimated gene number of an organism is known, the GO Term Finder program's default total gene number for the organism is the organism's total estimate gene number. If not, the GO Term Finder program will use the total number of annotated genes existed in the organism's gene association file as the default total gene number.

If you prefer to use a different total gene number for an organism in the background distribution calculation of GO terms, you can type the number of gene products you estimate for the organism in the provided text box to override the program's default total gene number for the organism. However, if the gene number you entered is smaller than the total number of annotated genes existed in the organism's gene association file, the GO Term Finder program will not use the gene number you entered but uses the program's default total gene number for the organism.

The FDR is calculated by running 50 sumulations with random genes, and counting the average number of times a p-value as good as or better than a p-value generated from the real data is seen. This is used as the numerator. The denominator is the number of p-values in the real data that are as good as or better than it.

relationship: regulates
relationship: positively_regulates
relationship: negatively_regulates

With this option checked, terms that are related by regulation (and possibly in no other way) are also included in the search, in just the same way as the traditional links:

Gene Association File Table lists the organism default gene URLs used by the GO Term Finder program.

For example, 'http://db.yeastgenome.org/cgi-bin/SGD/locus.pl?locus=xxxx' is the GO Term Finder program's default gene URL for Saccharomyces cerevisiae, where xxxx is a SGD_ID, SGD gene name, or SGD systematic ORF name (e.g. http://db.yeastgenome.org/cgi-bin/SGD/locus.pl?locus=YPL250C). If you prefer to use the old Saccharomyces cerevisiae gene URL 'http://genome-www4.stanford.edu/cgi-bin/SGD/locus.pl?locus=', you can type the old gene URL in the provided text box to override the program's default gene url.

In general, the ontology and gene association files are downloaded nightly from GO FTP site. Occasionally, there may be a problem with a particular file causing a delay in updating it. For example, sometimes an association file does not conform exactly to our understanding of the specification. In that case, the file is removed from the annotation selection pop-up menu, and a notice is printed below the pop-up menu, until the situation is resolved. There may be other reasons for a delay in updating a particular file.

The tables below show the version, GOC validation dates (where available and applicable), and other information for files that are currently in use.

Organism, Gene Associations, and Authority Total Annotated
Gene Products
Total Estimated
Gene Products
Identifiers Example IDs Identifier Conversion Tool(s) Evidence Code Counts
Skin parasite - Leishmania major
L. major GeneDB
gene_association.GeneDB_Lmajor
README
2778 Systematic_ID
Systematic_ID
L302.10
L2256.04
LM5.39
sample list
EXP(61) IDA(230) IPI(46) IMP(123) IGI(27) IEP(2) ISS(164) ISO(5105) ISA(200) ISM(184) IGC(1) RCA(53) TAS(8) IC(5)
Malaria parasite - Plasmodium falciparum
P. falciparum GeneDB
gene_association.GeneDB_Pfalciparum
README
23705400Systematic Name
Systematic Name
PFL1830w
2277.t00366
PFL1830W
sample list
EXP(10) IDA(1890) IPI(122) IMP(32) IGI(17) IEP(5) ISS(2739) ISO(137) ISM(42) IGC(5) RCA(420) TAS(759) NAS(14) IC(56) ND(2)
Default URL template: http://www.genedb.org/genedb/Search?organism=malaria&name=
Trypanosome - Tryanosoma brucei
T. brucei GeneDB
gene_association.GeneDB_Tbrucei
README
6362 Systematic Name
Gene Name
Gene Synonym
Tb927.7.4670
RRP4
TB927.7.4670
sample list
EXP(123) IDA(10016) IPI(517) IMP(794) IGI(42) IEP(14) ISS(492) ISO(476) ISA(995) ISM(3606) RCA(1145) TAS(589) NAS(4) IC(50)
Default URL template: http://www.genedb.org/genedb/Search?organism=tryp&name=
Candida - Candida albicans
CGD
gene_association.cgd
README
63701 CGD_ID
Standard Name
Systematic name
CAL0004982
CaO19.6783
CA5922
Contig4-2621_0008
orf6.8848
sample list
IDA(2807) IPI(71) IMP(5928) IGI(932) IEP(46) ISS(1868) ISO(349) ISA(170) ISM(1328) TAS(48) NAS(173) IC(35) ND(16192) IEA(320654)
Default URL template: http://www.candidagenome.org/cgi-bin/locus.pl?locus=
Slime mold - Dictyostelium discoideum
DictyBase
gene_association.dictyBase
950412098DictyBase_ID
Gene Name
Alias
DdP2X
DDB_G0272004
p2xA
sample list
IDA(3820) IPI(1086) IMP(2955) IGI(541) IEP(217) ISS(3398) IGC(80) TAS(415) NAS(6) IC(143) ND(6358) IEA(42365)
Default URL template: http://dictybase.org/db/cgi-bin/dictyBase/locus.pl?locus=
Fruit fly - Drosophila melanogaster
FlyBase
gene_association.fb
README
1452716085FlyBase_ID
Gene Symbol
Gene Synonym
FBGN0031491
alpha4GT1
4-N-acetylgalactosaminyltransferase-1
CG17223
alpha1
sample list
IDA(17446) IPI(3669) IMP(23605) IGI(3871) IEP(715) ISS(10968) ISO(3) ISA(134) ISM(2813) IGC(29) TAS(2751) NAS(1457) IC(1246) ND(7895) IEA(8261)
Default URL template: http://flybase.bio.indiana.edu/.bin/fbidq.html?
Bacterium coli - Escherichia coli
GOA @EBI
gene_association.goa_Ecoli
README
71877187UniProt_Accession (or Ensembl_ID)
UniProt_ID (or Ensembl_ID)
International Protein Index
A3QXC6
A3QXC6_ECOLX
sample list
IDA(10) IPI(140) IEA(45310)
Chicken - Gallus gallus
GOA @EBI
gene_association.goa_chicken
README
1654630837UniProt_Accession (or Ensembl_ID)
UniProt_ID (or Ensembl_ID)
International Protein Index
FGB
IPI00588322
FIBB_CHICK
Q02020
sample list
EXP(3) IDA(1865) IPI(476) IMP(810) IGI(20) IEP(222) ISS(5774) ISO(36) ISA(581) ISM(22) RCA(11) TAS(689) NAS(138) IC(20) ND(67) IEA(92409)
Cow - Bos taurus
GOA @EBI
gene_association.goa_cow
README
1979737225UniProt_Accession (or Ensembl_ID)
UniProt_ID (or Ensembl_ID)
International Protein Index
FGG
P12799
IPI00699860
FIBG_BOVIN
sample list
EXP(4) IDA(1636) IPI(604) IMP(258) IGI(13) IEP(5) ISS(18865) ISA(151) RCA(2) TAS(665) NAS(52) IC(10) ND(102) IEA(115965)
Human - Homo sapiens
GOA @EBI
gene_association.goa_human
README
19751 UniProt_Accession (or Ensembl_ID)
UniProt_ID (or Ensembl_ID)
International Protein Index
TGFR1_HUMAN
IPI00005733
P36897
TGFBR1
sample list
EXP(463) IDA(79999) IPI(188168) IMP(23096) IGI(1892) IEP(898) ISS(26242) ISO(8) ISA(1489) ISM(723) IGC(1) RCA(469) TAS(103620) NAS(7251) IC(1319) ND(1785) IEA(75019)
Default URL template: http://www.ensembl.org/Homo_sapiens/geneview?gene=
Human - Homo sapiens
GOA @EBI + Ensembl
gene_association.goa_human_ensembl
README
19499 UniProt_Accession (or Ensembl_ID)
UniProt_ID (or Ensembl_ID)
International Protein Index with additional crossreferenced gene symbols
FZD6
B4DRN0_HUMAN
ENSG00000164930
B4DRN0
sample list
EXP(1271) IDA(70458) IPI(90026) IMP(19988) IGI(1469) IEP(893) ISS(21741) ISA(2) ISM(1) TAS(107837) NAS(7482) IC(1410) ND(1885) IEA(81176)
Default URL template: http://www.ensembl.org/Homo_sapiens/geneview?gene=
Human - Homo sapiens
GOA @EBI + XREFs
gene_association.goa_human_hgnc
README
19663 UniProt_Accession (or Ensembl_ID)
UniProt_ID (or Ensembl_ID)
International Protein Index with additional crossreferenced gene symbols
HGNC:4854
FZD6
O60353
HGNC:4044
4044
FZD6_HUMAN
sample list
EXP(1273) IDA(70998) IPI(97274) IMP(20223) IGI(1533) IEP(900) ISS(22483) ISO(8) ISA(1449) ISM(769) TAS(104438) NAS(8120) IC(1417) ND(1874) IEA(80560)
Default URL template: http://www.genenames.org/data/hgnc_data.php?hgnc_id=
Rice - Oryza sativa
Gramene
gene_association.gramene_oryza
README
4114241521Swiss-Prot/TrEMBL_ID
Gene Name/Symbol
O04138
LOC_Os04g41620
PR-3 CLASS IV CHITINASE
Os04g0493400
CHT4
sample list
IDA(122) IPI(6) IMP(151) IGI(44) IEP(65) ISS(374) RCA(46617) TAS(13) IC(2572)
Default URL template: http://www.gramene.org/perl/protein_search?acc=
Bacillus anthracis
gene_association.jcvi_Banthracis (1.47 03/18/2011)
README
52805507JCVI Locus Name
Gene Symbol
dnaN-2
BA_2684
sample list
IDA(3) IMP(2) ISS(5955) TAS(15) NAS(4) ND(7054)
Coxiella burnetii
gene_association.jcvi_Cburnetii (1.39 03/18/2011)
README
20332095JCVI Locus Name
Gene Symbol
CBU1815
CBU0002
sample list
ISS(2148) TAS(2) ND(2984)
Campylobacter jejuni
gene_association.jcvi_Cjejuni (1.40 03/18/2011)
README
1829 flaB
CJE_1526
sample list
IDA(1) IMP(15) IGI(15) ISS(2577) TAS(1) ND(1985)
Dehalococcoides ethenogenes
gene_association.jcvi_Dethenogenes (1.30 03/18/2011)
1584 DET_0079
tceA
sample list
ISS(2139) TAS(4) ND(1780)
Geobacter - Geobacter sulfurreducens PCA
gene_association.jcvi_Gsulfurreducens (1.39 03/18/2011)
README
34103533JCVI Locus Name
Gene Symbol
GSU_0001
dnaN
sample list
IDA(4) ISS(4148) TAS(2) NAS(8) ND(3988)
Listeria monocytogenes
gene_association.jcvi_Lmonocytogenes (1.46 03/18/2011)
README
2822 LMOF2365_1337
polC
LMOf2365_1337
sample list
IMP(2) ISS(4198) TAS(9) ND(2963)
Methylococcus capsulatus
gene_association.jcvi_Mcapsulatus (1.41 03/18/2011)
README
2925 MCA_1120
sample list
IDA(2) ISS(3981) TAS(8) ND(3250)
Pseudomonas syringae
gene_association.jcvi_Psyringae (1.48 03/18/2011)
README
40125763JCVI Locus Name
Gene Symbol
flgI
PSPTO_1942
sample list
IDA(377) IPI(20) IMP(7) IGI(22) IEP(3) ISS(4348) IGC(31) TAS(41) IC(45) ND(5401)
Shewanella oneidensis
gene_association.jcvi_Soneidensis (1.45 03/18/2011)
README
48424843JCVI Locus Name
Gene Symbol
H
SO_2953
sample list
IMP(5) ISS(5253) TAS(48) ND(6813)
Silicibacter pomeroyi
gene_association.jcvi_Spomeroyi (1.41 03/18/2011)
README
4252 SPO_3786
sample list
IDA(2) ISS(6618) TAS(117) NAS(2) IC(15) ND(3974)
Cholera spirillum - Vibrio cholerae
gene_association.jcvi_Vcholerae (1.48 03/18/2011)
README
38583885JCVI Locus Name
Gene Symbol
holB
VC_2015
sample list
IDA(6) IMP(11) IGI(28) ISS(4266) ND(5078)
Mouse - Mus musculus
MGI
gene_association.mgi
README
24799 MGI_ID
Gene Symbol
Gene_Symbol (old)
P2ry12
MGI:1918089
P2Y12
sample list
EXP(328) IDA(52682) IPI(17052) IMP(45279) IGI(9241) IEP(1546) ISS(1790) ISO(128018) ISA(4693) ISM(22) RCA(306) TAS(6491) NAS(622) IC(565) ND(16273) IEA(74228)
Default URL template: http://www.informatics.jax.org/searches/accession_report.cgi?id=
Yeast - Schizosaccharomyces pombe
PomBase
gene_association.pombase (11/25/2011)
README
5398 Systematic Name
Gene Name
Gene Synonym
SPCC191.07
cyc1
sample list
EXP(888) IDA(7726) IPI(2667) IMP(4593) IGI(799) IEP(25) ISS(1453) ISO(5144) ISM(1536) TAS(395) NAS(736) IC(1814) ND(2194) IEA(3333)
Default URL template: http://www.pombase.org/gene/
Pseudomonas - Pseudomonas aeruginosa PAO1
PseudoCAP
gene_association.pseudocap
1537 PA#
Gene Name
Alt. Gene Name (opt.)
fliD
PA1094
hook-associated protein
sample list
EXP(48) IDA(950) IPI(42) IMP(1222) IGI(66) IEP(13) ISS(1254) ISO(14) ISA(10) IGC(49) TAS(11) NAS(18) IEA(14)
Default URL template: http://www.pseudomonas.com/AnnotationByPAU.asp?PA=
Rat - Rattus norvegicus
RGD
gene_association.rgd
README
22793 RGD_ID (or Ensembl Id, or UniProt accession)
Gene Symbol (or UniProt Entry Name)
if GOA-provided, an International Protein Index identifier
Fgb
D3Z8Y5_RAT
D3Z8Y5
IPI00948614
sample list
EXP(317) IDA(30947) IPI(7938) IMP(9884) IGI(357) IEP(10852) ISS(25259) ISO(176196) RCA(5) TAS(3438) NAS(630) IC(216) ND(6595) IEA(80867)
Default URL template: http://rgd.mcw.edu/tools/genes/genes_view.cgi?id=
Yeast - Saccharomyces cerevisiae
SGD
gene_association.sgd
README
64407166SGD_ID
Gene Name
Systematic ORF Name
YJL166W
S000003702
COR5
QCR8
sample list
IDA(17523) IPI(2605) IMP(14077) IGI(5316) IEP(30) ISS(1133) ISO(7) ISA(316) ISM(446) TAS(307) NAS(75) IC(1418) ND(3641) IEA(50695)
Default URL template: http://www.yeastgenome.org/locus/
Common wallcress - Arabidopsis thaliana
TAIR
gene_association.tair
README
31860 TAIR Accession
Gene Name
Gene Alias
AT4G31210
AT4G31210.1
LOCUS:2128101
F8F16.30
F8F16_30
sample list
IDA(37497) IPI(17968) IMP(16238) IGI(3803) IEP(4729) ISS(8016) ISM(37757) RCA(3) TAS(6747) NAS(749) IC(213) ND(21120) IEA(20155)
Default URL template: http://www.arabidopsis.org/servlets/Search?type=general&search_action=detail&method=1&show_obsolete=F&sub_type=gene&SEARCH_EXACT=4&SEARCH_CONTAINS=1&name=
Worm - Caenorhabditis elegans
WormBase
gene_association.wb
README
1441722246Protein Name
Gene Name
Gene Symbol
casy-1
B0034.3
cdh-11
WBGENE00000403
sample list
IDA(7418) IPI(4044) IMP(9299) IGI(4616) IEP(174) ISS(1837) ISO(1) ISM(9) RCA(14) TAS(175) NAS(180) IC(112) ND(412) IEA(65278)
Default URL template: http://www.wormbase.org/db/gene/gene?name=
Zebrafish - Danio rerio
ZFIN
gene_association.zfin
README
2545722409ZFIN_ID
Gene Symbol
ZDB-GENE-030131-6506
mobkl1b
sample list
IDA(3878) IPI(937) IMP(16517) IGI(4852) IEP(154) ISS(6564) ISO(3) TAS(20) NAS(127) IC(89) ND(5937) IEA(128738)
Default URL template: http://zfin.org/cgi-bin/webdriver?MIval=aa-markerview.apg&OID=

Please note that the additional synonyms may result in greater ambiguity of terms.

Please cite the original manuscript for GO-TermFinder (the perl module providing the core analysis methods used by this tool):

"GO::TermFinder--open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes." Boyle et al, Bioninformatics (2004)


The most important criterion for GO Consortium membership is that the members contribute something to the collection of resources that we make available to the public (almost all members contribute annotations several contribute to the ontologies a few contribute software). The scientists involved in working with GO in these member groups communicate via the GO mailing lists and GitHub to discuss development issues in the ontologies. If you represent a database that wishes to join the GO Consortium please contact the GOC.

Anyone with a more general interest in the GO should subscribe to the Twitter feed (@news4go) to receive updates about the GO.


CONCLUSIONS

Overall, we have shown that advances in instrument control software and data collection strategies, coupled with improved data analysis, can allow the effective use of a benchtop high resolution mass spectrometer for the top-down analysis of highly complex proteoform mixtures such as those presented by the human proteome. The use of efficient, benchtop instrumentation alongside improved software and more structured handling/reporting of proteoforms will advance top-down proteomics.


Watch the video: How to use (July 2022).


Comments:

  1. Fabien

    You are making a mistake. Email me at PM, we will discuss.

  2. Kuan-Yin

    I'll just keep quiet

  3. Abd Al Jabbar

    Congratulations, what a great answer.

  4. Leveret

    Clean

  5. Migis

    Do not be nervous, it is better to describe the error in detail.

  6. Goltimuro

    Noteworthy, the very funny answer



Write a message