Data on Gene Position in Human Genome

Data on Gene Position in Human Genome

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am trying to get some data on gene position in the human genome and I need some help

What I tried

I downloaded

I am only interested in gene position, so I kept only the first columns.

awk -F "." '{print $1}' /Users/remi/Downloads/gencode.v18.annotation.gtf >> HumanGenomePositions.txt

This operation will take a few minutes. The file contains information on exon position and transcript. I subsetted the table to get only the lines that concern genes

sed -i.bak '/gene/!d' HumanGenomePositions.txt

I am left with 57445 entries. 9872 are annotated by ENSEMBL and 47573 are annotated by HAVANA. Note that there is partial overlap between the two. According to Church et al. 2009, there are 19042 annotated genes in the human genome (reported from bionumbers). There is obviously something I am getting wrong!


Can you help me to get data on gene positions in humans in a handy format (see below)?

start end 15648 65487 129841 124984…

I recommend to filter usingtranscript_typevalue from description column. You need onlyproteine_codinggenes. Now you have extra ~10K unprocessed pseudogenes, ~5K antisense genes, ~4K miRNA, ~7K lincRNA and more than thirty other categories of unprocessed pseudogenic stuff.

As far as I know current release for GRCh37 is 19th version, not 18.

Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes

Background: Gene-expression analysis is increasingly important in biological research, with real-time reverse transcription PCR (RT-PCR) becoming the method of choice for high-throughput and accurate expression profiling of selected genes. Given the increased sensitivity, reproducibility and large dynamic range of this methodology, the requirements for a proper internal control gene for normalization have become increasingly stringent. Although housekeeping gene expression has been reported to vary considerably, no systematic survey has properly determined the errors related to the common practice of using only one control gene, nor presented an adequate way of working around this problem.

Results: We outline a robust and innovative strategy to identify the most stably expressed control genes in a given set of tissues, and to determine the minimum number of genes required to calculate a reliable normalization factor. We have evaluated ten housekeeping genes from different abundance and functional classes in various human tissues, and demonstrated that the conventional use of a single gene for normalization leads to relatively large errors in a significant proportion of samples tested. The geometric mean of multiple carefully selected housekeeping genes was validated as an accurate normalization factor by analyzing publicly available microarray data.

Conclusions: The normalization strategy presented here is a prerequisite for accurate RT-PCR expression profiling, which, among other things, opens up the possibility of studying the biological relevance of small expression differences.


The identification of signals of very recent positive selection provides information about the adaptation of modern humans to local conditions. We report here on a genome-wide scan for signals of very recent positive selection in favor of variants that have not yet reached fixation. We describe a new analytical method for scanning single nucleotide polymorphism (SNP) data for signals of recent selection, and apply this to data from the International HapMap Project. In all three continental groups we find widespread signals of recent positive selection. Most signals are region-specific, though a significant excess are shared across groups. Contrary to some earlier low resolution studies that suggested a paucity of recent selection in sub-Saharan Africans, we find that by some measures our strongest signals of selection are from the Yoruba population. Finally, since these signals indicate the existence of genetic variants that have substantially different fitnesses, they must indicate loci that are the source of significant phenotypic variation. Though the relevant phenotypes are generally not known, such loci should be of particular interest in mapping studies of complex traits. For this purpose we have developed a set of SNPs that can be used to tag the strongest ∼250 signals of recent selection in each population.

Citation: Voight BF, Kudaravalli S, Wen X, Pritchard JK (2006) A Map of Recent Positive Selection in the Human Genome. PLoS Biol 4(3): e72.

Academic Editor: Laurence Hurst, University of Bath, United Kingdom

Received: November 10, 2005 Accepted: January 10, 2006 Published: March 7, 2006

Copyright: © 2006 Voight et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: Our project was supported by RO1 HG002772–1. BFV also received partial support from RO1 DK55889 to Nancy Cox.

Competing interests: The authors have declared that no competing interests exist.

Abbreviations: ASN, East Asian(s) CEU, northern and western European(s) EHH, extended haplotype homozygosity iHH, integrated EHH iHS, integrated haplotype score SNP, single nucleotide polymorphism YRI, Yoruba

Correction note: Because of a typesetting error, the symbol "σ" was incorrectly displayed as an "s" in the legends of Figures 1, 2, and 3. Corrected 3/6/06

First holistic view of how human genome actually works: ENCODE study produces massive data set

The Human Genome Project produced an almost complete order of the 3 billion pairs of chemical letters in the DNA that embodies the human genetic code -- but little about the way this blueprint works. Now, after a multi-year concerted effort by more than 440 researchers in 32 labs around the world, a more dynamic picture gives the first holistic view of how the human genome actually does its job.

During the new study, researchers linked more than 80 percent of the human genome sequence to a specific biological function and mapped more than 4 million regulatory regions where proteins specifically interact with the DNA. These findings represent a significant advance in understanding the precise and complex controls over the expression of genetic information within a cell. The findings bring into much sharper focus the continually active genome in which proteins routinely turn genes on and off using sites that are sometimes at great distances from the genes themselves. They also identify where chemical modifications of DNA influence gene expression and where various functional forms of RNA, a form of nucleic acid related to DNA, help regulate the whole system.

"During the early debates about the Human Genome Project, researchers had predicted that only a few percent of the human genome sequence encoded proteins, the workhorses of the cell, and that the rest was junk. We now know that this conclusion was wrong," said Eric D. Green, M.D., Ph.D., director of the National Human Genome Research Institute (NHGRI), a part of the National Institutes of Health. "ENCODE has revealed that most of the human genome is involved in the complex molecular choreography required for converting genetic information into living cells and organisms."

NHGRI organized the research project producing these results it is called the Encyclopedia of DNA Elements or ENCODE. Launched in 2003, ENCODE's goal of identifying all of the genome's functional elements seemed just as daunting as sequencing that first human genome. ENCODE was launched as a pilot project to develop the methods and strategies needed to produce results and did so by focusing on only 1 percent of the human genome. By 2007, NHGRI concluded that the technology had sufficiently evolved for a full-scale project, in which the institute invested approximately $123 million over five years. In addition, NHGRI devoted about $40 million to the ENCODE pilot project, plus approximately $125 million to ENCODE-related technology development and model organism research since 2003.

The scale of the effort has been remarkable. Hundreds of researchers across the United States, United Kingdom, Spain, Singapore and Japan performed more than 1,600 sets of experiments on 147 types of tissue with technologies standardized across the consortium. The experiments relied on innovative uses of next-generation DNA sequencing technologies, which had only become available around five years ago, due in large part to advances enabled by NHGRI's DNA sequencing technology development program. In total, ENCODE generated more than 15 trillion bytes of raw data and consumed the equivalent of more than 300 years of computer time to analyze.

"We've come a long way," said Ewan Birney, Ph.D., of the European Bioinformatics Institute, in the United Kingdom, and lead analysis coordinator for the ENCODE project. "By carefully piecing together a simply staggering variety of data, we've shown that the human genome is simply alive with switches, turning our genes on and off and controlling when and where proteins are produced. ENCODE has taken our knowledge of the genome to the next level, and all of that knowledge is being shared openly."

The ENCODE Consortium placed the resulting data sets as soon as they were verified for accuracy, prior to publication, in several databases that can be freely accessed by anyone on the Internet. These data sets can be accessed through the ENCODE project portal ( as well as at the University of California, Santa Cruz genome browser,, the National Center for Biotechnology Information, and the European Bioinformatics Institute,

"The ENCODE catalog is like Google Maps for the human genome," said Elise Feingold, Ph.D., an NHGRI program director who helped start the ENCODE Project. "Simply by selecting the magnification in Google Maps, you can see countries, states, cities, streets, even individual intersections, and by selecting different features, you can get directions, see street names and photos, and get information about traffic and even weather. The ENCODE maps allow researchers to inspect the chromosomes, genes, functional elements and individual nucleotides in the human genome in much the same way."

The coordinated publication set includes one main integrative paper and five related papers in the journal Nature 18 papers in Genome Research and six papers in Genome Biology. The ENCODE data are so complex that the three journals have developed a pioneering way to present the information in an integrated form that they call threads.

"Because ENCODE has generated so much data, we, together with the ENCODE Consortium, have introduced a new way to enable researchers to navigate through the data," said Magdalena Skipper, Ph.D., senior editor at Nature, which produced the freely available publishing platform on the Internet.

Since the same topics were addressed in different ways in different papers, the new website,, will allow anyone to follow a topic through all of the papers in the ENCODE publication set by clicking on the relevant thread at the Nature ENCODE explorer page. For example, thread number one compiles figures, tables, and text relevant to genetic variation and disease from several papers and displays them all on one page. ENCODE scientists believe this will illuminate many biological themes emerging from the analyses.

In addition to the threaded papers, six review articles are being published in the Journal of Biological Chemistry and two related papers in Science and one in Cell.

The ENCODE data are rapidly becoming a fundamental resource for researchers to help understand human biology and disease. More than 100 papers using ENCODE data have been published by investigators who were not part of the ENCODE Project, but who have used the data in disease research. For example, many regions of the human genome that do not contain protein-coding genes have been associated with disease. Instead, the disease-linked genetic changes appear to occur in vast tracts of sequence between genes where ENCODE has identified many regulatory sites. Further study will be needed to understand how specific variants in these genomic areas contribute to disease.

"We were surprised that disease-linked genetic variants are not in protein-coding regions," said Mike Pazin, Ph.D., an NHGRI program director working on ENCODE. "We expect to find that many genetic changes causing a disorder are within regulatory regions, or switches, that affect how much protein is produced or when the protein is produced, rather than affecting the structure of the protein itself. The medical condition will occur because the gene is aberrantly turned on or turned off or abnormal amounts of the protein are made. Far from being junk DNA, this regulatory DNA clearly makes important contributions to human health and disease."

Identifying regulatory regions will also help researchers explain why different types of cells have different properties. For example why do muscle cells generate force while liver cells break down food? Scientists know that muscle cells turn on some genes that only work in muscle, but it has not been previously possible to examine the regulatory elements that control that process. ENCODE has laid a foundation for these kinds of studies by examining more than 140 of the hundreds of cell types found in the human body and identifying many of the cell type-specific control elements.

Despite the enormity of the dataset described in this historic collection of publications, it does not comprehensively describe all of the functional genomic elements in all of the different types of cells in the human body. NHGRI plans to invest in additional ENCODE-related research for at least another four years. During the next phase, ENCODE will increase the depth of the catalog with respect to the types of functional elements and cell types studied. It will also develop new tools for more sophisticated analyses of the data.

The Institute for Creation Research

The first rough drafts of the human genome were reported in 2001 (one in the private sector and one in the public sector). 1-2 Since then, after 20 years of intensive globally conducted research, the data has revealed a wealth of complexity that has completely upset all of the original evolutionary misconceptions. 3 Most importantly, the false evolutionary paradigm of &ldquojunk DNA&rdquo has been utterly debunked in favor of a new model, one containing pervasive functionality and network complexity. The reality of this seemingly unending complexity is only just beginning to be revealed&mdashan inconvenient fact that points directly to an omnipotent Creator.

A recent cover story in the journal Nature briefly summarized the past 20 years since the original publications with the first drafts of the human genome hit the press. 3 When the first phase of research was completed in 2001, it was initially found that the genome contained about 25,000 protein coding genes and that the actual coding segments of these genes only accounted for about 2% of the total DNA sequence. Many evolutionists found affirmation in these initial reports. This was because the neutral model of evolutionary theory predicted that there should be vast regions of the human genome in evolutionary limbo (termed &ldquojunk DNA&rdquo). These alleged nonfunctional regions would then be randomly churning out new genes for nature to magically select. 4-5 Needless to say, this misguided evolutionary speculation was short-lived.

Since 2001, numerous research projects have demonstrated that these uncharted and mysterious regions of the human genome were not junk at all. Rather, they were vital to life and good health. In a subsection of the new Nature article entitled "Not Junk," the authors say, "With the HGP [human genome project] draft in hand, the discovery of non-protein-coding elements exploded. So far, that growth has outstripped the discovery of protein-coding genes by a factor of five, and shows no signs of slowing." They also said, "Thanks in large part to the HGP, it is now appreciated that the majority of functional sequences in the human genome do not encode proteins. Rather, elements such as long non-coding RNAs, promoters, enhancers and countless gene-regulatory motifs work together to bring the genome to life."

The main points of the past 20 years of research on the human genome can be summarized as follows:

1) The human genome is a complete storehouse of important information, and this fact negates the concept of junk DNA.

2) Protein-coding genes are largely a basic set of instructions within a complex and larger repertoire of regulatory DNA sequence.

3) Many more genes exist (compared to protein coding genes) that code for functional RNA molecules that are not used to make proteins, but do other jobs in the cell.

4) A vast number of regulatory switches and control features exist in the human genome that regulate its function.

The pervasive and complex design of the human genome is exactly what&rsquos gleaned from the Bible. After all, the scriptures say in Psalm 139:14, &ldquoI will praise You, for I am fearfully and wonderfully made Marvelous are Your works, And that my soul knows very well.&rdquo

1. Venter, J.C., et al. 2001. The Sequence of the Human Genome. Science. 291(2001):1304-1351.
2. International Human Genome Sequencing Consortium. 2001. Initial Sequencing and Analysis of the Human Genome. Nature. 409(2001):860-921.
3. Alexander J. Gates, A.J., D.M. Gysi, M. Kellis, and A.L. Barabási. 2021. A wealth of discovery built on the Human Genome Project &mdash by the numbers. Nature. 590:212-215.
4. Tomkins, Ph.D. 2017. Evolutionary Clock Futility. Acts & Facts. 46 (3).
5. Tomkins, J. P. and J. Bergman. 2015. Evolutionary molecular genetic clocks&mdasha perpetual exercise in futility and failure. Journal of Creation. 29 (2): 26-35.

*Dr. Tomkins is Director of Research at the Institute for Creation Research and earned his doctorate in genetics from Clemson University.

Prediction of complete gene structures in human genomic DNA

We introduce a general probabilistic model of the gene structure of human genomic sequences which incorporates descriptions of the basic transcriptional, translational and splicing signals, as well as length distributions and compositional features of exons, introns and intergenic regions. Distinct sets of model parameters are derived to account for the many substantial differences in gene density and structure observed in distinct C + G compositional regions of the human genome. In addition, new models of the donor and acceptor splice signals are described which capture potentially important dependencies between signal positions. The model is applied to the problem of gene identification in a computer program, GENSCAN, which identifies complete exon/intron structures of genes in genomic DNA. Novel features of the program include the capacity to predict multiple genes in a sequence, to deal with partial as well as complete genes, and to predict consistent sets of genes occurring on either or both DNA strands. GENSCAN is shown to have substantially higher accuracy than existing methods when tested on standardized sets of human and vertebrate genes, with 75 to 80% of exons identified exactly. The program is also capable of indicating fairly accurately the reliability of each predicted exon. Consistently high levels of accuracy are observed for sequences of differing C + G content and for distinct groups of vertebrates.

Why did it take 20 years?

Much of the newly sequenced material is the “heterochromatic” part of the genome, which is more “tightly packed” than the euchromatic genome and contains many highly repetitive sequences that are very challenging to read accurately.

These regions were once thought not to contain any important genetic information but they are now known to contain genes that are involved in fundamentally important processes such as the formation of organs during embryonic development. Among the 200 million newly sequenced base pairs are an estimated 115 genes predicted to be involved in producing proteins.

Two key factors made the completion of the human genome possible:

1. Choosing a very special cell type

The newly published genome sequence was created using human cells derived from a very rare type of tissue called a complete hydatidiform mole, which occurs when a fertilised egg loses all the genetic material contributed to it by the mother.

Most cells contain two copies of each chromosome, one from each parent and each parent’s chromosome contributing a different DNA sequence. A cell from a complete hydatidiform mole has two copies of the father’s chromosomes only, and the genetic sequence of each pair of chromosomes is identical. This makes the full genome sequence much easier to piece together.

2. Advances in sequencing technology

After decades of glacial progress, the Human Genome Project achieved its 2001 breakthrough by pioneering a method called “shotgun sequencing”, which involved breaking the genome into very small fragments of about 200 base pairs, cloning them inside bacteria, deciphering their sequences, and then piecing them back together like a giant jigsaw.

This was the main reason the original draft covered only the euchromatic regions of the genome — only these regions could be reliably sequenced using this method.

The latest sequence was deduced using two complementary new DNA-sequencing technologies. One was developed by PacBio, and allows longer DNA fragments to be sequenced with very high accuracy. The second, developed by Oxford Nanopore, produces ultra-long stretches of continuous DNA sequence. These new technologies allows the jigsaw pieces to be thousands or even millions of base pairs long, making it easier to assemble.

The new information has the potential to advance our understanding of human biology including how chromosomes function and maintain their structure. It is also going to improve our understanding of genetic conditions such as Down syndrome that have an underlying chromosomal abnormality.

Detailed Caption

The graphic shows the human genome annotated with data related to genes implicated in disease, regions of variation found in various populations, and regions of similarity between chromosomes.

The 24 individual chromosomes (1..22 [each present in pairs in the genome], X, Y) are arranged circularly (C), and represented by labeled (C3) ideograms on which the distance scale is displayed (C1).

Some chromosomes are shown at different physical scales to illustrate the rich pattern of the data (chr2 3x chrs 18,19,20,21,22 2x chrs 3,7,17 10x). Within each ideogram, cytogenetic bands are shown (C2). These are large-scale features used in cytogenetics to locate and reference gross changes.

On the outside of the ideograms, genomic variation between individuals and populations is represented by tracks (A) and (B). The number of catalogued locations at which single base pair changes have been observed within populations is shown as a histogram (A). Large regions which have been seen to vary in size and copy number between individuals are marked in (B).

Locations of genes associated with disease are superimposed on the ideograms (D). (D3) shows the location of genes implicated in cancer (very dark red), other disease (dark red) and all other genes (red). (D2) shows locations of genes implicated in lung, ovarian, breast, prostate, pancreatic, and colon cancer, colored in progressively darker shade of red. (D1) marks gene positions implicated in other diseases such as ataxia, epilepsy, glaucoma, heart disease, neuropathy, colored in progressively darker shade of red, as well as diabetes (orange), deafness (green), and Alzheimer (blue) disease.

Grey lines (E) connect positions on ideograms associated with genes that participate in the same biochemical pathways. The shade of the link reflects character of the gene - dark grey indicates that the gene is implicated in cancer, grey in disease, and light grey for all other genes. Colored links (F) connect a subset of genomic region pairs that are highly similar and illustrate the deep level of similarity between genomic regions (about 50% of the genome is in so-called repeat regions regions which appear in the genome multiple times and in a variety of locations).

Genome Biology at Genome Informatics

At the start of the year, I was thinking about the conferences I attended last year. One highlight was Genome Informatics, which I went to in September, on behalf of Genome Biology.

Genome Informatics is an annual conference, focusing on computational approaches for understanding the biology of genomes. It alternates between the Wellcome Trust conference center in Hinxton, UK and Cold Spring Harbor Laboratories, NY, USA. Last year was the turn of Hinxton, so I went along, as I have the previous two times it was in the UK.

The two keynote presentations were from Katie Pollard (University of California San Francisco, USA) and Rafael Irizarry (Dana-Farber Cancer Institute, Boston, USA). Pollard discussed the use of machine learning in genomics research, and in particular the problems that can arise. She pointed out that you shouldn’t use balanced training data if the problem you are looking at is very unbalanced (ie few positives and many negatives such as identifying promoter sequences) and also that many machine learning models assume that data are independent and identically distributed, but this is very much not the case with genomics data – but nevertheless, even though the assumptions of the model may be violated, useful results can still be obtained.

Now there are more talks discussing the biology revealed by the informatics rather than the informatics methods themselves.

Irizarry’s talk also dealt with problems in analysis, and why you shouldn’t just blindly trust the results you get. Sometimes, you can get a good idea if your results are plausible just by eyeballing the data. This was a common theme in many talks. Irizarry gave an example of a study which reported that a quarter of genes expressed in blood were differentially expressed between two human populations. This seemed implausibly high, so he looked into it and found a batch effect from having the two populations sampled in two separate projects.

In previous editions of this conference, attendees have told me how it has changed since it first started – now there are more talks discussing the biology revealed by the informatics rather than the informatics methods themselves. This iteration was no different, with several talks about analyzing large numbers of cancer genomes to find variants, or large cohorts of personal genomes to find variants associated with developmental disorders. For going beyond trying to identify variants associated with a condition, Sri Kosuri (University of California Los Angeles, USA) talked about experiments in which he tested thousands of SNPs for their effects on splicing in a reporter gene construct.

One biology talk that I found particularly interesting was from Lucia Spangenberg (Institut Pasteur de Montevideo, Uruguay), who has been attempting to reconstruct the genome of the Charruas, the indigenous people of Uruguay who were exterminated in the 19 th century. Spangenberg found that the genomes of ten modern-day Uruguayans between them contain enough Charruan DNA to be able to reconstruct 99% of the Charruan genome. In general, people’s native genetic ancestry was higher than their self-reported native identity.

Several talks discussed how modern techniques, such as long-read sequencing from Pacific Biosciences, linked reads from 10x Genomics, and genome contact information from Hi-C, can be used to improve genome assemblies. This was shown in a variety of systems: birds (Alexander Suh, Uppsala University, Sweden), donkeys (Nikka Keivanfar, 10x Genomics, USA), and moss (Sarah Carey, University of Florida, USA). Jeffrey Kidd (University of Michigan, USA) showed that PacBio can used to produce a reference genome for dog that is more complete than the original one sequenced using Sanger technology.

One trend that particularly intrigued us at Genome Biology was the increased number of methods for representing genomes in a graph format, with variants shown as alternative branches, rather than the traditional linear reference representation. This was described for both prokaryotic genomes (Rachel Colquhoun, Oxford University, UK) and eukaryotic genomes (Prithicka Sritharan, Quadram Institute Bioscience, UK). We found this interesting, as we have been discussing this for a while, and have just issued a call for papers for an article collection on graph genomes.

I am planning on attending this year’s Genome Informatics conference in Cold Spring Harbor, and it will be fascinating to see how the different location, with a different set of delegates, affects the feel and focus of the conference. However it is different, I predict it will be equally as fascinating as last year’s conference.


The new release of BiologicalNetworks introduces extensive functionality for a more efficient integrated analysis and visualization of diverse data in studies of different biological systems concerning human diseases, host-pathogen interactions, metagenomics, meiosis in fungi, microbial metabolism, and whole-genome metabolic reconstruction in eukaryotes and prokaryotes. The BiologicalNetworks database has a general purpose graph architecture and is data-type neutral, thus there is the prospect of further data integration for more complete systems biology studies. The integration of additional, orthogonal sources of information, such as clinical data, will enable quantitative associations of clinical variables with the activities of molecular pathways and processes. We also demonstrated how BiologicalNetworks can be used to find disease-specific interaction networks, through the application of multi-level analysis of microarray, sequence, regulatory, and other data.

Besides customization on the level of selecting analysis methods/tools in BiologicalNetworks, the user has an option to change the parameters of each method for example, specify the homology level in the "Build Homology Wizard" when building the clusters of homologous genes/proteins or specify data sources, types of interactions, species, and p-values in the "Build Pathway Wizard". We are also customizing BiologicalNetworks constantly adding new features, methods, data formats and sources by the users' requests.

To allow for the replication and comparison of the results presented in this work with other related analysis, all available demonstrated examples and data can be accessed in 'BMC Bioinformatics Demo Project', upon launching the BiologicalNetworks application. Additionally, the BiologicalNetworks Welcome Screen and front page of the web site contains a list of "driving" biological projects (for various species and types of analysis) which can be replicated by simply running the respective project.

BiologicalNetworks, along with the user Manual and Video tutorials and Quick Start Guide, is available at

Watch the video: DNA animations by for Science-Art exhibition (May 2022).