Probability of all alleles represented in a sample

Probability of all alleles represented in a sample

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I'm trying to wrap my head around some formulas presented in the 1992 paper from Chakraborty Sample Size Requirements for Addressing the Population Genetic Issues of Forensic Use of DNA Typing, but I have not been able to.

Specifically, the right hand side of formula (16) and it's relation with formula (13).

$1-sumlimits_{i=1}^{k}(1-p_{i})^{2n}$ (13)

$[1-(1-p)^{2n}]^{r}geqslant1-alpha$ (16)

Formula 13 indicates the probability, for a locus with $k$ segregating alleles whose frequencies are contained in the vector $p$, that all alleles are represented in a given sample of size $n$, and the right hand side of formula 16 indicates the probability of $r$ alleles to be represented in a given sample of size $n$.

First of all, why, based on 13, the expression inside the summation indicates the probability of an allele of frequency p, to remain unobserved in a sample of size n?

I tried to understand this from the Hardy-Weinberg equation but did not have any success.

Second, Why to take the expression in (16) to the r'th power?

Which biological concepts am I missing?

I'm going to strictly answer the questions, rather than step through the proof, because it involves a lot of formatting that I'm not familiar with. Other folks are welcome to edit this!

Equation 13

This equation assumes a diploid genotype, given by the $2n$ power with $n$ individuals. For anything with greater ploidy than mono-, it's mathematically simpler to determine the probability that an allele is not present. As an example, see this calculation of a triploid Hardy-Weinburg equilibrium equation. Using this simplification,

$P(single$ $allele$ $not$ $present)$ $= (1$ $- P(allele$ $present))$ ^ $(ploidy)$ ^ $(n)$

$= (1$ $- P(allele$ $present))$ ^ $(ploidy$ * $n)$

With $k$ segregating alleles, each allele has its own non-presence probability. The probability of total non-presence is $1 - (sum$ $of$ $P(each$ $non$-$presence))$

Equation 16

In this equation, the author describes the probability that all alleles are present at a given frequency. These allele presences are independent of each other and therefore multiplicative. Since $P(allele$ $present)$ is vectorized, this product can be simplified to $^r$

The Hardy-Weinberg formulas allow scientists to determine whether evolution has occurred. Any changes in the gene frequencies in the population over time can be detected. The law essentially states that if no evolution is occurring, then an equilibrium of allele frequencies will remain in effect in each succeeding generation of sexually reproducing individuals. In order for equilibrium to remain in effect (i.e. that no evolution is occurring) then the following five conditions must be met:

  1. No mutations must occur so that new alleles do not enter the population.
  2. No gene flow can occur (i.e. no migration of individuals into, or out of, the population).
  3. Random mating must occur (i.e. individuals must pair by chance)
  4. The population must be large so that no genetic drift (random chance) can cause the allele frequencies to change.
  5. No selection can occur so that certain alleles are not selected for, or against.

Obviously, the Hardy-Weinberg equilibrium cannot exist in real life. Some or all of these types of forces all act on living populations at various times and evolution at some level occurs in all living organisms. The Hardy-Weinberg formulas allow us to detect some allele frequencies that change from generation to generation, thus allowing a simplified method of determining that evolution is occurring. There are two formulas that must be memorized:

How to Make a Punnett Square

The most basic Punnett square is started by drawing square, and dividing it into 4 equal parts. The letters that go on the top and side of the Punnett square are the alleles contributed by each parent. Each allele gets a column or row. Now, to fill in the Punnett square, simply transfer each letter to the column or row it starts. When this is finished, you will have a Punnett square like the one below.

Each box within the Punnett square represents one possible genetic outcome for the offspring. In a diploid organisms, each organism can only carry 2 alleles. There may be many more alleles present in the population at large, but between the two individuals mating, there can only be 4 different alleles total. Most simple Punnett square diagrams only consider 2 alleles.

The alleles are capitalized based on their relationship to other alleles. If the allele is dominant, and will mask the effects of other alleles, it is capitalized. Alleles which need two copies of themselves to produce a phenotype are considered recessive, and are given lower-case letters. Other relationships between alleles are indicated with superscripts, subscripts, and other designations to separate alleles with incomplete dominance or codominance.

Direction Instruction - Punnett Squares And Probabilities of Inheritance

Some of your students may have been introduced to the concept of Punnett squares in middle school life science, but many students do not remember how to develop a Punnett square to accurately predict the probability of a genetic cross.

It is important to remind your student that Punnett squares DO NOT TELL THE FUTURE ! Punnett squares are a tool to organize the geneotypes of the parents for a given trait, create the cross, and count up the possible genotypes/phenotypes that can result for the given cross. The punnett square will not tell exactly what will occur, but will allow students to trace the genotypes of the parents, create potential genotypes of the offspring, and calculate the probability of the offspring occurring in the population.

Punnett squares are a simple tool for students to create that will organize their data and provide a visual representation of the distribution of certain traits. Although Punnett squares are useful tools to predict outcomes, many students view them as riddles as they work to calculate the probably outcomes.

As a simple introduction, or possible review, students will record Probability Punnett Squares Lecture Notes that describe how to create a Punnett square and the rationale of using Punnett squares to predict probable genetic outcomes. The lecture notes contain five review questions to reinforce the main idea of creating and solving Punnett square practice problems.


To get started, first determine an arbitrary letter for the allele. I chose the letter f for feathers. We know it is standard in genetics to use capital letter to indicate the dominant allele, and lowercase letter for recessive allele.

We now determine the following:

F/f - heterozygous (phenotype: blue) F/F - homozygous dominant (phenotype: blue) f/f - homozygous recessive (phenotype: white)

The parakeet crossing is between F/f x f/f.

When done, you should get two possible zygotes TWICE:

The question asks for offspring percentage to have BLUE feathers, which are the heterozygotes F/f. Since there are two offspring with this genotype for every four offspring, the ratio is:

2 offspring/ 4 offspring = 1/2 or 0.5

Convert to percentage by multiplying by 100%:

If you liked this solution, please hit Thanks or give a Rating!

A single copy of a dominant allele is enough to be expressed, but recessive alleles need two copies to express. For that reason, the possible genotypes and phenotypes for feather colors are:

The parakeet with blue feathers is heterozygous, so it the has both a dominant allele for blue feathers (B) and a recessive allele for white feathers (b) and the genotype is Bb. The other parakeet is homozygous for white feathers, so it has two b alleles and the genotype bb.

When individuals produce gametes, the alleles separate, so the heterozygous parakeet will produce the gametes B and b, and the homozygous parakeet will produce two b gametes.

The Punnett Square shows the possible combination of the gametes produced by the two parents, and therefore the genotypes of the offspring:

Bb - blue feathersBb - blue feathersbb - white feathersbb - white feathers

2 out of four individuals of the offpring (50%) will have the genotype Bb and blue feathers.

A punnett square is a tabular summary of the possible combinations of the paternal and maternal alleles and thus predicting the genotypes of the organisms in a breeding experiment.

A parakeet that is heterozygous dominant with blue feathers can be represented by the genotype Bb. A parakeet that is homozygous recessive with white feathers can be represented by the genotype bb. When a cross is made between these two, the percentage of offspring predicted to have blue feathers is 50%. It can be represented below-

Phenotype - Blue feathers x White feathers

Thus, the percentage of offspring with blue feathers is 50% with a genotype Bb.

According to the Punnett square for this cross, what percentage of offspring is predicted to have blue feathers?

Materials and methods

GATK ASEReadCounter tool and benchmarking

The tool and accompanying documentation are available in GATK v.3.4, which can be downloaded from [44]. The Python script which processes the output from SAMtools mpileup can be found at [45]. Benchmarking was run using GATK v.3.4 and SAMtools 1.2 on STAR aligned reads from the Geuvadis sample NA06986.2.M_111215_4 using heterozygous bi-allelic sites from 1000 Genomes phase 1. Reads were coordinate sorted, indexed, and WASP filtered to produce a BAM file containing 56,362,192 reads. Runtime benchmarking was performed using 100 %, 75 %, and 50 % of the reads sampled from the file, and is reported as the mean of 10 runs with the 95 % confidence interval shown. For comparison ASEQ v.1.1.8 was run in pileup mode. Benchmarking was run on CentOS 6.5 with Java version 1.6 on an Intel Xeon CPU E7- 8830 @ 2.13GHz.

Filtering homozygous sites

In order to identify potentially homozygous sites miscalled as a heterozygous SNP we model the number of reads that can be observed due to technical error of the experimental and upstream computational pipeline. Let us assume there are a total of n reads originating from a site homozygous for an allele R. Assuming a noise rate ε, by which a read can erroneously support another allele A, the distribution of total number of reads aligned to allele A, n A, is given by binomial distribution. Hence, the probability of observing n A or more reads assigned to allele A in a site homozygous for R is given by:

where BinCDF(n A, n, ε) is the binomial cumulative distribution function. Conversely, the probability of observing n R(n = n R + n A) or more reads assigned to allele R in a site homozygous for A is given by:

under the assumption that the noise rate is equal for all alleles. Therefore, the probability of observing extreme allelic imbalance due to the null hypothesis, homozygosity for one of the alleles, can be calculated by summing up the two above probabilities corresponding to the two tails of the distribution. In order to derive an empirical estimate of the noise rate ε we used the ratio between the total sum of reads assigned to other alleles, those different from the designated reference or alternative allele at each site, to the total number of reads in a library divided by two. For this purpose we exclude the sites with more than 5 % of the reads aligned to other alleles from the analysis.

Mapping strategies for AE analysis

For all analyses, unless otherwise noted, reads were mapped using STAR v.2.4.0f1 and the two-pass mapping strategy as recommended by the Broad Institute [39]. Briefly, splice junctions are detected during a first pass mapping, and these are used to inform a second round of mapping. All reads were mapped to hg19 and Gencode v19 annotations were used.

For mapping to a personalized genome, the vcf2diploid tool, part of AlleleSeq, was used to generate both a maternal and paternal genome for NA06986 from the phased 1000 Genomes phase 1 reference using het-SNPs only. Reads were then mapped to both genomes separately using STAR two-pass strategy (as above). Reads which aligned uniquely to only one genome were kept, and in cases where reads mapped uniquely to both genomes, the alignment with the higher alignment quality was used.

Mapping using GSNAP was performed with default settings and splice site annotations from hg19 refGene. Variant-aware alignment was performed using the “-d” option for NA06986 from the phased 1000 Genomes phase 1 reference using het-SNPs only, as described in the GSNAP documentation.

Multidimensional scaling clustering of samples by AE data

A pairwise distance matrix was produced for all Geuvadis samples using AE data and used for classical multidimensional scaling (cmdscale) in R. The first two dimensions were then plotted against each other for all samples. The distance between two samples was calculated as follows: Pairwise distance = Total number of sites with significant AE in only one sample/Total number of shared sites. A binomial test with a 5 % FDR was used for significance with either no effect size cutoff (Fig. 6c) or a minimum effect size of 0.15 (Fig. 6d).

Measuring AE at eQTL genes

RNA-seq data from 343 Geuvadis European individuals was used to generate allele counts at het-SNPs. For each individual, AE (AE = |0.5 − Reference ratio |) was calculated for all sites with ≥16 reads, each site was intersected against all Geuvadis European genes with a significant eQTL (eGene, 5 % FDR), and the median AE of all sites covering each eGene was calculated. The genotype of each individual for the top eQTL for each gene was then determined to be either heterozygous or homozygous. For each eGene with at least 30 measurements of AE in both heterozygous and homozygous individuals the significance of the difference in AE between the two classes was calculated using a Wilcoxon rank sum test (1 % FDR). To determine the enrichment of sites within eSNP heterozygous eGenes across the AE spectrum, the percentage of these sites was calculated in bins of AE for each individual.

Units of AE

Reference ratio = Reference reads/Total reads

Allelic expression (effect size) = |0.5 – Reference ratio|

Data availability

RNA-seq data from the Geuvadis Consortium alongside 1000 Genomes phase 1 genotype data were used for all analyses. RNA-Seq FASTQ files are available from the European Nucleotide Archive under accession [ENA:ERP001942].

HW answers

PROBLEM #1. You have sampled a population in which you know that the percentage of the homozygous recessive genotype (aa) is 36%. Using that 36%, calculate the following:

  1. The frequency of the “aa” genotype. Answer: 36%, as given in the problem itself.
  2. The frequency of the “a” allele. Answer: The frequency of aa is 36%, which means that q2 = 0.36, by definition. If q2 = 0.36, then q = 0.6, again by definition. Since q equals the frequency of the a allele, then the frequency is 60%.
  3. The frequency of the “A” allele. Answer: Since q = 0.6, and p + q = 1, then p = 0.4 the frequency of A is by definition equal to p, so the answer is 40%.
  4. The frequencies of the genotypes “AA” and “Aa.” Answer: The frequency of AA is equal to p2, and the frequency of Aa is equal to 2pq. So, using the information above, the frequency of AA is 16% (i.e. p2 is 0.4 x 0.4 = 0.16) and Aa is 48% (2pq = 2 x 0.4 x 0.6 = 0.48).
  5. The frequencies of the two possible phenotypes if “A” is completely dominant over “a.” Answers: Because “A” is totally dominate over “a”, the dominant phenotype will show if either the homozygous “AA” or heterozygous “Aa” genotypes occur. The recessive phenotype is controlled by the homozygous aa genotype. Therefore, the frequency of the dominant phenotype equals the sum of the frequencies of AA and Aa, and the recessive phenotype is simply the frequency of aa. Therefore, the dominant frequency is 64% and, in the first part of this question above, you have already shown that the recessive frequency is 36%.

PROBLEM #2. Sickle-cell anemia is an interesting genetic disease. Normal homozygous individials (SS) have normal blood cells that are easily infected with the malarial parasite. Thus, many of these individuals become very ill from the parasite and many die. Individuals homozygous for the sickle-cell trait (ss) have red blood cells that readily collapse when deoxygenated. Although malaria cannot grow in these red blood cells, individuals often die because of the genetic defect. However, individuals with the heterozygous condition (Ss) have some sickling of red blood cells, but generally not enough to cause mortality. In addition, malaria cannot survive well within these “partially defective” red blood cells. Thus, heterozygotes tend to survive better than either of the homozygous conditions. If 9% of an African population is born with a severe form of sickle-cell anemia (ss), what percentage of the population will be more resistant to malaria because they are heterozygous (Ss) for the sickle-cell gene? Answer: 9% =.09 = ss = q2. To find q, simply take the square root of 0.09 to get 0.3. Since p = 1 – 0.3, then p must equal 0.7. 2pq = 2 (0.7 x 0.3) = 0.42 = 42% of the population are heterozygotes (carriers).

PROBLEM #3. There are 100 students in a class. Ninety-six did well in the course whereas four blew it totally and received a grade of F. Sorry. In the highly unlikely event that these traits are genetic rather than environmental, if these traits involve dominant and recessive alleles, and if the four (4%) represent the frequency of the homozygous recessive condition, please calculate the following:

  1. The frequency of the recessive allele. Answer: Since we believe that the homozygous recessive for this gene (q2) represents 4% (i.e. = 0.04), the square root (q) is 0.2 (20%).
  2. The frequency of the dominant allele. Answer: Since q = 0.2, and p + q = 1, then p = 0.8 (80%).
  3. The frequency of heterozygous individuals. Answer: The frequency of heterozygous individuals is equal to 2pq. In this case, 2pq equals 0.32, which means that the frequency of individuals heterozygous for this gene is equal to 32% (i.e. 2 (0.8)(0.2) = 0.32).

PROBLEM #4. Within a population of butterflies, the color brown (B) is dominant over the color white (b). And, 40% of all butterflies are white. Given this simple information, which is something that is very likely to be on an exam, calculate the following:

  1. The percentage of butterflies in the population that are heterozygous.
  2. The frequency of homozygous dominant individuals. Answers: The first thing you’ll need to do is obtain p and q. So, since white is recessive (i.e. bb), and 40% of the butterflies are white, then bb = q2 = 0.4. To determine q, which is the frequency of the recessive allele in the population, simply take the square root of q2 which works out to be 0.632 (i.e. 0.632 x 0.632 = 0.4). So, q = 0.63. Since p + q = 1, then p must be 1 – 0.63 = 0.37. Now then, to answer our questions. First, what is the percentage of butterflies in the population that are heterozygous? Well, that would be 2pq so the answer is 2 (0.37) (0.63) = 0.47. Second, what is the frequency of homozygous dominant individuals? That would be p2 or (0.37)2 = 0.14.

PROBLEM #5. A rather large population of Biology instructors have 396 red-sided individuals and 557 tan-sided individuals. Assume that red is totally recessive. Please calculate the following:

  1. The allele frequencies of each allele. Answer: Well, before you start, note that the allelic frequencies are p and q, and be sure to note that we don’t have nice round numbers and the total number of individuals counted is 396 + 557 = 953. So, the recessive individuals are all red (q2) and 396/953 = 0.416. Therefore, q (the square root of q2) is 0.645. Since p + q = 1, then p must equal 1 – 0.645 = 0.355.
  2. The expected genotype frequencies. Answer: Well, AA = p2 = (0.355)2 = 0.126 Aa = 2(p)(q) = 2(0.355)(0.645) = 0.458 and finally aa = q2 = (0.645)2 = 0.416 (you already knew this from part A above).
  3. The number of heterozygous individuals that you would predict to be in this population. Answer: That would be 0.458 x 953 = about 436.
  4. The expected phenotype frequencies. Answer: Well, the “A” phenotype = 0.126 + 0.458 = 0.584 and the “a” phenotype = 0.416 (you already knew this from part A above).
  5. Conditions happen to be really good this year for breeding and next year there are 1,245 young “potential” Biology instructors. Assuming that all of the Hardy-Weinberg conditions are met, how many of these would you expect to be red-sided and how many tan-sided? Answer: Simply put, The “A” phenotype = 0.584 x 1,245 = 727 tan-sided and the “a” phenotype = 0.416 x 1,245 = 518 red-sided ( or 1,245 – 727 = 518).

PROBLEM #6. A very large population of randomly-mating laboratory mice contains 35% white mice. White coloring is caused by the double recessive genotype, “aa”. Calculate allelic and genotypic frequencies for this population. Answer: 35% are white mice, which = 0.35 and represents the frequency of the aa genotype (or q2). The square root of 0.35 is 0.59, which equals q. Since p = 1 – q then 1 – 0.59 = 0.41. Now that we know the frequency of each allele, we can calculate the frequency of the remaining genotypes in the population (AA and Aa individuals). AA = p2 = 0.41 x 0.41 = 0.17 Aa = 2pq = 2 (0.59) (0.41) = 0.48 and as before aa = q2 = 0.59 x 0.59 = 0.35. If you add up all these genotype frequencies, they should equal 1.

PROBLEM #7. After graduation, you and 19 of your closest friends (lets say 10 males and 10 females) charter a plane to go on a round-the-world tour. Unfortunately, you all crash land (safely) on a deserted island. No one finds you and you start a new population totally isolated from the rest of the world. Two of your friends carry (i.e. are heterozygous for) the recessive cystic fibrosis allele (c). Assuming that the frequency of this allele does not change as the population grows, what will be the incidence of cystic fibrosis on your island? Answer: There are 40 total alleles in the 20 people of which 2 alleles are for cystic fibrous. So, 2/40 = .05 (5%) of the alleles are for cystic fibrosis. That represents p. Thus, cc or p2 = (.05)2 = 0.0025 or 0.25% of the F1 population will be born with cystic fibrosis.

PROBLEM #8. You sample 1,000 individuals from a large population for the MN blood group, which can easily be measured since co-dominance is involved (i.e., you can detect the heterozygotes). They are typed accordingly:

M MM 490 0.49
MN MN 420 0.42
N NN 90 0.09

Using the data provide above, calculate the following:

  1. The frequency of each allele in the population. Answer: Since MM = p2, MN = 2pq, and NN = q2, then p (the frequency of the M allele) must be the square root of 0.49, which is 0.7. Since q = 1 – p, then q must equal 0.3.
  2. Supposing the matings are random, the frequencies of the matings. Answer: This is a little harder to figure out. Try setting up a “Punnett square” type arrangement using the 3 genotypes and multiplying the numbers in a manner something like this:
    MM (0.49)MN (0.42)NN (0.09)
    MM (0.49)0.2401*0.20580.0441
    MN (0.42)0.20580.1764*0.0378
    NN (0.09)0.04410.03780.0081*

PROBLEM #9. Cystic fibrosis is a recessive condition that affects about 1 in 2,500 babies in the Caucasian population of the United States. Please calculate the following:

The Normal Distribution

Inferring probabilities from data distributions (that's what we did last week. remember?) can be useful in a descriptive sense, but for inferential statistics we will be making use of theoretical distributions that we can apply to our null expectations. For example, if we were interested in determining whether two sample means represent different statistical populations with different population means, or two samples from a single population (read that again. this is the question that we are asking when we compare means to see if they differ), we would want to define the probability distribution for the difference between 2 sample means drawn from the same population. This is the null expectation, because it is defined by the condition where both sample means estimate the same population mean, rather than each sample mean representing a different statistical population. We use the null expectation because it is an efficient way of making a comparison. There is only one way that 2 sample means can represent a single statistical population, which means that we only have to consider one distribution, that of:

Where both sample means were drawn from a single statistical population. From this distribution, we can determine whether a difference that we observe is too improbable (remember that we defined this earlier as a probability less than 0.05) for us to accept the premise that both sample means were, in fact, drawn from the same statistical population.

Considering the probability distributions for the same difference where the two sample means were drawn from statistical populations with two different central tendencies produces an infinite number of possible distributions (one for each amount by which the two population means might differ). Thus, having only a single distribution to deal with (where both sample means estimate the same population mean) makes our analysis (and therefore our lives) much less complicated.

One probability distribution that (under certain specific circumstances that we will concern ourselves with later) does describe the distribution of differences between sample means drawn from a single population is the normal (or Gaussian) distribution. It is important that we become familiar with this distribution and its characteristics, as it plays an integral role in the assumptions of many of the analyses that we will learn. The normal distribution looks like this:

All 3 of the above distributions were drawn from a statistical population with μ = 10, and the standard deviation (σ), as indicated on the graphs themselves, varied from 1 to 3. If the change in shape of the distribution with increasing variance surprises you, please go back and review the section on descriptive statistics. If the animation is not working, or if you wish to view the graphs individually, you can view them HERE.

The normal distribution is clearly a symmetrical distribution, but not all symmetrical distributions can be considered to be normal. While all 3 of the above distributions may appear different, they are, in fact, all identical in one regard. The distribution of the observations around the mean is very precisely defined as:

68.27% of the observations lie within 1 standard deviation of the mean ( μ ± σ)

95.45% of the observations lie within 2 standard deviations of the mean ( μ ± 2σ)

99.73% of the observations lie within 3 standard deviations of the mean ( μ ± 3σ)

Or, in a slightly more usable format:

50% of the observations lie within 0.674 standard deviations of the mean ( μ ± 0.674σ)

95% of the observations lie within 1.960 standard deviations of the mean ( μ ± 1.960σ)

99% of the observations lie within 2.576 standard deviations of the mean ( μ ± 2.576σ)

For this reason, the values of the normal distribution (and other probability distibutions that we will employ in our analyses) typically are reported as standardized deviates:

Reporting the values as deviates (Y - μ) centers the distribution around zero, and dividing the deviate by the standard deviation (σ) expresses the X-variable (distance from the mean) in units of standard deviation. Applying this calculation to any of the 3 distributions shown above (or any normal distribution for that matter) produces the following distribution:

Many observations of biological processes and characteristics tend to follow a normal distribution. One potential reason for this is that these processes and characteristics tend to be influenced by numerous determinants and, if the effects of these determinants are additive, the resulting distribution should approach the parameters of a normal distribution. Let us recall Pascal's triangle and consider multiple draws from binomial probabilities of:

Each draw (remember that k is the number of draws) could represent a different genetic (one of 2 alleles) or environmental (one of 2 conditions) factor that influences a particular character. The probability p reflects the chance that a particular effect adds to that character, such that the value for a character is the sum of all the positive influences on that character. Conversely, q is the probability that the factor does not affect the character. If we assign a value of 1 for each addition to the character, then for a character influenced by only 2 factors, i.e., (p + q) k where k = 2, we would expect the distribution of values for that character to reflect a value of 2 with a probability equal to p 2 (0.25 in our case), a value of 1 with a probability equal to 2pq (0.5 in our case), and a value of 0 with a probability q 2 . This produces a symmetrical, but not normal, distribution.

The more factors influencing the value of the character, i.e., the greater k becomes, the closer the distribution of the values for that character approaches a normal distribution, as is demonstrated below, where the bars represent the distribution of values, and the red line is the expected normal distribution (generated using the NORM.DIST function in Excel) for the same mean and standard deviation:

The X-axis values in this case are displayed as distances from the mean, because the mean value of the character increases as k increases (the expected mean is pk). The data for the preceding animation were based on 1000 samples from binomial expansions with p = 0.5, and values of k as shown in the graphs. If the above animation isn't working, or if you would like to take a closer look at the graphs, they are shown individually HERE.

This relatively rapid approach to a normal distribution is the result of p being equal to 0.5, which makes the distribution symmetrical at all values of k. For values of p other than 0.5, the approach to a normal distribution occurs much more slowly, as can be seen below (for p = 0.2) by comparing the values for k to those from the previous demonstration:

The data for the preceding animation were based on 1000 samples from binomial expansions with p = 0.2, and values of k as shown in the graphs. If the above animation isn't working, or if you would like to take a closer look at the graphs, they are shown individually HERE. The data used in both the preceding animations were generated in R using THIS program.

Question 1: Explain why many biological variables would be expected to exhibit a normal distribution.

It was noted above that the Excel function NORM.DIST was used to generate the red lines indicating the probability densities for the normal distribution given a specifed mean and standard deviation. The syntax for the function is:


Where x is the value on the X-axis for which you wish to find the probability density. The logical argument at the end should be "false", unless a cumulative probability (as shown by the red line below) is desired:

Question 2: What is the difference between the density function (black line above) and the cumulative density function (red line above)?

If you prefer pencil and paper to Excel functions, the normal probability density function can be calculated as:

While we will make no real use of the normal distribution as a probability distribution for our inferential statistical analyses (which is why I am not putting you through the busy work of generating z-scores, another term for probability densities for the normal distribution), the assumption that our observations are normally distributed will be required for most of our analyses. Although it may seem counterintuitive, we always test our assumptions. One might argue that they no longer should then be considered "assumptions", but that misinterpretation can easily be corrected by realizing that the assumptions are the assumptions of the analysis, and define the conditions under which the analysis will give us a result that can be properly interpreted. That is why we must test our data against those assumptions in order to determine whether the conclusion to which our analysis leads us is an appropriate one.

We have, in a sense, already evaluated several distributions for normality by a visual comparison of the bars to the red lines. Such a comparison, however, is strongly influenced by the size of a graph. As a young and impressionable lad, I was taught that draws from a binomial distribution, as was demonstrated above, would produce a distribution not distinguishable from a normal distribution on a graph printed on 8.5" x 11" paper when k > 25. This is absolutely true, but in the examples I used above, normality was not achieved until k ≈ 200 when p = 0.5, and when k ≈ 600 for p = 0.2. While this specific set of circumstances might not be broadly applicable, it does serve to illustrate the point that one must be cautious with visual comparisons.

The preceding narrative should also have suggested to you that there are other ways to test for normality. One possibility is generating the probability densities and using a Goodness-of-Fit test to compare the observed frequencies to those expected for a normal distribution. We will deal with such approaches later on when we explore "Analysis of Frequencies" in week 13. For now, take comfort in the fact that there is a far better approach.

The Shapiro-Wilk statistic is the most reliable, and most widely applied test for normality. Unfortunately (although you might think it fortunate) it is too cumbersome and computationally intensive for us to do by hand, so when we need to test the assumption of normality, the result of the Shapiro-Wilk test for normality will be provided to you.

One application of the normal distribution (or more correctly, distributions that describe the approach to normality) involves the calculation of confidence intervals.

Probability of all alleles represented in a sample - Biology

Probability Tutorial for Biology 231

The aim of this tutorial is to guide you through the basics of probability. An understanding of probability is the key to success in Mendelian and evolutionary genetics. Along the way, you will be challenged with eight problems to test your understanding of the concepts.

    p(A) = the probability of outcome A.

The value of any probability must lie within the range of 0.0 and 1.0. If p(A) = 0.0, then outcome A is impossible. If p(A) = 1.0, then outcome x is guaranteed.

Consider a typical 6-sided die (the singular of dice). Assume that the die is "fair" (i.e., it is equally likely to land with any of its six sides facing up). Define A as 3. What is p(A)? It is simply the probability of rolling a 3: p(A) = 1/6.

A) = the probability of anything except outcome A.

Using the above example, what is p(

A)? It is the probability of rolling anything other than a 3. This can be calculated as one minus p(A): 1 - 1/6 = 5/6.

If outcomes A and B are mutually exclusive, then p(A,B) = p(A) + p(B). Put another way, the joint probability of outcomes A and B equals the sum of their individual probabilities. This concept is central to the SUM RULE .

Here are a couple examples using the same die. First, define A as the set <1,2>. Define B as the set <4,5,6>. In this example, p(A) = 2/6 (or 1/3) p(B) = 3/6 (or 1/2). Because outcomes A and B are mutually exclusive, p(A,B) = 2/6 + 3/6 = 5/6.

Now let's redefine B as the set <1,3,5>. What is the joint probability of A and B? It is no longer the sum of the individual probabilities, because A and B are not mutually exclusive they both have the outcome 1 in common. In this example, p(A,B) = p(1,2,3,5) = 4/6.

If outcomes A and B are independent, then p(AB) = p(A) × p(B). This concept is central to the PRODUCT RULE .

Applying Basic Probability to Mendelian Genetics.

    Mendel's First Law (Equal Segregation of Alleles). If an organism has the genotype Dd, Mendel's First Law tells us that half of its gametes should bear the D allele and half should bear the d allele. In terms of formal probability, p(D) = 0.5 and p(d) = 0.5. If an individual has the DD genotype, then p(D) = 1.0 and p(d) = 0.0. If an individual has the dd genotype, then p(D) = 0.0 and p(d) = 1.0.

How is this applicable to Mendelian genetics? Consider the following cross: Dd (parent #1) × dd (parent #2). Using formal probability, what is the chance that a particular offspring has the Dd genotype? We know there are only two ways that this can happen: either (i) parent #1 passes on a D allele and parent #2 passes on a d allele (outcome A), or (ii) parent #1 passes on a d allele and parent #2 passes on a D allele (outcome B). Here is what we want to know:

Since A and B are mutually exclusive outcomes, we can use the Sum Rule and simply add together p(A) and p(B). But we first have to calculate these.

Let's begin with p(A), the probability that the Dd individual passes on a D allele and the dd individual passes on a d allele. It should be apparent that we can use the Product Rule here, since the two parents are passing on alleles independently of each other. The probability that the Dd parent passes on a D allele is 0.5, and the probability that the dd parent passes on the d allele is 1.0. Therefore, p(A) = 0.5 × 1.0 = 0.5.

Now let's move on to p(B), the probability that the Dd parent passes on the d allele and the dd parent passes on the D allele. Again, we can use the Product Rule. The probability that the Dd parent passes on the d allele is 0.5, and the probability that the dd parent passes on the D allele is 0.0 (right?). Therefore, p(B) = 0.5 × 0.0 = 0.0.

So, to finish the problem, we use the Sum Rule. Remember, we want to solve for p(A,B). We've already accepted that the conditions for using the Sum Rule have been met, so p(A,B) = p(A) + p(B) = 0.5 + 0.0 = 0.5.

Was this easier than using a Punnett square? Probably not. However, all of this reasoning is implicit to a Punnett square! A Punnett square is just a visual shortcut for doing the same arithmetic.

aBde AaBBDdEe AaBBddEe AaBbDdEe AaBbddEe
abDe AaBbDDEe AaBbDdEe AabbDDEe AabbDdEe
abdE AaBbDdEE AaBbddEE AabbDdEE AabbddEE
abde AaBbDdEe AaBbddEe AabbDdEe AabbddEe

Right. There are 32 boxes (we got off easy. there could have been 64 for 4 genes). Let's find the ones with AabbddEE.

aBde AaBBDdEe AaBBddEe AaBbDdEe AaBbddEe
abDe AaBbDDEe AaBbDdEe AabbDDEe AabbDdEe
abdE AaBbDdEE AaBbddEE AabbDdEE AabbddEE
abde AaBbDdEe AaBbddEe AabbDdEe AabbddEe

It looks like there's a 1/32 chance of getting this genotype. Now let's do it the easy way. Define outcome A as Aa, outcome B as bb, outcome D as dd and outcome E as EE. We are interesting in determining p(ABDE), the probability of simultaneously seeing all four outcomes. Because the genes are independently assorting, we can use the Product Rule: p(ABDE) = p(A) × p(B) × p(D) × p(E).

  • p(A): the probability of getting Aa from a AA × aa cross is 1.0.
  • p(B): the probability of getting bb from a Bb × Bb cross is 0.25.
  • p(D): the probability of getting dd from a Dd × Dd cross is 0.25.
  • p(E): the probability of getting EE from a EE × Ee cross is 0.5.


Let's cross AaBBCcDdEEffGGHh × AaBbccDDEeFfGgHh. Again, we'll assume that the genes are independently assorting.

First, what is the chance that a particular offspring has the AaBbccDDEeFfGghh genotype? If you choose to set up a Punnett square, beware! You'll have 16 columns and 64 rows, for a grand total of 1024 boxes. Don't make any mistakes.

From the same cross. what is the probability that the offspring has the dominant phenotype for all eight genes, assuming that upper-case alleles are dominant to lower case alleles?

Conditional Probability

    p(A|B) = the probability of outcome A given condition B. This is not the same as a joint probability or a simultaneous probability.

It turns out that p(A|B) is very easy to calculate: p(A|B) = p(AB) ÷ p(B) . Remember, p(AB) is the simultaneous probability of outcomes A and B. The conditional probability of A given B is their simultaneous probability divided by the probability of B.

Here is an example. Define A as 3. Define B as "odd numbers." First, determine p(A), the probability that a fair die lands on 3. The answer is 1/6 .

Now, determine p(A|B), the probability of rolling 3 given that the die lands on an odd number. The answer is 1/3. Why did the answer change? It didn't. We are asking two different questions. In the first case, we wanted to know the overall probability of outcome A. In the second case, We were only interested in the chance of rolling 3 if condition B was satisfied. If the die had landed on 2, 4 or 6, then condition B would not have been satisfied.

Does the arithmetic described above work? The probability of outcome B (rolling an odd number) is 1/2. The simultaneous probability of A and B is the probability of rolling 3, which is 1/6 since this is the only outcome that satisfies both A and B. Using the formula p(A|B) = p(AB) ÷ p(B), our answer is 1/6 ÷ 1/2 = 1/3.

A slightly trickier problem: determine p(A|

B). We are now seeking the probability of rolling 3 given that the die does not land on an odd number. The answer, of course, is zero. Does the math work? The probability of not rolling an odd number is 1/2. However, the probability of simultaneously satisfying A and

B (i.e., rolling 3 and not rolling an odd number) is zero. So p(A|


Let's apply this to a common Mendelian genetics problem. There is a gene in cats that affects development of the spine. Individuals with the MM genotype are phenotypically normal. Individuals with the Mm genotype are tailless (Manx) cats. The mm genotype is developmentally lethal, so zygotes with this genotype do not develop into kittens. If you cross two Manx cats, what fraction of the kittens are expected to be Manx?

Let's try a different problem. In fruit flies, brown eyes result from a homozygous recessive genotype (br/br). A pair of heterozygous parents produce a son with wild type eye color. He is mated with a brown-eyed female. What is the probability that their first offspring has brown eyes?

Probability in Statistical Analysis.

For many statistical tests, we are interested in the so-called p-value . This is the probability of obtaining a particular value of a test statistic (or greater) just by chance. In general, we are using the statistical test to contrast observed results (our data) to expected results (those predicted by the hypothesis being tested). [We usually must make certain assumptions about the data in order to use the p-value to reject or fail to reject the hypothesis.] If the difference between the observed and expected results is sufficiently great -- by convention, such that the p-value corresponding to the test statistic value is less than 0.05 -- we reject the hypothesis used to generated the expected results. If the p-value is greater than 0.05, we fail to reject the hypothesis.

How do we put this in terms of formal probability? Define A as "the observed results or any results less likely given the hypothesis" and B as "the hypothesis is correct." If all of the assumptions of the statistical test are valid, then the p-value = p(A|B) : the probability of observing the results or any less likely results given that the hypothesis is correct.

Another way to define a p-value is as follows: it is the probability that, if we choose to reject the hypothesis, we are making a mistake! Obviously, we don't like to make mistakes. So we feel better about rejecting a hypothesis if our statistical test gives us a very low p-value.

The Binomial Distribution

A particularly broad class of repeated experiments falls into the category of Bernoulli Trials . By definition, Bernoulli trials have three characteristics:

    the result of each experiment (i.e., trial) is either success or failure (yes or no, true or false, etc.)

If one knows in advance the probability of success (p), then one can predict the exact probability of k successes in N Bernoulli trials. This probability can be written formally as:

p(k|pN) = [N! ÷ (k! × (N-k)!)] × p k × (1-p) N-k.

In terms of formal probability, the probability of k successes given N trials and given probability of success = p. [Note the awkward use of p for two different purposes in the equation.] This formula is the basis of the Binomial Distribution .

Perhaps a more proper way to think about the Binomial Distribution is to consider the distribution, itself. The Binomial Distribution describes the probabilities of all possible outcomes of N Bernoulli trials given probability of success = p. It should be evident that one could observe, in principle, any integer number of successes ranging from 0 to N.

To better understand the Binomial Distribution, it makes sense to break down the formula.

    [N! ÷ (k! × (N-k)!)] . If we perform N trials and don't care which of those trials represent the k successes, we must calculate the number of different ways that we can get k successes. Consider a die-rolling experiment, where we define success as rolling a 3. If we roll the die 4 times, how many different ways are there to get 0, 1, 2, 3 or 4 successes? The following table summarizes this.

. Trial 1 Trial 2 Trial 3 Trial 4
k=0 Fail Fail Fail Fail
k=1 Success Fail Fail Fail
Fail Success Fail Fail
Fail Fail Success Fail
Fail Fail Fail Success
k=2 Success Success Fail Fail
Success Fail Success Fail
Success Fail Fail Success
Fail Success Success Fail
Fail Success Fail Success
Fail Fail Success Success
k=3 Fail Success Success Success
Success Fail Success Success
Success Success Fail Success
Success Success Success Fail
k=4 Success Success Success Success

By comparison, the formula gives the following answers:

k N! k! N-k! [N! ÷ (k! × (N-k)!)]
0 1 × 2 × 3 × 4 = 24 1 (by definition) 1 × 2 × 3 × 4 = 24 24 ÷ (1 × 24) = 1
1 1 × 2 × 3 × 4 = 24 1 1 × 2 × 3 = 6 24 ÷ (1 × 6) = 4
2 1 × 2 × 3 × 4 = 24 1 × 2 = 2 1 × 2 = 2 24 ÷ (2 × 2) = 6
3 1 × 2 × 3 × 4 = 24 1 × 2 × 3= 6 1 24 ÷ (6 × 1) = 4
4 1 × 2 × 3 × 4 = 24 1 × 2 × 3 × 4 = 24 1 (by definition) 24 ÷ (24 × 1) = 1

For the die-rolling experiment, the probability of success, p, is 1/6 the probability of failure, 1-p, is 5/6. The following table shows the probabilities of k of N successes for p=1/6:

k p k (1-p) (N-k) p k × (1-p) (N-k)
0 (1/6) 0 = 1.0000 (5/6) 4 = 0.4823 1.0000 × 0.4823 = 0.4823
1 (1/6) 1 = 0.1667 (5/6) 3 = 0.5787 0.1667 × 0.5787 = 0.0965
2 (1/6) 2 = 0.0278 (5/6) 3 = 0.6944 0.0278 × 0.6944 = 0.0193
3 (1/6) 3 = 0.0046 (5/6) 1 = 0.8333 0.0046 × 0.8333 = 0.0039
3 (1/6) 4 = 0.0008 (5/6) 0 = 1.0000 0.0008 × 1.000 = 0.0008

k N! ÷ (k! × (N-k)!) × p k × (1-p) (N-k) = p(k|pN)
0 1 × 0.4823 = 0.4823
1 4 × 0.0965 = 0.3858
2 6 × 0.0193 = 0.1157
3 4 × 0.0039 = 0.0154
4 1 × 0.0008 = 0.0008

Below are binomial distribution plots for 10 Bernoulli trials with three different probabilities of success.

As the number of trials is increased, the binomial distribution becomes smoother. In fact, the normal distribution can be derived mathematically from a binomial distribution with N = infinity and p = 0.5.


Do we really expect the expected results of a cross? Hmmm. In mice, individuals with either the BB or Bb genotype have black fur, while those with the bb genotype have brown fur. [We are ignoring other genes that can interact with this gene to produce other fur colors.] You cross true-breeding black and brown mice to produce heterozygotes, then cross these to produce an F2 generation with sixteen mouse pups. What is the exact probability that you will observe the expected result: twelve black mice and four brown mice?

Consider, then, the two closest outcomes: eleven black/five brown mice and thirteen black/three brown mice. How much more likely is the expected result than each of these alternative results?

In traditional statistical analysis, we are estimating the probability of observed data given the hypothesis. Sometimes, however, we are interested in the inverse: the probability of a hypothesis given the observed data.

Consider the following scenario. A female human (Gladys) with an autosomal recessive phenotype has mated with a male human (Mickey) with the dominant phenotype. They have three offspring, all of whom show the dominant phenotype. What is the probability that Mickey was a heterozygote?

If we define A as the observed results (i.e., the data) and B as the hypothesis that Mickey is heterozygous and Gladys is homozygous recessive, we are interested in the value of p(B|A). As a conditional probability,

p(B|A) = p(BA) ÷ p(A) .

p(A|B) = p(AB) ÷ p(B) .

It should be obvious that

p(AB) = p(BA) .

Therefore, rearranging the formula for p(A|B) and substituting p(BA) for p(AB), we get

p(BA) = p(A|B) × p(B) .

If we substitute this into the first formula, we get

p(B|A) = [p(A|B) × p(B)] ÷ p(A) .

This equation represents Bayes' Theorem. It has three components:

  • p(B) is the prior probability of B. In other words, it is the probability of B before we have any additional information.
  • p(A|B) is the probability of A given B.
  • p(B|A) is the probability of B given A.

At first glance, solving the Mickey/Gladys problem might seem straightforward. We want to calculate the posterior probability of Mickey being a heterozygote given the observation that three children have the dominant phenotype. However, it turns out that only one of the terms on the right side of the formula can actually be calculated with the information provided:

  • p(A|B) is the probability of three dominant offspring from a cross of heterozygous and homozygous recessive parents. If we use G and g to represent the alleles of the relevant gene, the cross would be Gg × gg. This cross produces Gg and gg offspring with equal probability (0.5). Since each offspring is produced independently, we use the Product Rule. here is a 0.5 × 0.5 × 0.5 (1/8) chance of having three phenotypically dominant offspring.

The other two terms, p(A) and p(B) can not be calculated with the information provided. We need one more piece of information: the prior probability that Mickey is heterozygous. That is, before we had any offspring data, what was the chance that Mickey was heterozygous? It depends on his parents. If they were both heterozygous, then there is a 2/3 chance that Mickey is heterozygous and a 1/3 chance that he is homozygous dominant. [Remember, we are conditioning these probabilities on the observation that Mickey has the dominant phenotype. Therefore, we only consider the outcomes of the cross that produce dominant offspring.] But if Mickey's parents had different genotypes, the chance that he is a heterozygote will change. So we need more information. Here it is: let's assume that we had prior information that led us to believe that both of Mickey's parents were heterozygous.

Now we can plug in a value for p(B), the prior probability that Mickey is heterozygous and Gladys is homozygous. We know that Gladys has the gg genotype. We also know that Mickey has the dominant phenotype, so his genotype must be either GG or Gg. If both of his parents were heterozygous, then there is a 2/3 change that Mickey is heterozygous. Therefore, we will assume that the prior probability of Mickey being heterozygous and Gladys being homozygous, p(B), is 2/3.

What about p(A)? This actually still has to be calculated. In terms of formal probability,

p(A) = p(A|B) × p(B) + p(A|

So, first, what is p(A|B)? We calculated this already! So, next, what is p(A|

B). That is, what is the probability of seeing three phenotypically dominant offspring if Mickey is not heterozygous? Since Mickey has the dominant phenotype, this means he must have the homozygous dominant genotype. Therefore, there is a 1.0 probability that the three offspring are phenotypically dominant. [GG × gg can only produce Gg offspring.] Therefore, using the formula above, p(A) = 1/8 × 2/3 + 1.0 × 1/3 = 1/12 + 4/12 = 5/12.

We are now ready to calculate the probability that Mickey is a heterozygote given the fact that he and Gladys have three phenotypically dominant offspring. From Bayes' Theorem, p(B|A) = [p(A|B) × p(B)] ÷ p(A) = 1/8 × 2/3 ÷ 5/12 = 1/8 × 2/3 × 12/5 = 0.2.

This is a very important point: if we had made different assumptions about the genotypes of Mickey's parents, we would have obtained a different answer.

This is another very important point: the posterior probability of a hypothesis is generally different than the prior probability of a hypothesis. This is because the posterior probability of a hypothesis is calculated after additional information (the data) has been provided.

Let's take the Mickey/Gladys problem one step farther. Given the data, what are the relative likelihoods of our two competing hypotheses: B, the hypothesis that Mickey is heterozygous and

B, the probability that Mickey is homozygous(i.e., an individual with the dominant phenotype but not the Gg genotype)? We have already calculated the posterior probability that Mickey is heterozygous (assuming a prior probability of 2/3). We now must calculate the posterior probability that Mickey is homozygous (assuming the same prior probability). This can be written as

B|A) = [p(A|

B) = 1. This was calculated earlier. It is the chance of getting three dominant offspring if Mickey has the GG genotype.

This should actually make sense. If we already calculated that the posterior probability of Mickey being a heterozygote is 0.2, then the posterior probability that he is not a heterozygote should be 1 - 0.2, or 0.8.

So, given the data, what are the relative likelihoods of the two competing hypotheses?

p(B|A) / p(

B|A) = 0.2 / 0.8 = 1/4.

In other words, it is four times more likely that Mickey is a homozygote than it is that he is a heterozygote.


Consider a scenario where healthy individuals heterozygous for a recessive genetic disease represent 18% of the general population, while those with the disease represent 1% of the general population. A healthy male has undergone testing for the recessive allele and learns that he is heterozygous. His spouse is also healthy, but we do not know her genotype. They have a healthy child. What is the posterior probability that she is homozygous?

This next problem is pretty challenging. How many healthy children must they have before she can be more than 95% confident that she is homozygous? [Note: if they have even one child with the disease, the question is moot. We would know that she is heterozgous.]

Calculating Relatedness

How can we calculate relatedness in inbred, mixed, or haplodiploid families? The procedure is essentially the same as with regular diploid families. We can trace genes from generation to generation and calculate the probability that they are shared or we can use a graphical technique similar to the one above. However, we can no longer assume that all steps reduce relatedness by a factor of 2 (multiplying by ½). Instead, we must label our family tree with the known relatedness at each step. As you make your path through the tree, write down the relatedness at each step. At the end, multiply all of the r values to obtain the coefficient of relatedness. The four trees below illustrate this for sample unrelated, inbred, mixed, and haplodiploid cases.

Unrelated. This tree simply adds the relatedness of ½ between parents and offspring and between siblings when there is no inbreeding.

Inbred. A was related to its mate by ½ (siblings) and B was related to its mate by ⅛ (first cousins). No other parents are related.

Mixed. In this tree, B and C share only one parent, A, reducing their relatedness to ¼. Similarly, D and E share only one parent, B.

Haplodiploid. This tree shows a family of wasps with multiple queens. Only G is male all the others are female. (Non-reproductive females are not shown.)

For example, what is the relatedness between D and G in the Inbred tree? Following the path D-B-C-G, we cross relatednesses of 9 &frasl16, ¾, and ½, giving a relatedness of 27 &frasl128 (about 0.211). For comparison, D and G are related by ⅛ (0.125) in the Unrelated tree. In the Mixed tree, how are D and K related? The path D-E-K has ¼ and ½, for a relatedness of ⅛. For more practice, try the problems below,which refer to the family trees above unless otherwise indicated. (For other families, you need to draw customized trees using the parent-offspring and sibling-sibling relatednesses given earlier in this tutorial.)