Why genes are assumed to follow multivariate normal?

Why genes are assumed to follow multivariate normal?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I wonder why gene expression data are very frequently modeled by multivariate normal distributions. What is the reason for those strong assumptions that the genes follow multivariate gaussian? Are there any reasons specific for genetics other than the reasons for general gaussian assumptions (the ease of calculation, etc.)?

Usually if something is not expected to behave according to some scheme then the measured values for such a parameter are assumed to be normal. It is not just with gene expression but with all types of measurements like dimensions of an object, luminosity of an electric bulb, range of a bullet etc. In any measurement, the random error is modeled using normal distribution. I don't have a very intuitive explanation for why random errors follow normal distribution but mathematically it comes from the central limit theorem.

Now each gene is a variable and measurement of each gene suffers some random error; so a multivariate normal distribution is used.

When we dismiss a null hypothesis in a t-test or z-test what we are actually doing is dismissing our parsimonious notion that a sample is drawn from a given normal distribution. This means two things:

  1. The sample belongs to some other normal distribution (different $mu$ and $sigma$)
  2. The sample follows some other distribution

But a t-test will never be able to point out the exact reason. All it tells you is that the sample is not from some given normal distribution.

Genes that are located on the same chromosome are called linked genes. Alleles for these genes tend to segregate together during meiosis, unless they are separated by crossing-over.Crossing-over occurs when two homologous chromosomes exchange genetic material during meiosis I. The closer together two genes are on a chromosome, the less likely their alleles will be separated by crossing-over. At the following link, you can watch an animation showing how genes on the same chromosome may be separated by ed%20Genes.htm.

Linkage explains why certain characteristics are frequently inherited together. For example, genes for hair color and eye color are linked, so certain hair and eye colors tend to be inherited together, such as blonde hair with blue eyes and brown hair with brown eyes. What other human traits seem to occur together? Do you think they might be controlled by linked genes?

Sex-Linked Genes

Genes located on the sex chromosomes are called sex-linked genes. Most sex-linked genes are on the X chromosome, because the Y chromosome has relatively few genes. Strictly speaking, genes on the X chromosome are X-linked genes, but the term sex-linked is often used to refer to them.

Mapping Linkage

Linkage can be assessed by determining how often crossing-over occurs between two genes on the same chromosome. Genes on different (nonhomologous) chromosomes are not linked. They assort independently during meiosis, so they have a 50 percent chance of ending up in different gametes. If genes show up in different gametes less than 50 percent of the time (that is, they tend to be inherited together), they are assumed to be on the same (homologous) chromosome. They may be separated by crossing-over, but this is likely to occur less than 50 percent of the time. The lower the frequency of crossing-over, the closer together on the same chromosome the genes are presumed to be. Frequencies of crossing-over can be used to construct a linkage map like the one in Figure below. A linkage map shows the locations of genes on a chromosome.

Linkage Map for the Human X Chromosome. This linkage map shows the locations of several genes on the X chromosome. Some of the genes code for normal proteins. Others code for abnormal proteins that lead to genetic disorders. Which pair of genes would you expect to have a lower frequency of crossing-over: the genes that code for hemophilia A and G6PD deficiency, or the genes that code for protan and Xm?


Generation and parameters Edit

is called the log-normal distribution with parameters μ and σ . These are the expected value (or mean) and standard deviation of the variable's natural logarithm, not the expectation and standard deviation of X itself.

Probability density function Edit

A positive random variable X is log-normally distributed (i.e., X ∼ Lognormal ⁡ ( μ x , σ x 2 ) (mu _,sigma _^<2>)> [1] ), if the natural logarithm of X is normally distributed with mean μ and variance σ 2 > :

Cumulative distribution function Edit

This may also be expressed as follows: [2]

Multivariate log-normal Edit

Since the multivariate log-normal distribution is not widely used, the rest of this entry only deals with the univariate distribution.

Characteristic function and moment generating function Edit

All moments of the log-normal distribution exist and

However, a number of alternative divergent series representations have been obtained. [10] [11] [12] [13]

where W is the Lambert W function. This approximation is derived via an asymptotic method, but it stays sharp all over the domain of convergence of φ .

Probability in different domains Edit

The probability content of a log-normal distribution in any arbitrary domain can be computed to desired precision by first transforming the variable to normal, then numerically integrating using the ray-trace method. [15] (Matlab code)

Probabilities of functions of a log-normal variable Edit

Since the probability of a log-normal can be computed in any domain, this means that the cdf (and consequently pdf and inverse cdf) of any function of a log-normal variable can also be computed. [15] (Matlab code)

Geometric or multiplicative moments Edit

Note that the geometric mean is smaller than the arithmetic mean. This is due to the AM–GM inequality and is a consequence of the logarithm being a concave function. In fact,

Arithmetic moments Edit

For any real or complex number n , the n -th moment of a log-normally distributed variable X is given by [4]

Specifically, the arithmetic mean, expected square, arithmetic variance, and arithmetic standard deviation of a log-normally distributed variable X are respectively given by: [2]

This estimate is sometimes referred to as the "geometric CV" (GCV), [19] [20] due to its use of the geometric variance. Contrary to the arithmetic standard deviation, the arithmetic coefficient of variation is independent of the arithmetic mean.

The parameters μ and σ can be obtained, if the arithmetic mean and the arithmetic variance are known:

A probability distribution is not uniquely determined by the moments E[X n ] = e + 1 / 2 n 2 σ 2 for n ≥ 1 . That is, there exist other distributions with the same set of moments. [4] In fact, there is a whole family of distributions with the same moments as the log-normal distribution. [ citation needed ]

Mode, median, quantiles Edit

The mode is the point of global maximum of the probability density function. In particular, by solving the equation ( ln ⁡ f ) ′ = 0 , we get that:

Specifically, the median of a log-normal distribution is equal to its multiplicative mean, [21]

Partial expectation Edit

where Φ is the normal cumulative distribution function. The derivation of the formula is provided in the discussion of this Wikipedia entry. [ where? ] The partial expectation formula has applications in insurance and economics, it is used in solving the partial differential equation leading to the Black–Scholes formula.

Conditional expectation Edit

Alternative parameterizations Edit

  • LogNormal1(μ,σ) with mean, μ, and standard deviation, σ, both on the log-scale [24] P ( x μ , σ ) = 1 x σ 2 π exp ⁡ [ − ( ln ⁡ x − μ ) 2 2 σ 2 ] >,<oldsymbol >)=>>>exp left[-><2sigma ^<2>>> ight]>
  • LogNormal2(μ,υ) with mean, μ, and variance, υ, both on the log-scale P ( x μ , v ) = 1 x v 2 π exp ⁡ [ − ( ln ⁡ x − μ ) 2 2 v ] >,<oldsymbol >)=>>>>exp left[-><2v>> ight]>
  • LogNormal3(m,σ) with median, m, on the natural scale and standard deviation, σ, on the log-scale [24] P ( x m , σ ) = 1 x σ 2 π exp ⁡ [ − ln 2 ⁡ ( x / m ) 2 σ 2 ] >,<oldsymbol >)=>>>exp left[-(x/m)><2sigma ^<2>>> ight]>
  • LogNormal4(m,cv) with median, m, and coefficient of variation, cv, both on the natural scale P ( x m , c v ) = 1 x ln ⁡ ( c v 2 + 1 ) 2 π exp ⁡ [ − ln 2 ⁡ ( x / m ) 2 ln ⁡ ( c v 2 + 1 ) ] >,<oldsymbol >)=+1)>>>>>exp left[-(x/m)><2ln(cv^<2>+1)>> ight]>
  • LogNormal5(μ,τ) with mean, μ, and precision, τ, both on the log-scale [25] P ( x μ , τ ) = τ 2 π 1 x exp ⁡ [ − τ 2 ( ln ⁡ x − μ ) 2 ] >,<oldsymbol < au >>)=<2pi >>>>exp left[-<2>>(ln x-mu )^<2> ight]>
  • LogNormal6(m,σg) with median, m, and geometric standard deviation, σg, both on the natural scale [26] P ( x m , σ g ) = 1 x ln ⁡ ( σ g ) 2 π exp ⁡ [ − ln 2 ⁡ ( x / m ) 2 ln 2 ⁡ ( σ g ) ] >,<oldsymbol >>)=)>>>exp left[-(x/m)><2ln ^<2>(sigma _)>> ight]>
  • LogNormal7(μNN) with mean, μN, and standard deviation, σN, both on the natural scale [27] P ( x μ N , σ N ) = 1 x 2 π ln ⁡ ( 1 + σ N 2 / μ N 2 ) exp ⁡ ( − [ ln ⁡ x − ln ⁡ μ N 1 + σ N 2 / μ N 2 ] 2 2 ln ⁡ ( 1 + σ N 2 / μ N 2 ) ) >>,<oldsymbol >>)=^<2>/mu _^<2> ight)>>>>exp left(-ln x-ln >^<2>/mu _^<2>>>>^<2>><2ln(1+sigma _^<2>/mu _^<2>)>> ight)>

Examples for re-parameterization Edit

Consider the situation when one would like to run a model using two different optimal design tools, for example PFIM [28] and PopED. [29] The former supports the LN2, the latter LN7 parameterization, respectively. Therefore, the re-parameterization is required, otherwise the two tools would produce different results.

All remaining re-parameterisation formulas can be found in the specification document on the project website. [30]

Multiple, Reciprocal, Power Edit

Multiplication and division of independent, log-normal random variables Edit

Multiplicative Central Limit Theorem Edit

In fact, the random variables do not have to be identically distributed. It is enough for the distributions of ln ⁡ ( X i ) )> to all have finite variance and satisfy the other conditions of any of the many variants of the Central limit theorem.

This is commonly known as Gibrat's law.

Other Edit

A set of data that arises from the log-normal distribution has a symmetric Lorenz curve (see also Lorenz asymmetry coefficient). [31]

Log-normal distributions are infinitely divisible, [33] but they are not stable distributions, which can be easily drawn from. [34]

  • If X ∼ N ( μ , σ 2 ) >(mu ,sigma ^<2>)> is a normal distribution, then exp ⁡ ( X ) ∼ Lognormal ⁡ ( μ , σ 2 ) . (mu ,sigma ^<2>).>
  • If X ∼ Lognormal ⁡ ( μ , σ 2 ) (mu ,sigma ^<2>)> is distributed log-normally, then ln ⁡ ( X ) ∼ N ( μ , σ 2 ) >(mu ,sigma ^<2>)> is a normal random variable. [1]
  • Let X j ∼ Lognormal ⁡ ( μ j , σ j 2 ) sim operatorname (mu _,sigma _^<2>) > be independent log-normally distributed variables with possibly varying σ and μ parameters, and Y = ∑ j = 1 n X j ^X_> . The distribution of Y has no closed-form expression, but can be reasonably approximated by another log-normal distribution Z at the right tail. [35] Its probability density function at the neighborhood of 0 has been characterized [34] and it does not resemble any log-normal distribution. A commonly used approximation due to L.F. Fenton (but previously stated by R.I. Wilkinson and mathematical justified by Marlow [36] ) is obtained by matching the mean and variance of another log-normal distribution:

For a more accurate approximation, one can use the Monte Carlo method to estimate the cumulative distribution function, the pdf and the right tail. [37] [38]

The sum of correlated log-normally distributed random variables can also be approximated by a log-normal distribution [ citation needed ]

  • If X ∼ Lognormal ⁡ ( μ , σ 2 ) (mu ,sigma ^<2>)> then X + c is said to have a Three-parameter log-normal distribution with support x ∈ ( c , + ∞ ) . [39] E ⁡ [ X + c ] = E ⁡ [ X ] + c [X+c]=operatorname [X]+c> , Var ⁡ [ X + c ] = Var ⁡ [ X ] [X+c]=operatorname [X]> .
  • The log-normal distribution is a special case of the semi-bounded Johnson's SU-distribution. [40]
  • If X ∣ Y ∼ Rayleigh ⁡ ( Y ) (Y),> with Y ∼ Lognormal ⁡ ( μ , σ 2 ) (mu ,sigma ^<2>)> , then X ∼ Suzuki ⁡ ( μ , σ ) (mu ,sigma )> (Suzuki distribution).
  • A substitute for the log-normal whose integral can be expressed in terms of more elementary functions [41] can be obtained based on the logistic distribution to get an approximation for the CDF

Estimation of parameters Edit

For determining the maximum likelihood estimators of the log-normal distribution parameters μ and σ, we can use the same procedure as for the normal distribution. Note that

Statistics Edit

The most efficient way to analyze log-normally distributed data consists of applying the well-known methods based on the normal distribution to logarithmically transformed data and then to back-transform results if appropriate.

Scatter intervals Edit

of the probability. Using estimated parameters, then approximately the same percentages of the data should be contained in these intervals.

Confidence interval for μ ∗ > Edit

Extremal principle of entropy to fix the free parameter σ Edit

The log-normal distribution is important in the description of natural phenomena. Many natural growth processes are driven by the accumulation of many small percentage changes which become additive on a log scale. Under appropriate regularity conditions, the distribution of the resulting accumulated changes will be increasingly well approximated by a log-normal, as noted in the section above on "Multiplicative Central Limit Theorem". This is also known as Gibrat's law, after Robert Gibrat (1904–1980) who formulated it for companies. [46] If the rate of accumulation of these small changes does not vary over time, growth becomes independent of size. Even if that's not true, the size distributions at any age of things that grow over time tends to be log-normal.

A second justification is based on the observation that fundamental natural laws imply multiplications and divisions of positive variables. Examples are the simple gravitation law connecting masses and distance with the resulting force, or the formula for equilibrium concentrations of chemicals in a solution that connects concentrations of educts and products. Assuming log-normal distributions of the variables involved leads to consistent models in these cases.

Even if none of these justifications apply, the log-normal distribution is often a plausible and empirically adequate model. Examples include the following:

Human behaviors Edit

  • The length of comments posted in Internet discussion forums follows a log-normal distribution. [47]
  • Users' dwell time on online articles (jokes, news etc.) follows a log-normal distribution. [48]
  • The length of chess games tends to follow a log-normal distribution. [49]
  • Onset durations of acoustic comparison stimuli that are matched to a standard stimulus follow a log-normal distribution. [18] solves, both general or by person, appear to follow a log-normal distribution. [citation needed]

In biology and medicine Edit

  • Measures of size of living tissue (length, skin area, weight). [50]
  • For highly communicable epidemics, such as SARS in 2003, if public intervention control policies are involved, the number of hospitalized cases is shown to satisfy the log-normal distribution with no free parameters if an entropy is assumed and the standard deviation is determined by the principle of maximum rate of entropy production. [51]
  • The length of inert appendages (hair, claws, nails, teeth) of biological specimens, in the direction of growth. [citation needed]
  • The normalised RNA-Seq readcount for any genomic region can be well approximated by log-normal distribution.
  • The PacBio sequencing read length follows a log-normal distribution. [52]
  • Certain physiological measurements, such as blood pressure of adult humans (after separation on male/female subpopulations). [53]
  • In neuroscience, the distribution of firing rates across a population of neurons is often approximately log-normal. This has been first observed in the cortex and striatum [54] and later in hippocampus and entorhinal cortex, [55] and elsewhere in the brain. [56][57] Also, intrinsic gain distributions and synaptic weight distributions appear to be log-normal [58] as well.

In colloidal chemistry and polymer chemistry Edit

Consequently, reference ranges for measurements in healthy individuals are more accurately estimated by assuming a log-normal distribution than by assuming a symmetric distribution about the mean.

9.2 Multidimensional scaling and ordination

Sometimes, data are not represented as points in a feature space. This can occur when we are provided with (dis)similarity matrices between objects such as drugs, images, trees or other complex objects, which have no obvious coordinates in (^n) .

In Chapter 5 we saw how to produce clusters from distances. Here our goal is to visualize the data in maps in low dimensional spaces (e.g., planes) reminiscent of the ones we make from the first few principal axes in PCA.

We start with an example showing what we can do with simple geographic data. In Figure 9.1 a heatmap and clustering of the approximate road distances between some of the European cities is shown.

Figure 9.1: A heatmap of the distances between some of the cities. The function has re-arranged the order of the cities, grouping the closest ones.

Given the these distances between cities, multidimensional scaling (MDS) provides a `map’ of their relative locations. Of course, in this case the distances were originally measured as road distances (except for ferries), so we actually expect to find a two dimensional map that would represent the data well. With biological data, our maps are likely to be less clearcut. We call the function with:

We make a function that we can reuse to make the MDS screeplot from the result of a call to the cmdscale function:

Figure 9.2: Screeplot of the first 5 eigenvalues. The drop after the first two eigenvalues is very visible.

Make a barplot of all the eigenvalues ouput by the cmdscale function: what do you notice?

you will note that unlike in PCA, there are some negative eigenvalues, these are due to the fact that the data do not come from a Euclidean space.

To position the points on the map we have projected them on the new coordinates created from the distances (we will discuss how the algorithm works in the next section). Note that while relative positions in Figure 9.3 are correct, the orientation of the map is unconventional: e.g., Istanbul, which is in the South-East of Europe, is at the top left.

Figure 9.3: MDS map of European cities based on their distances.

We reverse the signs of the principal coordinates and redraw the map. We also read in the cities’ true longitudes and latitudes and plot these alongside for comparison (Figure 9.4).

Figure 9.4: Left: same as Figure 9.3, but with axes flipped. Right: true latitudes and longitudes.

Which cities seem to have the worst representation on the PCoA map in the left panel of Figure 9.4?

It seems that the cities at the extreme West: Dublin, Madrid and Barcelona have worse projections than the central cities. This is likely because the data are more sparse in these areas and it is harder for the method to `triangulate’ the outer cities.

We drew the longitudes and latitudes in the right panel of Figure 9.4 without much attention to aspect ratio. What is the right aspect ratio for this plot?

There is no simple relationship between the distances that correspond to 1 degree change in longitude and to 1 degree change in latitude, so the choice is difficult to make. Even under the simpliyfing assumption that our Earth is spherical and has a radius of 6371 km, it’s complicated: one degree in latitude always corresponds to a distance of 111 km ( (6371 imes2pi/360) ), as does one degree of longitude on the equator. However, at the latitude of Barcelona (41.4 degrees), this becomes 83 km, at that of Sankt Petersburg (60 degrees), 56 km. Pragmatically, we could choose a value for the aspect ratio that’s somewhere in between, say, the cosine for 50 degrees. Check out the internet for information on the Haversine formula.

Note: MDS creates similar output as PCA, however there is only one ‘dimension’ to the data (the sample points). There is no ‘dual’ dimension and biplots are unavailable. This is a drawback when coming to interpreting the maps. Interpretation can be facilitated by examining carefully the extreme points and their differences.

9.2.1 How does the method work?

Let’s take a look at what would happen if we really started with points whose coordinates were known 125 125 Here we commit a slight ‘abuse’ by using the longitude and longitude of our cities as Cartesian cooordinates and ignoring the curvature of the earth’s surface. . We put these coordinates into the two columns of a matrix with 24 rows. Now we compute the distances between points based on these coordinates. To go from the coordinates (X) to distances, we write [d^2_ = (x_i^1 - x_j^1)^2 + dots + (x_i^p - x_j^p)^2.] We will call the matrix of squared distances DdotD in R and (Dullet D) in the text ⊕ (D^2) would mean D multiplied by itself, which is different than this. . We want to find points such that the square of their distances is as close as possible to the (Dullet D) observed.

The relative distances do not depend on the point of origin of the data. We center the data by using a matrix (H) : the centering matrix defined as (H=I-frac<1>>^t) . Let’s check the centering property of (H) using:

Call B0 the matrix obtained by applying the centering matrix both to the right and to the left of DdotD Consider the points centered at the origin given by the (HX) matrix and compute its cross product, we’ll call this B2 . What do you have to do to B0 to make it equal to B2 ?

Therefore, given the squared distances between rows ( (Dullet D) ) and the cross product of the centered matrix (B=(HX)(HX)^t) , we have shown:

This is always true, and we use it to reverse-engineer an (X) which satisfies Equation (9.1) when we are given (Dullet D) to start with.

From (Dullet D) to (X) using singular vectors.

We can go backwards from a matrix (Dullet D) to (X) by taking the eigen-decomposition of (B) as defined in Equation (9.1). This also enables us to choose how many coordinates, or columns, we want for the (X) matrix. This is very similar to how PCA provides the best rank (r) approximation.
Note: As in PCA, we can write this using the singular value decomposition of (HX) (or the eigen decomposition of (HX(HX)^t) ):

⊕ [S^ <(r) >= egin s_1 &0 & 0 &0 &. 0&s_2&0 & 0 &. 0& 0& . & . & . 0 & 0 & . & s_r &. . & . & . & 0 & 0 end] This provides the best approximate representation in an Euclidean space of dimension (r) . ⊕ The method is often called Principal Coordinates Analysis, or PCoA which stresses the connection to PCA. The algorithm gives us the coordinates of points that have approximately the same distances as those provided by the (D) matrix.

Classical MDS Algorithm.

In summary, given an (n imes n) matrix of squared interpoint distances (Dullet D) , we can find points and their coordinates ( ilde) by the following operations:

Double center the interpoint distance squared and multiply it by (-frac<1><2>) :
(B = -frac<1><2>H Dullet D H) .

Diagonalize (B) : (quad B = U Lambda U^t) .

Extract ( ilde) : (quad ilde = U Lambda^<1/2>) .

Finding the right underlying dimensionality.

As an example, let’s take objects for which we have similarities (surrogrates for distances) but for which there is no natural underlying Euclidean space. In a psychology experiment from the 1950s, Ekman (1954) asked 31 subjects to rank the similarities of 14 different colors. His goal was to understand the underlying dimensionality of color perception. The similarity or confusion matrix was scaled to have values between 0 and 1. The colors that were often confused had similarities close to 1. We transform the data into a dissimilarity by subtracting the values from 1:

We compute the MDS coordinates and eigenvalues. We combine the eigenvalues in the screeplot shown in Figure 9.5:

Figure 9.5: The screeplot shows us that the phenomenon is two dimensional, giving a clean answer to Ekman’s question.

We plot the different colors using the first two principal coordinates as follows:

Figure 9.6: The layout of the scatterpoints in the first two dimensions has a horseshoe shape. The labels and colors show that the arch corresponds to the wavelengths.

Figure 9.6 shows the Ekman data in the new coordinates. There is a striking pattern that calls for explanation. This horseshoe or arch structure in the points is often an indicator of a sequential latent ordering or gradient in the data (Diaconis, Goel, and Holmes 2007) . We will revisit this in Section 9.5.

9.2.2 Robust versions of MDS

Robustness: A method is robust if it is not too influenced by a few outliers. For example, the median of a set of (n) numbers does not change by a lot even if we change 20 the numbers by arbitrarily large amounts to drastically shift the median, we need to change more than half of the numbers. In contrast, we can change the mean by a large amount by just manipulating one of the numbers. We say that the breakdown point of the median is 1/2, while that of the mean is only (1/n) . Both mean and median are estimators of the location of a distribution (i.e., what is a “typical” value of the numbers), but the median is more robust. The median is based on the ranks more generally, methods based on ranks are often more robust than those based on the actual values. Many nonparametric tests are based on reductions of data to their ranks. Multidimensional scaling aims to minimize the difference between the squared distances as given by (Dullet D) and the squared distances between the points with their new coordinates. Unfortunately, this objective tends to be sensitive to outliers: one single data point with large distances to everyone else can dominate, and thus skew, the whole analysis. Often, we like to use something that is more robust, and one way to achieve this is to disregard the actual values of the distances and only ask that the relative rankings of the original and the new distances are as similar as possible. Such a rank based approach is robust: its sensitivity to outliers is reduced.

We will use the Ekman data to show how useful robust methods are when we are not quite sure about the ‘scale’ of our measurements. Robust ordination, called non metric multidimensional scaling (NMDS for short) only attempts to embed the points in a new space such that the order of the reconstructed distances in the new map is the same as the ordering of the original distance matrix.

Non metric MDS looks for a transformation (f) of the given dissimilarities in the matrix (d) and a set of coordinates in a low dimensional space ( the map ) such that the distance in this new map is ( ilde) and (f(d) hickapprox ilde) . The quality of the approximation can be measured by the standardized residual sum of squares ( stress ) function:

NMDS is not sequential in the sense that we have to specify the underlying dimensionality at the outset and the optimization is run to maximize the reconstruction of the distances according to that number. There is no notion of percentage of variation explained by individual axes as provided in PCA. However, we can make a simili-screeplot by running the program for all the successive values of (k) ( (k=1, 2, 3, . ) ) and looking at how well the stress drops. Here is an example of looking at these successive approximations and their goodness of fit. As in the case of diagnostics for clustering, we will take the number of axes after the stress has a steep drop.

Because each calculation of a NMDS result requires a new optimization that is both random and dependent on the (k) value, we use a similar procedure to what we did for clustering in Chapter 4. We execute the metaMDS function, say, 100 times for each of the four possible values of (k) and record the stress values.

Let’s look at the boxplots of the results. This can be a useful diagnostic plot for choosing (k) (Figure 9.7).

Figure 9.7: Several replicates at each dimension were run to evaluate the stability of the stress . We see that the stress drops dramatically with two or more dimensions, thus indicating that a two dimensional solution is appropriate here.

We can also compare the distances and their approximations using what is known as a Shepard plot for (k=2) for instance, computed with:

Figure 9.8: The Shepard’s plot compares the original distances or dissimilarities (along the horizonal axis) to the reconstructed distances, in this case for (k=2) (vertical axis).

Both the Shepard’s plot in Figure 9.8 and the screeplot in Figure 9.7 point to a two-dimensional solution for Ekman’s color confusion study.

Let’s compare the output of the two different MDS programs, the classical metric least squares approximation and the nonmetric rank approximation method. The right panel of Figure 9.9 shows the result from the nonmetric rank approximation, the left panel is the same as Figure 9.6. The projections are almost identical in both cases. For these data, it makes little difference whether we use a Euclidean or nonmetric multidimensional scaling method.

Figure 9.9: Comparison of the output from the classical multidimensional scaling on the left (same as Figure 9.6) and the nonmetric version on the right.


In this article, we propose scDesign2, a transparent simulator for single-cell gene expression count data. Our development of scDesign2 is motivated by the pressing challenge to generate realistic synthetic data for various scRNA-seq protocols and other single-cell gene expression count-based technologies. Unlike existing simulators including our previous simulator scDesign, scDesign2 achieves six properties: protocol adaptiveness, gene preservation, gene correlation capture, flexible cell number and sequencing depth choices, transparency, and computational and sample efficiency. This achievement of scDesign2 is enabled by its unique use of the copula statistical framework, which combines marginal distributions of individual genes and the global correlation structure among genes. As a result, scDesign2 has the following methodological advantages that contribute to its high degree of transparency. First, it selects a marginal distribution from four options (Poisson, ZIP, NB, and ZINB) for each gene in a data-driven manner to best capture and summarize the expression characteristics of that gene. Second, it uses a Gaussian copula to estimate gene correlations, which will be used to generate synthetic single-cell gene expression counts that preserve the correlation structures. Third, it can generate gene expression counts according to user-specified sequencing depth and cell number.

We have performed a comprehensive set of benchmarking and real data studies to evaluate scDesign2 in terms of its accuracy in generating synthetic data and its efficacy in guiding experimental design and benchmarking computational methods. Based on four scRNA-seq protocols and 12 cell types, our benchmarking results demonstrate that scDesign2 better captures gene expression characteristics in real data than eight existing scRNA-seq simulators do. In particular, among the four simulators that aim to preserve gene correlations, scDesign2 achieves the best accuracy. Moreover, we demonstrate the capacity of scDesign2 in generating synthetic data of other single-cell count-based technologies including MERFISH and pciSeq, two single-cell spatial transcriptomics technologies. After validating the realistic nature of synthetic data generated by scDesign2, we use real data applications to demonstrate how scDesign2 can guide the selection of cell number and sequencing depth in experimental design, as well as how scDesign2 can benchmark computational methods for cell clustering and rare cell type identification.

In the last stage of manuscript finalization, we found another scRNA-seq simulator SPsimSeq [79] (published in Bioinformatics as a 2.3-page software article), which can capture gene correlations. However, unlike scDesign2, SPsimSeq cannot generate scRNA-seq data with varying sequencing depths. To compare scDesign2 with SPsimSeq, we have benchmarked their synthetic data against the corresponding real data in two sets of analyses: (1) gene correlation matrices of the previously used 12 cell type–protocol combinations (3 cell types × 4 scRNA-seq protocols) and (2) 2D visualization plots of the 4 multi-cell type scRNA-seq datasets and one MERFISH dataset. The results are summarized in Additional file 2. We find that in most cases (10 out 12 cases in the first set of analysis 5 out 5 cases in the second set of analysis), the synthetic data of scDesign2 better resemble the real data than the synthetic data of SPsimSeq do.

Since scRNA-seq data typically contain tens of thousands of genes, the estimation of the copula gene correlation matrix is a high dimensional problem. This problem can be partially avoided by only estimating the copula correlation matrix of thousands of moderately to highly expressed genes. We use a simulation study to demonstrate why this approach is reasonable (Additional file 1: Figures S42 and S43), and a more detailed discussion is in the “Methods” section. To summarize, the simulation results suggest that, to reach an average estimation accuracy of ±0.3 of true correlation values among the top 1000 highly expressed genes, at least 20 cells are needed. To reach an accuracy level of ±0.2 for the top 1500 highly expressed genes, at least 50 cells are needed. With 100 cells, an accuracy level of ±0.1 can be reached for the top 200 highly expressed genes, and a slightly worse accuracy level can be reached for the top 2000 genes.

In the implementation of the scDesign2 R package, we control the number of genes for which copula correlations need to be estimated by filtering out the genes whose zero proportions exceed a user-specified cutoff. For all the results in this paper, the cutoff is set as 0.8. In Additional file 1: Table S1, we summarize the number of cells (n), i.e., the sample size, and the number of genes included for copula correlation estimation (p) in each of the 12 datasets used for benchmarking simulators. Based on Additional file 1: Figures S42 and S43, we see that p appears to be too large for the CEL-Seq2, Fluidigm C1, and Smart-Seq2 datasets. This suggests that the results in this paper may be further improved by setting a more stringent cutoff for gene selection.

For future methodological improvement, there are other ways to address this high-dimensional estimation problem. For example, we can consider implementing sparse estimation (e.g., [97]) for the copula correlation matrix. Moreover, we can build a hierarchical model to borrow information across cell types/clusters. This will be useful for improving the model fitting for small cell types/clusters that may share similar gene correlation structures.

The current implementation of scDesign2 is restricted to single-cell datasets composed of discrete cell types, because the generative model of scDesign2 assumes that cells of the same type follow the same distribution of gene expression. However, many single-cell datasets exhibit continuous cell trajectories instead of discrete cell types. A nice property of the probabilistic model used in scDesign2 is that it is generalizable to account for continuous cell trajectories. First, we can use the generalized additive model (GAM) [52, 98, 99] to model each gene’s marginal distribution of expression as a function of cell pseudotime, which can be computationally inferred from real data [53, 54, 56]. Second, the copula framework can be used to incorporate gene correlation structures along the cell pseudotime. Combining these two steps into a generative model, this extension of scDesign2 has the potential to overcome the current challenge in preserving gene correlations encountered by existing simulators for single-cell trajectory data, such as Splatter Path [69], dyngen [77], and PROSSTT [68]. Another note is that scDesign2 does not generate synthetic cells based on outlier cells that do not cluster well with any cells in well-formed clusters. This is not necessarily a disadvantage, neither is it a unique feature to scDesign2. In fact, all model-based simulators that learn a generative model from real data must ignore certain outlier cells that do not fit well to their model. Some outlier cells could either represent an extremely rare cell type or are just “doublets” [100–103], artifacts resulted from single-cell sequencing experiments. Hence, our stance is that ignorance of outlier cells is a sacrifice that every simulator has to make the open question is the degree to which outlier cells should be ignored, and proper answers to this question must resort to statistical model selection principles.

Regarding the use of scDesign2 to guide the design of scRNA-seq experiments, although scDesign2 can model and simulate data from various scRNA-seq protocols and other single-cell expression count-based technologies, the current scDesign2 implementation is not yet applicable to cross-protocol data generation (i.e., training scDesign2 on real data of one protocol and generating synthetic data for another protocol) because of complicated differences in data characteristics among protocols. To demonstrate this issue, we use a multi-protocol dataset of peripheral blood mononuclear cells (PBMCs) generated for benchmarking purposes [20]. We select data of five cell types measured by three protocols, 10x Genomics, Drop-Seq, and Smart-Seq2, and we train scDesign2 on the 10x Genomics data. Then, we adjust the fitted scDesign2 model for the Drop-Seq and Smart-Seq2 protocols by rescaling the mean parameters in the fitted model to account for the total sequencing depth and cell number, which are protocol-specific (see the “Methods” for details). After the adjustment, we use the model for each protocol to generate synthetic data. Additional file 1: Figure S44 illustrates the comparison of real data and synthetic data for each protocol. From the comparison, we observe that the synthetic cells do not mix well with the real cells for the two cross-protocol scenarios only for 10x Genomics, the same-protocol scenario, do the synthetic cells mix well with the real cells.

To further illustrate the different data characteristics of different protocols, we compare individual genes’ mean expression levels in the aforementioned three protocols. We refer to Drop-Seq and Smart-Seq2 as the target protocols, and 10x Genomics as the reference protocol. First, we randomly partition the two target-protocol datasets and the reference-protocol dataset into two halves each we repeat the partitions for 100 times and collect 100 sets of partial datasets, with each set containing two target-protocol partial datasets (one Drop-Seq and one Smart-Seq2) and two reference-protocol partial datasets (split from the 10x Genomics dataset)—one of the latter is randomly picked and referred to as the “reference data.” Second, For every gene in each cell type, we take each set of partial datasets and compute two cross-protocol ratios, defined as the gene’s mean expression levels in the target-protocol partial datasets divided by its mean expression level in the reference data, and a within-protocol ratio, defined as the ratio of the gene’s mean expression level in the other reference-protocol partial dataset divided by that in the reference data together, with the 100 sets of partial dataset, every gene in each cell type has 100 ratios for each of the two cross-protocol comparisons and 100 ratios for the within-protocol comparison. We apply this procedure to the top 50 and 2000 highly expressed genes in five cell types. Additional file 1: Figures S45 and S46 show that, with the within-protocol ratios as a baseline control for each cell type and each target protocol, the cross-protocol ratios exhibit a strongly gene-specific pattern moreover, there is no monotone relationship between the cross-protocol ratios and the mean expression levels of genes. This result confirms that there does not exist a single scaling factor to convert all genes’ expression levels from one protocol to another. However, an interesting phenomenon is that, for each target protocol, the cross-protocol ratios have similar patterns across cell types. This phenomenon sheds light on a future research direction of cross-protocol simulation for the cell types that exist in only one protocol, if the two protocols have shared cell types. In this scenario, we may train a model for each cell type in each protocol, learn a gene-specific but cell type-invariant scaling factor from the shared cell types, and simulate data for the cell types missing in one protocol.

We note that the above analysis is only conducted for the genes’ mean expression levels. The difficulty of cross-protocol simulation is in fact even larger because realistic simulation requires the rescaling of the other distributional parameter(s) in a two-parameter distribution such as NB and ZIP or a three-parameter distribution such as ZINB. Existing work has provided extensive empirical evidence on the vast differences between protocols in terms of data characteristics [42, 86].

In applications 2 and 3, we have demonstrated how to use scDesign2 to guide experimental design and benchmark computational methods for the tasks of cell clustering and rare cell type detection. Note that in these analyses, the optimized sequencing depths and cell numbers are only applicable to the same experimental protocols and biological samples. Yet, this limitation does not disqualify scDesign2 as a useful tool to guide experimental design. For example, researchers usually perform a coarse-grained, low-budget experiment to obtain a preliminary dataset, and then they may use scDesign2 to guide the optimal design of the later, more refined experiment. As another example, if scRNA-seq data need to be collected from many individuals, researchers usually first perform a pilot study on a small number of individuals. Then, they may train scDesign2 using the pilot data to guide the design of the subsequent, large-scale experiments. In addition to guiding the experimental design, scDesign2 is useful as a general benchmarking tool for various experimental protocols and computational methods. For example, the analyses we performed in applications 2 and 3 are easily generalizable to other computational methods for a more comprehensive benchmarking.

Although we only use cell clustering and rare cell type detection to demonstrate scDesign2’s use in guiding experimental design and benchmarking computational methods, we want to emphasize that scDesign2 has broad applications beyond these two tasks. Inheriting the flexible and transparent modeling nature of our previous simulator scDesign, scDesign2 can also benchmark other computational analyses we have demonstrated in our scDesign paper [35], including differential gene expression analysis and cell dimensionality reduction. Moreover, beyond its role as a simulator, scDesign2 may benefit single-cell gene expression data analysis by providing its estimated parameters about gene expression and gene correlations. Here, we discuss three potential directions. First, scDesign2 can assist differential gene expression analysis. Its estimated marginal distributions of individual genes in different cell types can be used to investigate more general patterns of differential expression (such as different variances and different zero proportions), in addition to comparing gene expression means between two groups of cells [104]. Second, its estimated gene correlation structures can be used to construct cell type-specific gene networks [105] and incorporated into gene set enrichment analysis to enhance statistical power [106, 107]. Third, scDesign2 has the potential to improve the alignment of cells from multiple single-cell datasets [108]. Its estimated gene expression parameters can guide the calculation of cell type or cluster similarities between batches, and its estimated gene correlation structures can be used to align cell types or clusters across batches based on the similarity in gene correlation structures. [109].


Consider system (23) in conjunction with the normality assumptions (25) and (26), and regard the vector Λyi as “data.” The model for the entire data vector can be written as 35 where u comprises additive genetic effects for all individuals and all traits (u may include additive genetic effects of individuals without records), and Z is an incidence matrix of appropriate order. If all individuals have records for all traits, Z is an identity matrix of order NK × NK otherwise, columns of 0's for effects of individuals without phenotypic measurements would be included in Z. In view of the normality assumptions (25) and (26), one can write and where A is a matrix of additive genetic relationships (or of twice the coefficients of coancestry) between individuals in a genealogy, and indicates Kronecker product. Note that I ⊗ R0 reflects the assumption that all individuals with records possess phenotypic values for each of the K traits. This is not a requirement, but it simplifies somewhat the treatment that follows.

Given u, the vectors Λyi are mutually independent (since all ei vectors are independent of each other), so the joint density of all Λyi is 36 where Zi is an incidence matrix that “picks up” the K breeding values of individual i (ui) and relates these to its phenotypic records yi. Making a change of variables from Λyi to yi (i = 1, 2, … , N), the determinant of the Jacobian of the transformation is |Λ|. Hence, the density of is 37 This is the density of the product of the N normal distributions highlighting that the data generation process can be represented in terms of the reduced model (24), with the only novelty here being the presence of the incidence matrix Zi, with the latter being a K × K identity matrix in (24). Hence, the entire data vector can be modeled as 38 where XΛ is an matrix (again, assuming that each of the N individuals has measurements for the K traits), and ZΛ has order NK × (N + P)K, where P is the number of individuals in the genealogy lacking phenotypic records (the corresponding columns of ZΛ being null). Observe that (38) is in the form of a standard multiple-trait mixed-effects linear model, save for the fact that the incidence matrices depend on the unknown structural coefficients contained in Λ. Hence 39 where is a block-diagonal matrix consisting of N blocks of order K × K, and all such blocks are equal to Λ −1 R0Λ′ −1 . It follows that y|Λ, β, u, R0N(XΛβ + ZΛu, RΛ). Hence, if simultaneity or recursiveness holds, the estimator of the residual variance-covariance matrix from a reduced model analysis is actually estimating Λ −1 R0Λ′ −1 this has a bearing on the interpretation of the parameter estimates.

Since it is assumed that u|G0N(0, A ⊗ G0), the likelihood function is given by 40 This likelihood has the same form as that for a standard multivariate mixed-effects model, except that, here, additional parameters (the nonnull elements of Λ) appear in both the location and dispersion structures of the reduced model (38). A pertinent issue, then, is whether or not all parameters in the model, that is, Λ, β, R0, and G0, can be identified (i.e., estimated uniquely) from the likelihood. This is discussed in the following section.

  • Incomplete Dominance: The hybrid phenotype is a mixture of the expression of both alleles, resulting in a third intermediate phenotype. Example: Red flower (RR) X White flower (rr) = Pink flower (Rr)
  • Co-dominance: The hybrid phenotype is a combination of the expressed alleles, resulting in a third phenotype that includes both phenotypes. (Example: Red flower (RR) X White flower (rr) = Red and white flower (Rr)
  • Incomplete Dominance: The phenotype may be expressed to varying degrees in the hybrid. (Example: A pink flower may have lighter or darker coloration depending on the quantitative expression of one allele versus the other.)
  • Co-dominance: Both phenotypes are fully expressed in the hybrid genotype.

Yeah, it is

The whole thing got started in about 2009, when Pozhitkov was a postdoctoral researcher at the Max Planck Institute for Evolutionary Biology in Germany. It was there that he got a chance to pursue a project he’d been thinking about for more than a decade.

Pozhitkov acquired about 30 zebrafish from the institute’s colony. (These tropical fish are commonly used in research because, among other things, they have transparent embryos, ideal for observing development.) He killed the animals by shocking them with a quick immersion in a cooler of ice water, then put them back in their regular 82-degree Fahrenheit tank.

Over the course of the next four days, he periodically scooped a few fish out of the tank, froze them in liquid nitrogen, and then analyzed their messenger RNA. These are threadlike molecules that do the work of translating DNA into proteins each strand of messenger RNA is a transcript of some section of DNA. Later Pozhitkov and his colleagues repeated the same process with mice, although their death was meted out by broken neck rather than cold shock.

When Pozhitkov’s colleague Peter Noble, then a biochemist at the University of Washington, dug into the data on how active the messenger RNA was on each day after death, something amazed him. In both the fish and the mice, the translation of genes into proteins generally declined after death, as would be expected. But the count of messenger RNA indicated that about 1 percent of genes actually increased in transcription after death. Some were chugging along four days after life ceased.

It wasn’t that the researchers had expected a total cessation of activity the moment the zebrafish and mice shuffled off this mortal coil. But to detect increases in transcription rather than just the blinking off of the lights one by one? That was “the most bizarre thing I’ve ever seen,” Noble says.

Not everyone was impressed. Noble and Pozhitkov heard a lot of criticism after the story made the rounds, first on the preprint site bioRxiv in 2016 and then in a paper in Open Biology in 2017. The main critique was that they might have misinterpreted a statistical blip. Because cells die off at different rates, perhaps the transcripts recorded in still-living cells merely made up a greater proportion of all the total transcripts, says Peter Ellis, a lecturer in molecular biology at the University of Kent. Think of the transcripts as socks in a drawer, he says. If you lost some of the red ones, the remaining white socks would make up a larger percentage of your total sock collection, but you wouldn’t have acquired more of them.

“The most bizarre thing I’ve ever seen.”

Since that original publication, though, there are hints that something more is going on in the cells that are still churning after the organism dies. In a study published in February in Nature Communications, other researchers examined human tissue samples and found hundreds of genes that alter their expression after death. Some genes declined in activity, but others increased. A gene that promotes growth, EGR3, began ramping up its expression four hours after death. Some fluctuated back and forth, like the gene CXCL2, which codes for a signaling protein that calls white blood cells to the site of inflammation or infection.

These changes weren’t merely the passive result of transcripts degrading at different rates like red socks being sporadically lost, says the University of Porto’s Pedro Ferreira, who led the study. Something, he says, was going on that actively regulated gene expression “even after the death of the organism.”

Surprising behavior of transcription factors challenges theories of gene regulation

Transforming progenitor cells into committed T-cell precursors in real time. Inset: Live imaging of a clone of future T cells, from progenitor stage (left) to commitment (right) in 3 days (courtesy, Mary A. Yui). Background: field of cells corresponding to a mixture of these stages, all processed to show individual molecules of RNA encoding key regulatory proteins. Runx1 (cyan dots) is expressed at similar levels in cells at early, middle, and late stages alike (courtesy, Wen Zhou). Credit: B. Shin

How cells develop and the diseases that arise when development goes wrong have been a decades-long research focus in the laboratory of Distinguished Professor of Biology Ellen Rothenberg. In particular, the lab studies the development of immune cells known as T cells, which act as "intelligence agents"—they circulate throughout the body, detect threats, and determine what kind of response the immune system should make. However, when the many stages of T cell development do not occur perfectly, leukemia occurs.

"Many of the genes that we study in normal developing T cells are the same genes that, when regulated incorrectly, lead to the cells becoming T-cell leukemia," says Rothenberg. "Understanding the precision of this process is really important. There's also an interesting aspect of irreversibility: Some of the genes we study only have activity at a specific time period in development, and then they turn off forever. But in leukemia, we see that these genes 'leak' back on again at a later stage when they are supposed to be off. We want to know the details of the process for turning genes on and keeping genes off, which will help us understand what goes wrong in leukemia."

Now, a new study from the Rothenberg lab examines certain proteins that supervise gene regulation in developing T cells and finds that these proteins behave in a manner quite different from that assumed in previous theory. The work suggests that theories of gene regulation may need to be reevaluated.

A paper describing the research appears in the journal Proceedings of the National Academy of Sciences on January 21, 2021. The study's first authors are Caltech postdoctoral scholar Boyoung Shin and former Caltech postdoctoral scholar Hiroyuki Hosokawa, now a faculty member at Tokai University in Japan.

Nearly every cell in the human body contains the same copy of the genome, but differences in the expression of particular genes give rise to different cell types, like muscles, neurons, and immune system cells. Gene expression can be thought of like a light bulb with a dimmer switch. Similar to how a light bulb on a dimmer switch can be turned on brightly, or dimly, or not at all, a gene can be expressed strongly, weakly, or be silenced. The "hands" that adjust these genomic dimmer switches are proteins called transcription factors, which bind to the genome to dial expression up or down.

There are many different kinds of transcription factors, with each acting upon defined sets of genes, sometimes with multiple transcription factors working together to regulate gene expression. The Rothenberg laboratory focused on two very similar transcription factors, Runx1 and Runx3, to find if they play a role during the cascade of sharp changes in gene expression that cause stem cell–like progenitors to become transformed into future T cells.

"The Runx transcription factors have traditionally been underappreciated in these early T cells—they are present in the cell at constant, steady levels throughout development, so scientists have reasoned that they must be unimportant in regulating genes that need to change in expression dramatically over time," says Rothenberg.

In previous studies, other researchers experimentally knocked out one of the Runx factors and subsequently found that little changed in the cell, leading to the conclusion that Runx was not very important. But in this new study, Rothenberg's team found that the two Runx transcription factors cover for each other, so that effects only show up when they are both removed—and those results now show that these transcription factors behave in very unexpected ways.

The conventional genetics theory is that when a factor regulates a target gene, the activity of the factor is correlated with the level of the target gene. But Rothenberg's study found that this was not the case for Runx factors. Although the Runx factors themselves stay active at steady levels through key developmental events, the great majority of genes that respond to the Runx factors change dramatically in expression during this period. In fact, the Runx factors act upon "incredibly important" genes for T cell development, according to Rothenberg, and regulate them strongly.

The findings open up new questions, such as how can the Runx factors cause these dramatic changes in gene expression when levels of Runx themselves do not change?

The team also found that the positions where the Runx factors bind to the genome change markedly over time, bringing Runx to different target DNA sites. At any one time, the study found, the factors are only acting on a fraction of the genes they could regulate they shift their "attention" from one set to another over time. Interestingly, in many of these shifts, large groups of Runx proteins leave their initial sites and travel to occupy clusters of new sites grouped across large distances of the genome, as they act on different genes at different times.

"There's no good explanation yet for this group behavior, and we find that Runx are interacting with the physical genomic architecture in a complex way, as they're regulating genes that have totally different expression patterns than the transcription factors themselves," says Shin. "What is controlling the deployment of the transcription factors? We still don't know, and it's far more interesting than what we thought."

"This work has big implications for researchers trying to model gene networks and shows that transcription factors are more versatile in their actions than people have assumed," Rothenberg says.

The paper is titled "Runx1 and Runx3 drive progenitor to T-lineage transcriptome conversion in mouse T cell commitment via dynamic genomic site switching."

More information: Boyoung Shin et al. Runx1 and Runx3 drive progenitor to T-lineage transcriptome conversion in mouse T cell commitment via dynamic genomic site switching, Proceedings of the National Academy of Sciences (2021). DOI: 10.1073/pnas.2019655118

Supporting Information

Figure S1.

Power of tests described in the main text to detect a signal of selection on the mapped genetic basis of skin pigmentation [67] as an increasing function of the strength of selection (A), and a decreasing function of the genetic correlation between skin pigmentation and the selected trait with the effect of selection held constant at (B).

Figure S2.

Power of tests described in the main text to detect a signal of selection on the mapped genetic basis of BMI [74] as an increasing function of the strength of selection (A), and a decreasing function of the genetic correlation between BMI and the selected trait with the effect of selection held constant at (B).

Figure S3.

Power of tests described in the main text to detect a signal of selection on the mapped genetic basis of T2D [75] as an increasing function of the strength of selection (A), and a decreasing function of the genetic correlation between height and the selected trait with the effect of selection held constant at (B).

Figure S4.

Power of tests described in the main text to detect a signal of selection on the mapped genetic basis of CD [26] as an increasing function of the strength of selection (A), and a decreasing function of the genetic correlation between CD and the selected trait with the effect of selection held constant at (B).

Figure S5.

Power of tests described in the main text to detect a signal of selection on the mapped genetic basis of UC [26] as an increasing function of the strength of selection (A), and a decreasing function of the genetic correlation between UC and the selected trait with the effect of selection held constant at (B).

Figure S6.

The two components of for the skin pigmentation dataset, as described by the left and right terms in (14). The null distribution of each component is shows as a histogram. The expected value is shown as a black bar, and the observed value as a red arrow.

Figure S7.

The two components of for the BMI dataset, as described by the left and right terms in (14). The null distribution of each component is shows as a histogram. The expected value is shown as a black bar, and the observed value as a red arrow.

Figure S8.

The two components of for the T2D dataset, as described by the left and right terms in (14). The null distribution of each component is shows as a histogram. The expected value is shown as a black bar, and the observed value as a red arrow.

Figure S9.

The two components of for the CD dataset, as described by the left and right terms in (14). The null distribution of each component is shows as a histogram. The expected value is shown as a black bar, and the observed value as a red arrow.

Figure S10.

The two components of for the UC dataset, as described by the left and right terms in (14). The null distribution of each component is shows as a histogram. The expected value is shown as a black bar, and the observed value as a red arrow.

Figure S11.

The genetic values for height in each HGDP population plotted against the measured sex averaged height taken from [127]. Only the subset of populations with an appropriately close match in the named population in [127]'s Appendix I are shown, values used are given in Supplementary table S1.

Figure S12.

The genetic skin pigmentation score for a each HGDP population plotted against the HGDP populations values on the skin pigmentation index map of Biasutti 1959. Data obtained from Supplementary table of [69]. Note that Biasutti map is interpolated, and so values are known to be imperfect. Values used are given in Supplementary table S2.

Figure S13.

The genetic skin pigmentation score for a each HGDP population plotted against the HGDP populations values from the [68] mean skin reflectance (685nm) data (their Table 6). Only the subset of populations with an appropriately close match were used as in the Supplementary table of [69]. Values and populations used are given in Table S2.

Figure S14.

The distribution of genetic height score across all 52 HGDP populations. Grey bars represent the confidence interval for the genetic height score of an individual randomly chosen from that population under Hardy-Weinberg assumptions.

Figure S15.

The distribution of genetic skin pigmentation score across all 52 HGDP populations. Grey bars represent the confidence interval for the genetic skin pigmentation score of an individual randomly chosen from that population under Hardy-Weinberg assumptions.

Figure S16.

The distribution of genetic BMI score across all 52 HGDP populations. Grey bars represent the confidence interval for the genetic BMI score of an individual randomly chosen from that population under Hardy-Weinberg assumptions.

Figure S17.

The distribution of genetic T2D risk score across all 52 HGDP populations. Grey bars represent the confidence interval for the genetic T2D risk score of an individual randomly chosen from that population under Hardy-Weinberg assumptions.

Figure S18.

The distribution of genetic CD risk score across all 52 HGDP populations. Grey bars represent the confidence interval for the genetic CD risk score of an individual randomly chosen from that population under Hardy-Weinberg assumptions.

Figure S19.

The distribution of genetic UC risk score across all 52 HGDP populations. Grey bars represent the confidence interval for the genetic UC risk score of an individual randomly chosen from that population under Hardy-Weinberg assumptions.

Table S1.

Genetic height scores as compared to true heights for populations with a suitably close match in the dataset of [127]. See Figure S11 for a plot of genetic height score against sex averaged height.

Table S2.

Genetic skin pigmentation score as compared to values from Biasutti [69], [128] and [68]. We also calculate a genetic skin pigmentation score including previously reported associations at KITLG and OCA2 for comparisson. See also Figures S12 and S13.

Table S3.

Conditional analysis at the regional level for the height dataset.

Table S4.

Conditional analysis at the individual population level for the height dataset.

Table S5.

Conditional analysis at the regional level for the skin pigmentation dataset.

Table S6.

Conditional analysis at the individual population level for the skin pigmentation dataset.

Table S7.

Condtional analysis at the regional level for the BMI dataset.

Table S8.

Conditional analysis at the individual population level for the BMI dataset.

Table S9.

Conditional analysis at the regional level for the T2D dataset.

Table S10.

Conditional analysis at the individual population level for the T2D dataset.

Table S11.

Conditional analysis at the regional level for the CD dataset.

Table S12.

Conditional analysis at the individual population level for the CD dataset.

Table S13.

Conditional analysis at the regional level for the UC dataset.

Table S14.

Conditional analysis at the individual population level for the UC dataset.

Table S15.

Corresponding statistics for all analyses presented in Table 2.

Table S16.

Corresponding statistics for all analyses presented in Table 2.

Watch the video: Multivariate Gaussian distributions (May 2022).


  1. Barr

    In it something is. Thanks for the help in this question. I did not know this.

  2. Simen

    Now everything has become clear, thank you very much for your help in this matter.

  3. Zulukree

    You are making a mistake. Let's discuss this. Email me at PM.

  4. Zioniah

    In my opinion you are not right. I am assured. I can defend the position.

  5. Jeremie

    Remarkably topic

Write a message