How to calculate the Jaccard index

How to calculate the Jaccard index

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I want to calculate the Jaccard index between two compounds. What is the algorithm? I have searched for it, it just gives the formula but how to apply it on compounds is not known to me. Can you help?

The Jaccard index is a measure of similarity between two sets. Take a look at the Wikipedia article here. It is very easy to compute:

The Jaccard similarity coefficient for sets X and Y is defined as:

J(X,Y) = |intersection(X,Y)| / |union(X,Y)|

Where| |indicates the size (number of elements) of the set. Imagine you have two sets X and Y defined as follows:

X = {A, B, C, D} Y = {C, D, E, F, G}


intersection(X,Y) = {C, D} => |intersection(X,Y)| = 2 union(X,Y) = {A,B,C,D,E,F} => |union(X,Y)| = 5

Therefore:J(X,Y) = 2/5

Alternatively, the Jaccard distance would beD(X,Y) = 1 - J(X,Y) = 1 - 2/5 = 3/5

In Biology the Jaccard index has been used to compute the similarity between networks, by comparing the number of edges in common (e.g. Bass, Nature methods 2013)

Regarding applying it to compounds, if you have two sets with different compounds, you can find how similar the two sets are using this index. The elements on the sets, in this case the compounds, correspond to A, B, C, etc. in my example.

Jaccard Index / Similarity Coefficient

The Jaccard similarity index (sometimes called the Jaccard similarity coefficient) compares members for two sets to see which members are shared and which are distinct. It’s a measure of similarity for the two sets of data, with a range from 0% to 100%. The higher the percentage, the more similar the two populations. Although it’s easy to interpret, it is extremely sensitive to small samples sizes and may give erroneous results, especially with very small samples or data sets with missing observations.

Jaccard Similarity Formula

“The Jaccard index, also known as Intersection over Union and the Jaccard similarity coefficient (originally given the French name coefficient de communauté by Paul Jaccard), is a statistic used for gauging the similarity and diversity of sample sets.”

As the formula shows, J(A,B) JS f ormula depends on set A and set B, specifically it is the division of intersect of A and B denoted by the arch shape, and the A union of B denoted by U. It is basically a formula for measuring how much overlap there is between A and B.

Part of the formula can be rewritten as |A| + |B| — |A intersect B| because when we do |A|+|B| it is potentially larger than |A union B| because there may be an overlap, so we need to subtract the overlap |A intersect B|.

Evaluation metrics help us in telling the performance of our ML models. They help us in calculating an ML model’s accuracy. Accuracy tells us how good or bad our ML model is, i.e., how our ML model is going to perform on an unknown data sample, based on the training that it has received by the training set. For evaluating an ML model, we need a test set, which is usually different from the training set, that we feed into our ML model and see what the outputs are and compare these outputs with already known outputs. So now that we are clear with what evaluation metrics are, let’s move on to the actual topic of our blog, Jaccard Index.

Jaccard Index is one of the simplest ways to calculate and find out the accuracy of a classification ML model. Let’s understand it with an example. Suppose we have a labelled test set, with labels as –

And our model has pre dicted the labels as –

The above Venn diagram shows us the labels of the test set and the labels of the predictions, and their intersection and union.

The Jaccard Index is defined as the size of the intersection divided by the size of the union of the two labelled sets, with formula as –

So, for our example, we can see that the intersection of the two sets is equal to 8 (since eight values are predicted correctly) and the union is 10 + 10 – 8 = 12. So, the Jaccard index gives us the accuracy as –

So, the accuracy of our model, according to Jaccard Index, becomes 0.66, or 66%.

That was all there is to know about the Jaccard Index. Hope this blog was helpful to you. Thanks for reading.

Direct gradient analysis

Multivariate analyses are required for community data because we're interested in the response of many species, simultaneously

Multivariate analyses are used to summarize redundancy, reduce noise, elucidate relationships, and identify outliers

Multivariate analyses can relate communities to other kinds of data (e.g., environmental, historical data)

Results from multivariate analyses are designed to improve our understanding of communities, esp. community structure

Used to display distribution of organisms along gradients of important environmental factors

Devised by Ramensky (1930) and Gause (1930), but used extensively in ecological research after about 1950 (Whittaker)

Dix and Smeins (1967) took 100 community samples to represent the range of vegetation present in Nelson County, North Dakota

Homogeneous stands of 0.1 ha were sampled by recording frequency in 30, 0.5 × 0.5 m quadrats

Numerous environmental variables were recorded for each stand

Defined indicator species of a drainage class as a species w/ frequency at least 10% greater in that class than in any other class

Defined indicator value as drainage class of the indicator species

Goal: summarize frequency of all species --> single number for each stand

Stand Index Number = < />(rel. freq. × indicator value)/< />(rel. freq. of indiv. sp)> × 100

Stvi10---(not an indicator for any drainage class)
40 *  65 

* sum of RF for spp. w/ IV (20+15+5)

Stand Index 17 = (65/40) × 100 = 162

For all stands, stand index varied from 100 to 600

Divided this 500-unit gradient into 10, 50-unit classes:

  Species frequency
ClassStand w/in 50-uinit classA BC

=========> Fig. 2 [Dix and Smeins 1967, p. 33]

They could have plotted frequency over the entire 500-unit gradient, but the graph would have been messy--10 drainage classes "smooths" the graph, making interpretation easier

The purpose of direct gradient analysis is to organize community and environmental data to answer questions such as:

    Precisely which environmental factor in a complex of factors principally affects distribution of organisms and communities?

While direct gradient analysis can be used to identify ecologically important environmental factors, experimental manipulations are needed to more precisely determine the importance of various environmental factors

Dix and Smeins derived an index for drainage based on the plants themselves: this may be easier, more accurate, and less expensive than other measures of drainage or soil moisture

Often difficult to evaluate because secondary gradients are overshadowed by primary gradients

Data are plotted along environmental axes which are generally accepted as given. Axes can be:

Species, communities, and community-level characteristics can be plotted

Several dimensions are possible

Some form of data-smoothing is usually employed prior to presentation

common smoothing technique is weighted average for each datum e.g.,

smoothed = previous datum + 2 × current datum + next datum/4>

resulting curve is less "noisy" than original data

Whittaker offered the following conclusions about DGA:

    The general form for the distribution of a species population along an environmental complex-gradient is the bell-shaped curve

    The center (or mode) of a species population along a complex-gradient is not at its physiological optimum but is a center of maximum population success in competition with other species populations

One important qualification: in some cases, competing species appear to be not randomly but regularly distributed along environmental complex-gradients

According to Whittaker, these considerations imply the following:

Whittaker's conclusions were strongly influenced by his belief in bell-shaped curves of species distributions

The bell-shaped curve concept was challenged by Austin (1976, Vegetatio 33:33-41) in a summary of previously published data:

 linearbell symmetricskewedvery skewedbimodaltotal
Percent of Total1461624633 

bell (%)skewedshoulderedplateaubimodaltotal
Smokies8 (23%)6102935
Siskiyous14 (27%)16811251

Austin therefore concluded that the general form of the species population is not normal, bell-shaped. And he was considering data which had already been smoothed

Werger (1983, Vegetatio 52:141-150) used a very conservative yardstick for "normal" distribution (50% of variation accounted for by curve)

31% of species normally distributed:

1 of 8 species (12%) on ridge tops

12 of 22 species (55%) midslope

5 of 32 species (16%) in swales

The data collected and summarized by Austin and Werger indicate that there is no a priori reason to assume bell-shaped normal curves for distributions of species on gradients

    DGA is of unquestionable value and utility in ecology as a means of

    data summarization and presentation, and

Circularity results from subjective (pre-conceived) sampling design--note that this was a criticism launched by Whittaker (among others) against the Clementsian approach of "seeing" communities and sampling w/in them.

The DGA-based conclusion of vegetation continuum results from arbitrary, subjective sampling (just as the discrete-community conclusion derives from sampling w/in well-defined communities which appear to be different.

Both schools describe, but do not answer "why"? Both groups base conclusions on descriptive data, w/o testing hypotheses.

Measuring Genealogical Similarity using the Jaccard Index

For some of the posts on this blog I’ll be using one way to measure the similarity of two sample sets of data. The statistic is called the Jaccard Index, or the Jaccard Similarity Coefficient. This post is a technical explanation of the calculation itself.

The sets of data are the unique ancestral surnames of my DNA matches. The question I’m asking for any two of my matches is: how similar are their lists of direct ancestral surnames?

If two lists of unique surnames are identical, they will have the exact same surnames. They will also have the same number of surnames in their lists, as each surname is only represented once regardless of how many times it appears in the direct tree. They will be 100% similar.

However, I’m also interested in trees that are “nearly” the same. Suppose two siblings create separate trees, and both get as far as all their great great grandparents. Tom’s research leads him to one pair of 3rd greats, and Joe finds a different pair. Neither are aware yet of the other’s research, but both have one extra maiden name each in their trees. Those lists will be very similar, and I’d like to highlight their similarity in some way.

So I need a way of defining the “similarity” of two lists of surnames. The Jaccard Similarity Index compares two sets (or lists) to see which members (surnames) are shared and which are different. It calculates the percentage of similarity from 0 to 100%. The math is pretty simple, and is described here in understandable terms.

In the simplest terms, we count the intersection of the lists i.e. the number of surnames common to both trees. We count the differences for each side, and we count the total number of surnames in all. The Jaccard index expresses this mathematically as:

J(X,Y) = |X∩Y| / |X∪Y| or (|X∩Y| / |X| + |Y| – |X∩Y|

Taking our two brothers, Tom and Joe:
|X∩Y is the number of shared surnames: 8 for the brothers.
|X| is the length of the set, or the number of surnames for Tom’s tree: 9.
|Y| is the length of the set, or the number of surnames for Joe’s tree: also 9.

So our equation is: 8 / (9 + 9 – 8) * 100 = 80% similarity for our brothers.

If brother’s had exactly the same trees, they’d be 100% similar. If the postman’s tree had no overlapping surnames with the brothers, his index compared to both would be 0%.

So the ultimate task is to compare every surname list within my matches with every every other surname list. As the Jaccard index only works on two sets at a time, to calculate the similarity across N sets requires N squared calculations.
/>This becomes unfeasible for large numbers of sets, and there are other methods that can be brought into play to reduce processing time. I had about 4.4 million pairs of sets to compare, which took a matter of hours to complete.

Note that for my current purposes, I am using unique surnames. If one match has entered father, grandfather and great-grandfather John Smith, his list has Smith represented once. This is to simplify data collection and computation.

Note also that For my current purposes, the direction of surnames is unimportant. Match #1 may have a two-person tree with Mary Smith as the mother of Bob Jones, while Match #2 has Anne Jones as the mother of Bob Smith. That is “Smith->Jones” and “Jones->Smith”. If I include direction, these lists are different. I am treating the lists as a “bag of words”, where direction is not important – so these two lists “Jones, Smith”, and “Smith, Jones” are the same. This is to simplify data collection and computation.

Two caveats must be considered with the Jaccard Index. One is that it can be erroneous for small sample sizes, so I intend to exclude small trees.
The other problem for the index is when there are missing observations in the data sets. It’s safe to say that most of my lists have missing observations, as I’m not drawing from a sample of relatives with perfect trees to four generations. The trees tend to be ragged i.e. people know more about one branch than another.

2 thoughts on “Measuring Genealogical Similarity using the Jaccard Index”

This is interesting – I’m looking forward to seeing what you do with it. How are you planning to handle surname spelling variations?

A great question that I am grappling with, and am undecided. What to do with O Raghallaigh/O’Reilly/Reilly/Riley: a particular Irish ancestral line may see all variants through successive generations.
The easiest sledgehammer is to collapse the variants into one, the first step being to strip the “O'”/”Mc”/”Mac” from surnames (I’m being very Irish-centric here but that is my personal domain challenge). Then going further, using various sources of name variations to collapse names to a single version.
Yet this could lose the richness that allows historical tracing. For example, a set of my matches share 4th/5th generation surnames of a very unusual and distinct variant of a common Irish surname. This allows easy tracking through U.S. records of their line.
Instead, I am thinking of using a similarity measure within surnames themselves, which would ensure that variants are not treated as completely dissimilar but do have an effect of lowering the overall similarity index. The challenge there is that the computation times get higher.

Leave a Comment Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

The argument of this function is a list of three matrices all of whom are indexed exactly in the same manner - the rows of each of the matrix is indexed by the complexes, , of the first bipartite graph, bg1, and the colunms are indexed by the complexes, of the second bipartite graph, bg2.

The first matrix of the list is the intersect matrix, I. The (i,j) entry of I is the cardinality of complex C-i of bg1 and K-j of bg2.

The second matrix of the list is the cminusk matrix, Q. The (i,j) entry of Q is the cardinality of the set difference between C-i and K-j.

The third matrix of the list is the kminusc matrix, P. The (i,j) entry of P is the cardinality of the set difference between K-j and C-i.

The Jaccard Coefficient between two sets (here between two complexes) C-i and K-j is given by the quotient of cardinality(C-i intersect K-j) and cardinality(C-i union K-j). Note that cardinality(C-i intersect K-j) is the (i,j) entry of I, and that cardinality(C-i union K-j) is the sum of the (i,j) entry of I, Q, P.

Sklearn.metrics .jaccard_score¶

The Jaccard index [1], or Jaccard similarity coefficient, defined as the size of the intersection divided by the size of the union of two label sets, is used to compare set of predicted labels for a sample to the corresponding set of labels in y_true .

Parameters y_true 1d array-like, or label indicator array / sparse matrix

Ground truth (correct) labels.

y_pred 1d array-like, or label indicator array / sparse matrix

Predicted labels, as returned by a classifier.

labels array-like of shape (n_classes,), default=None

The set of labels to include when average != 'binary' , and their order if average is None . Labels present in the data can be excluded, for example to calculate a multiclass average ignoring a majority negative class, while labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels in y_true and y_pred are used in sorted order.

pos_label str or int, default=1

The class to report if average='binary' and the data is binary. If the data are multiclass or multilabel, this will be ignored setting labels=[pos_label] and average != 'binary' will report scores for that label only.

If None , the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:

Only report results for the class specified by pos_label . This is applicable only if targets ( y_ ) are binary.

Calculate metrics globally by counting the total true positives, false negatives and false positives.

Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance.

Calculate metrics for each instance, and find their average (only meaningful for multilabel classification).

sample_weight array-like of shape (n_samples,), default=None

zero_division “warn”, <0.0, 1.0>, default=”warn”

Sets the value to return when there is a zero division, i.e. when there there are no negative values in predictions and labels. If set to “warn”, this acts like 0, but a warning is also raised.

Returns score float (if average is not None) or array of floats, shape = [n_unique_labels]

jaccard_score may be a poor metric if there are no positives for some samples or classes. Jaccard is undefined if there are no true or predicted labels, and our implementation will return a score of 0 with a warning.

Zero-columns extension of PWM

Lemma 1. Extending a PWM with any number of zero columns from the left or from the right does not change the score distribution or any P-value corresponding to any score threshold.

Proof: It is enough to have a proof for a single column appended from the right. A new extended matrix [M E]4 * (m + 1) defines the scores for ωA m + 1 . For the zero column, M[α, m + 1] = 0 for all α in A and S(ω, M E) = S(ω[1.. m], M). P-value can be calculated from the score distribution: P M E , t = ∑ s ≥ t Q M E , s .

The word set Ω E = <ωA m + 1 : S(ω, M E) ≥ s> can be obtained from the word set Ω by adding all 1-suffixes <ω[m + 1]> = A to any word ω[1.. m] from Ω. If words are generated by an i.i.d. random model, their probabilities are the products of the letter probabilities p(α). So the probabilities of (m+1)-mers in Ω factorize and the resulting probability does not change:

Reverse complement transformation of PWM

Lemma 2. If the words are generated by an i.i.d. random model and the background probabilities comply with the conditions p(A) = p(T), p(C) = p(G) then the reverse complement transformation of PWM M does not change the score distribution and hence the P-values.

The assertion of this lemma directly follows from the definition of the score distribution after all substitutions made. For any word ω having a score s with M there is a corresponding hit with M ˜ , which is obtained as ω read backwards with substitutions A ⇔ T, G ⇔ C.

Alignment of PWMs of different widths

Lemma 3. Let there be an aligned pair of PWMs M1,M2 with the corresponding thresholds t1,t2, defining TFBS recognition models Ω12. Extension of both PWMs with any number of zero columns does not change D1 (Ω12).

Proof: Again, it is enough to have a proof for a single column added from the right. The idea of the proof is very similar to that for Lemma 1. For the uniform probability distribution, let us consider the fraction J 1 Ω 1 E , Ω 2 E = Ω 1 E ∩ Ω 2 E Ω 1 E ∪ Ω 2 E . Ω1E = Ω(M1E, t1) is obtained by adding all 1-suffixes to any word from Ω1 = Ω(M1, t1) the same is true for Ω2E = Ω(M2E, t2). Thus, if a word is in Ω(M1, t1) ∩ Ω(M2, t2) then its four possible extensions are in Ω(M1E, t1) ∩ Ω(M2E, t2) and |Ω1E ∩ Ω2E| = 4|Ω1 ∩ Ω2|.

All four 1-suffixes become added when transiting from (Ω12) to (Ω1E2E). Thus any (m+1)-mer from Ω1E or Ω2E has a single corresponding m-mer in Ω1 ∪ Ω2 and for each m-mer in Ω1 ∪ Ω2 there are four (m+1)-mers in Ω1E ∪ Ω2E. Thus |Ω1E ∪ Ω2E| = 4|Ω1 ∪ Ω2|.

Reducing the fraction by 4 proves the lemma. In case of non-uniform background distribution of probabilities pα, it is important that the probability of an extended random word falling into Ω1E ∩ Ω2E is the same as for non-extended random word falling into Ω1 ∩ Ω2. The proof of the above is very similar to that of Lemma 1. The similar equation is true for the denominator, which proves the lemma.

Definition of the distance metric for TFBS models

Theorem: Distance D2(Ω1, Ω2) = 1 − J2(Ω1, Ω2) defines a proper metric in the space of TFBS models represented as PWMs with thresholds corresponding to the given P-value levels.

Proof: To prove the theorem, one needs to demonstrate that D2 complies with the following metric properties:

The second property is clear from the D2 definition and the first property follows from the observation that X ∩ Y = X ∪ Y only in the case when X=Y and the probability of a word set increases with the number of words. It only remains to prove the triangle inequality.

Proof of the triangle inequality. Note that the matrices become extended with zero-columns if necessary while the optimal shift and orientation are selected. This can be safely done according to Lemma 3. Thus, we omit the E index for matrices and models for simplicity.

Let us use the Ω1|3 notation for the model defined by M1 optimally aligned versus M 3. We start from separate alignments of M1 and M 2 with M 3 as a reference. Thus we obtain two optimal alignments M1vs M3 and M2vs M3 the inherited alignment of M1vs M2 is not necessary optimal but conditioned by the respective optimal alignments with M 3.

Nevertheless, all three matrices M1,M 2,M 3 become aligned, and for this alignment the triangle inequality is valid [16]:

By construction, D1(Ω1|3, Ω3) = D2(Ω1, Ω3), and it is possible to rewrite the latter equation as D1(Ω1|3, Ω2|3) ≤ D2(Ω1, Ω3) + D2(Ω2, Ω3). Finally, by definition:

Watch the video: Cálculo de Similaridad de especies: Indices de Sorensen y Jaccard (May 2022).


  1. Daric

    Looks like Lenya in nature.

  2. Daran

    In my opinion, you are wrong. I'm sure. Let's discuss. Email me at PM, we will talk.

  3. Bercleah

    Exclusive idea))))

  4. Mazukinos

    Quite right! Exactly.

  5. Dogal

    I consider, that you are not right. Let's discuss. Write to me in PM, we will talk.

Write a message