Home » How to Calculate Jaccard Similarity in R

How to Calculate Jaccard Similarity in R

by Tutor Aspire

The Jaccard similarity index measures the similarity between two sets of data. It can range from 0 to 1. The higher the number, the more similar the two sets of data.

The Jaccard similarity index is calculated as:

Jaccard Similarity = (number of observations in both sets) / (number in either set)

Or, written in notation form:

J(A, B) = |A∩B| / |A∪B|

This tutorial explains how to calculate Jaccard Similarity for two sets of data in R.

Example: Jaccard Similarity in R

Suppose we have the following two sets of data:

a 
b 

We can define the following function to calculate the Jaccard Similarity between the two sets:

#define Jaccard Similarity function
jaccard function(a, b) {
    intersection = length(intersect(a, b))
    union = length(a) + length(b) - intersection
    return (intersection/union)
}

#find Jaccard Similarity between the two sets 
jaccard(a, b)

0.4

The Jaccard Similarity between the two lists is 0.4.

Note that the function will return 0 if the two sets don’t share any values:

c 

And the function will return 1 if the two sets are identical:

e 

The function also works for sets that contain strings:

g cat', 'dog', 'hippo', 'monkey')
h monkey', 'rhino', 'ostrich', 'salmon')

jaccard(g, h)

0.142857

You can also use this function to find the Jaccard distance between two sets, which is the dissimilarity between two sets and is calculated as 1 – Jaccard Similarity.

a #find Jaccard distance between sets a and b
1 - jaccard(a, b)

[1] 0.6

Refer to this Wikipedia page to learn more details about the Jaccard Similarity Index.

You may also like