The Levenshtein distance between two strings is the minimum number of single-character edits required to turn one word into the other.
The word “edits” includes substitutions, insertions, and deletions.
For example, suppose we have the following two words:
- PARTY
- PARK
The Levenshtein distance between the two words (i.e. the number of edits we have to make to turn one word into the other) would be 2:
In practice, the Levenshtein distance is used in many different applications including approximate string matching, spell-checking, and natural language processing.
This tutorial explains how to calculate the Levenshtein distance between strings in R by using the stringdist() function from the stringdist package in R.
This function uses the following basic syntax:
#load stringdist package library(stringdist) #calculate Levenshtein distance between two strings stringdist("string1", "string2", method = "lv")
Note that this function can calculate many different distance metrics. By specifying method = “lv”, we tell the function to calculate the Levenshtein distance.
Example 1: Levenshtein Distance Between Two Strings
The following code shows how to calculate the Levenshtein distance between the two strings “party” and “park” using the stringdist() function:
#load stringdist package library(stringdist) #calculate Levenshtein distance between two strings stringdist('party', 'park', method = 'lv') [1] 2
The Levenshtein distance turns out to be 2.
Example 2: Levenshtein Distance Between Two Vectors
The following code shows how to calculate the Levenshtein distance between every pairwise combination of strings in two different vectors:
#load stringdist package library(stringdist) #define vectors a #calculate Levenshtein distance between two vectors stringdist(a, b, method='lv') [1] 6 4 5 5
The way to interpret the output is as follows:
- The Levenshtein distance between ‘Mavs’ and ‘Rockets’ is 6.
- The Levenshtein distance between ‘Spurs’ and ‘Pacers’ is 4.
- The Levenshtein distance between ‘Lakers’ and ‘Warriors’ is 5.
- The Levenshtein distance between ‘Cavs’ and ‘Celtics’ is 5.
Example 3: Levenshtein Distance Between Data Frame Columns
The following code shows how to calculate the Levenshtein distance between every pairwise combination of strings in two different columns of a data frame:
#load stringdist package library(stringdist) #define data data #calculate Levenshtein distance stringdist(data$a, data$b, method='lv') [1] 6 4 5 5
We could then append the Levenshtein distance as a new column in the data frame if we’d like:
#save Levenshtein distance as vector lev lv') #append Levenshtein distance as new column data$lev #view data frame data a b lev 1 Mavs Rockets 6 2 Spurs Pacers 4 3 Lakers Warriors 5 4 Cavs Celtics 5
Additional Resources
How to Calculate Hamming Distance in R
How to Calculate Euclidean Distance in R
How to Calculate Manhattan Distance in R