Finding the Levenshtein distance in Python
Submitted by Stavros on Thu, 17/01/2008 - 22:35
I discovered today that Moneygement won’t accept unicode characters when someone adds transactions by email because of the editdist module I used to check it. Since I don’t really need a fast function to do it (it’s all eight-letter words on average), I decided to write my own Python version of the function and am sharing it here if anyone needs it (because I haven’t found a Python implementation anywhere). It’s released under a BSD license with attribution, meaning that I’d like it if my name was mentioned where it is used :)
def levenshtein_distance(first, second):
"""Find the Levenshtein distance between two strings."""
if len(first) > len(second):
first, second = second, first
if len(second) == 0:
return len(first)
first_length = len(first) + 1
second_length = len(second) + 1
distance_matrix = [[0] * second_length for x in range(first_length)]
for i in range(first_length):
distance_matrix[i][0] = i
for j in range(second_length):
distance_matrix[0][j]=j
for i in xrange(1, first_length):
for j in range(1, second_length):
deletion = distance_matrix[i-1][j] + 1
insertion = distance_matrix[i][j-1] + 1
substitution = distance_matrix[i-1][j-1]
if first[i-1] != second[j-1]:
substitution += 1
distance_matrix[i][j] = min(insertion, deletion, substitution)
return distance_matrix[first_length-1][second_length-1]
"""Find the Levenshtein distance between two strings."""
if len(first) > len(second):
first, second = second, first
if len(second) == 0:
return len(first)
first_length = len(first) + 1
second_length = len(second) + 1
distance_matrix = [[0] * second_length for x in range(first_length)]
for i in range(first_length):
distance_matrix[i][0] = i
for j in range(second_length):
distance_matrix[0][j]=j
for i in xrange(1, first_length):
for j in range(1, second_length):
deletion = distance_matrix[i-1][j] + 1
insertion = distance_matrix[i][j-1] + 1
substitution = distance_matrix[i-1][j-1]
if first[i-1] != second[j-1]:
substitution += 1
distance_matrix[i][j] = min(insertion, deletion, substitution)
return distance_matrix[first_length-1][second_length-1]
Tags:
Peter Norvig (of the AI book and Google fame) has an interesting article on spelling correction, which is a fairly typical use of levenshtein. In his case though he generates all of the words within edit distance 1 of a source word, which although very computationally intensive might be of interest as well.
His code, from the link, is below..
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def edits1(word):
Submitted by Ross (not verified) on Wed, 23/04/2008 - 11:20.n = len(word)
return set([word[0:i]+word[i+1:] for i in range(n)] + # deletion
[word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition
[word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration
[word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet]) # insertion
Oh, I remember reading that, it was indeed very interesting, thanks for the link.
Submitted by Stavros on Wed, 23/04/2008 - 12:43.---
Vidi, Vici, Veni.
Just wanted to leave a link to a page I found with an implementation in multiple languages;
http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance#Python
You might want to consider checking it out, and altering your use of "range" to either xrange() or enumerate().
As you mentioned, if it's comparing small stuff then it's a minimal impact but an iterator would be better then building the entire array into memory.
Mostly just for posterity sake and to say thanks for sharing!
Submitted by jay (not verified) on Fri, 27/02/2009 - 17:17.Hi, your code does not work, e.g.
Submitted by Stephan (not verified) on Thu, 23/04/2009 - 17:29.print levenshtein_distance("abcdefghijk","cdefghijklm")
returns 2
4 would be correct.
The matrix is not correctly initialised.
You are, of course, correct. The code has been fixed, thank you for the report.
Submitted by Stavros on Thu, 23/04/2009 - 17:52.The code does not seem to be working correctly either, and everytime I compare two different string my result is 1.
(It does however work if the strings are identical, correctly producing 0)
I compared the matrix with the "kitten" "sitting" matrix from http://en.wikipedia.org/wiki/Levenshtein_distance
and they do not match up.
Any ideas?
Submitted by Anonymous (not verified) on Tue, 04/08/2009 - 15:12.PS.
Forgot to add my script:
http://pastebin.com/m194975fa
I simply added a print to the end.
I'm sure its probably just a stupid mistake on my part.
Help is appreciated.
Submitted by Anonymous (not verified) on Tue, 04/08/2009 - 15:18.I don't know what you mean, I compared both the matrices and the results from the Wikipedia page and it looks to be functioning correctly... I have a bunch of unit tests on that function too, and they're all correct...
Submitted by Stavros on Tue, 04/08/2009 - 18:35.The script doesn't seem to work for me.
It worked fine about a few days ago, but in my current program, it produces a difference of 1 for all strings.
I'll get screenshots of the output tomorrow.
Submitted by Anonymous (not verified) on Tue, 04/08/2009 - 19:52.I've been using python for about a week now, so its probably an error of mine.
That pastebin link has the script that I've been using.
Yeah, you got the indentation wrong. The if check is supposed to be inside the for.
Submitted by Stavros on Tue, 04/08/2009 - 21:48.Thanks, you've been a huge help.
Submitted by Anonymous (not verified) on Wed, 05/08/2009 - 09:16.