Finding the Levenshtein distance in Python

I discovered today that Moneygement won’t accept unicode characters when someone adds transactions by email because of the editdist module I used to check it. Since I don’t really need a fast function to do it (it’s all eight-letter words on average), I decided to write my own Python version of the function and am sharing it here if anyone needs it (because I haven’t found a Python implementation anywhere). It’s released under a BSD license with attribution, meaning that I’d like it if my name was mentioned where it is used :)

def levenshtein_distance(first, second):
    """Find the Levenshtein distance between two strings."""
    if len(first) > len(second):
        first, second = second, first
    if len(second) == 0:
        return len(first)
    first_length = len(first) + 1
    second_length = len(second) + 1
    distance_matrix = [[0] * second_length for x in range(first_length)]
    for i in range(first_length):
       distance_matrix[i][0] = i
    for j in range(second_length):
       distance_matrix[0][j]=j
    for i in xrange(1, first_length):
        for j in range(1, second_length):
            deletion = distance_matrix[i-1][j] + 1
            insertion = distance_matrix[i][j-1] + 1
            substitution = distance_matrix[i-1][j-1]
            if first[i-1] != second[j-1]:
                substitution += 1
            distance_matrix[i][j] = min(insertion, deletion, substitution)
    return distance_matrix[first_length-1][second_length-1]

Related material

Peter Norvig (of the AI book and Google fame) has an interesting article on spelling correction, which is a fairly typical use of levenshtein. In his case though he generates all of the words within edit distance 1 of a source word, which although very computationally intensive might be of interest as well.

His code, from the link, is below..


alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):
n = len(word)
return set([word[0:i]+word[i+1:] for i in range(n)] + # deletion
[word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition
[word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration
[word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet]) # insertion

Submitted by Ross (not verified) on Wed, 23/04/2008 - 11:20.
Re: Norvig

Oh, I remember reading that, it was indeed very interesting, thanks for the link.
---
Vidi, Vici, Veni.

Submitted by Stavros on Wed, 23/04/2008 - 12:43.
Another option

Just wanted to leave a link to a page I found with an implementation in multiple languages;

http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance#Python

You might want to consider checking it out, and altering your use of "range" to either xrange() or enumerate().

As you mentioned, if it's comparing small stuff then it's a minimal impact but an iterator would be better then building the entire array into memory.

Mostly just for posterity sake and to say thanks for sharing!

Submitted by jay (not verified) on Fri, 27/02/2009 - 17:17.
Hi, your code does not work,

Hi, your code does not work, e.g.
print levenshtein_distance("abcdefghijk","cdefghijklm")
returns 2
4 would be correct.
The matrix is not correctly initialised.

Submitted by Stephan (not verified) on Thu, 23/04/2009 - 17:29.
Re: Error

You are, of course, correct. The code has been fixed, thank you for the report.

Submitted by Stavros on Thu, 23/04/2009 - 17:52.
The code does not seem to be

The code does not seem to be working correctly either, and everytime I compare two different string my result is 1.
(It does however work if the strings are identical, correctly producing 0)
I compared the matrix with the "kitten" "sitting" matrix from http://en.wikipedia.org/wiki/Levenshtein_distance
and they do not match up.

Any ideas?

Submitted by Anonymous (not verified) on Tue, 04/08/2009 - 15:12.
PS. Forgot to add my

PS.

Forgot to add my script:

http://pastebin.com/m194975fa

I simply added a print to the end.

I'm sure its probably just a stupid mistake on my part.

Help is appreciated.

Submitted by Anonymous (not verified) on Tue, 04/08/2009 - 15:18.
I don't know what you mean, I

I don't know what you mean, I compared both the matrices and the results from the Wikipedia page and it looks to be functioning correctly... I have a bunch of unit tests on that function too, and they're all correct...

Submitted by Stavros on Tue, 04/08/2009 - 18:35.
The script doesn't seem to

The script doesn't seem to work for me.
It worked fine about a few days ago, but in my current program, it produces a difference of 1 for all strings.

I'll get screenshots of the output tomorrow.
I've been using python for about a week now, so its probably an error of mine.
That pastebin link has the script that I've been using.

Submitted by Anonymous (not verified) on Tue, 04/08/2009 - 19:52.
Yeah, you got the indentation

Yeah, you got the indentation wrong. The if check is supposed to be inside the for.

Submitted by Stavros on Tue, 04/08/2009 - 21:48.
Thanks, you've been a huge

Thanks, you've been a huge help.

Submitted by Anonymous (not verified) on Wed, 05/08/2009 - 09:16.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Ads