Damerau-Levenshtein distance

The Damerau-Levenshtein distance is a string metric used to measure the distance between two sequences of characters, typically strings. It is similar to the Levenshtein distance, which calculates the minimum number of single-character insertions, deletions, and substitutions required to transform one string into another. However, the Damerau-Levenshtein distance also allows for transpositions (i.e., swapping adjacent characters) as a valid operation.

The distance is computed by finding the minimum number of operations required to transform one string into the other, where each operation is either a deletion, insertion, substitution, or transposition. The distance is often used in spelling correction and approximate string matching applications.

The formula for calculating the Damerau-Levenshtein distance between two strings, a and b, is as follows:

D(a, b) = m - t

where m is the total number of operations required to transform string a into string b, and t is the number of transpositions. The operations allowed are deletion, insertion, substitution, and transposition. The transposition operation involves swapping two adjacent characters.

The Damerau-Levenshtein distance is a useful metric in various fields, including computer science, linguistics, and natural language processing. It has applications in spell checking, OCR, and DNA sequencing.

def damerau_levenshtein_distance(s1, s2):
    d = {}
    lenstr1 = len(s1)
    lenstr2 = len(s2)
    for i in range(-1,lenstr1+1):
        d[(i,-1)] = i+1
    for j in range(-1,lenstr2+1):
        d[(-1,j)] = j+1
 
    for i in range(lenstr1):
        for j in range(lenstr2):
            if s1[i] == s2[j]:
                cost = 0
            else:
                cost = 1
            d[(i,j)] = min(
                           d[(i-1,j)] + 1, # deletion
                           d[(i,j-1)] + 1, # insertion
                           d[(i-1,j-1)] + cost, # substitution
                          )
            if i and j and s1[i] == s2[j-1] and s1[i-1] == s2[j]:
                d[(i,j)] = min(d[(i,j)], d[i-2,j-2] + 1) # transposition
 
    return d[lenstr1-1,lenstr2-1]

Additional links:

fastDamerauLevenshtein module 

pyxDamerauLevenshtein module 

Расстояние Левенштейна в MySQL и алгоритмы нечёткого поиска средствами PHP

Методы нечеткого поиска в SQL: Расстояние Левенштейна 

Комментарии

Популярные сообщения из этого блога

Data Table Filters within Charts in Power BI

Beneficiaries and Disiribution database