edit-distance – Tarik Billa

Levenshtein distance: how to better handle words swapping positions?

September 12, 2023 by Tarik

Figure out if a business name is very similar to another one – Python

May 15, 2023 by Tarik

I’ve recently done a similar task, although I was matching new data to existing names in a database, rather than looking for duplicates within one set. Name matching is actually a well-studied task, with a number of factors beyond what you’d consider for matching generic strings. First, I’d recommend taking a look at a paper, … Read more

String similarity metrics in Python [duplicate]

April 10, 2023 by Tarik

I realize it’s not the same thing, but this is close enough: >>> import difflib >>> a=”Hello, All you people” >>> b = ‘hello, all You peopl’ >>> seq=difflib.SequenceMatcher(a=a.lower(), b=b.lower()) >>> seq.ratio() 0.97560975609756095 You can make this as a function def similar(seq1, seq2): return difflib.SequenceMatcher(a=seq1.lower(), b=seq2.lower()).ratio() > 0.9 >>> similar(a, b) True >>> similar(‘Hello, world’, … Read more

Levenshtein distance in T-SQL

January 5, 2023 by Tarik

I implemented the standard Levenshtein edit distance function in TSQL with several optimizations that improves the speed over the other versions I’m aware of. In cases where the two strings have characters in common at their start (shared prefix), characters in common at their end (shared suffix), and when the strings are large and a … Read more