edit-distance
Figure out if a business name is very similar to another one – Python
I’ve recently done a similar task, although I was matching new data to existing names in a database, rather than looking for duplicates within one set. Name matching is actually a well-studied task, with a number of factors beyond what you’d consider for matching generic strings. First, I’d recommend taking a look at a paper, … Read more
String similarity metrics in Python [duplicate]
I realize it’s not the same thing, but this is close enough: >>> import difflib >>> a=”Hello, All you people” >>> b = ‘hello, all You peopl’ >>> seq=difflib.SequenceMatcher(a=a.lower(), b=b.lower()) >>> seq.ratio() 0.97560975609756095 You can make this as a function def similar(seq1, seq2): return difflib.SequenceMatcher(a=seq1.lower(), b=seq2.lower()).ratio() > 0.9 >>> similar(a, b) True >>> similar(‘Hello, world’, … Read more
Levenshtein distance in T-SQL
I implemented the standard Levenshtein edit distance function in TSQL with several optimizations that improves the speed over the other versions I’m aware of. In cases where the two strings have characters in common at their start (shared prefix), characters in common at their end (shared suffix), and when the strings are large and a … Read more