String similarity score/hash

I believe what you’re looking for is called a Locality Sensitive Hash. Whereas most hash algorithms are designed such that small variations in input cause large changes in output, these hashes attempt the opposite: small changes in input generate proportionally small changes in output. As others have mentioned, there are inherent issues with forcing a … Read more

Algorithm to find articles with similar text

Edit distance isn’t a likely candidate, as it would be spelling/word-order dependent, and much more computationally expensive than Will is leading you to believe, considering the size and number of the documents you’d actually be interested in searching. Something like Lucene is the way to go. You index all your documents, and then when you … Read more

Comparing strings with tolerance

You could use the Levenshtein Distance algorithm. “The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.” – Wikipedia.com This one is from dotnetperls.com: using System; /// <summary> /// … Read more

How to calculate distance similarity measure of given 2 strings?

I just addressed this exact same issue a few weeks ago. Since someone is asking now, I’ll share the code. In my exhaustive tests my code is about 10x faster than the C# example on Wikipedia even when no maximum distance is supplied. When a maximum distance is supplied, this performance gain increases to 30x … Read more

Calculate cosine similarity given 2 sentence strings

A simple pure-Python implementation would be: import math import re from collections import Counter WORD = re.compile(r”\w+”) def get_cosine(vec1, vec2): intersection = set(vec1.keys()) & set(vec2.keys()) numerator = sum([vec1[x] * vec2[x] for x in intersection]) sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())]) sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())]) denominator = math.sqrt(sum1) … Read more

What’s the fastest way in Python to calculate cosine similarity given sparse matrix data?

You can compute pairwise cosine similarity on the rows of a sparse matrix directly using sklearn. As of version 0.17 it also supports sparse output: from sklearn.metrics.pairwise import cosine_similarity from scipy import sparse A = np.array([[0, 1, 0, 0, 1], [0, 0, 1, 1, 1],[1, 1, 0, 1, 0]]) A_sparse = sparse.csr_matrix(A) similarities = cosine_similarity(A_sparse) … Read more

A better similarity ranking algorithm for variable length strings

Simon White of Catalysoft wrote an article about a very clever algorithm that compares adjacent character pairs that works really well for my purposes: http://www.catalysoft.com/articles/StrikeAMatch.html Simon has a Java version of the algorithm and below I wrote a PL/Ruby version of it (taken from the plain ruby version done in the related forum entry comment … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)