similarity – Page 2 – Tarik Billa

String similarity score/hash

April 5, 2023 by Tarik

I believe what you’re looking for is called a Locality Sensitive Hash. Whereas most hash algorithms are designed such that small variations in input cause large changes in output, these hashes attempt the opposite: small changes in input generate proportionally small changes in output. As others have mentioned, there are inherent issues with forcing a … Read more

Algorithm to find articles with similar text

March 21, 2023 by Tarik

Edit distance isn’t a likely candidate, as it would be spelling/word-order dependent, and much more computationally expensive than Will is leading you to believe, considering the size and number of the documents you’d actually be interested in searching. Something like Lucene is the way to go. You index all your documents, and then when you … Read more

Comparing strings with tolerance

February 24, 2023 by Tarik

You could use the Levenshtein Distance algorithm. “The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.” – Wikipedia.com This one is from dotnetperls.com: using System; /// <summary> /// … Read more

How to calculate distance similarity measure of given 2 strings?

February 13, 2023 by Tarik

I just addressed this exact same issue a few weeks ago. Since someone is asking now, I’ll share the code. In my exhaustive tests my code is about 10x faster than the C# example on Wikipedia even when no maximum distance is supplied. When a maximum distance is supplied, this performance gain increases to 30x … Read more

Calculate cosine similarity given 2 sentence strings

January 30, 2023 by Tarik

A simple pure-Python implementation would be: import math import re from collections import Counter WORD = re.compile(r”\w+”) def get_cosine(vec1, vec2): intersection = set(vec1.keys()) & set(vec2.keys()) numerator = sum([vec1[x] * vec2[x] for x in intersection]) sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())]) sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())]) denominator = math.sqrt(sum1) … Read more

What’s the fastest way in Python to calculate cosine similarity given sparse matrix data?

January 29, 2023 by Tarik

You can compute pairwise cosine similarity on the rows of a sparse matrix directly using sklearn. As of version 0.17 it also supports sparse output: from sklearn.metrics.pairwise import cosine_similarity from scipy import sparse A = np.array([[0, 1, 0, 0, 1], [0, 0, 1, 1, 1],[1, 1, 0, 1, 0]]) A_sparse = sparse.csr_matrix(A) similarities = cosine_similarity(A_sparse) … Read more

A better similarity ranking algorithm for variable length strings

November 14, 2022 by Tarik

Simon White of Catalysoft wrote an article about a very clever algorithm that compares adjacent character pairs that works really well for my purposes: http://www.catalysoft.com/articles/StrikeAMatch.html Simon has a Java version of the algorithm and below I wrote a PL/Ruby version of it (taken from the plain ruby version done in the related forum entry comment … Read more

Checking images for similarity with OpenCV

October 22, 2022 by Tarik

This is a huge topic, with answers from 3 lines of code to entire research magazines. I will outline the most common such techniques and their results. Comparing histograms One of the simplest & fastest methods. Proposed decades ago as a means to find picture simmilarities. The idea is that a forest will have a … Read more

Find the similarity metric between two strings

September 19, 2022 by Tarik

There is a built in. from difflib import SequenceMatcher def similar(a, b): return SequenceMatcher(None, a, b).ratio() Using it: >>> similar(“Apple”,”Appel”) 0.8 >>> similar(“Apple”,”Mango”) 0.0