Algorithm to find out whether the matches for two Glob patterns (or Regular Expressions) intersect

Now witness the firepower of this fully ARMED and OPERATIONAL battle station! (I have worked too much on this answer and my brain has broken; There should be a badge for that.) In order to determine if two patterns intersect, I have created a recursive backtracking parser — when Kleene stars are encountered a new … Read more

Efficient string matching in Apache Spark

I wouldn’t use Spark in the first place, but if you are really committed to the particular stack, you can combine a bunch of ml transformers to get best matches. You’ll need Tokenizer (or split): import org.apache.spark.ml.feature.RegexTokenizer val tokenizer = new RegexTokenizer().setPattern(“”).setInputCol(“text”).setMinTokenLength(1).setOutputCol(“tokens”) NGram (for example 3-gram) import org.apache.spark.ml.feature.NGram val ngram = new NGram().setN(3).setInputCol(“tokens”).setOutputCol(“ngrams”) Vectorizer (for … Read more

Using Java Regex, how to check if a string contains any of the words in a set ?

TL;DR For simple substrings contains() is best but for only matching whole words Regular Expression are probably better. The best way to see which method is more efficient is to test it. You can use String.contains() instead of String.indexOf() to simplify your non-regexp code. To search for different words the Regular Expression looks like this: … Read more

Finding how similar two strings are

Ok, so the standard algorithms are: 1) Hamming distance Only good for strings of the same length, but very efficient. Basically it simply counts the number of distinct characters. Not useful for fuzzy searching of natural language text. 2) Levenstein distance. The Levenstein distance measures distance in terms of the number of “operations” required to … Read more

Are Regular Expressions a must for programming? [closed]

One could easily go without them but one should (IMHO) know the basics, for 2 reasons. 1) There may come a time where RegEx is the best solution to the problem at hand (see image below) 2) When you see a Regex in someone else’s code it shouldn’t be 100% mystical. preg_match(‘/summarycount”>.*?([,\d]+)<\/div>.*?Reputation/s’, $page, $rep); This … Read more