n-gram
Filename search with ElasticSearch
You have various problems with what you pasted: 1) Incorrect mapping When creating the index, you specify: “mappings”: { “files”: { But your type is actually file, not files. If you checked the mapping, you would see that immediately: curl -XGET ‘http://127.0.0.1:9200/files/_mapping?pretty=1’ # { # “files” : { # “files” : { # “properties” : … Read more
Understanding the `ngram_range` argument in a CountVectorizer in sklearn
Setting the vocabulary explicitly means no vocabulary is learned from data. If you don’t set it, you get: >>> v = CountVectorizer(ngram_range=(1, 2)) >>> pprint(v.fit([“an apple a day keeps the doctor away”]).vocabulary_) {u’an’: 0, u’an apple’: 1, u’apple’: 2, u’apple day’: 3, u’away’: 4, u’day’: 5, u’day keeps’: 6, u’doctor’: 7, u’doctor away’: 8, u’keeps’: … Read more
Python: Reducing memory usage of dictionary
I cannot offer a complete strategy that would help improve memory footprint, but I believe it may help to analyse what exactly is taking so much memory. If you look at the Python implementation of dictionary (which is a relatively straight-forward implementation of a hash table), as well as the implementation of the built-in string … Read more
Simple implementation of N-Gram, tf-idf and Cosine similarity in Python
Check out NLTK package: http://www.nltk.org it has everything what you need For the cosine_similarity: def cosine_distance(u, v): “”” Returns the cosine of the angle between vectors v and u. This is equal to u.v / |u||v|. “”” return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v))) For ngrams: def ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None): “”” … Read more
Elasticsearch: Find substring match
To search for partial field matches and exact matches, it will work better if you define the fields as “not analyzed” or as keywords (rather than text), then use a wildcard query. See also this. To use a wildcard query, append * on both ends of the string you are searching for: POST /my_index/my_type/_search { … Read more
n-grams in python, four, five, six grams?
Great native python based answers given by other users. But here’s the nltk approach (just in case, the OP gets penalized for reinventing what’s already existing in the nltk library). There is an ngram module that people seldom use in nltk. It’s not because it’s hard to read ngrams, but training a model base on … Read more