n-gram – Tarik Billa

counting n-gram frequency in python nltk

January 1, 2024 by Tarik

N-gram generation from a sentence

December 20, 2023 by Tarik

Filename search with ElasticSearch

December 18, 2023 by Tarik

You have various problems with what you pasted: 1) Incorrect mapping When creating the index, you specify: “mappings”: { “files”: { But your type is actually file, not files. If you checked the mapping, you would see that immediately: curl -XGET ‘http://127.0.0.1:9200/files/_mapping?pretty=1’ # { # “files” : { # “files” : { # “properties” : … Read more

Computing N Grams using Python

November 28, 2023 by Tarik

Understanding the `ngram_range` argument in a CountVectorizer in sklearn

June 15, 2023 by Tarik

Setting the vocabulary explicitly means no vocabulary is learned from data. If you don’t set it, you get: >>> v = CountVectorizer(ngram_range=(1, 2)) >>> pprint(v.fit([“an apple a day keeps the doctor away”]).vocabulary_) {u’an’: 0, u’an apple’: 1, u’apple’: 2, u’apple day’: 3, u’away’: 4, u’day’: 5, u’day keeps’: 6, u’doctor’: 7, u’doctor away’: 8, u’keeps’: … Read more

Python: Reducing memory usage of dictionary

April 19, 2023 by Tarik

I cannot offer a complete strategy that would help improve memory footprint, but I believe it may help to analyse what exactly is taking so much memory. If you look at the Python implementation of dictionary (which is a relatively straight-forward implementation of a hash table), as well as the implementation of the built-in string … Read more

Simple implementation of N-Gram, tf-idf and Cosine similarity in Python

April 16, 2023 by Tarik

Check out NLTK package: http://www.nltk.org it has everything what you need For the cosine_similarity: def cosine_distance(u, v): “”” Returns the cosine of the angle between vectors v and u. This is equal to u.v / |u||v|. “”” return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v))) For ngrams: def ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None): “”” … Read more

Elasticsearch: Find substring match

March 5, 2023 by Tarik

To search for partial field matches and exact matches, it will work better if you define the fields as “not analyzed” or as keywords (rather than text), then use a wildcard query. See also this. To use a wildcard query, append * on both ends of the string you are searching for: POST /my_index/my_type/_search { … Read more

n-grams in python, four, five, six grams?

October 25, 2022 by Tarik

Great native python based answers given by other users. But here’s the nltk approach (just in case, the OP gets penalized for reinventing what’s already existing in the nltk library). There is an ngram module that people seldom use in nltk. It’s not because it’s hard to read ngrams, but training a model base on … Read more