How is WordPiece tokenization helpful to effectively deal with rare words problem in NLP?

WordPiece and BPE are two similar and commonly used techniques to segment words into subword-level in NLP tasks. In both cases, the vocabulary is initialized with all the individual characters in the language, and then the most frequent/likely combinations of the symbols in the vocabulary are iteratively added to the vocabulary. Consider the WordPiece algorithm … Read more

CBOW v.s. skip-gram: why invert context and target words?

Here is my oversimplified and rather naive understanding of the difference: As we know, CBOW is learning to predict the word by the context. Or maximize the probability of the target word by looking at the context. And this happens to be a problem for rare words. For example, given the context yesterday was a … Read more

Text Summarization Evaluation – BLEU vs ROUGE

In general: Bleu measures precision: how much the words (and/or n-grams) in the machine generated summaries appeared in the human reference summaries. Rouge measures recall: how much the words (and/or n-grams) in the human reference summaries appeared in the machine generated summaries. Naturally – these results are complementing, as is often the case in precision … Read more

tech