CBOW v.s. skip-gram: why invert context and target words?
Here is my oversimplified and rather naive understanding of the difference: As we know, CBOW is learning to predict the word by the context. Or maximize the probability of the target word by looking at the context. And this happens to be a problem for rare words. For example, given the context yesterday was a … Read more