Old question, but an answer would be useful for future visitors. So here are some of my thoughts.
There are some problems in the tensorflow implementation:
windowis 1-side size, sowindow=5would be5*2+1=11words.- Note that with PV-DM version of doc2vec, the
batch_sizewould be the number of documents. Sotrain_word_datasetshape would bebatch_size * context_window, whiletrain_doc_datasetandtrain_labelsshapes would bebatch_size. - More importantly,
sampled_softmax_lossis notnegative_sampling_loss. They are two different approximations ofsoftmax_loss.
So for the OP’s listed questions:
- This implementation of
doc2vecintensorflowis working and correct in its own way, but it is different from both thegensimimplementation and the paper. windowis 1-side size as said above. If document size is less than context size, then the smaller one would be use.- There are many reasons why
gensimimplementation is faster. First,gensimwas optimized heavily, all operations are faster than naive python operations, especially data I/O. Second, some preprocessing steps such asmin_countfiltering ingensimwould reduce the dataset size. More importantly,gensimusesnegative_sampling_loss, which is much faster thansampled_softmax_loss, I guess this is the main reason. - Is it easier to find somethings when there are many of them? Just kidding 😉
It’s true that there are many solutions in this non-convex optimization problem, so the model would just find a local optimum. Interestingly, in neural network, most local optima are “good enough”. It has been observed that stochastic gradient descent seems to find better local optima than larger batch gradient descent, although this is still a riddle in current research.