You can use “treebank detokenizer” – TreebankWordDetokenizer
:
from nltk.tokenize.treebank import TreebankWordDetokenizer
TreebankWordDetokenizer().detokenize(['the', 'quick', 'brown'])
# 'The quick brown'
There is also MosesDetokenizer
which was in nltk
but got removed because of the licensing issues, but it is available as a Sacremoses
standalone package.