extracting text from MS word files in python

Use the native Python docx module. Here’s how to extract all the text from a doc:

document = docx.Document(filename)
docText="\n\n".join(
    paragraph.text for paragraph in document.paragraphs
)
print(docText)

See Python DocX site

Also check out Textract which pulls out tables etc.

Parsing XML with regexs invokes cthulu. Don’t do it!

Leave a Comment

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)