If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?
I once wrote an algorithm that did exactly what you mentioned for a PDF editor product that is still the number one PDF editor used today. There are a couple of reasons for what you mention (I think) but the important one is focus. You are correct that PDF (usually) doesn’t contain any structure information. … Read more