Tesseract training for a new font

You can use this tool to get a traineddata file of whichever font you want. After that move the traineddata file in your tessdata folder. To use tesseract with the new font in Python put lang = “Font”as the second parameter in the image_to_string function. It improves accuracy significantly but still makes mistakes of course. … Read more

Tesseract ocr PDF as input

Just for documentation reasons, here is an example of OCR using tesseract and pdf2image to extract text from an image pdf. import pdf2image try: from PIL import Image except ImportError: import Image import pytesseract def pdf_to_img(pdf_file): return pdf2image.convert_from_path(pdf_file) def ocr_core(file): text = pytesseract.image_to_string(file) return text def print_pages(pdf_file): images = pdf_to_img(pdf_file) for pg, img in enumerate(images): … Read more

How to know if a PDF contains only images or has been OCR scanned for searching?

Scannned images converted to PDF which have been OCR’ed in the aftermath to make text searchable do normally contain the text parts rendered as “invisible”. So what you see on screen (or on paper when printed) is still the original image. But when you search successfully, you get the hits highlighted that are on the … Read more

Which OCR Engine is better: Tesseract or OCRopus? [closed]

Initially OCRopus was actually using Tesseract as recognition engine inside, but later they changed it to their own brand-new engine. It is still fresh and not mature. We have been making accuracy comparison about year ago, and OCRopus was definitely losing to Tesseract, I am not even talking about commercial enignes. Since then I stopped … Read more

How to remove all lines and borders in an image while keeping text programmatically?

Since no one has posted a complete OpenCV solution, here’s a simple approach Obtain binary image. Load the image, convert to grayscale, and Otsu’s threshold Remove horizontal lines. We create a horizontal shaped kernel with cv2.getStructuringElement() then find contours and remove the lines with cv2.drawContours() Remove vertical lines. We do the same operation but with … Read more

Is there an OCR library that outputs coordinates of words found within an image? [closed]

Most commercial OCR engines will return word and character coordinate positions but you have to work with their SDK’s to extract the information. Even Tesseract OCR will return position information but it has been not easy to get to. Version 3.01 will make easier but a DLL interface is still being worked on. Unfortunately, most … Read more

tech