tesseract – Tarik Billa

Tesseract training for a new font

April 11, 2024 by Tarik

You can use this tool to get a traineddata file of whichever font you want. After that move the traineddata file in your tessdata folder. To use tesseract with the new font in Python put lang = “Font”as the second parameter in the image_to_string function. It improves accuracy significantly but still makes mistakes of course. … Read more

Can `tesseract-ocr` put the result to STDOUT?

April 9, 2024 by Tarik

The solution is: tesseract input.jpg stdout But you need at least version 3.03

Tesseract ocr PDF as input

April 7, 2024 by Tarik

Just for documentation reasons, here is an example of OCR using tesseract and pdf2image to extract text from an image pdf. import pdf2image try: from PIL import Image except ImportError: import Image import pytesseract def pdf_to_img(pdf_file): return pdf2image.convert_from_path(pdf_file) def ocr_core(file): text = pytesseract.image_to_string(file) return text def print_pages(pdf_file): images = pdf_to_img(pdf_file) for pg, img in enumerate(images): … Read more

How can I run tesseract with multiple languages one time?

January 4, 2024 by Tarik

Since tesseract 3.02 it is possible to specify multiple languages for the -l parameter. -l lang The language to use. If none is specified, English is assumed. Multiple languages may be specified, separated by plus characters. Tesseract uses 3-character ISO 639-2 language codes. An example: tesseract myscan.png out -l deu+eng

How to install language in tesseract OCR

December 31, 2023 by Tarik

On mac OS type brew install tesseract-lang Installs all languages, you can check them by, tesseract –list-langs

Which OCR Engine is better: Tesseract or OCRopus? [closed]

December 27, 2023 by Tarik

Initially OCRopus was actually using Tesseract as recognition engine inside, but later they changed it to their own brand-new engine. It is still fresh and not mature. We have been making accuracy comparison about year ago, and OCRopus was definitely losing to Tesseract, I am not even talking about commercial enignes. Since then I stopped … Read more

OCR with the Tesseract interface

December 12, 2023 by Tarik

Take a look at tessnet

Using Tesseract from java

December 11, 2023 by Tarik

Now tesseract is provided by the javacv project, this is a far better option than using Tess4J since all that is required is adding a single dependency to your pom file, the native libs for your platform will then be downloaded and linked automatically for you by the javacv tesseract version. I’ve created an example … Read more

Tesseract OCR simple example

November 29, 2023 by Tarik

Ok. I found the solution here tessnet2 fails to load the Ans given by Adam Apparently i was using wrong version of tessdata. I was following the the source page instruction intuitively and that caused the problem. it says Quick Tessnet2 usage Download binary here, add a reference of the assembly Tessnet2.dll to your .NET … Read more

Recognize a number from an image

September 22, 2023 by Tarik

You will most likely need to do the following: Apply the Hough Transform algorithm on the entire page, this should should yield a series of page sections. For each section you get, apply it again. If the current section yielded 2 elements, then you should be dealing with a rectangle similar to the above. Once … Read more