Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistencies in detection and extraction of text using tesseract #4255

Open
saanvib13 opened this issue May 30, 2024 · 4 comments
Open

Inconsistencies in detection and extraction of text using tesseract #4255

saanvib13 opened this issue May 30, 2024 · 4 comments

Comments

@saanvib13
Copy link

saanvib13 commented May 30, 2024

Your Feature Request

I have provided the image from which I am trying to extract text from, using tesseract ocr.
output

Along with that, I have also provided the result or the extracted text from the image.
input

As it can be observed from the images, the extracted text is not very accurate. Negative symbols have been omitted, some undesired characters are also there in the extracted text. (I have marked some of the incorrect results with blue boxes)
I have tried to improve the results by preprocessing and bringing changes in the parameters of the model. I have tried:

  1. binarizing the images
  2. HDR processing of the processes
    Even then, such inconsistencies remain.

How to improve the detection and extraction of text in tesseract? I have also tried paddleocr for the same task. Even then, symbols such as euro, some negative signs are not being detected.

@zdenop
Copy link
Contributor

zdenop commented May 30, 2024

What about reading documentation?

@saanvib13
Copy link
Author

@zdenop Thank you for your response. I tried each and every step mentioned in this documentation. Even then, some decimal points are being omitted such as 22.5 is being misunderstood as 225. Moreover some numbers and being wrongly detected, such as -9 is being extracted as = ). Some negative symbols are also being omitted.
I have tried preprocessing the images and have implemented the following:

  1. noise removal
  2. canny edge detection
  3. hough line transform
  4. binarization
  5. hdr processing

Pls provide your guidance and help me resolve this issue.

@zdenop
Copy link
Contributor

zdenop commented May 31, 2024

And what did you learn about table recognition?
What forum posts about table recognition, what other issues are stated about table recognition? You should check these sources BEFORE posting the issue.

@amitdo amitdo added the tables label Jun 4, 2024
@rmast
Copy link

rmast commented Jun 16, 2024

This mod seems to do a slightly better job, still not flawless...
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants