You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using Tesseract indirectly as part of OCRmyPDF and I'm coming here from this issue.
When OCR'ing English (Latin) text with diacritics it doesn't always recognise them. The diacritics in my document are part of surnames originating from Hungary and Belgium.
I've tried with just English, English + Hungarian dictionaries, also tried with Latin script (which has extended character map) to no avail.
The words: poéme, pathétique, animé are recognised.
The words: Ysaÿe, Jenő, Petőfi, etc. are not recognised.
The words csárdás, Telmányi, Dvořák are recognised only with Latin script.
Expected Behavior
The diacritics should be recognised.
Source files
tesseract -v
tesseract 5.3.4
leptonica-1.82.0
libgif 5.2.2 : libjpeg 6b (libjpeg-turbo 2.1.5) : libpng 1.6.43 : libtiff 4.5.1 : zlib 1.3 : libwebp 1.4.0 : libopenjp2 2.5.0
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
Found libarchive 3.7.2 zlib/1.3.1 liblzma/5.4.5 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.5
Found libcurl/8.8.0 OpenSSL/3.2.2 zlib/1.3.1 brotli/1.1.0 zstd/1.5.5 libidn2/2.3.7 libpsl/0.21.2 libssh2/1.11.0 nghttp2/1.62.1 librtmp/2.3 OpenLDAP/2.5.18
"ő" is included in hun.traineddata, so you could try Latin+hun, but training a new model would be better.
Tried with Hungarian and Latin too, didn't always work. And if training the new model is the only way forward, I'll have to do it myself, or it can be added to the existing Tesseract models?
Current Behavior
I'm using Tesseract indirectly as part of OCRmyPDF and I'm coming here from this issue.
When OCR'ing English (Latin) text with diacritics it doesn't always recognise them. The diacritics in my document are part of surnames originating from Hungary and Belgium.
I've tried with just English, English + Hungarian dictionaries, also tried with Latin script (which has extended character map) to no avail.
The words:
poéme
,pathétique
,animé
are recognised.The words:
Ysaÿe
,Jenő
,Petőfi
, etc. are not recognised.The words
csárdás
,Telmányi
,Dvořák
are recognised only with Latin script.Expected Behavior
The diacritics should be recognised.
Source files
tesseract -v
Operating System
Debian Testing (Bookworm)
uname -a
The text was updated successfully, but these errors were encountered: