Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dropout layers for Tesseract #4252

Open
yaofuzhou opened this issue May 27, 2024 · 5 comments
Open

Dropout layers for Tesseract #4252

yaofuzhou opened this issue May 27, 2024 · 5 comments

Comments

@yaofuzhou
Copy link

yaofuzhou commented May 27, 2024

Your Feature Request

I am trying to implement the feature of dropout layers for Tesseract. For now, the hope is to enable something like, say, "Dr0.2" or so to the VGSLSpecs syntax. I implemented some of the code, but have encountered a few issues, and I figure this may be the place for discussion.

  1. The files I have edited are
 Changes to be committed:
   (use "git restore --staged <file>..." to unstage)
 	new file:   ../src/lstm/dropout.cpp
 	new file:   ../src/lstm/dropout.h
 
 Changes not staged for commit:
   (use "git add <file>..." to update what will be committed)
   (use "git restore <file>..." to discard changes in working directory)
 	modified:   ../Makefile.am
 	modified:   ../configure.ac (for my own environment and irrelevant to the new dropout feature)
 	modified:   ../src/lstm/fullyconnected.cpp
 	modified:   ../src/lstm/network.cpp
 	modified:   ../src/lstm/network.h
 	modified:   ../src/training/common/networkbuilder.cpp
 	modified:   ../src/training/common/networkbuilder.h
  1. The code compiles but cannot run
  ~/Documents/OCR/tesstrain_units_6 (main*) » make training
  make[1]: Entering directory '~/Documents/OCR/tesstrain_units_6'
  ~/Documents/OCR/tesseract_dr/build/combine_lang_model \
 	--input_unicharset data/units/unicharset \
 	--script_dir data/langdata \
 	--numbers data/units/units.numbers \
 	--puncs data/units/units.punc \
 	--words data/units/units.wordlist \
 	--output_dir data \
 	 \
 	--lang units
  dyld[91402]: symbol not found in flat namespace '__ZN9tesseract7Network11DeSerializeEPNS_5TFileE'
  make[1]: *** [dr_training.mk:40: data/units/units.traineddata] Abort trap: 6
  make[1]: Leaving directory '~/Documents/OCR/tesstrain_units_6'
  make: *** [Makefile:17: training] Error 2

This is not surprising, as I am sure there are additional and essential modifications needed on other parts of the codebase.

  1. It is obvious that I need to be able to disable the dropout feature for the deployed .trainedmodels, for which I may need to further modify network.cpp. I need to ask the community about the best practice in terms of adding the new flag or switch for this purpose.

  2. Ideally, I want to, when continuing training from a checkpoint, be able to adjust the dropout rate(s) to a different value(s), including setting it/them to 0 (perhaps when the training is converging). There is probably more than one way to do it, but I want to ask the community for the best practice.

  3. Let me know when you want to go over my already implemented modifications (that do not work yet).

@amitdo
Copy link
Collaborator

amitdo commented May 28, 2024

Let me know when you want to go over my already implemented modifications (that do not work yet).

I suggest to put it in a feature branch in your GitHub fork of Tesseract, so other people can see it.

@amitdo
Copy link
Collaborator

amitdo commented May 28, 2024

I reformatted your comment.

@amitdo
Copy link
Collaborator

amitdo commented May 28, 2024

CC @bertsky,

Maybe you can help @yaofuzhou with this new feature.

@stweil
Copy link
Contributor

stweil commented May 28, 2024

I just pushed my own unfinished efforts: https://github.com/stweil/tesseract/tree/dropout.

@yaofuzhou
Copy link
Author

yaofuzhou commented May 28, 2024

[Edited]

This is my implementation of the dropout feature so far -
https://github.com/yaofuzhou/tesseract
I have gone over @stweil 's code and it seems that we are trying to approach it in a very similar way.

There are aspects from @stweil 's code that I can learn from, and I will try to incorporate those into my code and give full credit to @stweil in the process.

My original description remains the same, namely -

  1. My code compiles but does not run. Specifically, the tesseract and lstmtraining binaries yield the error messages
dyld[2292]: symbol not found in flat namespace '__ZN9tesseract7Network11DeSerializeEPNS_5TFileE'
[1]    2292 abort      ./lstmtraining
dyld[2292]: symbol not found in flat namespace '__ZN9tesseract7Network11DeSerializeEPNS_5TFileE'
[1]    2329 abort      ./tesseract

respectively, which means that I am probably missing something elsewhere in the Tesseract codebase. I tried to search for convolve and maxpool to see where these parallel components show up, but have not found the solution. This is probably where I need help the most.

  1. I need to implement a flag/switch somewhere so that the dropout mechanism is only activated during the training process (running the lstmtraining binary) and not during normal usage (running the tesseract binary).

  2. Ideally, I need to implement a mechanism to adjust the dropout_rate for each dropout layer when the lstmtraining binary continues from a checkpoint, as it may be desirable to turn off the dropout feature when the training converges to a good finish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants