PDF-translator-OpenAI-API

Experimental Python-based PDF/plaintext translator that utilizes the OpenAI API

Can be used to dump PDF content into JSON and further on to local databases or i.e. LLM supplementation with RAG
NOTE: this is a highly experimental WIP pipeline for dumping PDF's into plaintext and getting them translated through the OpenAI API.
I do NOT recommend running it without first studying the code since the program is just an early trial at this point.

Prerequisites

Text extraction from PDF files requires pdfminer.six -- install with: pip install -U pdfminer.six
Token counting (to calculate estimated API costs) requires transformers -- install with pip install -U transformers
Translation module when using OpenAI API requires the openai package (pip install -U openai) and a functioning OpenAI API key.
Put the OpenAI API key into your environmental variables as OPENAI_API_KEY or into a single line entry into api_token.txt in the program directory.

Functionalities so far / processing order

pdfget.py <directory> will use fitz (PyMuPDF) in order to dump the text in a natural reading order by approximating the position on the page. The current version adds a page separator and page counter between each page and dumps the plaintext files to txt_raw subdirectory. Then, page_fixing.py <directory> can be used on the txt_raw directory to dump the formatting per page into a more concise format, keeping the page splits. The output directory is txt_processed. Keep in mind that all of these are trial-and-error type approaches that may not be applicable to all use case scenarios.
pdf_reader_splitter.py <pdf file> to dump to splits by page straight from the pdf. Also supports cmdline option for setting split on chars. WIP, as usual.
openai_api_auto_translate.py <directory name> to translate an entire directory (where you dumped your stuff into with pdf_reader_splitter.py). Edit config.ini to set your own parameters for translation.
combine_translation.py <directory name> to combine the splits back into one piece.
post_process.py <textfile> for final touches, i.e. any paragraphs that are without an empty line in between, add one in, and trim multiple empty lines.

Text parsing with `spacy` (for specific use case scenarios only)

pip install spacy and then your needed packages like:
python -m spacy download <your spacy package>

WIP

gui-translator.py - an early alpha GUI for side-by-side / A/B type comparison with a graphical user interface.

Other stuff

pdfmine.py your_file.pdf to dump the text layer of a PDF to plaintext.
tokencounter.py to estimate the amount of tokens that the text file has for a rough token usage estimate.
splitter.py textfile.txt to split the text file into pieces that are more suitable for LLM's such as GPT-3.5 or GPT-4. It splits at 5000 chars at newline by default, but can be adjusted from the char_limit variable.
splitter.py also tries to auto-sanitize tha pdf dump at the moment -- this might not be suitable for your use case scenario, so again -- look at the split dumps first before you run it through a LLM translation -- GIGO (garbage in, garbage out) applies to NLP translations as well.
(Coming soon) pipeline to automate the actual translation process.

Changelog

v0.14 - added token_count_estimator.py to run a token count estimate (with spacy and tokenizer)
v0.13 - added pdfget.py for natural reading order extraction using fitz (PyMuPDF)
v0.12 - early alpha test for the GUI; gui-translator.py
v0.11 - bugfixes
v0.10 - translation combining via combine_translation.py
v0.09 - token handling, naming policy
v0.08 - more changes to the API call functionality
v0.07 - API call updated and fixed for openai >v1.0
v0.06 - fixes to the API call
v0.05 - calculate the cost approximation
v0.04 - calculate both tokens and chars
v0.03 and earlier: rudimentary sketches

Todo

More streamlined automation for the translation process
Perhaps an optional GUI with a PDF reader
Looking into PDF file layers to see if we could replace the contents in-place (get text block layer from PDF page => sanitize => LLM translate => insert back in-place)

About

Started as a Grindmas (= Code-Grinding Christmas) project for Skrolli magazine
FlyingFathead w/ code whispers from ChaosWhisperer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF-translator-OpenAI-API

Prerequisites

Functionalities so far / processing order

Text parsing with `spacy` (for specific use case scenarios only)

WIP

Other stuff

Changelog

Todo

About

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
rag_redis		rag_redis
README.md		README.md
combine_translation.py		combine_translation.py
config.ini		config.ini
format_fixing.py		format_fixing.py
gui-screenshot.png		gui-screenshot.png
gui-translator.py		gui-translator.py
openai_api_auto_translate.py		openai_api_auto_translate.py
openai_pricing_calculator.py		openai_pricing_calculator.py
page_fixing.py		page_fixing.py
pdf_by_issue.py		pdf_by_issue.py
pdf_reader_splitter.py		pdf_reader_splitter.py
pdfget.py		pdfget.py
pdfmine.py		pdfmine.py
post_process.py		post_process.py
requirements.txt		requirements.txt
spacy_parsing.py		spacy_parsing.py
splitter.py		splitter.py
testrunner.sh		testrunner.sh
testview.sh		testview.sh
token_count_estimator.py		token_count_estimator.py
tokencounter.py		tokencounter.py
translate.py		translate.py

FlyingFathead/PDF-translator-OpenAI-API

Folders and files

Latest commit

History

Repository files navigation

PDF-translator-OpenAI-API

Prerequisites

Functionalities so far / processing order

Text parsing with spacy (for specific use case scenarios only)

WIP

Other stuff

Changelog

Todo

About

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Text parsing with `spacy` (for specific use case scenarios only)

Packages