Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TikaDocumentConverter does not split content by page #7949

Open
vaclcer opened this issue Jun 28, 2024 · 4 comments
Open

TikaDocumentConverter does not split content by page #7949

vaclcer opened this issue Jun 28, 2024 · 4 comments
Labels
Contributions wanted! Looking for external contributions pdf

Comments

@vaclcer
Copy link

vaclcer commented Jun 28, 2024

The Documents generated by "TikaDocumentConverter" from PDF files do not contain "\f" page separators, so later in the pipeline the "DocumentSplitter" then cannot split them by page and generates one big "page" with all the text.

The page separation works as expected with using "PyPDFToDocument".

@mrm1001 mrm1001 added the pdf label Jun 28, 2024
@anakin87
Copy link
Member

Thanks, @vaclcer!

For context, TikaDocumentConverter split documents by page in v1.x (v1.x TikaConverter), so it might make sense to see if the logic is still valid and port it to v2.x.

@anakin87 anakin87 added the Contributions wanted! Looking for external contributions label Jun 28, 2024
@ghost
Copy link

ghost commented Jul 8, 2024

Hello,
I'm new to contributing to open-source.
Can I take a shot at this?

@AnushreeBannadabhavi
Copy link
Contributor

I'd like to take this up if no one is working on it

@anakin87
Copy link
Member

@AnushreeBannadabhavi feel free to work on this! 💙

(the user who had commented earlier removed his GitHub profile)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Contributions wanted! Looking for external contributions pdf
Projects
Development

No branches or pull requests

4 participants