TikaDocumentConverter does not split content by page #7949

vaclcer · 2024-06-28T08:36:05Z

The Documents generated by "TikaDocumentConverter" from PDF files do not contain "\f" page separators, so later in the pipeline the "DocumentSplitter" then cannot split them by page and generates one big "page" with all the text.

The page separation works as expected with using "PyPDFToDocument".

anakin87 · 2024-06-28T08:40:51Z

Thanks, @vaclcer!

For context, TikaDocumentConverter split documents by page in v1.x (v1.x TikaConverter), so it might make sense to see if the logic is still valid and port it to v2.x.

ghost · 2024-07-08T12:10:42Z

Hello,
I'm new to contributing to open-source.
Can I take a shot at this?

AnushreeBannadabhavi · 2024-07-14T17:10:49Z

I'd like to take this up if no one is working on it

anakin87 · 2024-07-15T08:51:05Z

@AnushreeBannadabhavi feel free to work on this! 💙

(the user who had commented earlier removed his GitHub profile)

mrm1001 added the pdf label Jun 28, 2024

anakin87 added the Contributions wanted! Looking for external contributions label Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TikaDocumentConverter does not split content by page #7949

TikaDocumentConverter does not split content by page #7949

vaclcer commented Jun 28, 2024

anakin87 commented Jun 28, 2024

ghost commented Jul 8, 2024

AnushreeBannadabhavi commented Jul 14, 2024

anakin87 commented Jul 15, 2024

TikaDocumentConverter does not split content by page #7949

TikaDocumentConverter does not split content by page #7949

Comments

vaclcer commented Jun 28, 2024

anakin87 commented Jun 28, 2024

ghost commented Jul 8, 2024

AnushreeBannadabhavi commented Jul 14, 2024

anakin87 commented Jul 15, 2024