You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Documents generated by "TikaDocumentConverter" from PDF files do not contain "\f" page separators, so later in the pipeline the "DocumentSplitter" then cannot split them by page and generates one big "page" with all the text.
The page separation works as expected with using "PyPDFToDocument".
The text was updated successfully, but these errors were encountered:
For context, TikaDocumentConverter split documents by page in v1.x (v1.x TikaConverter), so it might make sense to see if the logic is still valid and port it to v2.x.
The Documents generated by "TikaDocumentConverter" from PDF files do not contain "\f" page separators, so later in the pipeline the "DocumentSplitter" then cannot split them by page and generates one big "page" with all the text.
The page separation works as expected with using "PyPDFToDocument".
The text was updated successfully, but these errors were encountered: