Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keep_in_memory=False leads to the fact that dataset.get_vw_document() is almost unworkable #59

Open
Alvant opened this issue May 12, 2020 · 2 comments
Labels
discuss Not everything clear, further communication required

Comments

@Alvant
Copy link
Collaborator

Alvant commented May 12, 2020

The method is too slow!

Do we really need dask.dataframe? Maybe better to store documents on disk as single files (and not as one big .csv)?

References:

@Alvant Alvant added the discuss Not everything clear, further communication required label May 12, 2020
@Evgeny-Egorov-Projects
Copy link
Contributor

  1. Please describe the actual scenario (do you get documents 1 by 1 or a whole bunch at the same time?)
  2. It is possible that dask requires some fiddling with options before usage (like running on GPU) but we need to investigate that.

@Alvant
Copy link
Collaborator Author

Alvant commented May 25, 2020

  1. Yes, my scenario was just the 1 by 1 case. Intratext coherence score cooperates with Dataset, it retrieves document texts under the hood (many documents, one by one, on each fit iteration)
  2. Maybe this might help... but so far I have doubts. In the referenced notebook it is shown that reading document with dask may lead to nearly 2 sec, whereas reading the same document from disk can be approximately 0.005 sec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Not everything clear, further communication required
Projects
None yet
Development

No branches or pull requests

2 participants