`keep_in_memory=False` leads to the fact that `dataset.get_vw_document()` is almost unworkable #59

Alvant · 2020-05-12T08:15:01Z

The method is too slow!

Do we really need dask.dataframe? Maybe better to store documents on disk as single files (and not as one big .csv)?

References:

How one tried to fix the problem locally: TopicBank-Experiment-BankCreation.ipynb, section Lower Time Consumption in Case of Big Datasets

The text was updated successfully, but these errors were encountered:

Evgeny-Egorov-Projects · 2020-05-25T10:43:28Z

Please describe the actual scenario (do you get documents 1 by 1 or a whole bunch at the same time?)
It is possible that dask requires some fiddling with options before usage (like running on GPU) but we need to investigate that.

Alvant · 2020-05-25T14:10:32Z

Yes, my scenario was just the 1 by 1 case. Intratext coherence score cooperates with Dataset, it retrieves document texts under the hood (many documents, one by one, on each fit iteration)
Maybe this might help... but so far I have doubts. In the referenced notebook it is shown that reading document with dask may lead to nearly 2 sec, whereas reading the same document from disk can be approximately 0.005 sec

Alvant added the discuss Not everything clear, further communication required label May 12, 2020

Provide feedback