Caching results for update #39

NickyFot · 2019-01-16T13:33:39Z

Your Environment

Operating System: Windows 10 Pro
Python Version Used: 3.6
Scattertext Version Used: 0.0.2.36

NickyFot · 2019-01-16T13:36:50Z

Looks like if we have a corpus that updates over time (eg. Twitter feed) we have to rerun the code for everything (eg. st.produce_scattertext_explorer())
We can pickle objects and reload in the future, so maybe a method to run explorer on new items only? it would significantly lower processing time on each update.

JasonKessler · 2019-01-16T17:05:25Z

Hi Nicky,

Appreciate the feedback.

Are you able to quickly build new Corpus objects when there are updates to documents? It would be somewhat straightforward to have add, say, an "add_documents(parsed_documents, categories)" method to the TermDocMatrix family of classes.

Unfortunately, produce_scattertext_explorer uses term positioning, scoring, and selection methods which can depend the distribution or rank of term frequencies. This means that adding a single document can change the scores or positions of every other term, and makes it very difficult to update the visualization without completely regenerating it.

However, if you'd like to give either of these a shot, I'd be happy to add your improvements to the codebase.

Jason

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching results for update #39

Caching results for update #39

NickyFot commented Jan 16, 2019

NickyFot commented Jan 16, 2019

JasonKessler commented Jan 16, 2019

Caching results for update #39

Caching results for update #39

Comments

NickyFot commented Jan 16, 2019

Your Environment

NickyFot commented Jan 16, 2019

JasonKessler commented Jan 16, 2019