Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching results for update #39

Open
NickyFot opened this issue Jan 16, 2019 · 2 comments
Open

Caching results for update #39

NickyFot opened this issue Jan 16, 2019 · 2 comments

Comments

@NickyFot
Copy link

Your Environment

  • Operating System: Windows 10 Pro
  • Python Version Used: 3.6
  • Scattertext Version Used: 0.0.2.36
@NickyFot
Copy link
Author

Looks like if we have a corpus that updates over time (eg. Twitter feed) we have to rerun the code for everything (eg. st.produce_scattertext_explorer())
We can pickle objects and reload in the future, so maybe a method to run explorer on new items only? it would significantly lower processing time on each update.

@JasonKessler
Copy link
Owner

Hi Nicky,

Appreciate the feedback.

Are you able to quickly build new Corpus objects when there are updates to documents? It would be somewhat straightforward to have add, say, an "add_documents(parsed_documents, categories)" method to the TermDocMatrix family of classes.

Unfortunately, produce_scattertext_explorer uses term positioning, scoring, and selection methods which can depend the distribution or rank of term frequencies. This means that adding a single document can change the scores or positions of every other term, and makes it very difficult to update the visualization without completely regenerating it.

However, if you'd like to give either of these a shot, I'd be happy to add your improvements to the codebase.

Jason

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants