-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache local model to reduce memory usage and delays #966
base: main
Are you sure you want to change the base?
Conversation
Someone is attempting to deploy a commit to the Arc53 Team on Vercel. A member of the Team first needs to authorize it. |
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Sorry for a delay on this PR. Really appreciate it! Think both cache and singleton patterns are good, but I would prefer singleton pattern due to its more simplistic nature. Maybe because Im more used to it too. Is there any advantage or disadvantage to either one that you see? |
No worries on the delay. I chose the flask cache option since it seemed easier to implement: we need the cache/singleton to be accessible throughout the API, as a sort of global variable. Flask cache made this simple, just call the cache and you have your object. I wasn't sure how to do this cleanly with a singleton, admittedly I'm also not a Python expert. I could rework the PR to use a singleton instead of cache, I just wouldn't have as much time to test it anymore. |
Hey @starkgate Thank you so much for this PR, I do think we will go in favour of a singleton. |
I'm glad to hear it helped! |
Store the LLM after creation, for future use. This is especially necessary for local LLMs. Previously the LLM was recreated for each request, which caused long delays and doubled the GPU memory consumption.
Fixes #945