Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache local model to reduce memory usage and delays #966

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

starkgate
Copy link
Contributor

  • What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)

Store the LLM after creation, for future use. This is especially necessary for local LLMs. Previously the LLM was recreated for each request, which caused long delays and doubled the GPU memory consumption.

  • Why was this change needed? (You can also link to an open issue here)

Fixes #945

  • Other information:

Copy link

vercel bot commented May 28, 2024

Someone is attempting to deploy a commit to the Arc53 Team on Vercel.

A member of the Team first needs to authorize it.

Copy link

vercel bot commented May 28, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
docs-gpt ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 14, 2024 0:04am

@dartpain
Copy link
Contributor

Sorry for a delay on this PR. Really appreciate it!
Main reason is because its a relative large conceptual change.
Personally I think there should be no LLM inference on our API container. Ideally it should be a separate one.
One kind of inference that I think is ok for the time being is embeddings, but it also causes issues with memory, I had multiple issues because of it on our production servers, so I moved embedding creation to a singleton instead of classic flask cache.

PR for embeddings

Think both cache and singleton patterns are good, but I would prefer singleton pattern due to its more simplistic nature. Maybe because Im more used to it too.

Is there any advantage or disadvantage to either one that you see?

@starkgate
Copy link
Contributor Author

starkgate commented Jun 17, 2024

Sorry for a delay on this PR. Really appreciate it! Main reason is because its a relative large conceptual change. Personally I think there should be no LLM inference on our API container. Ideally it should be a separate one. One kind of inference that I think is ok for the time being is embeddings, but it also causes issues with memory, I had multiple issues because of it on our production servers, so I moved embedding creation to a singleton instead of classic flask cache.

PR for embeddings

Think both cache and singleton patterns are good, but I would prefer singleton pattern due to its more simplistic nature. Maybe because Im more used to it too.

Is there any advantage or disadvantage to either one that you see?

No worries on the delay. I chose the flask cache option since it seemed easier to implement: we need the cache/singleton to be accessible throughout the API, as a sort of global variable. Flask cache made this simple, just call the cache and you have your object. I wasn't sure how to do this cleanly with a singleton, admittedly I'm also not a Python expert. I could rework the PR to use a singleton instead of cache, I just wouldn't have as much time to test it anymore.

@dartpain
Copy link
Contributor

Hey @starkgate Thank you so much for this PR, I do think we will go in favour of a singleton.
Also thank you so much for bringing it up as it brought up some other issues in our application. We definitely need to change things. If you can re-work it as a singleton - would be amazing!

@starkgate
Copy link
Contributor Author

Hey @starkgate Thank you so much for this PR, I do think we will go in favour of a singleton. Also thank you so much for bringing it up as it brought up some other issues in our application. We definitely need to change things. If you can re-work it as a singleton - would be amazing!

I'm glad to hear it helped!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
application Application
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

🐛 Bug Report: Local model is recreated for each request, causing delay and out of memory errors
3 participants