Cache local model to reduce memory usage and delays #966

starkgate · 2024-05-28T08:38:19Z

What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)

Store the LLM after creation, for future use. This is especially necessary for local LLMs. Previously the LLM was recreated for each request, which caused long delays and doubled the GPU memory consumption.

Why was this change needed? (You can also link to an open issue here)

Fixes #945

Other information:

vercel · 2024-05-28T08:38:24Z

Someone is attempting to deploy a commit to the Arc53 Team on Vercel.

A member of the Team first needs to authorize it.

vercel · 2024-05-28T13:14:23Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
docs-gpt	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jun 14, 2024 0:04am

dartpain · 2024-06-14T12:00:22Z

Sorry for a delay on this PR. Really appreciate it!
Main reason is because its a relative large conceptual change.
Personally I think there should be no LLM inference on our API container. Ideally it should be a separate one.
One kind of inference that I think is ok for the time being is embeddings, but it also causes issues with memory, I had multiple issues because of it on our production servers, so I moved embedding creation to a singleton instead of classic flask cache.

PR for embeddings

Think both cache and singleton patterns are good, but I would prefer singleton pattern due to its more simplistic nature. Maybe because Im more used to it too.

Is there any advantage or disadvantage to either one that you see?

starkgate · 2024-06-17T07:27:52Z

Sorry for a delay on this PR. Really appreciate it! Main reason is because its a relative large conceptual change. Personally I think there should be no LLM inference on our API container. Ideally it should be a separate one. One kind of inference that I think is ok for the time being is embeddings, but it also causes issues with memory, I had multiple issues because of it on our production servers, so I moved embedding creation to a singleton instead of classic flask cache.

PR for embeddings

Think both cache and singleton patterns are good, but I would prefer singleton pattern due to its more simplistic nature. Maybe because Im more used to it too.

Is there any advantage or disadvantage to either one that you see?

No worries on the delay. I chose the flask cache option since it seemed easier to implement: we need the cache/singleton to be accessible throughout the API, as a sort of global variable. Flask cache made this simple, just call the cache and you have your object. I wasn't sure how to do this cleanly with a singleton, admittedly I'm also not a Python expert. I could rework the PR to use a singleton instead of cache, I just wouldn't have as much time to test it anymore.

dartpain · 2024-06-18T10:52:35Z

Hey @starkgate Thank you so much for this PR, I do think we will go in favour of a singleton.
Also thank you so much for bringing it up as it brought up some other issues in our application. We definitely need to change things. If you can re-work it as a singleton - would be amazing!

starkgate · 2024-06-18T11:28:35Z

Hey @starkgate Thank you so much for this PR, I do think we will go in favour of a singleton. Also thank you so much for bringing it up as it brought up some other issues in our application. We definitely need to change things. If you can re-work it as a singleton - would be amazing!

I'm glad to hear it helped!

Cache local model to reduce memory usage and delays

0926ca2

github-actions bot added the application Application label May 28, 2024

vercel bot deployed to Preview – nextra-docsgpt May 28, 2024 08:39 View deployment

vercel bot deployed to Preview – docs-gpt May 28, 2024 13:14 View deployment

Remove unneeded imports

d06c9ec

vercel bot deployed to Preview – nextra-docsgpt May 28, 2024 13:32 View deployment

vercel bot deployed to Preview – docs-gpt June 14, 2024 12:04 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache local model to reduce memory usage and delays #966

Cache local model to reduce memory usage and delays #966

starkgate commented May 28, 2024

vercel bot commented May 28, 2024

vercel bot commented May 28, 2024 •

edited

Loading

dartpain commented Jun 14, 2024

starkgate commented Jun 17, 2024 •

edited

Loading

dartpain commented Jun 18, 2024

starkgate commented Jun 18, 2024

Cache local model to reduce memory usage and delays #966

Are you sure you want to change the base?

Cache local model to reduce memory usage and delays #966

Conversation

starkgate commented May 28, 2024

vercel bot commented May 28, 2024

vercel bot commented May 28, 2024 • edited Loading

dartpain commented Jun 14, 2024

starkgate commented Jun 17, 2024 • edited Loading

dartpain commented Jun 18, 2024

starkgate commented Jun 18, 2024

vercel bot commented May 28, 2024 •

edited

Loading

starkgate commented Jun 17, 2024 •

edited

Loading