A high-throughput and memory-efficient inference and serving engine for LLMs
-
Updated
Jul 17, 2024 - Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
Multi-node production AI stack. Run the best of open source AI easily on your own servers. Create your own AI by fine-tuning open source models. Integrate LLMs with APIs. Run gptscript securely on the server
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
🔮 SuperDuper: Bring AI to your database! Build, deploy and manage any AI application directly with your existing data infrastructure, without moving your data. Including streaming inference, scalable model training and vector search.
Run any open-source LLMs, such as Llama 2, Mistral, as OpenAI compatible API endpoint in the cloud.
The easiest way to serve AI/ML models in production - Build Model Inference Service, LLM APIs, Multi-model Inference Graph/Pipelines, LLM/RAG apps, and more!
SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface.
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
A collection of hand on notebook for LLMs practitioner
EmbeddedLLM: API server for Embedded Device Deployment. Currently support IpexLLM/DirectML./CPU
A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine
AICI: Prompts as (Wasm) Programs
A ChatGPT(GPT-3.5) & GPT-4 Workload Trace to Optimize LLM Serving Systems
A REST API for vLLM, production ready
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on accelerating LLMs, currently focusing mainly on inference acceleration, and related works will be gradually added in the future. Welcome contributions!
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).
A library to benchmark LLMs via their API exposure. For now, it is vLLM oriented
An unofficial Go port of the official Tavily API Python Wrapper.
Add a description, image, and links to the llm-serving topic page so that developers can more easily learn about it.
To associate your repository with the llm-serving topic, visit your repo's landing page and select "manage topics."