#
llm-evaluation-framework
Here are
12 public repositories
matching this topic...
Test your prompts, agents, and RAGs. Use LLM evals to improve your app's quality and catch problems. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
Updated
Jul 16, 2024
TypeScript
The LLM Evaluation Framework
Updated
Jul 16, 2024
Python
The official evaluation suite and dynamic data release for MixEval.
Updated
Jul 12, 2024
Python
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Updated
Jul 12, 2024
Python
[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models
Updated
Jul 12, 2024
Python
FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts
Updated
May 8, 2024
Python
Code for "Prediction-Powered Ranking of Large Language Models", Arxiv 2024.
Updated
May 27, 2024
Python
TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Updated
Jul 15, 2024
TypeScript
This is the official implementation of the paper "A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners" in PyTorch.
Updated
Jul 9, 2024
Jupyter Notebook
Evaluation of Language Models in Non-English Languages
Updated
Jun 14, 2024
Python
Updated
Jun 1, 2024
Jupyter Notebook
Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing LLM Capabilities
Updated
Jul 8, 2024
Jupyter Notebook
Improve this page
Add a description, image, and links to the
llm-evaluation-framework
topic page so that developers can more easily learn about it.
Curate this topic
Add this topic to your repo
To associate your repository with the
llm-evaluation-framework
topic, visit your repo's landing page and select "manage topics."
Learn more
You can’t perform that action at this time.