llm-evaluation-framework

Here are 12 public repositories matching this topic...

promptfoo / promptfoo

Test your prompts, agents, and RAGs. Use LLM evals to improve your app's quality and catch problems. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd cicd prompts evaluation-framework rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Jul 16, 2024
TypeScript

confident-ai / deepeval

Star

The LLM Evaluation Framework

evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Updated Jul 16, 2024
Python

Psycoy / MixEval

Star

The official evaluation suite and dynamic data release for MixEval.

benchmark evaluation benchmarking-suite evaluation-framework benchmarking-framework foundation-models large-language-models large-language-model llm-inference llm-evaluation large-multimodal-models llm-evaluation-framework benchmark-mixture mixeval

Updated Jul 12, 2024
Python

parea-ai / parea-sdk-py

Star

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

metrics good-first-issue llm prompt-engineering generative-ai llmops llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated Jul 12, 2024
Python

zhuohaoyu / KIEval

Star

[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

machine-learning explainable-ai llm llm-evaluation llm-evaluation-toolkit llm-evaluation-framework llm-evaluation-metrics acl2024

Updated Jul 12, 2024
Python

aws-samples / fm-leaderboarder

Star

FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts

llm-evaluation llm-evaluation-framework llm-benchmarking

Updated May 8, 2024
Python

Networks-Learning / prediction-powered-ranking

Star

Code for "Prediction-Powered Ranking of Large Language Models", Arxiv 2024.

ranking-algorithm llm-eval llm-evaluation llm-evaluation-framework prediction-powered-inference rank-sets

Updated May 27, 2024
Python

parea-ai / parea-sdk-ts

Star

TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

llm prompt-engineering llms llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated Jul 15, 2024
TypeScript

bowen-upenn / llm_token_bias

Star

This is the official implementation of the paper "A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners" in PyTorch.

reasoning large-language-models llm llm-evaluation llm-evaluation-framework llm-reasoning token-bias

Updated Jul 9, 2024
Jupyter Notebook

stair-lab / villm-eval

Star

Evaluation of Language Models in Non-English Languages

llms-benchmarking llm-evaluation-framework

Updated Jun 14, 2024
Python

nagababumo / Building-and-Evaluating-Advanced-RAG

Star

python rag llamaindex retrieval-augmented-generation llm-evaluation llm-evaluation-framework

Updated Jun 1, 2024
Jupyter Notebook

jaaack-wang / multi-problem-eval-llm

Star

Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing LLM Capabilities

explainable-ai large-language-models llm llm-prompting llm-eval llm-evaluation-framework

Updated Jul 8, 2024
Jupyter Notebook

Improve this page

Add a description, image, and links to the llm-evaluation-framework topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-evaluation-framework topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-evaluation-framework

Here are 12 public repositories matching this topic...

promptfoo / promptfoo

confident-ai / deepeval

Psycoy / MixEval

parea-ai / parea-sdk-py

zhuohaoyu / KIEval

aws-samples / fm-leaderboarder

Networks-Learning / prediction-powered-ranking

parea-ai / parea-sdk-ts

bowen-upenn / llm_token_bias

stair-lab / villm-eval

nagababumo / Building-and-Evaluating-Advanced-RAG

jaaack-wang / multi-problem-eval-llm

Improve this page

Add this topic to your repo