The LLM Evaluation Framework
-
Updated
Jul 16, 2024 - Python
The LLM Evaluation Framework
LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.
NeurIPS 2023 - TopP&R: Robust Support Estimation Approach for Evaluating Fidelity and Diversity in Generative Models Official Code
Python SDK for agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks like CrewAI, Langchain, and Autogen
The most comprehensive Python package for evaluating survival analysis models.
📈 Implementation of eight evaluation metrics to access the similarity between two images. The eight metrics are as follows: RMSE, PSNR, SSIM, ISSM, FSIM, SRE, SAM, and UIQ.
Python client for Kolena's machine learning testing platform
Evaluate custom and HuggingFace text-to-image/zero-shot-image-classification models like CLIP, SigLIP, DFN5B, and EVA-CLIP. Metrics include Zero-shot accuracy, Linear Probe, Image retrieval, and KNN accuracy.
Production-Grade Evaluation for LLM-Powered Applications
Python SDK for running evaluations on LLM generated responses
Source code for "Taming Visually Guided Sound Generation" (Oral at the BMVC 2021)
Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.
Valor is a centralized evaluation store which makes it easy to measure, explore, and rank model performance.
Evaluates neuron segmentations in terms of statistics related to the number of splits and merges
Awesome diffusion Video-to-Video (V2V). A collection of paper on diffusion model-based video editing, aka. video-to-video (V2V) translation. And a video editing benchmark code.
A list of works on evaluation of visual generation models, including evaluation metrics, models, and systems
VELOCITI Benchmark Evaluation and Visualisation Code
Pip compatible CodeBLEU metric implementation available for linux/macos/win
Design and implement monitoring and evaluation frameworks. Measure and report on the impact of programs.
This repository contains analysis and predictive modeling of household electricity consumption using Python. It includes data cleaning, exploratory data analysis (EDA), time series forecasting (ARIMA, SARIMA, LSTM), and model evaluation to optimize energy usage.
Add a description, image, and links to the evaluation-metrics topic page so that developers can more easily learn about it.
To associate your repository with the evaluation-metrics topic, visit your repo's landing page and select "manage topics."