The LLM Evaluation Framework
-
Updated
May 23, 2024 - Python
The LLM Evaluation Framework
Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.
🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
The Typescript SDK for the prompt engineering, prompt management, and prompt testing tool Prompt Foundry
Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.
Open-Source Evaluation for GenAI Application Pipelines
The all-in-one LLM developer platform: prompt management, evaluation, human feedback, and deployment all in one place.
Python SDK for running evaluations on LLM generated responses
🐢 Open-Source Evaluation & Testing for LLMs and ML models
TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Superpipe - optimized LLM pipelines for structured data
Cookbooks and tutorials on Literal AI
Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute, relative and much more. It contains a list of all the available tool, methods, repo, code etc to detect hallucination, LLM evaluation, grading and much more.
A list of LLMs Tools & Projects
A compilation of referenced benchmark metrics to evaluate different aspects of knowledge for Large Language Models.
Awesome papers involving LLMs in Social Science.
LLMs Evaluation
FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts
A prompt collection for testing and evaluation of LLMs.
Add a description, image, and links to the llm-evaluation topic page so that developers can more easily learn about it.
To associate your repository with the llm-evaluation topic, visit your repo's landing page and select "manage topics."