llms-benchmarking

Here are 24 public repositories matching this topic...

microsoft / private-benchmarking

A platform that enables users to perform private benchmarking of machine learning models. The platform facilitates the evaluation of models based on different trust levels between the model owners and the dataset owners.

benchmarking inference secure private mpc large-language-models llms-benchmarking private-benchmarking ezpc

Updated Jun 11, 2024
Python

parea-ai / parea-sdk-ts

Star

TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

llm prompt-engineering llms llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated Jun 11, 2024
TypeScript

parea-ai / parea-sdk-py

Star

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

metrics good-first-issue llm prompt-engineering generative-ai llmops llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated Jun 11, 2024
Python

stair-lab / villm-eval

Star

Evaluation of Language Models in Non-English Languages

llms-benchmarking llm-evaluation-framework

Updated Jun 7, 2024
Python

logikon-ai / cot-eval

Star

A framework for evaluating the effectiveness of chain-of-thought reasoning in language models.

leaderboard llm chain-of-thought gen-ai llms-reasoning llms-benchmarking

Updated Jun 6, 2024
Jupyter Notebook

lamalab-org / chem-bench

Star

How good are LLMs at chemistry?

benchmark machine-learning chemistry safety materials-science llm llms llms-benchmarking

Updated Jun 10, 2024

nachoDRT / MERIT-Dataset

Star

The MERIT Dataset is a fully synthetic, labeled dataset created for training and benchmarking LLMs on Visually Rich Document Understanding tasks. It is also designed to help detect biases and improve interpretability in LLMs, where we are actively working. This repository is actively maintained, and new features are continuously being added.

biases synthetic-dataset-generation layoutlm synthetic-dataset layoutxlm token-classification layoutlmv3 layoutlmv2 llms-benchmarking

Updated May 29, 2024
Python

saqib727 / Artifical-Intelligence-Projects

Star

You Can see The Top Artificial Intelligence Projects Based on Real Use cases. 😃 Why wait More when you have all things at one place. 😎

machine-learning ai machine-translation handwritten-digit-recognition fakenews disease-prediction stockprediction heartdisease fraudsensor llmsecurity llms-benchmarking

Updated May 21, 2024
Jupyter Notebook

amit-sarker / ICL-Analysis-NLP-685

Star

sentiment-analysis huggingface in-context-learning cerebras llama2 mistral-7b llms-benchmarking btlms mamba-state-space-models arithemtic-tasks

Updated May 18, 2024
Python

EvilPsyCHo / Open-LLM-Benchmark

Star

Evaluate open-source language models on Agent, formatted output, command following, long text, multilingual, coding, and custom task capabilities. 开源语言模型在Agent，格式化输出，指令追随，长文本，多语言，代码，自定义任务的能力基准测试。

openai evaluation-framework huggingface large-language-models llamacpp vllm llm-agent llms-benchmarking

Updated May 10, 2024
Python

Santhoshi-Ravi / minerva

Star

Evaluating and enhancing Large Language Models (LLMs) using mathematical datasets through innovative Multi-Agent Debate Architecture, without traditional fine-tuning or Retrieval-Augmented Generation techniques. This project explores advanced strategies to boost LLM capabilities in mathematical reasoning.

llm llms-benchmarking multi-agent-debate