#

llms-benchmarking

Here are 24 public repositories matching this topic...

dinesh-kumar-mr / MediVQA

Part of our final year project work involving complex NLP tasks along with experimentation on various datasets and different LLMs

vqa medical-application vqa-dataset vqa-med-2018 llms llms-benchmarking

Updated Jan 12, 2024
HTML

Santhoshi-Ravi / minerva

Evaluating and enhancing Large Language Models (LLMs) using mathematical datasets through innovative Multi-Agent Debate Architecture, without traditional fine-tuning or Retrieval-Augmented Generation techniques. This project explores advanced strategies to boost LLM capabilities in mathematical reasoning.

llm llms-benchmarking multi-agent-debate

Updated Apr 27, 2024
Jupyter Notebook

SharathHebbar / eval_llms

eleutherai llm-evaluation llms-benchmarking

Updated Feb 4, 2024
Jupyter Notebook

amit-sarker / ICL-Analysis-NLP-685

sentiment-analysis huggingface in-context-learning cerebras llama2 mistral-7b llms-benchmarking btlms mamba-state-space-models arithemtic-tasks

Updated May 18, 2024
Python

saqib727 / Artifical-Intelligence-Projects

You Can see The Top Artificial Intelligence Projects Based on Real Use cases. 😃 Why wait More when you have all things at one place. 😎

machine-learning ai machine-translation handwritten-digit-recognition fakenews disease-prediction stockprediction heartdisease fraudsensor llmsecurity llms-benchmarking

Updated May 21, 2024
Jupyter Notebook

aflah02 / Humans-v-s-LLM-Benchmarks

LLM Benchmarks play a crucial role in assessing the performance of Language Model Models (LLMs). However, it is essential to recognize that these benchmarks have their own limitations. This interactive tool is designed to engage users in a quiz game based on popular LLM benchmarks, offering an insightful way to explore and understand them

streamlit llms llms-benchmarking

Updated Jan 1, 2024
Python

lwachowiak / LLMs-for-Social-Robotics

Code and data for the paper: "Are Large Language Models Aligned with People's Social Intuitions for Human–Robot Interactions?"

alignment hri vlm social-robotics llms value-alignment llms-benchmarking

Updated Apr 11, 2024
Jupyter Notebook

EvilPsyCHo / Open-LLM-Benchmark

Evaluate open-source language models on Agent, formatted output, command following, long text, multilingual, coding, and custom task capabilities. 开源语言模型在Agent，格式化输出，指令追随，长文本，多语言，代码，自定义任务的能力基准测试。

openai evaluation-framework huggingface large-language-models llamacpp vllm llm-agent llms-benchmarking

Updated May 10, 2024
Python

microsoft / private-benchmarking

A platform that enables users to perform private benchmarking of machine learning models. The platform facilitates the evaluation of models based on different trust levels between the model owners and the dataset owners.

benchmarking inference secure private mpc large-language-models llms-benchmarking private-benchmarking ezpc

Updated Jun 11, 2024
Python

s2e-lab / RegexEval

Source code for the accepted paper in ICSE-NIER'24: Re(gEx|DoS)Eval: Evaluating Generated Regular Expressions and their Proneness to DoS Attacks.

regex code-generation benchmark-framework redos-checker redos-detector llms-benchmarking

Updated Mar 13, 2024
Python

melvinebenezer / Liah-Lie_in_a_haystack

needle in a haystack for LLMs

needle-in-haystack llm long-context llm-inference llms-benchmarking

Updated Apr 15, 2024
Python

stair-lab / villm-eval

Evaluation of Language Models in Non-English Languages

llms-benchmarking llm-evaluation-framework

Updated Jun 7, 2024
Python

PrincySinghal / Html-code-generation-from-LLM

Fine-Tuning and Evaluating a Falcon 7B Model for generating HTML code from input prompts.

machine-learning code-generation llms fine-tuning-llm llms-benchmarking

Updated Jan 6, 2024
Jupyter Notebook

parea-ai / parea-sdk-ts

TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

llm prompt-engineering llms llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated Jun 11, 2024
TypeScript

dippatel1994 / Large-Language-Models-Evaluation-Benchmarks-Collection

This repository contains a list of benchmarks used by big orgs to evaluate their LLMs.

benchmarks large-language-models llm llms llms-benchmarking

Updated Feb 26, 2024

logikon-ai / cot-eval

A framework for evaluating the effectiveness of chain-of-thought reasoning in language models.

leaderboard llm chain-of-thought gen-ai llms-reasoning llms-benchmarking

Updated Jun 6, 2024
Jupyter Notebook

nachoDRT / MERIT-Dataset

The MERIT Dataset is a fully synthetic, labeled dataset created for training and benchmarking LLMs on Visually Rich Document Understanding tasks. It is also designed to help detect biases and improve interpretability in LLMs, where we are actively working. This repository is actively maintained, and new features are continuously being added.

biases synthetic-dataset-generation layoutlm synthetic-dataset layoutxlm token-classification layoutlmv3 layoutlmv2 llms-benchmarking

Updated May 29, 2024
Python

Paulescu / text-embedding-evaluation

Join 15k builders to the Real-World ML Newsletter ⬇️⬇️⬇️

machine-learning embeddings llms llms-benchmarking

Updated Apr 19, 2024
Python

minnesotanlp / cobbler

Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

nlp evaluation bias bias-detection llm llms llm-evaluation llms-benchmarking llm-as-judge llm-as-a-judge llm-as-evaluator

Updated Feb 16, 2024
Jupyter Notebook

declare-lab / resta

Restore safety in fine-tuned language models through task arithmetic

alignment safety alignment-algorithm llm llms llm-safety llms-benchmarking llm-safety-benchmark

Updated Mar 28, 2024
Python

Improve this page

Add a description, image, and links to the llms-benchmarking topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llms-benchmarking topic, visit your repo's landing page and select "manage topics."