Part of our final year project work involving complex NLP tasks along with experimentation on various datasets and different LLMs
-
Updated
Jan 12, 2024 - HTML
Part of our final year project work involving complex NLP tasks along with experimentation on various datasets and different LLMs
Evaluating and enhancing Large Language Models (LLMs) using mathematical datasets through innovative Multi-Agent Debate Architecture, without traditional fine-tuning or Retrieval-Augmented Generation techniques. This project explores advanced strategies to boost LLM capabilities in mathematical reasoning.
You Can see The Top Artificial Intelligence Projects Based on Real Use cases. 😃 Why wait More when you have all things at one place. 😎
LLM Benchmarks play a crucial role in assessing the performance of Language Model Models (LLMs). However, it is essential to recognize that these benchmarks have their own limitations. This interactive tool is designed to engage users in a quiz game based on popular LLM benchmarks, offering an insightful way to explore and understand them
Code and data for the paper: "Are Large Language Models Aligned with People's Social Intuitions for Human–Robot Interactions?"
Evaluate open-source language models on Agent, formatted output, command following, long text, multilingual, coding, and custom task capabilities. 开源语言模型在Agent,格式化输出,指令追随,长文本,多语言,代码,自定义任务的能力基准测试。
A platform that enables users to perform private benchmarking of machine learning models. The platform facilitates the evaluation of models based on different trust levels between the model owners and the dataset owners.
Source code for the accepted paper in ICSE-NIER'24: Re(gEx|DoS)Eval: Evaluating Generated Regular Expressions and their Proneness to DoS Attacks.
needle in a haystack for LLMs
Evaluation of Language Models in Non-English Languages
Fine-Tuning and Evaluating a Falcon 7B Model for generating HTML code from input prompts.
TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
This repository contains a list of benchmarks used by big orgs to evaluate their LLMs.
A framework for evaluating the effectiveness of chain-of-thought reasoning in language models.
The MERIT Dataset is a fully synthetic, labeled dataset created for training and benchmarking LLMs on Visually Rich Document Understanding tasks. It is also designed to help detect biases and improve interpretability in LLMs, where we are actively working. This repository is actively maintained, and new features are continuously being added.
Join 15k builders to the Real-World ML Newsletter ⬇️⬇️⬇️
Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
Restore safety in fine-tuned language models through task arithmetic
Add a description, image, and links to the llms-benchmarking topic page so that developers can more easily learn about it.
To associate your repository with the llms-benchmarking topic, visit your repo's landing page and select "manage topics."