What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks
-
Updated
Mar 26, 2024 - Jupyter Notebook
What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
How good are LLMs at chemistry?
The data and implementation for the experiments in the paper "Flows: Building Blocks of Reasoning and Collaborating AI".
Restore safety in fine-tuned language models through task arithmetic
Code and data for ACL ARR 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
Join 15k builders to the Real-World ML Newsletter ⬇️⬇️⬇️
A framework for evaluating the effectiveness of chain-of-thought reasoning in language models.
Fine-Tuning and Evaluating a Falcon 7B Model for generating HTML code from input prompts.
TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
This repository contains a list of benchmarks used by big orgs to evaluate their LLMs.
needle in a haystack for LLMs
Source code for the accepted paper in ICSE-NIER'24: Re(gEx|DoS)Eval: Evaluating Generated Regular Expressions and their Proneness to DoS Attacks.
Code and data for the paper: "Are Large Language Models Aligned with People's Social Intuitions for Human–Robot Interactions?"
LLM Benchmarks play a crucial role in assessing the performance of Language Model Models (LLMs). However, it is essential to recognize that these benchmarks have their own limitations. This interactive tool is designed to engage users in a quiz game based on popular LLM benchmarks, offering an insightful way to explore and understand them
Evaluation of Language Models in Non-English Languages
Part of our final year project work involving complex NLP tasks along with experimentation on various datasets and different LLMs
Evaluating and enhancing Large Language Models (LLMs) using mathematical datasets through innovative Multi-Agent Debate Architecture, without traditional fine-tuning or Retrieval-Augmented Generation techniques. This project explores advanced strategies to boost LLM capabilities in mathematical reasoning.
Evaluate open-source language models on Agent, formatted output, command following, long text, multilingual, coding, and custom task capabilities. 开源语言模型在Agent,格式化输出,指令追随,长文本,多语言,代码,自定义任务的能力基准测试。
Add a description, image, and links to the llms-benchmarking topic page so that developers can more easily learn about it.
To associate your repository with the llms-benchmarking topic, visit your repo's landing page and select "manage topics."