Two Word Test

Combinatorial Semantics Benchmark for Large Language Models (LLMs)

Nicholas Riccardi, Xuan Yang, and Rutvik Desai - University of South Carolina Department of Psychology

LLMs sometimes struggle with word-order effects and compositional (or combinatorial) language processes, especially when surrounding context is absent. Here, we provide the Two Word Test, a series of functions that compares LLM meaningfulness judgments of simple two word phrases to meaningfulness judgments made by humans (Graves et al., 2013; https://doi.org/10.3758/s13428-012-0256-3).

We provide a variety of statistical methods to quantify LLM performance and compare it to human performance. We test OpenAI's GPT-4 and GPT-3.5-turbo, and Google's Bard. Briefly, we find that GPT-3.5 and Bard fail dramatically at judging the meaningfulness of simple two word phrases without context. GPT-4 performs substantially better, but still fails in certain circumstances, especially when asked to make continuous instead of binary judgments.

Using the Two Word Test

To gather LLM meaningfulness ratings, we used the prompts detailed in our manuscript (closely mirroring the prompt used by Graves et al., 2013 to gather human ratings, but providing more examples for the LLMs). We collected binary and continuous ratings for all models. two-word-test.ipynb can be run as-is, taking LLM_ratings.csv and graves_2013.csv as input. LLM_ratings.csv can be updated by adding the ratings from other LLMs, which must then be specified in the models list within two-word-test.ipynb. A description of each function's purpose and brief explanations of the statistical tests can be found within.

Comments and questions can be posted in the discussion, or emailed to riccardn@email.sc.edu

Scripts

graves-gpt-api.ipynb

Sample query to GPT. Prompts and word lists can be edited to suit an experimenter's needs.

LLM_ratings.csv

Once ratings are collected from an LLM, they should be formatted identically to this document. This .csv is fed into two-word-test.ipynb

graves_2013.csv

Human ratings collected by Graves. Also includes similarity metrics for each phrase from some popular word embedding models. This .csv is also fed into two-word-test.ipynb

two-word-test.ipynb

The statistical tests used to compare LLM responses to humans. If changes are made to LLM_ratings.csv or graves_2013.csv, this script will have to be edited accordingly.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
Generate_Stim_shuffled.ipynb		Generate_Stim_shuffled.ipynb
Generate_predictions_shuffled_GPT.ipynb		Generate_predictions_shuffled_GPT.ipynb
Generate_predictions_shuffled_claude.ipynb		Generate_predictions_shuffled_claude.ipynb
Generate_predictions_shuffled_gemini.ipynb		Generate_predictions_shuffled_gemini.ipynb
Graves2013.pdf		Graves2013.pdf
LICENSE.md		LICENSE.md
LLM_ratings.csv		LLM_ratings.csv
README.md		README.md
graves_2013.csv		graves_2013.csv
instruction_continuous.txt		instruction_continuous.txt
instruction_discrete.txt		instruction_discrete.txt
two-word-test_shuffle.ipynb		two-word-test_shuffle.ipynb

License

NickRiccardi/two-word-test

Folders and files

Latest commit

History

Repository files navigation

Two Word Test

Combinatorial Semantics Benchmark for Large Language Models (LLMs)

Nicholas Riccardi, Xuan Yang, and Rutvik Desai - University of South Carolina Department of Psychology

Using the Two Word Test

Scripts

graves-gpt-api.ipynb

LLM_ratings.csv

graves_2013.csv

two-word-test.ipynb

About

Topics

Resources

License

Stars

Watchers

Forks

Languages