Aligning Large Language Models on Information Extraction

We introduce ADELIE (Aligning large language moDELs on Information Extraction), an aligned LLM that effectively solves various IE tasks, including closed IE, open IE, and on-demand IE. We first collect and construct a high-quality alignment corpus IEInstruct for IE. Then we train ADELIE_SFT using instruction tuning on IEInstruct. We further train ADELIE_SFT with direct preference optimization (DPO) objective, resulting in ADELIE_DPO. Extensive experiments on various held-out IE datasets demonstrate that our models (ADELIE_SFT and ADELIE_DPO) achieve state-of-the-art (SoTA) performance among open-source models. We further explore the general capabilities of ADELIE, and experimental results reveal that their general capabilities do not exhibit a noticeable decline.

📖 Paper: ADELIE: Aligning Large Language Models on Information Extraction
🐧 ADELIE in the 🤗HuggingFace Hub: THU-KEG/ADELIE
🌟 IEInstruct and IEFeedback: Datasets

An inference example

Installation

The code repository is based on Pytorch and Transformers. Please use the following command to install the necessary dependcies. pip install -r requirements.txt.

Pretrained models

We release three ADELIE models based on LLama-2 (7B). The models are available in the 🤗HuggingFace Hub.

Model	IE Average F1 (%)	General Average Score (%)	🤗HuggingFace Hub
ADELIE-SFT	47.5	53.5	ADELIE-SFT
ADELIE-DPO	47.7	53.8	ADELIE-DPO

Generate the ADELIE dataset

ADELIE_SFT is trained on IEInstruct. And it is further trained with direct preference optimization (DPO) objective on IEFeedback, resulting in ADELIE_DPO.
Among our training and testing tasks, the copyright of TACRED, ACE 2005, and RichERE belongs to LDC2 and we access them through our LDC membership. All the other datasets are open-sourced, and we strictly adhere to their licenses.
We remove the non-open source datasets from IEInstruct and IEFeedback, and make these two training datasets public. You can download the data from ADELIE Datasets.

IEInstruct

To access the full version of the IEInstruct and evaluation dataset, first install the entire raw dataset as prepared in the data/Readme.md file, then proceed with the following instructions:

#Generate a unified data format
sh ./scripts/generate_unified_data.sh

#Generate IEInstruct mixture
sh ./scripts/generate_mixtural_train_data.sh

IEFeedback

#Generate sampled data
sh ./scripts/generate_dpo_sample_data.sh

#Sample output from ADELIE-SFT
sh ./train4llama/scripts/predict.sh

#Generate IEFeedback mixture
sh ./scripts/generate_mixtural_dpo_data.sh

Model training

First, you need to generate the ADELIE dataset.

Second, you can train ADELIE-SFT and ADELIE-DPO by running the following command.

# ADELIE-SFT: 
sh train4llama/scripts/finetune_with_accelerate.sh

# ADELIE-DPO: 
sh train4llama/scripts/dpo_train_with_accelerate.sh

Please note that the training data for DPO includes ADELIE-SFT generation. Therefore, upon completing the ADELIE-SFT training, it is necessary to generate DPO training data following the method mentioned above for IEFeedback dataset generation.

Our training code is based on the open-instruct。

Evaluation

We have publicly released preprocessed test datasets for evaluation of IE capabilities, excluding the RichERE dataset. Execute the following command to perform IE ability testing.

Note: For ondemand-IE and Open IE datasets, it is necessary to download the raw data from ODIE and ROBUST respectively, and place them in the data directory before evaluation can proceed.

sh ./train4llama/scripts/eval.sh

Citation

@misc{qi2024adelie,
      title={ADELIE: Aligning Large Language Models on Information Extraction}, 
      author={Yunjia Qi and Hao Peng and Xiaozhi Wang and Bin Xu and Lei Hou and Juanzi Li},
      year={2024},
      eprint={2405.05008},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
data		data
scripts		scripts
train4llama		train4llama
unified_data/test_format		unified_data/test_format
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

data

data

scripts

scripts

train4llama

train4llama

unified_data/test_format

unified_data/test_format

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Aligning Large Language Models on Information Extraction

An inference example

Installation

Pretrained models

Generate the ADELIE dataset

IEInstruct

IEFeedback

Model training

Evaluation

Citation

About

Releases

Packages

Languages

THU-KEG/ADELIE

Folders and files

Latest commit

History

Repository files navigation

Aligning Large Language Models on Information Extraction

An inference example

Installation

Pretrained models

Generate the ADELIE dataset

IEInstruct

IEFeedback

Model training

Evaluation

Citation

About

Resources

Stars

Watchers

Forks

Languages