Collects model and data projects for instruction following large language models. These works may point out the future direction of natural language processing industry and research.
✕ means "not available"
We welcome updates and suggestions
Project | model | data | license | note |
---|---|---|---|---|
stanford alpaca | llama | alpaca | apache 2.0 | Data is generated with text-davinci-003 . |
AlpacaDataCleaned | / | alpaca cleaned | apache 2.0 | A cleaned and curated version of the Alpaca dataset. |
ChatGLM | GLM | ✕ | apache 2.0 | A bilingual language model based on GLM framework. |
flan-alpaca | Flan-T5 | alpaca | apache 2.0 | Extend Standford Alpaca synthetic instruction tuning to Flan-T5. |
Dolly | pythia | databricks-dolly-15k | mit | An instruction-following large language model trained on the Databricks machine learning platform that is licensed for commercial use. |
Vicuna | llama | ShareGPT filtered, ✕ | apache 2.0 | User-shared conversations gathered from ShareGPT.com. Low-quality samples are filtered. |
ShareGPT unfiltered | / | ShareGPT unfiltered | apache 2.0 | Conversations gathered from ShareGPT.com without filtering. |
GPTeacher | / | GPTeacher | mit | A collection of modular datasets generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer. |
GPT-4-LLM | llama | alpaca gpt4 data | apache 2.0 | Instruction following data generated by GPT-4. |
Luotuo | llama, chatglm, ... | unknown | apache 2.0 | A set of Chinese instruction following large language models. |
Chinese-Vicuna | llama | BELLE | apache 2.0 | Chinese instruction following large language model trained with lora. |
BELLE | / | BELLE | apache 2.0 | Chinese instruction data. Generated by ChatGPT using similar method as Alpaca. |