Vicuna-Chemical-Expert

Update Logs

2023.8.15: Create Chatbot_v2
- Add features: Langchain, ChromaDB(VetcorDB)
- Toggle switch for searching Hydrogen paper
2023.7.30: Create Chatbot_v1
- Adding Multiple Models: Chemical, Physics, Mathematics
- Create Streamlit app

Introduction

This is the repo for Vicuna Chemical Expert, which can help to solve some chemical questions. This model was finetuned by the sharded version of lmsys/vicuna-7b-v1.3, and it can be trained on 4x V100 32GB.

Finetune

Use Qlora tuning in Peft
Vicuna 7B was finetuned based on chemistry and chemical industry domain.
Parameters available in training

trainable params	all params	trainable%
13107200	6685086720	0.1961

When training is done, merge lora back to base model
Below is the finetuning train/loss graph (Weights & Biases):

HuggingFace for Chemical: FelixChao/vicuna-7B-chemical
HuggingFace for Coder: FelixChao/vicuna-33b-coder

Setup

To inference this model on your local

Create development environment and activate

conda create -n vicuna-chemical python=3.10 
conda activate vicuna-chemical

Download Chatbot_v1 or Chatbot_v2
Install dependencies

pip install -r requirements.txt

Run streamlit app

streamlit run app.py

Note: Please make sure that your gpu RAM (at least 16GB) is enough for loading model, avoid CUDA Out Of Memory(OOM).

Inference

Model Inference

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("FelixChao/vicuna-7B-chemical")
model = AutoModelForCausalLM.from_pretrained("FelixChao/vicuna-7B-chemical",device_map="auto")

encoding = tokenizer(example_text, return_tensors="pt").to("cuda:0")
output = model.generate(input_ids=encoding.input_ids, attention_mask=encoding.attention_mask, max_new_tokens=512, do_sample=True, eos_token_id=tokenizer.eos_token_id)
predict = tokenizer.decode(output[0], skip_special_tokens=True)

Pipeline inference(Faster)

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="FelixChao/vicuna-7B-chemical")

Vector Database

Using ChromaDB and LangChain to create 4 similarity searches in (Hydrogen)Papers.
This feature can improve the incomplete dataset in which the base model was trained, creating an augmented dataset.
Below is a demo that can see the difference between VectorDB and not.

Blue Hydrogen Problem

With ChromaDB ✅:

Without ChromaDB ❌:

Hydrogen Colors Problem

With ChromaDB ✅:

Without ChromaDB ❌:

From the above examples which are the latest data, we can see that when the finetuned model is connected with the vector database, it will generate the answer better than without connection.

Demo

💡 We ask a question about photosynthesis, it can also give the corresponding chemical formula.

Reference

This project is based on the open sources below, and I am very grateful to all the researchers and developers.

🧠 Base Model: Vicuna-7b-v1.3 by LMSYS and Vicuna-7b-v1.3-sharded-bf16 by CleverShovel
🎓 Fintune: Parameter-Efficient Fine-Tuning (PEFT) from huggingface and LoRA: Low-Rank Adaptation of Large Language Models from microsoft
📚 Dataset: andersonbcdefg/chemistry by andersonbcdefg
🚀 Front-End: Streamlit

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
Google_Colab_finetune		Google_Colab_finetune
Vicuna_Chemical_Expert		Vicuna_Chemical_Expert
images		images
LICENSE		LICENSE
Lora_finetune.py		Lora_finetune.py
README.md		README.md
inference.py		inference.py
merge.py		merge.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Google_Colab_finetune

Google_Colab_finetune