Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Allocation Issue (QLoRa + Llama3-8B-IT) #1716

Open
2 of 4 tasks
DONGRYEOLLEE1 opened this issue May 8, 2024 · 1 comment
Open
2 of 4 tasks

GPU Allocation Issue (QLoRa + Llama3-8B-IT) #1716

DONGRYEOLLEE1 opened this issue May 8, 2024 · 1 comment

Comments

@DONGRYEOLLEE1
Copy link

DONGRYEOLLEE1 commented May 8, 2024

System Info

peft: 0.10.1.dev0
accelerate: 0.30.0
bitsandbytes: 0.43.1
transformers: 4.39.3
GPU: A6000 * 2 ( 96GB )
nvidia-driver version: 535.171.04
cuda: 11.8

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder
  • My own task or dataset (give details below)

Reproduction

I was training a Llama3-8B-IT model with QLoRA. I successed a training, but GPU wasn't evenly allocate. Is it a version issue with peft or transformers? Or is it a version issue with the graphics driver? I have experience with learning evenly on previous A100*8 servers, but I don't know if this is an issue in this case.

This is my script.

quantization_config = BitsAndBytesConfig(
    load_in_4bit = True, 
    bnb_4bit_compute_dtype = torch.bfloat16, 
    bnb_4bit_quant_type = "nf4", 
    bnb_4bit_use_double_quant = True
)

MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"

tok = AutoTokenizer.from_pretrained(MODEL_ID)
tok.pad_token_id = tok.eos_token_id
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config = quantization_config,
    device_map = 'auto'
)

data = load_dataset("...")

proc_data = data.map(process, remove_columns = data['train'].column_names)

toknized_proc_data = proc_data.map(lambda x: tok(x['text'], truncation = True, max_length = 2048), batched = True)
toknized_proc_data = toknized_proc_data.remove_columns("text")

lora_config = LoraConfig(
    r = 16,
    lora_alpha = 32,
    lora_dropout = 0.01,
    target_modules = "all-linear"
)

model = get_peft_model(model, lora_config)

train_args_trainer = TrainingArguments(
    num_train_epochs = 3,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 2,
    learning_rate = 2e-8,
    logging_steps = 100,
    warmup_steps = 100,
    save_total_limit = 3,
    output_dir = "llama3-7b-4bit-lora-test2",
    optim = "paged_adamw_32bit",
    bf16 = True,
    report_to = "wandb",
    run_name = "llama3-7b-4bit-lora-test2",
    remove_unused_columns=False
)

model.is_parallelizable = True
model.model_parallel = True

trainer = Trainer(
    model = model,
    tokenizer = tok,
    args = train_args_trainer,
    train_dataset = toknized_proc_data['train'],
    data_collator = DataCollatorForLanguageModeling(tok, mlm = False)
)

trainer.train()
Wed May  8 06:54:12 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:1F:00.0 Off |                  Off |
| 30%   58C    P2             145W / 300W |  13224MiB / 49140MiB |     40%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:8B:00.0 Off |                  Off |
| 47%   71C    P2             221W / 300W |  32908MiB / 49140MiB |     73%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    207219      C   /data/envs/tt/bin/python                  13090MiB |
|    1   N/A  N/A    207219      C   /data/envs/tt/bin/python                  32774MiB |
+---------------------------------------------------------------------------------------+

Expected behavior

I want the GPUs to be evenly allocated.

@BenjaminBossan
Copy link
Member

Hmm, hard to say and I can't easily try to reproduce this. Do you already see strange behavior after loading the model, before starting training? If you try without PEFT, do you see the same issue (in case of not having enough memory without PEFT, you could e.g. turn off autograd on most of the layers to "simulate" parameter efficient fine-tuning)? If yes, this could be an accelerate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants