GPU Allocation Issue (QLoRa + Llama3-8B-IT) #1716

DONGRYEOLLEE1 · 2024-05-08T07:04:46Z

System Info

peft: 0.10.1.dev0
accelerate: 0.30.0
bitsandbytes: 0.43.1
transformers: 4.39.3
GPU: A6000 * 2 ( 96GB )
nvidia-driver version: 535.171.04
cuda: 11.8

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder
My own task or dataset (give details below)

Reproduction

I was training a Llama3-8B-IT model with QLoRA. I successed a training, but GPU wasn't evenly allocate. Is it a version issue with peft or transformers? Or is it a version issue with the graphics driver? I have experience with learning evenly on previous A100*8 servers, but I don't know if this is an issue in this case.

This is my script.

quantization_config = BitsAndBytesConfig(
    load_in_4bit = True, 
    bnb_4bit_compute_dtype = torch.bfloat16, 
    bnb_4bit_quant_type = "nf4", 
    bnb_4bit_use_double_quant = True
)

MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"

tok = AutoTokenizer.from_pretrained(MODEL_ID)
tok.pad_token_id = tok.eos_token_id
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config = quantization_config,
    device_map = 'auto'
)

data = load_dataset("...")

proc_data = data.map(process, remove_columns = data['train'].column_names)

toknized_proc_data = proc_data.map(lambda x: tok(x['text'], truncation = True, max_length = 2048), batched = True)
toknized_proc_data = toknized_proc_data.remove_columns("text")

lora_config = LoraConfig(
    r = 16,
    lora_alpha = 32,
    lora_dropout = 0.01,
    target_modules = "all-linear"
)

model = get_peft_model(model, lora_config)

train_args_trainer = TrainingArguments(
    num_train_epochs = 3,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 2,
    learning_rate = 2e-8,
    logging_steps = 100,
    warmup_steps = 100,
    save_total_limit = 3,
    output_dir = "llama3-7b-4bit-lora-test2",
    optim = "paged_adamw_32bit",
    bf16 = True,
    report_to = "wandb",
    run_name = "llama3-7b-4bit-lora-test2",
    remove_unused_columns=False
)

model.is_parallelizable = True
model.model_parallel = True

trainer = Trainer(
    model = model,
    tokenizer = tok,
    args = train_args_trainer,
    train_dataset = toknized_proc_data['train'],
    data_collator = DataCollatorForLanguageModeling(tok, mlm = False)
)

trainer.train()

Wed May  8 06:54:12 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:1F:00.0 Off |                  Off |
| 30%   58C    P2             145W / 300W |  13224MiB / 49140MiB |     40%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:8B:00.0 Off |                  Off |
| 47%   71C    P2             221W / 300W |  32908MiB / 49140MiB |     73%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    207219      C   /data/envs/tt/bin/python                  13090MiB |
|    1   N/A  N/A    207219      C   /data/envs/tt/bin/python                  32774MiB |
+---------------------------------------------------------------------------------------+

Expected behavior

I want the GPUs to be evenly allocated.

The text was updated successfully, but these errors were encountered:

BenjaminBossan · 2024-05-08T08:49:13Z

Hmm, hard to say and I can't easily try to reproduce this. Do you already see strange behavior after loading the model, before starting training? If you try without PEFT, do you see the same issue (in case of not having enough memory without PEFT, you could e.g. turn off autograd on most of the layers to "simulate" parameter efficient fine-tuning)? If yes, this could be an accelerate issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Allocation Issue (QLoRa + Llama3-8B-IT) #1716

GPU Allocation Issue (QLoRa + Llama3-8B-IT) #1716

DONGRYEOLLEE1 commented May 8, 2024 •

edited

BenjaminBossan commented May 8, 2024

GPU Allocation Issue (QLoRa + Llama3-8B-IT) #1716

GPU Allocation Issue (QLoRa + Llama3-8B-IT) #1716

Comments

DONGRYEOLLEE1 commented May 8, 2024 • edited

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

BenjaminBossan commented May 8, 2024

DONGRYEOLLEE1 commented May 8, 2024 •

edited