Llama2 70B SFT with FSDP failing #9138

satheeshkatipomu · 2024-05-08T13:17:04Z

Unable to fine-tune Llama2 70B with FSDP

I am trying to fine-tune Llama2 70B model on a dataset, with TP=4, PP=8 it is working fine. But with FSDP on 6 nodes it is failing with below error

File "/opt/NeMo/nemo/collections/nlp/parts/nlp_overrides.py", line 670, in setup_environment
    for p in self.model.parameters():
AttributeError: 'NoneType' object has no attribute 'parameters'

Steps/Code to reproduce bug

Converted Llama2 70B base model checkpoint from huggingface to nemo format
Started training on 6 nodes with the below config.

run:
  name: llama_dolly_ft_fsdp_n_6_tp_4_pp_1_ddp_1_gbs_512_mbs_1_llama2_70b
  time_limit: 3-04:00:00
  dependency: singleton
  convert_name: convert_nemo
  model_train_name: llama2_70b
  convert_dir: ~/Projects/NeMo/nemo_launcher/NeMo-Megatron-Launcher/launcher_scripts/results/llama2_70b/convert_nemo
  task_name: llama_dolly_ft_fsdp_n_6_tp_4_pp_1_ddp_1_gbs_512_mbs_1
  results_dir: ~/Projects/NeMo/nemo_launcher/NeMo-Megatron-Launcher/launcher_scripts/results/llama2_70b/llama_dolly_ft_fsdp_n_6_tp_4_pp_1_ddp_1_gbs_512_mbs_1
trainer:
  devices: 8
  accelerator: gpu
  num_nodes: 6
  precision: bf16
  logger: false
  enable_checkpointing: false
  use_distributed_sampler: false
  max_epochs: null
  max_steps: 13000
  log_every_n_steps: 10
  val_check_interval: 300
  gradient_clip_val: 1.0
exp_manager:
  explicit_log_dir: ~/Projects/NeMo/nemo_launcher/NeMo-Megatron-Launcher/launcher_scripts/results/llama2_70b/llama_dolly_ft_fsdp_n_6_tp_4_pp_1_ddp_1_gbs_512_mbs_1/results
  exp_dir: null
  name: megatron_llama_llama_dolly_ft_fsdp_n_6_tp_4_pp_1_ddp_1_gbs_512_mbs_1
  create_wandb_logger: false
  wandb_logger_kwargs:
    project: nemo_llama_llama_dolly_ft_fsdp_n_6_tp_4_pp_1_ddp_1_gbs_512_mbs_1
    name: llama_dolly_ft_fsdp_n_6_tp_4_pp_1_ddp_1_gbs_512_mbs_1_llama2_70b
  resume_if_exists: true
  resume_ignore_no_checkpoint: true
  create_checkpoint_callback: true
  checkpoint_callback_params:
    monitor: validation_loss
    save_top_k: 5
    mode: min
    save_nemo_on_train_end: true
    filename: megatron_gpt_sft--{validation_loss:.3f}-{step}-{consumed_samples}
    model_parallel_size: 4
    save_best_model: true
model:
  seed: 1234
  tensor_model_parallel_size: 4
  pipeline_model_parallel_size: 1
  global_batch_size: 528
  micro_batch_size: 1
  restore_from_path: /workspace/llama2_models
  resume_from_checkpoint: null
  save_nemo_on_validation_end: false
  sync_batch_comm: false
  megatron_amp_O2: false
  sequence_parallel: true
  activations_checkpoint_granularity: selective
  activations_checkpoint_method: uniform
  activations_checkpoint_num_layers: null
  answer_only_loss: true
  gradient_as_bucket_view: false
  seq_len_interpolation_factor: null
  use_flash_attention: true
  hidden_dropout: 0.1
  attention_dropout: 0.1
  ffn_dropout: 0.1
  fsdp: true
  fsdp_sharding_strategy: full
  fsdp_grad_reduce_dtype: bf16
  fsdp_sharded_checkpoint: false
  fsdp_use_orig_params: false
  peft:
    peft_scheme: null
    restore_from_path: null
    adapter_tuning:
      type: parallel_adapter
      adapter_dim: 32
      adapter_dropout: 0.0
      norm_position: pre
      column_init_method: xavier
      row_init_method: zero
      norm_type: mixedfusedlayernorm
      layer_selection: null
      weight_tying: false
      position_embedding_strategy: null
    lora_tuning:
      adapter_dim: 32
      adapter_dropout: 0.0
      column_init_method: xavier
      row_init_method: zero
      layer_selection: null
      weight_tying: false
      position_embedding_strategy: null
    p_tuning:
      virtual_tokens: 10
      bottleneck_dim: 1024
      embedding_dim: 1024
      init_std: 0.023
    ia3_tuning:
      layer_selection: null
  data:
    chat: false
    train_ds:
      file_names:
      - ~/Projects/data/training.jsonl
      global_batch_size: 528
      micro_batch_size: 1
      shuffle: false
      num_workers: 4
      pin_memory: true
      max_seq_length: 4096
      min_seq_length: 1
      drop_last: true
      concat_sampling_probabilities:
      - 1.0
      context_key: input
      label_key: output
      add_eos: true
      add_sep: false
      add_bos: true
      separate_prompt_and_response_with_newline: true
      truncation_field: context
      index_mapping_dir: null
      prompt_template: '{input} {output}'
    validation_ds:
      file_names:
      - ~/Projects/data/validation.jsonl
      names:
      - llama_dolly_ft_fsdp_n_6_tp_4_pp_1_ddp_1_gbs_512_mbs_1
      global_batch_size: 528
      micro_batch_size: 1
      shuffle: false
      num_workers: 4
      pin_memory: true
      max_seq_length: 4096
      min_seq_length: 1
      drop_last: true
      context_key: input
      label_key: output
      add_eos: true
      add_sep: false
      add_bos: true
      separate_prompt_and_response_with_newline: true
      write_predictions_to_file: false
      output_file_path_prefix: null
      truncation_field: context
      index_mapping_dir: null
      prompt_template: '{input} {output}'
      metric:
        name: loss
        average: null
        num_classes: null
    test_ds:
      file_names:
      - ~/Projects/data/test.jsonl
      names: null
      global_batch_size: 528
      micro_batch_size: 1
      shuffle: false
      num_workers: 4
      pin_memory: true
      max_seq_length: 4096
      min_seq_length: 1
      drop_last: true
      context_key: input
      label_key: output
      add_eos: true
      add_sep: false
      add_bos: true
      separate_prompt_and_response_with_newline: true
      write_predictions_to_file: false
      output_file_path_prefix: null
      truncation_field: context
      index_mapping_dir: null
      prompt_template: '{input} {output}'
      metric:
        name: loss
        average: null
        num_classes: null
  optim:
    name: fused_adam
    lr: 1.0e-06
    weight_decay: 0.1
    betas:
    - 0.9
    - 0.98
    sched:
      name: CosineAnnealing
      monitor: validation_loss
      min_lr: 1.0e-08
      warmup_steps: 1000
      last_epoch: -1

Expected behavior

Llama2 70B SFT works fine.

Environment details
Image: nvcr.io/nvidia/nemo:24.03.01.framework
Using slurm cluster.

The text was updated successfully, but these errors were encountered:

xjohnxjohn · 2024-05-12T01:23:00Z

@satheeshkatipomu What the tools you convert "Converted Llama2 70B base model checkpoint from huggingface to nemo format"?

satheeshkatipomu · 2024-05-14T05:05:13Z

I have used convert_llama_hf_to_nemo.py script to convert llama2 70B model from huggingface format to NeMo format. Here is the exact command

python3 -u /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py --input_name_or_path=/workspace/llama2_models --output_path=/workspace/llama2_models/llama2-70b-base.nemo

satheeshkatipomu added the bug Something isn't working label May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama2 70B SFT with FSDP failing #9138

Llama2 70B SFT with FSDP failing #9138

satheeshkatipomu commented May 8, 2024 •

edited

xjohnxjohn commented May 12, 2024

satheeshkatipomu commented May 14, 2024

Llama2 70B SFT with FSDP failing #9138

Llama2 70B SFT with FSDP failing #9138

Comments

satheeshkatipomu commented May 8, 2024 • edited

xjohnxjohn commented May 12, 2024

satheeshkatipomu commented May 14, 2024

satheeshkatipomu commented May 8, 2024 •

edited