generate.py not utilizing GPU in full #476

frankxu2004 · 2021-12-06T11:01:41Z

I tried to run text generation with prompts using generate.py. I provided a large list of prompts, approximately 20K, and tried to run the generation on 10 RTX 8000 GPUs. However, the GPU utilization by nvidia-smi shows that the GPU utilization during generation is averaging at about 50-60%, which is not ideal. Thank you!

My configuration is:

{
  # Text gen type: `input-file`, `unconditional` or `interactive`
  "text-gen-type": "input-file", #"input-file",
 
  # Params for all
  "maximum_tokens": 256,
  "temperature": 0.2,
  "top_p": 0.95,
  "top_k": 0,
  "recompute": false,
  
  # `unconditional`/`input-file`: samples
  "num-samples": 100,

  # input/output file
  "sample-input-file": "0",
  
  "data-path": "data/code/code_text_document",
  
  # or for weighted datasets: 
  # "train-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],
  # "test-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],
  # "valid-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],
  # "train-data-weights": [1., 2.],
  # "test-data-weights": [2., 1.],
  # "valid-data-weights": [0.5, 0.4],

  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. 
  # WARNING: setting this to True will override any user provided weights
  # "weight_by_num_documents": false,
  # "weighted_sampler_alpha": 0.3,

  "vocab-file": "data/code-vocab.json",
  "merge-file": "data/code-merges.txt",

  "save": "checkpoints",
  "load": "checkpoints",
  "checkpoint_validation_with_forward_pass": False,
  
  "tensorboard-dir": "tensorboard",
  "log-dir": "logs",
  "use_wandb": True,
  "wandb_host": "https://api.wandb.ai",
  "wandb_project": "neox",
}

And the model config:

# GPT-2 pretraining setup
{
   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages
   # across the node boundaries )
   "pipe-parallel-size": 1,
   "model-parallel-size": 1,

   # model settings
   "num-layers": 32,
   "hidden-size": 2560,
   "num-attention-heads": 32,
   "seq-length": 2048,
   "max-position-embeddings": 2048,
   "norm": "layernorm",
   "pos-emb": "rotary",
   "no-weight-tying": true,

   # these should provide some speedup but takes a while to build, set to true if desired
   "scaled-upper-triang-masked-softmax-fusion": true,
   "bias-gelu-fusion": true,

   # optimizer settings
   "zero_allow_untested_optimizer": true,
   "optimizer": {
     "type": "adam",
     "params": {
       "lr": 0.00016,
       "betas": [0.9, 0.999],
       "eps": 1.0e-8,
     }
   },
   "zero_optimization": {
    "stage": 1,
    "allgather_partitions": True,
    "allgather_bucket_size": 500000000,
    "overlap_comm": True,
    "reduce_scatter": True,
    "reduce_bucket_size": 500000000,
    "contiguous_gradients": True,
    "cpu_offload": False
  },

   # batch / data settings
   "train_micro_batch_size_per_gpu": 16,
   "gradient_accumulation_steps": 1,
   "data-impl": "mmap",
   "split": "989,10,1",

   # activation checkpointing
   "checkpoint-activations": true,
   "checkpoint-num-layers": 1,
   "partition-activations": true,
   "synchronize-each-layer": true,

   # regularization
   "gradient_clipping": 1.0,
   "weight-decay": 0,
   "hidden-dropout": 0,
   "attention-dropout": 0,

   # precision settings
   "fp16": { 
     "fp16": true,
     "enabled": true,
     "loss_scale": 0,
     "initial_scale_power": 16,
     "loss_scale_window": 1000,
     "hysteresis": 2,
     "min_loss_scale": 1
   },

   # misc. training settings
   "train-iters": 160000,
   "lr-decay-iters": 160000,
   "distributed-backend": "nccl",
   "lr-decay-style": "cosine",
   "warmup": 0.01,
   "save-interval": 1000,
   "eval-interval": 1000,
   "eval-iters": 10,

   # logging
   "log-interval": 100,
   "steps_per_print": 10,
   "keep-last-n-checkpoints": 1,
   "wall_clock_breakdown": true,
}

The text was updated successfully, but these errors were encountered:

StellaAthena · 2021-12-10T06:24:30Z

Have you tried increasing the batch size?

frankxu2004 · 2021-12-10T06:42:17Z

Thanks for the reply. I wonder how should I increase the batch size during generation? In the configuration file it has batch size for training, such as "train_micro_batch_size_per_gpu"

sdtblck · 2021-12-12T21:40:45Z

same setting in inference

psinha30 · 2022-03-19T06:10:26Z

@frankxu2004 @sdtblck Changing train_micro_batch_size_per_gpu doesnt work out for Inference mode("Input-file") should I change any other parameter?

sdtblck closed this as completed Dec 12, 2021

StellaAthena linked a pull request Mar 20, 2022 that will close this issue

Add generating batch size #592

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generate.py not utilizing GPU in full #476

generate.py not utilizing GPU in full #476

frankxu2004 commented Dec 6, 2021

StellaAthena commented Dec 10, 2021

frankxu2004 commented Dec 10, 2021

sdtblck commented Dec 12, 2021

psinha30 commented Mar 19, 2022

generate.py not utilizing GPU in full #476

generate.py not utilizing GPU in full #476

Comments

frankxu2004 commented Dec 6, 2021

StellaAthena commented Dec 10, 2021

frankxu2004 commented Dec 10, 2021

sdtblck commented Dec 12, 2021

psinha30 commented Mar 19, 2022