Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generate.py not utilizing GPU in full #476

Closed
frankxu2004 opened this issue Dec 6, 2021 · 4 comments · May be fixed by #592
Closed

generate.py not utilizing GPU in full #476

frankxu2004 opened this issue Dec 6, 2021 · 4 comments · May be fixed by #592

Comments

@frankxu2004
Copy link

I tried to run text generation with prompts using generate.py. I provided a large list of prompts, approximately 20K, and tried to run the generation on 10 RTX 8000 GPUs. However, the GPU utilization by nvidia-smi shows that the GPU utilization during generation is averaging at about 50-60%, which is not ideal. Thank you!

My configuration is:

{
  # Text gen type: `input-file`, `unconditional` or `interactive`
  "text-gen-type": "input-file", #"input-file",
 
  # Params for all
  "maximum_tokens": 256,
  "temperature": 0.2,
  "top_p": 0.95,
  "top_k": 0,
  "recompute": false,
  
  # `unconditional`/`input-file`: samples
  "num-samples": 100,

  # input/output file
  "sample-input-file": "0",
  
  "data-path": "data/code/code_text_document",
  
  # or for weighted datasets: 
  # "train-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],
  # "test-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],
  # "valid-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],
  # "train-data-weights": [1., 2.],
  # "test-data-weights": [2., 1.],
  # "valid-data-weights": [0.5, 0.4],

  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. 
  # WARNING: setting this to True will override any user provided weights
  # "weight_by_num_documents": false,
  # "weighted_sampler_alpha": 0.3,

  "vocab-file": "data/code-vocab.json",
  "merge-file": "data/code-merges.txt",

  "save": "checkpoints",
  "load": "checkpoints",
  "checkpoint_validation_with_forward_pass": False,
  
  "tensorboard-dir": "tensorboard",
  "log-dir": "logs",
  "use_wandb": True,
  "wandb_host": "https://api.wandb.ai",
  "wandb_project": "neox",
}

And the model config:

# GPT-2 pretraining setup
{
   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages
   # across the node boundaries )
   "pipe-parallel-size": 1,
   "model-parallel-size": 1,

   # model settings
   "num-layers": 32,
   "hidden-size": 2560,
   "num-attention-heads": 32,
   "seq-length": 2048,
   "max-position-embeddings": 2048,
   "norm": "layernorm",
   "pos-emb": "rotary",
   "no-weight-tying": true,

   # these should provide some speedup but takes a while to build, set to true if desired
   "scaled-upper-triang-masked-softmax-fusion": true,
   "bias-gelu-fusion": true,

   # optimizer settings
   "zero_allow_untested_optimizer": true,
   "optimizer": {
     "type": "adam",
     "params": {
       "lr": 0.00016,
       "betas": [0.9, 0.999],
       "eps": 1.0e-8,
     }
   },
   "zero_optimization": {
    "stage": 1,
    "allgather_partitions": True,
    "allgather_bucket_size": 500000000,
    "overlap_comm": True,
    "reduce_scatter": True,
    "reduce_bucket_size": 500000000,
    "contiguous_gradients": True,
    "cpu_offload": False
  },

   # batch / data settings
   "train_micro_batch_size_per_gpu": 16,
   "gradient_accumulation_steps": 1,
   "data-impl": "mmap",
   "split": "989,10,1",

   # activation checkpointing
   "checkpoint-activations": true,
   "checkpoint-num-layers": 1,
   "partition-activations": true,
   "synchronize-each-layer": true,

   # regularization
   "gradient_clipping": 1.0,
   "weight-decay": 0,
   "hidden-dropout": 0,
   "attention-dropout": 0,

   # precision settings
   "fp16": { 
     "fp16": true,
     "enabled": true,
     "loss_scale": 0,
     "initial_scale_power": 16,
     "loss_scale_window": 1000,
     "hysteresis": 2,
     "min_loss_scale": 1
   },

   # misc. training settings
   "train-iters": 160000,
   "lr-decay-iters": 160000,
   "distributed-backend": "nccl",
   "lr-decay-style": "cosine",
   "warmup": 0.01,
   "save-interval": 1000,
   "eval-interval": 1000,
   "eval-iters": 10,

   # logging
   "log-interval": 100,
   "steps_per_print": 10,
   "keep-last-n-checkpoints": 1,
   "wall_clock_breakdown": true,
}

@StellaAthena
Copy link
Member

Have you tried increasing the batch size?

@frankxu2004
Copy link
Author

Thanks for the reply. I wonder how should I increase the batch size during generation? In the configuration file it has batch size for training, such as "train_micro_batch_size_per_gpu"

@sdtblck
Copy link
Contributor

sdtblck commented Dec 12, 2021

same setting in inference

@sdtblck sdtblck closed this as completed Dec 12, 2021
@psinha30
Copy link

@frankxu2004 @sdtblck Changing train_micro_batch_size_per_gpu doesnt work out for Inference mode("Input-file") should I change any other parameter?

@StellaAthena StellaAthena linked a pull request Mar 20, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants