Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train.py fails with TypeError: Object of type Tensor is not JSON serializable #314

Open
khayamgondal opened this issue Mar 8, 2024 · 0 comments

Comments

@khayamgondal
Copy link

Towards the end of training. I see following exception thrown

100%|██████████| 203/203 [08:00<00:00,  2.37s/it]
Traceback (most recent call last):
  File "/home/khayam/notebooks/stanford_alpaca/train.py", line 222, in <module>
    train()
  File "/home/khayam/notebooks/stanford_alpaca/train.py", line 217, in train
    trainer.save_state()
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 1045, in save_state
    self.state.save_to_json(path)
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/site-packages/transformers/trainer_callback.py", line 113, in save_to_json
    json_string = json.dumps(dataclasses.asdict(self), indent=2, sort_keys=True) + "\n"
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/encoder.py", line 201, in encode
    chunks = list(chunks)
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/encoder.py", line 325, in _iterencode_list
    yield from chunks
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/encoder.py", line 438, in _iterencode
    o = _default(o)
  File "/home/khayam/anaconda3/envs/alpaca/lib/python3.10/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type Tensor is not JSON serializable
{'loss': 1.0194, 'grad_norm': tensor(0.9940, device='cuda:0'), 'learning_rate': 1.0204081632653061e-07, 'epoch': 1.0}
{'train_runtime': 480.766, 'train_samples_per_second': 108.165, 'train_steps_per_second': 0.422, 'train_loss': 1.0709380231467374, 'epoch': 1.0}

I am running trainer like this

torchrun --nproc_per_node={PROCS} --master_port=8080 train.py \
                --model_name_or_path {MODEL} \
                --data_path ./alpaca_data.json \
                --bf16 True \
                --output_dir {OUTPUT}  \
                --num_train_epochs 1  \
                --per_device_train_batch_size {BATCH} \
                --per_device_eval_batch_size {BATCH} \
                --gradient_accumulation_steps {GRADIENT} \
                --evaluation_strategy 'no'  \
                --save_strategy 'steps'  --save_steps 2000 \
                --save_total_limit 1  \
                --learning_rate 2e-5  --weight_decay 0. \
                --warmup_ratio 0.03  \
                --lr_scheduler_type 'cosine' \
                --logging_steps 1 \
                --tf32 True \
                --deepspeed {DEEPSPEED_CONFIG}
MODEL = "/mnt/dataset-storage/AI_MODELS/LLAMA-HF/llama-7b-hf/"
OUTPUT = "model_output"

DEEPSPEED_CONFIG="/mnt/dataset-storage/AI_MODELS/training/stanford_alpaca/configs/zero1.json"

PROCS=8
BATCH=4
GRADIENT=8
deepspeed==0.13.4
torch==2.2.1
accelerate==0.27.2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant