We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Describe the bug When I am using the most recent Megatrone-LM fork I get the following error
make: Entering directory '/workspace/megatron-lm/megatron/core/datasets' g++ -O3 -Wall -shared -std=c++11 -fPIC -fdiagnostics-color -I/usr/include/python3.10 -I/usr/local/lib/python3.10/dist-packages/pybind11/include helpers.cpp -o helpers.cpython-310-x86_64-linux-gnu.so make: Leaving directory '/workspace/megatron-lm/megatron/core/datasets' ERROR:megatron.core.datasets.utils:Failed to compile the C++ dataset helper functions
To Reproduce
#!/bin/bash #SBATCH --ntasks-per-node=1 #SBATCH --exclusive #SBATCH --gpus-per-node=8 #SBATCH --partition=batch # Adjust this for your cluster #SBATCH --output=/home/shamane/logs/training_scratch/log.out # Adjust this for your cluster #SBATCH --err=/home/shamane/logs/training_scratch/error.err # Adjust this for your cluster export MASTER_ADDR=$(hostname) export GPUS_PER_NODE=8 # --- export LD_LIBRARY_PATH=/usr/lib:/usr/lib64 export NCCL_TESTS_HOME=nccl-tests export NCCL_DEBUG=INFO export NCCL_ALGO=RING export NCCL_IB_AR_THRESHOLD=0 export NCCL_IB_PCI_RELAXED_ORDERING=1 export NCCL_IB_SPLIT_DATA_ON_QPS=0 export NCCL_IB_QPS_PER_CONNECTION=2 export UCX_IB_PCI_RELAXED_ORDERING=on export CUDA_DEVICE_ORDER=PCI_BUS_ID export NCCL_SOCKET_IFNAME=enp27s0np0 export NCCL_IB_HCA=mlx5_0:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_7:1,mlx5_8:1,mlx5_9:1 export NCCL_IGNORE_CPU_AFFINITY=1 # --- nodes_array=($(scontrol show hostnames $SLURM_JOB_NODELIST)) head_node=${nodes_array[0]} head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address) echo "Node IP: $head_node_ip" # Specify the Docker image to use. PYTORCH_IMAGE="nvcr.io/nvidia/pytorch:24.03-py3" # Define the path to the Megatron-LM directory on the head node. MEGATRONE_PATH="/home/shamane/Megatron-LM-luke" # Update with actual path. Path should be on the head node. # Set paths for checkpoints and tokenizer data. These should be on a shared data directory. SHARED_DIR="/data/fin_mixtral_2B/" #MASTER_ADDR=${MASTER_ADDR:-"localhost"} MASTER_ADDR=$head_node_ip MASTER_PORT=${MASTER_PORT:-"6008"} NNODES=${SLURM_NNODES:-"1"} NODE_RANK=${RANK:-"0"} WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES)) echo "SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST" echo "SLURM_NNODES: $SLURM_NNODES" echo "SLURM_NODEID: $SLURM_NODEID" echo "MASTER_ADDR: $MASTER_ADDR" echo "NNODES: $NNODES" echo "MASTER_PORT: $MASTER_PORT" echo "NODE_RANK: $NODE_RANK" #module load docker echo "-v $SHARED_DIR:/workspace/data" echo "-v $MEGATRONE_PATH:/workspace/megatron-lm" echo "$PYTORCH_IMAGE" echo "bash -c \"pip install flash-attn sentencepiece && \ bash /workspace/megatron-lm/examples/mixtral/run_mixtral_distributed.sh \ /workspace/data/megatrone_checkpoints \ /workspace/data/tokenizers/tokenizer.model \ /workspace/data/processed_data/finance_2b_mixtral_text_document \ $MASTER_ADDR \ $MASTER_PORT \ $NNODES \ $NODE_RANK\"" # # Run the Docker container with the specified PyTorch image. srun docker run \ -e SLURM_JOB_ID=$SLURM_JOB_ID \ --gpus all \ --ipc=host \ --network=host \ --workdir /workspace/megatron-lm \ -v $SHARED_DIR:/workspace/data \ -v $MEGATRONE_PATH:/workspace/megatron-lm \ $PYTORCH_IMAGE \ bash -c "pip install flash-attn sentencepiece wandb 'git+https://github.com/fanshiqing/grouped_gemm@v1.0' && \ bash /workspace/megatron-lm/examples/mixtral/run_mixtral_distributed.sh \ /workspace/data/mixtral8x7-instr-tp2-emp8-ggemm \ /workspace/data/tokenizers/tokenizer.model \ /workspace/data/processed_data/finance_2b_mixtral_text_document \ $MASTER_ADDR \ $MASTER_PORT \ $NNODES \ $NODE_RANK" # # This Docker command mounts the specified Megatron-LM and data directories, sets the working directory, # # and runs the 'run_mixtral_distributed.sh' script inside the container. # # This script facilitates distributed training using the specified PyTorch image, leveraging NVIDIA's optimizations.
Environment (please complete the following information):
PYTORCH_IMAGE="nvcr.io/nvidia/pytorch:24.03-py3"
Proposed fix If you have a proposal for how to fix the issue state it here or link to a PR.
Additional context This works well with the form that I have download 4 days ago.
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Describe the bug
When I am using the most recent Megatrone-LM fork I get the following error
To Reproduce
Environment (please complete the following information):
PYTORCH_IMAGE="nvcr.io/nvidia/pytorch:24.03-py3"
Proposed fix
If you have a proposal for how to fix the issue state it here or link to a PR.
Additional context
This works well with the form that I have download 4 days ago.
The text was updated successfully, but these errors were encountered: