Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core dataset compilation error] #807

Open
shamanez opened this issue May 6, 2024 · 0 comments
Open

[core dataset compilation error] #807

shamanez opened this issue May 6, 2024 · 0 comments

Comments

@shamanez
Copy link

shamanez commented May 6, 2024

Describe the bug
When I am using the most recent Megatrone-LM fork I get the following error

make: Entering directory '/workspace/megatron-lm/megatron/core/datasets'
g++ -O3 -Wall -shared -std=c++11 -fPIC -fdiagnostics-color -I/usr/include/python3.10 -I/usr/local/lib/python3.10/dist-packages/pybind11/include helpers.cpp -o helpers.cpython-310-x86_64-linux-gnu.so
make: Leaving directory '/workspace/megatron-lm/megatron/core/datasets'
ERROR:megatron.core.datasets.utils:Failed to compile the C++ dataset helper functions

To Reproduce

#!/bin/bash
#SBATCH --ntasks-per-node=1
#SBATCH --exclusive
#SBATCH --gpus-per-node=8
#SBATCH --partition=batch  # Adjust this for your cluster
#SBATCH --output=/home/shamane/logs/training_scratch/log.out # Adjust this for your cluster
#SBATCH --err=/home/shamane/logs/training_scratch/error.err    # Adjust this for your cluster
export MASTER_ADDR=$(hostname)
export GPUS_PER_NODE=8

# ---

export LD_LIBRARY_PATH=/usr/lib:/usr/lib64
export NCCL_TESTS_HOME=nccl-tests
export NCCL_DEBUG=INFO
export NCCL_ALGO=RING

export NCCL_IB_AR_THRESHOLD=0
export NCCL_IB_PCI_RELAXED_ORDERING=1
export NCCL_IB_SPLIT_DATA_ON_QPS=0
export NCCL_IB_QPS_PER_CONNECTION=2
export UCX_IB_PCI_RELAXED_ORDERING=on
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export NCCL_SOCKET_IFNAME=enp27s0np0
export NCCL_IB_HCA=mlx5_0:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_7:1,mlx5_8:1,mlx5_9:1
export NCCL_IGNORE_CPU_AFFINITY=1

# ---


nodes_array=($(scontrol show hostnames $SLURM_JOB_NODELIST))
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

echo "Node IP: $head_node_ip"


# Specify the Docker image to use.
PYTORCH_IMAGE="nvcr.io/nvidia/pytorch:24.03-py3"

# Define the path to the Megatron-LM directory on the head node.
MEGATRONE_PATH="/home/shamane/Megatron-LM-luke" # Update with actual path. Path should be on the head node.

# Set paths for checkpoints and tokenizer data. These should be on a shared data directory.
SHARED_DIR="/data/fin_mixtral_2B/"

#MASTER_ADDR=${MASTER_ADDR:-"localhost"}
MASTER_ADDR=$head_node_ip
MASTER_PORT=${MASTER_PORT:-"6008"}
NNODES=${SLURM_NNODES:-"1"}
NODE_RANK=${RANK:-"0"}
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

echo "SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST"
echo "SLURM_NNODES: $SLURM_NNODES"
echo "SLURM_NODEID: $SLURM_NODEID"

echo "MASTER_ADDR: $MASTER_ADDR"
echo "NNODES: $NNODES"
echo "MASTER_PORT: $MASTER_PORT"
echo "NODE_RANK: $NODE_RANK"


#module load docker


echo "-v $SHARED_DIR:/workspace/data"
echo "-v $MEGATRONE_PATH:/workspace/megatron-lm"
echo "$PYTORCH_IMAGE"
echo "bash -c \"pip install flash-attn sentencepiece &&  \
           bash /workspace/megatron-lm/examples/mixtral/run_mixtral_distributed.sh \
           /workspace/data/megatrone_checkpoints \
           /workspace/data/tokenizers/tokenizer.model \
           /workspace/data/processed_data/finance_2b_mixtral_text_document \
           $MASTER_ADDR \
           $MASTER_PORT \
           $NNODES \
           $NODE_RANK\""



# # Run the Docker container with the specified PyTorch image.
srun docker run \
  -e SLURM_JOB_ID=$SLURM_JOB_ID \
  --gpus all \
  --ipc=host \
  --network=host \
  --workdir /workspace/megatron-lm \
  -v $SHARED_DIR:/workspace/data \
  -v $MEGATRONE_PATH:/workspace/megatron-lm \
     $PYTORCH_IMAGE \
  bash -c "pip install flash-attn sentencepiece wandb 'git+https://github.com/fanshiqing/grouped_gemm@v1.0' &&  \
           bash /workspace/megatron-lm/examples/mixtral/run_mixtral_distributed.sh \
           /workspace/data/mixtral8x7-instr-tp2-emp8-ggemm \
           /workspace/data/tokenizers/tokenizer.model \
           /workspace/data/processed_data/finance_2b_mixtral_text_document \
           $MASTER_ADDR \
           $MASTER_PORT \
           $NNODES \
           $NODE_RANK"



# # This Docker command mounts the specified Megatron-LM and data directories, sets the working directory,
# # and runs the 'run_mixtral_distributed.sh' script inside the container.
# # This script facilitates distributed training using the specified PyTorch image, leveraging NVIDIA's optimizations.

Environment (please complete the following information):

PYTORCH_IMAGE="nvcr.io/nvidia/pytorch:24.03-py3"

Proposed fix
If you have a proposal for how to fix the issue state it here or link to a PR.

Additional context
This works well with the form that I have download 4 days ago.

@shamanez shamanez closed this as completed May 6, 2024
@shamanez shamanez reopened this May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant