NVIDIA / Megatron-LM Public

Notifications
Fork 2k
Star 8.8k

Code
Issues 293
Pull requests 125
Actions
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Security
Insights

Issues: NVIDIA/Megatron-LM

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

293 Open 226 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

Projeto liliti stk 3.6.9 inteligência artificial 🤖

#822 opened May 11, 2024 by felipeliliti

Projeto liliti stk 3.6.9 inteligência artificial

#821 opened May 11, 2024 by felipeliliti

Megatron-LM for LLaMa3

#818 opened May 10, 2024 by SDsly

How to set up fp8 training

#817 opened May 10, 2024 by yangzhipeng1108

[QUESTION] How does tensor_parallel coop with q/k_layernorm

#816 opened May 10, 2024 by cryoco

[BUG] Typo in drop_policy options in moe_utils.py

#815 opened May 9, 2024 by Malikeh97

[BUG] [MoE] Typo in Token Drop policy's default value

#812 opened May 7, 2024 by passaglia

[QUESTION] Why is expert parallelism not supported during fp16 training?

#810 opened May 7, 2024 by yutian-mt

[core dataset compilation error]

#807 opened May 6, 2024 by shamanez

[QUESTION] Does Megatron-Core supports LLAMA models?

#803 opened May 3, 2024 by noob-ctrl

[QUESTION] bf16 Parameters and fp32 Gradients

#800 opened Apr 30, 2024 by pluiez

[QUESTION] How to pre-build the dataset's index ?

#795 opened Apr 24, 2024 by etiennemlb

[BUG] Example of pretraining BERT does not work

#791 opened Apr 23, 2024 by xju2

When can we have a the MOE checkpoint convert script.

#790 opened Apr 22, 2024 by shamanez

[QUESTION] Validation loss & PPL keep going up

#787 opened Apr 20, 2024 by zhentingqi

[QUESTION] Is it expected to do grad norm on dense-optimizer and moe-optimizer respectively?

#785 opened Apr 19, 2024 by ezioliao

[QUESTION] RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:10:00)

#782 opened Apr 16, 2024 by JanryPei

[QUESTION] found NaN in local grad norm in backward pass before data-parallel communication collective

#780 opened Apr 16, 2024 by ftgreat

[BUG] The bug about the options of the Megatron-core, transformer-impl and flash-attention.

#778 opened Apr 12, 2024 by Baibaifan

[BUG] ConstantGradScaler and loss-scale argument not match

#776 opened Apr 12, 2024 by BeingGod

[BUG] Passed the wrong type of argument to torch.distributed.broadcast.

#774 opened Apr 11, 2024 by sandyhouse

[QUESTION] vicuna-7b-v1.5 weight conversion from huggingface to megatron-lm format

#773 opened Apr 10, 2024 by uehara-mech

[QUESTION] Why megatron-core seems slower and use more gpu mem than legacy for gpt_pretrain?

#770 opened Apr 9, 2024 by REIGN12

[QUESTION]why replace F.embedding() with [] on VocabParallelEmbedding class?

#769 opened Apr 9, 2024 by starkhu

[BUG] How to checkpoint the specific microbatch in pipeline parallelism?

#767 opened Apr 7, 2024 by robotsp

Previous 1 2 3 4 5 … 11 12 Next

Previous Next

ProTip! Add no:assignee to see everything that’s not assigned.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly