Skip to content
View santurini's full-sized avatar
πŸ”₯
this emoji is fire
πŸ”₯
this emoji is fire
  • University of Rome, La Sapienza
  • Rome
  • LinkedIn in/santurini

Highlights

  • Pro
Block or Report

Block or report santurini

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
santurini/README.md

TL;DR

cats dogs cows

Click on the banner. Adopt, don't buy! πŸˆβ€β¬›
Machine Learning and Artificial Intelligence enthusiast 🧠
In theory also Data Scientist, not sure what it means by now πŸ“Š


My Contacts

Gmail LinkedIn Kaggle

Click or hover on the badge πŸ‘†πŸΌ


Main Projects

video-sr counting simsiam

Bunch of projects in Python, R and little bit of everything πŸ’»
Click on the badge πŸ‘†πŸΌ


Click on the Pacman for an easter egg πŸ₯š

Pinned

  1. Tutorial to setup a Distributed Data... Tutorial to setup a Distributed Data Parallel training in torch using mpirun instead of torchrun
    1
    To launch a distributed training in torch with _**mpirun**_ we have to:
    2
    1. Configure a passwordless ssh connection with the nodes
    3
    2. Setup the distributed environment inside the training script, in this case _**train.py**_
    4
    3. Launch the training from the MASTER node with _**mpirun**_
    5
    
                  
  2. Tutorial to setup a Data and Model P... Tutorial to setup a Data and Model Parallel training with FastMoE.
    1
    One of the main reason mixture of Experts are gaining so much attention is due to their high degree of parallelization while allowing to scale exponentially the number of parameters.
    2
    Usually this requires a lot of complex code and deep knowledge of distributed systems but we can get this for free with the FastMoE library.
    3
    
                  
    4
    First of all we need to define our Experts and specify in the **_expert_dp_comm_** attribute which type of gradient reduction we would like to use out of:
    5
    - **_dp_**: reduced across the data-parallel groups, which means that in the model parallel group, they are not synchronized.   
  3. DeepSpeed Multi-node Training Setup DeepSpeed Multi-node Training Setup
    1
    In this tutorial we assume to launch a distributed training on 2 nodes using DeepSpeed with the OpenMPI Launcher.
    2
    
                  
    3
    1. First of all DeepSpeed needs a passwordless ssh connection with all the nodes, MASTER included:
    4
    ```
    5
    # generate a public/private ssh key and make sure to NOT insert a passphrase
  4. Step-by-step installation procedure ... Step-by-step installation procedure for Intel neural speed
    1
    1. **Setup WSL**
    2
        * Install wsl: `wsl --install -d Ubuntu`
    3
        
    4
        * Run Powershell as Administrator and enter:
    5
            ```
  5. Step-by-step tutorial for FastMoE in... Step-by-step tutorial for FastMoE installation
    1
    Step by step tutorial to install FastMoE on your local machine:
    2
    
                  
    3
    1. First of all you'll need to check your torch and nccl version, make sure to have a CUDA version compatible to the one torch was compiled (in general if you have the latest torch version it works also with the latest CUDA):
    4
    ```
    5
    # go in terminal and use this command, the output should be something like this:
  6. Simple Tutorial to get started with ... Simple Tutorial to get started with the FastMoE library.
    1
    In this tutorial we are going to consider a simple model in which we are going to replace the MLP with a MoE.
    2
    The starting model is defined like this:
    3
    
                  
    4
    ```
    5
    class Net(nn.Module):