How to set up a multi-GPU Linux machine for deep learning in 2024 | Written by Nika

Deep learning with multiple GPUs

Super fast setup of CUDA and PyTorch in minutes.

Image by author: Multi-GPU Machine (Cartoon)

As deep learning models (particularly LLMs) continue to grow, there is an increasing need for more GPU memory (VRAM) to develop and use the models locally. Building or acquiring a multi-GPU machine is only the first part of the challenge. Most libraries and applications use only one GPU by default. Therefore, the machine also requires appropriate drivers along with libraries that can take advantage of multi-GPU setups.

This story provides a guide on how to set up a multi-GPU (Nvidia) Linux machine using essential libraries. This saves you time experimenting and allows you to start developing.

Finally, links are provided to popular open source libraries that can take advantage of multi-GPU setups for deep learning.

the goal

Set up a multi-GPU Linux system with the necessary libraries such as CUDA Toolkit and PyTorch to get started with deep learning. 🤖. The same steps apply to single GPU machines as well.

To get started with deep learning using frameworks like exllamaV2 and torchtune, install 1) CUDA Toolkit, 2) PyTorch, and 3) Miniconda.

©️ All libraries and information mentioned in this story are open source and publicly available.

Start

Image by author: Output of the nvidia-smi command on a Linux machine with 8 Nvidia A10G GPUs

To find out how many GPUs are installed on your machine, nvidia-smi Run the command in terminal. You should see a list of all installed GPUs. If there is a discrepancy or the command does not work, first install the Nvidia driver for your version of Linux. Please Confirm. nvidia-smi The command will output a list of all GPUs installed on your machine, as shown above.

If you haven't already installed the Nvidia driver, follow this page to install it.

How to install NVIDIA drivers on Ubuntu 22.04 – Linux Tutorial – Learn Linux Configuration – (Source: linuxconfig.org)

Step 1 Install the CUDA toolkit

💡 Check if there is an existing CUDA folder in the following location: usr/local/cuda-xx. This means that a version of CUDA is already installed. If you already have the required CUDA toolkit installed, nvcc Terminal command) Proceed to step 2.

Check the required CUDA version for your desired PyTorch library. Start locally | PyTorch (installing CUDA 12.1 installation)

Go to Download CUDA Toolkit 12.1 | NVIDIA developers obtain Linux commands to install CUDA 12.1 (select the OS version and the corresponding “deb (local)” installer type).

Options selected for Ubuntu 22 (Source: developer.nvidia.com)

The base installer terminal commands are displayed according to the option you selected. Copy these and run them in your Linux terminal to install the CUDA toolkit. For example, for x86_64 Ubuntu 22, open a terminal in your downloads folder and run the following command:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda-repo-ubuntu2204-12-1-local_12.1.0-530.30.02-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-12-1-local_12.1.0-530.30.02-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-12-1-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda

⚠️During the installation of the CUDA toolkit, the installer may request a kernel update. If a popup appears in your terminal to update the kernel, esc Click the button to cancel. Do not update the kernel at this stage. Nvidia drivers may become corrupted. ☠️.

Reboot your Linux machine after installation.of nvcc The command still doesn't work. You need to add your CUDA installation to your PATH. Open. .bashrc Create the file using the nano editor.

nano /home/$USER/.bashrc

Scroll to the bottom and .bashrc Add the following two lines to the file:

 export PATH="/usr/local/cuda-12.1/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH"

💡 Please note that you can change cuda-12.1 Depending on the installed CUDA version, cuda-xx In case you need it in the future, “xx” is the version of CUDA.

Save your changes and close the nano editor.

 To save changes - On you keyboard, press the following: ctrl + o             --> save 
enter or return key  --> accept changes
ctrl + x             --> close editor

Close and reopen the terminal. now, nvcc--version The command should output the installed CUDA version to the terminal.

Step 2 Install Miniconda

Before installing PyTorch, we recommend that you install Miniconda and then install PyTorch inside your Conda environment. It is also useful to create a new Conda environment for each project.

Open a terminal in your downloads folder and run the following command:

mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh# initiate conda
~/miniconda3/bin/conda init bash
~/miniconda3/bin/conda init zsh

Close and reopen the terminal. now, conda The command should work.

Step 3 Install PyTorch

(Optional) — Create a new conda environment for your project.can be exchanged <environment-name> With a name of your choice. I usually name it after the project name. 💡 can be used conda activate <environment-name> and conda deactivate <environment-name> Run commands before or after working on a project.

conda create -n <environment-name> python=3.11# activate the environment
conda activate <environment-name>

Install the CUDA version of the PyTorch library. The following command is for cuda-12.1 installed.

pip3 install torch torchvision torchaudio

The above commands were taken from the PyTorch Installation Guide — Start Local | PyTorch.

After installing PyTorch, check the number of GPUs displayed by PyTorch in the terminal.

python>> import torch
>> print(torch.cuda.device_count())
8

This will print the number of GPUs installed on your system (8 in my case) and should also match the number of GPUs listed. nvidia-smi Instructions.

viola! Now you're ready to start working on your deep learning project that leverages multiple GPUs 🥳.

1. 🤗 First, you can clone popular models from: hug face:

2. 💬 For inference (using LLM models), clone and install exllamav2 In a different environment. This uses all GPUs to speed up inference: (Check out my Medium page for a detailed tutorial)

3. 👨‍🏫 Can be cloned and installed for fine-tuning and training torch tuneFollow the instructions for either full finetune or lora finetune Build your model by leveraging all your GPUs: (Check out my Medium page for a detailed tutorial)

This guide describes the machine setup required for multi-GPU deep learning. Now you can start working on any project that takes advantage of multiple GPUs, including torchtune for faster development.

stay tuned For a more detailed tutorial, exllamaV2 and torch tune.

Source link