CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7 bash finetune.sh

Trouble shooting

To make cuda report the error where it actually occurs:
- CUDA_LAUNCH_BLOCKING=1 if using bash
- os.environ['CUDA_LAUNCH_BLOCKING'] = "1" if using python code

“The detected CUDA version (12.1) mismatches the version that was used to compile PyTorch (11.8). Please make sure to use the same CUDA versions.”

Switch system default CUDA version

# e.g. cuda 11.7, run the following within certain Conda env
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'export CUDA_HOME=/usr/local/cuda-11.7' > $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
echo 'PATH=/usr/local/cuda-11.7/bin:$PATH' > $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

# deactivate and activate again
nvcc -V # should be 11.7

“ImportError: /lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found”

mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:/home/yzhanglo/anaconda3/lib' > $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

CUDA+Pytorch+PyG

Get driver version: nvidia-smi (like 515.43.04)
Find a upper bound of CUDA: check out https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html to find suitable CUDA version based on driver version in Table 3 ()
Check avaiable cuda drivers on machine: ls /usr/local/
(in case for cudnn, see https://developer.nvidia.com/rdp/cudnn-archive)
Check suitable python version: look into https://download.pytorch.org/whl/torch_stable.html (make sure the filename has ‘cu’, such that not CPU version)
1. e.g. cu111/torch-1.10.2%2Bcu111-cp38-cp38-linux_x86_64.whl
install
1. (recommended) pytorch official website. If lucky, conda install pytorch torchvision torchaudio pytorch-cuda=11.7 pyg -c pytorch -c nvidia -c pyg will be enough.
2. to use pip install torch==1.10.2+cu111 torchvision==0.11.3+cu111 torchaudio==0.10.2+cu111 -f [<https://download.pytorch.org/whl/torch_stable.html>](<https://download.pytorch.org/whl/torch_stable.html>), check the website in [5] to decide version correspondence

Sanity check

python3 -c "import torch; print(torch.cuda.nccl.version())"
locate nccl| grep "libnccl.so"
# should be the same version

https://askubuntu.com/questions/1330041/runtimeerror-cuda-unknown-error-this-may-be-due-to-an-incorrectly-set-up-envi

python3
import torch
torch. __version__
torch.version.cuda
torch.cuda.is_available()
torch.cuda.device_count()
torch.cuda.current_device()
torch.cuda.device(0)
torch.cuda.get_device_name(0)
torch.cuda.get_arch_list() # sm_86 for 3080Ti

Optional dependency of PyG:

pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f <https://data.pyg.org/whl/torch-2.0.0+cu117.html>

GPU info

gpus = []
pynvml.nvmlInit()
for i in range(0, ngpus_per_node):
    h = pynvml.nvmlDeviceGetHandleByIndex(i)
    info = pynvml.nvmlDeviceGetMemoryInfo(h)
    print(f'GPU {i}:{info.free / 1024 ** 2} free, {info.used / 1024 ** 2} used')
    gpus.append((info.free, i))
gpus.sort(reverse=True)
gpus = [i for capability, i in gpus ]

Trouble shooting

CUDA+Pytorch+PyG

GPU info

CUDA+tensorflow