CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7 bash finetune.sh
To make cuda report the error where it actually occurs:
CUDA_LAUNCH_BLOCKING=1 if using bashos.environ['CUDA_LAUNCH_BLOCKING'] = "1" if using python code“The detected CUDA version (12.1) mismatches the version that was used to compile PyTorch (11.8). Please make sure to use the same CUDA versions.”
Switch system default CUDA version
# e.g. cuda 11.7, run the following within certain Conda env
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'export CUDA_HOME=/usr/local/cuda-11.7' > $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
echo 'PATH=/usr/local/cuda-11.7/bin:$PATH' > $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
# deactivate and activate again
nvcc -V # should be 11.7
“ImportError: /lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found”
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:/home/yzhanglo/anaconda3/lib' > $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
Get driver version: nvidia-smi (like 515.43.04)
Find a upper bound of CUDA: check out https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html to find suitable CUDA version based on driver version in Table 3 ()
Check avaiable cuda drivers on machine: ls /usr/local/
(in case for cudnn, see https://developer.nvidia.com/rdp/cudnn-archive)
Check suitable python version: look into https://download.pytorch.org/whl/torch_stable.html (make sure the filename has ‘cu’, such that not CPU version)
install
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 pyg -c pytorch -c nvidia -c pyg will be enough.pip install torch==1.10.2+cu111 torchvision==0.11.3+cu111 torchaudio==0.10.2+cu111 -f [<https://download.pytorch.org/whl/torch_stable.html>](<https://download.pytorch.org/whl/torch_stable.html>), check the website in [5] to decide version correspondenceSanity check
python3 -c "import torch; print(torch.cuda.nccl.version())"
locate nccl| grep "libnccl.so"
# should be the same version
https://askubuntu.com/questions/1330041/runtimeerror-cuda-unknown-error-this-may-be-due-to-an-incorrectly-set-up-envi
python3
import torch
torch. __version__
torch.version.cuda
torch.cuda.is_available()
torch.cuda.device_count()
torch.cuda.current_device()
torch.cuda.device(0)
torch.cuda.get_device_name(0)
torch.cuda.get_arch_list() # sm_86 for 3080Ti
Optional dependency of PyG:
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f <https://data.pyg.org/whl/torch-2.0.0+cu117.html>
gpus = []
pynvml.nvmlInit()
for i in range(0, ngpus_per_node):
h = pynvml.nvmlDeviceGetHandleByIndex(i)
info = pynvml.nvmlDeviceGetMemoryInfo(h)
print(f'GPU {i}:{info.free / 1024 ** 2} free, {info.used / 1024 ** 2} used')
gpus.append((info.free, i))
gpus.sort(reverse=True)
gpus = [i for capability, i in gpus ]