CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7 bash finetune.sh

Trouble shooting

CUDA+Pytorch+PyG

  1. Get driver version: nvidia-smi (like 515.43.04)

  2. Find a upper bound of CUDA: check out https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html to find suitable CUDA version based on driver version in Table 3 ()

  3. Check avaiable cuda drivers on machine: ls /usr/local/

  4. (in case for cudnn, see https://developer.nvidia.com/rdp/cudnn-archive)

  5. Check suitable python version: look into https://download.pytorch.org/whl/torch_stable.html (make sure the filename has ‘cu’, such that not CPU version)

    1. e.g. cu111/torch-1.10.2%2Bcu111-cp38-cp38-linux_x86_64.whl
  6. install

    1. (recommended) pytorch official website. If lucky, conda install pytorch torchvision torchaudio pytorch-cuda=11.7 pyg -c pytorch -c nvidia -c pyg will be enough.
    2. to use pip install torch==1.10.2+cu111 torchvision==0.11.3+cu111 torchaudio==0.10.2+cu111 -f [<https://download.pytorch.org/whl/torch_stable.html>](<https://download.pytorch.org/whl/torch_stable.html>), check the website in [5] to decide version correspondence
  7. Sanity check

    python3 -c "import torch; print(torch.cuda.nccl.version())"
    locate nccl| grep "libnccl.so"
    # should be the same version
    

    https://askubuntu.com/questions/1330041/runtimeerror-cuda-unknown-error-this-may-be-due-to-an-incorrectly-set-up-envi

    python3
    import torch
    torch. __version__
    torch.version.cuda
    torch.cuda.is_available()
    torch.cuda.device_count()
    torch.cuda.current_device()
    torch.cuda.device(0)
    torch.cuda.get_device_name(0)
    torch.cuda.get_arch_list() # sm_86 for 3080Ti
    
  8. Optional dependency of PyG:

pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f <https://data.pyg.org/whl/torch-2.0.0+cu117.html>

GPU info

gpus = []
pynvml.nvmlInit()
for i in range(0, ngpus_per_node):
    h = pynvml.nvmlDeviceGetHandleByIndex(i)
    info = pynvml.nvmlDeviceGetMemoryInfo(h)
    print(f'GPU {i}:{info.free / 1024 ** 2} free, {info.used / 1024 ** 2} used')
    gpus.append((info.free, i))
gpus.sort(reverse=True)
gpus = [i for capability, i in gpus ]

CUDA+tensorflow