配置有nvlink的H20&A800使用pytorch报错

发布于:2025-06-26 ⋅ 阅读:(20) ⋅ 点赞:(0)

背景

装有nvlink的h20机器上配置好驱动和cuda之后使用pytorch报错
A800机器同样

(pytorch2.4) root@xx-dev-H20:~# python
Python 3.12.0 | packaged by Anaconda, Inc. | (main, Oct 2 2023, 17:29:18) [GCC 11.2.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.
>>> import torch
torch.>>> torch.cuda.is_available()
/root/anaconda3/envs/pytorch2.4/lib/python3.12/site-packages/torch/cuda/init.py:128: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at /opt/conda/conda-bld/pytorch_1724789220573/work/c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False

解决

在nvidia fabricmanager官网找到和H20机器上的驱动版本相对应的fabricmanager版本安装,启动即可

(pytorch2.4) root@xx-dev-H20:/opt/fabricmanager-linux-x86_64-550.163.01-archive# python
Python 3.12.0 | packaged by Anaconda, Inc. | (main, Oct  2 2023, 17:29:18) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 
>>> 
>>> 
>>> import torch
>>> torch.
KeyboardInterrupt
>>> torch.cuda.is_available()
True

查看nvlink吞吐量
nvidia-smi nvlink --getthroughput d
watch -n 1 ‘nvidia-smi nvlink -gt d’

reference

fabricmanager下载地址
https://developer.download.nvidia.cn/compute/nvidia-driver/redist/fabricmanager/linux-x86_64/
nccl使用nvlink通信
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html


网站公告

今日签到

点亮在社区的每一天
去签到