使用llamafactory多卡会报错:
Traceback (most recent call last):
File "/home/Mmm/anaconda3/envs/llama/bin/llamafactory-cli", line 8, in <module>
sys.exit(main())
File "/data/Mmm/LLaMA-Factory/src/llamafactory/cli.py", line 130, in main
process = subprocess.run(
File "/home/Mmm/anaconda3/envs/llama/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['torchrun', '--nnodes', '1', '--node_rank', '0', '--nproc_per_node', '4', '--master_addr', '127.0.0.1', '--master_port', '33837', '/data/Mmm/LLaMA-Factory/src/llamafactory/launcher.py', '/data/Mmm/Params/train_2025-07-23-13-30-31/training_args.yaml']' returned non-zero exit status 1.
解决方法:在LLaMA-Factory/src/llamafactory/launcher.py这个文件中添加
import os
# 指定使用的GPU ID(例如只使用第0号GPU)
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
然后在终端运行llamafactory-cli webui之前,先依次运行这两条命令
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
否则会报错:
File "/home/Mmm/anaconda3/envs/llama/lib/python3.10/site-packages/accelerate/state.py", line 311, in __init__
raise NotImplementedError(
NotImplementedError: Using RTX 4000 series doesn't support faster communication broadband via P2P or IB. Please set `NCCL_P2P_DISABLE="1"` and `NCCL_IB_DISABLE="1" or use `accelerate launch` which will do this automatically.
根据上面的步骤就可以解决不使用多卡的报错问题