解决多卡情况下运行llamafactory报错问题

发布于:2025-07-28 ⋅ 阅读:(13) ⋅ 点赞:(0)

使用llamafactory多卡会报错:

Traceback (most recent call last):
      File "/home/Mmm/anaconda3/envs/llama/bin/llamafactory-cli", line 8, in <module>
        sys.exit(main())
      File "/data/Mmm/LLaMA-Factory/src/llamafactory/cli.py", line 130, in main
        process = subprocess.run(
      File "/home/Mmm/anaconda3/envs/llama/lib/python3.10/subprocess.py", line 526, in run
        raise CalledProcessError(retcode, process.args,
    subprocess.CalledProcessError: Command '['torchrun', '--nnodes', '1', '--node_rank', '0', '--nproc_per_node', '4', '--master_addr', '127.0.0.1', '--master_port', '33837', '/data/Mmm/LLaMA-Factory/src/llamafactory/launcher.py', '/data/Mmm/Params/train_2025-07-23-13-30-31/training_args.yaml']' returned non-zero exit status 1.

解决方法:在LLaMA-Factory/src/llamafactory/launcher.py这个文件中添加

import os
    # 指定使用的GPU ID(例如只使用第0号GPU)
    os.environ["CUDA_VISIBLE_DEVICES"] = "0"

然后在终端运行llamafactory-cli webui之前,先依次运行这两条命令

 export NCCL_P2P_DISABLE=1
 export NCCL_IB_DISABLE=1

否则会报错:
      File "/home/Mmm/anaconda3/envs/llama/lib/python3.10/site-packages/accelerate/state.py", line 311, in __init__
        raise NotImplementedError(
    NotImplementedError: Using RTX 4000 series doesn't support faster communication broadband via P2P or IB. Please set `NCCL_P2P_DISABLE="1"` and `NCCL_IB_DISABLE="1" or use `accelerate launch` which will do this automatically.

   根据上面的步骤就可以解决不使用多卡的报错问题


网站公告

今日签到

点亮在社区的每一天
去签到