前提
现在是已经搭建好一个GPU集群,需要添加一个新的节点(3090卡),用来分担工作,大致可以分为以下几个部分:
- 1,安装GPU驱动
- 2,安装docker
- 3,安装cri-dockerd
- 4,离线安装Nvidia-container-toolkit
- 5,二进制安装k8s组件以及密钥
下面分别介绍。
1,安装GPU驱动
详情可见:
https://blog.csdn.net/m0_62464865/article/details/145487945?spm=1001.2014.3001.5502
2,安装docker
详情可见:
https://blog.csdn.net/m0_62464865/article/details/145491293?spm=1001.2014.3001.5502
3,安装cri-dockerd
3.1 下载并解压
wget https://github.com/Mirantis/cri-dockerd/releases/download/v0.3.16/cri-dockerd-0.3.16.arm64.tgz
tar -zxvf cri-dockerd-0.3.16.arm64.tgz
如果是amd64架构,则用下面的地址:
https://github.com/Mirantis/cri-dockerd/releases/download/v0.3.16/cri-dockerd-0.3.16.amd64.tgz
3.2 复制到bin目录下
sudo cp cri-dockerd /usr/bin/
3.3 配置启动文件
sudo vim /usr/lib/systemd/system/cri-docker.service
[Unit]
Description=CRI Interface for Docker Application Container Engine
Documentation=https://docs.mirantis.com
After=network-online.target firewalld.service docker.service
Wants=network-online.target
Requires=cri-docker.socket
[Service]
Type=notify
ExecStart=/usr/bin/cri-dockerd --network-plugin=cni --pod-infra-container-image=registry.aliyuncs.com/google_containers/pause:3.7
ExecReload=/bin/kill -s HUP $MAINPID
TimeoutSec=0
RestartSec=2
Restart=always
StartLimitBurst=3
StartLimitInterval=60s
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
Delegate=yes
KillMode=process
[Install]
WantedBy=multi-user.target
sudo vim /usr/lib/systemd/system/cri-docker.socket
[Unit]
Description=CRI Docker Socket for the API
PartOf=cri-docker.service
[Socket]
ListenStream=%t/cri-dockerd.sock
SocketMode=0660
SocketUser=root
SocketGroup=docker
[Install]
WantedBy=sockets.target
3.4 启动cri-docker并设置开机启动
sudo systemctl daemon-reload
sudo systemctl enable cri-docker --now
sudo systemctl status cri-docker
4,离线安装Nvidia-container-toolkit
https://blog.csdn.net/m0_62464865/article/details/145500004?spm=1001.2014.3001.5502
5,二进制安装k8s组件以及密钥
由于这部分涉及内容比较多,空闲时间再详细写离线情况下如何搭k8s的GPU集群以及如何新增节点。