MaskFormer是一套基于transformer结构的语义分割代码。
链接地址:
https://github.com/facebookresearch/MaskFormer/tree/main
测试用的数据集:ADE20k Dataset
该数据集可通过上述链接下载,其中training含有20210张图片,validation含有2000张图片。SceneParsing中是全景分割的标签图片,InstanceSegmentation是实例分割的标签图片。
1.环境搭建
本人在python3.10,CUDA11.8,torch2.1.0的linux服务器上做实验。通过pip装好torch之后,然后按照INSTALL.md中的提示安装Detectron中的包。
有以下几点需要注意:
1.需要安装opencv-python-headless版本的opnecv
pip install opencv-python-headless
2.需要安装1.*版本的numpy
pip install numpy==1.26.0
3.使用timm加载模型的时候,会遇到某些层不支持的问题,在mask_former/modeling/backbone/swin.py中,修改为如下:
# from timm.models.layers import DropPath, to_2tuple, trunc_normal_
from timm.layers import DropPath, to_2tuple, trunc_normal_
4.安装panopticapi的包
git clone https://github.com/cocodataset/panopticapi.git
python setup.py build_ext --inplace
python setup.py build_ext install
个人配好的环境如下所示:
Package Version Editable project location
----------------------- ------------------ ------------------------------------
absl-py 2.2.1
antlr4-python3-runtime 4.9.3
black 25.1.0
certifi 2025.1.31
charset-normalizer 3.4.1
click 8.1.8
cloudpickle 3.1.1
coloredlogs 15.0.1
contourpy 1.3.1
cycler 0.12.1
Cython 3.0.12
detectron2 0.6 /home/shengpeng/downloads/detectron2
filelock 3.18.0
flatbuffers 25.2.10
fonttools 4.56.0
fsspec 2025.3.0
fvcore 0.1.5.post20221221
grpcio 1.71.0
h5py 3.13.0
huggingface-hub 0.29.3
humanfriendly 10.0
hydra-core 1.3.2
idna 3.10
iopath 0.1.9
Jinja2 3.1.6
kiwisolver 1.4.8
Markdown 3.7
markdown-it-py 3.0.0
MarkupSafe 3.0.2
matplotlib 3.10.1
mdurl 0.1.2
mpmath 1.3.0
mypy-extensions 1.0.0
networkx 3.4.2
numpy 1.26.0
omegaconf 2.3.0
onnx 1.17.0
onnx-simplifier 0.4.36
onnxruntime 1.21.0
opencv-python-headless 4.11.0.86
packaging 24.2
panopticapi 0.1
pathspec 0.12.1
pillow 11.1.0
pip 25.0
platformdirs 4.3.7
portalocker 3.1.1
protobuf 6.30.2
pycocotools 2.0.8
Pygments 2.19.1
pyparsing 3.2.3
python-dateutil 2.9.0.post0
PyYAML 6.0.2
requests 2.32.3
rich 13.9.4
safetensors 0.5.3
scipy 1.15.2
setuptools 75.8.0
shapely 2.0.7
six 1.17.0
sympy 1.13.3
tabulate 0.9.0
tensorboard 2.19.0
tensorboard-data-server 0.7.2
termcolor 2.5.0
timm 1.0.15
tomli 2.2.1
torch 2.1.0+cu118
torchvision 0.16.0+cu118
tqdm 4.67.1
triton 2.1.0
typing_extensions 4.13.0
urllib3 2.3.0
Werkzeug 3.1.3
wheel 0.45.1
yacs 0.1.8
下载预训练模型,即调用demo/demo.py,指定config的配置文件,和预训练权重,对图片进行推理,看预测效果。
python demo/demo.py \
--config-file configs/ade20k-150/maskformer_R50_bs16_160k.yaml \
--input images/ADE/ADE_test_00000001.jpg \
--opts MODEL.WEIGHTS weights/MaskFormer_seg_R50_512x512.pkl
训练的脚本:
python train_net.py \
--num-gpus 2 \
--config-file configs/ade20k-150/maskformer_R50_bs16_160k.yaml \
在train_net.py中需要指定数据集的路径:
os.environ['DETECTRON2_DATASETS']='/home/shengpeng/code/github_proj2/ADE2016/SceneParsing'
2张RTX3090的卡,大概跑了一晚上,结果如下:
其中最小模型,基于R50的backbone练出来的模型也有160多M。
2.torch模型转onnx
该套代码中没有带转onnx的代码,需要自己想办法转。
找到下载的detectron2的代码,detectron2/detectron2/engine/defaults.py中,重写class DefaultPredictor的__call__函数,如下:
def __call__(self, original_image):
with torch.no_grad():
image = original_image[:, :, ::-1]
input_blob = torch.as_tensor(image.astype("float32").transpose(2, 0, 1))
input_blob = input_blob.unsqueeze(0)
# print('self.cfg.MODEL.DEVICE:', self.cfg.MODEL.DEVICE)
pixel_mean = self.cfg.MODEL.PIXEL_MEAN
pixel_std = self.cfg.MODEL.PIXEL_STD
pixel_mean = torch.Tensor(pixel_mean).view(-1, 1, 1)
pixel_std = torch.Tensor(pixel_std).view(-1, 1, 1)
input_blob = (input_blob-pixel_mean) / pixel_std
input_blob = input_blob.to(self.cfg.MODEL.DEVICE)
print('input_blob.shape:',input_blob.shape)
predictions = self.model(input_blob)[0]
return predictions
重写MaskFormer/maskformer/mask_former_model.py中的class MaskFormer的forward()的函数:
def forward(self, input_blob):
print('MaskFormer input_blob:', input_blob.shape)
print('self.device:', self.device)
print('input_blob.device:', input_blob.device)
input_h, input_w = input_blob.shape[2], input_blob.shape[3]
features = self.backbone(input_blob)
outputs = self.sem_seg_head(features)
if self.training:
# # mask classification target
# if "instances" in batched_inputs[0]:
# gt_instances = [x["instances"].to(self.device) for x in batched_inputs]
# targets = self.prepare_targets(gt_instances, images)
# else:
# targets = None
targets = None
# bipartite matching-based loss
losses = self.criterion(outputs, targets)
for k in list(losses.keys()):
if k in self.criterion.weight_dict:
losses[k] *= self.criterion.weight_dict[k]
else:
# remove this loss if not specified in `weight_dict`
losses.pop(k)
return losses
else:
mask_cls_results = outputs["pred_logits"]
mask_pred_results = outputs["pred_masks"]
# return mask_cls_results, mask_pred_results
# upsample masks
mask_pred_results = F.interpolate(
mask_pred_results,
size=(input_h, input_w),
mode="bilinear",
align_corners=False,
)
# mask_cls_result=mask_cls_results[0]
# mask_pred_result=mask_pred_results[0]
# print('mask_cls_result:',mask_cls_result.shape)
# print('mask_pred_result:',mask_pred_result.shape)
print('mask_cls_results:',mask_cls_results.shape)
print('mask_pred_results:',mask_pred_results.shape)
processed_results = []
if self.sem_seg_postprocess_before_inference:
mask_pred_results = sem_seg_postprocess(
mask_pred_results, [input_h, input_w], input_h, input_w
)
# semantic segmentation inference
r = self.semantic_inference(mask_cls_results, mask_pred_results)
print(f'r1:{r.shape}')
if not self.sem_seg_postprocess_before_inference:
r = sem_seg_postprocess(r, [input_h, input_w], input_h, input_w)
print(f'r2:{r.shape}')
processed_results.append({"sem_seg": r})
print('processed_results num:',len(processed_results))
return processed_results
在tools中新建convert_torchvision_to_onnx.py的转模型脚本:
import argparse
import glob
import multiprocessing as mp
import os
# fmt: off
import sys
sys.path.insert(1, os.path.join(sys.path[0], '..'))
# fmt: on
import tempfile
import time
import warnings
import cv2
import numpy as np
import tqdm
from detectron2.config import get_cfg
from detectron2.data.detection_utils import read_image
from detectron2.projects.deeplab import add_deeplab_config
from detectron2.utils.logger import setup_logger
from mask_former import add_mask_former_config
from demo.predictor import VisualizationDemo
import onnx
import torch
def setup_cfg(args):
# load config from file and command-line arguments
cfg = get_cfg()
add_deeplab_config(cfg)
add_mask_former_config(cfg)
cfg.merge_from_file(args.config_file)
cfg.merge_from_list(args.opts)
cfg.freeze()
return cfg
def get_parser():
parser = argparse.ArgumentParser(description="Detectron2 demo for builtin configs")
parser.add_argument("--config-file", default="configs/ade20k-150/maskformer_R50_bs16_160k.yaml")
parser.add_argument("--input", nargs="+")
parser.add_argument(
"--output", help="A file or directory to save output visualizations. "
"If not given, will show output in an OpenCV window.")
parser.add_argument(
"--confidence-threshold", type=float, default=0.5, help="Minimum score for instance predictions to be shown")
parser.add_argument(
"--opts",
help="Modify config options using the command-line 'KEY VALUE' pairs",
default=['MODEL.WEIGHTS', 'output/model_0159999.pth'],
nargs=argparse.REMAINDER,
)
return parser
if __name__ == "__main__":
args = get_parser().parse_args()
cfg = setup_cfg(args)
demo = VisualizationDemo(cfg)
net = demo.predictor.model
net.to('cpu')
input_model_path=cfg.MODEL.WEIGHTS
print('input_model_path:%s' % (input_model_path))
output_model_path=input_model_path.replace('.pth', '.onnx')
im = torch.zeros(1, 3, 512, 512).to('cpu') # image size(1, 3, 512, 512) BCHW
input_layer_names = ["images"]
output_layer_names = ["output"]
dynamic = False
# Export the model
print(f'Starting export with onnx {onnx.__version__}.')
torch.onnx.export(net,
im,
f = output_model_path,
verbose = False,
opset_version = 12,
training = torch.onnx.TrainingMode.EVAL,
do_constant_folding = True,
input_names = input_layer_names,
output_names = output_layer_names,
dynamic_axes = {'images': {0: 'batch'},'output': {0: 'batch'}} if dynamic else None)
# Checks
model_onnx = onnx.load(output_model_path) # load onnx model
onnx.checker.check_model(model_onnx) # check onnx model
# Simplify onnx
simplify = 1
if simplify:
import onnxsim
print(f'Simplifying with onnx-simplifier {onnxsim.__version__}.')
# model_onnx, check = onnxsim.simplify(
# model_onnx,
# dynamic_input_shape=False,
# input_shapes=None)
onnx_sim_model, check = onnxsim.simplify(model_onnx)
assert check, 'assert check failed'
onnx.save(model_onnx, output_model_path)
print('Onnx model save as {}'.format(output_model_path))
即可转换成功得到对应的onnx模型,可使用onnxruntime加载该onnx模型做推理。
3.推理速度测试
在c++代码中,加载onnx转tensorrt测试速度,对比segformer中14M的模型,和该MaskFormer161M的模型,同时基于512x512的分辨率,转fp16的engine,做推理:
segfomer_b0 10ms左右
maskformer_R50 220ms左右
这个实验结果显示,该maskformer的模型不适用于那种速度要求特别高的场景,更适用于类别数较多,全景分割的场景。