轻量级高性能推理引擎MNN 学习笔记 02.MNN主要API-EW帮帮网

1. MNN 主要API

注意：本学习笔记只介绍了我在学习过程中常用的API ，更多MNN API 请参考官方文档。

1.1. 推理时操作流程

创建Interpreter ： createFromFile()
通过Interpreter创建Session ：createSession()
设置输入数据: getSessionInput()、map()、unmap()、copyFromHostTensor（）
通过Session进行推理: runSession()
获取推理结果：getSessionOutput()、map()、unmap()、copyToHostTensor（）
释放Interpreter：delete

1.2. Interpreter

使用MNN推理时，有两个层级的抽象，分别是解释器Interpreter和会话Session。Interpreter是模型数据的持有者；Session通过Interpreter创建，是推理数据的持有者。多个推理可以共用同一个模型，即多个Session可以共用一个Interpreter。

在创建完Session，且不再创建Session或更新训练模型数据时，Interpreter可以通过releaseModel函数释放模型数据，以节省内存。

1.2.1. 创建Interpreter

通过磁盘文件创建

/**
 * @brief create net from file.
 * @param file  given file.
 * @return created net if success, NULL otherwise.
 */
static Interpreter* createFromFile(const char* file);

函数返回的Interpreter实例是通过new创建的，务必在不再需要时，通过delete释放，以免造成内存泄露。

1.3. Session

一般通过Interpreter::createSession创建Session：

/**
 * @brief create session with schedule config. created session will be managed in net.
 * @param config session schedule config.
 * @return created session if success, NULL otherwise.
 */
Session* createSession(const ScheduleConfig& config);

函数返回的Session实例是由Interpreter管理，随着Interpreter销毁而释放，一般不需要关注。也可以在不再需要时，调用Interpreter::releaseSession释放，减少内存占用。

创建Session 一般而言需要较长耗时，而Session在多次推理过程中可以重复使用，建议只创建一次多次使用。

1.4. ScheduleConfig

简易模式:不需要额外设置调度配置，函数会根据模型结构自动识别出调度路径、输入输出，例如：

ScheduleConfig conf;
Session* session = interpreter->createSession(conf); // 创建Session

这种模式下采用CPU推理。

高级模式：需要设置调度配置，例如：

/** session schedule config */
struct ScheduleConfig {
    /** which tensor should be kept */
    std::vector<std::string> saveTensors;
    /** forward type */
    MNNForwardType type = MNN_FORWARD_CPU;
    /** CPU:number of threads in parallel , Or GPU: mode setting*/
    union {
        int numThread = 4;
        int mode;
    };

    /** subpath to run */
    struct Path {
        std::vector<std::string> inputs;
        std::vector<std::string> outputs;

        enum Mode {
            /**
             * Op Mode
             * - inputs means the source op, can NOT be empty.
             * - outputs means the sink op, can be empty.
             * The path will start from source op, then flow when encounter the sink op.
             * The sink op will not be compute in this path.
             */
            Op = 0,

            /**
             * Tensor Mode
             * - inputs means the inputs tensors, can NOT be empty.
             * - outputs means the outputs tensors, can NOT be empty.
             * It will find the pipeline that compute outputs from inputs.
             */
            Tensor = 1
        };

        /** running mode */
        Mode mode = Op;
    };
    Path path;

    /** backup backend used to create execution when desinated backend do NOT support any op */
    MNNForwardType backupType = MNN_FORWARD_CPU;

    /** extra backend config */
    BackendConfig* backendConfig = nullptr;
};

推理时，主选后端由type指定，默认为CPU。若模型中存在主选后端不支持的算子，这些算子会使用由backupType指定的备选后端运行。

推理路径包括由path的inputs到outputs途径的所有算子，在不指定时，会根据模型结构自动识别。为了节约内存，MNN会复用outputs之外的tensor内存。如果需要保留中间tensor的结果，可以使用saveTensors保留tensor结果，避免内存复用。

CPU推理时，并发数与线程数可以由numThread修改。numThread决定并发数的多少，但具体线程数和并发效率，不完全取决于numThread：

iOS，线程数由系统GCD决定；
启用MNN_USE_THREAD_POOL时，线程数取决于第一次配置的大于1的numThread；
OpenMP，线程数全局设置，实际线程数取决于最后一次配置的numThread；

GPU推理时，可以通过mode来设置GPU运行的一些参量选择(暂时只支持OpenCL)。GPU mode参数如下：

typedef enum {
    // choose one tuning mode Only
    MNN_GPU_TUNING_NONE    = 1 << 0,/* Forbidden tuning, performance not good */
    MNN_GPU_TUNING_HEAVY  = 1 << 1,/* heavily tuning, usually not suggested */
    MNN_GPU_TUNING_WIDE   = 1 << 2,/* widely tuning, performance good. Default */
    MNN_GPU_TUNING_NORMAL = 1 << 3,/* normal tuning, performance may be ok */
    MNN_GPU_TUNING_FAST   = 1 << 4,/* fast tuning, performance may not good */
    
    // choose one opencl memory mode Only
    /* User can try OpenCL_MEMORY_BUFFER and OpenCL_MEMORY_IMAGE both, then choose the better one according to performance*/
    MNN_GPU_MEMORY_BUFFER = 1 << 6,/* User assign mode */
    MNN_GPU_MEMORY_IMAGE  = 1 << 7,/* User assign mode */
} MNNGpuMode;

目前支持tuning力度以及GPU memory用户可自由设置。例如：

MNN::ScheduleConfig config;
config.mode = MNN_GPU_TUNING_NORMAL | MNN_GPU_MEMORY_IMAGE;

tuning力度选取越高，第一次初始化耗时越多，推理性能越佳。如果介意初始化时间过长，可以选取MNN_GPU_TUNING_FAST或者MNN_GPU_TUNING_NONE，也可以同时通过下面的cache机制，第二次之后就不会慢。GPU_Memory用户可以指定使用MNN_GPU_MEMORY_BUFFER或者MNN_GPU_MEMORY_IMAGE，用户可以选择性能更佳的那一种。如果不设定，框架会采取默认判断帮你选取(不保证一定性能最优)。

上述CPU的numThread和GPU的mode，采用union联合体方式，共用同一片内存。用户在设置的时候numThread和mode只需要设置一种即可，不要重复设置。

对于GPU初始化较慢的问题，提供了Cache机制。后续可以直接加载cache提升初始化速度。

具体可以参考tools/cpp/MNNV2Basic.cpp里面setCacheFile设置cache方法进行使用。
当模型推理输入尺寸有有限的多种时，每次resizeSession后调用updateCacheFile更新cache文件。
当模型推理输入尺寸无限随机变化时，建议config.mode设为1，关闭MNN_GPU_TUNING。

1.5. 输入数据

1.5.1. 获取输入tensor

/**
 * @brief get input tensor for given name.
 * @param session   given session.
 * @param name      given name. if NULL, return first input.
 * @return tensor if found, NULL otherwise.
 */
Tensor* getSessionInput(const Session* session, const char* name);

/**
 * @brief get all input tensors.
 * @param session   given session.
 * @return all output tensors mapped with name.
 */
const std::map<std::string, Tensor*>& getSessionInputAll(const Session* session) const;

Interpreter上提供了两个用于获取输入Tensor的方法：getSessionInput用于获取单个输入tensor， getSessionInputAll用于获取输入tensor映射。

在只有一个输入tensor时，可以在调用getSessionInput时传入NULL以获取tensor。

1.5.2. 【推荐】映射填充数据

映射输入Tensor的内存，部分后端可以免数据拷贝

auto input = interpreter->getSessionInput(session, NULL);
void* host = input->map(MNN::Tensor::MAP_TENSOR_WRITE, input->getDimensionType());
// fill host memory data
input->unmap(MNN::Tensor::MAP_TENSOR_WRITE,  input->getDimensionType(), host);

1.5.3. 【不推荐】拷贝填充数据

NCHW示例，适用 ONNX / Caffe / Torchscripts 转换而来的模型：

auto inputTensor = interpreter->getSessionInput(session, NULL);
auto nchwTensor = new Tensor(inputTensor, Tensor::CAFFE);
// nchwTensor-host<float>()[x] = ...
inputTensor->copyFromHostTensor(nchwTensor);
delete nchwTensor;

通过这类拷贝数据的方式，用户只需要关注自己创建的tensor的数据布局，copyFromHostTensor会负责处理数据布局上的转换（如需）和后端间的数据拷贝（如需）。

1.6. 运行会话

MNN中，Interpreter一共提供了三个接口用于运行Session，但一般来说，简易运行就足够满足绝对部分场景。

1.6.1. 简易运行

/**
 * @brief run session.
 * @param session   given session.
 * @return result of running.
 */
ErrorCode runSession(Session* session) const;

1.7. 获取输出tensor

/**
 * @brief get output tensor for given name.
 * @param session   given session.
 * @param name      given name. if NULL, return first output.
 * @return tensor if found, NULL otherwise.
 */
Tensor* getSessionOutput(const Session* session, const char* name);

/**
 * @brief get all output tensors.
 * @param session   given session.
 * @return all output tensors mapped with name.
 */
const std::map<std::string, Tensor*>& getSessionOutputAll(const Session* session) const;

Interpreter上提供了两个用于获取输出Tensor的方法：getSessionOutput用于获取单个输出tensor， getSessionOutputAll用于获取输出tensor映射。

在只有一个输出tensor时，可以在调用getSessionOutput时传入NULL以获取tensor。

1.7.1. 【推荐】映射输出数据

映射输出Tensor的内存数据，部分后端可以免数据拷贝

auto outputTensor = net->getSessionOutput(session, NULL);
void* host = outputTensor->map(MNN::Tensor::MAP_TENSOR_READ,  outputTensor->getDimensionType());
// use host memory by yourself
outputTensor->unmap(MNN::Tensor::MAP_TENSOR_READ,  outputTensor->getDimensionType(), host);

1.7.2. 【不推荐】拷贝输出数据

NCHW示例，适用 ONNX / Caffe / Torchscripts 转换而来的模型：

auto outputTensor = interpreter->getSessionOutput(session, NULL);
auto nchwTensor = new Tensor(outputTensor, Tensor::CAFFE);
outputTensor->copyToHostTensor(nchwTensor);
auto score = nchwTensor->host<float>()[0];
auto index = nchwTensor->host<float>()[1];
// ...
delete nchwTensor;

通过这类拷贝数据的方式，用户只需要关注自己创建的tensor的数据布局，copyToHostTensor会负责处理数据布局上的转换（如需）和后端间的数据拷贝（如需）

enum

MNNForwardType

缺省值是 MNN_FORWARD_CPU =0 ，即表示采用CPU后端进行推理。

typedef enum {
    MNN_FORWARD_CPU = 0,

    /*
     Firtly find the first available backends not equal to CPU
     If no other backends, use cpu
     */
    MNN_FORWARD_AUTO = 4,

    /*Hand write metal*/
    MNN_FORWARD_METAL = 1,

    /*NVIDIA GPU API*/
    MNN_FORWARD_CUDA = 2,

    /*Android / Common Device GPU API*/
    MNN_FORWARD_OPENCL = 3,
    MNN_FORWARD_OPENGL = 6,
    MNN_FORWARD_VULKAN = 7,

    /*Android 8.1's NNAPI or CoreML for ios*/
    MNN_FORWARD_NN = 5,

    /*User can use API from Backend.hpp to add or search Backend*/
    MNN_FORWARD_USER_0 = 8,
    MNN_FORWARD_USER_1 = 9,
    MNN_FORWARD_USER_2 = 10,
    MNN_FORWARD_USER_3 = 11,

    MNN_FORWARD_ALL = 12,

    /* Apply arm extension instruction set to accelerate some Ops, this forward type
       is only used in MNN internal, and will be active automatically when user set forward type
       to be MNN_FORWARD_CPU and extension instruction set is valid on hardware.
    */
    MNN_FORWARD_CPU_EXTENSION = 13,
    // use for shared memory on android device
    
    MNN_MEMORY_AHARDWAREBUFFER = 14
} MNNForwardType;

1.8. 参考

Session API使用 — MNN-Doc 2.1.1 documentation

轻量级高性能推理引擎MNN 学习笔记 02.MNN主要API