说明
mindie目前不支持embedding及rerank部署,在昇腾部署这2种模型需要修改原生的tei框架支持npu并编译,这里总结下编译tei遇到的错误及解决方法。后续要把embed及rerank模型转换为pt模型才能加载部署。具体方法参考:
我是在mindie1.0的docker内执行的编译动作,硬件环境是310P。
错误及解决方法
1. linker cc
not found
报错信息:
error: linker `cc` not found
|
= note: No such file or directory (os error 2)
error: could not compile `serde_json` (build script) due to 1 previous error
warning: build failed, waiting for other jobs to finish...
解决办法:
yum -y install gcc gcc-c++
2. yum报错
报错信息:
File "/usr/lib64/python3.11/site-packages/libdnf/error.py", line 10, in <module>
from . import _error
ImportError: /usr/lib64/libldap.so.2: undefined symbol: EVP_md2, version OPENSSL_3.0.0
原因:执行 source /usr/local/Ascend/mindie/set_env.sh 引起的。
解决办法:重新进容器,不要执行那个脚本就行
3. 没有openssl库
报错信息
warning: openssl-sys@0.9.107: Could not find directory of OpenSSL installation, and this `-sys` crate cannot proceed without this knowledge. If OpenSSL is installed and this crate had trouble finding it, you can set the `OPENSSL_DIR` environment variable for the compilation process. See stderr section below for further information.
error: failed to run custom build command for `openssl-sys v0.9.107`
解决办法:
dnf install openssl-devel pkg-config
4. 找不到 protoc
报错信息:
error: failed to run custom build command for `backend-grpc-client v1.2.3
Could not find protoc installation and this build crate cannot proceed without
this knowledge. If protoc is installed and this crate had trouble finding
it, you can set the PROTOC environment variable with the specific path to your
installed protoc binary.If you're on debian, try apt-get install protobuf-compiler or download it from https://github.com/protocolbuffers/protobuf/releases
解决办法:
dnf install protobuf-compiler
5. 找不到头文件
转换embedding报错
/usr/local/Ascend/ascend-toolkit/8.0.0/aarch64-linux/tikcpp/tikcfw/impl/kernel_macros.h:24:10: fatal error: 'cstdint' file not found
#include <cstdint>
^~~~~~~~~
1 error generated.
.........[2025-04-09 22:18:38.514 +08:00] [59501] [281473373997632] [rt] [ERROR] [BuilderImpl.cpp:574] : Build model Failed!
[2025-04-09 22:18:38.892 +08:00] [59501] [281473373997632] [rt] [ERROR] [OmParser.cpp:101] : Model data is null!
转换rerank报错:
/usr/local/Ascend/ascend-toolkit/8.0.0/aarch64-linux/tikcpp/tikcfw/impl/kernel_macros.h:24:10: fatal error: 'cstdint' file not found
#include <cstdint>
解决办法:
在 Linux 环境下,CPLUS_INCLUDE_PATH 是一个专门用于 C++ 编译器(如 g++、clang++)的环境变量,它的作用是指定额外的头文件搜索路径。C_INCLUDE_PATH:仅影响 C 编译器。
export CPLUS_INCLUDE_PATH=/usr/include/c++/12:$CPLUS_INCLUDE_PATH
验证:echo | gcc -xc++ -E -v -
输出的#include <…> search starts here: 部分,包含/usr/include/c++/12
6. 缺另一个头文件
/usr/include/c++/12/cstdint:38:10: fatal error: 'bits/c++config.h' file not found
#include <bits/c++config.h>
^~~~~~~~~~~~~~~~~~
1 error generated.
解决办法跟上一个问题类似,就是把头文件目录加到环境变量中
export CPLUS_INCLUDE_PATH=/usr/include/c++/12:/usr/include/c++/12/aarch64-openEuler-linux/
7.TEI启动报错
2025-04-10T03:12:21.021449Z WARN python-backend: text_embeddings_backend_python::logging: backends/python/src/logging.rs:39: Could not import Flash Attention enabled models: No module named 'dropout_layer_norm'
2025-04-10T03:12:22.835879Z ERROR python-backend: text_embeddings_backend_python::logging: backends/python/src/logging.rs:40: Error when initializing model
File "/usr/local/lib64/python3.11/site-packages/torch_npu/npu/utils.py", line 58, in set_device
torch_npu._C._npu_setDevice(device_id)
RuntimeError: Initialize:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:222 NPU function error: c10_npu::SetDevice(device_id_), error code is 107001
[ERROR] 2025-04-10-11:12:22 (PID:72640, Device:0, RankID:-1) ERR00100 PTA call acl api failed
[Error]: Invalid device ID.
Check whether the device ID is valid.
EE1001: [PID: 72640] 2025-04-10-11:12:22.170.736 The argument is invalid.Reason: Set visible device failed, invalid device=7, input visible devices:7
Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
TraceBack (most recent call last):
rtSetDevice execute failed, reason=[device id error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
open device 7 failed, runtime result = 107001.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
原因分析:我设置的ASCEND_RT_VISIBLE_DEVICES=7. 所以对容器内的算法来说,npu7就是第一张卡。我之前设置环境变量TEI_NPU_DEVICE=7是错误的
解决办法
export TEI_NPU_DEVICE=0
8. embedding服务报错
2025-04-14T14:41:22.160588Z ERROR embed:embed_pooled{inputs=String("value\":{") truncate=false normalize=true}: text_embeddings_core::infer: core/src/infer.rs:331: Server error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__/transformers/models/bert/modeling_bert.py", line 10, in forward
input_2: Tensor) -> Tuple[Tensor, Tensor]:
__torch___transformers_models_bert_modeling_bert_BertModel_aie_engine_0 = getattr(input_0, "__torch__.transformers.models.bert.modeling_bert.BertModel_aie_engine_0")
_0 = ops.mrt.execute_engine([input_2, input_1], __torch___transformers_models_bert_modeling_bert_BertModel_aie_engine_0)
~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_1, _2, = _0
return (_1, _2)
Traceback of TorchScript, original code (most recent call last):
RuntimeError: [ERROR thrown at ascend-inference-ptplugin/core/runtime/execute_engine.cpp:311] [runtime] SetInputShape failed, name: input_0, index: 0, dims: [216 76 ]
原因:转发层是异步的,可能多次调用被合并成较大的batch了。模型默认最大只支持128的batchsize
解决办法,在tei启动参数中限制batchsize:
--max-concurrent-requests 256 --max-batch-requests 128 --max-batch-tokens 1100000 --max-client-batch-size 128
参数解释
--max-concurrent-requests <MAX_CONCURRENT_REQUESTS>
The maximum amount of concurrent requests for this particular deployment. Having a low limit will refuse clients requests instead of having them wait for too long and is usually good to handle backpressure correctly
--max-batch-requests <MAX_BATCH_REQUESTS>
Optionally control the maximum number of individual requests in a batch
--max-client-batch-size <MAX_CLIENT_BATCH_SIZE>
Control the maximum number of inputs that a client can send in a single request
参考资料
昇腾下embed及rerank模型转换: 选择MindIE Torch作为模型后端