因为 cutlass 2.0.0 版本相对简单一些,这里先分析 这个版本。
0. 系统环境
x86_64
rtx-2080ti
ubuntu 22.04
cutlass 2.0.0
cuda 12.9
1. 下载cutlass
https://github.com/NVIDIA/cutlass/tree/v2.0.0
git clone https://github.com/NVIDIA/cutlass.git
cd cutlass
git checkout v2.0.0
2.编译运行 cutlass_profiler
cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_KERNELS=all
会有一个较长的时间自动下载 googletest
编译 cutlass_profiler:
make cutlass_profiler -j
运行:
./tools/profiler/cutlass_profiler --kernels=sgemm --m=4352 --n=4096 --k=4096
3. cutlass 文件夹布局
CUTLASS Templates are implemented by header files in the following directory structure:
include/ # Top-level include directory. Client applications should target this path.
cutlass/ # CUDA Templates for Linear Algebra Subroutines and Solvers - headers only
arch/ # direct exposure of architecture features (including instruction-level GEMMs)
*
gemm/ # code specialized for general matrix product computations
thread/ # thread-level operators
warp/ # warp-level operators
threadblock/ # CTA-level operators
kernel/ # CUDA kernel entry points
device/ # launches kernel(s) over a full device
* # scope-agnostic components and basic vocabular type definitions for GEMM
layout/ # layout definitions for matrices, tensors, and other mathematical objects in memory
*
reduction/ # bandwidth-limited reduction kernels that do not fit the "gemm" models
thread/ # thread-level operators
warp/ # warp-level operators
threadblock/ # CTA-level operators
kernel/ # CUDA kernel entry points
device/ # launches kernel(s) over a full device
* # scope-agnostic components and basic vocabular type definitions
transform/ # code specialized for layout, type, and domain transformations
thread/ # thread-level operators
warp/ # warp-level operators
threadblock/ # CTA-level operators
kernel/ # CUDA kernel entry points
device/ # launches kernel(s) over a full device
* # scope-agnostic components and basic vocabulary type definitions
util/ # miscellaneous CUTLASS components
*
* # core vocabulary types and fundamental arithmetic operators
编程参考文档:
https://github.com/NVIDIA/cutlass/blob/v2.0.0/media/docs/programming_guidelines.md
4. 分析 cutlass_profiler
4.1. 编译 debug 版本 cutlass_profiler
首先,编译一个debug 版本的cutlass 的测试app程序。
cmake 配置时加一个 -DCMAKE_BUILD_TYPE=Debug,具体命令:
$ cd cutlass/
$ make build_gdb/
$ cd build_gdb/
$ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_KERNELS=all -DCMAKE_BUILD_TYPE=Debug
$ make cutlass_profiler -j VERBOSE=1
如果遇到了这样的报错,可以给这个 dummy 变量赋个初始值:
cutlass/build_gdb/_deps/googletest-src/googletest/src/gtest-death-test.cc:1008:24: error: ‘dummy’ may be used uninitialized [-Werror=maybe-uninitialized]
1008 | StackLowerThanAddress(&dummy, &result);
4.2. debug cutlass_profiler
gdb 载入主程序,并设置命令行参数
gdb ./tools/profiler/cutlass_profiler
(gdb) set args --kernels=sgemm --m=4352 --n=4096 --k=4096
(gdb) start
(gdb) layout src
除了 return profiler();里边会运行到 cuda kernel, 前边的几行代码是在设置参数等。
会执行到这段代码:
/// Execute the program
int CutlassProfiler::operator()() {
... ....
if (options_.execution_mode == ExecutionMode::kProfile ||
options_.execution_mode == ExecutionMode::kDryRun ||
options_.execution_mode == ExecutionMode::kTrace) {
// Profiles all operations
profile_();
}
... ...
return 0;
}
主要内容在函数 profile_(); 中调用。
/// Profiles all operations
int CutlassProfiler::profile_() {
library::Manifest manifest;
Status status = manifest.initialize();
if (status != Status::kSuccess) {
return -1;
}
int result = 0;
DeviceContext device_context;
// For all profilers
for (auto & profiler : operation_profilers_) {
if (options_.operation_kind == library::OperationKind::kInvalid ||
options_.operation_kind == profiler->kind()) {
result = profiler->profile_all(options_, manifest, device_context);
if (result) {
return result;
}
}
}
return result;
}
主要是这个函数的调用:
result = profiler->profile_all(options_, manifest, device_context);
通过
nvprof ./tools/profiler/cutlass_profiler --kernels=sgemm --m=4352 --n=4096 --k=4096
可以发现 这个 cutlass_profiler 调用 了cublas Gemm,
通过 grep -rn cublas | grep emm 发现了 cublasGemmEx 存在于 cutlass 源代码中;
再结合 gdb 断点到 cublasGemmEx行,continue过去之后,bt,发现 调用栈 :
/// Executes GEMM using these arguments
cublasStatus_t operator()(cublasHandle_t handle) {
return cublasGemmEx(
handle,
trans_A,
trans_B,
configuration.problem_size.m(),
configuration.problem_size.n(),
configuration.problem_size.k(),
arguments.alpha,
arguments.A,
data_type_A,
int(configuration.lda),
arguments.B,
data_type_B,
int(configuration.ldb),
arguments.beta,
arguments.D,
data_type_C,
int(configuration.ldc),
compute_type,
algo
);
}
};
back trace 的结果:
588 return cublasGemmEx(
(gdb) bt
#0 cutlass::profiler::detail::cublasGemmExDispatcher::operator() (this=0x7fffffffce50, handle=0x555556d7b330) at /home/hipper/ex_cutlass/tmp2_20250623/cutlass/tools/profiler/src/gemm_operation_profiler.cu:588
#1 0x00005555555e4b29 in cutlass::profiler::GemmOperationProfiler::verify_with_cublas_ (this=0x5555567acf80, options=..., report=..., device_context=..., operation=0x5555564d1cd0, problem_space=...,
problem=std::vector of length 25, capacity 32 = {...}) at /home/hipper/ex_cutlass/tmp2_20250623/cutlass/tools/profiler/src/gemm_operation_profiler.cu:691
#2 0x00005555555e455b in cutlass::profiler::GemmOperationProfiler::verify_cutlass (this=0x5555567acf80, options=..., report=..., device_context=..., operation=0x5555564d1cd0, problem_space=...,
problem=std::vector of length 25, capacity 32 = {...}) at /home/hipper/ex_cutlass/tmp2_20250623/cutlass/tools/profiler/src/gemm_operation_profiler.cu:440
#3 0x00005555555db955 in cutlass::profiler::OperationProfiler::profile_all (this=0x5555567acf80, options=..., manifest=..., device_context=...) at /home/hipper/ex_cutlass/tmp2_20250623/cutlass/tools/profiler/src/operation_profiler.cu:320
#4 0x000055555555f6f6 in cutlass::profiler::CutlassProfiler::profile_ (this=0x7fffffffda90) at /home/hipper/ex_cutlass/tmp2_20250623/cutlass/tools/profiler/src/cutlass_profiler.cu:131
#5 0x000055555555f577 in cutlass::profiler::CutlassProfiler::operator() (this=0x7fffffffda90) at /home/hipper/ex_cutlass/tmp2_20250623/cutlass/tools/profiler/src/cutlass_profiler.cu:95
#6 0x000055555555d3b4 in main (argc=5, arg=0x7fffffffe158) at /home/hipper/ex_cutlass/tmp2_20250623/cutlass/tools/profiler/src/main.cpp:44