仓库:https://gitee.com/mrxiao_com/2d_game_3
上次回顾:周期计数代码
我们正在进行一个项目的代码优化工作,目标是提高性能。当前正在优化某个特定的代码片段,已经将其执行周期减少到48个周期。为了实现这一目标,我们设计了一个计数周期和指令的工具,帮助我们更清晰地看到优化效果。
工具的工作原理是,通过定义一系列内置的宏和指令,来统计代码中各个指令的执行周期数。这个工具会逐步遍历代码,计算出每条指令的执行时间,最终给出一个总的执行周期数。
我们将这个工具与原始代码进行对比,发现通过手动优化和调整,代码的效率有所提升。然而,在整个过程中,发现有一些更复杂的问题和挑战,需要更深入的优化思考,特别是对底层指令的控制和优化策略的调整。
目前我们正处于细化优化阶段,已经有了一些初步的结果,但还需要进一步分析和调整。
包括吞吐量的指令计数不准确,未正确考虑CPU能够重叠执行不同操作的能力
在进行代码优化时,我们遇到了一个关于计数方式的问题,特别是如何准确地计算每条指令的执行周期。我们之前的方法是将吞吐量直接计入到计算中,比如,如果吞吐量是三分之一,就表示处理器在一个周期内能够执行三条相同的指令,这样的计算方式包含了指令并行执行的影响。
然而,这种做法被认为不够精确,因为吞吐量的计算已经考虑了并行性,而有些指令,比如加法(ADD)和乘法(MUL),并不是每个指令都能并行执行。举例来说,乘法和加法通常由单一的硬件单元执行,因此它们的吞吐量是一个周期一个指令,而不像逻辑与(AND)指令那样能够在一个周期内执行多个。
问题在于,我们的计数方式并没有很好地捕捉到加法和乘法可以重叠执行的事实,这就导致了吞吐量数值的混淆。如果加法和乘法指令能够同时执行,我们应该把它们的计算周期看作是半个周期,而不是一个周期。相反,像逻辑与指令,由于可以在同一个周期内执行多个,我们应该重新审视它们的计算方式。
因此,优化建议是要么重新调整加法和乘法的周期数,将它们的吞吐量设置为半个周期,要么干脆剔除逻辑与等指令的吞吐量计算,而只计算加法和乘法等更为简单的指令。关键问题在于,我们必须决定是采用适当的高估值还是低估值,避免将两者混合,导致计算结果失真。
精确的方法是写一个工具来模拟CPU,像为XB360做的那样
为了更准确地计算指令的执行周期,我们采用了一种模拟处理器的方法。在这种方法中,我们编写了一个小程序来模拟 CPU 的执行过程。这个程序会处理所有的汇编代码,并根据处理器的架构来推测它实际执行时的行为,考虑到处理器的各种特性和文档中的描述。
通过这种方式,程序能够准确地识别出何时会发生停顿(例如,某些指令由于无法在同一个周期内执行而导致的延迟)。最终,程序会告诉我们,每个指令在最理想情况下(即无停顿、完全并行执行)的周期数。这种模拟可以帮助我们更真实地了解程序执行的效率,并进一步优化代码。
或者使用Intel架构代码分析器(IACA) (这个已经停止维护了)
https://www.intel.com/content/www/us/en/developer/archive/tools/architecture-code-analyzer.html
在进行优化时,提到了一个工具——Intel 的架构代码分析器。这个工具可以帮助我们模拟处理器的执行,显示出处理器在实际运行中可能发生的各种情况,包括停顿和并行执行等细节。这个工具由 Intel 提供,理论上它能够准确地捕捉到处理器的内部行为,因为 Intel 对处理器的内部结构有最全面的了解。
虽然我们之前没有使用过这个工具,但它被推荐用来帮助分析 CPU 的行为,特别是在处理器架构和指令执行上。于是决定尝试使用这个工具,看看它能否提供有用的信息,帮助我们更精确地理解程序的执行过程。
安装和使用这个工具时遇到了一些问题。首先,工具需要下载并运行,但在尝试加载时,Windows 系统似乎出现了一些兼容性问题,导致工具没有立即能够正常使用。虽然遇到了一些技术问题,但还是决定继续尝试使用,查看是否能获取有用的数据。
总的来说,这个工具可能是分析处理器行为和优化代码的有力工具,但使用过程中也需要耐心,特别是在调试和解决系统兼容性问题时。
iaca
iacaMarks.h
intel-architecture-code-analyzer-3-0-users-guide-157552.pdf
如何将IACA与代码一起使用的概述
在进行代码分析时,通常并不需要将特定的头文件包含在项目中。这个头文件的目的是在某些需要通过特定工具进行代码分析的情况下,临时包含它以便工具能够正常工作。通常情况下,项目不会因为这个头文件而进行构建,只有在需要对代码进行分析时,才会在代码中加入该头文件。
例如,在使用某个分析工具时,可以临时在代码中包含该头文件,以便工具能够识别并执行分析。这个过程通常是临时性的,分析结束后不再使用该头文件,因此它不会影响正常的构建过程。
用IACA_START和IACA_END标记代码段
在进行代码分析时,可以使用特定的标记来标识希望分析的代码块。这些标记用于让分析工具识别并专门分析代码的某一部分。通常,这些标记本身并不直接改变程序的功能,而是通过在特定的代码段周围添加起始和结束标记,指示工具对这段代码进行性能分析。
例如,在代码的某个区域前后加入 “start” 和 “end” 标记,分析工具会在这两个标记之间的代码块上进行分析,收集性能数据。这样做可以让分析工具仅关注程序中感兴趣的部分,而不会影响其他部分的运行。
完成标记后,可能需要对构建过程进行一些小调整,例如包含分析工具所需的目录,以确保分析能够顺利进行。通过这种方式,分析工具能够准确识别并处理指定的代码区域,从而提供更精准的性能反馈。
对于64位版本问题兼容性,包含文件时要注意大小写
考虑到是否使用 32 位版本还是 64 位版本来解决构建问题。可能需要切换到适合的版本,例如使用 VC64版本来确保正确构建。
虽然在构建时遇到了一些困难,但通过一些配置调整,最终应该能够成功构建并确保正确的版本被使用。这一过程还引发了对一些其他工具和技术的兴趣,如向量操作等,但这些并没有立即解决当前的问题。
总之,构建过程中需要对平台兼容性、文件路径和版本进行细致调整,以确保顺利完成构建。
运行IACA命令行
在理论上,现在可以运行这个工具并提供架构信息。通过指定架构和其他参数,工具应该能够输出相关的信息,帮助了解程序的执行情况。运行后,工具确实能够给出实际的周期计数,并且结果显示出它能够准确地分析并提供这些数据。
输出的结果包括每次迭代的端口绑定和周期数。这些数据帮助分析程序的性能,不过对于某些特定的字段,如“dv”和“d”等,尚不完全清楚它们的含义。尽管如此,工具依然提供了一个可以用来分析性能的图表。
其中的“数据获取管道”(data fetch pipe)等术语,虽然不完全理解,但它们是描述执行过程中不同阶段的性能指标的关键。这些信息有助于深入了解代码在硬件上的执行行为和效率。
..\..\..\..\iaca-win64\iaca.exe -arch SKL game.dll
Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-23;17:30:24
Analyzed File - game.dll
Binary Format - 64Bit
Architecture - SKL
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 85.00 Cycles Throughput Bottleneck: Backend
Loop Count: 22
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
--------------------------------------------------------------------------------------------------
| Cycles | 64.5 9.0 | 64.5 | 33.5 26.9 | 33.5 27.1 | 14.0 | 23.0 | 6.0 | 1.0 |
--------------------------------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1* | | | | | | | | | mov r9, r12
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm13, xmmword ptr [r10]
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovdqu xmmword ptr [rsp+0x30], xmm2
| 1 | 0.5 | 0.5 | | | | | | | vsubps xmm2, xmm3, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vsubps xmm5, xmm4, xmm0
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x80], xmm2
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovdqu xmmword ptr [rsp+0x60], xmm13
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x70], xmm5
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | movsxd rdx, dword ptr [rsp+r9*1+0x40]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | mov eax, dword ptr [rsp+r9*1+0x30]
| 1* | | | | | | | | | test edx, edx
| 0*F | | | | | | | | | js 0x8
| 2^ | | | 0.5 0.5 | 0.5 0.5 | | | 1.0 | | cmp edx, dword ptr [r14+0x10]
| 0*F | | | | | | | | | jle 0xa
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | mov dword ptr [0x0], r12d
| 1* | | | | | | | | | test eax, eax
| 0*F | | | | | | | | | js 0x8
| 2^ | | | 0.5 0.5 | 0.5 0.5 | | | 1.0 | | cmp eax, dword ptr [r14+0x14]
| 0*F | | | | | | | | | jle 0xa
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | mov dword ptr [0x0], r12d
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | movsxd r8, dword ptr [r14+0x18]
| 1 | | 1.0 | | | | | | | imul eax, r8d
| 1 | | | | | | | 1.0 | | movsxd rcx, eax
| 1 | | | | | | 1.0 | | | lea rdx, ptr [rcx+rdx*4]
| 2^ | | | 0.5 0.5 | 0.5 0.5 | | | 1.0 | | add rdx, qword ptr [r14+0x20]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | mov eax, dword ptr [rdx]
| 2 | | | 0.5 | 0.5 | 1.0 | | | | mov dword ptr [rsp+r9*1+0x110], eax
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | mov eax, dword ptr [r8+rdx*1]
| 2 | | | 0.5 | 0.5 | 1.0 | | | | mov dword ptr [rsp+r9*1+0x100], eax
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | mov eax, dword ptr [r8+rdx*1+0x4]
| 2 | | | 0.5 | 0.5 | 1.0 | | | | mov dword ptr [rsp+r9*1+0xf0], eax
| 1 | | | | | | | 1.0 | | add r9, 0x4
| 1* | | | | | | | | | cmp r9, 0x10
| 0*F | | | | | | | | | jl 0xffffffffffffff94
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm8, xmmword ptr [rsp+0xf0]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm10, xmmword ptr [rsp+0x100]
| 1 | 1.0 | | | | | | | | vsubps xmm0, xmm15, xmm2
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm7, xmm0, xmm5
| 1 | 0.5 | 0.5 | | | | | | | vsubps xmm1, xmm15, xmm5
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm9, xmm0, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm4, xmm1, xmm2
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm5, xmm2, xmm5
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm8, 0x18
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm12
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm2, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm6, xmm2, xmm11
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm10, 0x18
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm12
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm2, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm3, xmm2, xmm11
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm0, xmm5, xmm6
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x30], xmm4
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm4, xmm3, xmm4
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm5, xmm4, xmm0
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm4, xmmword ptr [rsp+0x110]
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm1, xmm4, 0x18
| 1 | | | | | | 1.0 | | | vpand xmm0, xmm1, xmm12
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm2, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm3, xmm2, xmm11
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm1, xmm3, xmm9
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm0, xmm7, xmm6
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm1, xmm1, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm2, xmm5, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm0, xmm2, xmm14
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x130], xmm0
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm8, 0x8
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x1e0], xmm9
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm9, xmmword ptr [rip+0x8d4a]
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm9
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm2, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm3, xmm2, xmm11
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm4, 0x8
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm9
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm2, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm14, xmm2, xmm11
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm10, 0x8
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm9
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm2, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm12, xmm2, xmm11
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm13, 0x8
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm9
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm2, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm13, xmm2, xmm11
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm8, 0x10
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm9
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm2, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm4, 0x10
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm9
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm15, xmm3, xmm3
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm3, xmm2, xmm11
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm2, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm6, xmm2, xmm11
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm10, 0x10
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm9
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm0, xmmword ptr [rsp+0x60]
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm2, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm0, 0x10
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm9
| 1 | | | | | | 1.0 | | | vpand xmm0, xmm8, xmm9
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x40], xmm7
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm7, xmm3, xmm3
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm3, xmm2, xmm11
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm2, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm1, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm5, xmm2, xmm11
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm2, xmm1, xmm11
| 1 | | | | | | 1.0 | | | vpand xmm0, xmm4, xmm9
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm1, xmm0
| 2^ | | | 0.5 0.5 | 0.5 0.5 | | 1.0 | | | vpand xmm0, xmm10, xmmword ptr [rip+0x8c80]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm11, xmm2, xmm2
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm2, xmmword ptr [rip+0x8c84]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm9, xmm1, xmm2
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm1, xmm0
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm0, xmmword ptr [rsp+0x60]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm8, xmm1, xmm2
| 2^ | | | 0.5 0.5 | 0.5 0.5 | | 1.0 | | | vpand xmm0, xmm0, xmmword ptr [rip+0x8c5a]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm4, xmmword ptr [rsp+0x70]
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm1, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm10, xmm1, xmm2
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm1, xmm4, xmmword ptr [rsp+0x80]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm1, xmm7, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm0, xmm3, xmm3
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm2, xmm0, xmmword ptr [rsp+0x30]
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm4, xmm2, xmm1
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm1, xmm7, xmmword ptr [rsp+0x40]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm0, xmm6, xmm6
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm3, xmm0, xmmword ptr [rsp+0x1e0]
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm2, xmm3, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm0, xmm4, xmm2
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm3, xmm0, xmmword ptr [rsp+0x120]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm2, xmmword ptr [rip+0x8c45]
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vsubps xmm6, xmm2, xmmword ptr [rsp+0x130]
| 1* | | | | | | | | | vxorps xmm1, xmm1, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vmaxps xmm1, xmm3, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vminps xmm4, xmm1, xmm2
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm0, xmm5, xmm5
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm2, xmm0, xmm6
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm1, xmm4, xmm2
| 1 | 1.0 3.0 | | | | | | | | vsqrtps xmm3, xmm1
| 2^ | | 1.0 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm0, xmm3, xmmword ptr [rip+0x8c38]
| 1 | 0.5 | 0.5 | | | | | | | vcvtps2dq xmm2, xmm0
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm0, xmmword ptr [rsp+0x70]
| 1 | 0.5 | 0.5 | | | | | | | vpslld xmm5, xmm2, 0x10
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm1, xmm12, xmm12
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm12, xmm0, xmmword ptr [rsp+0x80]
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm3, xmm1, xmmword ptr [rsp+0x30]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm0, xmm15, xmm12
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm4, xmm3, xmm0
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm0, xmm15, xmmword ptr [rsp+0x40]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm15, xmmword ptr [rip+0x8bde]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm1, xmm14, xmm14
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm2, xmm1, xmmword ptr [rsp+0x1e0]
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm2, xmm2, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm1, xmm4, xmm2
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm3, xmm1, xmmword ptr [rsp+0x140]
| 1* | | | | | | | | | vxorps xmm0, xmm0, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vmaxps xmm0, xmm3, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vminps xmm4, xmm0, xmm15
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm1, xmm13, xmm13
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm13, xmmword ptr [rip+0x8bc5]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm2, xmm1, xmm6
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm0, xmm4, xmm2
| 1 | 1.0 3.0 | | | | | | | | vsqrtps xmm3, xmm0
| 1 | | 1.0 | | | | | | | vmulps xmm1, xmm3, xmm13
| 1 | 0.5 | 0.5 | | | | | | | vcvtps2dq xmm2, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vpslld xmm0, xmm2, 0x8
| 1 | | | | | | 1.0 | | | vpor xmm5, xmm5, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm0, xmm11, xmm12
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm12, xmmword ptr [rip+0x8b3a]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm1, xmm8, xmm8
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm2, xmm1, xmmword ptr [rsp+0x30]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm8, xmmword ptr [rsp+0x60]
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm4, xmm2, xmm0
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm0, xmm11, xmmword ptr [rsp+0x40]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm11, xmmword ptr [rip+0x8b27]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm1, xmm9, xmm9
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm3, xmm1, xmmword ptr [rsp+0x1e0]
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm2, xmm3, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm1, xmm4, xmm2
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm3, xmm1, xmmword ptr [rsp+0x150]
| 1* | | | | | | | | | vxorps xmm0, xmm0, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vmaxps xmm0, xmm3, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vminps xmm4, xmm0, xmm15
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm1, xmm10, xmm10
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm2, xmm1, xmm6
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm0, xmm4, xmm2
| 1 | 1.0 3.0 | | | | | | | | vsqrtps xmm3, xmm0
| 1 | | 1.0 | | | | | | | vmulps xmm1, xmm3, xmm13
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm8, 0x18
| 1 | | | | | | 1.0 | | | vpand xmm2, xmm0, xmm12
| 1 | 0.5 | 0.5 | | | | | | | vcvtps2dq xmm4, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm1, xmm2
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm3, xmm1, xmm11
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm0, xmm3, xmm6
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vaddps xmm2, xmm0, xmmword ptr [rsp+0x130]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm1, xmm2, xmm13
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm9, xmmword ptr [rsp+0xb0]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm6, xmmword ptr [rsp+0xc0]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm10, xmmword ptr [rsp+0xd0]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm14, xmmword ptr [rsp+0xe0]
| 1 | 0.5 | 0.5 | | | | | | | vcvtps2dq xmm3, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vpslld xmm0, xmm3, 0x18
| 1 | | | | | | 1.0 | | | vpor xmm2, xmm4, xmm0
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm0, xmmword ptr [rsp+0x160]
| 1 | | | | | | 1.0 | | | vpor xmm1, xmm5, xmm2
| 1 | | | | | | 1.0 | | | vpand xmm3, xmm1, xmm0
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm1, xmmword ptr [rsp+0x90]
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vaddps xmm1, xmm1, xmmword ptr [rip+0x8aa8]
| 1 | | | | | | 1.0 | | | vpandn xmm0, xmm0, xmm8
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm8, xmmword ptr [rsp+0xa0]
| 1 | | | | | | 1.0 | | | vpor xmm2, xmm3, xmm0
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm3, xmmword ptr [rsp+0x170]
| 2^ | | | | | 1.0 | | | 1.0 | vmovdqu xmmword ptr [r10], xmm2
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm2, xmmword ptr [rsp+0x180]
| 1 | | | | | | | 1.0 | | add r10, 0x10
Total Num Of Uops: 247
Analysis Notes:
Backend allocation was stalled due to unavailable allocation resources.
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 85.00 Cycles Throughput Bottleneck: Backend
Loop Count: 22
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
--------------------------------------------------------------------------------------------------
| Cycles | 64.5 9.0 | 64.5 | 33.5 26.9 | 33.5 27.1 | 14.0 | 23.0 | 6.0 | 1.0 |
--------------------------------------------------------------------------------------------------
这是 iaca
工具进行的 吞吐量分析报告,用于评估代码执行中的性能瓶颈,特别是在计算机硬件中各个组件的工作效率。吞吐量分析侧重于每个操作或循环的执行周期(Cycle),并通过分析指令在 CPU 各个端口的绑定来找出性能瓶颈。
报告中的内容可以解释为以下几个部分:
1. Block Throughput (块吞吐量):
- 85.00 Cycles:这是每个计算块的吞吐量,表示处理某个任务(例如一个计算块)所需要的周期数。这个值越小,表示计算越高效,运行得越快。
2. Throughput Bottleneck (吞吐量瓶颈):
- Backend:表示性能瓶颈在 后端(Backend)部分,即 CPU 的执行单元,可能是算术逻辑单元(ALU)或其他负责执行指令的硬件部分。吞吐量瓶颈通常是指 CPU 的执行单元无法充分利用,导致性能下降。
3. Loop Count (循环次数):
- 22:分析的代码块被执行了 22 次。
4. Port Binding in Cycles Per Iteration (每次迭代的端口绑定周期):
这是对每个端口(Port)的性能分析,显示了在每次迭代中指令如何在 CPU 的不同端口上执行。每个端口代表 CPU 内部的不同执行单元,通常会有多个端口同时执行不同的指令。
- 每一行表示在 CPU 各个端口上执行的指令所需要的周期数。不同的端口对应不同的执行单元,比如数据加载、计算等。
端口绑定表:
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
--------------------------------------------------------------------------------------------------
| Cycles | 64.5 9.0 | 64.5 | 33.5 26.9 | 33.5 27.1 | 14.0 | 23.0 | 6.0 | 1.0 |
--------------------------------------------------------------------------------------------------
Port 0 - DV:代表执行的第 0 个端口(可能是数据寄存器端口)。这两列
64.5
和9.0
表示端口 0 需要的周期数。Port 1:端口 1 的执行周期为
64.5
,这个端口可能是执行某些指令的执行单元。Port 2 - D、Port 3 - D:这些列分别表示两个端口上执行的数据指令,显示了它们的执行周期数。例如,端口 2 的周期为
33.5
和26.9
。Port 4 到 Port 7:后续的端口周期数表示其他硬件执行单元的执行时间,逐行列出不同端口的执行周期。
DV - Divider pipe (on port 0):DV - 除法管线(在端口 0 上)
这是表示除法操作发生在 CPU 的除法管线中,通常与除法指令相关。D - Data fetch pipe (on ports 2 and 3):D - 数据获取管线(在端口 2 和 3 上)
表示数据获取操作发生在 CPU 的数据获取管线,这通常是与内存访问操作相关的指令。F - Macro Fusion with the previous instruction occurred:F - 与前一条指令发生了宏融合
表示当前指令与前一条指令发生了宏融合(Macro Fusion),即这两条指令被合并为一条指令以提高执行效率。* - instruction micro-ops not bound to a port:* - 指令微操作没有绑定到任何端口
这表示该指令的微操作没有分配到具体的硬件执行端口。^ - Micro Fusion occurred:^ - 发生了微融合
这表示当前指令与其他指令发生了微融合(Micro Fusion),即 CPU 自动将两条指令合并成一个较复杂的操作。# - ESP Tracking sync uop was issued:# - 发出了 ESP 跟踪同步微操作
这表示 CPU 发出了用于同步栈指针(ESP)的微操作,通常与函数调用或栈操作相关。@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected:@ - SSE 指令紧跟在 AVX256/AVX512 指令后面,预计会有数十个周期的惩罚
这表示 SSE 指令紧接在 AVX256 或 AVX512 指令之后执行,CPU 可能会因为不同指令集的切换而导致额外的延迟。X - instruction not supported, was not accounted in Analysis:X - 指令不支持,未被纳入分析
表示该指令在分析过程中不被支持,因此没有被计算在内。
这些符号帮助解释了不同指令在执行过程中对 CPU 各个执行单元的影响,尤其是在性能分析时,标记了可能的瓶颈、指令融合情况或潜在的性能损失。
阅读IACA结果
这里讨论的是如何分析指令在处理器中的执行过程,特别是通过使用某些工具来查看指令是如何被分解和分配到不同的执行单元的。
微操作(Micro-ops):一个指令可能被拆分成多个微操作(micro-ops)。这意味着处理器实际上可能并不直接执行我们所写的单一指令,而是将其分解成多个小的操作,处理器分别执行这些微操作。这有时会导致一个指令看起来更复杂,并且执行过程比单一的指令执行时间要长。
端口(Ports):在处理器中,指令会被分配到不同的执行端口。如果多个指令被分配到同一个端口,它们可能会发生冲突,影响并行执行。而如果它们被分配到不同的端口(例如,端口 5 和端口 0),它们理论上可以并行执行,不会互相干扰。
执行周期(Cycles):执行周期(CP)可能指的是指令完成执行所需的时钟周期。如果我们看到某些指令需要较长的时间,可能是因为它们需要在多个周期内才能完成。这也可能与指令的类型(例如乘法)有关。
工具的作用:使用这种工具可以帮助理解指令如何在处理器中执行。这个工具提供了对 Nehalem 处理器的分析,能够告诉我们具体是哪些执行单元发生了瓶颈。这比之前使用的宏更有帮助,能够详细地显示各个端口的执行情况,是进行性能分析时非常有价值的工具。
这个表格展示了在执行过程中,各个指令的微操作(micro-ops)如何被分配到不同的处理器执行端口,以及每个端口的压力(以周期为单位)。其中每一行代表一个指令的执行信息,列出了该指令的执行情况,包括它使用的端口、它在每个端口上的周期压力,以及其他相关的执行信息。让我们逐一解读这些字段:
表格的各列含义:
Num Of Uops:表示指令生成的微操作数量。微操作是指令在 CPU 内部的拆解,复杂的指令会被拆分成多个微操作来执行。例如,“mov r9, r12”只生成1个微操作,而“vmovdqu xmm13, xmmword ptr [r10]”则生成了2个微操作。
Ports pressure in cycles:这是一个表示每个端口所承受的压力的时间表。每个端口执行的操作所需的周期数用来评估该指令的执行效率。
- 0 - DV (Divider pipe):分配到端口 0 的操作,通常与除法或类似的计算相关。
- 1:一般的计算操作,指令在执行时占用的端口。
- 2 - D (Data fetch pipe):数据获取管线,指令从内存加载数据时使用的端口。
- 3 - D (Data fetch pipe):与端口 2 类似,也是数据获取管线的一部分。
- 4 - 7:其他执行端口,处理器根据具体操作的性质来分配这些端口。
F:表示“宏融合”(Macro Fusion)发生,两个指令被合并为一个微操作来减少周期消耗。
^:表示“微融合”(Micro Fusion)发生,多个微操作被合并到一个处理器执行周期内,提高执行效率。
X:表示该指令不被分析工具支持,无法进行性能分析。
#:表示 ESP 跟踪同步微操作(sync uop)已发出,通常用于同步堆栈指针操作。
@:表示 SSE 指令在 AVX256/AVX512 指令后执行,这通常会导致几个周期的惩罚,因为 AVX 指令的处理较为复杂。
每行代表的指令:
- mov r9, r12:一个简单的移动操作,不涉及复杂的端口资源消耗,因此它没有占用其他端口的周期压力(空白)。
- vmovdqu xmm13, xmmword ptr [r10]:从内存中加载数据到寄存器,使用了数据获取端口(端口 2 和端口 3),它的周期压力为 0.5,表示该操作不需要占用完整的一个周期。
- vmovdqu xmmword ptr [rsp+0x30], xmm2:将数据从 xmm2 寄存器写入到内存中,也使用数据获取端口,且进行了微融合(^),优化了执行时间。
尝试解读IACA表格中的字母含义
在进行性能分析时,工具会展示一些关于指令执行的详细信息,比如是否发生了“微融合”或者“宏融合”。微融合是指多个微操作(micro-ops)被合并成一个操作执行,从而减少了执行周期。宏融合则是指两个指令的合并,使它们能够在一个周期内完成,从而提升执行效率。
例如,条件跳转指令和设置条件标志的指令可能会被合并执行,这样它们就不需要各自占用一个周期,减少了处理器的负担。同时,分析工具会标识出哪些指令发生了宏融合,哪些指令通过微融合合并。
工具还提供了关于指令的执行周期、执行端口等详细数据,使得开发人员可以更清楚地了解每个指令的执行过程。通过这种方式,能够更轻松地识别哪些操作可能成为性能瓶颈。
总的来说,这种分析工具不仅能够提供精确的指令执行时间,还能显示指令间的关系,帮助开发人员优化代码。同时,它还允许输出图形化数据,帮助进一步分析和理解性能问题。这些功能对于优化代码和提高处理器利用率非常有帮助。
IACA可以输出图表吗?
这个描述提到,如果代码块的执行时间为 86 个周期,那么可以得出结论:这段代码的执行大概会需要 85 个周期。此结果可能基于执行的指令数量、指令类型及其在 CPU 中的执行方式(比如,指令是否会发生微融合、是否有端口冲突等)。根据这些信息,可以推测出执行该代码块的时间,进而对性能进行优化分析。
IACA报告最大吞吐量为85个周期
根据分析,执行这段代码所需的周期为 85 个周期,而实际观察到的结果比预期要低。这是因为每次计算四个像素,因此将 85 除以 4 后,大约每个像素需要 21 到 22 个周期。显然,这个速度比直接执行代码时的速度减半,表明在执行时存在一定的优化,可能是通过并行处理等方式提高了效率。
可能还有一些优化空间…
从分析结果来看,代码执行的周期数为 85,这可能意味着在处理过程中有某些内存等待操作,这也暗示可能可以通过预取(prefetch)等技术来优化性能。尽管目前不打算深入研究这个优化,但这个发现表明在这方面仍有改进的空间。最重要的是,通过这种方式,可以省去手动调优的复杂步骤,直接使用分析工具来提高效率,这非常令人鼓舞。
IACA真的很棒!
使用像 IACA 这样的工具虽然不是内置在编译器中,有些操作步骤可能有点繁琐,但总体来说并不复杂,远比自己写一个处理器模拟器要简单。因为没有完整的处理器文档,自己编写模拟器可能会面临很多不确定性。相信像英特尔这样的公司在这方面的工作是可靠的,因此使用他们提供的工具来分析处理器的执行周期会更具权威性,能够准确地了解程序执行的情况,这是非常有用的。
添加一些宏来开启/关闭IACA
目前的操作是暂时去掉一些不必要的代码,因为大多数情况下并不需要这些代码来构建程序,这部分功能并不是直接使用的。然而,未来可能需要再次启用这些功能,因此目前只是将其注释掉以便将来可以轻松恢复。
感谢Fabian的建议
今天的工作重点是优化和整理代码,之前在处理这个例程时,主要做了标量到SIMD的转换,并没有进行深入的优化。虽然有一些简单的优化,比如将某些操作移入循环以节省周期,但整体上并没有进行结构化的优化。提到的改进建议是通过调整数学运算顺序,进一步提高效率,这涉及到对算法的重新组织和优化。
Fabian: 双线性和平方操作不需要浮点数
提到的优化建议是,在进行数学计算时,特别是在处理线性和平方运算时,不一定非要使用浮点运算。如果不需要浮点运算,还可以避免将值处理为零到一的范围。之前在进行乘法运算时,为了计算某些空间的值,可能不需要将这些值强制映射到零到一的范围,而是可以采用其他更高效的方式。
将sRGB->线性转换移到双线性之后
优化建议提到,可以避免在计算过程中频繁进行乘法运算,尤其是线性插值和平方计算。通过将这些运算移到最终结果计算后进行,可以减少不必要的乘法操作。在之前的代码中,乘法操作是分散的,导致每个周期只能执行一次乘法运算,从而降低了效率。如果将乘法操作集中到最后一步进行,并且移除中间的多余乘法,能够提高运算效率。特别是在做线性插值之后,转换和乘法操作可以在最后一步进行,而不是在每个阶段都进行。
将归一化操作合并到颜色计算中
通过优化代码,减少了重复的乘法运算。在之前的代码中,乘法操作与颜色常量相乘。优化后的做法是将常量的除数直接嵌入代码中,从而避免了多余的乘法。对于归一化系数,原本是1/255,但由于其他值已经被平方过,因此需要使用平方后的值来调整范围。具体来说,平方后的值范围从0到255变为0到255的平方,因此需要调整归一化系数。通过这个调整,乘法操作被简化,最终验证了代码的有效性,达到了预期效果。
工作正常(没有太大改进)
通过优化,减少了乘法运算,最终将代码执行时间降低到大约45个周期。尽管节省了时间,但仍然无法确定是否受到了内存带宽的限制,因为当前的变化是否真的提高了性能还不明确。为了进一步分析,可能需要移除纹理采样的影响。可以考虑通过填充实际的采样值来简化计算,但这需要进一步的分析来确定最佳方法。
通过移除不必要的乘法运算,代码的效率有所提升。通过内联优化,也避免了引入额外的乘法操作,优化效果显著。虽然无法完全确定是否已完全消除内存瓶颈,但这种优化方法显然减少了计算复杂性。
通过保持0-255空间来移除多个乘法操作(没有改进)
在优化过程中,尝试调整了颜色空间的处理方式,将计算保持在零到二百五十五的范围内,从而避免了多余的乘法运算。通过这种方式,理论上可以只使用一个乘法运算来调整输出范围。此外,针对alpha值,决定不进行平方根操作,避免了不必要的运算,保持了alpha的原始值。
在进一步测试时,尽管做了这些调整,编译器似乎已经优化掉了大部分不必要的操作,导致优化后的周期数与之前几乎没有变化,从85.00个周期降到了74.47个周期。这表明编译器本身可能已经做出了某些优化,因此看似做出的手动优化并没有显著提升性能。
对比移除乘法操作前后IACA的输出
在优化过程中,尝试通过删除不必要的乘法运算来提高性能。然而,通过比较两次修改后的代码,结果显示移除这些乘法并未显著减少周期数,甚至编译器可能已经自动优化掉了一些操作,导致最终的性能提升并不明显。为了进一步了解差异,决定使用wind def
工具来比较两份代码文件,查看每次修改后的代码与原始代码之间的差异。尽管删除了多余的乘法操作,代码的周期数与最初的版本几乎相同,且乘法的总次数仍然较高,这让人感到困惑并且有些令人沮丧。
移除之前
..\..\..\..\iaca-win64\iaca.exe -arch SKL game.dll
Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-23;17:30:24
Analyzed File - game.dll
Binary Format - 64Bit
Architecture - SKL
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 85.00 Cycles Throughput Bottleneck: Backend
Loop Count: 22
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
--------------------------------------------------------------------------------------------------
| Cycles | 64.5 9.0 | 64.5 | 33.5 26.9 | 33.5 27.1 | 14.0 | 23.0 | 6.0 | 1.0 |
--------------------------------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1* | | | | | | | | | mov r9, r12
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm13, xmmword ptr [r10]
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovdqu xmmword ptr [rsp+0x30], xmm2
| 1 | 0.5 | 0.5 | | | | | | | vsubps xmm2, xmm3, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vsubps xmm5, xmm4, xmm0
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x80], xmm2
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovdqu xmmword ptr [rsp+0x60], xmm13
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x70], xmm5
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | movsxd rdx, dword ptr [rsp+r9*1+0x40]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | mov eax, dword ptr [rsp+r9*1+0x30]
| 1* | | | | | | | | | test edx, edx
| 0*F | | | | | | | | | js 0x8
| 2^ | | | 0.5 0.5 | 0.5 0.5 | | | 1.0 | | cmp edx, dword ptr [r14+0x10]
| 0*F | | | | | | | | | jle 0xa
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | mov dword ptr [0x0], r12d
| 1* | | | | | | | | | test eax, eax
| 0*F | | | | | | | | | js 0x8
| 2^ | | | 0.5 0.5 | 0.5 0.5 | | | 1.0 | | cmp eax, dword ptr [r14+0x14]
| 0*F | | | | | | | | | jle 0xa
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | mov dword ptr [0x0], r12d
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | movsxd r8, dword ptr [r14+0x18]
| 1 | | 1.0 | | | | | | | imul eax, r8d
| 1 | | | | | | | 1.0 | | movsxd rcx, eax
| 1 | | | | | | 1.0 | | | lea rdx, ptr [rcx+rdx*4]
| 2^ | | | 0.5 0.5 | 0.5 0.5 | | | 1.0 | | add rdx, qword ptr [r14+0x20]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | mov eax, dword ptr [rdx]
| 2 | | | 0.5 | 0.5 | 1.0 | | | | mov dword ptr [rsp+r9*1+0x110], eax
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | mov eax, dword ptr [r8+rdx*1]
| 2 | | | 0.5 | 0.5 | 1.0 | | | | mov dword ptr [rsp+r9*1+0x100], eax
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | mov eax, dword ptr [r8+rdx*1+0x4]
| 2 | | | 0.5 | 0.5 | 1.0 | | | | mov dword ptr [rsp+r9*1+0xf0], eax
| 1 | | | | | | | 1.0 | | add r9, 0x4
| 1* | | | | | | | | | cmp r9, 0x10
| 0*F | | | | | | | | | jl 0xffffffffffffff94
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm8, xmmword ptr [rsp+0xf0]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm10, xmmword ptr [rsp+0x100]
| 1 | 1.0 | | | | | | | | vsubps xmm0, xmm15, xmm2
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm7, xmm0, xmm5
| 1 | 0.5 | 0.5 | | | | | | | vsubps xmm1, xmm15, xmm5
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm9, xmm0, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm4, xmm1, xmm2
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm5, xmm2, xmm5
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm8, 0x18
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm12
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm2, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm6, xmm2, xmm11
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm10, 0x18
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm12
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm2, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm3, xmm2, xmm11
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm0, xmm5, xmm6
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x30], xmm4
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm4, xmm3, xmm4
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm5, xmm4, xmm0
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm4, xmmword ptr [rsp+0x110]
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm1, xmm4, 0x18
| 1 | | | | | | 1.0 | | | vpand xmm0, xmm1, xmm12
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm2, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm3, xmm2, xmm11
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm1, xmm3, xmm9
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm0, xmm7, xmm6
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm1, xmm1, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm2, xmm5, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm0, xmm2, xmm14
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x130], xmm0
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm8, 0x8
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x1e0], xmm9
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm9, xmmword ptr [rip+0x8d4a]
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm9
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm2, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm3, xmm2, xmm11
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm4, 0x8
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm9
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm2, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm14, xmm2, xmm11
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm10, 0x8
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm9
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm2, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm12, xmm2, xmm11
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm13, 0x8
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm9
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm2, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm13, xmm2, xmm11
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm8, 0x10
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm9
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm2, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm4, 0x10
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm9
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm15, xmm3, xmm3
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm3, xmm2, xmm11
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm2, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm6, xmm2, xmm11
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm10, 0x10
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm9
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm0, xmmword ptr [rsp+0x60]
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm2, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm0, 0x10
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm9
| 1 | | | | | | 1.0 | | | vpand xmm0, xmm8, xmm9
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x40], xmm7
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm7, xmm3, xmm3
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm3, xmm2, xmm11
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm2, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm1, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm5, xmm2, xmm11
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm2, xmm1, xmm11
| 1 | | | | | | 1.0 | | | vpand xmm0, xmm4, xmm9
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm1, xmm0
| 2^ | | | 0.5 0.5 | 0.5 0.5 | | 1.0 | | | vpand xmm0, xmm10, xmmword ptr [rip+0x8c80]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm11, xmm2, xmm2
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm2, xmmword ptr [rip+0x8c84]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm9, xmm1, xmm2
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm1, xmm0
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm0, xmmword ptr [rsp+0x60]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm8, xmm1, xmm2
| 2^ | | | 0.5 0.5 | 0.5 0.5 | | 1.0 | | | vpand xmm0, xmm0, xmmword ptr [rip+0x8c5a]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm4, xmmword ptr [rsp+0x70]
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm1, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm10, xmm1, xmm2
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm1, xmm4, xmmword ptr [rsp+0x80]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm1, xmm7, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm0, xmm3, xmm3
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm2, xmm0, xmmword ptr [rsp+0x30]
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm4, xmm2, xmm1
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm1, xmm7, xmmword ptr [rsp+0x40]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm0, xmm6, xmm6
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm3, xmm0, xmmword ptr [rsp+0x1e0]
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm2, xmm3, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm0, xmm4, xmm2
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm3, xmm0, xmmword ptr [rsp+0x120]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm2, xmmword ptr [rip+0x8c45]
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vsubps xmm6, xmm2, xmmword ptr [rsp+0x130]
| 1* | | | | | | | | | vxorps xmm1, xmm1, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vmaxps xmm1, xmm3, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vminps xmm4, xmm1, xmm2
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm0, xmm5, xmm5
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm2, xmm0, xmm6
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm1, xmm4, xmm2
| 1 | 1.0 3.0 | | | | | | | | vsqrtps xmm3, xmm1
| 2^ | | 1.0 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm0, xmm3, xmmword ptr [rip+0x8c38]
| 1 | 0.5 | 0.5 | | | | | | | vcvtps2dq xmm2, xmm0
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm0, xmmword ptr [rsp+0x70]
| 1 | 0.5 | 0.5 | | | | | | | vpslld xmm5, xmm2, 0x10
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm1, xmm12, xmm12
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm12, xmm0, xmmword ptr [rsp+0x80]
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm3, xmm1, xmmword ptr [rsp+0x30]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm0, xmm15, xmm12
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm4, xmm3, xmm0
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm0, xmm15, xmmword ptr [rsp+0x40]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm15, xmmword ptr [rip+0x8bde]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm1, xmm14, xmm14
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm2, xmm1, xmmword ptr [rsp+0x1e0]
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm2, xmm2, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm1, xmm4, xmm2
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm3, xmm1, xmmword ptr [rsp+0x140]
| 1* | | | | | | | | | vxorps xmm0, xmm0, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vmaxps xmm0, xmm3, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vminps xmm4, xmm0, xmm15
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm1, xmm13, xmm13
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm13, xmmword ptr [rip+0x8bc5]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm2, xmm1, xmm6
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm0, xmm4, xmm2
| 1 | 1.0 3.0 | | | | | | | | vsqrtps xmm3, xmm0
| 1 | | 1.0 | | | | | | | vmulps xmm1, xmm3, xmm13
| 1 | 0.5 | 0.5 | | | | | | | vcvtps2dq xmm2, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vpslld xmm0, xmm2, 0x8
| 1 | | | | | | 1.0 | | | vpor xmm5, xmm5, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm0, xmm11, xmm12
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm12, xmmword ptr [rip+0x8b3a]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm1, xmm8, xmm8
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm2, xmm1, xmmword ptr [rsp+0x30]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm8, xmmword ptr [rsp+0x60]
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm4, xmm2, xmm0
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm0, xmm11, xmmword ptr [rsp+0x40]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm11, xmmword ptr [rip+0x8b27]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm1, xmm9, xmm9
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm3, xmm1, xmmword ptr [rsp+0x1e0]
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm2, xmm3, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm1, xmm4, xmm2
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm3, xmm1, xmmword ptr [rsp+0x150]
| 1* | | | | | | | | | vxorps xmm0, xmm0, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vmaxps xmm0, xmm3, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vminps xmm4, xmm0, xmm15
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm1, xmm10, xmm10
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm2, xmm1, xmm6
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm0, xmm4, xmm2
| 1 | 1.0 3.0 | | | | | | | | vsqrtps xmm3, xmm0
| 1 | | 1.0 | | | | | | | vmulps xmm1, xmm3, xmm13
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm8, 0x18
| 1 | | | | | | 1.0 | | | vpand xmm2, xmm0, xmm12
| 1 | 0.5 | 0.5 | | | | | | | vcvtps2dq xmm4, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm1, xmm2
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm3, xmm1, xmm11
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm0, xmm3, xmm6
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vaddps xmm2, xmm0, xmmword ptr [rsp+0x130]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm1, xmm2, xmm13
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm9, xmmword ptr [rsp+0xb0]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm6, xmmword ptr [rsp+0xc0]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm10, xmmword ptr [rsp+0xd0]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm14, xmmword ptr [rsp+0xe0]
| 1 | 0.5 | 0.5 | | | | | | | vcvtps2dq xmm3, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vpslld xmm0, xmm3, 0x18
| 1 | | | | | | 1.0 | | | vpor xmm2, xmm4, xmm0
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm0, xmmword ptr [rsp+0x160]
| 1 | | | | | | 1.0 | | | vpor xmm1, xmm5, xmm2
| 1 | | | | | | 1.0 | | | vpand xmm3, xmm1, xmm0
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm1, xmmword ptr [rsp+0x90]
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vaddps xmm1, xmm1, xmmword ptr [rip+0x8aa8]
| 1 | | | | | | 1.0 | | | vpandn xmm0, xmm0, xmm8
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm8, xmmword ptr [rsp+0xa0]
| 1 | | | | | | 1.0 | | | vpor xmm2, xmm3, xmm0
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm3, xmmword ptr [rsp+0x170]
| 2^ | | | | | 1.0 | | | 1.0 | vmovdqu xmmword ptr [r10], xmm2
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm2, xmmword ptr [rsp+0x180]
| 1 | | | | | | | 1.0 | | add r10, 0x10
Total Num Of Uops: 247
Analysis Notes:
Backend allocation was stalled due to unavailable allocation resources.
之前测试这部分好像是打开的
打开IACA 宏
编译release版本执行IACA 命令查看
..\..\..\..\iaca-win64\iaca.exe -arch SKL game.dll
Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-23;17:30:24
Analyzed File - game.dll
Binary Format - 64Bit
Architecture - SKL
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 74.47 Cycles Throughput Bottleneck: Backend
Loop Count: 22
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
--------------------------------------------------------------------------------------------------
| Cycles | 53.5 9.0 | 53.5 | 28.5 19.8 | 28.5 20.2 | 18.0 | 23.0 | 7.0 | 1.0 |
--------------------------------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1* | | | | | | | | | mov r9, r12
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm13, xmmword ptr [r10]
| 1 | 0.5 | 0.5 | | | | | | | vsubps xmm0, xmm3, xmm1
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x70], xmm0
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovdqu xmmword ptr [rsp+0x180], xmm13
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovdqu xmmword ptr [rsp+0x60], xmm2
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0xe0], xmm14
| 1 | | | | | | | 1.0 | | data16 nop
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | movsxd rdx, dword ptr [rsp+r9*1+0x50]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | mov eax, dword ptr [rsp+r9*1+0x60]
| 1* | | | | | | | | | test edx, edx
| 0*F | | | | | | | | | js 0x8
| 2^ | | | 0.5 0.5 | 0.5 0.5 | | | 1.0 | | cmp edx, dword ptr [r14+0x10]
| 0*F | | | | | | | | | jle 0xa
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | mov dword ptr [0x0], r12d
| 1* | | | | | | | | | test eax, eax
| 0*F | | | | | | | | | js 0x8
| 2^ | | | 0.5 0.5 | 0.5 0.5 | | | 1.0 | | cmp eax, dword ptr [r14+0x14]
| 0*F | | | | | | | | | jle 0xa
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | mov dword ptr [0x0], r12d
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | movsxd r8, dword ptr [r14+0x18]
| 1 | | 1.0 | | | | | | | imul eax, r8d
| 1 | | | | | | | 1.0 | | movsxd rcx, eax
| 1 | | | | | | 1.0 | | | lea rdx, ptr [rcx+rdx*4]
| 2^ | | | 0.5 0.5 | 0.5 0.5 | | | 1.0 | | add rdx, qword ptr [r14+0x20]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | mov eax, dword ptr [rdx]
| 2 | | | 0.5 | 0.5 | 1.0 | | | | mov dword ptr [rsp+r9*1+0xb0], eax
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | mov eax, dword ptr [r8+rdx*1]
| 2 | | | 0.5 | 0.5 | 1.0 | | | | mov dword ptr [rsp+r9*1+0xd0], eax
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | mov eax, dword ptr [r8+rdx*1+0x4]
| 2 | | | 0.5 | 0.5 | 1.0 | | | | mov dword ptr [rsp+r9*1+0xc0], eax
| 1 | | | | | | | 1.0 | | add r9, 0x4
| 1* | | | | | | | | | cmp r9, 0x10
| 0*F | | | | | | | | | jl 0xffffffffffffff94
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm5, xmmword ptr [rip+0x8e0a]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm11, xmmword ptr [rsp+0xb0]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm4, xmmword ptr [rsp+0xc0]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm3, xmmword ptr [rsp+0xd0]
| 1 | 1.0 | | | | | | | | vpsrld xmm0, xmm11, 0x10
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm5
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm2, xmm1
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x50], xmm2
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm11, 0x8
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm5
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm2, xmm1
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x60], xmm2
| 1 | | | | | | 1.0 | | | vpand xmm0, xmm11, xmm5
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm1, xmm0
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x150], xmm1
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm4, 0x10
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm5
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm6, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm4, 0x8
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm5
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm10, xmm1
| 1 | | | | | | 1.0 | | | vpand xmm0, xmm4, xmm5
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm7, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm3, 0x10
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm5
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm2, xmm1
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x110], xmm2
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm3, 0x8
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm5
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm9, xmm1
| 1 | | | | | | 1.0 | | | vpand xmm0, xmm3, xmm5
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm1, xmm0
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x140], xmm1
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm1, xmm13, 0x10
| 1 | | | | | | 1.0 | | | vpand xmm0, xmm1, xmm5
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm2, xmm0
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x130], xmm2
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm1, xmm13, 0x8
| 1 | | | | | | 1.0 | | | vpand xmm0, xmm1, xmm5
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm8, xmm0
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm13, xmm5
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm0, xmm1
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm1, xmmword ptr [rsp+0x70]
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x170], xmm0
| 1 | 0.5 | 0.5 | | | | | | | vsubps xmm0, xmm12, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm15, xmm0, xmm14
| 1 | 0.5 | 0.5 | | | | | | | vsubps xmm2, xmm12, xmm14
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm12, xmm1, xmmword ptr [rsp+0xe0]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm13, xmm0, xmm2
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm14, xmm2, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm4, 0x18
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm5
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm5, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm3, 0x18
| 2^ | | | 0.5 0.5 | 0.5 0.5 | | 1.0 | | | vpand xmm1, xmm0, xmmword ptr [rip+0x8ceb]
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm2, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm3, xmm2, xmm14
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm1, xmm11, 0x18
| 2^ | | | 0.5 0.5 | 0.5 0.5 | | 1.0 | | | vpand xmm2, xmm1, xmmword ptr [rip+0x8cd4]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm0, xmm12, xmm5
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm4, xmm3, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm0, xmm2
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm3, xmm0, xmm13
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm1, xmm15, xmm5
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm2, xmm3, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm0, xmm4, xmm2
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x70], xmm13
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm13, xmm0, xmmword ptr [rsp+0xf0]
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm1, xmm13, xmmword ptr [rip+0x8cb0]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm0, xmmword ptr [rip+0x8cd8]
| 1 | 0.5 | 0.5 | | | | | | | vsubps xmm11, xmm0, xmm1
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm0, xmmword ptr [rsp+0x60]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm5, xmm10, xmm10
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm10, xmmword ptr [rsp+0x70]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm0, xmm0, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm2, xmm0, xmm10
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm1, xmm15, xmm5
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm4, xmm2, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm6, xmm6, xmm6
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm7, xmm7, xmm7
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm0, xmm9, xmm9
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm9, xmmword ptr [rip+0x8cbd]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm3, xmm0, xmm14
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm1, xmm12, xmm5
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm2, xmm3, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm0, xmm4, xmm2
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm3, xmm0, xmmword ptr [rsp+0x100]
| 1* | | | | | | | | | vxorps xmm1, xmm1, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vmaxps xmm1, xmm3, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vminps xmm4, xmm1, xmm9
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm0, xmm8, xmm8
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm2, xmm0, xmm11
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm1, xmm4, xmm2
| 1 | 1.0 3.0 | | | | | | | | vsqrtps xmm3, xmm1
| 1 | | 1.0 | | | | | | | vcvtps2dq xmm0, xmm3
| 1 | 0.5 | 0.5 | | | | | | | vpslld xmm5, xmm0, 0x8
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm0, xmmword ptr [rsp+0x50]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm1, xmm0, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm2, xmm1, xmm10
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm0, xmm15, xmm6
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm4, xmm2, xmm0
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm0, xmmword ptr [rsp+0x110]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm1, xmm0, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm3, xmm1, xmm14
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm0, xmm12, xmm6
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm2, xmm3, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm1, xmm4, xmm2
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm3, xmm1, xmmword ptr [rsp+0x120]
| 1* | | | | | | | | | vxorps xmm8, xmm8, xmm8
| 1 | 0.5 | 0.5 | | | | | | | vmaxps xmm0, xmm3, xmm8
| 1 | 0.5 | 0.5 | | | | | | | vminps xmm4, xmm0, xmm9
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm0, xmmword ptr [rsp+0x130]
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm1, xmm0, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm2, xmm1, xmm11
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm0, xmm4, xmm2
| 1 | 1.0 3.0 | | | | | | | | vsqrtps xmm3, xmm0
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm0, xmmword ptr [rsp+0x140]
| 1 | | 1.0 | | | | | | | vmulps xmm0, xmm0, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vcvtps2dq xmm1, xmm3
| 1 | 0.5 | 0.5 | | | | | | | vpslld xmm2, xmm1, 0x10
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm3, xmm0, xmm14
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm0, xmmword ptr [rsp+0x150]
| 1 | | | | | | 1.0 | | | vpor xmm6, xmm5, xmm2
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm0, xmm0, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm2, xmm0, xmm10
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm1, xmm12, xmm7
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm4, xmm3, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm1, xmm15, xmm7
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm7, xmmword ptr [rsp+0x180]
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm2, xmm2, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm0, xmm4, xmm2
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm3, xmm0, xmmword ptr [rsp+0x160]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm0, xmmword ptr [rsp+0x170]
| 1 | 0.5 | 0.5 | | | | | | | vmaxps xmm1, xmm3, xmm8
| 1 | 0.5 | 0.5 | | | | | | | vminps xmm4, xmm1, xmm9
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm0, xmm0, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm2, xmm0, xmm11
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm1, xmm4, xmm2
| 1 | 1.0 3.0 | | | | | | | | vsqrtps xmm3, xmm1
| 1 | | 1.0 | | | | | | | vcvtps2dq xmm5, xmm3
| 1 | 0.5 | 0.5 | | | | | | | vpsrld xmm0, xmm7, 0x18
| 2^ | | | 0.5 0.5 | 0.5 0.5 | | 1.0 | | | vpand xmm1, xmm0, xmmword ptr [rip+0x8b28]
| 1 | 0.5 | 0.5 | | | | | | | vcvtdq2ps xmm2, xmm1
| 1 | 0.5 | 0.5 | | | | | | | vmulps xmm3, xmm2, xmm11
| 1 | 0.5 | 0.5 | | | | | | | vaddps xmm0, xmm3, xmm13
| 1 | 0.5 | 0.5 | | | | | | | vcvtps2dq xmm1, xmm0
| 1 | 0.5 | 0.5 | | | | | | | vpslld xmm2, xmm1, 0x18
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm1, xmmword ptr [rsp+0x190]
| 1 | | | | | | 1.0 | | | vpor xmm3, xmm5, xmm2
| 1 | | | | | | 1.0 | | | vpor xmm0, xmm6, xmm3
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm3, xmmword ptr [rsp+0x1a0]
| 1 | | | | | | 1.0 | | | vpand xmm4, xmm0, xmm1
| 1 | | | | | | 1.0 | | | vpandn xmm1, xmm1, xmm7
| 1 | | | | | | 1.0 | | | vpor xmm2, xmm4, xmm1
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm1, xmmword ptr [rsp+0x30]
| 2^ | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vaddps xmm1, xmm1, xmmword ptr [rip+0x8b2d]
| 2^ | | | | | 1.0 | | | 1.0 | vmovdqu xmmword ptr [r10], xmm2
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm2, xmmword ptr [rsp+0x1b0]
| 1 | | | | | | | 1.0 | | add r10, 0x10
Total Num Of Uops: 219
Analysis Notes:
Backend allocation was stalled due to unavailable allocation resources.
windiff
windiff
windiff 饱和度太高刺眼
用vscode 的比较查看
移除28条指令并未提高IACA报告的吞吐量
把后面测试的移除看看能节省多少条
..\..\..\..\iaca-win64\iaca.exe -arch SKL game.dll
Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-23;17:30:24
Analyzed File - game.dll
Binary Format - 64Bit
Architecture - SKL
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 77.05 Cycles Throughput Bottleneck: Backend
Loop Count: 22
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
--------------------------------------------------------------------------------------------------
| Cycles | 54.0 9.0 | 54.0 | 29.5 21.6 | 29.5 21.4 | 17.0 | 23.0 | 6.0 | 1.0 |
--------------------------------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1* | | | | | | | | | mov r9, r12
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm13, xmmword ptr [r10]
| 1 | 1.0 | | | | | | | | vsubps xmm14, xmm4, xmm0
| 1 | | 1.0 | | | | | | | vsubps xmm0, xmm3, xmm1
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x70], xmm0
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovdqu xmmword ptr [rsp+0x170], xmm13
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovdqu xmmword ptr [rsp+0x60], xmm2
| 0X | | | | | | | | | nop word ptr [rax+rax*1], ax
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | movsxd rdx, dword ptr [rsp+r9*1+0x50]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | mov eax, dword ptr [rsp+r9*1+0x60]
| 1* | | | | | | | | | test edx, edx
| 0*F | | | | | | | | | js 0x8
| 2^ | | | 0.5 0.5 | 0.5 0.5 | | | 1.0 | | cmp edx, dword ptr [r14+0x10]
| 0*F | | | | | | | | | jle 0xa
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | mov dword ptr [0x0], r12d
| 1* | | | | | | | | | test eax, eax
| 0*F | | | | | | | | | js 0x8
| 2^ | | | 0.5 0.5 | 0.5 0.5 | | | 1.0 | | cmp eax, dword ptr [r14+0x14]
| 0*F | | | | | | | | | jle 0xa
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | mov dword ptr [0x0], r12d
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | movsxd r8, dword ptr [r14+0x18]
| 1 | | 1.0 | | | | | | | imul eax, r8d
| 1 | | | | | | | 1.0 | | movsxd rcx, eax
| 1 | | | | | | 1.0 | | | lea rdx, ptr [rcx+rdx*4]
| 2^ | | | 0.5 0.5 | 0.5 0.5 | | | 1.0 | | add rdx, qword ptr [r14+0x20]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | mov eax, dword ptr [rdx]
| 2 | | | 0.5 | 0.5 | 1.0 | | | | mov dword ptr [rsp+r9*1+0xb0], eax
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | mov eax, dword ptr [r8+rdx*1]
| 2 | | | 0.5 | 0.5 | 1.0 | | | | mov dword ptr [rsp+r9*1+0xd0], eax
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | mov eax, dword ptr [r8+rdx*1+0x4]
| 2 | | | 0.5 | 0.5 | 1.0 | | | | mov dword ptr [rsp+r9*1+0xc0], eax
| 1 | | | | | | | 1.0 | | add r9, 0x4
| 1* | | | | | | | | | cmp r9, 0x10
| 0*F | | | | | | | | | jl 0xffffffffffffff94
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm5, xmmword ptr [rip+0x8e1a]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm10, xmmword ptr [rsp+0xb0]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm4, xmmword ptr [rsp+0xc0]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm3, xmmword ptr [rsp+0xd0]
| 1 | 1.0 | | | | | | | | vpsrld xmm0, xmm10, 0x10
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm5
| 1 | 1.0 | | | | | | | | vcvtdq2ps xmm2, xmm1
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x60], xmm2
| 1 | | 1.0 | | | | | | | vpsrld xmm0, xmm10, 0x8
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm5
| 1 | 1.0 | | | | | | | | vcvtdq2ps xmm2, xmm1
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x50], xmm2
| 1 | | | | | | 1.0 | | | vpand xmm0, xmm10, xmm5
| 1 | | 1.0 | | | | | | | vcvtdq2ps xmm1, xmm0
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x130], xmm1
| 1 | 1.0 | | | | | | | | vpsrld xmm0, xmm4, 0x10
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm5
| 1 | | 1.0 | | | | | | | vcvtdq2ps xmm11, xmm1
| 1 | 1.0 | | | | | | | | vpsrld xmm0, xmm4, 0x8
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm5
| 1 | | 1.0 | | | | | | | vcvtdq2ps xmm6, xmm1
| 1 | | | | | | 1.0 | | | vpand xmm0, xmm4, xmm5
| 1 | 1.0 | | | | | | | | vcvtdq2ps xmm7, xmm0
| 1 | | 1.0 | | | | | | | vpsrld xmm0, xmm3, 0x10
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm5
| 1 | 1.0 | | | | | | | | vcvtdq2ps xmm8, xmm1
| 1 | | 1.0 | | | | | | | vpsrld xmm0, xmm3, 0x8
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm5
| 1 | 1.0 | | | | | | | | vcvtdq2ps xmm2, xmm1
| 1 | | | | | | 1.0 | | | vpand xmm0, xmm3, xmm5
| 1 | | 1.0 | | | | | | | vcvtdq2ps xmm1, xmm0
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x140], xmm1
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x100], xmm2
| 1 | 1.0 | | | | | | | | vpsrld xmm1, xmm13, 0x10
| 1 | | | | | | 1.0 | | | vpand xmm0, xmm1, xmm5
| 1 | | 1.0 | | | | | | | vcvtdq2ps xmm9, xmm0
| 1 | 1.0 | | | | | | | | vpsrld xmm1, xmm13, 0x8
| 1 | | | | | | 1.0 | | | vpand xmm0, xmm1, xmm5
| 1 | | 1.0 | | | | | | | vcvtdq2ps xmm2, xmm0
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm13, xmm5
| 1 | 1.0 | | | | | | | | vcvtdq2ps xmm0, xmm1
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm1, xmmword ptr [rsp+0x70]
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x160], xmm0
| 1 | | 1.0 | | | | | | | vsubps xmm0, xmm12, xmm1
| 1 | 1.0 | | | | | | | | vmulps xmm13, xmm0, xmm14
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x120], xmm2
| 1 | | 1.0 | | | | | | | vsubps xmm2, xmm12, xmm14
| 1 | 1.0 | | | | | | | | vmulps xmm15, xmm0, xmm2
| 1 | | 1.0 | | | | | | | vmulps xmm12, xmm2, xmm1
| 1 | 1.0 | | | | | | | | vmulps xmm14, xmm1, xmm14
| 1 | | 1.0 | | | | | | | vpsrld xmm0, xmm4, 0x18
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm5
| 1 | 1.0 | | | | | | | | vcvtdq2ps xmm5, xmm1
| 1 | | 1.0 | | | | | | | vpsrld xmm0, xmm3, 0x18
| 2^ | | | 0.5 0.5 | 0.5 0.5 | | 1.0 | | | vpand xmm1, xmm0, xmmword ptr [rip+0x8cff]
| 1 | 1.0 | | | | | | | | vcvtdq2ps xmm2, xmm1
| 1 | | 1.0 | | | | | | | vmulps xmm3, xmm2, xmm12
| 1 | 1.0 | | | | | | | | vpsrld xmm1, xmm10, 0x18
| 2^ | | | 0.5 0.5 | 0.5 0.5 | | 1.0 | | | vpand xmm2, xmm1, xmmword ptr [rip+0x8ce8]
| 1 | | 1.0 | | | | | | | vmulps xmm0, xmm14, xmm5
| 1 | 1.0 | | | | | | | | vaddps xmm4, xmm3, xmm0
| 1 | | 1.0 | | | | | | | vcvtdq2ps xmm0, xmm2
| 1 | 1.0 | | | | | | | | vmulps xmm3, xmm0, xmm15
| 1 | | 1.0 | | | | | | | vmulps xmm1, xmm13, xmm5
| 1 | 1.0 | | | | | | | | vaddps xmm2, xmm3, xmm1
| 1 | | 1.0 | | | | | | | vaddps xmm0, xmm4, xmm2
| 2^ | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rsp+0x70], xmm12
| 2^ | 1.0 | | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm12, xmm0, xmmword ptr [rsp+0xe0]
| 2^ | | 1.0 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm1, xmm12, xmmword ptr [rip+0x8cc4]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm0, xmmword ptr [rip+0x8cec]
| 1 | 1.0 | | | | | | | | vsubps xmm10, xmm0, xmm1
| 1 | | 1.0 | | | | | | | vmulps xmm0, xmm8, xmm8
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm8, xmmword ptr [rsp+0x70]
| 1 | 1.0 | | | | | | | | vmulps xmm2, xmm0, xmm8
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm0, xmmword ptr [rsp+0x60]
| 1 | | 1.0 | | | | | | | vmulps xmm5, xmm11, xmm11
| 1 | 1.0 | | | | | | | | vmulps xmm1, xmm14, xmm5
| 1 | | 1.0 | | | | | | | vmulps xmm0, xmm0, xmm0
| 1 | 1.0 | | | | | | | | vmulps xmm6, xmm6, xmm6
| 1 | | 1.0 | | | | | | | vmulps xmm7, xmm7, xmm7
| 1 | 1.0 | | | | | | | | vaddps xmm4, xmm2, xmm1
| 1 | | 1.0 | | | | | | | vmulps xmm3, xmm0, xmm15
| 1 | 1.0 | | | | | | | | vmulps xmm1, xmm13, xmm5
| 1 | | 1.0 | | | | | | | vaddps xmm2, xmm3, xmm1
| 1 | 1.0 | | | | | | | | vaddps xmm0, xmm4, xmm2
| 2^ | | 1.0 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm3, xmm0, xmmword ptr [rsp+0xf0]
| 1* | | | | | | | | | vxorps xmm11, xmm11, xmm11
| 1 | 1.0 | | | | | | | | vmaxps xmm1, xmm3, xmm11
| 2^ | | 1.0 | 0.5 0.5 | 0.5 0.5 | | | | | vminps xmm4, xmm1, xmmword ptr [rip+0x8cad]
| 1 | 1.0 | | | | | | | | vmulps xmm0, xmm9, xmm9
| 1 | | 1.0 | | | | | | | vmulps xmm2, xmm0, xmm10
| 1 | 1.0 | | | | | | | | vaddps xmm1, xmm4, xmm2
| 1 | 1.0 3.0 | | | | | | | | vsqrtps xmm3, xmm1
| 1 | | 1.0 | | | | | | | vcvtps2dq xmm0, xmm3
| 1 | | 1.0 | | | | | | | vpslld xmm5, xmm0, 0x10
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm0, xmmword ptr [rsp+0x50]
| 1 | 1.0 | | | | | | | | vmulps xmm1, xmm0, xmm0
| 1 | | 1.0 | | | | | | | vmulps xmm2, xmm1, xmm15
| 1 | 1.0 | | | | | | | | vmulps xmm0, xmm13, xmm6
| 1 | | 1.0 | | | | | | | vaddps xmm4, xmm2, xmm0
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm0, xmmword ptr [rsp+0x100]
| 1 | 1.0 | | | | | | | | vmulps xmm1, xmm0, xmm0
| 1 | | 1.0 | | | | | | | vmulps xmm3, xmm1, xmm8
| 1 | 1.0 | | | | | | | | vmulps xmm0, xmm14, xmm6
| 1 | | 1.0 | | | | | | | vaddps xmm2, xmm3, xmm0
| 1 | 1.0 | | | | | | | | vaddps xmm1, xmm4, xmm2
| 2^ | | 1.0 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm3, xmm1, xmmword ptr [rsp+0x110]
| 1 | 1.0 | | | | | | | | vmaxps xmm0, xmm3, xmm11
| 2^ | | 1.0 | 0.5 0.5 | 0.5 0.5 | | | | | vminps xmm4, xmm0, xmmword ptr [rip+0x8c47]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm0, xmmword ptr [rsp+0x120]
| 1 | 1.0 | | | | | | | | vmulps xmm1, xmm0, xmm0
| 1 | | 1.0 | | | | | | | vmulps xmm2, xmm1, xmm10
| 1 | 1.0 | | | | | | | | vaddps xmm0, xmm4, xmm2
| 1 | 1.0 3.0 | | | | | | | | vsqrtps xmm3, xmm0
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm0, xmmword ptr [rsp+0x130]
| 1 | | 1.0 | | | | | | | vmulps xmm0, xmm0, xmm0
| 1 | | 1.0 | | | | | | | vcvtps2dq xmm1, xmm3
| 1 | 1.0 | | | | | | | | vpslld xmm2, xmm1, 0x8
| 1 | | 1.0 | | | | | | | vmulps xmm3, xmm0, xmm15
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm0, xmmword ptr [rsp+0x140]
| 1 | | | | | | 1.0 | | | vpor xmm6, xmm5, xmm2
| 1 | 1.0 | | | | | | | | vmulps xmm0, xmm0, xmm0
| 1 | | 1.0 | | | | | | | vmulps xmm2, xmm0, xmm8
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm8, xmmword ptr [rsp+0x80]
| 1 | 1.0 | | | | | | | | vmulps xmm1, xmm13, xmm7
| 1 | | 1.0 | | | | | | | vaddps xmm4, xmm3, xmm1
| 1 | 1.0 | | | | | | | | vmulps xmm1, xmm14, xmm7
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm7, xmmword ptr [rsp+0x170]
| 1 | | 1.0 | | | | | | | vaddps xmm2, xmm2, xmm1
| 1 | 1.0 | | | | | | | | vaddps xmm0, xmm4, xmm2
| 2^ | | 1.0 | 0.5 0.5 | 0.5 0.5 | | | | | vmulps xmm3, xmm0, xmmword ptr [rsp+0x150]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm0, xmmword ptr [rsp+0x160]
| 1 | 1.0 | | | | | | | | vmaxps xmm1, xmm3, xmm11
| 2^ | | 1.0 | 0.5 0.5 | 0.5 0.5 | | | | | vminps xmm4, xmm1, xmmword ptr [rip+0x8bb7]
| 1 | 1.0 | | | | | | | | vmulps xmm0, xmm0, xmm0
| 1 | | 1.0 | | | | | | | vmulps xmm2, xmm0, xmm10
| 1 | 1.0 | | | | | | | | vaddps xmm1, xmm4, xmm2
| 1 | 1.0 3.0 | | | | | | | | vsqrtps xmm3, xmm1
| 1 | | 1.0 | | | | | | | vpsrld xmm0, xmm7, 0x18
| 2^ | | | 0.5 0.5 | 0.5 0.5 | | 1.0 | | | vpand xmm1, xmm0, xmmword ptr [rip+0x8b39]
| 1 | | 1.0 | | | | | | | vcvtdq2ps xmm2, xmm1
| 1 | 1.0 | | | | | | | | vcvtps2dq xmm5, xmm3
| 1 | | 1.0 | | | | | | | vmulps xmm3, xmm2, xmm10
| 1 | 1.0 | | | | | | | | vaddps xmm0, xmm3, xmm12
| 1 | | 1.0 | | | | | | | vcvtps2dq xmm1, xmm0
| 1 | 1.0 | | | | | | | | vpslld xmm2, xmm1, 0x18
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovdqu xmm1, xmmword ptr [rsp+0x180]
| 1 | | | | | | 1.0 | | | vpor xmm3, xmm5, xmm2
| 1 | | | | | | 1.0 | | | vpor xmm0, xmm6, xmm3
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm3, xmmword ptr [rsp+0x190]
| 1 | | | | | | 1.0 | | | vpand xmm4, xmm0, xmm1
| 1 | | | | | | 1.0 | | | vpandn xmm1, xmm1, xmm7
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm7, xmmword ptr [rsp+0x200]
| 1 | | | | | | 1.0 | | | vpor xmm2, xmm4, xmm1
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm1, xmmword ptr [rsp+0x30]
| 2^ | | 1.0 | 0.5 0.5 | 0.5 0.5 | | | | | vaddps xmm1, xmm1, xmmword ptr [rip+0x8b31]
| 2^ | | | | | 1.0 | | | 1.0 | vmovdqu xmmword ptr [r10], xmm2
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovups xmm2, xmmword ptr [rsp+0x1a0]
| 1 | | | | | | | 1.0 | | add r10, 0x10
Total Num Of Uops: 219
Analysis Notes:
There was an unsupported instruction(s), it was not accounted in Analysis.
Backend allocation was stalled due to unavailable allocation resources.
在移除28条指令后,虽然减少了这些指令,但整体的处理吞吐量并未发生变化,仍然保持不变。具体来说,在指令块中,存在两个乘法操作,并且其他的乘法操作通常出现在不同的处理区域(例如,前端和后端)。即使移除了这些指令,乘法操作并没有显著减少或加速整个处理流程。在某些地方,尽管没有乘法操作,但由于需要获取纹理数据等操作,这些区域依然会有一定的压力。最终,虽然移除了一些乘法指令,但依然只有一部分乘法指令在处理过程中执行,其他地方没有产生乘法操作,表明内存或其他限制可能是性能瓶颈所在。
看起来无论如何都在执行相同数量的乘法操作
看起来,尽管移除了一些指令,实际上执行的乘法次数几乎没有变化,或者说,它们以不同的方式执行。令人惊讶的是,经过这些修改后,乘法的数量几乎保持不变。这表明,编译器可能已经自动优化了这些乘法操作,可能是通过某些内部优化机制来处理的。为了进一步确认这一点,可能需要进行更深入的分析和调查,找出编译器是如何处理这些指令的。
编译器是否足够聪明,自动做了这些转换?
编译器非常聪明,能够自动进行优化,进行了一些常数合并和简化操作,实际上比我们手动做的更有效。它能够识别出并合并那些重复的计算和常数,这让人非常惊讶。虽然通常我们不太认为编译器能够在这些细节上做得很好,但这次它的表现令人印象深刻,成功地做到了我们没能手动优化的部分。虽然最终我们并没有实现比编译器更优的优化,但这也凸显了编译器在优化方面的强大能力,值得称赞。
还有哪些优化可以做?
目前考虑的优化是尝试在转换之前对数值进行平方处理,从而可能减轻端口的压力,特别是通过使用整数乘法操作。这种做法虽然能带来潜在的性能提升,但涉及的变化较为复杂,需要对当前流程进行较大调整。因此,这一方案可能不适合在当前阶段实施。
考虑到使用 Intel 内建函数的方式进行处理,虽然理论上可以通过特定的整数乘法来优化计算,但在实际操作中,似乎并没有直接适用的指令,尤其是涉及到操作分割和高低位处理时。尽管想过采用这种方式,但在实际应用中发现其实现并不如预期,可能需要更多的深入思考和调整。
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_mul
使用 _mm_mul_mulhi_epi16 来在转换为浮点之前进行更广泛的平方操作?
目前考虑的优化方案是在进行数值转换时,尝试通过使用更小的数据宽度(如16位整数)来提高操作效率,尤其是在处理乘法时。由于浮点数转换后导致的数据宽度限制,现有操作变得更为复杂,因此考虑回到整数运算来提升性能,避免浮点数运算中的固定点操作导致的性能瓶颈。
对于平方和平方根的计算,虽然平方可以有效地利用整数运算进行优化,但平方根的计算在固定点运算中会较为困难,因此可能无法直接在这种模式下完成。整体来看,最有效的方案是对现有操作进行更多的位移和掩码处理,以减轻转换和计算的负担,尤其是对于乘法和平方操作。
黑板:掩蔽掉A和G,只保留R和B对齐到16位的SIMD边界
目前的计划是尝试优化处理流程,通过合并操作来提高效率。具体来说,原本的做法是对每个数据单独进行掩码操作,现在的设想是一次性掩码掉两个数据(例如A和G通道),而不是单独处理每个通道。这种方法有可能让数据对齐到16位宽度的内存中,从而能够更高效地执行乘法或平方操作。
接下来计划通过对已经掩码过的数据进行平方计算,并保留数据在原地进行操作,而不是立即进行类型转换。为了计算平方,可能需要使用_mm_mul_mulhi_epi16
对16位的值进行操作,但也需要注意_mm_mul_mulhi_epi16
和_mm_mul_mullo_epi16
的区别,以确保计算的准确性。总体上,这样的优化方案看起来有可能提高处理效率,尽管在具体实现时还需要进一步确认细节。
黑板:_mm_mullo_epi16 vs. _mm_mulhi_epi16
当两个16位的值相乘时,理论上结果应该是32位,因为乘法操作可能会导致溢出。就像将10乘以10,结果会变成100,溢出了原来的位数。在16位的运算中,结果必须适应16位宽的存储,但乘法的实际结果会有超过16位的部分。因此,乘法单元允许选择保留结果中的某个16位部分。
在这种情况下,由于我们只关心低8位,因此我们只需要处理低16位的结果,而忽略高16位的溢出部分。所以,使用mullo
指令来获取低16位的结果就足够了,这样就能实现平方操作,而不需要担心高16位的数据。
这意味着可以简化计算流程,移除不必要的步骤,从而提高效率。然后,在后续处理时,可能会出现一些问题需要解决,但基本思路是通过这种方式来优化平方操作。
问题:这会导致我们的Alpha通道也被平方
这引入了一个问题,因为无法实现完美的半数化简。虽然理想情况下能够做到,但由于某些限制,无法达到预期效果。这使得优化的效果没有想象中那么完美,令人有些失望。
需要使用其他指令来处理Alpha通道
为了避免额外的操作,可以选择通过掩码处理来移除其中的 alpha 值,这样就能够避免引入过多的复杂性。然而,这样的方案并不理想,因为本来可以减少的操作数,现在却增加了额外的计算负担,导致效果没有预期的好。
此外,尽管我们可以从整数转换到浮点数,仍然面临一个问题,即 R 和 G 值与 A 和 B 值混在一起。因此,需要引入额外的转换步骤来处理这个问题,以确保每个通道的值正确分开处理。
通过位移/掩蔽操作提取16位寄存器中的组件
为了高效地提取需要的值,采用掩码是一种比较简单有效的方式。可以通过掩码操作将不同的颜色通道值分别提取出来。例如,通过掩码操作可以提取出 B 通道的值,而对 R 通道使用右移操作,能够将其值提取出来。对于 G 通道,已经处于正确的位置,只需要通过掩码即可提取。同样,对于 A 通道,也可以使用相应的掩码来提取。
这种方法相对清晰,不过需要进行一定的计算和调整,以确保每个通道的值都能准确地分离出来。
错误结果!现在是Q&A阶段,但我们先试着调试一下…
在调试过程中,发现了一个问题:如果只是对16位值进行掩码操作,将其限制为8位的话,会导致无法正确处理16位的数据。最初的想法是通过掩码将不需要的值去除,但实际上,由于操作的是16位值,掩码只去除了8位部分,导致数据没有完全清理干净,进而影响了计算结果。
通过调整方法,确保在每次掩码时,只操作需要的部分,并确保16位数据的处理得当,最终能够正确得到每个通道的值。此时,所有的计算流程才得以正常工作。
找到问题:应该掩蔽16位而不是8位
需要注意的是,应该掩码掉16位中的6个部分,而不是8个部分。这是一个显而易见的问题,需要修正。接着,需要移除一些设置,以便能够观察到实际的处理效果。
改进了,但结果依然有些奇怪
出现了一些问题,尽管有些部分处理正确,但仍然感觉不完全对。这可能是因为在掩码操作中未能正确修改所有部分,导致结果不尽如人意。虽然发现了一些问题,但最终觉得整体思路还算合理。
如何避免对Alpha进行平方操作?
在处理时,发现一个问题:不能仅仅通过掩码来处理,因为乘法操作需要确保 alpha 值保持不变。目标是让 alpha 保持原始状态,并且不经过平方操作,同时确保最终的结果保持线性。虽然可以通过更多的指令来实现这个需求,但仍希望找到一种更简洁的方式,使得 alpha 保持原样而不被平方。
直接在平方之前提取Alpha通道?
重新考虑后,发现可能之前的想法有些过于复杂。实际上,是否可以提前将 alpha 提取出来,然后对其他部分进行平方操作呢?这种方法似乎并不难实施,可能之前的担忧只是过度思考了。
… 这样就可以了
现在一切都可以顺利运行了,速度也很快。接下来,检查一下如果继续执行下去,会发生什么情况。
现在转换所有内容以使用16位平方操作
如果将它们全部转换,那么可以通过替换不同的texel来实现。例如,texel A替换为texel B,texel B替换为texel C,texel C替换为texel D。可以考虑为这些操作创建一个宏,这样就能更高效地进行转换。每次转换的过程将是一样的,只需根据需要将texel A、B、C按顺序替换即可。
… 实际上我们现在比之前差了8个周期
目前的结果比之前更差。出现的“幽灵”效果可能是因为在操作中出现了拼写错误。之前的操作效果更好,现在的问题似乎是由于某个环节出了问题,因此需要进一步排查。
为什么会这样?让我们通过IACA运行一下
..\..\..\..\iaca-win64\iaca.exe -arch SKL game.dll
Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-23;17:30:24
Analyzed File - game.dll
Binary Format - 64Bit
Architecture - SKL
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 77.42 Cycles Throughput Bottleneck: Backend
Loop Count: 22
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
--------------------------------------------------------------------------------------------------
| Cycles | 52.0 9.0 | 52.0 | 29.0 25.0 | 29.0 19.0 | 15.0 | 20.0 | 6.0 | 1.0 |
--------------------------------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1* | | | | | | | | | mov r9, rsi
| 1 | | | 1.0 1.0 | | | | | | vmovdqu xmm12, xmmword ptr [r10]
| 2^ | | | | 1.0 | 1.0 | | | | vmovdqu xmmword ptr [rsp+0x50], xmm1
| 1 | | 1.0 | | | | | | | vcvtdq2ps xmm1, xmm2
| 1 | 1.0 | | | | | | | | vsubps xmm15, xmm3, xmm1
| 2^ | | | 1.0 | | 1.0 | | | | vmovdqu xmmword ptr [rsp+0x160], xmm12
| 2^ | | | | 1.0 | 1.0 | | | | vmovdqu xmmword ptr [rsp+0x60], xmm2
| 1 | | 1.0 | | | | | | | vsubps xmm5, xmm4, xmm0
| 0X | | | | | | | | | nop dword ptr [rax+rax*1], eax
| 1 | | | 1.0 1.0 | | | | | | movsxd rdx, dword ptr [rsp+r9*1+0x50]
| 1 | | | | 1.0 1.0 | | | | | mov eax, dword ptr [rsp+r9*1+0x60]
| 1* | | | | | | | | | test edx, edx
| 0*F | | | | | | | | | js 0x8
| 2^ | | | 1.0 1.0 | | | | 1.0 | | cmp edx, dword ptr [r14+0x10]
| 0*F | | | | | | | | | jle 0x9
| 2^ | | | | 1.0 | 1.0 | | | | mov dword ptr [0x0], esi
| 1* | | | | | | | | | test eax, eax
| 0*F | | | | | | | | | js 0x8
| 2^ | | | 1.0 1.0 | | | | 1.0 | | cmp eax, dword ptr [r14+0x14]
| 0*F | | | | | | | | | jle 0x9
| 2^ | | | | 1.0 | 1.0 | | | | mov dword ptr [0x0], esi
| 1 | | | 1.0 1.0 | | | | | | movsxd r8, dword ptr [r14+0x18]
| 1 | | 1.0 | | | | | | | imul eax, r8d
| 1 | | | | | | | 1.0 | | movsxd rcx, eax
| 1 | | | | | | 1.0 | | | lea rdx, ptr [rcx+rdx*4]
| 2^ | | | | 1.0 1.0 | | | 1.0 | | add rdx, qword ptr [r14+0x20]
| 1 | | | 1.0 1.0 | | | | | | mov eax, dword ptr [r8+rdx*1]
| 2 | | | | 1.0 | 1.0 | | | | mov dword ptr [rsp+r9*1+0xc0], eax
| 1 | | | 1.0 1.0 | | | | | | mov eax, dword ptr [r8+rdx*1+0x4]
| 2 | | | | 1.0 | 1.0 | | | | mov dword ptr [rsp+r9*1+0xb0], eax
| 1 | | | | | | | 1.0 | | add r9, 0x4
| 1* | | | | | | | | | cmp r9, 0x10
| 0*F | | | | | | | | | jl 0xffffffffffffffa0
| 1 | | | 1.0 1.0 | | | | | | vmovdqu xmm4, xmmword ptr [rsp+0xb0]
| 1 | | | | 1.0 1.0 | | | | | vmovdqu xmm3, xmmword ptr [rsp+0xc0]
| 1 | | | 1.0 1.0 | | | | | | vmovdqu xmm0, xmmword ptr [rip+0x8de4]
| 2^ | | | | 1.0 | 1.0 | | | | vmovdqu xmmword ptr [rsp+0xa0], xmm0
| 1 | 1.0 | | | | | | | | vpsrld xmm0, xmm4, 0x10
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm13
| 1 | 1.0 | | | | | | | | vcvtdq2ps xmm11, xmm1
| 1 | | 1.0 | | | | | | | vpsrld xmm0, xmm4, 0x8
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm13
| 1 | 1.0 | | | | | | | | vcvtdq2ps xmm6, xmm1
| 1 | | | | | | 1.0 | | | vpand xmm0, xmm4, xmm13
| 1 | | 1.0 | | | | | | | vcvtdq2ps xmm8, xmm0
| 1 | 1.0 | | | | | | | | vpsrld xmm0, xmm3, 0x10
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm13
| 1 | | 1.0 | | | | | | | vcvtdq2ps xmm10, xmm1
| 1 | 1.0 | | | | | | | | vpsrld xmm0, xmm3, 0x8
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm13
| 1 | | 1.0 | | | | | | | vcvtdq2ps xmm2, xmm1
| 2^ | | | 1.0 | | 1.0 | | | | vmovups xmmword ptr [rsp+0x100], xmm2
| 1 | | | | | | 1.0 | | | vpand xmm0, xmm3, xmm13
| 1 | 1.0 | | | | | | | | vcvtdq2ps xmm1, xmm0
| 2^ | | | | 1.0 | 1.0 | | | | vmovups xmmword ptr [rsp+0x140], xmm1
| 1 | | 1.0 | | | | | | | vpsrld xmm1, xmm12, 0x10
| 1 | | | | | | 1.0 | | | vpand xmm0, xmm1, xmm13
| 1 | 1.0 | | | | | | | | vcvtdq2ps xmm9, xmm0
| 1 | | 1.0 | | | | | | | vpsrld xmm1, xmm12, 0x8
| 1 | | | | | | 1.0 | | | vpand xmm0, xmm1, xmm13
| 1 | 1.0 | | | | | | | | vcvtdq2ps xmm2, xmm0
| 2^ | | | 1.0 | | 1.0 | | | | vmovups xmmword ptr [rsp+0x130], xmm2
| 1 | | 1.0 | | | | | | | vsubps xmm2, xmm7, xmm5
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm12, xmm13
| 1 | 1.0 | | | | | | | | vcvtdq2ps xmm0, xmm1
| 2^ | | | | 1.0 | 1.0 | | | | vmovups xmmword ptr [rsp+0xa0], xmm0
| 1 | | 1.0 | | | | | | | vmulps xmm12, xmm2, xmm15
| 1 | 1.0 | | | | | | | | vsubps xmm0, xmm7, xmm15
| 1 | | 1.0 | | | | | | | vmulps xmm7, xmm0, xmm2
| 1 | 1.0 | | | | | | | | vmulps xmm14, xmm0, xmm5
| 1 | | 1.0 | | | | | | | vpsrld xmm0, xmm4, 0x18
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm13
| 1 | 1.0 | | | | | | | | vpsrld xmm0, xmm3, 0x18
| 1 | | 1.0 | | | | | | | vmulps xmm15, xmm15, xmm5
| 1 | 1.0 | | | | | | | | vcvtdq2ps xmm5, xmm1
| 1 | | | | | | 1.0 | | | vpand xmm1, xmm0, xmm13
| 1 | | 1.0 | | | | | | | vcvtdq2ps xmm2, xmm1
| 1 | 1.0 | | | | | | | | vmulps xmm3, xmm2, xmm12
| 2^ | | 1.0 | 1.0 1.0 | | | | | | vmulps xmm2, xmm7, xmmword ptr [rsp+0xd0]
| 1 | 1.0 | | | | | | | | vmulps xmm0, xmm15, xmm5
| 1 | | 1.0 | | | | | | | vaddps xmm4, xmm3, xmm0
| 1 | 1.0 | | | | | | | | vmulps xmm1, xmm14, xmm5
| 1 | | 1.0 | | | | | | | vaddps xmm0, xmm2, xmm1
| 1 | | | | 1.0 1.0 | | | | | vmovdqu xmm1, xmmword ptr [rip+0x8d3a]
| 1 | 1.0 | | | | | | | | vaddps xmm2, xmm4, xmm0
| 2^ | | 1.0 | 1.0 1.0 | | | | | | vmulps xmm13, xmm2, xmmword ptr [rsp+0xe0]
| 2^ | 1.0 | | | 1.0 1.0 | | | | | vmulps xmm0, xmm13, xmmword ptr [rip+0x8cf5]
| 2^ | | | 1.0 | | 1.0 | | | | vmovups xmmword ptr [rsp+0x50], xmm12
| 1 | | 1.0 | | | | | | | vsubps xmm12, xmm1, xmm0
| 1 | | | | 1.0 1.0 | | | | | vmovdqu xmm1, xmmword ptr [rip+0x8cc3]
| 1 | 1.0 | | | | | | | | vpmullw xmm0, xmm1, xmm1
| 1 | | 1.0 | | | | | | | vpsrld xmm0, xmm0, 0x10
| 1 | 1.0 | | | | | | | | vcvtdq2ps xmm1, xmm0
| 1 | | 1.0 | | | | | | | vmulps xmm0, xmm10, xmm10
| 1 | | | 1.0 1.0 | | | | | | vmovups xmm10, xmmword ptr [rsp+0x50]
| 1 | 1.0 | | | | | | | | vmulps xmm4, xmm0, xmm10
| 2^ | | | | 1.0 | 1.0 | | | | vmovups xmmword ptr [rsp+0x60], xmm7
| 1 | | 1.0 | | | | | | | vmulps xmm7, xmm6, xmm6
| 1 | 1.0 | | | | | | | | vmulps xmm6, xmm11, xmm11
| 1 | | | 1.0 1.0 | | | | | | vmovups xmm11, xmmword ptr [rsp+0x60]
| 1 | | 1.0 | | | | | | | vmulps xmm3, xmm1, xmm11
| 1 | 1.0 | | | | | | | | vmulps xmm2, xmm14, xmm6
| 1 | | 1.0 | | | | | | | vaddps xmm5, xmm3, xmm2
| 1 | 1.0 | | | | | | | | vmulps xmm1, xmm15, xmm6
| 1 | | 1.0 | | | | | | | vaddps xmm2, xmm4, xmm1
| 1 | 1.0 | | | | | | | | vaddps xmm0, xmm5, xmm2
| 2^ | | 1.0 | | 1.0 1.0 | | | | | vmulps xmm3, xmm0, xmmword ptr [rsp+0xf0]
| 1* | | | | | | | | | vxorps xmm6, xmm6, xmm6
| 1 | 1.0 | | | | | | | | vmaxps xmm1, xmm3, xmm6
| 2^ | | 1.0 | 1.0 1.0 | | | | | | vminps xmm4, xmm1, xmmword ptr [rip+0x8ccf]
| 1 | 1.0 | | | | | | | | vmulps xmm0, xmm9, xmm9
| 1 | | 1.0 | | | | | | | vmulps xmm2, xmm0, xmm12
| 1 | 1.0 | | | | | | | | vaddps xmm1, xmm4, xmm2
| 1 | | 1.0 | | | | | | | vmulps xmm8, xmm8, xmm8
| 1 | | | | 1.0 1.0 | | | | | vmovdqu xmm9, xmmword ptr [rsp+0x160]
| 1 | 1.0 3.0 | | | | | | | | vsqrtps xmm3, xmm1
| 1 | | 1.0 | | | | | | | vcvtps2dq xmm0, xmm3
| 1 | 1.0 | | | | | | | | vpslld xmm5, xmm0, 0x10
| 1 | | | 1.0 1.0 | | | | | | vmovups xmm0, xmmword ptr [rsp+0x100]
| 1 | | 1.0 | | | | | | | vmulps xmm1, xmm0, xmm0
| 1 | 1.0 | | | | | | | | vmulps xmm2, xmm1, xmm10
| 2^ | | 1.0 | | 1.0 1.0 | | | | | vmulps xmm1, xmm11, xmmword ptr [rsp+0x110]
| 1 | 1.0 | | | | | | | | vmulps xmm0, xmm15, xmm7
| 1 | | 1.0 | | | | | | | vaddps xmm4, xmm2, xmm0
| 1 | 1.0 | | | | | | | | vmulps xmm3, xmm14, xmm7
| 1 | | | 1.0 1.0 | | | | | | vmovups xmm7, xmmword ptr [rsp+0x1f0]
| 1 | | 1.0 | | | | | | | vaddps xmm0, xmm3, xmm1
| 1 | 1.0 | | | | | | | | vaddps xmm2, xmm4, xmm0
| 2^ | | 1.0 | | 1.0 1.0 | | | | | vmulps xmm3, xmm2, xmmword ptr [rsp+0x120]
| 1 | | | 1.0 1.0 | | | | | | vmovups xmm0, xmmword ptr [rsp+0x130]
| 1 | 1.0 | | | | | | | | vmulps xmm0, xmm0, xmm0
| 1 | | 1.0 | | | | | | | vmulps xmm2, xmm0, xmm12
| 1 | 1.0 | | | | | | | | vmaxps xmm1, xmm3, xmm6
| 2^ | | 1.0 | | 1.0 1.0 | | | | | vminps xmm4, xmm1, xmmword ptr [rip+0x8c47]
| 1 | 1.0 | | | | | | | | vaddps xmm1, xmm4, xmm2
| 1 | 1.0 3.0 | | | | | | | | vsqrtps xmm3, xmm1
| 1 | | | 1.0 1.0 | | | | | | vmovdqu xmm1, xmmword ptr [rip+0x8bc7]
| 1 | | 1.0 | | | | | | | vcvtps2dq xmm0, xmm3
| 1 | | 1.0 | | | | | | | vpslld xmm2, xmm0, 0x8
| 1 | | | | | | 1.0 | | | vpor xmm6, xmm5, xmm2
| 1 | 1.0 | | | | | | | | vpmullw xmm0, xmm1, xmm1
| 2^ | | | | 1.0 1.0 | | 1.0 | | | vpand xmm1, xmm0, xmmword ptr [rip+0x8bbe]
| 1 | | 1.0 | | | | | | | vcvtdq2ps xmm0, xmm1
| 1 | 1.0 | | | | | | | | vmulps xmm3, xmm0, xmm11
| 1 | | | 1.0 1.0 | | | | | | vmovups xmm0, xmmword ptr [rsp+0x140]
| 1 | | 1.0 | | | | | | | vmulps xmm0, xmm0, xmm0
| 1 | 1.0 | | | | | | | | vmulps xmm4, xmm0, xmm10
| 1 | | | | 1.0 1.0 | | | | | vmovups xmm10, xmmword ptr [rsp+0x90]
| 1 | | 1.0 | | | | | | | vmulps xmm2, xmm14, xmm8
| 1 | 1.0 | | | | | | | | vaddps xmm5, xmm3, xmm2
| 1 | | 1.0 | | | | | | | vmulps xmm1, xmm15, xmm8
| 1 | 1.0 | | | | | | | | vaddps xmm2, xmm4, xmm1
| 1 | | 1.0 | | | | | | | vaddps xmm0, xmm5, xmm2
| 2^ | 1.0 | | 1.0 1.0 | | | | | | vmulps xmm3, xmm0, xmmword ptr [rsp+0x150]
| 1 | | | | 1.0 1.0 | | | | | vmovups xmm0, xmmword ptr [rsp+0xa0]
| 1 | | 1.0 | | | | | | | vmulps xmm0, xmm0, xmm0
| 1 | 1.0 | | | | | | | | vmulps xmm2, xmm0, xmm12
| 1 | | 1.0 | | | | | | | vpsrld xmm0, xmm9, 0x18
| 1* | | | | | | | | | vxorps xmm8, xmm8, xmm8
| 1 | 1.0 | | | | | | | | vmaxps xmm1, xmm3, xmm8
| 2^ | | 1.0 | 1.0 1.0 | | | | | | vminps xmm4, xmm1, xmmword ptr [rip+0x8bb1]
| 1 | 1.0 | | | | | | | | vaddps xmm1, xmm4, xmm2
| 1 | 1.0 3.0 | | | | | | | | vsqrtps xmm3, xmm1
| 2^ | | | | 1.0 1.0 | | 1.0 | | | vpand xmm1, xmm0, xmmword ptr [rip+0x8b41]
| 1 | | 1.0 | | | | | | | vcvtdq2ps xmm2, xmm1
| 1 | | 1.0 | | | | | | | vcvtps2dq xmm5, xmm3
| 1 | 1.0 | | | | | | | | vmulps xmm3, xmm2, xmm12
| 1 | | 1.0 | | | | | | | vaddps xmm0, xmm3, xmm13
| 1 | | | 1.0 1.0 | | | | | | vmovdqu xmm13, xmmword ptr [rip+0x8b27]
| 1 | 1.0 | | | | | | | | vcvtps2dq xmm1, xmm0
| 1 | | 1.0 | | | | | | | vpslld xmm2, xmm1, 0x18
| 1 | | | | 1.0 1.0 | | | | | vmovdqu xmm1, xmmword ptr [rsp+0x170]
| 1 | | | | | | 1.0 | | | vpor xmm3, xmm5, xmm2
| 1 | | | 1.0 1.0 | | | | | | vmovups xmm5, xmmword ptr [rsp+0x190]
| 1 | | | | | | 1.0 | | | vpor xmm0, xmm6, xmm3
| 1 | | | | 1.0 1.0 | | | | | vmovups xmm3, xmmword ptr [rsp+0x180]
| 1 | | | 1.0 1.0 | | | | | | vmovups xmm6, xmmword ptr [rsp+0x80]
| 1 | | | | | | 1.0 | | | vpand xmm4, xmm0, xmm1
| 1 | | | | | | 1.0 | | | vpandn xmm1, xmm1, xmm9
| 1 | | | | 1.0 1.0 | | | | | vmovups xmm9, xmmword ptr [rsp+0x70]
| 1 | | | | | | 1.0 | | | vpor xmm2, xmm4, xmm1
| 2^ | | | | | 1.0 | | | 1.0 | vmovdqu xmmword ptr [r10], xmm2
| 1 | | | 1.0 1.0 | | | | | | vmovups xmm2, xmmword ptr [rsp+0x30]
| 2^ | 1.0 | | | 1.0 1.0 | | | | | vaddps xmm2, xmm2, xmmword ptr [rip+0x8b1c]
| 1 | | | | | | | 1.0 | | add r10, 0x10
Total Num Of Uops: 210
Analysis Notes:
There was an unsupported instruction(s), it was not accounted in Analysis.
Backend allocation was stalled due to unavailable allocation resources.
通过分析,发现当使用16位平方运算时,性能比预期差,Cycles次数显著增加。
吞吐量瓶颈:迭代间依赖?这是一个值得问Fabian的问题
性能瓶颈出现在“迭代间”,这意味着在不同迭代之间的处理速度存在限制。这个问题可能需要进一步的专业人员来解答。
微操作总数变小了—247 -> 306 -> 210,但吞吐量反而更差了
在性能优化过程中,尽管总的操作次数比之前减少了很多,从247次减少到210次,但由于这些操作的种类不同,实际的吞吐量并没有提高,反而在某些方面变得更差。这种情况令人沮丧,似乎无法取得突破。在这种情况下,可能需要停下来重新思考,决定是否要回退并调整策略,或者继续尝试找到更好的方法解决当前的问题。此时,最好的做法可能是暂时放一放,思考接下来应该采取什么行动。
可以在加载目标时使用相同的技术,但可能不太适合
目前,尽管考虑通过改变数据类型(例如将浮点数转换为宽数据格式)来优化性能,但并不确定这种转换是否能够提高速度。如果将数据处理放入浮点管线,可能有些操作会受益,而其他操作则可能被限制在整数管线内,这种情况需要仔细分析。现在的问题是,在当前瓶颈区域(无论是在浮点版本还是宽版本中)如何有效减轻负担,还没有明确的解决方案。虽然有可能考虑更改为宽数据格式,但这不一定能带来明显的改善,因此,下一步需要更多的思考和尝试,以找到最佳的优化方法。
CP代表临界路径
关键路径上的操作会导致整体性能的瓶颈,因此我们不希望看到太多的“cp”(即关键路径)的出现。需要避免这些瓶颈区域,以确保系统的高效运行。
IACA显示Port 1是瓶颈,所以减少乘法操作不会有帮助
出现了对端口瓶颈的误解。最初误认为是端口2是瓶颈,实际上是端口1。解决乘法操作的瓶颈并不会带来帮助。另一个问题是,理解为什么1.0出现在了mole ps的零列。实际上,mole ps应该在零列,这也是造成混淆的原因。
迭代间依赖意味着第x次循环依赖于前一次循环的结果
集成意味着某个循环的当前迭代依赖于前一个迭代的结果。然而,这个依赖关系似乎并不成立,因为这些循环实际上是独立的。理论上,这些循环可以完全独立地运行,因此不应该存在相互依赖的关系。
尝试将TexturePitch/TextureMemory提取到外部(没什么效果)
将纹理指针的解引用从内部循环中提取到局部变量中是一个不错的建议。编译器可能每次都执行读取操作,因为它认为存在别名问题(aliasing)。通过将纹理指针(如 TexturePitch
)提取到局部变量中,编译器就能知道它们不再变化,因此不需要每次都读取它们,从而减少了开销。这样做确实提高了性能,因为之前的版本由于每次都需要从指针中读取数据,导致性能较低。
在执行之后,结果表明减少了读取操作,提升了性能。这表明编译器原本可能认为这些纹理数据是变化的,因此每次都需要重新读取,但通过提取到局部变量中,编译器能够更有效地处理,减少了不必要的内存读取操作。虽然吞吐量保持不变,但内存操作的优化帮助减少了性能瓶颈。
如何支持AVX?上下文切换时如何进行寄存器保存?
支持尚未普及的新指令(如AVX)时,通常的做法是检查CPU特性标志(CPUID),然后根据支持的指令集设置相应的函数指针,以调用适当的实现。这种方法确保了程序可以根据不同的硬件特性动态选择指令集。
至于上下文切换时的寄存器保存问题,如果操作系统没有专门为新指令集(如AVX)处理寄存器的保存和恢复,可能会面临挑战。某些操作系统可能会屏蔽这些问题,但如果操作系统不处理,使用这些新指令集时,可能会导致寄存器不被正确保存,从而引发问题。因此,可能需要检查操作系统和CPUID的组合,确认是否支持这些寄存器的保存和恢复,确保程序的稳定性。
替换sqrt为mul/rsqrt?
可以通过使用更快的倒数平方根(reciprocal square root)来替代平方根(square root),从而提高性能。倒数平方根提供的是一个近似值,即 1 x \frac{1}{\sqrt{x}} x1,而不是直接计算平方根,这通常能带来更高的吞吐量,因为它的计算速度比平方根更快。
具体操作是,如果要计算 x \sqrt{x} x,首先计算 1 x \frac{1}{\sqrt{x}} x1,然后通过一些代数转换得到最终的平方根值。可以通过将计算的倒数平方根乘以 x x x,并利用代数公式进行简化,最终得到平方根。这个过程将避免直接调用平方根指令,进而提高性能。
尽管这一策略可能会增加一些额外的指令和周期,但它仍然能显著减少总的吞吐量,并改善性能。至于是否应该继续使用宽版本的平方根指令或尝试其他优化方法,目前仍在探索阶段,实际效果可能会因不同的硬件和实现方式有所不同。
因此,最终是否采纳这一方法仍需根据实际情况来决定,可能需要更多的实验和调试以找到最佳的解决方案。
结果和之前一样的
关于Port 1压力的评论
通过分析硬件的端口分配情况,可以发现,乘法操作和加法操作都集中在特定的端口(端口1),这导致了端口1的压力非常大。而乘法操作本身,如果可以分配到不那么繁忙的端口上(如端口0),可以减轻系统的负担。尤其是浮点乘法操作,理想情况下应该分配到端口0,以避免增加端口1的压力。
然而,问题在于如何平衡乘法操作的数量和它们所占用的端口。如果能够更有效地利用端口,理论上可以在一个乘法操作中执行更多的计算,但实际上如何实现这一点还是不太清楚。因此,现阶段可能需要进一步思考和试验,找到一个优化平衡点。
目前考虑到浮点路径已经使用了端口1来进行最大值和最小值的计算,进一步使用宽版本的计算可能会带来更多的挑战,因为端口1的压力已经很大。要有效地实施宽度扩展,可能需要找到一种创意的解决方案来重新分配操作,从而避免增加已经繁忙的端口的负担。因此,目前可能需要暂停进一步的尝试,继续进行更多的实验和思考,才能做出最佳决策。
如果sqrt操作是在乘法端口上进行的,而不是加法端口,那它的移除怎么有帮助?
移除平方根操作能够提高性能,主要是因为平方根操作在乘法端口上进行,但它的吞吐量较慢,达到16个周期。这意味着即使我们通过流水线技术优化了操作,执行三个平方根操作仍需要16周期 * 3的时间,而这段时间对于处理器来说是很难有效填补的,尤其是在循环的末尾,可能需要去获取下一个数据进行处理。因此,即使平方根操作不占用高压端口,其较长的执行时间(足足16个周期)也会导致系统性能的下降。
具体来说,执行平方根操作后的计算可能会在程序中导致阻塞,因为处理器不能同时执行所有指令。它需要按照依赖关系依次执行指令,而如果某些操作的时间跨度过长,就会影响整体的吞吐量。这可能是导致性能下降的原因之一。
因此,优化思路之一就是尝试用更高效的方法代替平方根操作,譬如使用倒数平方根。通过这一方法,计算出1/√x后,可以通过进一步的计算得到最终的结果,从而减少处理时间。
此外,优化过程也涉及了更好的理解和运用相关工具,虽然这是一个全新的工具,但通过适当的实验和分析,可以得出有效的优化方案。这些分析结果将为进一步优化代码提供有用的信息。
其他可看
linux c++ 性能和 llvm-mca 相关可以看
<< The Art of Writing Efficient Programs >>
github 上这本书的机翻
https://github.com/apachecn/apachecn-c-cpp-zh-pt2/tree/master/docs/art-write-effec-prog
额外补充可以看这个博客
https://blog.csdn.net/TM1695648164/article/details/130033992