1.问题环境
硬件环境 |
机型 |
QEMU KVM Virtual Machine 虚拟机 |
整机类型/架构 |
arm |
固件版本 |
BIOS 0.0.0 02/06/2015 |
软件环境 |
具体操作系统版本 |
Kylin Linux Advanced Server release V10 (Tercel) |
内核版本 |
4.19.90-23.8.v2101.ky10.aarch64 |
2.问题描述
2022年9月20日下午15点左右,系统运行时发生宕机。
3.问题分析
3.1.分析 vmcore-dmesg.txt日志
[8559662.187756] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000080 [8559662.188245] Mem abort info: [8559662.188408] ESR = 0x96000006 [8559662.188580] Exception class = DABT (current EL), IL = 32 bits [8559662.188909] SET = 0, FnV = 0 [8559662.189082] EA = 0, S1PTW = 0 [8559662.189261] Data abort info: [8559662.189427] ISV = 0, ISS = 0x00000006 [8559662.189730] CM = 0, WnR = 0 [8559662.189955] user pgtable: 64k pages, 48-bit VAs, pgdp = 000000008e382035 [8559662.190297] [0000000000000080] pgd=00000005b0170003, pud=00000005b0170003, pmd=0000000000000000 [8559662.190748] Internal error: Oops: 96000006 [#1] SMP [8559662.191004] Modules linked in: ppp_mppe arc4 ppp_async ppp_generic slhc fuse nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rfkill binfmt_misc sunrpc vfat fat aes_ce_blk crypto_simd cryptd aes_ce_cipher crct10dif_ce ofpart ghash_ce cmdlinepart sha2_ce sha256_arm64 cfi_cmdset_0001 sha1_ce cfi_probe cfi_util gen_probe physmap_of chipreg mtd uio_pdrv_genirq uio sch_fq_codel ip_tables sr_mod cdrom virtio_scsi virtio_net virtio_console net_failover failover bochs_drm [8559662.193156] Process pool-11-thread- (pid: 1747398, stack limit = 0x00000000fd966dc6) [8559662.193556] CPU: 1 PID: 1747398 Comm: pool-11-thread- Kdump: loaded Not tainted 4.19.90-23.8.v2101.ky10.aarch64 #1 [8559662.194061] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [8559662.194409] pstate: 60400005 (nZCv daif +PAN -UAO) [8559662.194680] pc : nfs_updatepage+0x4fc/0x8c0 [nfs] [8559662.194929] lr : (null) [8559662.195119] sp : ffff80057dd2fb90 [8559662.195304] x29: ffff80057dd2fb90 x28: 0000000000000000 [8559662.195581] x27: 00000000000000d4 x26: ffff800065347eb0 [8559662.195862] x25: 0000000000000080 x24: ffff800065347d40 [8559662.196135] x23: ffff80056903d4c0 x22: 00000000000009de [8559662.196408] x21: ffff80056903d480 x20: ffff7fe001301140 [8559662.196681] x19: 00000000000000d4 x18: 0000000000000000 [8559662.196972] x17: 0000000000000000 x16: 0000000000000000 [8559662.197245] x15: 0000000000000000 x14: 355b656d69742070 [8559662.197517] x13: 614d6c616544746e x12: ffffffffffffffb8 [8559662.197790] x11: ffffffffffffffb8 x10: 0000000000000040 [8559662.198068] x9 : 0000000000000000 x8 : ffff80056903d500 [8559662.198361] x7 : 0000000000000000 x6 : 000000000000003f [8559662.198720] x5 : 0000000000000040 x4 : 0000000000000000 [8559662.199126] x3 : 0000000000000000 x2 : 0000000000000001 [8559662.199491] x1 : 0000000000000000 x0 : 0000000000000080 [8559662.199873] Call trace: [8559662.200077] nfs_updatepage+0x4fc/0x8c0 [nfs] [8559662.200390] nfs_write_end+0x70/0x320 [nfs] [8559662.200726] generic_perform_write+0xfc/0x188 [8559662.201043] nfs_file_write+0xb8/0x230 [nfs] [8559662.201355] new_sync_write+0xcc/0x130 [8559662.201641] __vfs_write+0x74/0x80 [8559662.201892] vfs_write+0xac/0x1c0 [8559662.202143] ksys_write+0x5c/0xc8 [8559662.202390] __arm64_sys_write+0x24/0x30 [8559662.202679] el0_svc_common+0x78/0x130 [8559662.202951] el0_svc_handler+0x38/0x78 [8559662.203222] el0_svc+0x8/0x1b0 [8559662.203451] Code: d2800001 aa1903e0 d2800022 2a0103fe (88fe7f22) [8559662.203884] SMP: stopping secondary CPUs [8559662.204851] Starting crashdump kernel... [8559662.205149] Bye! |
nfs_updatepage+0x4fc/0x8c0 [nfs],0x4fc表示出错函数中的偏移位置,0x8c0表示函数大小。
[root@localhost vmcore]# gcc-nm /usr/lib/debug/usr/lib/modules/4.19.90-23.8.v2101.ky10.aarch64/kernel/fs/nfs/nfs.ko.debug | grep nfs_updatepage 0000000000019c58 T nfs_updatepage |
通过gcc-nm得到0000000000019c58 ,加入偏移0x4fc,得00000000001A154。
[root@localhost vmcore]# addr2line -e /usr/lib/debug/usr/lib/modules/4.19.90-23.8.v2101.ky10.aarch64/kernel/fs/nfs/nfs.ko.debug 00000000001A154 /usr/src/debug/kernel-4.19.90/linux-4.19.90-23.8.v2101.ky10.aarch64/./arch/arm64/include/asm/atomic_lse.h:479 |
3.2. 分析vmcore
发现寄存器信息已经改变了。
This GDB was configured as "aarch64-unknown-linux-gnu"...
KERNEL: /usr/lib/debug/usr/lib/modules/4.19.90-23.8.v2101.ky10.aarch64/vmlinux DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 8 DATE: Tue Sep 20 15:00:09 CST 2022 UPTIME: 99 days, 01:41:17 LOAD AVERAGE: 0.39, 0.23, 0.20 TASKS: 1564 NODENAME: localhost.localdomain RELEASE: 4.19.90-23.8.v2101.ky10.aarch64 VERSION: #1 SMP Mon May 17 17:07:38 CST 2021 MACHINE: aarch64 (unknown Mhz) MEMORY: 24 GB PANIC: "Unable to handle kernel NULL pointer dereference at virtual address 0000000000000080" PID: 1747398 COMMAND: "pool-11-thread-" TASK: ffff800527400000 [THREAD_INFO: ffff800527400000] CPU: 1 STATE: TASK_RUNNING (PANIC)
crash> bt PID: 1747398 TASK: ffff800527400000 CPU: 1 COMMAND: "pool-11-thread-" #0 [ffff80057dd2f5e0] machine_kexec at ffff0000080a2e24 #1 [ffff80057dd2f640] __crash_kexec at ffff0000081aecb8 #2 [ffff80057dd2f7b0] crash_kexec at ffff0000081aedc0 #3 [ffff80057dd2f7e0] die at ffff00000808f754 #4 [ffff80057dd2f820] die_kernel_fault at ffff0000080aa8cc #5 [ffff80057dd2f850] __do_kernel_fault at ffff0000080aa59c #6 [ffff80057dd2f880] do_page_fault at ffff000008bf7294 #7 [ffff80057dd2f970] do_translation_fault at ffff000008bf778c #8 [ffff80057dd2f9a0] do_mem_abort at ffff000008081284 #9 [ffff80057dd2fb80] el1_ia at ffff00000808310c #10 [ffff80057dd2fb90] nfs_updatepage at ffff0000024aa150 [nfs] #11 [ffff80057dd2fbf0] nfs_write_end at ffff000002497abc [nfs] #12 [ffff80057dd2fc40] generic_perform_write at ffff000008270bf0 #13 [ffff80057dd2fcc0] nfs_file_write at ffff0000024984a4 [nfs] #14 [ffff80057dd2fd00] new_sync_write at ffff0000083268e0 #15 [ffff80057dd2fd90] __vfs_write at ffff000008329268 #16 [ffff80057dd2fdc0] vfs_write at ffff000008329478 #17 [ffff80057dd2fe00] ksys_write at ffff0000083297a8 #18 [ffff80057dd2fe40] __arm64_sys_write at ffff000008329838 #19 [ffff80057dd2fe60] el0_svc_common at ffff00000809834c #20 [ffff80057dd2fea0] el0_svc_handler at ffff00000809843c #21 [ffff80057dd2fff0] el0_svc at ffff000008084084 PC: 0000fffe2b802570 LR: 0000fffe2b802558 SP: 0000fffdd900bde0 X29: 0000fffdd900bde0 X28: 0000fffe1c0d7640 X27: 0000fffdd900bea8 X26: 0000fffdd900bea8 X25: 0000000000000042 X24: 0000fffdd900deb0 X23: 0000fffe2a2d0fe0 X22: 0000000000000091 X21: 0000fffdd900bea8 X20: 00000000000000d4 X19: 0000000000000091 X18: 000000008262ccb8 X17: 0000fffe2b802500 X16: 0000fffe2a2cfba0 X15: 00000000820d8fc0 X14: 0000000000000008 X13: 355b656d69742070 X12: 614d6c616544746e X11: 0a205d305b657a69 X10: 73202c5d355b656d X9: 69742070614d6c61 X8: 0000000000000040 X7: 0000000000000091 X6: 0000000000000000 X5: 0000fffdd900f0e0 X4: 00000000ffffffbb X3: 0000000000000000 X2: 00000000000000d4 X1: 0000fffdd900bea8 X0: 0000000000000091 ORIG_X0: 0000000000000091 SYSCALLNO: 40 PSTATE: 80001000 |
0xffff0000024aa138 <nfs_updatepage+1248>: add x23, x21, #0x40 /usr/src/debug/kernel-4.19.90/linux-4.19.90-23.8.v2101.ky10.aarch64/./include/asm-generic/bitops/atomic.h: 38 0xffff0000024aa13c <nfs_updatepage+1252>: tbz w0, #0, 0xffff0000024aa268 <nfs_updatepage+1552> /usr/src/debug/kernel-4.19.90/linux-4.19.90-23.8.v2101.ky10.aarch64/./include/linux/spinlock.h: 180 0xffff0000024aa140 <nfs_updatepage+1256>: add x25, x3, #0x80 /usr/src/debug/kernel-4.19.90/linux-4.19.90-23.8.v2101.ky10.aarch64/./arch/arm64/include/asm/atomic_lse.h: 479 0xffff0000024aa144 <nfs_updatepage+1260>: mov x1, #0x0 // #0 0xffff0000024aa148 <nfs_updatepage+1264>: mov x0, x25 0xffff0000024aa14c <nfs_updatepage+1268>: mov x2, #0x1 // #1 0xffff0000024aa150 <nfs_updatepage+1272>: mov w30, w1 0xffff0000024aa154 <nfs_updatepage+1276>: .inst 0x88fe7f22 ; undefined 0xffff0000024aa158 <nfs_updatepage+1280>: mov w0, w30 |
crash> sym 0xffff0000024aa154 ffff0000024aa154 (T) nfs_updatepage+1276 [nfs] /usr/src/debug/kernel-4.19.90/linux-4.19.90-23.8.v2101.ky10.aarch64/./arch/arm64/include/asm/atomic_lse.h: 479 |
vim /usr/src/kernels/4.19.90-23.8.v2101.ky10.aarch64/arch/arm64/include/asm/atomic_lse.h +479 |

3.3.资料查找
从网上的资料查看,有类似的堆栈信息,不知道当前内核是否已修复下面的问题。
RHEL7: Kernel crash at nfs_readpage_async+0x43 or nfs_updatepage+0x1b9 - Red Hat Customer Portal
201705 – Oops when copying large file from xfs to nfs, happens every time
nfs_page_async_flush returning 0 for fatal errors on writeback - Calum Mackay
4.问题总结
0x80的地址是通过x3寄存器加上0x80得到的
[8559662.199126] x3 : 0000000000000000 x2 : 0000000000000001
[8559662.199491] x1 : 0000000000000000 x0 : 0000000000000080
从问题发生时候打印出来的寄存器的值也可以知道x3寄存器变成0
有可能是个used_after_free的情况。
根据研发分析和该bug一致:
nfs_page_async_flush returning 0 for fatal errors on writeback - Calum Mackay
5.问题处理
升级当前最新内核