【问题实操】银河高级服务器操作系统实例分享,某信息系统服务器宕机处理

发布于:2024-05-22 ⋅ 阅读:(147) ⋅ 点赞:(0)

1.问题环境

硬件环境

机型

QEMU KVM Virtual Machine 虚拟机

整机类型/架构

arm

固件版本

 BIOS 0.0.0 02/06/2015

软件环境

具体操作系统版本

Kylin Linux Advanced Server release V10 (Tercel)

内核版本

4.19.90-23.8.v2101.ky10.aarch64

2.问题描述

2022年9月20日下午15点左右,系统运行时发生宕机。

3.问题分析

3.1.分析 vmcore-dmesg.txt日志

[8559662.187756] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000080
[8559662.188245] Mem abort info:
[8559662.188408]   ESR = 0x96000006
[8559662.188580]   Exception class = DABT (current EL), IL = 32 bits
[8559662.188909]   SET = 0, FnV = 0
[8559662.189082]   EA = 0, S1PTW = 0
[8559662.189261] Data abort info:
[8559662.189427]   ISV = 0, ISS = 0x00000006
[8559662.189730]   CM = 0, WnR = 0
[8559662.189955] user pgtable: 64k pages, 48-bit VAs, pgdp = 000000008e382035
[8559662.190297] [0000000000000080] pgd=00000005b0170003, pud=00000005b0170003, pmd=0000000000000000
[8559662.190748] Internal error: Oops: 96000006 [#1] SMP
[8559662.191004] Modules linked in: ppp_mppe arc4 ppp_async ppp_generic slhc fuse nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rfkill binfmt_misc sunrpc vfat fat aes_ce_blk crypto_simd cryptd aes_ce_cipher crct10dif_ce ofpart ghash_ce cmdlinepart sha2_ce sha256_arm64 cfi_cmdset_0001 sha1_ce cfi_probe cfi_util gen_probe physmap_of chipreg mtd uio_pdrv_genirq uio sch_fq_codel ip_tables sr_mod cdrom virtio_scsi virtio_net virtio_console net_failover failover bochs_drm
[8559662.193156] Process pool-11-thread- (pid: 1747398, stack limit = 0x00000000fd966dc6)
[8559662.193556] CPU: 1 PID: 1747398 Comm: pool-11-thread- Kdump: loaded Not tainted 4.19.90-23.8.v2101.ky10.aarch64 #1
[8559662.194061] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[8559662.194409] pstate: 60400005 (nZCv daif +PAN -UAO)
[8559662.194680] pc : nfs_updatepage+0x4fc/0x8c0 [nfs]
[8559662.194929] lr :           (null)
[8559662.195119] sp : ffff80057dd2fb90
[8559662.195304] x29: ffff80057dd2fb90 x28: 0000000000000000
[8559662.195581] x27: 00000000000000d4 x26: ffff800065347eb0
[8559662.195862] x25: 0000000000000080 x24: ffff800065347d40
[8559662.196135] x23: ffff80056903d4c0 x22: 00000000000009de
[8559662.196408] x21: ffff80056903d480 x20: ffff7fe001301140
[8559662.196681] x19: 00000000000000d4 x18: 0000000000000000
[8559662.196972] x17: 0000000000000000 x16: 0000000000000000
[8559662.197245] x15: 0000000000000000 x14: 355b656d69742070
[8559662.197517] x13: 614d6c616544746e x12: ffffffffffffffb8
[8559662.197790] x11: ffffffffffffffb8 x10: 0000000000000040
[8559662.198068] x9 : 0000000000000000 x8 : ffff80056903d500
[8559662.198361] x7 : 0000000000000000 x6 : 000000000000003f
[8559662.198720] x5 : 0000000000000040 x4 : 0000000000000000
[8559662.199126] x3 : 0000000000000000 x2 : 0000000000000001
[8559662.199491] x1 : 0000000000000000 x0 : 0000000000000080
[8559662.199873] Call trace:
[8559662.200077]  nfs_updatepage+0x4fc/0x8c0 [nfs]
[8559662.200390]  nfs_write_end+0x70/0x320 [nfs]
[8559662.200726]  generic_perform_write+0xfc/0x188
[8559662.201043]  nfs_file_write+0xb8/0x230 [nfs]
[8559662.201355]  new_sync_write+0xcc/0x130
[8559662.201641]  __vfs_write+0x74/0x80
[8559662.201892]  vfs_write+0xac/0x1c0
[8559662.202143]  ksys_write+0x5c/0xc8
[8559662.202390]  __arm64_sys_write+0x24/0x30
[8559662.202679]  el0_svc_common+0x78/0x130
[8559662.202951]  el0_svc_handler+0x38/0x78
[8559662.203222]  el0_svc+0x8/0x1b0
[8559662.203451] Code: d2800001 aa1903e0 d2800022 2a0103fe (88fe7f22)
[8559662.203884] SMP: stopping secondary CPUs
[8559662.204851] Starting crashdump kernel...
[8559662.205149] Bye!

nfs_updatepage+0x4fc/0x8c0 [nfs],0x4fc表示出错函数中的偏移位置,0x8c0表示函数大小。

[root@localhost vmcore]#  gcc-nm /usr/lib/debug/usr/lib/modules/4.19.90-23.8.v2101.ky10.aarch64/kernel/fs/nfs/nfs.ko.debug | grep nfs_updatepage
0000000000019c58 T nfs_updatepage

通过gcc-nm得到0000000000019c58 ,加入偏移0x4fc,得00000000001A154。

[root@localhost vmcore]# addr2line -e  /usr/lib/debug/usr/lib/modules/4.19.90-23.8.v2101.ky10.aarch64/kernel/fs/nfs/nfs.ko.debug  00000000001A154
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.8.v2101.ky10.aarch64/./arch/arm64/include/asm/atomic_lse.h:479

3.2. 分析vmcore

发现寄存器信息已经改变了。

This GDB was configured as "aarch64-unknown-linux-gnu"...

      KERNEL: /usr/lib/debug/usr/lib/modules/4.19.90-23.8.v2101.ky10.aarch64/vmlinux
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 8
        DATE: Tue Sep 20 15:00:09 CST 2022
      UPTIME: 99 days, 01:41:17
LOAD AVERAGE: 0.39, 0.23, 0.20
       TASKS: 1564
    NODENAME: localhost.localdomain
     RELEASE: 4.19.90-23.8.v2101.ky10.aarch64
     VERSION: #1 SMP Mon May 17 17:07:38 CST 2021
     MACHINE: aarch64  (unknown Mhz)
      MEMORY: 24 GB
       PANIC: "Unable to handle kernel NULL pointer dereference at virtual address 0000000000000080"
         PID: 1747398
     COMMAND: "pool-11-thread-"
        TASK: ffff800527400000  [THREAD_INFO: ffff800527400000]
         CPU: 1
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 1747398  TASK: ffff800527400000  CPU: 1   COMMAND: "pool-11-thread-"
 #0 [ffff80057dd2f5e0] machine_kexec at ffff0000080a2e24
 #1 [ffff80057dd2f640] __crash_kexec at ffff0000081aecb8
 #2 [ffff80057dd2f7b0] crash_kexec at ffff0000081aedc0
 #3 [ffff80057dd2f7e0] die at ffff00000808f754
 #4 [ffff80057dd2f820] die_kernel_fault at ffff0000080aa8cc
 #5 [ffff80057dd2f850] __do_kernel_fault at ffff0000080aa59c
 #6 [ffff80057dd2f880] do_page_fault at ffff000008bf7294
 #7 [ffff80057dd2f970] do_translation_fault at ffff000008bf778c
 #8 [ffff80057dd2f9a0] do_mem_abort at ffff000008081284
 #9 [ffff80057dd2fb80] el1_ia at ffff00000808310c
#10 [ffff80057dd2fb90] nfs_updatepage at ffff0000024aa150 [nfs]
#11 [ffff80057dd2fbf0] nfs_write_end at ffff000002497abc [nfs]
#12 [ffff80057dd2fc40] generic_perform_write at ffff000008270bf0
#13 [ffff80057dd2fcc0] nfs_file_write at ffff0000024984a4 [nfs]
#14 [ffff80057dd2fd00] new_sync_write at ffff0000083268e0
#15 [ffff80057dd2fd90] __vfs_write at ffff000008329268
#16 [ffff80057dd2fdc0] vfs_write at ffff000008329478
#17 [ffff80057dd2fe00] ksys_write at ffff0000083297a8
#18 [ffff80057dd2fe40] __arm64_sys_write at ffff000008329838
#19 [ffff80057dd2fe60] el0_svc_common at ffff00000809834c
#20 [ffff80057dd2fea0] el0_svc_handler at ffff00000809843c
#21 [ffff80057dd2fff0] el0_svc at ffff000008084084
     PC: 0000fffe2b802570   LR: 0000fffe2b802558   SP: 0000fffdd900bde0
    X29: 0000fffdd900bde0  X28: 0000fffe1c0d7640  X27: 0000fffdd900bea8
    X26: 0000fffdd900bea8  X25: 0000000000000042  X24: 0000fffdd900deb0
    X23: 0000fffe2a2d0fe0  X22: 0000000000000091  X21: 0000fffdd900bea8
    X20: 00000000000000d4  X19: 0000000000000091  X18: 000000008262ccb8
    X17: 0000fffe2b802500  X16: 0000fffe2a2cfba0  X15: 00000000820d8fc0
    X14: 0000000000000008  X13: 355b656d69742070  X12: 614d6c616544746e
    X11: 0a205d305b657a69  X10: 73202c5d355b656d   X9: 69742070614d6c61
     X8: 0000000000000040   X7: 0000000000000091   X6: 0000000000000000
     X5: 0000fffdd900f0e0   X4: 00000000ffffffbb   X3: 0000000000000000
     X2: 00000000000000d4   X1: 0000fffdd900bea8   X0: 0000000000000091
    ORIG_X0: 0000000000000091  SYSCALLNO: 40  PSTATE: 80001000

0xffff0000024aa138 <nfs_updatepage+1248>:       add     x23, x21, #0x40
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.8.v2101.ky10.aarch64/./include/asm-generic/bitops/atomic.h: 38
0xffff0000024aa13c <nfs_updatepage+1252>:       tbz     w0, #0, 0xffff0000024aa268 <nfs_updatepage+1552>
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.8.v2101.ky10.aarch64/./include/linux/spinlock.h: 180
0xffff0000024aa140 <nfs_updatepage+1256>:       add     x25, x3, #0x80
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.8.v2101.ky10.aarch64/./arch/arm64/include/asm/atomic_lse.h: 479
0xffff0000024aa144 <nfs_updatepage+1260>:       mov     x1, #0x0                        // #0
0xffff0000024aa148 <nfs_updatepage+1264>:       mov     x0, x25
0xffff0000024aa14c <nfs_updatepage+1268>:       mov     x2, #0x1                        // #1
0xffff0000024aa150 <nfs_updatepage+1272>:       mov     w30, w1
0xffff0000024aa154 <nfs_updatepage+1276>:       .inst   0x88fe7f22 ; undefined
0xffff0000024aa158 <nfs_updatepage+1280>:       mov     w0, w30

crash> sym 0xffff0000024aa154
ffff0000024aa154 (T) nfs_updatepage+1276 [nfs] /usr/src/debug/kernel-4.19.90/linux-4.19.90-23.8.v2101.ky10.aarch64/./arch/arm64/include/asm/atomic_lse.h: 479

vim /usr/src/kernels/4.19.90-23.8.v2101.ky10.aarch64/arch/arm64/include/asm/atomic_lse.h +479

3.3.资料查找

从网上的资料查看,有类似的堆栈信息,不知道当前内核是否已修复下面的问题。

RHEL7: Kernel crash at nfs_readpage_async+0x43 or nfs_updatepage+0x1b9 - Red Hat Customer Portal

201705 – Oops when copying large file from xfs to nfs, happens every time

nfs_page_async_flush returning 0 for fatal errors on writeback - Calum Mackay

4.问题总结

0x80的地址是通过x3寄存器加上0x80得到的

[8559662.199126] x3 : 0000000000000000 x2 : 0000000000000001

[8559662.199491] x1 : 0000000000000000 x0 : 0000000000000080

从问题发生时候打印出来的寄存器的值也可以知道x3寄存器变成0

有可能是个used_after_free的情况

根据研发分析和该bug一致:

nfs_page_async_flush returning 0 for fatal errors on writeback - Calum Mackay

5.问题处理

升级当前最新内核


网站公告

今日签到

点亮在社区的每一天
去签到