"BUG: Bad page state in process pve-ha-lrm" causing server to become unresponsive

liddack

Member
Mar 12, 2021
6
1
8
28
Hello,

I run a 3-node cluster with Ceph hyper-converged storage. Hours ago I got an alert that one of the servers (razor01) was down, but strangely the Ceph cluster became slow and sometimes unresponsive, saying that some PGs were in the peering state, which caused some of my VMs become unresponsive as well. As I was away from the servers in the moment of the incident, I managed to workaround this outing all razor01 OSDs, and the Ceph cluster eventually became responsive again.

A few hours ago I arrived at work and restarted the faulty node. Everything went back to normal.

Looking through the razor01 /var/log/syslog file, this is the first error trace I found:

Code:
Jul 25 23:58:03 razor01 kernel: [267560.978432] BUG: Bad page map in process pvescheduler  pte:8000000825e75805 pmd:10908df067
Jul 25 23:58:03 razor01 kernel: [267560.978447] page:00000000b6c7db5a refcount:3 mapcount:-254 mapping:0000000000000000 index:0x55e3cf046 pfn:0x825e75
Jul 25 23:58:03 razor01 kernel: [267560.978461] memcg:ffff98922c222000
Jul 25 23:58:03 razor01 kernel: [267560.978463] anon flags: 0x17ffffc008001c(uptodate|dirty|lru|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)
Jul 25 23:58:03 razor01 kernel: [267560.978471] raw: 0017ffffc008001c ffffda752096d6c8 ffffda75209a4488 ffff98936f0451a1
Jul 25 23:58:03 razor01 kernel: [267560.978473] raw: 000000055e3cf046 0000000000000000 00000003ffffff01 ffff98922c222000
Jul 25 23:58:03 razor01 kernel: [267560.978474] page dumped because: bad pte
Jul 25 23:58:03 razor01 kernel: [267560.978476] addr:000055e3cf046000 vm_flags:08100073 anon_vma:ffff98a03ce90068 mapping:0000000000000000 index:55e3cf046
Jul 25 23:58:03 razor01 kernel: [267560.978481] file:(null) fault:0x0 mmap:0x0 readpage:0x0
Jul 25 23:58:03 razor01 kernel: [267560.978489] CPU: 44 PID: 3039315 Comm: pvescheduler Tainted: P    B      O      5.15.108-1-pve #1
Jul 25 23:58:03 razor01 kernel: [267560.978492] Hardware name: Default string Default string/X99-D8-MAX, BIOS 5.11 06/13/2022
Jul 25 23:58:03 razor01 kernel: [267560.978493] Call Trace:
Jul 25 23:58:03 razor01 kernel: [267560.978496]  <TASK>
Jul 25 23:58:03 razor01 kernel: [267560.978508]  dump_stack_lvl+0x4a/0x63
Jul 25 23:58:03 razor01 kernel: [267560.978514]  dump_stack+0x10/0x16
Jul 25 23:58:03 razor01 kernel: [267560.978516]  print_bad_pte.cold+0x87/0xdf
Jul 25 23:58:03 razor01 kernel: [267560.978522]  ? __mod_lruvec_page_state+0x6b/0xb0
Jul 25 23:58:03 razor01 kernel: [267560.978527]  unmap_page_range+0x8c5/0xfa0
Jul 25 23:58:03 razor01 kernel: [267560.978531]  unmap_single_vma+0x7f/0xf0
Jul 25 23:58:03 razor01 kernel: [267560.978533]  unmap_vmas+0x77/0xf0
Jul 25 23:58:03 razor01 kernel: [267560.978535]  exit_mmap+0xa2/0x200
Jul 25 23:58:03 razor01 kernel: [267560.978539]  mmput+0x63/0x150
Jul 25 23:58:03 razor01 kernel: [267560.978548]  do_exit+0x2fc/0xa20
Jul 25 23:58:03 razor01 kernel: [267560.978552]  do_group_exit+0x3b/0xb0
Jul 25 23:58:03 razor01 kernel: [267560.978555]  __x64_sys_exit_group+0x18/0x20
Jul 25 23:58:03 razor01 kernel: [267560.978557]  do_syscall_64+0x5c/0xc0
Jul 25 23:58:03 razor01 kernel: [267560.978560]  ? handle_mm_fault+0xd8/0x2c0
Jul 25 23:58:03 razor01 kernel: [267560.978564]  ? exit_to_user_mode_prepare+0x37/0x1b0
Jul 25 23:58:03 razor01 kernel: [267560.978581]  ? irqentry_exit_to_user_mode+0x9/0x20
Jul 25 23:58:03 razor01 kernel: [267560.978591]  ? irqentry_exit+0x1d/0x30
Jul 25 23:58:03 razor01 kernel: [267560.978593]  ? exc_page_fault+0x89/0x170
Jul 25 23:58:03 razor01 kernel: [267560.978594]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 25 23:58:03 razor01 kernel: [267560.978598] RIP: 0033:0x7fa8579ddbd9
Jul 25 23:58:03 razor01 kernel: [267560.978600] Code: Unable to access opcode bytes at RIP 0x7fa8579ddbaf.
Jul 25 23:58:03 razor01 kernel: [267560.978601] RSP: 002b:00007ffdbb70f7e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000e7
Jul 25 23:58:03 razor01 kernel: [267560.978626] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fa8579ddbd9
Jul 25 23:58:03 razor01 kernel: [267560.978628] RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
Jul 25 23:58:03 razor01 kernel: [267560.978629] RBP: 000055e3cd5642a0 R08: ffffffffffffff80 R09: 000055e3cc6e15e0
Jul 25 23:58:03 razor01 kernel: [267560.978630] R10: fffffffffffffd8c R11: 0000000000000206 R12: 0000000000000000
Jul 25 23:58:03 razor01 kernel: [267560.978631] R13: 0000000000000000 R14: 000055e3cd569ca8 R15: 000055e3cd689f00
Jul 25 23:58:03 razor01 kernel: [267560.978632]  </TASK>
Jul 25 23:58:03 razor01 kernel: [267560.981319] BUG: Bad page map in process pvescheduler  pte:8000000825e75805 pmd:145fcd9067
Jul 25 23:58:03 razor01 kernel: [267560.981326] page:00000000b6c7db5a refcount:2 mapcount:-255 mapping:0000000000000000 index:0x55e3cf046 pfn:0x825e75
Jul 25 23:58:03 razor01 kernel: [267560.981328] memcg:ffff98922c222000
Jul 25 23:58:03 razor01 kernel: [267560.981329] anon flags: 0x17ffffc008001c(uptodate|dirty|lru|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)
Jul 25 23:58:03 razor01 kernel: [267560.981332] raw: 0017ffffc008001c ffffda752096d6c8 ffffda75209a4488 ffff98936f0451a1
Jul 25 23:58:03 razor01 kernel: [267560.981334] raw: 000000055e3cf046 0000000000000000 00000002ffffff00 ffff98922c222000
Jul 25 23:58:03 razor01 kernel: [267560.981334] page dumped because: bad pte
Jul 25 23:58:03 razor01 kernel: [267560.981335] addr:000055e3cf046000 vm_flags:08100073 anon_vma:ffff98a72f117958 mapping:0000000000000000 index:55e3cf046
Jul 25 23:58:03 razor01 kernel: [267560.981338] file:(null) fault:0x0 mmap:0x0 readpage:0x0
Jul 25 23:58:03 razor01 kernel: [267560.981349] CPU: 43 PID: 3039316 Comm: pvescheduler Tainted: P    B      O      5.15.108-1-pve #1
Jul 25 23:58:03 razor01 kernel: [267560.981351] Hardware name: Default string Default string/X99-D8-MAX, BIOS 5.11 06/13/2022
Jul 25 23:58:03 razor01 kernel: [267560.981351] Call Trace:
Jul 25 23:58:03 razor01 kernel: [267560.981353]  <TASK>
Jul 25 23:58:03 razor01 kernel: [267560.981354]  dump_stack_lvl+0x4a/0x63
Jul 25 23:58:03 razor01 kernel: [267560.981357]  dump_stack+0x10/0x16
Jul 25 23:58:03 razor01 kernel: [267560.981359]  print_bad_pte.cold+0x87/0xdf
Jul 25 23:58:03 razor01 kernel: [267560.981369]  ? __mod_lruvec_page_state+0x6b/0xb0
Jul 25 23:58:03 razor01 kernel: [267560.981372]  unmap_page_range+0x8c5/0xfa0
Jul 25 23:58:03 razor01 kernel: [267560.981375]  unmap_single_vma+0x7f/0xf0
Jul 25 23:58:03 razor01 kernel: [267560.981377]  unmap_vmas+0x77/0xf0
Jul 25 23:58:03 razor01 kernel: [267560.981379]  exit_mmap+0xa2/0x200
Jul 25 23:58:03 razor01 kernel: [267560.981381]  mmput+0x63/0x150
Jul 25 23:58:03 razor01 kernel: [267560.981384]  do_exit+0x2fc/0xa20
Jul 25 23:58:03 razor01 kernel: [267560.981387]  do_group_exit+0x3b/0xb0
Jul 25 23:58:03 razor01 kernel: [267560.981389]  __x64_sys_exit_group+0x18/0x20
Jul 25 23:58:03 razor01 kernel: [267560.981392]  do_syscall_64+0x5c/0xc0
Jul 25 23:58:03 razor01 kernel: [267560.981394]  ? handle_mm_fault+0xd8/0x2c0
Jul 25 23:58:03 razor01 kernel: [267560.981395]  ? exit_to_user_mode_prepare+0x37/0x1b0
Jul 25 23:58:03 razor01 kernel: [267560.981398]  ? irqentry_exit_to_user_mode+0x9/0x20
Jul 25 23:58:03 razor01 kernel: [267560.981400]  ? irqentry_exit+0x1d/0x30
Jul 25 23:58:03 razor01 kernel: [267560.981402]  ? exc_page_fault+0x89/0x170
Jul 25 23:58:03 razor01 kernel: [267560.981404]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 25 23:58:03 razor01 kernel: [267560.981409] RIP: 0033:0x7fa8579ddbd9
Jul 25 23:58:03 razor01 kernel: [267560.981410] Code: Unable to access opcode bytes at RIP 0x7fa8579ddbaf.
Jul 25 23:58:03 razor01 kernel: [267560.981411] RSP: 002b:00007ffdbb70f7e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000e7
Jul 25 23:58:03 razor01 kernel: [267560.981413] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fa8579ddbd9
Jul 25 23:58:03 razor01 kernel: [267560.981414] RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
Jul 25 23:58:03 razor01 kernel: [267560.981415] RBP: 000055e3cd5642a0 R08: ffffffffffffff80 R09: 000055e3cc6e15e0
Jul 25 23:58:03 razor01 kernel: [267560.981416] R10: fffffffffffffd8c R11: 0000000000000206 R12: 0000000000000000
Jul 25 23:58:03 razor01 kernel: [267560.981417] R13: 0000000000000000 R14: 000055e3cd569ca8 R15: 000055e3cd689f00
Jul 25 23:58:03 razor01 kernel: [267560.981419]  </TASK>

The same error dump apparently repeated itself each minute for one hour. That was when the errors in the attachments #1 and #2 showed up in the log, and the moment razor01 became unresponsive.

What may be causing this issue?

pveversion -v included as the #3 attachment.
 

Attachments

  • 1-general-protection-fault.txt
    7 KB · Views: 3
  • 2-bad-page-state-pve-ha-lrm.txt
    5.2 KB · Views: 4
  • 3-pveversion.txt
    1.6 KB · Views: 4
Last edited:
Hi,
I'd run a check for bad RAM to rule that out first, e.g. with memtest86+ on the Proxmox VE installation ISO in the advanced options.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!