Proxmox randomly fails, but seems to be the UI only

foxyproxy

New Member
Dec 19, 2022
27
4
3
Hello everyone!

Proxmox (and I think it's just the UI) has a problem where about once every 2 weeks - 1 month, all of the containers, storage and VM's appear offline

Sorry I don't have a better image, I thought I had taken a high res screenshot, but hopefully it shows the problem. Everything has a question mark and all the names are gone.

1723064015882.png

Some containers work, others don't. The VM's all seem to work but I can manage them. The only way to get around this is a hard power off. I can still use the console and I have tried `shutdown now` and `reboot` and some containers and VM's go into shutdown mode, others don't.

I also managed to get some logs as soon as this happened. Hopefully this helps. Items in bold were coloured red in the logs.

Aug 07 11:12:29 proxmox-ve kernel: BUG: Bad page state in process java pfn:1737bf5
Aug 07 11:12:29 proxmox-ve kernel: page:000000001d48a063 refcount:0 mapcount:-1 mapping:0000000000000000 index:0x0 pfn:0x1737bf5
Aug 07 11:12:29 proxmox-ve kernel: flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
Aug 07 11:12:29 proxmox-ve kernel: raw: 0017ffffc0000000 dead000000000100 dead000000000122 0000000000000000
Aug 07 11:12:29 proxmox-ve kernel: raw: 0000000000000000 0000000080270027 00000000fffffffe 0000000000000000
Aug 07 11:12:29 proxmox-ve kernel: page dumped because: nonzero mapcount
Aug 07 11:12:29 proxmox-ve kernel: Modules linked in: udp_diag dm_snapshot nfnetlink_acct wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel tcp>
Aug 07 11:12:29 proxmox-ve kernel: intel_powerclamp snd_soc_core i915 coretemp snd_compress ac97_bus snd_pcm_dmaengine kvm_intel drm_buddy snd_hda_intel iwlmvm snd_intel_dspcfg ttm snd_intel_sdw_acpi kvm mac80211 irqbypass drm_d>
Aug 07 11:12:29 proxmox-ve kernel: xhci_pci_renesas crc32_pclmul ahci intel_lpss_pci i2c_i801 nvme_core e1000e xhci_hcd spi_intel_pci intel_lpss i2c_smbus libahci spi_intel nvme_common idma64 video wmi
Aug 07 11:12:29 proxmox-ve kernel: CPU: 0 PID: 1431832 Comm: java Tainted: P D W O 6.2.16-3-pve #1
Aug 07 11:12:29 proxmox-ve kernel: Hardware name: ASRock B660M Pro RS/AX/B660M Pro RS/AX, BIOS 4.03 11/24/2022
Aug 07 11:12:29 proxmox-ve kernel: Call Trace:
Aug 07 11:12:29 proxmox-ve kernel: <TASK>
Aug 07 11:12:29 proxmox-ve kernel: dump_stack_lvl+0x48/0x70
Aug 07 11:12:29 proxmox-ve kernel: dump_stack+0x10/0x20
Aug 07 11:12:29 proxmox-ve kernel: bad_page+0x76/0x120
Aug 07 11:12:29 proxmox-ve kernel: free_page_is_bad_report+0x66/0x80
Aug 07 11:12:29 proxmox-ve kernel: free_pcppages_bulk+0x1bc/0x2f0
Aug 07 11:12:29 proxmox-ve kernel: free_unref_page_commit+0xf1/0x190
Aug 07 11:12:29 proxmox-ve kernel: free_unref_page_list+0x1e7/0x450
Aug 07 11:12:29 proxmox-ve kernel: release_pages+0x152/0x520
Aug 07 11:12:29 proxmox-ve kernel: free_pages_and_swap_cache+0x4a/0x60
Aug 07 11:12:29 proxmox-ve kernel: tlb_batch_pages_flush+0x43/0x80
Aug 07 11:12:29 proxmox-ve kernel: tlb_finish_mmu+0x73/0x1a0
Aug 07 11:12:29 proxmox-ve kernel: exit_mmap+0x14d/0x310
Aug 07 11:12:29 proxmox-ve kernel: __mmput+0x41/0x140
Aug 07 11:12:29 proxmox-ve kernel: mmput+0x2e/0x40
Aug 07 11:12:29 proxmox-ve kernel: do_exit+0x2ef/0xac0
Aug 07 11:12:29 proxmox-ve kernel: ? __futex_unqueue+0x29/0x60
Aug 07 11:12:29 proxmox-ve kernel: ? futex_unqueue+0x3d/0x70

Aug 07 11:12:29 proxmox-ve kernel: arch_do_signal_or_restart+0x42/0x280
Aug 07 11:12:29 proxmox-ve kernel: exit_to_user_mode_prepare+0x11b/0x190
Aug 07 11:12:29 proxmox-ve kernel: syscall_exit_to_user_mode+0x29/0x50
Aug 07 11:12:29 proxmox-ve kernel: do_syscall_64+0x67/0x90
Aug 07 11:12:29 proxmox-ve kernel: ? exc_page_fault+0x91/0x1b0
Aug 07 11:12:29 proxmox-ve kernel: entry_SYSCALL_64_after_hwframe+0x72/0xdc
Aug 07 11:12:29 proxmox-ve kernel: RIP: 0033:0x7f968dedfd61
Aug 07 11:12:29 proxmox-ve kernel: Code: Unable to access opcode bytes at 0x7f968dedfd37.
Aug 07 11:12:29 proxmox-ve kernel: RSP: 002b:00007f968c7d4730 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
Aug 07 11:12:29 proxmox-ve kernel: RAX: fffffffffffffe00 RBX: 00007f9688018550 RCX: 00007f968dedfd61
Aug 07 11:12:29 proxmox-ve kernel: RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00007f968801857c
Aug 07 11:12:29 proxmox-ve kernel: RBP: 00007f968c7d4770 R08: 0000000000000000 R09: 00000000ffffffff
Aug 07 11:12:29 proxmox-ve kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
Aug 07 11:12:29 proxmox-ve kernel: R13: 0000000000000000 R14: 00007f9688018528 R15: 00007f968801857c
Aug 07 11:12:29 proxmox-ve kernel: </TASK>
Aug 07 11:12:29 proxmox-ve kernel: BUG: Bad rss-counter state mm:00000000819475f3 type:MM_FILEPAGES val:-1
Aug 07 11:12:29 proxmox-ve kernel: BUG: Bad rss-counter state mm:00000000819475f3 type:MM_ANONPAGES val:1

Aug 07 11:12:29 proxmox-ve kernel: docker0: port 3(veth3cd6817) entered disabled state
Aug 07 11:12:29 proxmox-ve kernel: vethe0f8f22: renamed from eth0
Aug 07 11:12:29 proxmox-ve kernel: docker0: port 3(veth3cd6817) entered disabled state
Aug 07 11:12:29 proxmox-ve kernel: device veth3cd6817 left promiscuous mode
Aug 07 11:12:29 proxmox-ve kernel: docker0: port 3(veth3cd6817) entered disabled state
Aug 07 11:12:29 proxmox-ve kernel: fwbr118i0: port 2(veth118i0) entered disabled state
Aug 07 11:12:29 proxmox-ve kernel: fwbr118i0: port 2(veth118i0) entered disabled state
Aug 07 11:12:29 proxmox-ve kernel: device veth118i0 left promiscuous mode
Aug 07 11:12:29 proxmox-ve kernel: fwbr118i0: port 2(veth118i0) entered disabled state
Aug 07 11:12:30 proxmox-ve kernel: br-1a87481796fb: port 5(veth431c3e2) entered disabled state
Aug 07 11:12:30 proxmox-ve kernel: vethb53d107: renamed from eth0

Aug 07 11:13:02 proxmox-ve kernel: usb 1-3: USB disconnect, device number 12
Aug 07 11:13:04 proxmox-ve pvedaemon[696653]: <root@pam> starting task UPID:proxmox-ve:000EC79D:073EB478:66B2CA20:vncshell::root@pam:
Aug 07 11:13:04 proxmox-ve pvedaemon[968605]: starting termproxy UPID:proxmox-ve:000EC79D:073EB478:66B2CA20:vncshell::root@pam:
Aug 07 11:13:04 proxmox-ve pvedaemon[696653]: <root@pam> successful auth for user 'root@pam'
Aug 07 11:13:04 proxmox-ve login[968608]: pam_unix(login:session): session opened for user root(uid=0) by root(uid=0)
Aug 07 11:13:04 proxmox-ve login[968608]: pam_systemd(login:session): Failed to create session: Transport endpoint is not connected
Aug 07 11:13:04 proxmox-ve login[968613]: ROOT LOGIN on '/dev/pts/0'
Aug 07 11:13:06 proxmox-ve kernel: usb 1-3: new low-speed USB device number 13 using xhci_hcd
Aug 07 11:13:06 proxmox-ve kernel: usb 1-3: New USB device found, idVendor=0764, idProduct=0501, bcdDevice= 0.02
Aug 07 11:13:06 proxmox-ve kernel: usb 1-3: New USB device strings: Mfr=3, Product=1, SerialNumber=2
Aug 07 11:13:06 proxmox-ve kernel: usb 1-3: Product: VP700ELCD
Aug 07 11:13:06 proxmox-ve kernel: usb 1-3: Manufacturer: CPS
Aug 07 11:13:06 proxmox-ve kernel: usb 1-3: SerialNumber:
Aug 07 11:13:06 proxmox-ve kernel: hid-generic 0003:0764:0501.1A6DA: hiddev0,hidraw0: USB HID v1.10 Device [CPS VP700ELCD] on usb-0000:00:14.0-3/input0

I also found this log entry this morning but proxmox is still working fine, just thought I'd include it.

proxmox-ve kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
TDH <39>
TDT <72>
next_to_use <72>
next_to_clean <38>
buffer_info[next_to_clean]:
time_stamp <1007c7e00>
next_to_watch <39>
jiffies <1007c7fd1>
next_to_watch.status <0>
MAC Status <40080083>
PHY Status <796d>
PHY 1000BASE-T Status <3c00>
PHY Extended Status <3000>
PCI Status <10>


Any help would be appreciated
 
Code:
proxmox-ve kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                                     TDH                  <39>
                                     TDT                  <72>
                                     next_to_use          <72>
                                     next_to_clean        <38>
                                   buffer_info[next_to_clean]:
                                     time_stamp           <1007c7e00>
                                     next_to_watch        <39>
                                     jiffies              <1007c7fd1>
                                     next_to_watch.status <0>
                                   MAC Status             <40080083>
                                   PHY Status             <796d>
                                   PHY 1000BASE-T Status  <3c00>
                                   PHY Extended Status    <3000>
                                   PCI Status             <10>

What's the hardware and NIC? Anyhow I think you will find more ideas to troubleshoot here:
https://forum.proxmox.com/threads/e1000-driver-hang.58284/page-8#post-390709
 
Thanks for the link, I'll check it out now.

Here's my network details if it helps, although, I'm not sure if it's related to the hang I experienced.

lspci -vnn | grep -i net
00:14.3 Network controller [0280]: Intel Corporation Alder Lake-S PCH CNVi WiFi [8086:7af0] (rev 11)
00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (17) I219-V [8086:1a1d] (rev 11)
Subsystem: ASRock Incorporation Ethernet Connection (17) I219-V [1849:1a1d]

ethtool enp0s31f6
Settings for enp0s31f6:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supported pause frame use: No
Supports auto-negotiation: Yes
Supported FEC modes: Not reported
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
Speed: 1000Mb/s
Duplex: Full
Auto-negotiation: on
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
MDI-X: off (auto)
Supports Wake-on: pumbg
Wake-on: g
Current message level: 0x00000007 (7)
drv probe link
Link detected: yes

lshw -class network
*-network:0 DISABLED
description: Wireless interface
product: Alder Lake-S PCH CNVi WiFi
vendor: Intel Corporation
physical id: 14.3
bus info: pci@0000:00:14.3
logical name: wlp0s20f3
version: 11
serial: 70:1a:b8:99:5a:b1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress msix bus_master cap_list ethernet physical wireless
configuration: broadcast=yes driver=iwlwifi driverversion=6.2.16-3-pve firmware=72.a764baac.0 so-a0-gf-a0-72.uc latency=0 link=no multicast=yes wireless=IEEE 802.11
resources: iomemory:600-5ff irq:18 memory:6001114000-6001117fff

*-network:1
description: Ethernet interface
product: Ethernet Connection (17) I219-V
vendor: Intel Corporation
physical id: 1f.6
bus info: pci@0000:00:1f.6
logical name: enp0s31f6
version: 11
serial: 9c:6b:00:17:3e:cd
size: 1Gbit/s
capacity: 1Gbit/s
width: 32 bits
clock: 33MHz
capabilities: pm msi bus_master cap_list ethernet physical tp 10bt 10bt-fd 100bt 100bt-fd 1000bt-fd autonegotiation
configuration: autonegotiation=on broadcast=yes driver=e1000e driverversion=6.2.16-3-pve duplex=full firmware=0.21-4 latency=0 link=yes multicast=yes port=twisted pair speed=1Gbit/s
resources: irq:125 memory:70a00000-70a1ffff