Proxmox randomly fails, but seems to be the UI only

foxyproxy

New Member
Dec 19, 2022
27
2
3
Hello everyone!

Proxmox (and I think it's just the UI) has a problem where about once every 2 weeks - 1 month, all of the containers, storage and VM's appear offline

Sorry I don't have a better image, I thought I had taken a high res screenshot, but hopefully it shows the problem. Everything has a question mark and all the names are gone.

1723064015882.png

Some containers work, others don't. The VM's all seem to work but I can manage them. The only way to get around this is a hard power off. I can still use the console and I have tried `shutdown now` and `reboot` and some containers and VM's go into shutdown mode, others don't.

I also managed to get some logs as soon as this happened. Hopefully this helps. Items in bold were coloured red in the logs.

Aug 07 11:12:29 proxmox-ve kernel: BUG: Bad page state in process java pfn:1737bf5
Aug 07 11:12:29 proxmox-ve kernel: page:000000001d48a063 refcount:0 mapcount:-1 mapping:0000000000000000 index:0x0 pfn:0x1737bf5
Aug 07 11:12:29 proxmox-ve kernel: flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
Aug 07 11:12:29 proxmox-ve kernel: raw: 0017ffffc0000000 dead000000000100 dead000000000122 0000000000000000
Aug 07 11:12:29 proxmox-ve kernel: raw: 0000000000000000 0000000080270027 00000000fffffffe 0000000000000000
Aug 07 11:12:29 proxmox-ve kernel: page dumped because: nonzero mapcount
Aug 07 11:12:29 proxmox-ve kernel: Modules linked in: udp_diag dm_snapshot nfnetlink_acct wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel tcp>
Aug 07 11:12:29 proxmox-ve kernel: intel_powerclamp snd_soc_core i915 coretemp snd_compress ac97_bus snd_pcm_dmaengine kvm_intel drm_buddy snd_hda_intel iwlmvm snd_intel_dspcfg ttm snd_intel_sdw_acpi kvm mac80211 irqbypass drm_d>
Aug 07 11:12:29 proxmox-ve kernel: xhci_pci_renesas crc32_pclmul ahci intel_lpss_pci i2c_i801 nvme_core e1000e xhci_hcd spi_intel_pci intel_lpss i2c_smbus libahci spi_intel nvme_common idma64 video wmi
Aug 07 11:12:29 proxmox-ve kernel: CPU: 0 PID: 1431832 Comm: java Tainted: P D W O 6.2.16-3-pve #1
Aug 07 11:12:29 proxmox-ve kernel: Hardware name: ASRock B660M Pro RS/AX/B660M Pro RS/AX, BIOS 4.03 11/24/2022
Aug 07 11:12:29 proxmox-ve kernel: Call Trace:
Aug 07 11:12:29 proxmox-ve kernel: <TASK>
Aug 07 11:12:29 proxmox-ve kernel: dump_stack_lvl+0x48/0x70
Aug 07 11:12:29 proxmox-ve kernel: dump_stack+0x10/0x20
Aug 07 11:12:29 proxmox-ve kernel: bad_page+0x76/0x120
Aug 07 11:12:29 proxmox-ve kernel: free_page_is_bad_report+0x66/0x80
Aug 07 11:12:29 proxmox-ve kernel: free_pcppages_bulk+0x1bc/0x2f0
Aug 07 11:12:29 proxmox-ve kernel: free_unref_page_commit+0xf1/0x190
Aug 07 11:12:29 proxmox-ve kernel: free_unref_page_list+0x1e7/0x450
Aug 07 11:12:29 proxmox-ve kernel: release_pages+0x152/0x520
Aug 07 11:12:29 proxmox-ve kernel: free_pages_and_swap_cache+0x4a/0x60
Aug 07 11:12:29 proxmox-ve kernel: tlb_batch_pages_flush+0x43/0x80
Aug 07 11:12:29 proxmox-ve kernel: tlb_finish_mmu+0x73/0x1a0
Aug 07 11:12:29 proxmox-ve kernel: exit_mmap+0x14d/0x310
Aug 07 11:12:29 proxmox-ve kernel: __mmput+0x41/0x140
Aug 07 11:12:29 proxmox-ve kernel: mmput+0x2e/0x40
Aug 07 11:12:29 proxmox-ve kernel: do_exit+0x2ef/0xac0
Aug 07 11:12:29 proxmox-ve kernel: ? __futex_unqueue+0x29/0x60
Aug 07 11:12:29 proxmox-ve kernel: ? futex_unqueue+0x3d/0x70

Aug 07 11:12:29 proxmox-ve kernel: arch_do_signal_or_restart+0x42/0x280
Aug 07 11:12:29 proxmox-ve kernel: exit_to_user_mode_prepare+0x11b/0x190
Aug 07 11:12:29 proxmox-ve kernel: syscall_exit_to_user_mode+0x29/0x50
Aug 07 11:12:29 proxmox-ve kernel: do_syscall_64+0x67/0x90
Aug 07 11:12:29 proxmox-ve kernel: ? exc_page_fault+0x91/0x1b0
Aug 07 11:12:29 proxmox-ve kernel: entry_SYSCALL_64_after_hwframe+0x72/0xdc
Aug 07 11:12:29 proxmox-ve kernel: RIP: 0033:0x7f968dedfd61
Aug 07 11:12:29 proxmox-ve kernel: Code: Unable to access opcode bytes at 0x7f968dedfd37.
Aug 07 11:12:29 proxmox-ve kernel: RSP: 002b:00007f968c7d4730 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
Aug 07 11:12:29 proxmox-ve kernel: RAX: fffffffffffffe00 RBX: 00007f9688018550 RCX: 00007f968dedfd61
Aug 07 11:12:29 proxmox-ve kernel: RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00007f968801857c
Aug 07 11:12:29 proxmox-ve kernel: RBP: 00007f968c7d4770 R08: 0000000000000000 R09: 00000000ffffffff
Aug 07 11:12:29 proxmox-ve kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
Aug 07 11:12:29 proxmox-ve kernel: R13: 0000000000000000 R14: 00007f9688018528 R15: 00007f968801857c
Aug 07 11:12:29 proxmox-ve kernel: </TASK>
Aug 07 11:12:29 proxmox-ve kernel: BUG: Bad rss-counter state mm:00000000819475f3 type:MM_FILEPAGES val:-1
Aug 07 11:12:29 proxmox-ve kernel: BUG: Bad rss-counter state mm:00000000819475f3 type:MM_ANONPAGES val:1

Aug 07 11:12:29 proxmox-ve kernel: docker0: port 3(veth3cd6817) entered disabled state
Aug 07 11:12:29 proxmox-ve kernel: vethe0f8f22: renamed from eth0
Aug 07 11:12:29 proxmox-ve kernel: docker0: port 3(veth3cd6817) entered disabled state
Aug 07 11:12:29 proxmox-ve kernel: device veth3cd6817 left promiscuous mode
Aug 07 11:12:29 proxmox-ve kernel: docker0: port 3(veth3cd6817) entered disabled state
Aug 07 11:12:29 proxmox-ve kernel: fwbr118i0: port 2(veth118i0) entered disabled state
Aug 07 11:12:29 proxmox-ve kernel: fwbr118i0: port 2(veth118i0) entered disabled state
Aug 07 11:12:29 proxmox-ve kernel: device veth118i0 left promiscuous mode
Aug 07 11:12:29 proxmox-ve kernel: fwbr118i0: port 2(veth118i0) entered disabled state
Aug 07 11:12:30 proxmox-ve kernel: br-1a87481796fb: port 5(veth431c3e2) entered disabled state
Aug 07 11:12:30 proxmox-ve kernel: vethb53d107: renamed from eth0

Aug 07 11:13:02 proxmox-ve kernel: usb 1-3: USB disconnect, device number 12
Aug 07 11:13:04 proxmox-ve pvedaemon[696653]: <root@pam> starting task UPID:proxmox-ve:000EC79D:073EB478:66B2CA20:vncshell::root@pam:
Aug 07 11:13:04 proxmox-ve pvedaemon[968605]: starting termproxy UPID:proxmox-ve:000EC79D:073EB478:66B2CA20:vncshell::root@pam:
Aug 07 11:13:04 proxmox-ve pvedaemon[696653]: <root@pam> successful auth for user 'root@pam'
Aug 07 11:13:04 proxmox-ve login[968608]: pam_unix(login:session): session opened for user root(uid=0) by root(uid=0)
Aug 07 11:13:04 proxmox-ve login[968608]: pam_systemd(login:session): Failed to create session: Transport endpoint is not connected
Aug 07 11:13:04 proxmox-ve login[968613]: ROOT LOGIN on '/dev/pts/0'
Aug 07 11:13:06 proxmox-ve kernel: usb 1-3: new low-speed USB device number 13 using xhci_hcd
Aug 07 11:13:06 proxmox-ve kernel: usb 1-3: New USB device found, idVendor=0764, idProduct=0501, bcdDevice= 0.02
Aug 07 11:13:06 proxmox-ve kernel: usb 1-3: New USB device strings: Mfr=3, Product=1, SerialNumber=2
Aug 07 11:13:06 proxmox-ve kernel: usb 1-3: Product: VP700ELCD
Aug 07 11:13:06 proxmox-ve kernel: usb 1-3: Manufacturer: CPS
Aug 07 11:13:06 proxmox-ve kernel: usb 1-3: SerialNumber:
Aug 07 11:13:06 proxmox-ve kernel: hid-generic 0003:0764:0501.1A6DA: hiddev0,hidraw0: USB HID v1.10 Device [CPS VP700ELCD] on usb-0000:00:14.0-3/input0

I also found this log entry this morning but proxmox is still working fine, just thought I'd include it.

proxmox-ve kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
TDH <39>
TDT <72>
next_to_use <72>
next_to_clean <38>
buffer_info[next_to_clean]:
time_stamp <1007c7e00>
next_to_watch <39>
jiffies <1007c7fd1>
next_to_watch.status <0>
MAC Status <40080083>
PHY Status <796d>
PHY 1000BASE-T Status <3c00>
PHY Extended Status <3000>
PCI Status <10>


Any help would be appreciated
 
Code:
proxmox-ve kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                                     TDH                  <39>
                                     TDT                  <72>
                                     next_to_use          <72>
                                     next_to_clean        <38>
                                   buffer_info[next_to_clean]:
                                     time_stamp           <1007c7e00>
                                     next_to_watch        <39>
                                     jiffies              <1007c7fd1>
                                     next_to_watch.status <0>
                                   MAC Status             <40080083>
                                   PHY Status             <796d>
                                   PHY 1000BASE-T Status  <3c00>
                                   PHY Extended Status    <3000>
                                   PCI Status             <10>

What's the hardware and NIC? Anyhow I think you will find more ideas to troubleshoot here:
https://forum.proxmox.com/threads/e1000-driver-hang.58284/page-8#post-390709
 
Thanks for the link, I'll check it out now.

Here's my network details if it helps, although, I'm not sure if it's related to the hang I experienced.

lspci -vnn | grep -i net
00:14.3 Network controller [0280]: Intel Corporation Alder Lake-S PCH CNVi WiFi [8086:7af0] (rev 11)
00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (17) I219-V [8086:1a1d] (rev 11)
Subsystem: ASRock Incorporation Ethernet Connection (17) I219-V [1849:1a1d]

ethtool enp0s31f6
Settings for enp0s31f6:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supported pause frame use: No
Supports auto-negotiation: Yes
Supported FEC modes: Not reported
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
Speed: 1000Mb/s
Duplex: Full
Auto-negotiation: on
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
MDI-X: off (auto)
Supports Wake-on: pumbg
Wake-on: g
Current message level: 0x00000007 (7)
drv probe link
Link detected: yes

lshw -class network
*-network:0 DISABLED
description: Wireless interface
product: Alder Lake-S PCH CNVi WiFi
vendor: Intel Corporation
physical id: 14.3
bus info: pci@0000:00:14.3
logical name: wlp0s20f3
version: 11
serial: 70:1a:b8:99:5a:b1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress msix bus_master cap_list ethernet physical wireless
configuration: broadcast=yes driver=iwlwifi driverversion=6.2.16-3-pve firmware=72.a764baac.0 so-a0-gf-a0-72.uc latency=0 link=no multicast=yes wireless=IEEE 802.11
resources: iomemory:600-5ff irq:18 memory:6001114000-6001117fff

*-network:1
description: Ethernet interface
product: Ethernet Connection (17) I219-V
vendor: Intel Corporation
physical id: 1f.6
bus info: pci@0000:00:1f.6
logical name: enp0s31f6
version: 11
serial: 9c:6b:00:17:3e:cd
size: 1Gbit/s
capacity: 1Gbit/s
width: 32 bits
clock: 33MHz
capabilities: pm msi bus_master cap_list ethernet physical tp 10bt 10bt-fd 100bt 100bt-fd 1000bt-fd autonegotiation
configuration: autonegotiation=on broadcast=yes driver=e1000e driverversion=6.2.16-3-pve duplex=full firmware=0.21-4 latency=0 link=yes multicast=yes port=twisted pair speed=1Gbit/s
resources: irq:125 memory:70a00000-70a1ffff
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!