[Partially Solved] Failed cluster node addition, Windows 2012 freezing on boot, ram usage high, backup fails (Got backup working and VM running).

Stryker777

New Member
Oct 14, 2023
5
0
1
I recently set up a server to test (48 cores, 32G ram, ZFS). It has been running 8.0.2 originally, and updated to 8.0.4 a week ago. I did not do a reboot of it till last night and didn't know it was updated. In my scenario, I set up a second node. I created a cluster on PVE, did pve version and noticed 8.04, so I did updates on the second node and, then tried to attach. It failed with a cert error:
'/etc/pve/nodes/pve2/pve-ssl.pem' does not exist! (500)

I didn't think anything of the vm that was already running on the first PVE. It was Server 2012,

12G Ram, 8 cores, Seahorse, Default display, PC-I440FX-8.0, VirtIO SCSI controller, Sata drive on ZFS, e1000 network.

When the join failed, I had to delnode 2 and then rebooted.

After rebooting, the Windows 2012 Server install will no longer successfully boot. It gets to the spinning circle, ram starts to climb, then the circle freezes and statuses stop updating properly. Then the vnc fails. I tried rebooting it again, tried to get to safe mode, checked the logs and this is all I found:
VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - unable to connect to VM 100 qmp socket - timeout after 51 retries

After, I questioned if the update may have brought out a hardware incompatibility with my server, it is quite new, so I tried to back it up (stopped) using the gui, so I could load it on another and test other hardware. It was 1tb and took about 4 hours to get to 99%, then stayed there for the next 5 hours. I have tried twice in the GUI with failure. I am running it from the command line now to see and have about 30 minutes left.

***Update*** The backup succeeded from the CLI.

Any information would be very helpful.
Could creating a cluster cause the problem? Update error? Hardware incompatible?

Thank you.
 
Last edited:
DId it again and see this in the log when it fails:

Code:
Oct 14 10:47:30 pve kernel: device tap100i0 entered promiscuous mode
Oct 14 10:47:30 pve kernel: vmbr0: port 2(tap100i0) entered blocking state
Oct 14 10:47:30 pve kernel: vmbr0: port 2(tap100i0) entered disabled state
Oct 14 10:47:30 pve kernel: vmbr0: port 2(tap100i0) entered blocking state
Oct 14 10:47:30 pve kernel: vmbr0: port 2(tap100i0) entered forwarding state
Oct 14 10:47:31 pve pvedaemon[1138322]: <root@pam> end task UPID:pve:0001703A:001E473D:652AB812:qmstart:100:root@pam: OK
Oct 14 10:47:31 pve pvedaemon[94314]: starting vnc proxy UPID:pve:0001706A:001E47A5:652AB813:vncproxy:100:root@pam:
Oct 14 10:47:31 pve pvedaemon[3080470]: <root@pam> starting task UPID:pve:0001706A:001E47A5:652AB813:vncproxy:100:root@pam:
Oct 14 10:48:14 pve kernel: BUG: unable to handle page fault for address: ff1781e6182d3cff
Oct 14 10:48:14 pve kernel: #PF: supervisor write access in kernel mode
Oct 14 10:48:14 pve kernel: #PF: error_code(0x0003) - permissions violation
Oct 14 10:48:14 pve kernel: PGD 849001067 P4D 849002067 PUD 48373c063 PMD 118212063 PTE 80000001182d3161
Oct 14 10:48:14 pve kernel: Oops: 0003 [#1] PREEMPT SMP NOPTI
Oct 14 10:48:14 pve kernel: CPU: 18 PID: 105816 Comm: z_rd_int_2 Tainted: P           O       6.2.16-15-pve #1
Oct 14 10:48:14 pve kernel: Hardware name: Dell Inc. PowerEdge R760/0DK96C, BIOS 1.4.4 05/15/2023
Oct 14 10:48:14 pve kernel: RIP: 0010:kfpu_begin+0x31/0xa0 [zcommon]
Oct 14 10:48:14 pve kernel: Code: 3f 48 89 e5 fa 0f 1f 44 00 00 48 8b 15 88 89 00 00 65 8b 05 6d 65 35 3f 48 98 48 8b 0c c2 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5d 31 c0 31 d2 31 c9 31 f6 31 ff c3 cc cc cc cc 0f 1f 44
Oct 14 10:48:14 pve kernel: RSP: 0018:ff2cbbbea2bcb7f0 EFLAGS: 00010082
Oct 14 10:48:14 pve kernel: RAX: 00000000ffffffff RBX: ff1781eab6ce3000 RCX: ff1781e6182d1000
Oct 14 10:48:14 pve kernel: RDX: 00000000ffffffff RSI: ff1781eab6ce3000 RDI: ff2cbbbea2bcb940
Oct 14 10:48:14 pve kernel: RBP: ff2cbbbea2bcb7f0 R08: 0000000000000000 R09: 0000051a2c270000
Oct 14 10:48:14 pve kernel: R10: ff1781e9f304c920 R11: 0000000000000000 R12: ff1781eab6ce4000
Oct 14 10:48:14 pve kernel: R13: ff2cbbbea2bcb940 R14: 0000000000001000 R15: 0000000000000000
Oct 14 10:48:14 pve kernel: FS:  0000000000000000(0000) GS:ff1781e96fe40000(0000) knlGS:0000000000000000
Oct 14 10:48:14 pve kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 14 10:48:14 pve kernel: CR2: ff1781e6182d3cff CR3: 00000002d53ec002 CR4: 0000000000773ee0
Oct 14 10:48:14 pve kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct 14 10:48:14 pve kernel: DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
Oct 14 10:48:14 pve kernel: PKRU: 55555554
Oct 14 10:48:14 pve kernel: Call Trace:
Oct 14 10:48:14 pve kernel:  <TASK>
Oct 14 10:48:14 pve kernel:  ? show_regs+0x6d/0x80
Oct 14 10:48:14 pve kernel:  ? __die+0x24/0x80
Oct 14 10:48:14 pve kernel:  ? page_fault_oops+0x176/0x500
Oct 14 10:48:14 pve kernel:  ? kfpu_begin+0x31/0xa0 [zcommon]
Oct 14 10:48:14 pve kernel:  ? kernelmode_fixup_or_oops+0xb2/0x140
Oct 14 10:48:14 pve kernel:  ? __bad_area_nosemaphore+0x1a5/0x2c0
Oct 14 10:48:14 pve kernel:  ? bad_area_nosemaphore+0x16/0x30
Oct 14 10:48:14 pve kernel:  ? do_kern_addr_fault+0x7b/0xa0
Oct 14 10:48:14 pve kernel:  ? exc_page_fault+0x10a/0x1b0
Oct 14 10:48:14 pve kernel:  ? asm_exc_page_fault+0x27/0x30
Oct 14 10:48:14 pve kernel:  ? kfpu_begin+0x31/0xa0 [zcommon]
Oct 14 10:48:14 pve kernel:  fletcher_4_avx512f_native+0x1d/0xb0 [zcommon]
Oct 14 10:48:14 pve kernel:  abd_fletcher_4_iter+0x71/0xe0 [zcommon]
Oct 14 10:48:14 pve kernel:  abd_iterate_func+0x104/0x1e0 [zfs]
Oct 14 10:48:14 pve kernel:  ? __pfx_abd_fletcher_4_iter+0x10/0x10 [zcommon]
Oct 14 10:48:14 pve kernel:  ? __pfx_abd_fletcher_4_native+0x10/0x10 [zfs]
Oct 14 10:48:14 pve kernel:  abd_fletcher_4_native+0x89/0xd0 [zfs]
Oct 14 10:48:14 pve kernel:  zio_checksum_error_impl+0x1b3/0x800 [zfs]
Oct 14 10:48:14 pve kernel:  ? __slab_free+0xe9/0x2f0
Oct 14 10:48:14 pve kernel:  ? update_load_avg+0x82/0x810
Oct 14 10:48:14 pve kernel:  ? __slab_free+0xe9/0x2f0
Oct 14 10:48:14 pve kernel:  zio_checksum_error+0x6e/0xf0 [zfs]
Oct 14 10:48:14 pve kernel:  vdev_raidz_io_done+0x225/0x810 [zfs]
Oct 14 10:48:14 pve kernel:  zio_vdev_io_done+0x81/0x240 [zfs]
Oct 14 10:48:14 pve kernel:  zio_execute+0x94/0x170 [zfs]
Oct 14 10:48:14 pve kernel:  taskq_thread+0x2ac/0x4d0 [spl]
Oct 14 10:48:14 pve kernel:  ? __pfx_default_wake_function+0x10/0x10
Oct 14 10:48:14 pve kernel:  ? __pfx_zio_execute+0x10/0x10 [zfs]
Oct 14 10:48:14 pve kernel:  ? __pfx_taskq_thread+0x10/0x10 [spl]
Oct 14 10:48:14 pve kernel:  kthread+0xe6/0x110
Oct 14 10:48:14 pve kernel:  ? __pfx_kthread+0x10/0x10
Oct 14 10:48:14 pve kernel:  ret_from_fork+0x29/0x50
Oct 14 10:48:14 pve kernel:  </TASK>
Oct 14 10:48:14 pve kernel: Modules linked in: tcp_diag inet_diag ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel nf_tables bonding tls softdog nfnetlink_log nfnetlink sunrpc binfmt_misc ipmi_ssif intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_ifs i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel mgag200 sha512_ssse3 drm_shmem_helper aesni_intel crypto_simd acpi_ipmi drm_kms_helper cryptd i2c_algo_bit syscopyarea cmdlinepart pmt_telemetry pmt_crashlog dell_smbios ipmi_si sysfillrect intel_sdsi mei_me rapl spi_nor ipmi_devintf dell_wmi_descriptor dcdbas pmt_class isst_if_mmio isst_if_mbox_pci sysimgblt isst_if_common intel_vsec zfs(PO) wmi_bmof input_leds idxd mtd pcspkr ipmi_msghandler intel_cstate mei idxd_bus mac_hid acpi_power_meter zunicode(PO) zzstd(O) zlua(O)
Oct 14 10:48:14 pve kernel:  zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap drm efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq simplefb cdc_ether usbnet mii uas usb_storage hid_generic usbkbd usbmouse usbhid hid dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c xhci_pci xhci_pci_renesas crc32_pclmul megaraid_sas bnxt_en i2c_i801 xhci_hcd ahci spi_intel_pci libahci spi_intel i2c_smbus i2c_ismt wmi pinctrl_emmitsburg
Oct 14 10:48:14 pve kernel: CR2: ff1781e6182d3cff
Oct 14 10:48:14 pve kernel: ---[ end trace 0000000000000000 ]---
 
Update again.
added clearcpuid=600 to GRUB_CMDLINE_LINUX_DEFAULT in the file /etc/default/grub. Then ran update-grub. Then rebooted. VM boots properly now.
Bug info for that is here: https://bugzilla.proxmox.com/show_bug.cgi?id=4836

So I have a backup, and it is booting correct.

Now for the last issue, why do I keep failing when I try to add a node to the cluster? I'll try to find some logs on that next.
Thanks for reading my banter.
 
I have done the following to clean up the cluster data, from many tries.

rm /etc/pve/pve-root-ca.pem rm /etc/pve/priv/pve-root-ca.key rm /etc/pve/nodes/pve/pve-ssl.pem rm /etc/pve/nodes/pve2/pve-ssl.pem rm /etc/pve/nodes/pve/pve-ssl.key rm /etc/pve/nodes/pve2/pve-ssl.key rm /etc/pve/authkey.pub rm /etc/pve/priv/authkey.key rm /etc/pve/priv/authorized_keys pvecm updatecerts -f systemctl restart pvedaemon pveproxy mv /root/.ssh/known_hosts /root/.ssh/known_hosts_old Then shut down VMs and reboot each server. Refresh the webpage and accept new cert.

After, I had the same issues. I have verified time, and it is correct on both servers.
Here are my known errors.

On pve2: permission denied invalid pve ticket (401)
On pve (the main one): '/etc/pve/nodes/pve2/pve-ssl.pem' does not exist! (500)

Oct 14 14:13:47 pve pvedaemon[90077]: starting termproxy UPID:pve:00015FDD:000289C9:652AE86B:vncshell::root@pam: Oct 14 14:13:47 pve pvedaemon[3231]: <root@pam> starting task UPID:pve:00015FDD:000289C9:652AE86B:vncshell::root@pam: Oct 14 14:13:47 pve pvedaemon[3230]: <root@pam> successful auth for user 'root@pam' Oct 14 14:13:47 pve corosync[3178]: [TOTEM ] Retransmit List: 15 16 85 Oct 14 14:13:47 pve corosync[3178]: [TOTEM ] Retransmit List: 15 16 86 Oct 14 14:13:47 pve corosync[3178]: [TOTEM ] Retransmit List: 15 16 Oct 14 14:13:47 pve corosync[3178]: [TOTEM ] Retransmit List: 15 16 Oct 14 14:13:48 pve corosync[3178]: [TOTEM ] Retransmit List: 15 16 Oct 14 14:13:48 pve corosync[3178]: [TOTEM ] Retransmit List: 15 16 Oct 14 14:13:49 pve corosync[3178]: [TOTEM ] Retransmit List: 15 16 Oct 14 14:13:49 pve pveproxy[3239]: '/etc/pve/nodes/pve2/pve-ssl.pem' does not exist! Oct 14 14:13:50 pve corosync[3178]: [TOTEM ] Retransmit List: 15 16 Oct 14 14:13:50 pve corosync[3178]: [TOTEM ] Retransmit List: 15 16 Oct 14 14:13:50 pve corosync[3178]: [TOTEM ] Retransmit List: 15 16 Oct 14 14:13:50 pve corosync[3178]: [TOTEM ] Retransmit List: 15 16
 
My ssl files do exist before trying to join the cluster, but then disappear during the join. Still no resolution on that.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!