Problem with web accessibility on one node.

LionBurr · Dec 28, 2022

Hi all,

I'm running Proxmox VE 7.3-3 on a 3-node cluster. One of my nodes rebooted today after replication to it from the 2 other nodes failed. I've got a bunch of these messages in /var/log/syslog:

The crashes were here:

Code:

Dec 27 10:17:27 Jaguar kernel: [    0.000000] x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks
Dec 27 10:17:27 Jaguar kernel: [   10.690899] pstore: Using crash dump compression: deflate
Dec 27 10:54:46 Jaguar kernel: [    0.000000] x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks
Dec 27 10:54:46 Jaguar kernel: [    3.752694] pstore: Using crash dump compression: deflate
Dec 27 11:03:39 Jaguar kernel: [    0.000000] x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks
Dec 27 11:03:39 Jaguar kernel: [    3.755702] pstore: Using crash dump compression: deflate

And these are recent errors.

Code:

Dec 27 11:12:58 Jaguar kernel: [  564.795861] x86/split lock detection: #AC: CPU 3/KVM/3327 took a split_lock trap at address: 0xfffff8064da24b6f
Dec 27 11:15:08 Jaguar kernel: [  693.905122] x86/split lock detection: #AC: CPU 4/KVM/3328 took a split_lock trap at address: 0xfffff8064da24b6f
Dec 27 17:41:23 Jaguar kernel: [23869.265255] perf: interrupt took too long (2511 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
Dec 27 18:22:06 Jaguar kernel: [26311.967076] x86/split lock detection: #AC: CPU 0/KVM/3324 took a split_lock trap at address: 0xfffff8064da79c9f
Dec 27 21:16:27 Jaguar kernel: [36773.504957] perf: interrupt took too long (3144 > 3138), lowering kernel.perf_event_max_sample_rate to 63500
Dec 27 22:03:23 Jaguar kernel: [39589.576370] x86/split lock detection: #AC: CPU 0/KVM/3324 took a split_lock trap at address: 0xfffff8064da79c9f
Dec 27 22:05:39 Jaguar kernel: [39725.185644] x86/split lock detection: #AC: CPU 1/KVM/3325 took a split_lock trap at address: 0xfffff8064da21396

I've tried restarting pveproxy, rebooting the node - All of my VMs that are in HA say they're still in HA and replication is working, however when I try to go to the node on port 8006, the web management console on only that node times out. If I try to access it from another node, only some things (like remote shell for the main host) come up.

System is an HP Elite 800 mini with Intel i7-12700t Alder Lake, 64GB DDR5 RAM, 2 NVMe 1TB SSDs in RAIDz1, a 1.92TB enterprise SATA SSD, and 64GB. There is an internal cluster interface on 192.168.x.x and externals with public IPs though port 8006 is blocked from outside the network. The system had a 29 or so day uptime until this happened. Any help is appreciated.

Thanks,
Bear

nunner · Dec 28, 2022

Hi,

can you still reach the node via ssh? Or is it completely unreachable?

LionBurr said:
I've tried restarting pveproxy

Anything interesting when you check the status of pveproxy? Maybe also try reloading pvedaemon, just in case. Do your requests show up in /var/log/pveproxy/access.log?

LionBurr · Dec 28, 2022

nunner said:
Hi,

can you still reach the node via ssh? Or is it completely unreachable?

Anything interesting when you check the status of pveproxy? Maybe also try reloading pvedaemon, just in case. Do your requests show up in /var/log/pveproxy/access.log?

I can ssh to it.

Status for pveproxy gives me:


Dec 28 08:43:10 Jaguar pveproxy[40475]: starting 2 worker(s)
Dec 28 08:43:10 Jaguar pveproxy[40475]: worker 2758053 started
Dec 28 08:43:10 Jaguar pveproxy[40475]: worker 2758054 started
Dec 28 08:43:10 Jaguar pveproxy[2758054]: unable to open log file '/var/log/pveproxy/access.log' - Permission denied
Dec 28 08:43:10 Jaguar pveproxy[2758053]: unable to open log file '/var/log/pveproxy/access.log' - Permission denied
Dec 28 08:43:10 Jaguar pveproxy[2757864]: worker exit
Dec 28 08:43:10 Jaguar pveproxy[40475]: worker 2757864 finished
Dec 28 08:43:10 Jaguar pveproxy[40475]: starting 1 worker(s)
Dec 28 08:43:10 Jaguar pveproxy[40475]: worker 2758055 started
Dec 28 08:43:10 Jaguar pveproxy[2758055]: unable to open log file '/var/log/pveproxy/access.log' - Permission denied

Catting /var/log/pveproxy/access.log gives me nothing.



root@Jaguar:/var/log/pveproxy# ls -la
total 53
drwx------  2 www-data www-data   10 Dec 27 10:17 .
drwxr-xr-x 15 root     root       87 Dec 28 00:00 ..
-rw-------  1 root     root        0 Dec 27 00:00 access.log
-rw-r-----  1 www-data www-data  115 Dec 27 10:17 access.log.1
-rw-r-----  1 www-data www-data  124 Dec 24 21:05 access.log.2.gz
-rw-r-----  1 www-data www-data  199 Dec 22 13:19 access.log.3.gz
-rw-r-----  1 www-data www-data  143 Dec 20 10:26 access.log.4.gz
-rw-r-----  1 www-data www-data  953 Dec 17 22:40 access.log.5.gz
-rw-r-----  1 www-data www-data  206 Dec 11 16:59 access.log.6.gz
-rw-r-----  1 www-data www-data 7204 Dec  8 12:47 access.log.7.gz

nunner · Dec 28, 2022

The permissions for the log file seem to be messed up. Try running

Code:

chown www-data:www-data access.log
chmod 640 access.log

and restart pveproxy, just for good measure.

LionBurr · Dec 28, 2022

Thanks, Leo - I've got web access back.

Now I'd like to figure out what exactly happened to mess up the perms, as well as what caused the system to do this prior to rebooting. Does it have something to do with the Spit Lock detection errors?

nunner · Dec 28, 2022

LionBurr said:
Now I'd like to figure out what exactly happened to mess up the perms

Hmm, I've seen several threads now with the same problem, where the permissions for the access log were set to root-only. I would suggest opening a report on the bugtracker [1], so that others will also be able to discuss this issue.

LionBurr said:
Does it have something to do with the Spit Lock detection errors?

Is it the first time that this happened? You can try turning it off, this [2] thread should be a good reference.

Could you maybe post a few more details about your setup and the crash that occured?

[1] https://bugzilla.proxmox.com/
[2] https://forum.proxmox.com/threads/x86-split-lock-detection.111544/

LionBurr · Dec 28, 2022

Sure, I’ll file one tonight once I get back from a trip.

This is the first time this has happened. The cluster has run for over 3 months without any issues except a bad NIC in another node.

What other info would you like on the setup? Any specific command outputs you think would be beneficial?

LionBurr · Dec 29, 2022

Added a Bug - https://bugzilla.proxmox.com/show_bug.cgi?id=4434 for the access log perms getting changed.

Also,
Edited /etc/kernel/cmdline


root=ZFS=rpool/ROOT/pve-1 boot=zfs
split_lock_detect=off

Ran proxmox-boot-tool refresh

On reboot, still seeing:

Code:

Dec 28 19:38:59 Jaguar kernel: [  157.133469] x86/split lock detection: #AC: CPU 5/KVM/3579 took a split_lock trap at address: 0xfffff8074aa79c9f
Dec 28 19:40:53 Jaguar kernel: [  271.086553] x86/split lock detection: #AC: CPU 0/KVM/3574 took a split_lock trap at address: 0xfffff8074aa79c9f
Dec 28 19:41:10 Jaguar kernel: [  288.946079] x86/split lock detection: #AC: CPU 3/KVM/3577 took a split_lock trap at address: 0xfffff8074aa79c9f

Neobin · Dec 29, 2022

LionBurr said:
Also,
Edited /etc/kernel/cmdline
root=ZFS=rpool/ROOT/pve-1 boot=zfs split_lock_detect=off

Is it all on one single line in the actual file? At least here in the forum it looks not so; but it needs to be: [1]

You can check the active/booted kernel command line with: cat /proc/cmdline

[1] https://pve.proxmox.com/wiki/Host_Bootloader#sysboot_edit_kernel_cmdline

LionBurr · Dec 29, 2022

Neobin said:
Is it all on one single line in the actual file? At least here in the forum it looks not so; but it needs to be: [1]

You can check the active/booted kernel command line with: cat /proc/cmdline

[1] https://pve.proxmox.com/wiki/Host_Bootloader#sysboot_edit_kernel_cmdline

D'oh! It wasn't on a single line. Fixed. Thank you!


root@Jaguar:~# cat /proc/cmdline
initrd=\EFI\proxmox\5.15.74-1-pve\initrd.img-5.15.74-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs split_lock_detect=off

LionBurr · Dec 29, 2022

Also curious though - Is the commonality between systems crashing with split locks that KVM is running Windows Server guests? I've seen that mentioned a few times, and in the referenced thread. This issue took a month to show up, so I'm not sure if setting the kernel line option will resolve it or not...at least I won't know immediately.

Search

Search

Problem with web accessibility on one node.

LionBurr

Member

nunner

Active Member

LionBurr

Member

nunner

Active Member

LionBurr

Member

Attachments

nunner

Active Member

LionBurr

Member

LionBurr

Member

Neobin

Distinguished Member

LionBurr

Member

LionBurr

Member

We value your privacy