x86/split lock detection

stefan6 · Jun 29, 2022

Hi,

we have installed last week 6 new Servers with following Hardware:

2xIntel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
512GB RAM
6x25G QSFP28 NICs
2x 1.92TB NVMe
+ NFS Storage (multiple NVMes here)

We host 90% Windows Server 2019 VMs.

In Dmesg we see the following:

[1459519.912125] x86/split lock detection: #AC: kvm/2275802 took a split_lock trap at address: 0xfffff8002d424d73
[1460057.912295] x86/split lock detection: #AC: kvm/2275793 took a split_lock trap at address: 0xfffff8002d424d73
[1460461.912321] x86/split lock detection: #AC: kvm/2275794 took a split_lock trap at address: 0xfffff8002d424d73
[1460846.693425] x86/split lock detection: #AC: kvm/2275799 took a split_lock trap at address: 0xfffff8002d424d73
[1460941.740885] x86/split lock detection: #AC: kvm/2279407 took a split_lock trap at address: 0xfffff8064bf0ed61
[1462788.772002] x86/split lock detection: #AC: kvm/2279360 took a split_lock trap at address: 0xfffff8031b4c4d73

The Windows VMs lost time for time the connectivity to the Storage, where the Disks are hosted. The network for the VMs will be dead.

We use Proxmox Virtual Environment 7.2-5 with the newest updates and in cluster mode.

Example VM config:

Code:

balloon: 0
boot: order=virtio0;ide2;net0
cores: 16
ide2: storage02_disks:iso/ExchangeServer2019-x64-CU9.ISO,media=cdrom,size=6119220K
memory: 131072
meta: creation-qemu=6.2.0,ctime=1653902151
name: MX04
net0: virtio=AA:6D:F5:EF:3C:A5,bridge=vmbr1014,firewall=1
numa: 0
ostype: win10
scsihw: virtio-scsi-pci
smbios1: uuid=fb8ffb0a-b2f0-4dbe-bee6-fa900cc3720d
sockets: 1
virtio0: storage02_disks:110/vm-110-disk-0.qcow2,discard=on,size=100G
virtio1: storage02_disks:110/vm-110-disk-1.qcow2,discard=on,size=101G
virtio2: storage02_disks:110/vm-110-disk-2.qcow2,size=2T
vmgenid: fb6f740c-e2d2-4981-80a1-990610864358

sterzy · Jun 29, 2022

Hi,

can you try disabling split lock detection and see if that improves the issue? To do so you need to set a kernel command line option. The manual details how to set these [1]. In your case you need to add split_lock_detect=off to turn off split lock detection.

If you are using systemd-boot, make sure that you enter the entire kernel command line as one line only!

[1]: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysboot_edit_kernel_cmdline

stefan6 · Jun 29, 2022

Hi,

thanks for your answer. We have normalle install that maschines from your ISOs.

Also, i add in /etc/default/grub


GRUB_CMDLINE_LINUX_DEFAULT="quiet split_lock_detect=off"

Correctly?

sterzy · Jun 29, 2022

Hi,

stefan6 said:
thanks for your answer. We have normalle install that maschines from your ISOs.

"normal" is a bit tricky, the actual setup depends on your selection in the installer and other factors.

stefan6 said:
Also, i add in /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet split_lock_detect=off"

Yes that should be correct, now run update-grub and reboot. You can then take a look at your journal with journalctl -b. You should see a line that says something like:

Jun 29 12:42:55 hostname kernel: x86/split lock detection: disabled

However, if you see a line like this:

Jun 29 12:42:55 hostname kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks

Then please double check whether you followed all steps correctly. Also, shortly before that line there should be an entry that shows the kernel command line, please post that if this doesn't work for you. It looks something like this:

Jun 29 12:42:55 hostname kernel: Command line: initrd=\EFI\proxmox\5.15.39-1-pve\initrd.img-5.15.39-1-pve quiet

ales · Jul 5, 2022

I see these messages

Jul 05 11:56:26 host3 kernel: x86/split lock detection: #AC: kvm/797055 took a split_lock trap at address: 0xfffff80016761edd

as well, but I don't notice any storage or connectivity issue (VM's storage is located on local disks).
Should I be concerned about these messages?

sterzy · Jul 5, 2022

Hi @ales,

if you don't see any issues with your VM, it should be fine to leave the kernel command line as is. Split lock detection *should* have a fairly minimal impact on most systems.

In an ideal world, you could change whatever software is running within your VM that causes the issue to not trigger split locks. Turning split lock detection off just reverts the kernel's behavior back to what it used to be before split lock detection was added.

godblessyou · Jul 13, 2022

stefan6 said:
Hi,

we have installed last week 6 new Servers with following Hardware:

2xIntel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
512GB RAM
6x25G QSFP28 NICs
2x 1.92TB NVMe
+ NFS Storage (multiple NVMes here)

We host 90% Windows Server 2019 VMs.

In Dmesg we see the following:

[1459519.912125] x86/split lock detection: #AC: kvm/2275802 took a split_lock trap at address: 0xfffff8002d424d73
[1460057.912295] x86/split lock detection: #AC: kvm/2275793 took a split_lock trap at address: 0xfffff8002d424d73
[1460461.912321] x86/split lock detection: #AC: kvm/2275794 took a split_lock trap at address: 0xfffff8002d424d73
[1460846.693425] x86/split lock detection: #AC: kvm/2275799 took a split_lock trap at address: 0xfffff8002d424d73
[1460941.740885] x86/split lock detection: #AC: kvm/2279407 took a split_lock trap at address: 0xfffff8064bf0ed61
[1462788.772002] x86/split lock detection: #AC: kvm/2279360 took a split_lock trap at address: 0xfffff8031b4c4d73

The Windows VMs lost time for time the connectivity to the Storage, where the Disks are hosted. The network for the VMs will be dead.

We use Proxmox Virtual Environment 7.2-5 with the newest updates and in cluster mode.

Example VM config:

Code:

balloon: 0 boot: order=virtio0;ide2;net0 cores: 16 ide2: storage02_disks:iso/ExchangeServer2019-x64-CU9.ISO,media=cdrom,size=6119220K memory: 131072 meta: creation-qemu=6.2.0,ctime=1653902151 name: MX04 net0: virtio=AA:6D:F5:EF:3C:A5,bridge=vmbr1014,firewall=1 numa: 0 ostype: win10 scsihw: virtio-scsi-pci smbios1: uuid=fb8ffb0a-b2f0-4dbe-bee6-fa900cc3720d sockets: 1 virtio0: storage02_disks:110/vm-110-disk-0.qcow2,discard=on,size=100G virtio1: storage02_disks:110/vm-110-disk-1.qcow2,discard=on,size=101G virtio2: storage02_disks:110/vm-110-disk-2.qcow2,size=2T vmgenid: fb6f740c-e2d2-4981-80a1-990610864358

Hello @stefan6, I am new to proxmox and also facing similar problem with latest gen intel servers on proxmx 7.2-5 release, I am curious if your problem is solved, if yes, how ? Thanks in Advance

ichi123 · Jul 15, 2022

godblessyou said:
Hello @stefan6, I am new to proxmox and also facing similar problem with latest gen intel servers on proxmx 7.2-5 release, I am curious if your problem is solved, if yes, how ? Thanks in Advance

I am also facing the same Problems with a relativly new Intel System (2x Xeon Silver 4309Y)

We are gettting huge amounts of split lock error messages.

We are also running Windows Server 2019. Its the only machine we left running for debbuging purposes.

In Addition we can see, that the whole system is gettting unresponsive once every 7-14 days. Unfortunately there are no exceptional syslog messages.

jaytee129 · Jul 23, 2022

Getting same issue here with newly purchased mini PC with 11th Gen Core i5-11320H.

Is there a way to find out which VM is causing it, and why? It's mostly same address but once in a while a different one. How do I find out which of the only two VM's I have is causing it? There are 4-5 different kvm numbers alternating/repeating.

Also, what's the risk/impact of disabling it? I think I read this is to prevent memory from one VM being read by another. Having that many attempts at reading other VM memory seems a bit problematic and cause for considerable concern. Is there a bug either in proxmox or in 11 generation CPUs (or elsewhere)?

sterzy · Jul 25, 2022

Hi,

jaytee129 said:
Is there a way to find out which VM is causing it, and why? It's mostly same address but once in a while a different one. How do I find out which of the only two VM's I have is causing it? There are 4-5 different kvm numbers alternating/repeating.

You can use the PID of the KVM process to figure that out. Just run pstree -asp $pid (where $pid is the PID of the KVM process). You can find the PID in the logs, it's the number after "kvm/" in x86/split lock detection: #AC: kvm/2275802....

jaytee129 said:
Also, what's the risk/impact of disabling it? I think I read this is to prevent memory from one VM being read by another. Having that many attempts at reading other VM memory seems a bit problematic and cause for considerable concern. Is there a bug either in proxmox or in 11 generation CPUs (or elsewhere)?

Can you point me towards a source for that? As far as I can tell split locks (and, thus, their detection) occur because on x86 memory is not required to be aligned to a given word size (think of that as a unit of memory). Sometimes you will now encounter a situation where you want to read from memory in an atomic fashion (meaning without interruption by another thread/process) but the word you want to read spans two cache lines, which is only possible because the memory does not need to be aligned. This presents a problem because reading a cache line is typically atomic by nature but reading two after each other will allow other actors to change the data between reads. This means you need a mechanism to prevent such a scenario: split locks.

On x86 this mechanism is rather simple: just lock the entire memory bus. While this works, it is also really inefficient: typically reading from cache will cost somewhere between 2 and maybe 100 CPU cycles (approximately, you'll find different numbers floating around and it also depends on the cache level etc.). However, when you need to acquire a split lock first that will take 1000+ cycles. Hence, developers should take care that their programs observe memory alignment regardless of whether the CPU requires it or not.

You can read more about split lock detection in the description of the patch that added it [1].

[1]: https://lwn.net/Articles/810317/

jaytee129 · Jul 25, 2022

Thanks for the info. I don't know where I got the story about split lock preventing memory leakage. I can't find it and don't remember the keywords / threads I went down to get there. Perhaps I just misunderstood the info I found.

The issue is erratic. Sometimes none for several days then a whole bunch. Today after a restart, using the pstree -asp $pid command for three separate KVMs showed a very concise indications of the VM in question for the first two (which are the only two VMs running):

Code:

systemd,1
  └─kvm,1620 -id 100 -name SophosCT.4.254 -no-shutdown -chardev socket,id=qmp,path=/var/run/qemu-server/100.qmp,server=on,wait=off -mon ...
      └─{kvm},1660

and

Code:

systemd,1
  └─kvm,2063 -id 101 -name Windows-10-Pro -no-shutdown -chardev socket,id=qmp,path=/var/run/qemu-server/101.qmp,server=on,wait=off -mon ...
      └─{kvm},2128

but then the third offending kvm had a much lengthier story:

Code:

systemd,1
  ├─agetty,886 -o -p -- \\u --noclear tty1 linux
  ├─chronyd,890 -F 1
  │   └─chronyd,893 -F 1
  ├─cron,1552 -f
  ├─dbus-daemon,702 --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
  ├─dhclient,793 -pf /run/dhclient.vmbr1.pid -lf /var/lib/dhcp/dhclient.vmbr1.leases vmbr1
  │   ├─{dhclient},794
  │   ├─{dhclient},795
  │   └─{dhclient},796
  ├─dmeventd,438 -f
  │   ├─{dmeventd},441
  │   └─{dmeventd},442
  ├─iscsid,863
  ├─iscsid,864
  ├─kvm,1620 -id 100 -name SophosCT.4.254 -no-shutdown -chardev socket,id=qmp,path=/var/run/qemu-server/100.qmp,server=on,wait=off -mon ...
  │   ├─{kvm},1621
  │   ├─{kvm},1660
  │   ├─{kvm},1661
  │   ├─{kvm},1662
  │   ├─{kvm},1663
  │   ├─{kvm},1666
  │   ├─{kvm},12584
  │   ├─{kvm},12711
  │   ├─{kvm},12729
  │   ├─{kvm},12857
  │   ├─{kvm},12942
  │   ├─{kvm},12943
  │   ├─{kvm},12951
  │   └─{kvm},12952
  ├─kvm,2063 -id 101 -name Windows-10-Pro -no-shutdown -chardev socket,id=qmp,path=/var/run/qemu-server/101.qmp,server=on,wait=off -mon ...
  │   ├─{kvm},2064
  │   ├─{kvm},2128
  │   ├─{kvm},2129
  │   ├─{kvm},2130
  │   ├─{kvm},2131
  │   ├─{kvm},2132
  │   ├─{kvm},2133
  │   ├─{kvm},2136
  │   ├─{kvm},2138
  │   ├─{kvm},12345
  │   ├─{kvm},12950
  │   └─{kvm},12953
  ├─lxc-monitord,845 --daemon
  ├─lxcfs,704 /var/lib/lxcfs
  │   ├─{lxcfs},712
  │   └─{lxcfs},714
  ├─master,1545 -w
  │   ├─pickup,1546 -l -t unix -u -c
  │   └─qmgr,1547 -l -t unix -u
  ├─pmxcfs,926
  │   ├─{pmxcfs},927
  │   ├─{pmxcfs},1549
  │   ├─{pmxcfs},1550
  │   ├─{pmxcfs},1551
  │   └─{pmxcfs},1600
  ├─pve-firewall,1560
  ├─pve-ha-crm,1595
  ├─pve-ha-lrm,1605
  ├─pve-lxc-syscall,705 --system /run/pve/lxc-syscalld.sock
  │   ├─{pve-lxc-syscall},713
  │   ├─{pve-lxc-syscall},715
  │   ├─{pve-lxc-syscall},716
  │   └─{pve-lxc-syscall},717
  ├─pvedaemon,1587
  │   ├─pvedaemon worke,1588
  │   │   └─task UPID:thibw,9343
  │   │       └─termproxy,9344 5900 --path /nodes/thibworldpx2 --perm Sys.Console -- /bin/login -f root
  │   │           └─login,9353 -f
  │   │               └─bash,9374
  │   │                   └─pstree,12954 -asp 169318
  │   ├─pvedaemon worke,1589
  │   └─pvedaemon worke,1590
  ├─pvefw-logger,700
  │   └─{pvefw-logger},701
  ├─pveproxy,1596
  │   ├─pveproxy worker,1597
  │   ├─pveproxy worker,1598
  │   └─pveproxy worker,1599
  ├─pvescheduler,2140
  ├─pvestatd,1559
  ├─qmeventd,709 /var/run/qmeventd.sock
  ├─rpcbind,690 -f -w
  ├─rrdcached,906 -B -b /var/lib/rrdcached/db/ -j /var/lib/rrdcached/journal/ -p /var/run/rrdcached.pid -l unix:/var/run/rrdcached.sock
  │   ├─{rrdcached},907
  │   ├─{rrdcached},914
  │   ├─{rrdcached},915
  │   ├─{rrdcached},916
  │   ├─{rrdcached},917
  │   ├─{rrdcached},918
  │   ├─{rrdcached},1814
  │   └─{rrdcached},8021
  ├─rsyslogd,707 -n -iNONE
  │   ├─{rsyslogd},728
  │   ├─{rsyslogd},729
  │   └─{rsyslogd},730
  ├─smartd,708 -n
  ├─spiceproxy,1603
  │   └─spiceproxy work,1604
  ├─sshd,872
  ├─systemd,9359 --user
  │   └─(sd-pam),9360
  ├─systemd-journal,426
  ├─systemd-logind,711
  ├─systemd-udevd,450
  ├─veeamservice,991 --pidfile /var/run/veeamservice.pid --daemon
  │   ├─{veeamservice},994
  │   ├─{veeamservice},995
  │   ├─{veeamservice},996
  │   ├─{veeamservice},997
  │   ├─{veeamservice},998
  │   ├─{veeamservice},999
  │   ├─{veeamservice},1000
  │   ├─{veeamservice},1001
  │   ├─{veeamservice},1002
  │   ├─{veeamservice},1003
  │   ├─{veeamservice},1004
  │   ├─{veeamservice},1005
  │   ├─{veeamservice},1006
  │   ├─{veeamservice},1007
  │   ├─{veeamservice},1008
  │   ├─{veeamservice},1009
  │   ├─{veeamservice},1010
  │   ├─{veeamservice},1011
  │   ├─{veeamservice},1012
  │   ├─{veeamservice},1013
  │   ├─{veeamservice},1014
  │   ├─{veeamservice},1015
  │   ├─{veeamservice},1016
  │   ├─{veeamservice},1017
  │   ├─{veeamservice},1018
  │   ├─{veeamservice},1019
  │   ├─{veeamservice},1020
  │   ├─{veeamservice},1021
  │   ├─{veeamservice},1022
  │   ├─{veeamservice},1023
  │   ├─{veeamservice},1024
  │   ├─{veeamservice},1025
  │   ├─{veeamservice},1026
  │   ├─{veeamservice},1027
  │   ├─{veeamservice},1028
  │   ├─{veeamservice},1029
  │   ├─{veeamservice},1030
  │   ├─{veeamservice},1031
  │   ├─{veeamservice},1032
  │   ├─{veeamservice},1033
  │   ├─{veeamservice},1067
  │   ├─{veeamservice},1068
  │   ├─{veeamservice},1069
  │   └─{veeamservice},1070
  ├─watchdog-mux,718
  └─zed,720 -F
      ├─{zed},726
      └─{zed},727

The third one looks to me like proxmox OS has encountered the issue - is that right? (Does it cause the issue or encounter the issue?)

I am suspecting it's having impact on performance, Is the message just a warning of the issue that, in fact, does nothing more than indicate it's happening? Or is the kernel doing something about it that protects the system?

It seems like I could tell the kernel to "Send SIGBUS to applications that cause split lock" and that this would kill the application, which could mean killing the VM in question? is that right?

I have two different machines configured the same way yet one does not report split locks (or maybe they're just very rare and I haven't caught it.) From this and from the article you reference, this appears to be all hardware level stuff. Is that correct? Is it because the hardware doesn't have virtualization support - or has a flaw it its support for virtualization?

Thanks again

sterzy · Jul 25, 2022

jaytee129 said:
The third one looks to me like proxmox OS has encountered the issue - is that right? (Does it cause the issue or encounter the issue?)

Could you please tell what the pid in question was? This output looks like you entered "1" as the pid. Which would mean you printed the process tree for systemd. I don't see how that would show up as a log of the format "x86/split lock detection: #AC: kvm/2275802...". So maybe also post the logs, please. Thanks!

jaytee129 said:
I am suspecting it's having impact on performance, Is the message just a warning of the issue that, in fact, does nothing more than indicate it's happening? Or is the kernel doing something about it that protects the system?

It seems like I could tell the kernel to "Send SIGBUS to applications that cause split lock" and that this would kill the application, which could mean killing the VM in question? is that right?

Split lock detection by default should be cosmetic and just add more lines to you journal. Turning it to "fatal" may kill the processes in question but really depends on how "SIGBUS" is handled by them. Usually the program will be terminated though.

jaytee129 said:
I have two different machines configured the same way yet one does not report split locks (or maybe they're just very rare and I haven't caught it.) From this and from the article you reference, this appears to be all hardware level stuff. Is that correct? Is it because the hardware doesn't have virtualization support - or has a flaw it its support for virtualization?

Yes this is all relatively low level stuff. If your CPU isn't new enough it just doesn't support split lock detection. It would need to be from the last 2-3 years or so. You can check if it support split lock detection with lscpu | grep split_lock_detect.

jaytee129 · Jul 25, 2022

sterzy said:
Could you please tell what the pid in question was?

first two PIDs were 1660 and 2128. Third one was 169318. ran command again, making sure no typos and got the same results

sterzy said:
Split lock detection by default should be cosmetic and just add more lines to you journal. Turning it to "fatal" may kill the processes in question but really depends on how "SIGBUS" is handled by them. Usually the program will be terminated though.

Yes this is all relatively low level stuff. If your CPU isn't new enough it just doesn't support split lock detection. It would need to be from the last 2-3 years or so. You can check if it support split lock detection with lscpu | grep split_lock_detect.

Problem system has 11th Gen Intel Core i5-11320H, launched Q2 '21

see https://ark.intel.com/content/www/u...ocessor-8m-cache-up-to-4-50-ghz-with-ipu.html

Code:

lscpu | grep split_lock_detect
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l2 invpcid_single cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves split_lock_detect dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid movdiri movdir64b fsrm avx512_vp2intersect md_clear flush_l1d arch_capabilities

one not reporting issue is i3 4030U launched Q2 '14 and nothing is returned for lscpu | grep split_lock_detect

sterzy · Jul 26, 2022

jaytee129 said:
first two PIDs were 1660 and 2128. Third one was 169318. ran command again, making sure no typos and got the same results

Ah yeah I guess that makes sense, the given process has already been terminated or exited. It doesn't exist anymore. So the command just gives you the entire pstree.

jaytee129 said:
Problem system has 11th Gen Intel Core i5-11320H, launched Q2 '21

jaytee129 said:
one not reporting issue is i3 4030U launched Q2 '14 and nothing is returned for lscpu | grep split_lock_detect

So that's pretty much what you'd expect then. A CPU without split lock detection won't log anything. Not necessarily because it doesn't encounter split locks, but mostly because the CPU does not support raising a split trap.

jaytee129 · Jul 26, 2022

Thanks. what's the impact of leaving the detection on vs turning it off? I read that with it on the kernel does an "OOPS" and #AC. I don't know what either is (as well as many of the other terms) as this is all over my head. Most importantly, will it lower the performance impact or raise it if I turn off detection? Should I test having kernel send "SIGBUS"? I don't think so but again this is beyond my knowledge. Please advise.
Thanks again

sterzy · Jul 26, 2022

jaytee129 said:
Thanks. what's the impact of leaving the detection on vs turning it off?

sterzy said:
if you don't see any issues with your VM, it should be fine to leave the kernel command line as is. Split lock detection *should* have a fairly minimal impact on most systems.

In an ideal world, you could change whatever software is running within your VM that causes the issue to not trigger split locks. Turning split lock detection off just reverts the kernel's behavior back to what it used to be before split lock detection was added.

@jaytee129 I basically already answered that in one of my previous comments

jaytee129 said:
I read that with it on the kernel does an "OOPS" and #AC. I don't know what either is (as well as many of the other terms) as this is all over my head.

Here are some more details: If split lock detection isn't set to off (so either warn or fatal) an OOPs will occur. A kernel OOPs indicates a serious but recoverable issue that the kernel has encountered. This may contribute to system instability if split locks are turned on and, thus, if you encounter problems with split lock detection and instability, I'd recommend turning it off.

#AC is the alignment check exception, which you can think of as just different word for split lock trap. So basically when the processor encounters a split it will raise this exception.

jaytee129 said:
Most importantly, will it lower the performance impact or raise it if I turn off detection?

I'd assume that performance might improve as the kernel does not have to handle the raised exception. As far as I can tell (and I'd be happily corrected) from a user perspective there isn't much you can do about split locks. It's really up to whoever developed and/or compiled the program causing the split lock detection to go off.

jaytee129 said:
Should I test having kernel send "SIGBUS"? I don't think so but again this is beyond my knowledge. Please advise.

If you are already struggling with instability turning split lock detection to "fatal" will probably make the situation worse. A process that is accidentally causing split locks to happen is imo unlikely to handle "SIGBUS" events neatly. Thus, more instability might be the result.

jaytee129 · Jul 27, 2022

Ok, I tried to turn off split_lock_detection using information from earlier in this thread, namely:

stefan6 said:
Also, i add in /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet split_lock_detect=off"

Yes that should be correct, now run update-grub and reboot. You can then take a look at your journal with journalctl -b. You should see a line that says something like:

Jun 29 12:42:55 hostname kernel: x86/split lock detection: disabled

However, if you see a line like this:

Jun 29 12:42:55 hostname kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks

Then please double check whether you followed all steps correctly. Also, shortly before that line there should be an entry that shows the kernel command line, please post that if this doesn't work for you. It looks something like this:

Jun 29 12:42:55 hostname kernel: Command line: initrd=\EFI\proxmox\5.15.39-1-pve\initrd.img-5.15.39-1-pve quiet

Here's the line in my particular /etc/default/grub:

Code:

GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=pt split_lock_detect=off"

I did update-grub and rebooted.

The command journalctl -b returns nothing and the command lscpu | grep split returns split_lock_detect as one of the flags.

Code:

Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l2 invpcid_single cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves split_lock_detect dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid movdiri movdir64b fsrm avx512_vp2intersect md_clear flush_l1d arch_capabilities

What am I missing?

sterzy · Jul 27, 2022

jaytee129 said:
The command journalctl -b returns nothing

Hm intresting, that should output the journal since the last boot. Your process tree from before also shows an active systemd-journal process, so maybe run systemctl status systemd-journald.service to check that it's fine?

Also you can check the kernel command line of the current boot with cat /proc/cmdline as well. So maybe check that too.

jaytee129 · Jul 27, 2022

sterzy said:
Hm intresting, that should output the journal since the last boot. Your process tree from before also shows an active systemd-journal process, so maybe run systemctl status systemd-journald.service to check that it's fine?

Also you can check the kernel command line of the current boot with cat /proc/cmdline as well. So maybe check that too.

Sorry, my bad. I meant to say journalctl -b | grep split returned nothing.

Also, it worked after a reboot this morning. Maybe the system didn't reboot for some reason yesterday when I called for it, or it took 2 reboots because it's there this morning

Code:

journalctl -b | grep split
Jul 27 09:31:41 thibworldpx2 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.39-1-pve root=/dev/mapper/pve-root ro quiet intel_iommu=pt split_lock_detect=off
Jul 27 09:31:41 thibworldpx2 kernel: x86/split lock detection: disabled
Jul 27 09:31:41 thibworldpx2 kernel: Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.15.39-1-pve root=/dev/mapper/pve-root ro quiet intel_iommu=pt split_lock_detect=off
Jul 27 09:31:41 thibworldpx2 kernel: Unknown kernel command line parameters "BOOT_IMAGE=/boot/vmlinuz-5.15.39-1-pve split_lock_detect=off", will be passed to user space.
Jul 27 09:31:41 thibworldpx2 kernel:     split_lock_detect=off

Not sure what happened there. Thanks for the quick reply.

StreetPiet · Nov 28, 2022

Hi,

i'm running a HCI-Cluster with four nodes and from time to time one node freezes completely. I changed the default GRUB file as recommended in the posts above, but i get the message

x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks

when i run journal -b

My boot command-line is:
Command line: initrd=\EFI\proxmox\5.15.30-2-pve\initrd.img-5.15.30-2-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs

/proc/cpuinfo shows:

processor : 47
vendor_id : GenuineIntel
cpu family : 6
model : 106
model name : Intel(R) Xeon(R) Gold 5317 CPU @ 3.00GHz
stepping : 6
microcode : 0xd0002e0
cpu MHz : 3383.006
cache size : 18432 KB
physical id : 1
siblings : 24
core id : 11
cpu cores : 12
apicid : 87
initial apicid : 87
fpu : yes
fpu_exception : yes
cpuid level : 27
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities
vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs pml ept_mode_based_exec tsc_scaling
bugs : spectre_v1 spectre_v2 spec_store_bypass swapgs
bogomips : 6002.20
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 57 bits virtual
power management:

uname -a:

Linux alb-pve-02 5.15.30-2-pve #1 SMP PVE 5.15.30-3 (Fri, 22 Apr 2022 18:08:27 +0200) x86_64 GNU/Linux

Proxmox Version: 7.2-3 on a Supermicro Motherboard

Any sugesstions?

best regards
Peter

x86/split lock detection

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

Member

Proxmox Staff Member

New Member

Active Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member

We value your privacy