VM crashes after configuring CPU affinity

jaytee129

Member
Jun 16, 2022
142
10
23
I used the GUI in 7.3 to assign CPU Affinity to a Windows 10 Gaming VM with 8 vCPUs and a GPU passed through that worked fine before but stuttered when playing game. With the affinity configured the system crashed hard (multiple times) simply downloading the latest update to a game. While it is a big download, point is the VM didn't last very long.

Yes I did align the CPU affinity to a CCX on my AMD EPYC 7282 16c/32t CPU. Or at least I think I did. Maybe I got that wrong and it caused the crash?

Here's the /etc/pve/qemu-server/<vmid>.conf

Code:
affinity: 12,13,14,15,28,29,30,31
agent: 1
bios: ovmf
boot: order=scsi0
cores: 8
cpu: EPYC-Rome
efidisk0: local-zfs:vm-201-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
hostpci0: 0000:c1:00,pcie=1
machine: pc-q35-7.1
memory: 12288
meta: creation-qemu=7.1.0,ctime=1672936369
name: GamerVM
net1: virtio=7E:2F:14:5D:F2:F6,bridge=vmbr1
numa: 0
onboot: 1
ostype: win10
scsi0: local-zfs:vm-201-disk-1,discard=on,iothread=1,size=250G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=50cd3213-74fe-412f-a65b-316cb02568e0
sockets: 1
vmgenid: 67eb436f-f0ec-40ee-b7fe-1f2d918b950d

Here's the lscpu -e output with the CPUs I assigned in bold:

CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ
0 0 0 0 0:0:0:0 yes 2800.0000 1500.0000
1 0 0 1 1:1:1:0 yes 2800.0000 1500.0000
2 0 0 2 2:2:2:0 yes 2800.0000 1500.0000
3 0 0 3 3:3:3:0 yes 2800.0000 1500.0000
4 0 0 4 4:4:4:1 yes 2800.0000 1500.0000
5 0 0 5 5:5:5:1 yes 2800.0000 1500.0000
6 0 0 6 6:6:6:1 yes 2800.0000 1500.0000
7 0 0 7 7:7:7:1 yes 2800.0000 1500.0000
8 0 0 8 8:8:8:2 yes 2800.0000 1500.0000
9 0 0 9 9:9:9:2 yes 2800.0000 1500.0000
10 0 0 10 10:10:10:2 yes 2800.0000 1500.0000
11 0 0 11 11:11:11:2 yes 2800.0000 1500.0000
12 0 0 12 12:12:12:3 yes 2800.0000 1500.0000
13 0 0 13 13:13:13:3 yes 2800.0000 1500.0000
14 0 0 14 14:14:14:3 yes 2800.0000 1500.0000
15 0 0 15 15:15:15:3 yes 2800.0000 1500.0000

16 0 0 0 0:0:0:0 yes 2800.0000 1500.0000
17 0 0 1 1:1:1:0 yes 2800.0000 1500.0000
18 0 0 2 2:2:2:0 yes 2800.0000 1500.0000
19 0 0 3 3:3:3:0 yes 2800.0000 1500.0000
20 0 0 4 4:4:4:1 yes 2800.0000 1500.0000
21 0 0 5 5:5:5:1 yes 2800.0000 1500.0000
22 0 0 6 6:6:6:1 yes 2800.0000 1500.0000
23 0 0 7 7:7:7:1 yes 2800.0000 1500.0000
24 0 0 8 8:8:8:2 yes 2800.0000 1500.0000
25 0 0 9 9:9:9:2 yes 2800.0000 1500.0000
26 0 0 10 10:10:10:2 yes 2800.0000 1500.0000
27 0 0 11 11:11:11:2 yes 2800.0000 1500.0000
28 0 0 12 12:12:12:3 yes 2800.0000 1500.0000
29 0 0 13 13:13:13:3 yes 2800.0000 1500.0000
30 0 0 14 14:14:14:3 yes 2800.0000 1500.0000
31 0 0 15 15:15:15:3 yes 2800.0000 1500.0000

Why is my VM crashing?
 
Hi,
please post the output of pveversion -v and provide /var/log/syslog from around the time the crash happens.
 
So turns out the VM is no longer stable, even without CPU Affinity. It was crashing regularly overnight. Not sure what happened. The main change I've made (yesterday) was to add a UPS at the proxmox/host level (which is generating a lot of syslog messages because of intermittent lost of connection that I need to look into also).

So it crashed at 7:25 AM this morning then again at 7:36 AM as I was scanning Windows Event Viewer looking for yesterday's crash times.

Here is output of pveversion:

Code:
proxmox-ve: 7.3-1 (running kernel: 5.15.83-1-pve)
pve-manager: 7.3-4 (running version: 7.3-4/d69b70d4)
pve-kernel-helper: 7.3-2
pve-kernel-5.15: 7.3-1
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.3
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.3-1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-1
libpve-guest-common-perl: 4.2-3
libpve-http-server-perl: 4.1-5
libpve-storage-perl: 7.3-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.3.2-1
proxmox-backup-file-restore: 2.3.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.3
pve-cluster: 7.3-2
pve-container: 4.4-2
pve-docs: 7.3-1
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-7
pve-firmware: 3.6-2
pve-ha-manager: 3.5.1
pve-i18n: 2.8-1
pve-qemu-kvm: 7.1.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.3-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+2
vncterm: 1.7-1
zfsutils-linux: 2.1.7-pve3

Here is syslog from just before to just after this mornings two crashes in a row:


Code:
Jan 20 07:17:01 thibworldpx5 CRON[630864]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jan 20 07:24:50 thibworldpx5 upsmon[3939]: Poll UPS [MyUPS@localhost] failed - Data stale
Jan 20 07:24:55 thibworldpx5 upsmon[3939]: Poll UPS [MyUPS@localhost] failed - Data stale
Jan 20 07:25:00 thibworldpx5 upsmon[3939]: Poll UPS [MyUPS@localhost] failed - Data stale
Jan 20 07:25:05 thibworldpx5 upsmon[3939]: Poll UPS [MyUPS@localhost] failed - Data stale
Jan 20 07:25:10 thibworldpx5 upsmon[3939]: Poll UPS [MyUPS@localhost] failed - Data stale
Jan 20 07:25:11 thibworldpx5 upssched[638040]: Event: commbad
Jan 20 07:25:11 thibworldpx5 upssched-cmd: UPS communication lost
Jan 20 07:25:15 thibworldpx5 upsmon[3939]: Poll UPS [MyUPS@localhost] failed - Data stale
Jan 20 07:25:20 thibworldpx5 upsmon[3939]: Poll UPS [MyUPS@localhost] failed - Data stale
Jan 20 07:25:25 thibworldpx5 upsmon[3939]: Poll UPS [MyUPS@localhost] failed - Data stale
Jan 20 07:25:30 thibworldpx5 upsmon[3939]: Poll UPS [MyUPS@localhost] failed - Data stale
Jan 20 07:25:35 thibworldpx5 upsmon[3939]: Poll UPS [MyUPS@localhost] failed - Data stale
Jan 20 07:25:40 thibworldpx5 upsmon[3939]: Poll UPS [MyUPS@localhost] failed - Data stale
Jan 20 07:25:41 thibworldpx5 upssched[638040]: Event: upsgone
Jan 20 07:25:41 thibworldpx5 upssched-cmd: UPS has been gone too long - restarting NUT driver
Jan 20 07:25:41 thibworldpx5 usbhid-ups[566303]: Signal 15: exiting
Jan 20 07:25:41 thibworldpx5 upsd[3936]: UPS [MyUPS] data is no longer stale
Jan 20 07:25:41 thibworldpx5 upsd[3936]: Can't connect to UPS [MyUPS] (usbhid-ups-MyUPS): No such file or directory
Jan 20 07:25:44 thibworldpx5 usbhid-ups[643509]: Startup successful
Jan 20 07:25:45 thibworldpx5 pvedaemon[607385]: <root@pam> successful auth for user 'root@pam'
Jan 20 07:25:45 thibworldpx5 upsd[3936]: Connected to UPS [MyUPS]: usbhid-ups-MyUPS
Jan 20 07:25:45 thibworldpx5 upsmon[3939]: Communications with UPS MyUPS@localhost established
Jan 20 07:25:45 thibworldpx5 upssched[638040]: New timer: commok (10 seconds)
Jan 20 07:25:45 thibworldpx5 upssched[638040]: Cancelling timer: upsgone2
Jan 20 07:25:55 thibworldpx5 upssched[638040]: Event: commok
Jan 20 07:25:55 thibworldpx5 upssched-cmd: UPS communication established; Charge: 100; Load: W
Jan 20 07:26:11 thibworldpx5 upssched[638040]: Timer queue empty, exiting
Jan 20 07:34:41 thibworldpx5 dhclient[2767]: DHCPREQUEST for 192.168.10.51 on vmbr1 to 192.168.10.1 port 67
Jan 20 07:34:41 thibworldpx5 dhclient[2767]: DHCPACK of 192.168.10.51 from 192.168.10.1
Jan 20 07:34:41 thibworldpx5 dhclient[2767]: bound to 192.168.10.51 -- renewal in 34506 seconds.
Jan 20 07:36:09 thibworldpx5 upsd[3936]: Data for UPS [MyUPS] is stale - check driver
Jan 20 07:36:10 thibworldpx5 upsmon[3939]: Poll UPS [MyUPS@localhost] failed - Data stale
Jan 20 07:36:10 thibworldpx5 upsmon[3939]: Communications with UPS MyUPS@localhost lost
Jan 20 07:36:10 thibworldpx5 upssched[654997]: Timer daemon started
Jan 20 07:36:11 thibworldpx5 upssched[654997]: New timer: commbad (30 seconds)
Jan 20 07:36:11 thibworldpx5 upssched[654997]: Cancel commok, event: 10
Jan 20 07:36:11 thibworldpx5 upssched-cmd: Unrecognized command: 10
Jan 20 07:36:11 thibworldpx5 upssched[654997]: New timer: upsgone (60 seconds)
Jan 20 07:36:11 thibworldpx5 upssched[654997]: New timer: upsgone2 (900 seconds)
Jan 20 07:36:15 thibworldpx5 upsmon[3939]: Poll UPS [MyUPS@localhost] failed - Data stale
Jan 20 07:36:20 thibworldpx5 upsmon[3939]: Poll UPS [MyUPS@localhost] failed - Data stale
Jan 20 07:36:25 thibworldpx5 upsmon[3939]: Poll UPS [MyUPS@localhost] failed - Data stale
Jan 20 07:36:30 thibworldpx5 upsmon[3939]: Poll UPS [MyUPS@localhost] failed - Data stale
Jan 20 07:36:35 thibworldpx5 upsmon[3939]: Poll UPS [MyUPS@localhost] failed - Data stale
Jan 20 07:36:40 thibworldpx5 upsmon[3939]: Poll UPS [MyUPS@localhost] failed - Data stale
Jan 20 07:36:41 thibworldpx5 upssched[654997]: Event: commbad
Jan 20 07:36:41 thibworldpx5 upssched-cmd: UPS communication lost
Jan 20 07:36:45 thibworldpx5 upsmon[3939]: Poll UPS [MyUPS@localhost] failed - Data stale
Jan 20 07:36:50 thibworldpx5 upsmon[3939]: Poll UPS [MyUPS@localhost] failed - Data stale
Jan 20 07:36:55 thibworldpx5 upsmon[3939]: Poll UPS [MyUPS@localhost] failed - Data stale
Jan 20 07:37:00 thibworldpx5 upsmon[3939]: Poll UPS [MyUPS@localhost] failed - Data stale
Jan 20 07:37:05 thibworldpx5 upsmon[3939]: Poll UPS [MyUPS@localhost] failed - Data stale
Jan 20 07:37:10 thibworldpx5 upsmon[3939]: Poll UPS [MyUPS@localhost] failed - Data stale
Jan 20 07:37:11 thibworldpx5 upssched[654997]: Event: upsgone
Jan 20 07:37:11 thibworldpx5 upssched-cmd: UPS has been gone too long - restarting NUT driver
Jan 20 07:37:11 thibworldpx5 usbhid-ups[643509]: Signal 15: exiting
Jan 20 07:37:11 thibworldpx5 upsd[3936]: UPS [MyUPS] data is no longer stale
Jan 20 07:37:11 thibworldpx5 upsd[3936]: Can't connect to UPS [MyUPS] (usbhid-ups-MyUPS): No such file or directory
Jan 20 07:37:15 thibworldpx5 kernel: [76047.012733] usb 2-1: reset SuperSpeed USB device number 2 using xhci_hcd
Jan 20 07:37:15 thibworldpx5 upsmon[3939]: Poll UPS [MyUPS@localhost] failed - Driver not connected
Jan 20 07:37:16 thibworldpx5 kernel: [76048.164557] usb 2-1: reset SuperSpeed USB device number 2 using xhci_hcd
Jan 20 07:37:16 thibworldpx5 usbhid-ups[660838]: Startup successful
Jan 20 07:37:17 thibworldpx5 upsd[3936]: Connected to UPS [MyUPS]: usbhid-ups-MyUPS
Jan 20 07:37:20 thibworldpx5 upsmon[3939]: Communications with UPS MyUPS@localhost established
Jan 20 07:37:20 thibworldpx5 upssched[654997]: New timer: commok (10 seconds)
Jan 20 07:37:20 thibworldpx5 upssched[654997]: Cancelling timer: upsgone2
Jan 20 07:37:30 thibworldpx5 upssched[654997]: Event: commok
 
I should mention that what happens is a reboot, not a hard stop/blue screen type event, although not what windows considers a clean one. More of a power cycle type event it would seem.

Also, looking back through windows event viewer logs for last 8-10 hours or so, it's been occurring every 11-12 mins or so of up time. With that many crashes now, could the VM OS now be corrupt and responsible for the crashes??
 
@fiona , any idea why my windows vm is no longer stable after changing CPU Affinity (using GUI) and/or adding Nut for my UPS?
Can you correlate the crashes with certain events in the syslog? Maybe try to remove the UPS again and see if the VM is stable then. Another candidate might be the PCI passthrough, does removing that make the VM stable?
 
Can you correlate the crashes with certain events in the syslog? Maybe try to remove the UPS again and see if the VM is stable then. Another candidate might be the PCI passthrough, does removing that make the VM stable?
I've disabled all UPS related services and removed the GPU that was passed through but the VM still crashed. Unfortunately the syslog provides no clue - why is that? Does it point to Windows OS corruption? Is there a setting I can change to increase the messages that end up in syslog, e.g. a "debug" level or something?

The only event I can correlate to the VM starting to crash regularly was the addition of CPU Affinity using GUI, which I've since removed. Could it have caused a crash that corrupted the OS and now it is the OS that crashes and not proxmox killing the VM (despite messages in Windows Event Viewer that says last reboot was due to power failure)?

Are there other utilities/commands that can help isolate what happens around the time of the crash?
 
so I restored a backup of the VM from before I messed with CPU config and all was fine - stable, no crashing - but then I changed number of cores as a first step and VM crashed after the usual 11-12 mins or so.

Is there anything I need to do before or when changing CPU core number?

Is there anything I can do to capture more data before and after making these changes?

Syslog did not capture anything related to crash. For example, I started VM at 17:32 PM and I saw it blue screen using console at 17:43

Here are the syslog entries between 15:32 and 15:44 (after crash): (I fixed my UPS issues so there are no UPS related messages cluttering things up time)

Jan 23 17:32:00 thibworldpx5 pvedaemon[512029]: start VM 201: UPID:thibworldpx5:0007D01D:00034708:63CF0AE0:qmstart:201:root@pam:
Jan 23 17:32:00 thibworldpx5 pvedaemon[4127]: <root@pam> starting task UPID:thibworldpx5:0007D01D:00034708:63CF0AE0:qmstart:201:root@pam:
Jan 23 17:32:00 thibworldpx5 systemd[1]: Started 201.scope.
Jan 23 17:32:00 thibworldpx5 systemd-udevd[512050]: Using default interface naming scheme 'v247'.
Jan 23 17:32:00 thibworldpx5 systemd-udevd[512050]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jan 23 17:32:01 thibworldpx5 kernel: [ 2148.348741] device tap201i1 entered promiscuous mode
Jan 23 17:32:01 thibworldpx5 systemd-udevd[512049]: Using default interface naming scheme 'v247'.
Jan 23 17:32:01 thibworldpx5 systemd-udevd[512049]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jan 23 17:32:01 thibworldpx5 systemd-udevd[512050]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jan 23 17:32:01 thibworldpx5 systemd-udevd[512060]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jan 23 17:32:01 thibworldpx5 systemd-udevd[512060]: Using default interface naming scheme 'v247'.
Jan 23 17:32:01 thibworldpx5 kernel: [ 2148.384860] vmbr1: port 5(fwpr201p1) entered blocking state
Jan 23 17:32:01 thibworldpx5 kernel: [ 2148.385409] vmbr1: port 5(fwpr201p1) entered disabled state
Jan 23 17:32:01 thibworldpx5 kernel: [ 2148.386565] device fwpr201p1 entered promiscuous mode
Jan 23 17:32:01 thibworldpx5 kernel: [ 2148.387309] vmbr1: port 5(fwpr201p1) entered blocking state
Jan 23 17:32:01 thibworldpx5 kernel: [ 2148.387956] vmbr1: port 5(fwpr201p1) entered forwarding state
Jan 23 17:32:01 thibworldpx5 kernel: [ 2148.398144] fwbr201i1: port 1(fwln201i1) entered blocking state
Jan 23 17:32:01 thibworldpx5 kernel: [ 2148.398775] fwbr201i1: port 1(fwln201i1) entered disabled state
Jan 23 17:32:01 thibworldpx5 kernel: [ 2148.399390] device fwln201i1 entered promiscuous mode
Jan 23 17:32:01 thibworldpx5 kernel: [ 2148.399984] fwbr201i1: port 1(fwln201i1) entered blocking state
Jan 23 17:32:01 thibworldpx5 kernel: [ 2148.400541] fwbr201i1: port 1(fwln201i1) entered forwarding state
Jan 23 17:32:01 thibworldpx5 kernel: [ 2148.409540] fwbr201i1: port 2(tap201i1) entered blocking state
Jan 23 17:32:01 thibworldpx5 kernel: [ 2148.410174] fwbr201i1: port 2(tap201i1) entered disabled state
Jan 23 17:32:01 thibworldpx5 kernel: [ 2148.410935] fwbr201i1: port 2(tap201i1) entered blocking state
Jan 23 17:32:01 thibworldpx5 kernel: [ 2148.411561] fwbr201i1: port 2(tap201i1) entered forwarding state
Jan 23 17:32:02 thibworldpx5 kernel: [ 2149.952475] vfio-pci 0000:c1:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
Jan 23 17:32:02 thibworldpx5 kernel: [ 2149.953076] vfio-pci 0000:c1:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
Jan 23 17:32:02 thibworldpx5 kernel: [ 2149.953589] vfio-pci 0000:c1:00.0: vfio_ecap_init: hiding ecap 0x26@0xc1c
Jan 23 17:32:02 thibworldpx5 kernel: [ 2149.954076] vfio-pci 0000:c1:00.0: vfio_ecap_init: hiding ecap 0x27@0xd00
Jan 23 17:32:02 thibworldpx5 kernel: [ 2149.954552] vfio-pci 0000:c1:00.0: vfio_ecap_init: hiding ecap 0x25@0xe00
Jan 23 17:32:02 thibworldpx5 kernel: [ 2149.988429] vfio-pci 0000:c1:00.1: vfio_ecap_init: hiding ecap 0x25@0x160
Jan 23 17:32:04 thibworldpx5 pvedaemon[4127]: <root@pam> end task UPID:thibworldpx5:0007D01D:00034708:63CF0AE0:qmstart:201:root@pam: OK
Jan 23 17:32:07 thibworldpx5 pvedaemon[4124]: VM 201 qmp command failed - VM 201 qmp command 'guest-ping' failed - got timeout
Jan 23 17:32:49 thibworldpx5 pvedaemon[529468]: starting vnc proxy UPID:thibworldpx5:0008143C:00035A50:63CF0B11:vncproxy:201:root@pam:
Jan 23 17:32:49 thibworldpx5 pvedaemon[4127]: <root@pam> starting task UPID:thibworldpx5:0008143C:00035A50:63CF0B11:vncproxy:201:root@pam:
Jan 23 17:40:12 thibworldpx5 pveproxy[4158]: worker exit
Jan 23 17:40:12 thibworldpx5 pveproxy[4156]: worker 4158 finished
Jan 23 17:40:12 thibworldpx5 pveproxy[4156]: starting 1 worker(s)
Jan 23 17:40:12 thibworldpx5 pveproxy[4156]: worker 585270 started
Jan 23 17:41:20 thibworldpx5 pvedaemon[4124]: <root@pam> successful auth for user 'root@pam'

what is " VM 201 qmp command failed - VM 201 qmp command 'guest-ping' failed - got timeout" - is that a clue or unrelated? I haven't seen this message before.
 
I've disabled all UPS related services and removed the GPU that was passed through but the VM still crashed. Unfortunately the syslog provides no clue - why is that? Does it point to Windows OS corruption? Is there a setting I can change to increase the messages that end up in syslog, e.g. a "debug" level or something?

The only event I can correlate to the VM starting to crash regularly was the addition of CPU Affinity using GUI, which I've since removed. Could it have caused a crash that corrupted the OS and now it is the OS that crashes and not proxmox killing the VM (despite messages in Windows Event Viewer that says last reboot was due to power failure)?

Are there other utilities/commands that can help isolate what happens around the time of the crash?
Since the crash seems to happen within the VM and not for the QEMU process (that would show up in the syslog), looking inside the guest for logs is the next step.

so I restored a backup of the VM from before I messed with CPU config and all was fine - stable, no crashing - but then I changed number of cores as a first step and VM crashed after the usual 11-12 mins or so.
Just a guess, but maybe Windows doesn't like that the number of CPUs changed?

Is there anything I need to do before or when changing CPU core number?

Is there anything I can do to capture more data before and after making these changes?
Logs inside the guest might give more hints.

what is " VM 201 qmp command failed - VM 201 qmp command 'guest-ping' failed - got timeout" - is that a clue or unrelated? I haven't seen this message before.
It means that the QEMU guest agent didn't respond which is natural if the guest crashed.
 
thanks. it's strange because I've changed core count on windows VMs many times before without problem, although in the this case the change was from large - from 20 down to 8. Perhaps one can only do small changes??

As a test I went back to restore the VM but this time changed the core count at restore time (using option to do so in the GUI) and that seems to have worked.
 
thanks. it's strange because I've changed core count on windows VMs many times before without problem, although in the this case the change was from large - from 20 down to 8. Perhaps one can only do small changes??
I don't know. Maybe Windows logged some information why the bluescreen happened.

As a test I went back to restore the VM but this time changed the core count at restore time (using option to do so in the GUI) and that seems to have worked.
Since you don't have CPU hotplug active in the config, I assume that last time the change was also applied while the VM was shut down, so there shouldn't really be a difference there.
 
didn't get any usable info from windows logs. even attempts to dump failed for some reason.

Yes, I only changed core count with VM off.

Gave up on this VM and cloned another relatively newly built one. I changed core count and set CPU affinity then reinstalled the pieces I needed and it's been stable for at least 12 hours so far, well beyond the crash interval of the old one.

So think I'll be okay although I don't know the root cause. I'll take lots of snapshots and backups as I go along...
 
In case someone else runs into the same problem, I found something that may explain the problem I ran into, which hasn't happened since I stopped running a benchmarking tool (Novabench). The VMs never properly recover from having run a benchmark so I followed the advice my doctor gives me when I say "It hurst when I do this". He says " Don't do that".

The proxmox documentation (as of today) has the tip below. Note that apparently that documentation page is due to be updated and the new page doesn't mention this - at least not yet (again, as of today)

NVIDIA Tips​

Some Windows applications like geforce experience, Passmark Performance Test and SiSoftware Sandra crash can crash the VM. You need to add:

echo "options kvm ignore_msrs=1" > /etc/modprobe.d/kvm.conf
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!