Sudden Bulk stop of all VMs ?

ProxyUser · Jan 9, 2024

Hi

Every so often, all of a sudden, all guest VMs terminate. Can you please help me troubleshoot why that happens?

I see Bulk start VMs and Containers in the log:

dmesg has the proxmox host has normal start-up messages. Nothing for why the VMs were terminated.

All VMs are running Windows 11.

Should I be looking anywhere else for logs?

root@pve:~# dmesg
[ 0.000000] Linux version 6.5.11-7-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.5.11-7 (2023-12-05T09:44Z) ()
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.5.11-7-pve root=/dev/mapper/pve-root ro quiet
[ 0.000000] KERNEL supported cpus:
[ 0.000000] Intel GenuineIntel
[ 0.000000] AMD AuthenticAMD
[ 0.000000] Hygon HygonGenuine
[ 0.000000] Centaur CentaurHauls
[ 0.000000] zhaoxin Shanghai
[ 0.000000] BIOS-provided physical RAM map:

Thanks in advance!

SMK

Dunuin · Jan 9, 2024

Are you sure they really get properly stopped? My guess would be they just get killed because your host is running out of RAM. Search the syslog for "oom" messages. And search it for "reboot" in case your whole server is crashing and restarting and therefore bulk starting the VMs.

ProxyUser · Jan 9, 2024

Thank you @Dunuin

I am not sure not sure if they get properly stopped. I was hoping the proxmox log files would tell me that. Its random for sure. When I restart the Windows VM, they seem to come back up just fine.

I do not think memory on the proxmox host is the issue. I have plenty of unused RAM - 64gb on host assigned 16GB for only 2 running VMs. journalctl | grep oom returns nothing.

Searching for reboot:
root@pve:~# journalctl | grep reboot
Jan 09 12:46:20 pve kernel: secureboot: Secure boot disabled
Jan 09 12:46:20 pve kernel: secureboot: Secure boot disabled
Jan 09 12:46:22 pve kernel: softdog: soft_reboot_cmd=<not set> soft_active_on_boot=0
Jan 09 12:46:24 pve cron[1542]: (CRON) INFO (Running @reboot jobs)

Around the time when guests stopped:
# journalctl
Jan 09 12:51:46 pve systemd[1]: Created slice qemu.slice - Slice /qemu.
Jan 09 12:51:46 pve systemd[1]: Started 100.scope.
Jan 09 12:51:46 pve kernel: kvm: SMP vm created on host with unstable TSC; guest TSC will not be reliable
Jan 09 12:51:46 pve pvedaemon[1580]: <root@pam> end task UPID ve:00000974:0000814A:659DB1E1:qmstart:100:root@pam: OK
Jan 09 12:51:48 pve pvedaemon[1580]: <root@pam> starting task UPID

ve:000009FE:00008262:659DB1E4:qmstart:102:root@pam:
Jan 09 12:51:48 pve pvedaemon[2558]: start VM 102: UPID

ve:000009FE:00008262:659DB1E4:qmstart:102:root@pam:
Jan 09 12:51:49 pve systemd[1]: Started 102.scope.
Jan 09 12:51:49 pve pvedaemon[1580]: <root@pam> end task UPIDve:000009FE:00008262:659DB1E4:qmstart:102:root@pam: OK
Jan 09 13:01:27 pve pvedaemon[1578]: <root@pam> successful auth for user 'root@pam'
Jan 09 13:01:27 pve systemd[1]: Starting systemd-tmpfiles-clean.service - Cleanup of Temporary Directories...
Jan 09 13:01:27 pve systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully.
Jan 09 13:01:27 pve systemd[1]: Finished systemd-tmpfiles-clean.service - Cleanup of Temporary Directories.
Jan 09 13:01:27 pve systemd[1]: run-credentials-systemd\x2dtmpfiles\x2dclean.service.mount: Deactivated successfully.
Jan 09 13:16:28 pve pvedaemon[1580]: <root@pam> successful auth for user 'root@pam'

Restart of guest VMs happened around the highlighted times.

What might be going on?

Thanks in advance!

fiona · Jan 10, 2024

Hi,
can you please share the full output of journalctl -b (assuming you didn't reboot yet, otherwise you'll need to use -b-1 instead, or -b-2 if rebooted twice etc.)? Please also share the output of pveversion -v and VM configuration qm config <ID>.

ProxyUser · Jan 10, 2024

fiona said:
Hi,
can you please share the full output of journalctl -b (assuming you didn't reboot yet, otherwise you'll need to use -b-1 instead, or -b-2 if rebooted twice etc.)? Please also share the output of pveversion -v and VM configuration qm config <ID>.

Thank you @fiona

I stand corrected, there was a Host reboot that occurred when this incident happened at time: 12:46 on Jan 9th. The host reboot was caused by the issue.

Attaching the full output of journalctl -b, journalctl -b-1, pveversion -v and qm config

Best Regards

fiona · Jan 11, 2024

ProxyUser said:
Attaching the full output of journalctl -b, journalctl -b-1, pveversion -v and qm config

The last log lines for journalctl -b-1 are from 12:17:01, so about half an hour before the sudden crash/reset occurred. You could try connecting via SSH from a different host, running journalctl -f and waiting for the next crash. If you are lucky, there will be more log there (it might be that it never makes it to disk because of the sudden reset).

Do you have latest CPU microcode and BIOS updates installed? For the former, see https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_firmware_cpu

ProxyUser · Jan 11, 2024

fiona said:
The last log lines for journalctl -b-1 are from 12:17:01, so about half an hour before the sudden crash/reset occurred. You could try connecting via SSH from a different host, running journalctl -f and waiting for the next crash. If you are lucky, there will be more log there (it might be that it never makes it to disk because of the sudden reset).

Do you have latest CPU microcode and BIOS updates installed? For the former, see https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_firmware_cpu

Thank you for the guidance.

I too suspect logs are not making it to the disk because of the sudden reset.

I do not have the latest CPU microcode although I have a recent Ryzen 7950X CPU. I will apply the CPU microcode and BIOS updates now. I will observe and get back with my findings.

ProxyUser · Mar 8, 2024

Confirmed that changing the guest VM processor type from host to quemu64 or x86-64-v4 fixes the random reboot problem. Yay! Problem almost solved!

But then these new CPU types cause problems with nested virtualization.

Physical processor type is Ryzen 7950x.

Per documentation, is CPU type "host" the only one that has svm? What is the best guest CPU type with smv when the host is a Ryzen7950x?

fiona · Mar 11, 2024

ProxyUser said:
Confirmed that changing the guest VM processor type from host to quemu64 or x86-64-v4 fixes the random reboot problem. Yay! Problem almost solved!

Glad you were able to find a workaround!

ProxyUser said:
But then these new CPU types cause problems with nested virtualization.

Physical processor type is Ryzen 7950x.

Per documentation, is CPU type "host" the only one that has svm? What is the best guest CPU type with smv when the host is a Ryzen7950x?

Please see: https://forum.proxmox.com/threads/selecting-cpu-type-x86-64-v2-aes.142869/post-641945

Question is also if the original issue was triggered by using nested virtualization or just by using the host CPU model.

ProxyUser · Mar 11, 2024

fiona said:
Glad you were able to find a workaround!

Please see: https://forum.proxmox.com/threads/selecting-cpu-type-x86-64-v2-aes.142869/post-641945

Question is also if the original issue was triggered by using nested virtualization or just by using the host CPU model.

The host reboot was not triggered by using nested virtualization, but rather just by using the host CPU model. Otherwise random reboots are consistently reproduceable by simultaneously running Aida64 in 3+ Win 11 guest VMs when using host type CPU.

- Without configuring a custom CPU (because they are not tested), any way to use CPU type: kvm64/quemu64 with svm?
- How can I help fix CPU type: host from crashing?

Thank you, @fiona, you are awesome!

Best Regards
SMK

fiona · Mar 12, 2024

ProxyUser said:
The host reboot was not triggered by using nested virtualization, but rather just by using the host CPU model. Otherwise random reboots are consistently reproduceable by simultaneously running Aida64 in 3+ Win 11 guest VMs when using host type CPU.

- Without configuring a custom CPU (because they are not tested), any way to use CPU type: kvm64/quemu64 with svm?

I mean the combination of a predefined model with the svm is not explicitly tested by us. But QEMU does support using those models with the flags, so I'm pretty sure some people are using this. I'm not aware of any other way. I'd just give it a shot and see if it works for you.

ProxyUser said:
- How can I help fix CPU type: host from crashing?

Did you already try getting a log using journalctl -f via SSH like suggested above? If you can find a kernel version that does not expose the issue, that would also help. Without logs or such additional info, we're really in the dark unfortunately. You could also try to gather traces from KVM, but those are huge if there is no hint what to filter for specifically.

ProxyUser · Mar 12, 2024

fiona said:
I mean the combination of a predefined model with the svm is not explicitly tested by us. But QEMU does support using those models with the flags, so I'm pretty sure some people are using this. I'm not aware of any other way. I'd just give it a shot and see if it works for you.

Did you already try getting a log using journalctl -f via SSH like suggested above? If you can find a kernel version that does not expose the issue, that would also help. Without logs or such additional info, we're really in the dark unfortunately. You could also try to gather traces from KVM, but those are huge if there is no hint what to filter for specifically.

I will test the combination of CPU type x86-64-v4 with svm and keep you posted.

I have attached 2 logs generated via SSH when using host CPU type. I got these crashes when starting multiple VMs one right after another.

Also attached log file: NoCrash-x86-64-v4-CPU generated when using x86-64-v4 CPU type that resulted in no host reboot and stable host.

Please let me know what these logs have to say.

Best Regards
SMK

fiona · Mar 13, 2024

Code:

Mar 12 08:50:24 pve kernel: kvm: vcpu 7: requested 59824 ns lapic timer period limited to 200000 ns
Mar 12 08:50:24 pve kernel: kvm: vcpu 23: requested 59824 ns lapic timer period limited to 200000 ns
Mar 12 08:50:24 pve kernel: kvm: vcpu 27: requested 59824 ns lapic timer period limited to 200000 ns
Mar 12 08:50:24 pve kernel: kvm: vcpu 12: requested 59824 ns lapic timer period limited to 200000 ns
Mar 12 08:50:24 pve kernel: kvm: vcpu 18: requested 59824 ns lapic timer period limited to 200000 ns
Mar 12 08:50:24 pve kernel: kvm: vcpu 30: requested 59824 ns lapic timer period limited to 200000 ns
Mar 12 08:50:24 pve kernel: kvm: vcpu 24: requested 59824 ns lapic timer period limited to 200000 ns
Mar 12 08:50:24 pve kernel: kvm: vcpu 11: requested 59824 ns lapic timer period limited to 200000 ns
Mar 12 08:50:24 pve kernel: kvm: vcpu 28: requested 59824 ns lapic timer period limited to 200000 ns
Mar 12 08:50:24 pve kernel: kvm: vcpu 29: requested 59824 ns lapic timer period limited to 200000 ns

Code:

Mar 12 08:54:18 pve kernel: hrtimer: interrupt took 10685 ns

Assuming those happened right before the crashes, might indicate that the issue is realted to clocks/timers somehow, but it might also just be a symptom of the real issue. How many cores/sockets do you have assigned to your other VMs? How does the system load look like before the crash happens?

ProxyUser · Mar 13, 2024

fiona said:

Code:

Mar 12 08:50:24 pve kernel: kvm: vcpu 7: requested 59824 ns lapic timer period limited to 200000 ns
Mar 12 08:50:24 pve kernel: kvm: vcpu 23: requested 59824 ns lapic timer period limited to 200000 ns
Mar 12 08:50:24 pve kernel: kvm: vcpu 27: requested 59824 ns lapic timer period limited to 200000 ns
Mar 12 08:50:24 pve kernel: kvm: vcpu 12: requested 59824 ns lapic timer period limited to 200000 ns
Mar 12 08:50:24 pve kernel: kvm: vcpu 18: requested 59824 ns lapic timer period limited to 200000 ns
Mar 12 08:50:24 pve kernel: kvm: vcpu 30: requested 59824 ns lapic timer period limited to 200000 ns
Mar 12 08:50:24 pve kernel: kvm: vcpu 24: requested 59824 ns lapic timer period limited to 200000 ns
Mar 12 08:50:24 pve kernel: kvm: vcpu 11: requested 59824 ns lapic timer period limited to 200000 ns
Mar 12 08:50:24 pve kernel: kvm: vcpu 28: requested 59824 ns lapic timer period limited to 200000 ns
Mar 12 08:50:24 pve kernel: kvm: vcpu 29: requested 59824 ns lapic timer period limited to 200000 ns

Code:

Mar 12 08:54:18 pve kernel: hrtimer: interrupt took 10685 ns

Assuming those happened right before the crashes, might indicate that the issue is realted to clocks/timers somehow, but it might also just be a symptom of the real issue. How many cores/sockets do you have assigned to your other VMs? How does the system load look like before the crash happens?

Yes, these did happen right before the crash.

Physical host CPU is 1 socket: 16 Cores and 32 Threads of Ryzen 7950x type

VM1 Processors: 32 ( 1 socket, 32 cores)
VM2 Processors: 2 ( 1 socket, 2 cores)
VM3 Processors: 32 ( 1 socket, 32 cores)
VM4 Processors: 32 ( 1socket, 32 cores)

I over allocated based off this document: "It is perfectly safe if the overall number of cores of all your VMs is greater than the number of cores on the server"

But you are right, because when I reduce the number of cores assigned to VMs such that the sum of cores assigned to all VMs totals less than 16 ( 16= my physical cores), the random host reboots are gone and is stable! With 16 cores total assignment, stress testing with Aida64 shows only 50% CPU load in Proxmox web UI which leads one to believe that it should be possible to go beyond 16 cores, but I just learnt that is not the hard way.

Thank you @fiona for troubleshooting this all the way to resolution. You are awesome!

Can you please mark this thread "Solved" so others may benefit?

Best Regards
SMK

BobhWasatch · Mar 14, 2024

ProxyUser said:
VM1 Processors: 32 ( 1 socket, 32 cores)
VM2 Processors: 2 ( 1 socket, 2 cores)
VM3 Processors: 32 ( 1 socket, 32 cores)
VM4 Processors: 32 ( 1socket, 32 cores)

I over allocated based off this document: "It is perfectly safe if the overall number of cores of all your VMs is greater than the number of cores on the server"

It is Ok to have more total cores assigned than you physically have but it is a bad idea to give all of your cores to one VM and you have done that with three. I think if you just reduce those that are 32 down to 16 then you will be fine even though the total will still be > 32.

Having all 32 assigned to one VM means that scheduling that VM leaves no resources for the host, which is going to cause problems. For example, how is disk IO supposed to get done?

ProxyUser · Mar 14, 2024

BobhWasatch said:
It is Ok to have more total cores assigned than you physically have but it is a bad idea to give all of your cores to one VM and you have done that with three. I think if you just reduce those that are 32 down to 16 then you will be fine even though the total will still be > 32.

Having all 32 assigned to one VM means that scheduling that VM leaves no resources for the host, which is going to cause problems. For example, how is disk IO supposed to get done?

Thanks @BobhWasatch. I've tested the following configs which seem to indicate that host always crashes unless the sum of cores assigned to all VMs totals less than 16 ( 16= my physical cores)

		Host Crash	Host Crash	Host Crash	Host Crash	Host Crash	No Crash
VM 1		8	8	8	6	6	4
VM 2		2	2	2	2	2	2
VM 3		12	10	8	8	6	4
VM 4		6	6	6	6	6	6
	Total assigned cores	28	26	24	22	20	16

fiona · Mar 14, 2024

ProxyUser said:
Yes, these did happen right before the crash.

Physical host CPU is 1 socket: 16 Cores and 32 Threads of Ryzen 7950x type

VM1 Processors: 32 ( 1 socket, 32 cores)
VM2 Processors: 2 ( 1 socket, 2 cores)
VM3 Processors: 32 ( 1 socket, 32 cores)
VM4 Processors: 32 ( 1socket, 32 cores)

I over allocated based off this document: "It is perfectly safe if the overall number of cores of all your VMs is greater than the number of cores on the server"

But you are right, because when I reduce the number of cores assigned to VMs such that the sum of cores assigned to all VMs totals less than 16 ( 16= my physical cores), the random host reboots are gone and is stable! With 16 cores total assignment, stress testing with Aida64 shows only 50% CPU load in Proxmox web UI which leads one to believe that it should be possible to go beyond 16 cores, but I just learnt that is not the hard way.

Glad you were able to find a workaround. It's still a bit strange, but maybe it's some kind of quirk with your CPU and KVM interaction.

ProxyUser said:
Can you please mark this thread "Solved" so others may benefit?

You can do this with the Edit Thread button above the first post and selecting the [SOLVED] prefix.

raidflex · Apr 4, 2024

I have noticed random reboots recently, I have an AMD 5800X system with ECC memory and also using "host" type for the VMs. I only have 1 Windows 11 VM and 1 Truenas VM currently, the rest are LXC containers.

If I understand this correctly, is if the total amount of cores allocated between all VMs exceeds the physical core count of your CPU while using host cpu type and those VMs are stressed, this will cause the reboot.

In order to replicate the issue, has it been confirmed that you need to stress test at least 3 VMs with AIDA64? I tried with one Windows 11 VM which has 8 cores assigned to it and did not experience a crash.

I did check the logs. but could not find anything obvious.

intelliIT · Apr 18, 2024

we also use 7950X.
we had these crashes without any traces up to last year and they stopped as soon as we stopped using "host" cpu-type for our windows vms. but now as we need nested virtualization again we went back to using "host". so since yesterday the crashes reappeared. is there any experience with the "svm" flag for other cpu-types and/or custom cpu types (like here).
or is the solution really to limit my hypervisor to not use more than its physical available cores when using cpu-type "host", because this seems to be a really weird solution?
additionally its only windows vms which cause the crashes, another host, which is also overprovisioned, is running smoothly since ever.

zzz09700 · Apr 18, 2024

Just a random thought but are you guys confident that the PSU is capable of handling 7950x with sustained high load?

It could be your PSU can only support 7950x at full load for a limited amount time and then heat or other factor starts to de-stabilize you PSU and everything goes down in blazing glory.

Sudden Bulk stop of all VMs ?

New Member

Distinguished Member

New Member

Proxmox Staff Member

New Member

Attachments

Proxmox Staff Member

New Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Attachments

Proxmox Staff Member

New Member

Renowned Member

New Member

Proxmox Staff Member

New Member

New Member

Member