[SOLVED] Windows VM shutdown/stop since migration to new Server

mircsicz · Oct 24, 2021

Hi all,

8 weeks ago I migrated a client from a HPE Gen8 DL360 to a new Gen10 machine... Since then the server randomly shuts down or stop, luckily only at night!

There is no cluster, and on that machine there are 2 Linux VM's and 4 Win2012r2 VM's and it's always the same two windows server giving me that issue...

There's absolutely no mention of that issue, or maybe I don't what to look for.

So please gi'me a hand with that issue...From my perspective machine is all up to date:

Code:

root@pve:~# pveversion -v
proxmox-ve: 7.0-2 (running kernel: 5.11.22-3-pve)
pve-manager: 7.0-13 (running version: 7.0-13/7aa7e488)
pve-kernel-helper: 7.1-2
pve-kernel-5.11: 7.0-8
pve-kernel-5.11.22-5-pve: 5.11.22-10
ceph-fuse: 15.2.13-pve1
corosync: 3.1.5-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve1
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-10
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-3
libpve-storage-perl: 7.0-12
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.11-1
proxmox-backup-file-restore: 2.0.11-1
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.1-1
pve-docs: 7.0-5
pve-edk2-firmware: 3.20210831-1
pve-firewall: 4.2-4
pve-firmware: 3.3-2
pve-ha-manager: 3.3-1
pve-i18n: 2.5-1
pve-qemu-kvm: 6.0.0-4
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-16
smartmontools: 7.2-1
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.5-pve1

Pifouney · Oct 24, 2021

Hi,
Check flags on the cpu.
If using the host cpu configuration model, you need to enable manually theses tags...
If not declared, activate all cpu flags witch seems for this machine( do not enable amd flags if using intel cpu ^^)

mircsicz · Oct 25, 2021

Pifouney said:
Hi,
Check flags on the cpu.
If using the host cpu configuration model, you need to enable manually theses tags...
If not declared, activate all cpu flags witch seems for this machine( do not enable amd flags if using intel cpu ^^)

THX for the hint, but they are the same on all of the machine's. As there's no cluster what would be the suggested model?

Reply to myself:

Here's what could be interpreted as best practice:

In short, if you care about live migration and moving VMs between nodes, leave the kvm64 default. If you don’t care about live migration or have a homogeneous cluster where all nodes have the same CPU, set the CPU type to host, as in theory this will give your guests maximum performance.

Pifouney · Oct 27, 2021

Hey,

You're right about live migration

Insteed, kvm & kvm64 are best recommanded cpu model with a eterogeneous cluster.

BUT, if you use a defined CPU model, you don't need to enable flags, they're activated by the model selectionned. But, if your model is "host", flags need to be declared manually

Did you solve your problem?

mircsicz · Nov 4, 2021

Hi hi, had one server running now for a week (but on host mode without flags) and last night after 7 days uptime I was so f...ing happy and this morning it crashed again...

EDIT: see attached screenshot what I changed the flags to ;-)

mircsicz · Nov 15, 2021

Even with those flags those two servers keep turning off randomly.

I'm ready to buy some support to get it fixed once and for all!

fabian · Nov 15, 2021

it's not quite clear to me what is shutting down - a VM? the VMs? the hypervisor host? in any case we'd need logs and more information

mircsicz · Nov 15, 2021

@fabian THX for the reply.

I thought I was clear, seems I'm not... It's 2 VM's that randomly shut down during the nights. The host is totally fine!

Problem is I can't find any log entry mentioning the host being shutdown.

Code:

root@pve:~# find /var/log/pve/tasks/ -name *qmstop*
/var/log/pve/tasks/C/UPID:pve:00053D49:000B45C5:6116A77C:qmstop:101:root@pam:
/var/log/pve/tasks/2/UPID:pve:000C50A9:0226E07B:611C2EA2:qmstop:101:root@pam:
/var/log/pve/tasks/5/UPID:pve:0035CC66:0012A142:6116DC15:qmstop:101:root@pam:
/var/log/pve/tasks/5/UPID:pve:0001F66F:000070E7:6116AD85:qmstop:101:root@pam:
/var/log/pve/tasks/E/UPID:pve:000B1FBC:0226A677:611C2E0E:qmstop:100:root@pam:
/var/log/pve/tasks/0/UPID:pve:000B615B:0226BA1D:611C2E40:qmstop:100:root@pam:
/var/log/pve/tasks/3/UPID:pve:003B8E4C:284447A5:617DC113:qmstop:104:root@pam:
/var/log/pve/tasks/3/UPID:pve:003F93DE:0009226B:6116A203:qmstop:102:root@pam:
/var/log/pve/tasks/A/UPID:pve:000256E6:0000985C:6116ADEA:qmstop:100:root@pam:

I'm pretty sure it's windows just turning off! This is all the qmstop run's I could find...

102 & 104 are the machine'S shutting down randomly:

Code:

root@pve:~# find /var/log/pve/tasks/ -name *qmstart*|egrep "102|104"
/var/log/pve/tasks/C/UPID:pve:003B8823:284444EE:617DC10C:qmstart:104:root@pam:
/var/log/pve/tasks/1/UPID:pve:002F0262:05C9FA85:61257E41:qmstart:104:root@pam:
/var/log/pve/tasks/1/UPID:pve:000D33F7:000959B9:6116A291:qmstart:102:root@pam:
/var/log/pve/tasks/6/UPID:pve:000CFEBB:022713DC:611C2F26:qmstart:104:root@pam:
/var/log/pve/tasks/6/UPID:pve:00114BEE:0BF22E6D:61354156:qmstart:102:root@pam:
/var/log/pve/tasks/D/UPID:pve:002D93A9:000A3E7D:6116C69D:qmstart:104:root@pam:
/var/log/pve/tasks/D/UPID:pve:0034B0B6:0E305CD5:613AFF3D:qmstart:104:root@pam:
/var/log/pve/tasks/8/UPID:pve:001153E6:24529210:6173A838:qmstart:104:root@pam:
/var/log/pve/tasks/8/UPID:pve:003B53D5:25DB3F16:61779578:qmstart:104:root@pam:
/var/log/pve/tasks/8/UPID:pve:000A7EB0:24AB3188:61748B18:qmstart:102:root@pam:
/var/log/pve/tasks/8/UPID:pve:0039F50C:2A86F315:61838A78:qmstart:102:root@pam:
/var/log/pve/tasks/8/UPID:pve:0004CC34:2A591345:61831508:qmstart:104:root@pam:
/var/log/pve/tasks/2/UPID:pve:0000CF74:00001225:6116AC92:qmstart:102:root@pam:
/var/log/pve/tasks/2/UPID:pve:003D97C3:1E7DC336:6164BAA2:qmstart:102:root@pam:
/var/log/pve/tasks/2/UPID:pve:00174920:15383941:614CFEE2:qmstart:104:root@pam:
/var/log/pve/tasks/2/UPID:pve:000D1B66:0C70977C:613684F2:qmstart:104:root@pam:
/var/log/pve/tasks/5/UPID:pve:002BFD26:2A02155C:61823655:qmstart:104:root@pam:
/var/log/pve/tasks/5/UPID:pve:000726BC:08923D9A:612C9DA5:qmstart:104:root@pam:
/var/log/pve/tasks/5/UPID:pve:001E50DA:07D3CF4C:612AB625:qmstart:102:root@pam:
/var/log/pve/tasks/5/UPID:pve:0031A71F:090C6EBA:612DD675:qmstart:104:root@pam:
/var/log/pve/tasks/E/UPID:pve:002F8869:26E9E28A:617A4A4E:qmstart:104:root@pam:
/var/log/pve/tasks/E/UPID:pve:000BD8CF:1686E2C4:6150579E:qmstart:102:root@pam:
/var/log/pve/tasks/E/UPID:pve:0022B925:302F6545:6192067E:qmstart:104:root@pam:
/var/log/pve/tasks/0/UPID:pve:003BAE24:28445F56:617DC150:qmstart:104:root@pam:
/var/log/pve/tasks/0/UPID:pve:003D4653:2C911495:6188C320:qmstart:104:root@pam:
/var/log/pve/tasks/0/UPID:pve:00287215:01D4A57E:611B5C20:qmstart:102:root@pam:
/var/log/pve/tasks/0/UPID:pve:003B6C3E:142ED0C9:614A5770:qmstart:104:root@pam:
/var/log/pve/tasks/F/UPID:pve:0027B47D:2DF41591:618C4FEF:qmstart:104:root@pam:
/var/log/pve/tasks/F/UPID:pve:00016106:2D9F6449:618B771F:qmstart:104:root@pam:
/var/log/pve/tasks/F/UPID:pve:001BAE02:29E1D288:6181E3BF:qmstart:102:root@pam:
/var/log/pve/tasks/4/UPID:pve:00082766:0224ED0B:611C29A4:qmstart:102:root@pam:
/var/log/pve/tasks/4/UPID:pve:000C7BC0:22C62D88:616FB174:qmstart:104:root@pam:
/var/log/pve/tasks/3/UPID:pve:0021407D:16BF0574:6150E753:qmstart:104:root@pam:
/var/log/pve/tasks/3/UPID:pve:002A957E:0A6ABF21:61315743:qmstart:104:root@pam:
/var/log/pve/tasks/3/UPID:pve:0032B16D:0970A012:612ED6F3:qmstart:104:root@pam:
/var/log/pve/tasks/3/UPID:pve:001375F1:2DEACFE8:618C3833:qmstart:102:root@pam:
/var/log/pve/tasks/9/UPID:pve:002EF066:1FACA79C:6167C209:qmstart:104:root@pam:
/var/log/pve/tasks/9/UPID:pve:003F9E13:1777B8E4:6152C029:qmstart:104:root@pam:
/var/log/pve/tasks/9/UPID:pve:003249FA:1A7DDF83:615A7D79:qmstart:104:root@pam:
/var/log/pve/tasks/9/UPID:pve:00130111:109A33F4:61412CE9:qmstart:104:root@pam:
/var/log/pve/tasks/9/UPID:pve:002DEB62:2348D331:6170FFE9:qmstart:102:root@pam:
/var/log/pve/tasks/9/UPID:pve:00210A9D:2557C0A4:617644D9:qmstart:102:root@pam:
/var/log/pve/tasks/A/UPID:pve:0020D84C:2F88AA4E:61905BAA:qmstart:104:root@pam:
/var/log/pve/tasks/A/UPID:pve:002D157D:0002B97E:6116919A:qmstart:102:root@pam:

fabian · Nov 15, 2021

well, I'd check the logs inside the VMs and the hypervisor journal for anything out of the ordinary (journalctl -b --since ... for example) if the VM is shuttding down, it should be visible in qmeventd logs. if it crashes, you should see a message about that as well in the journal.

mircsicz · Nov 15, 2021

Thx for pointing out "journalctl"

found it:

Code:

Oct 28 05:21:27 pve kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=qemu.slice,mems_allowed=0,global_oom,task_memcg=/qemu.slice/104.scope,task=kvm,pid=3888245,uid=0
Oct 28 05:21:27 pve kernel: Out of memory: Killed process 3888245 (kvm) total-vm:18476872kB, anon-rss:16814900kB, file-rss:3968kB, shmem-rss:4kB, UID:0 pgtables:34116kB oom_score_adj:0
Oct 28 05:21:27 pve systemd[1]: 104.scope: A process of this unit has been killed by the OOM killer.
Oct 28 05:21:27 pve kernel:  zd96: p1 p2
Oct 28 05:21:29 pve kernel: oom_reaper: reaped process 3888245 (kvm), now anon-rss:0kB, file-rss:36kB, shmem-rss:4kB
Oct 28 05:21:30 pve kernel:  zd112: p1 p2
Oct 28 05:21:30 pve kernel: vmbr0: port 9(tap104i0) entered disabled state
Oct 28 05:21:30 pve kernel: vmbr0: port 9(tap104i0) entered disabled state
Oct 28 05:21:30 pve systemd[1]: 104.scope: Succeeded.
Oct 28 05:21:30 pve systemd[1]: 104.scope: Consumed 13h 36min 45.712s CPU time.
Oct 28 05:21:31 pve qmeventd[2345577]: Starting cleanup for 104
Oct 28 05:21:31 pve qmeventd[2345577]: Finished cleanup for 104

So the sys is running OOM and kills the VM

What I don't get is why, the VM's have a total 60GB, from 96GB available on the host, assigned...

fabian · Nov 15, 2021

probably the rest is taken by ZFS' ARC, and you should limit that more: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_zfs_limit_memory_usage

mircsicz · Nov 15, 2021

fabian said:
probably the rest is taken by ZFS' ARC, and you should limit that more: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_zfs_limit_memory_usage

THX, I've limited it to 16GB which will leave some more buffer for the future ;-)

I'll reboot the machine tonight and report back after some days have passed!

mircsicz · Nov 15, 2021

Is there a way to find out why only two of the seven VM's were killed?

Code:

root@pve:~# qm list
      VMID NAME                 STATUS     MEM(MB)    BOOTDISK(GB) PID
100 pbx running 1024 16.00 754029
101 mgr running 1024 7.00 823598
102 dc running    16384             55.00 1275637
103 fs running 12288 55.00 676650
104 ex running    16384             64.00 2275647
105 ms running 8192 50.00 2096741
198 xmr8 running 4096 8.00 3304511

fabian · Nov 15, 2021

the OOM killer tries to find appropriate candidate processes (by looking at memory usage, OOM scores, etc). it should print a summary of the state when it is triggered, that might give you a clue. but most likely they were the two VMs which used the most RAM at that point. basically the idea is "if I have to kill a process to get free memory, I kill something that gives me a lot of memory so I don't have to kill many processes"

mircsicz · Nov 15, 2021

fabian said:
the OOM killer tries to find appropriate candidate processes (by looking at memory usage, OOM scores, etc). it should print a summary of the state when it is triggered, that might give you a clue. but most likely they were the two VMs which used the most RAM at that point. basically the idea is "if I have to kill a process to get free memory, I kill something that gives me a lot of memory so I don't have to kill many processes"

THX again, yes those two are the only VM's with 16GB of RAM assigned all the others have assigned less...

Pifouney · Nov 19, 2021

Hey,

If really necessary to have 16GB RAM in theses 2VM, maybe try balloning technology ?

mircsicz · Nov 24, 2021

Pifouney said:
Hey,

If really necessary to have 16GB RAM in theses 2VM, maybe try balloning technology ?

tried that like two years ago when the old host ran out of mem but Windows always grepped what was available :-(

mircsicz · Nov 24, 2021

@fabian As I've been running since a week without a killed VM I consider this solved and though changed the topic's name accordingly...

Thx for the support ;-)

[SOLVED] Windows VM shutdown/stop since migration to new Server

Renowned Member

Active Member

Renowned Member

Attachments

Active Member

Renowned Member

Attachments

Renowned Member

Attachments

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Proxmox Staff Member

Renowned Member

Active Member

Renowned Member

Renowned Member

We value your privacy