Hello everyone.
We are having this issue on a couple of clusters.
VMs will randomly freeze, no network response, CPU stuck between 100 and 102%.
When we start Live migration to a different Node, migration works flawlessly
and VM starts working just fine on new node.
While VM is frozen, there is no log activity on the OS inside the VM
OS is various ubuntu versions, from 12 to 20.
This happens also on a different storage: local-lvm, NFS, LVM over iSCSI LUN.
This is the version information on the last issue:
While VM was frozen, we took some strace output from this VM and others that were running fine.
This is the summary of 10 seconds, we thought it was strange:
Frozen VM:
Running VM:
I hope someone has seen this before, and can give us a clue.
It's a strange issue, specially because it gets resolved by just migrating.
Thanks.
We are having this issue on a couple of clusters.
VMs will randomly freeze, no network response, CPU stuck between 100 and 102%.
When we start Live migration to a different Node, migration works flawlessly
and VM starts working just fine on new node.
While VM is frozen, there is no log activity on the OS inside the VM
OS is various ubuntu versions, from 12 to 20.
This happens also on a different storage: local-lvm, NFS, LVM over iSCSI LUN.
This is the version information on the last issue:
# pveversion -v
proxmox-ve: 7.3-1 (running kernel: 5.19.17-1-pve)
pve-manager: 7.3-4 (running version: 7.3-4/d69b70d4)
pve-kernel-helper: 7.3-2
pve-kernel-5.15: 7.3-1
pve-kernel-5.19: 7.2-14
pve-kernel-5.19.17-1-pve: 5.19.17-1
pve-kernel-5.19.7-2-pve: 5.19.7-2
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.3
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.3-1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-1
libpve-guest-common-perl: 4.2-3
libpve-http-server-perl: 4.1-5
libpve-storage-perl: 7.3-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
openvswitch-switch: 2.15.0+ds1-2+deb11u2
proxmox-backup-client: 2.3.2-1
proxmox-backup-file-restore: 2.3.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.0-1
proxmox-widget-toolkit: 3.5.3
pve-cluster: 7.3-2
pve-container: 4.4-2
pve-docs: 7.3-1
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-7
pve-firmware: 3.6-2
pve-ha-manager: 3.5.1
pve-i18n: 2.8-1
pve-qemu-kvm: 7.1.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.3-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+2
vncterm: 1.7-1
zfsutils-linux: 2.1.7-pve3
proxmox-ve: 7.3-1 (running kernel: 5.19.17-1-pve)
pve-manager: 7.3-4 (running version: 7.3-4/d69b70d4)
pve-kernel-helper: 7.3-2
pve-kernel-5.15: 7.3-1
pve-kernel-5.19: 7.2-14
pve-kernel-5.19.17-1-pve: 5.19.17-1
pve-kernel-5.19.7-2-pve: 5.19.7-2
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.3
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.3-1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-1
libpve-guest-common-perl: 4.2-3
libpve-http-server-perl: 4.1-5
libpve-storage-perl: 7.3-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
openvswitch-switch: 2.15.0+ds1-2+deb11u2
proxmox-backup-client: 2.3.2-1
proxmox-backup-file-restore: 2.3.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.0-1
proxmox-widget-toolkit: 3.5.3
pve-cluster: 7.3-2
pve-container: 4.4-2
pve-docs: 7.3-1
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-7
pve-firmware: 3.6-2
pve-ha-manager: 3.5.1
pve-i18n: 2.8-1
pve-qemu-kvm: 7.1.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.3-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+2
vncterm: 1.7-1
zfsutils-linux: 2.1.7-pve3
# qm config 7110
agent: 1
balloon: 8192
boot: order=virtio0;ide2
cores: 8
ide2: none,media=cdrom
memory: 65536
meta: creation-qemu=7.1.0,ctime=1675781477
name: statsdata03.sw
net0: virtio=C6:BB:7B:39:03:00,bridge=vmbr0,mtu=9000,tag=25
numa: 0
ostype: l26
scsihw: virtio-scsi-single
smbios1: uuid=e1996e5b-0152-4590-8caf-83229d59e377
sockets: 1
virtio0: local-lvm:vm-7110-disk-0,backup=0,discard=on,iothread=1,size=3584G
vmgenid: 7785e57f-9800-4ab9-bf8b-81a5cf7024d9
agent: 1
balloon: 8192
boot: order=virtio0;ide2
cores: 8
ide2: none,media=cdrom
memory: 65536
meta: creation-qemu=7.1.0,ctime=1675781477
name: statsdata03.sw
net0: virtio=C6:BB:7B:39:03:00,bridge=vmbr0,mtu=9000,tag=25
numa: 0
ostype: l26
scsihw: virtio-scsi-single
smbios1: uuid=e1996e5b-0152-4590-8caf-83229d59e377
sockets: 1
virtio0: local-lvm:vm-7110-disk-0,backup=0,discard=on,iothread=1,size=3584G
vmgenid: 7785e57f-9800-4ab9-bf8b-81a5cf7024d9
While VM was frozen, we took some strace output from this VM and others that were running fine.
This is the summary of 10 seconds, we thought it was strange:
Frozen VM:
Code:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
97.52 32.214249 1482 21735 ppoll
1.86 0.615001 40 15164 write
0.38 0.125828 34 3700 recvmsg
0.24 0.077628 19 3900 read
0.00 0.000074 5 14 6 futex
0.00 0.000069 0 73 sendmsg
0.00 0.000022 1 16 close
0.00 0.000017 1 15 accept4
0.00 0.000010 0 30 fcntl
0.00 0.000005 0 15 getsockname
------ ----------- ----------- --------- --------- ----------------
100.00 33.032903 739 44662 6 total
Running VM:
Code:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
43.51 0.325470 1179 276 ioctl
30.79 0.230316 11 19617 ppoll
19.08 0.142736 3244 44 3 futex
5.93 0.044384 44384 1 1 restart_syscall
0.41 0.003098 1 2177 write
0.12 0.000900 1 581 read
0.11 0.000823 1 523 recvmsg
0.01 0.000095 11 8 io_submit
0.01 0.000075 75 1 clone
0.00 0.000035 3 10 sendmsg
0.00 0.000025 4 6 fdatasync
0.00 0.000016 8 2 rt_sigprocmask
0.00 0.000010 5 2 close
0.00 0.000007 7 1 prctl
0.00 0.000007 3 2 accept4
0.00 0.000005 5 1 madvise
0.00 0.000003 1 2 getsockname
0.00 0.000003 0 4 fcntl
0.00 0.000003 3 1 set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00 0.748011 32 23259 4 total
I hope someone has seen this before, and can give us a clue.
It's a strange issue, specially because it gets resolved by just migrating.
Thanks.