Saw lot of threads without fix but unsure if it's the same because we just got fresh install from ISO of 7.1-7, 2, running 2 node cluster waiting for testings to migrate the 3rd one.
Ceph cluster running with health warning because don't have the 3rd server free yet, but works fine.
Made VM backups from 4.4-1 to a NFS NAS, then restore . Installed few fresh VM installs too.
Few old and well tested Ubuntu/Centos
One W2013SVR (production for years)
Few new Ubuntu/CentOs.
All worked fine more then 24 hours, running VMs got ping time outs, saw them, no console or remote access...
Reboot the VM and just got stuck on boot (linux give vnc access but VM do nothing), windows says Failed to tun vncproxy
Remain 2 VMs running, not rebooted yet to avoid add more issues.
Nodes not rebooted yet.
Some examples showed:
*************
VM 101 - Windows 2003 SVR (restored from backup)
Status running
HA State none
Node node00
CPU usage 0.00% of 8 CPU(s)
Memory usage 4.04% (2.58 GiB of 64.00 GiB
bootdisk: ide0
cores: 4
ide0: zurqui_ceph:vm-101-disk-0,size=650G
ide1: zurqui_ceph:vm-101-disk-1,size=1T
ide2: cdrom,media=cdrom
memory: 65536
name: win
net0: e1000=E6:FF:30:F97:4A,bridge=vmbr0
numa: 0
onboot: 1
ostype: win8
scsihw: virtio-scsi-pci
smbios1: uuid=bb19a350-d155-4c08-9499-65f445d1b71b
sockets: 2
*************
Dec 27 12:42:47 node00 pvestatd[1977]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 31 retries
*************
VM 100 -Fresh CentOs 7 - stuck on Booting from Hard Disk...
Status running
HA State none
Node node00
CPU usage 25.09% of 4 CPU(s) *don't change*
Memory usage 0.14% (58.95 MiB of 40.00 GiB) *don't change*
Bootdisk size 360.00 GiB
IPs No Guest Agent configured
gent: 0
boot: order=scsi0;ide2;net0
cores: 2
ide2: cdrom,media=cdrom
memory: 40960
meta: creation-qemu=6.1.0,ctime=1640579359
name: in
net0: virtio=C2:32:84:B8:ED:8F,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: zurqui_ceph:vm-100-disk-0,aio=native,size=360G
scsihw: virtio-scsi-pci
smbios1: uuid=c67bc33d-75c8-46c2-a257-ed7bf332af58
sockets: 2
vmgenid: cf18aad3-365a-4ec8-ad1d-7db410c60e9a
*************
VM 108 -Old Ubuntu - stuck on Booting from Hard Disk...
Status running
HA State none
Node node00
CPU usage 25.31% of 4 CPU(s)
Memory usage 0.61% (49.07 MiB of 7.81 GiB)
Bootdisk size 512.00 GiB
IPs No Guest Agent configured
***
agent: 0
bootdisk: scsi0
cores: 2
ide2: none,media=cdrom
memory: 8000
name: Naza
net0: virtio=12:A6:F7:2B:60:92,bridge=vmbr0
numa: 0
ostype: l26
scsi0: zurqui_ceph:vm-108-disk-0,size=512G
scsihw: virtio-scsi-pci
smbios1: uuid=be1faba3-bd4c-477f-85ae-eeeaff9acaa2
sockets: 2
*************
Linux node00 5.13.19-2-pve #1 SMP PVE 5.13.19-4 (Mon, 29 Nov 2021 12:10:09 +0100) x86_64 GNU/Linux
*************
proxmox-ve: 7.1-1 (running kernel: 5.13.19-2-pve)
pve-manager: 7.1-7 (running version: 7.1-7/df5740ad)
pve-kernel-helper: 7.1-6
pve-kernel-5.13: 7.1-5
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-4
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.1.2-1
proxmox-backup-file-restore: 2.1.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-4
pve-cluster: 7.1-2
pve-container: 4.1-2
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3
***** U P D A T E ****
At some point all system got inestable. Tried shutdown but after 3 hours nothing so the only way was power cycle.
After that the rbd said #rbd error: rbd: listing images failed: (2) No such file or directory (500)#
In the pool was some file not showed in with the rbd -p pool list may be caused the after reboot issue but no idea the first one.
With command rbd -p pool list --long saw the file and after removed all it's working now. Will wait some days of work to verify if can replicate the error or will close the thread.
***** U P D A T E T W O****
20211230 Now got the same issues.
***** U P D A T E T H R E E****
20211230 8pm
Dec 30 20:28:11 node00 ceph-osd[188013]: 2021-12-30T20:28:11.526-0600 7f7aad3c5700 -1 osd.1 471 get_health_metrics reporting 32 slow ops, oldest is osd_op(client.624146.0:1 1.0 1.bce88a6b (undecoded) ondisk+retry+read+known_if_redirected e470)
Now getting SLOW OPS. Id disable the osd the W2008K (VM 101) machine stops responding after some seconds. Once start the osd again get slow ops starting from 1.
I guess if the errors are too many for handle the system freeze as showed in the past.
Will modify some parameters in the VM and reboot, if doesn't works, will migrate it to the old cluster and see what will happen.
Nobody with similars issues? So weird see this kind of errors in a fresh install.
Ceph cluster running with health warning because don't have the 3rd server free yet, but works fine.
Made VM backups from 4.4-1 to a NFS NAS, then restore . Installed few fresh VM installs too.
Few old and well tested Ubuntu/Centos
One W2013SVR (production for years)
Few new Ubuntu/CentOs.
All worked fine more then 24 hours, running VMs got ping time outs, saw them, no console or remote access...
Reboot the VM and just got stuck on boot (linux give vnc access but VM do nothing), windows says Failed to tun vncproxy
Remain 2 VMs running, not rebooted yet to avoid add more issues.
Nodes not rebooted yet.
Some examples showed:
*************
VM 101 - Windows 2003 SVR (restored from backup)
Status running
HA State none
Node node00
CPU usage 0.00% of 8 CPU(s)
Memory usage 4.04% (2.58 GiB of 64.00 GiB
bootdisk: ide0
cores: 4
ide0: zurqui_ceph:vm-101-disk-0,size=650G
ide1: zurqui_ceph:vm-101-disk-1,size=1T
ide2: cdrom,media=cdrom
memory: 65536
name: win
net0: e1000=E6:FF:30:F97:4A,bridge=vmbr0
numa: 0
onboot: 1
ostype: win8
scsihw: virtio-scsi-pci
smbios1: uuid=bb19a350-d155-4c08-9499-65f445d1b71b
sockets: 2
*************
Dec 27 12:42:47 node00 pvestatd[1977]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 31 retries
*************
VM 100 -Fresh CentOs 7 - stuck on Booting from Hard Disk...
Status running
HA State none
Node node00
CPU usage 25.09% of 4 CPU(s) *don't change*
Memory usage 0.14% (58.95 MiB of 40.00 GiB) *don't change*
Bootdisk size 360.00 GiB
IPs No Guest Agent configured
gent: 0
boot: order=scsi0;ide2;net0
cores: 2
ide2: cdrom,media=cdrom
memory: 40960
meta: creation-qemu=6.1.0,ctime=1640579359
name: in
net0: virtio=C2:32:84:B8:ED:8F,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: zurqui_ceph:vm-100-disk-0,aio=native,size=360G
scsihw: virtio-scsi-pci
smbios1: uuid=c67bc33d-75c8-46c2-a257-ed7bf332af58
sockets: 2
vmgenid: cf18aad3-365a-4ec8-ad1d-7db410c60e9a
*************
VM 108 -Old Ubuntu - stuck on Booting from Hard Disk...
Status running
HA State none
Node node00
CPU usage 25.31% of 4 CPU(s)
Memory usage 0.61% (49.07 MiB of 7.81 GiB)
Bootdisk size 512.00 GiB
IPs No Guest Agent configured
***
agent: 0
bootdisk: scsi0
cores: 2
ide2: none,media=cdrom
memory: 8000
name: Naza
net0: virtio=12:A6:F7:2B:60:92,bridge=vmbr0
numa: 0
ostype: l26
scsi0: zurqui_ceph:vm-108-disk-0,size=512G
scsihw: virtio-scsi-pci
smbios1: uuid=be1faba3-bd4c-477f-85ae-eeeaff9acaa2
sockets: 2
*************
Linux node00 5.13.19-2-pve #1 SMP PVE 5.13.19-4 (Mon, 29 Nov 2021 12:10:09 +0100) x86_64 GNU/Linux
*************
proxmox-ve: 7.1-1 (running kernel: 5.13.19-2-pve)
pve-manager: 7.1-7 (running version: 7.1-7/df5740ad)
pve-kernel-helper: 7.1-6
pve-kernel-5.13: 7.1-5
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-4
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.1.2-1
proxmox-backup-file-restore: 2.1.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-4
pve-cluster: 7.1-2
pve-container: 4.1-2
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3
***** U P D A T E ****
At some point all system got inestable. Tried shutdown but after 3 hours nothing so the only way was power cycle.
After that the rbd said #rbd error: rbd: listing images failed: (2) No such file or directory (500)#
In the pool was some file not showed in with the rbd -p pool list may be caused the after reboot issue but no idea the first one.
With command rbd -p pool list --long saw the file and after removed all it's working now. Will wait some days of work to verify if can replicate the error or will close the thread.
***** U P D A T E T W O****
20211230 Now got the same issues.
***** U P D A T E T H R E E****
20211230 8pm
Dec 30 20:28:11 node00 ceph-osd[188013]: 2021-12-30T20:28:11.526-0600 7f7aad3c5700 -1 osd.1 471 get_health_metrics reporting 32 slow ops, oldest is osd_op(client.624146.0:1 1.0 1.bce88a6b (undecoded) ondisk+retry+read+known_if_redirected e470)
Now getting SLOW OPS. Id disable the osd the W2008K (VM 101) machine stops responding after some seconds. Once start the osd again get slow ops starting from 1.
I guess if the errors are too many for handle the system freeze as showed in the past.
Will modify some parameters in the VM and reboot, if doesn't works, will migrate it to the old cluster and see what will happen.
Nobody with similars issues? So weird see this kind of errors in a fresh install.
Last edited: