Unable to connect to VM qmp socket - timeout after 31 retries and slow ops in v7.1-7 Fresh install

ti2kol

New Member
Dec 27, 2021
3
0
1
50
Saw lot of threads without fix but unsure if it's the same because we just got fresh install from ISO of 7.1-7, 2, running 2 node cluster waiting for testings to migrate the 3rd one.

Ceph cluster running with health warning because don't have the 3rd server free yet, but works fine.

Made VM backups from 4.4-1 to a NFS NAS, then restore . Installed few fresh VM installs too.

Few old and well tested Ubuntu/Centos
One W2013SVR (production for years)
Few new Ubuntu/CentOs.

All worked fine more then 24 hours, running VMs got ping time outs, saw them, no console or remote access...

Reboot the VM and just got stuck on boot (linux give vnc access but VM do nothing), windows says Failed to tun vncproxy

Remain 2 VMs running, not rebooted yet to avoid add more issues.

Nodes not rebooted yet.

Some examples showed:

*************
VM 101 - Windows 2003 SVR (restored from backup)
Status running
HA State none
Node node00
CPU usage 0.00% of 8 CPU(s)
Memory usage 4.04% (2.58 GiB of 64.00 GiB

bootdisk: ide0
cores: 4
ide0: zurqui_ceph:vm-101-disk-0,size=650G
ide1: zurqui_ceph:vm-101-disk-1,size=1T
ide2: cdrom,media=cdrom
memory: 65536
name: win
net0: e1000=E6:FF:30:F9:D7:4A,bridge=vmbr0
numa: 0
onboot: 1
ostype: win8
scsihw: virtio-scsi-pci
smbios1: uuid=bb19a350-d155-4c08-9499-65f445d1b71b
sockets: 2

*************

Dec 27 12:42:47 node00 pvestatd[1977]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 31 retries

*************
VM 100 -Fresh CentOs 7 - stuck on Booting from Hard Disk...
Status running
HA State none
Node node00
CPU usage 25.09% of 4 CPU(s) *don't change*

Memory usage 0.14% (58.95 MiB of 40.00 GiB) *don't change*
Bootdisk size 360.00 GiB

IPs No Guest Agent configured

gent: 0
boot: order=scsi0;ide2;net0
cores: 2
ide2: cdrom,media=cdrom
memory: 40960
meta: creation-qemu=6.1.0,ctime=1640579359
name: in
net0: virtio=C2:32:84:B8:ED:8F,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: zurqui_ceph:vm-100-disk-0,aio=native,size=360G
scsihw: virtio-scsi-pci
smbios1: uuid=c67bc33d-75c8-46c2-a257-ed7bf332af58
sockets: 2
vmgenid: cf18aad3-365a-4ec8-ad1d-7db410c60e9a

*************

VM 108 -Old Ubuntu - stuck on Booting from Hard Disk...

Status running
HA State none
Node node00
CPU usage 25.31% of 4 CPU(s)
Memory usage 0.61% (49.07 MiB of 7.81 GiB)
Bootdisk size 512.00 GiB
IPs No Guest Agent configured

***

agent: 0
bootdisk: scsi0
cores: 2
ide2: none,media=cdrom
memory: 8000
name: Naza
net0: virtio=12:A6:F7:2B:60:92,bridge=vmbr0
numa: 0
ostype: l26
scsi0: zurqui_ceph:vm-108-disk-0,size=512G
scsihw: virtio-scsi-pci
smbios1: uuid=be1faba3-bd4c-477f-85ae-eeeaff9acaa2
sockets: 2



*************

Linux node00 5.13.19-2-pve #1 SMP PVE 5.13.19-4 (Mon, 29 Nov 2021 12:10:09 +0100) x86_64 GNU/Linux

*************

proxmox-ve: 7.1-1 (running kernel: 5.13.19-2-pve)
pve-manager: 7.1-7 (running version: 7.1-7/df5740ad)
pve-kernel-helper: 7.1-6
pve-kernel-5.13: 7.1-5
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-4
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.1.2-1
proxmox-backup-file-restore: 2.1.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-4
pve-cluster: 7.1-2
pve-container: 4.1-2
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3


***** U P D A T E ****

At some point all system got inestable. Tried shutdown but after 3 hours nothing so the only way was power cycle.

After that the rbd said #rbd error: rbd: listing images failed: (2) No such file or directory (500)#
In the pool was some file not showed in with the rbd -p pool list may be caused the after reboot issue but no idea the first one.
With command rbd -p pool list --long saw the file and after removed all it's working now. Will wait some days of work to verify if can replicate the error or will close the thread.


***** U P D A T E T W O****

20211230 Now got the same issues.


Screenshot 2021-12-30 at 10-16-25 node01 - Proxmox Virtual Environment.png
Screenshot 2021-12-30 at 10-15-59 node01 - Proxmox Virtual Environment.png


***** U P D A T E T H R E E****

20211230 8pm

Dec 30 20:28:11 node00 ceph-osd[188013]: 2021-12-30T20:28:11.526-0600 7f7aad3c5700 -1 osd.1 471 get_health_metrics reporting 32 slow ops, oldest is osd_op(client.624146.0:1 1.0 1.bce88a6b (undecoded) ondisk+retry+read+known_if_redirected e470)

Now getting SLOW OPS. Id disable the osd the W2008K (VM 101) machine stops responding after some seconds. Once start the osd again get slow ops starting from 1.

I guess if the errors are too many for handle the system freeze as showed in the past.

Will modify some parameters in the VM and reboot, if doesn't works, will migrate it to the old cluster and see what will happen.

Nobody with similars issues? So weird see this kind of errors in a fresh install.
 
Last edited:
FYI - I have been trying to get a new PVE7.1 Ceph Cluster install working for two weeks and have continued to encounter similar issues. On this cluster I do not have any VM's just the four nodes (all Dell R730, dual E5-2660 10 core, 384GB RAM, SSD Raid1 for OS, 17 non raid 1TB SSd for OSD's. I have reformatted and reinstalled everything three times. All looks good until I add about the 30th OSD at which point I start getting pg errors and slow ops errors. This cluster is intended to replace a cluster that is running version 6 with 8 Dell R620 nodes. I am about to install version 6 on these machines to see what happens.
 
Hi Russel, thanks for the advice, last night just stopped all VMs and process, reboot and yes, got the same errors on osd's.

I'm migrating from 4.4 that actually works fine and it's so frustrating see this on new versions.

I will migrate the updated VMs to the old cluster and destroy pools and delete partitions from scratch and see what's happens. Hope in that time, some one from proxmox staff may verify because it's weird.

I'm using for the tests:

R630 16 x Intel(R) Xeon(R) CPU X5550 @ 2.67GHz, 256G RAM, 3 x 10Gb and 1Gb x 2 NICs - 2 mirror boot disk and 7 SSDs
R710 64 x Intel(R) Xeon(R) CPU E5-2697A v4 @ 2.60GHz, 220G RAM, 2 x 10Gb and 1Gb x 4 NICs - 2 mirror boot disk and 6 HDDs

This units works great on older Proxmox versions.
 
Turned out that my problem had to do with network performance after all. We use Juniper switches for the 10G network and set the mtu to be 9126 to match the mtu on the bonded ports of the servers. But the Juniper EX4550's had to be set for mtu of 9216 for it to work correctly. We determined that it was a network performance issue by using iperf3 to test the network. Now have four Dell R730xd servers running latest PVE7 version with 68 OSD's (1TB SSD's).
 
Turned out that my problem had to do with network performance after all. We use Juniper switches for the 10G network and set the mtu to be 9126 to match the mtu on the bonded ports of the servers. But the Juniper EX4550's had to be set for mtu of 9216 for it to work correctly. We determined that it was a network performance issue by using iperf3 to test the network. Now have four Dell R730xd servers running latest PVE7 version with 68 OSD's (1TB SSD's).
I still working on this without success. Just use a Mikrotik CRS312 and removed all working VM with same results. So I see you are running MTU at 9126 on clients and 9216 on SWITCH?

How got the values?
 
The MTU values must be found in the switch documentation.
Update - I have since modified my bonds to use standard ethernet MTU of 1500 and same with Juniper switches.
To determine if an issue exists with the network use iperf3.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!