Guest ping failed for a VM

Corsin

New Member
Jul 13, 2023
1
0
1
Hi,

i have a VM where the ping form a node to a VM fails and only rebooting the VM could solve the problem. The qemu guest agent was running on the machine and also a restart of that service could not solve it. The failing ping also doesn't happen every day, only from time to time but could not find a determinstic behavior...

On the host i see the following in syslog (ping failed the first time at 00:58:02)
Code:
Dec 09 00:58:02 strg-2 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=unconfined msg='unit=cmk-update-agent comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Dec 09 00:58:02 strg-2 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=unconfined msg='unit=cmk-update-agent comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Dec 09 00:58:03 strg-2 pvedaemon[4102194]: VM 129 qmp command failed - VM 129 qmp command 'query-proxmox-support' failed - unable to connect to VM 129 qmp socket - timeout after 51 retries
Dec 09 00:58:04 strg-2 pvestatd[4335]: VM 129 qmp command failed - VM 129 qmp command 'query-proxmox-support' failed - unable to connect to VM 129 qmp socket - timeout after 51 retries
Dec 09 00:58:04 strg-2 pvedaemon[4115731]: VM 129 qmp command failed - VM 129 qmp command 'guest-ping' failed - got timeout
Dec 09 00:58:05 strg-2 pvestatd[4335]: status update time (8.669 seconds)
Dec 09 00:58:06 strg-2 audit[4166582]: NETFILTER_CFG table=filter family=7 entries=4 op=xt_replace pid=4166582 subj=unconfined comm="ebtables-restor"
Dec 09 00:58:06 strg-2 audit[4166582]: SYSCALL arch=c000003e syscall=54 success=yes exit=0 a0=3 a1=0 a2=80 a3=5607a869d740 items=0 ppid=4336 pid=4166582 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ebtables-restor" exe="/usr/sbin/ebtables-legacy-restore" subj=unconfined key=(null)
Dec 09 00:58:06 strg-2 audit: PROCTITLE proctitle="ebtables-restore"
...
Dec 09 04:35:14 strg-2 pvestatd[4335]: VM 129 qmp command failed - VM 129 qmp command 'query-proxmox-support' failed - unable to connect to VM 129 qmp socket - timeout after 51 retries
Dec 09 04:35:14 strg-2 pvestatd[4335]: status update time (8.678 seconds)
Dec 09 04:35:15 strg-2 pvedaemon[315834]: VM 129 qmp command failed - VM 129 qmp command 'guest-ping' failed - unable to connect to VM 129 qga socket - timeout after 31 retries
Dec 09 04:35:16 strg-2 audit[398406]: NETFILTER_CFG table=filter family=7 entries=4 op=xt_replace pid=398406 subj=unconfined comm="ebtables-restor"
Dec 09 04:35:16 strg-2 audit[398406]: SYSCALL arch=c000003e syscall=54 success=yes exit=0 a0=3 a1=0 a2=80 a3=565339ed7740 items=0 ppid=4336 pid=398406 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ebtables-restor" exe="/usr/sbin/ebtables-legacy-restore" subj=unconfined key=(null)
Dec 09 04:35:16 strg-2 audit: PROCTITLE proctitle="ebtables-restore"
Dec 09 04:35:22 strg-2 pvestatd[4335]: status update time (5.488 seconds)
Dec 09 04:35:26 strg-2 audit[398640]: NETFILTER_CFG table=filter family=7 entries=4 op=xt_replace pid=398640 subj=unconfined comm="ebtables-restor"
Dec 09 04:35:26 strg-2 audit[398640]: SYSCALL arch=c000003e syscall=54 success=yes exit=0 a0=3 a1=0 a2=80 a3=5606309f9740 items=0 ppid=4336 pid=398640 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ebtables-restor" exe="/usr/sbin/ebtables-legacy-restore" subj=unconfined key=(null)
Dec 09 04:35:26 strg-2 audit: PROCTITLE proctitle="ebtables-restore"
Dec 09 04:35:29 strg-2 pvedaemon[158581]: VM 129 qmp command failed - VM 129 qmp command 'query-proxmox-support' failed - got timeout
Dec 09 04:35:34 strg-2 pvedaemon[158581]: VM 129 qmp command failed - VM 129 qmp command 'guest-ping' failed - got timeout
Dec 09 04:35:34 strg-2 pvestatd[4335]: VM 129 qmp command failed - VM 129 qmp command 'query-proxmox-support' failed - unable to connect to VM 129 qmp socket - timeout after 51 retries
Dec 09 04:35:35 strg-2 pvestatd[4335]: status update time (8.669 seconds)
Dec 09 04:35:36 strg-2 audit[398873]: NETFILTER_CFG table=filter family=7 entries=4 op=xt_replace pid=398873 subj=unconfined comm="ebtables-restor"

And on the VM
Code:
2024-12-09T00:57:40.993870+01:00 diogenes qemu-ga: info: guest-ping called
2024-12-09T00:57:51.505187+01:00 diogenes qemu-ga: info: guest-ping called
2024-12-09T00:57:51.520772+01:00 diogenes systemd[1]: Starting fstrim.service - Discard unused blocks on filesystems from /etc/fstab...
2024-12-09T00:57:56.577693+01:00 diogenes fstrim[40166]: /media/blobs/blob1: 479.1 GiB (514468536320 bytes) trimmed on /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi4-part1
2024-12-09T00:57:56.577779+01:00 diogenes fstrim[40166]: /media/blobs/blob2: 488.2 GiB (524191854592 bytes) trimmed on /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi3-part1
2024-12-09T00:57:56.577807+01:00 diogenes fstrim[40166]: /media/blobs/blob3: 981 GiB (1053306609664 bytes) trimmed on /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi2-part1
2024-12-09T00:57:56.577826+01:00 diogenes fstrim[40166]: /media/blobs/blob4: 447.6 GiB (480600289280 bytes) trimmed on /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi1-part1
2024-12-09T00:57:56.577843+01:00 diogenes fstrim[40166]: /var/tmp: 918.7 MiB (963297280 bytes) trimmed on /dev/mapper/OS-VARTMP
2024-12-09T00:57:56.577859+01:00 diogenes fstrim[40166]: /var/log/audit: 2.5 GiB (2653528064 bytes) trimmed on /dev/mapper/OS-AUDIT
2024-12-09T00:57:56.577877+01:00 diogenes fstrim[40166]: /var/log: 2.3 GiB (2505572352 bytes) trimmed on /dev/mapper/OS-LOG
2024-12-09T00:57:56.577890+01:00 diogenes fstrim[40166]: /var: 21.8 GiB (23382044672 bytes) trimmed on /dev/mapper/OS-VAR
2024-12-09T00:57:56.577906+01:00 diogenes fstrim[40166]: /tmp: 8.9 GiB (9521762304 bytes) trimmed on /dev/mapper/OS-TMP
2024-12-09T00:57:56.577920+01:00 diogenes fstrim[40166]: /home: 20.4 GiB (21925990400 bytes) trimmed on /dev/mapper/OS-HOME
2024-12-09T00:57:56.577936+01:00 diogenes fstrim[40166]: /boot/efi: 522.9 MiB (548352000 bytes) trimmed on /dev/sda1
2024-12-09T00:57:56.577949+01:00 diogenes fstrim[40166]: /boot: 327.5 MiB (343375872 bytes) trimmed on /dev/sda2
2024-12-09T00:57:56.577968+01:00 diogenes fstrim[40166]: /: 30.3 GiB (32583655424 bytes) trimmed on /dev/mapper/OS-os
2024-12-09T00:57:56.578231+01:00 diogenes systemd[1]: fstrim.service: Deactivated successfully.
2024-12-09T00:57:56.578295+01:00 diogenes systemd[1]: Finished fstrim.service - Discard unused blocks on filesystems from /etc/fstab.
2024-12-09T01:00:45.344398+01:00 diogenes kernel: [200336.540302] INFO: task jbd2/dm-2-8:676 blocked for more than 120 seconds.
2024-12-09T01:00:45.344416+01:00 diogenes kernel: [200336.540325]       Not tainted 6.1.0-28-amd64 #1 Debian 6.1.119-1
2024-12-09T01:00:45.344417+01:00 diogenes kernel: [200336.540338] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2024-12-09T01:00:45.344418+01:00 diogenes kernel: [200336.540352] task:jbd2/dm-2-8     state:D stack:0     pid:676   ppid:2      flags:0x00004000
2024-12-09T01:00:45.344419+01:00 diogenes kernel: [200336.540357] Call Trace:
2024-12-09T01:00:45.344419+01:00 diogenes kernel: [200336.540359]  <TASK>
2024-12-09T01:00:45.344420+01:00 diogenes kernel: [200336.540363]  __schedule+0x34d/0x9e0
2024-12-09T01:00:45.344420+01:00 diogenes kernel: [200336.540380]  schedule+0x5a/0xd0
2024-12-09T01:00:45.344421+01:00 diogenes kernel: [200336.540384]  io_schedule+0x42/0x70
2024-12-09T01:00:45.344421+01:00 diogenes kernel: [200336.540387]  bit_wait_io+0xd/0x60
2024-12-09T01:00:45.344422+01:00 diogenes kernel: [200336.540390]  __wait_on_bit+0x4b/0x140
2024-12-09T01:00:45.344422+01:00 diogenes kernel: [200336.540393]  ? bit_wait+0x60/0x60
2024-12-09T01:00:45.344422+01:00 diogenes kernel: [200336.540398]  out_of_line_wait_on_bit+0x91/0xb0
2024-12-09T01:00:45.344423+01:00 diogenes kernel: [200336.540402]  ? sugov_init+0x350/0x350
2024-12-09T01:00:45.344423+01:00 diogenes kernel: [200336.540408]  jbd2_journal_commit_transaction+0x164b/0x1a40 [jbd2]
2024-12-09T01:00:45.344424+01:00 diogenes kernel: [200336.540423]  ? srso_alias_return_thunk+0x5/0x7f
2024-12-09T01:00:45.344424+01:00 diogenes kernel: [200336.540427]  ? srso_alias_return_thunk+0x5/0x7f
2024-12-09T01:00:45.344424+01:00 diogenes kernel: [200336.540430]  ? finish_task_switch.isra.0+0x90/0x2d0
2024-12-09T01:00:45.344425+01:00 diogenes kernel: [200336.540436]  kjournald2+0xa9/0x280 [jbd2]
2024-12-09T01:00:45.344425+01:00 diogenes kernel: [200336.540445]  ? cpuusage_read+0x10/0x10
2024-12-09T01:00:45.344425+01:00 diogenes kernel: [200336.540450]  ? jbd2_fc_wait_bufs+0xa0/0xa0 [jbd2]
2024-12-09T01:00:45.344426+01:00 diogenes kernel: [200336.540457]  kthread+0xda/0x100
2024-12-09T01:00:45.344426+01:00 diogenes kernel: [200336.540461]  ? kthread_complete_and_exit+0x20/0x20
2024-12-09T01:00:45.344426+01:00 diogenes kernel: [200336.540464]  ret_from_fork+0x22/0x30
2024-12-09T01:00:45.344427+01:00 diogenes kernel: [200336.540472]  </TASK>
...
2024-12-09T01:02:46.176450+01:00 diogenes kernel: [200457.370942]  ? bit_wait+0x60/0x60
2024-12-09T01:02:46.176451+01:00 diogenes kernel: [200457.370946]  out_of_line_wait_on_bit+0x91/0xb0
2024-12-09T01:02:46.176451+01:00 diogenes kernel: [200457.370950]  ? sugov_init+0x350/0x350
2024-12-09T01:02:46.176451+01:00 diogenes kernel: [200457.370953]  jbd2_journal_commit_transaction+0x164b/0x1a40 [jbd2]
2024-12-09T01:02:46.176451+01:00 diogenes kernel: [200457.370965]  ? srso_alias_return_thunk+0x5/0x7f
2024-12-09T01:02:46.176452+01:00 diogenes kernel: [200457.370968]  ? srso_alias_return_thunk+0x5/0x7f
2024-12-09T01:02:46.176452+01:00 diogenes kernel: [200457.370970]  ? finish_task_switch.isra.0+0x90/0x2d0
2024-12-09T01:02:46.176452+01:00 diogenes kernel: [200457.370976]  kjournald2+0xa9/0x280 [jbd2]
2024-12-09T01:02:46.176452+01:00 diogenes kernel: [200457.370984]  ? cpuusage_read+0x10/0x10
2024-12-09T01:02:46.176453+01:00 diogenes kernel: [200457.370987]  ? jbd2_fc_wait_bufs+0xa0/0xa0 [jbd2]
2024-12-09T01:02:46.176453+01:00 diogenes kernel: [200457.370994]  kthread+0xda/0x100
2024-12-09T01:02:46.176453+01:00 diogenes kernel: [200457.370997]  ? kthread_complete_and_exit+0x20/0x20
2024-12-09T01:02:46.176453+01:00 diogenes kernel: [200457.371001]  ret_from_fork+0x22/0x30
2024-12-09T01:02:46.176454+01:00 diogenes kernel: [200457.371007]  </TASK>
2024-12-09T04:35:22.020852+01:00 diogenes qemu-ga: info: guest-ping called
2024-12-09T04:35:22.020928+01:00 diogenes systemd[1]: systemd-journald.service: Watchdog timeout (limit 3min)!
2024-12-09T04:35:22.020961+01:00 diogenes qemu-ga: info: guest-ping called
2024-12-09T04:35:22.021058+01:00 diogenes systemd[1]: systemd-journald.service: Killing process 362 (systemd-journal) with signal SIGABRT.
2024-12-09T04:35:22.021092+01:00 diogenes qemu-ga: info: guest-ping called
2024-12-09T04:35:22.021169+01:00 diogenes dailyaidecheck[40960]: WARN: it is not possible to use mail(1) unless aide is run as root. See /usr/share/doc/aide-common/README.Debian.gz for more information.
2024-12-09T04:35:22.021194+01:00 diogenes qemu-ga: info: guest-ping called
2024-12-09T04:35:22.021265+01:00 diogenes dailyaidecheck[41538]: fold: write error: Broken pipe
2024-12-09T04:35:22.021287+01:00 diogenes qemu-ga: info: guest-ping called
2024-12-09T04:35:22.021335+01:00 diogenes qemu-ga: info: guest-ping called
...

Can it be related to the fstrim or was it just a coincidence that the first failing ping was right after the fstrim? The 5 disks are on a ceph-storage. I have a lot of other vm's where i don't have this kind of problem (for sure there are running different process there but their disks are also on the ceph-storage etc..)
Used versions
pve-manager/8.1.3/b46aac3b42da5d15 (running kernel: 6.5.11-4-pve)
QEMU Guest Agent 7.2.13

As a side note:
I'm running a nexus instance in that VM where a cleanup tasks was running at 1:00. Maybe that could also be a problem?
Code:
2024-12-09 01:00:00,001+0100 INFO  [quartz-9-thread-20]  *SYSTEM org.sonatype.nexus.quartz.internal.task.QuartzTaskInfo - Task 'Cleanup unused maven2 blobs from nexus' [assetBlob.cleanup] state change WAITING -> RUNNING
2024-12-09 01:00:00,002+0100 INFO  [quartz-9-thread-19]  *SYSTEM org.sonatype.nexus.quartz.internal.task.QuartzTaskInfo - Task 'Cleanup unused nuget blobs from nexus' [assetBlob.cleanup] state change WAITING -> RUNNING
...


Has someone an idea what it could be or how i could solve it? Many thanks in advance.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!