Hi,
i have a VM where the ping form a node to a VM fails and only rebooting the VM could solve the problem. The qemu guest agent was running on the machine and also a restart of that service could not solve it. The failing ping also doesn't happen every day, only from time to time but could not find a determinstic behavior...
On the host i see the following in syslog (ping failed the first time at 00:58:02)
And on the VM
Can it be related to the fstrim or was it just a coincidence that the first failing ping was right after the fstrim? The 5 disks are on a ceph-storage. I have a lot of other vm's where i don't have this kind of problem (for sure there are running different process there but their disks are also on the ceph-storage etc..)
Used versions
pve-manager/8.1.3/b46aac3b42da5d15 (running kernel: 6.5.11-4-pve)
QEMU Guest Agent 7.2.13
As a side note:
I'm running a nexus instance in that VM where a cleanup tasks was running at 1:00. Maybe that could also be a problem?
Has someone an idea what it could be or how i could solve it? Many thanks in advance.
i have a VM where the ping form a node to a VM fails and only rebooting the VM could solve the problem. The qemu guest agent was running on the machine and also a restart of that service could not solve it. The failing ping also doesn't happen every day, only from time to time but could not find a determinstic behavior...
On the host i see the following in syslog (ping failed the first time at 00:58:02)
Code:
Dec 09 00:58:02 strg-2 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=unconfined msg='unit=cmk-update-agent comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Dec 09 00:58:02 strg-2 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=unconfined msg='unit=cmk-update-agent comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Dec 09 00:58:03 strg-2 pvedaemon[4102194]: VM 129 qmp command failed - VM 129 qmp command 'query-proxmox-support' failed - unable to connect to VM 129 qmp socket - timeout after 51 retries
Dec 09 00:58:04 strg-2 pvestatd[4335]: VM 129 qmp command failed - VM 129 qmp command 'query-proxmox-support' failed - unable to connect to VM 129 qmp socket - timeout after 51 retries
Dec 09 00:58:04 strg-2 pvedaemon[4115731]: VM 129 qmp command failed - VM 129 qmp command 'guest-ping' failed - got timeout
Dec 09 00:58:05 strg-2 pvestatd[4335]: status update time (8.669 seconds)
Dec 09 00:58:06 strg-2 audit[4166582]: NETFILTER_CFG table=filter family=7 entries=4 op=xt_replace pid=4166582 subj=unconfined comm="ebtables-restor"
Dec 09 00:58:06 strg-2 audit[4166582]: SYSCALL arch=c000003e syscall=54 success=yes exit=0 a0=3 a1=0 a2=80 a3=5607a869d740 items=0 ppid=4336 pid=4166582 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ebtables-restor" exe="/usr/sbin/ebtables-legacy-restore" subj=unconfined key=(null)
Dec 09 00:58:06 strg-2 audit: PROCTITLE proctitle="ebtables-restore"
...
Dec 09 04:35:14 strg-2 pvestatd[4335]: VM 129 qmp command failed - VM 129 qmp command 'query-proxmox-support' failed - unable to connect to VM 129 qmp socket - timeout after 51 retries
Dec 09 04:35:14 strg-2 pvestatd[4335]: status update time (8.678 seconds)
Dec 09 04:35:15 strg-2 pvedaemon[315834]: VM 129 qmp command failed - VM 129 qmp command 'guest-ping' failed - unable to connect to VM 129 qga socket - timeout after 31 retries
Dec 09 04:35:16 strg-2 audit[398406]: NETFILTER_CFG table=filter family=7 entries=4 op=xt_replace pid=398406 subj=unconfined comm="ebtables-restor"
Dec 09 04:35:16 strg-2 audit[398406]: SYSCALL arch=c000003e syscall=54 success=yes exit=0 a0=3 a1=0 a2=80 a3=565339ed7740 items=0 ppid=4336 pid=398406 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ebtables-restor" exe="/usr/sbin/ebtables-legacy-restore" subj=unconfined key=(null)
Dec 09 04:35:16 strg-2 audit: PROCTITLE proctitle="ebtables-restore"
Dec 09 04:35:22 strg-2 pvestatd[4335]: status update time (5.488 seconds)
Dec 09 04:35:26 strg-2 audit[398640]: NETFILTER_CFG table=filter family=7 entries=4 op=xt_replace pid=398640 subj=unconfined comm="ebtables-restor"
Dec 09 04:35:26 strg-2 audit[398640]: SYSCALL arch=c000003e syscall=54 success=yes exit=0 a0=3 a1=0 a2=80 a3=5606309f9740 items=0 ppid=4336 pid=398640 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ebtables-restor" exe="/usr/sbin/ebtables-legacy-restore" subj=unconfined key=(null)
Dec 09 04:35:26 strg-2 audit: PROCTITLE proctitle="ebtables-restore"
Dec 09 04:35:29 strg-2 pvedaemon[158581]: VM 129 qmp command failed - VM 129 qmp command 'query-proxmox-support' failed - got timeout
Dec 09 04:35:34 strg-2 pvedaemon[158581]: VM 129 qmp command failed - VM 129 qmp command 'guest-ping' failed - got timeout
Dec 09 04:35:34 strg-2 pvestatd[4335]: VM 129 qmp command failed - VM 129 qmp command 'query-proxmox-support' failed - unable to connect to VM 129 qmp socket - timeout after 51 retries
Dec 09 04:35:35 strg-2 pvestatd[4335]: status update time (8.669 seconds)
Dec 09 04:35:36 strg-2 audit[398873]: NETFILTER_CFG table=filter family=7 entries=4 op=xt_replace pid=398873 subj=unconfined comm="ebtables-restor"
And on the VM
Code:
2024-12-09T00:57:40.993870+01:00 diogenes qemu-ga: info: guest-ping called
2024-12-09T00:57:51.505187+01:00 diogenes qemu-ga: info: guest-ping called
2024-12-09T00:57:51.520772+01:00 diogenes systemd[1]: Starting fstrim.service - Discard unused blocks on filesystems from /etc/fstab...
2024-12-09T00:57:56.577693+01:00 diogenes fstrim[40166]: /media/blobs/blob1: 479.1 GiB (514468536320 bytes) trimmed on /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi4-part1
2024-12-09T00:57:56.577779+01:00 diogenes fstrim[40166]: /media/blobs/blob2: 488.2 GiB (524191854592 bytes) trimmed on /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi3-part1
2024-12-09T00:57:56.577807+01:00 diogenes fstrim[40166]: /media/blobs/blob3: 981 GiB (1053306609664 bytes) trimmed on /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi2-part1
2024-12-09T00:57:56.577826+01:00 diogenes fstrim[40166]: /media/blobs/blob4: 447.6 GiB (480600289280 bytes) trimmed on /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi1-part1
2024-12-09T00:57:56.577843+01:00 diogenes fstrim[40166]: /var/tmp: 918.7 MiB (963297280 bytes) trimmed on /dev/mapper/OS-VARTMP
2024-12-09T00:57:56.577859+01:00 diogenes fstrim[40166]: /var/log/audit: 2.5 GiB (2653528064 bytes) trimmed on /dev/mapper/OS-AUDIT
2024-12-09T00:57:56.577877+01:00 diogenes fstrim[40166]: /var/log: 2.3 GiB (2505572352 bytes) trimmed on /dev/mapper/OS-LOG
2024-12-09T00:57:56.577890+01:00 diogenes fstrim[40166]: /var: 21.8 GiB (23382044672 bytes) trimmed on /dev/mapper/OS-VAR
2024-12-09T00:57:56.577906+01:00 diogenes fstrim[40166]: /tmp: 8.9 GiB (9521762304 bytes) trimmed on /dev/mapper/OS-TMP
2024-12-09T00:57:56.577920+01:00 diogenes fstrim[40166]: /home: 20.4 GiB (21925990400 bytes) trimmed on /dev/mapper/OS-HOME
2024-12-09T00:57:56.577936+01:00 diogenes fstrim[40166]: /boot/efi: 522.9 MiB (548352000 bytes) trimmed on /dev/sda1
2024-12-09T00:57:56.577949+01:00 diogenes fstrim[40166]: /boot: 327.5 MiB (343375872 bytes) trimmed on /dev/sda2
2024-12-09T00:57:56.577968+01:00 diogenes fstrim[40166]: /: 30.3 GiB (32583655424 bytes) trimmed on /dev/mapper/OS-os
2024-12-09T00:57:56.578231+01:00 diogenes systemd[1]: fstrim.service: Deactivated successfully.
2024-12-09T00:57:56.578295+01:00 diogenes systemd[1]: Finished fstrim.service - Discard unused blocks on filesystems from /etc/fstab.
2024-12-09T01:00:45.344398+01:00 diogenes kernel: [200336.540302] INFO: task jbd2/dm-2-8:676 blocked for more than 120 seconds.
2024-12-09T01:00:45.344416+01:00 diogenes kernel: [200336.540325] Not tainted 6.1.0-28-amd64 #1 Debian 6.1.119-1
2024-12-09T01:00:45.344417+01:00 diogenes kernel: [200336.540338] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2024-12-09T01:00:45.344418+01:00 diogenes kernel: [200336.540352] task:jbd2/dm-2-8 state:D stack:0 pid:676 ppid:2 flags:0x00004000
2024-12-09T01:00:45.344419+01:00 diogenes kernel: [200336.540357] Call Trace:
2024-12-09T01:00:45.344419+01:00 diogenes kernel: [200336.540359] <TASK>
2024-12-09T01:00:45.344420+01:00 diogenes kernel: [200336.540363] __schedule+0x34d/0x9e0
2024-12-09T01:00:45.344420+01:00 diogenes kernel: [200336.540380] schedule+0x5a/0xd0
2024-12-09T01:00:45.344421+01:00 diogenes kernel: [200336.540384] io_schedule+0x42/0x70
2024-12-09T01:00:45.344421+01:00 diogenes kernel: [200336.540387] bit_wait_io+0xd/0x60
2024-12-09T01:00:45.344422+01:00 diogenes kernel: [200336.540390] __wait_on_bit+0x4b/0x140
2024-12-09T01:00:45.344422+01:00 diogenes kernel: [200336.540393] ? bit_wait+0x60/0x60
2024-12-09T01:00:45.344422+01:00 diogenes kernel: [200336.540398] out_of_line_wait_on_bit+0x91/0xb0
2024-12-09T01:00:45.344423+01:00 diogenes kernel: [200336.540402] ? sugov_init+0x350/0x350
2024-12-09T01:00:45.344423+01:00 diogenes kernel: [200336.540408] jbd2_journal_commit_transaction+0x164b/0x1a40 [jbd2]
2024-12-09T01:00:45.344424+01:00 diogenes kernel: [200336.540423] ? srso_alias_return_thunk+0x5/0x7f
2024-12-09T01:00:45.344424+01:00 diogenes kernel: [200336.540427] ? srso_alias_return_thunk+0x5/0x7f
2024-12-09T01:00:45.344424+01:00 diogenes kernel: [200336.540430] ? finish_task_switch.isra.0+0x90/0x2d0
2024-12-09T01:00:45.344425+01:00 diogenes kernel: [200336.540436] kjournald2+0xa9/0x280 [jbd2]
2024-12-09T01:00:45.344425+01:00 diogenes kernel: [200336.540445] ? cpuusage_read+0x10/0x10
2024-12-09T01:00:45.344425+01:00 diogenes kernel: [200336.540450] ? jbd2_fc_wait_bufs+0xa0/0xa0 [jbd2]
2024-12-09T01:00:45.344426+01:00 diogenes kernel: [200336.540457] kthread+0xda/0x100
2024-12-09T01:00:45.344426+01:00 diogenes kernel: [200336.540461] ? kthread_complete_and_exit+0x20/0x20
2024-12-09T01:00:45.344426+01:00 diogenes kernel: [200336.540464] ret_from_fork+0x22/0x30
2024-12-09T01:00:45.344427+01:00 diogenes kernel: [200336.540472] </TASK>
...
2024-12-09T01:02:46.176450+01:00 diogenes kernel: [200457.370942] ? bit_wait+0x60/0x60
2024-12-09T01:02:46.176451+01:00 diogenes kernel: [200457.370946] out_of_line_wait_on_bit+0x91/0xb0
2024-12-09T01:02:46.176451+01:00 diogenes kernel: [200457.370950] ? sugov_init+0x350/0x350
2024-12-09T01:02:46.176451+01:00 diogenes kernel: [200457.370953] jbd2_journal_commit_transaction+0x164b/0x1a40 [jbd2]
2024-12-09T01:02:46.176451+01:00 diogenes kernel: [200457.370965] ? srso_alias_return_thunk+0x5/0x7f
2024-12-09T01:02:46.176452+01:00 diogenes kernel: [200457.370968] ? srso_alias_return_thunk+0x5/0x7f
2024-12-09T01:02:46.176452+01:00 diogenes kernel: [200457.370970] ? finish_task_switch.isra.0+0x90/0x2d0
2024-12-09T01:02:46.176452+01:00 diogenes kernel: [200457.370976] kjournald2+0xa9/0x280 [jbd2]
2024-12-09T01:02:46.176452+01:00 diogenes kernel: [200457.370984] ? cpuusage_read+0x10/0x10
2024-12-09T01:02:46.176453+01:00 diogenes kernel: [200457.370987] ? jbd2_fc_wait_bufs+0xa0/0xa0 [jbd2]
2024-12-09T01:02:46.176453+01:00 diogenes kernel: [200457.370994] kthread+0xda/0x100
2024-12-09T01:02:46.176453+01:00 diogenes kernel: [200457.370997] ? kthread_complete_and_exit+0x20/0x20
2024-12-09T01:02:46.176453+01:00 diogenes kernel: [200457.371001] ret_from_fork+0x22/0x30
2024-12-09T01:02:46.176454+01:00 diogenes kernel: [200457.371007] </TASK>
2024-12-09T04:35:22.020852+01:00 diogenes qemu-ga: info: guest-ping called
2024-12-09T04:35:22.020928+01:00 diogenes systemd[1]: systemd-journald.service: Watchdog timeout (limit 3min)!
2024-12-09T04:35:22.020961+01:00 diogenes qemu-ga: info: guest-ping called
2024-12-09T04:35:22.021058+01:00 diogenes systemd[1]: systemd-journald.service: Killing process 362 (systemd-journal) with signal SIGABRT.
2024-12-09T04:35:22.021092+01:00 diogenes qemu-ga: info: guest-ping called
2024-12-09T04:35:22.021169+01:00 diogenes dailyaidecheck[40960]: WARN: it is not possible to use mail(1) unless aide is run as root. See /usr/share/doc/aide-common/README.Debian.gz for more information.
2024-12-09T04:35:22.021194+01:00 diogenes qemu-ga: info: guest-ping called
2024-12-09T04:35:22.021265+01:00 diogenes dailyaidecheck[41538]: fold: write error: Broken pipe
2024-12-09T04:35:22.021287+01:00 diogenes qemu-ga: info: guest-ping called
2024-12-09T04:35:22.021335+01:00 diogenes qemu-ga: info: guest-ping called
...
Can it be related to the fstrim or was it just a coincidence that the first failing ping was right after the fstrim? The 5 disks are on a ceph-storage. I have a lot of other vm's where i don't have this kind of problem (for sure there are running different process there but their disks are also on the ceph-storage etc..)
Used versions
pve-manager/8.1.3/b46aac3b42da5d15 (running kernel: 6.5.11-4-pve)
QEMU Guest Agent 7.2.13
As a side note:
I'm running a nexus instance in that VM where a cleanup tasks was running at 1:00. Maybe that could also be a problem?
Code:
2024-12-09 01:00:00,001+0100 INFO [quartz-9-thread-20] *SYSTEM org.sonatype.nexus.quartz.internal.task.QuartzTaskInfo - Task 'Cleanup unused maven2 blobs from nexus' [assetBlob.cleanup] state change WAITING -> RUNNING
2024-12-09 01:00:00,002+0100 INFO [quartz-9-thread-19] *SYSTEM org.sonatype.nexus.quartz.internal.task.QuartzTaskInfo - Task 'Cleanup unused nuget blobs from nexus' [assetBlob.cleanup] state change WAITING -> RUNNING
...
Has someone an idea what it could be or how i could solve it? Many thanks in advance.