Replication hangs before freeze guest filesystem

Jun 24, 2024
2
0
1
Hi,

i've two systems running in replication. Occasionally (about once or twice per day) the replication hangs at one vm.
The log ony shows the following:

2024-06-23 15:02:00 105-0: start replication job
2024-06-23 15:02:00 105-0: guest => VM 105, running => 2066997
2024-06-23 15:02:00 105-0: volumes => nvme:vm-105-disk-0,sata:vm-105-disk-0,sata:vm-105-disk-2

It then usually hangs there for an hour. Restarting pvescheduler fixes the Problem.

Anybody has an idea why it doenst freeze the guest filesystem or how i could get more details?

Thanks in advance.
 
Hi,
please check the output of ps faxl on the source node and locate the migration task to see at which command it hangs. My first guess is that it's SSH-related.
 
Hi,

thank you! It wasn't easy to time the "ps faxl" execution right in the moment the error happens, but i finally managed.

Usually only the following is shown:
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
1 0 4429 1 20 0 348892 115612 hrtime Ss ? 0:05 pvescheduler
1 0 917366 4429 20 0 356404 116892 do_sel S ? 0:00 \_ pvescheduler

Sometimes:
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
1 0 4429 1 20 0 348892 115612 hrtime Ss ? 0:05 pvescheduler
1 0 917366 4429 20 0 356404 116892 do_sel S ? 0:00 \_ pvescheduler
0 0 917372 917366 20 0 11184 7424 do_sys S ? 0:00 \_ /usr/bin/ssh -e none -o BatchMode=yes -o HostKeyAlias=pve2 -o UserKnownHostsFile=/etc/pve/nodes/pve2/ssh_known_hosts -o GlobalKnownHostsFile=none root@192.168.30.32 -- pvesr prepare-local-job 103-0 --scan mediapool,sata mediapool:vm-103-disk-0 sata:vm-103-disk-0 sata:vm-103-disk-1 --last_sync 1719596701

The problems occurs since the update from 7.4 to 8.2.

Is there any other way i could get more details? Maybe a debug log level?
 
Usually only the following is shown:
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
1 0 4429 1 20 0 348892 115612 hrtime Ss ? 0:05 pvescheduler
1 0 917366 4429 20 0 356404 116892 do_sel S ? 0:00 \_ pvescheduler
Is this is also while the hour-long hang happens or while it is working?

Sometimes:
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
1 0 4429 1 20 0 348892 115612 hrtime Ss ? 0:05 pvescheduler
1 0 917366 4429 20 0 356404 116892 do_sel S ? 0:00 \_ pvescheduler
0 0 917372 917366 20 0 11184 7424 do_sys S ? 0:00 \_ /usr/bin/ssh -e none -o BatchMode=yes -o HostKeyAlias=pve2 -o UserKnownHostsFile=/etc/pve/nodes/pve2/ssh_known_hosts -o GlobalKnownHostsFile=none root@192.168.30.32 -- pvesr prepare-local-job 103-0 --scan mediapool,sata mediapool:vm-103-disk-0 sata:vm-103-disk-0 sata:vm-103-disk-1 --last_sync 1719596701
Here, it seems to be stuck preparing the job on the remote side. In this case, it would be interesting to see what happens on the target side. I.e. the ps faxl tree below the pvesr command. If there is no pvesr command, the issue is likely already with SSH, otherwise my guess would be storage-related.

What does grep '' /proc/pressure/* show on both source and target show when the issue is happening? Anything in the system journal in one of the nodes?

The problems occurs since the update from 7.4 to 8.2.
You could try booting an older kernel to see if it improves the situation, but it's just a guess.
Is there any other way i could get more details? Maybe a debug log level?
You could enable verbose logging with SSH, but we don't know that the issue really lies there.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!