VM lockup at backup time

arubenstein

New Member
Jul 17, 2023
28
0
1
Having a strang issue. Environment is a 5 node cluster, CEPH underneath, all SSD. About 73 VMs running, tons of RAM at CPU available.

A guest, which is a CentOS 7.9 linux box, occasionally locks up around when backup occurs. Backing up to a remote PBS. Guest-tools are installed on the guest and communicating with the host. We backup nightly, and it doesn't lock up every night. But maybe 1 or 2 times a week. When I say lock up, the login prompt does allow you to type a username in on console, but never returns a prompt for password. The console is spammed as shown here. Guest is linux kernel is 3.10. From the research I have done, it does appear to be some sort of disk subsystem inaccessibility issue. However many, many other VMs operate on this cluster and that node with no issue with uptimes of 100's of days. I will say that we don't run a lot of CentOS (mostly debian and windows). I haven't power-cycled it yet, so if there any commands I can run on the hypervisor to hep troubleshoot, let me know and I will run them. Any help or pointers appreciated!


Code:
root@pvea2:~# qm status 142 --verbose
balloon: 17179869184
ballooninfo:
        actual: 17179869184
        free_mem: 3602145280
        last_update: 1704892454
        major_page_faults: 1568
        max_mem: 17179869184
        mem_swapped_in: 0
        mem_swapped_out: 0
        minor_page_faults: 722860004
        total_mem: 16655044608
blockstat:
        scsi0:
                account_failed: 1
                account_invalid: 1
                failed_flush_operations: 0
                failed_rd_operations: 0
                failed_unmap_operations: 0
                failed_wr_operations: 0
                failed_zone_append_operations: 0
                flush_operations: 1045232
                flush_total_time_ns: 1325208395575
                idle_time_ns: 31477011384646
                invalid_flush_operations: 0
                invalid_rd_operations: 0
                invalid_unmap_operations: 0
                invalid_wr_operations: 0
                invalid_zone_append_operations: 0
                rd_bytes: 1180863488
                rd_merged: 0
                rd_operations: 63496
                rd_total_time_ns: 84561163209
                timed_stats:
                unmap_bytes: 0
                unmap_merged: 0
                unmap_operations: 0
                unmap_total_time_ns: 0
                wr_bytes: 178115051520
                wr_highest_offset: 322119630848
                wr_merged: 0
                wr_operations: 10838567
                wr_total_time_ns: 2099085771014
                zone_append_bytes: 0
                zone_append_merged: 0
                zone_append_operations: 0
                zone_append_total_time_ns: 0
cpus: 12
disk: 0
diskread: 1180863488
diskwrite: 178115051520
freemem: 3602145280
maxdisk: 0
maxmem: 17179869184
mem: 13052899328
name: PNET-voipmonitor.voice.planet.net
netin: 759955683
netout: 397782855
nics:
        tap142i0:
                netin: 274156480
                netout: 249630649
        tap142i1:
                netin: 485799203
                netout: 148152206
pid: 4011717
proxmox-support:
        backup-max-workers: 1
        pbs-dirty-bitmap: 1
        pbs-dirty-bitmap-migration: 1
        pbs-dirty-bitmap-savevm: 1
        pbs-library-version: 1.4.1 (UNKNOWN)
        pbs-masterkey: 1
        query-bitmap-info: 1
qmpstatus: running
running-machine: pc-i440fx-8.1+pve0
running-qemu: 8.1.2
status: running
uptime: 512854
vmid: 142

1704892387035.png
 
Here is another tidbit. I was talking to another admin this morning and on another separate cluster in a different datacenter, there is a CentOS machine also locking up from time to time during backups. Is this something maybe to do with Kernel 3.10?
 
Another tidbit. The VM's in question that were fine for a very long time, only started exhibiting this behavior when we upgraded from: '8.0.3_amd64' -> '8.1.3_amd64'.
 
Another tidbit: the lockup seems to occur right around the time of a fsfreeze:

Code:
Jan  9 23:29:22 vxx qemu-ga: info: guest-ping called
Jan  9 23:29:22 xvxx qemu-ga: info: guest-fsfreeze called
Jan  9 23:29:22 vxx qemu-ga: info: executing fsfreeze hook with arg 'freeze'


Those are the last things in the syslog.