Host Freezes Once a Week

Potatoes1921 · Dec 8, 2024

Hi all, about once a week or so my host will freeze up where all VMs have ?? over them, webconfig still somewhat works. A couple of my VMs are still operational, but not fully. It gets in a very weird limbo state that can only be solved by force rebooting on the system, rebooting via webconfig worked once, but not any other time, it initiates all shutdown commands but nothing happens until button is pressed.

It seems to mainly centered around backups is my first thought, as I'll get a backup that hangs for a day for unknown reasons, possibly due to my NFS share they are being saved to being frozen or losing connection? Then next backups will hang with:
INFO: trying to get global lock - waiting...
ERROR: can't acquire lock '/var/run/vzdump.lock' - got timeout

NFS Share is on an OMV VM that has disks passed through to it. Not too sure where to start checking, I've knocked down my massive amount of backups and was a bit more selective on them. No longer backing up OMV VM as I thought that might have been the issue but seems not. Will post more info, just not sure what else is needed.

PVEVERSION: pve-manager/8.2.7/3e0176e6bb2ade3b (running kernel: 6.8.12-3-pve)

Potatoes1921 · Dec 11, 2024

Happened again, this time it got stuck on backing up my homeassistant VM, which I believe it was the same one last time. Not sure if that means its the VM issue, or that the timing lines up to be that way. Here is the log from that backup that got hung at the 10% mark at the bottom. Also, FWIW, my foundryvtt server is still accessible and functional, but the NFS and SMB share to my OMV bulk_storage isn't, which makes me think thats why the backup freezes, but I don't know why the shares would stop working.

INFO: starting new backup job: vzdump 100 104 108 --quiet 1 --prune-backups 'keep-last=10,keep-monthly=3,keep-weekly=4' --fleecing 0 --mailnotification always --mode snapshot --notes-template '{{guestname}}' --storage bulk_storage --compress zstd
INFO: Starting Backup of VM 100 (qemu)
INFO: Backup started at 2024-12-09 21:00:01
INFO: status = running
INFO: VM Name: homeassistant
INFO: include disk 'scsi0' 'vmstorage:100/vm-100-disk-1.raw' 32G
INFO: include disk 'efidisk0' 'vmstorage:100/vm-100-disk-0.raw' 4M
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/mnt/pve/bulk_storage/dump/vzdump-qemu-100-2024_12_09-21_00_01.vma.zst'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '4d478dc0-5482-4f7c-a2da-ddc64487f773'
INFO: resuming VM again
INFO: 3% (1.0 GiB of 32.0 GiB) in 3s, read: 346.8 MiB/s, write: 243.1 MiB/s
INFO: 5% (1.8 GiB of 32.0 GiB) in 6s, read: 274.3 MiB/s, write: 216.3 MiB/s
INFO: 10% (3.4 GiB of 32.0 GiB) in 9s, read: 554.6 MiB/s, write: 132.4 MiB/s

Maximiliano · Dec 11, 2024

Hello,

Do you see any crash in the system logs? You can see the system logs with

Code:

journalctl --since "2024-12-09 20:00" --until "2024-12-09 21:00"

Please modify the date above to match the timestamp of the crash.

What kernel are you using? You can query this with:

Code:

uname -a

Potatoes1921 · Dec 14, 2024

Maximiliano said:
Hello,

Do you see any crash in the system logs? You can see the system logs with

Code:

journalctl --since "2024-12-09 20:00" --until "2024-12-09 21:00"

Please modify the date above to match the timestamp of the crash.

What kernel are you using? You can query this with:

Code:

uname -a

I've attached the logs from a couple days before too by calling: journalctl --since "2024-12-12 18:00" --until "2024-12-13 22:00" > /home/crash.txt
Haven't had a chance to look too closely at them, but sure looks like an awful lot of weird stuff I can tell isn't right but not sure why or how it is happening..

As for kernel version: Linux tonyserver 6.8.12-3-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-3 (2024-10-23T11:41Z) x86_64 GNU/Linux

Same issue again last night/tonight, frozen on the same backup right around 10% again:

INFO: starting new backup job: vzdump 100 104 108 --compress zstd --mode snapshot --mailnotification always --notes-template '{{guestname}}' --storage bulk_storage --quiet 1 --fleecing 0 --prune-backups 'keep-last=10,keep-monthly=3,keep-weekly=4'
INFO: Starting Backup of VM 100 (qemu)
INFO: Backup started at 2024-12-12 21:00:01
INFO: status = stopped
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: VM Name: homeassistant
INFO: include disk 'scsi0' 'vmstorage:100/vm-100-disk-1.raw' 32G
INFO: include disk 'efidisk0' 'vmstorage:100/vm-100-disk-0.raw' 4M
INFO: creating vzdump archive '/mnt/pve/bulk_storage/dump/vzdump-qemu-100-2024_12_12-21_00_01.vma.zst'
INFO: starting kvm to execute backup task
INFO: started backup task '72f11880-d1fa-4ede-b955-586b03c206bf'
INFO: 3% (1007.8 MiB of 32.0 GiB) in 3s, read: 335.9 MiB/s, write: 232.2 MiB/s
INFO: 5% (1.8 GiB of 32.0 GiB) in 6s, read: 275.5 MiB/s, write: 221.8 MiB/s
INFO: 10% (3.4 GiB of 32.0 GiB) in 10s, read: 424.1 MiB/s, write: 104.1 MiB/s

Potatoes1921 · Sunday at 05:38

I was able to catch it around 10:30 pm, when the backup that keeps failing was still running. None of the VMs or the host were frozen by now, however the NFS pool was offline, and I was able to restart the Host remotely. I just removed the VM 100 HA from the schedule, so I'll see if that was the issue.

Potatoes1921 · Monday at 04:45

Seems daily now, or might have been before I just never noticed. This time it was the same backup job, just now a different VM. So backed up to the same NFS shared storage. Not sure why the share is having issues with backups now when it used to work with no issues..

INFO: starting new backup job: vzdump 108 104 --notes-template '{{guestname}}' --mode snapshot --compress zstd --quiet 1 --prune-backups 'keep-last=10,keep-monthly=3,keep-weekly=4' --storage bulk_storage --mailnotification always --fleecing 0
INFO: Starting Backup of VM 104 (qemu)
INFO: Backup started at 2024-12-15 21:00:02
INFO: status = running
INFO: VM Name: FoundryVTT
INFO: include disk 'scsi0' 'vmstorage:104/vm-104-disk-0.qcow2' 64G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: snapshots found (not included into backup)
INFO: creating vzdump archive '/mnt/pve/bulk_storage/dump/vzdump-qemu-104-2024_12_15-21_00_02.vma.zst'
INFO: started backup task '60c50ce4-4a97-4981-8dc7-bb8ecdd66e3d'
INFO: resuming VM again
INFO: 2% (1.5 GiB of 64.0 GiB) in 3s, read: 519.6 MiB/s, write: 267.9 MiB/s
INFO: 3% (2.5 GiB of 64.0 GiB) in 6s, read: 317.8 MiB/s, write: 284.4 MiB/s
INFO: 5% (3.5 GiB of 64.0 GiB) in 9s, read: 349.2 MiB/s, write: 155.5 MiB/s
INFO: 6% (4.1 GiB of 64.0 GiB) in 12s, read: 226.0 MiB/s, write: 225.8 MiB/s
INFO: 8% (5.4 GiB of 64.0 GiB) in 15s, read: 435.0 MiB/s, write: 204.8 MiB/s
INFO: 12% (8.3 GiB of 64.0 GiB) in 18s, read: 976.3 MiB/s, write: 71.3 MiB/s
INFO: 13% (8.9 GiB of 64.0 GiB) in 21s, read: 219.9 MiB/s, write: 217.9 MiB/s
INFO: 14% (9.2 GiB of 64.0 GiB) in 24s, read: 99.1 MiB/s, write: 99.1 MiB/s

Maximiliano · Monday at 14:42

Here are a few logs that catched my attention:

Code:

Dec 12 18:02:22 tonyserver postfix/smtp[590183]: connect to alt1.gmail-smtp-in.l.google.com[172.217.197.27]:25: Connection timed out
Dec 12 21:01:18 tonyserver smartd[1096]: Device: /dev/sdd [SAT], 20 Offline uncorrectable sectors
Dec 12 21:03:39 tonyserver kernel: nfs: server 192.168.1.3 not responding, still trying
Dec 13 21:52:22 tonyserver kernel: pcieport 0000:00:01.3: AER: Multiple Correctable error message received from 0000:03:01.0
Dec 13 21:52:26 tonyserver mount[1223]: error 2 (No such file or directory) opening credential file /home/username/.smbcredentials
Dec 12 21:10:14 tonyserver pvescheduler[626701]: VM 100 qmp command failed - VM 100 qmp command 'query-backup' failed - got timeout

- Could you please doulbe check if your network connection is working reliably?
- The smartd message suggest that disk sdd might be failing.

This might even more interesting:

Code:

Dec 12 21:11:30 tonyserver kernel: nfs: server 192.168.1.3 not responding, still trying
Dec 12 21:13:05 tonyserver kernel: INFO: task task UPID:tonys:626701 blocked for more than 122 seconds.
Dec 12 21:13:05 tonyserver kernel:       Tainted: P           O       6.8.12-3-pve #1
Dec 12 21:13:05 tonyserver kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 12 21:13:05 tonyserver kernel: task:task UPID:tonys state:D stack:0     pid:626701 tgid:626701 ppid:1      flags:0x00004002
Dec 12 21:13:05 tonyserver kernel: Call Trace:
Dec 12 21:13:05 tonyserver kernel:  <TASK>
Dec 12 21:13:05 tonyserver kernel:  __schedule+0x401/0x15e0
Dec 12 21:13:05 tonyserver kernel:  ? srso_return_thunk+0x5/0x5f
Dec 12 21:13:05 tonyserver kernel:  ? nfs_pageio_complete+0xee/0x140 [nfs]
Dec 12 21:13:05 tonyserver kernel:  schedule+0x33/0x110
Dec 12 21:13:05 tonyserver kernel:  io_schedule+0x46/0x80
Dec 12 21:13:05 tonyserver kernel:  folio_wait_bit_common+0x136/0x330
Dec 12 21:13:05 tonyserver kernel:  ? __pfx_wake_page_function+0x10/0x10
Dec 12 21:13:05 tonyserver kernel:  folio_wait_bit+0x18/0x30
Dec 12 21:13:05 tonyserver kernel:  folio_wait_writeback+0x2b/0xa0
Dec 12 21:13:05 tonyserver kernel:  __filemap_fdatawait_range+0x90/0x100
Dec 12 21:13:05 tonyserver kernel:  filemap_write_and_wait_range+0x94/0xc0
Dec 12 21:13:05 tonyserver kernel:  nfs_wb_all+0x27/0x130 [nfs]
Dec 12 21:13:05 tonyserver kernel:  nfs4_file_flush+0x7e/0xe0 [nfsv4]
Dec 12 21:13:05 tonyserver kernel:  filp_flush+0x38/0x90
Dec 12 21:13:05 tonyserver kernel:  __x64_sys_close+0x34/0x90
Dec 12 21:13:05 tonyserver kernel:  x64_sys_call+0x1a20/0x24b0
Dec 12 21:13:05 tonyserver kernel:  do_syscall_64+0x81/0x170
Dec 12 21:13:05 tonyserver kernel:  ? __pte_offset_map+0x1c/0x1b0
Dec 12 21:13:05 tonyserver kernel:  ? srso_return_thunk+0x5/0x5f
Dec 12 21:13:05 tonyserver kernel:  ? __handle_mm_fault+0xbd3/0xed0
Dec 12 21:13:05 tonyserver kernel:  ? srso_return_thunk+0x5/0x5f
Dec 12 21:13:05 tonyserver kernel:  ? __count_memcg_events+0x6f/0xe0
Dec 12 21:13:05 tonyserver kernel:  ? srso_return_thunk+0x5/0x5f
Dec 12 21:13:05 tonyserver kernel:  ? count_memcg_events.constprop.0+0x2a/0x50
Dec 12 21:13:05 tonyserver kernel:  ? srso_return_thunk+0x5/0x5f
Dec 12 21:13:05 tonyserver kernel:  ? handle_mm_fault+0xad/0x380
Dec 12 21:13:05 tonyserver kernel:  ? srso_return_thunk+0x5/0x5f
Dec 12 21:13:05 tonyserver kernel:  ? do_user_addr_fault+0x337/0x660
Dec 12 21:13:05 tonyserver kernel:  ? srso_return_thunk+0x5/0x5f
Dec 12 21:13:05 tonyserver kernel:  ? irqentry_exit_to_user_mode+0x7e/0x260
Dec 12 21:13:05 tonyserver kernel:  ? srso_return_thunk+0x5/0x5f
Dec 12 21:13:05 tonyserver kernel:  ? irqentry_exit+0x43/0x50
Dec 12 21:13:05 tonyserver kernel:  ? srso_return_thunk+0x5/0x5f
Dec 12 21:13:05 tonyserver kernel:  ? exc_page_fault+0x94/0x1b0
Dec 12 21:13:05 tonyserver kernel:  entry_SYSCALL_64_after_hwframe+0x78/0x80
Dec 12 21:13:05 tonyserver kernel: RIP: 0033:0x7bce473098e0
Dec 12 21:13:05 tonyserver kernel: RSP: 002b:00007fffa95f1098 EFLAGS: 00000202 ORIG_RAX: 0000000000000003
Dec 12 21:13:05 tonyserver kernel: RAX: ffffffffffffffda RBX: 0000633b57ffd2a0 RCX: 00007bce473098e0
Dec 12 21:13:05 tonyserver kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000000e
Dec 12 21:13:05 tonyserver kernel: RBP: 000000000000000e R08: 0000000000000000 R09: 0000000000000000
Dec 12 21:13:05 tonyserver kernel: R10: 0000000000000000 R11: 0000000000000202 R12: 0000633b5ea46d20
Dec 12 21:13:05 tonyserver kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
Dec 12 21:13:05 tonyserver kernel:  </TASK>

It seems the NFS is not responding and that caused the backup to freeze/stop.

Potatoes1921 · Tuesday at 03:58

Maximiliano said:

Here are a few logs that catched my attention:

Code:

Dec 12 18:02:22 tonyserver postfix/smtp[590183]: connect to alt1.gmail-smtp-in.l.google.com[172.217.197.27]:25: Connection timed out
Dec 12 21:01:18 tonyserver smartd[1096]: Device: /dev/sdd [SAT], 20 Offline uncorrectable sectors
Dec 12 21:03:39 tonyserver kernel: nfs: server 192.168.1.3 not responding, still trying
Dec 13 21:52:22 tonyserver kernel: pcieport 0000:00:01.3: AER: Multiple Correctable error message received from 0000:03:01.0
Dec 13 21:52:26 tonyserver mount[1223]: error 2 (No such file or directory) opening credential file /home/username/.smbcredentials
Dec 12 21:10:14 tonyserver pvescheduler[626701]: VM 100 qmp command failed - VM 100 qmp command 'query-backup' failed - got timeout

- Could you please doulbe check if your network connection is working reliably?
- The smartd message suggest that disk sdd might be failing.

This might even more interesting:

Code:

Dec 12 21:11:30 tonyserver kernel: nfs: server 192.168.1.3 not responding, still trying
Dec 12 21:13:05 tonyserver kernel: INFO: task task UPID:tonys:626701 blocked for more than 122 seconds.
Dec 12 21:13:05 tonyserver kernel:       Tainted: P           O       6.8.12-3-pve #1
Dec 12 21:13:05 tonyserver kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 12 21:13:05 tonyserver kernel: task:task UPID:tonys state:D stack:0     pid:626701 tgid:626701 ppid:1      flags:0x00004002
Dec 12 21:13:05 tonyserver kernel: Call Trace:
Dec 12 21:13:05 tonyserver kernel:  <TASK>
Dec 12 21:13:05 tonyserver kernel:  __schedule+0x401/0x15e0
Dec 12 21:13:05 tonyserver kernel:  ? srso_return_thunk+0x5/0x5f
Dec 12 21:13:05 tonyserver kernel:  ? nfs_pageio_complete+0xee/0x140 [nfs]
Dec 12 21:13:05 tonyserver kernel:  schedule+0x33/0x110
Dec 12 21:13:05 tonyserver kernel:  io_schedule+0x46/0x80
Dec 12 21:13:05 tonyserver kernel:  folio_wait_bit_common+0x136/0x330
Dec 12 21:13:05 tonyserver kernel:  ? __pfx_wake_page_function+0x10/0x10
Dec 12 21:13:05 tonyserver kernel:  folio_wait_bit+0x18/0x30
Dec 12 21:13:05 tonyserver kernel:  folio_wait_writeback+0x2b/0xa0
Dec 12 21:13:05 tonyserver kernel:  __filemap_fdatawait_range+0x90/0x100
Dec 12 21:13:05 tonyserver kernel:  filemap_write_and_wait_range+0x94/0xc0
Dec 12 21:13:05 tonyserver kernel:  nfs_wb_all+0x27/0x130 [nfs]
Dec 12 21:13:05 tonyserver kernel:  nfs4_file_flush+0x7e/0xe0 [nfsv4]
Dec 12 21:13:05 tonyserver kernel:  filp_flush+0x38/0x90
Dec 12 21:13:05 tonyserver kernel:  __x64_sys_close+0x34/0x90
Dec 12 21:13:05 tonyserver kernel:  x64_sys_call+0x1a20/0x24b0
Dec 12 21:13:05 tonyserver kernel:  do_syscall_64+0x81/0x170
Dec 12 21:13:05 tonyserver kernel:  ? __pte_offset_map+0x1c/0x1b0
Dec 12 21:13:05 tonyserver kernel:  ? srso_return_thunk+0x5/0x5f
Dec 12 21:13:05 tonyserver kernel:  ? __handle_mm_fault+0xbd3/0xed0
Dec 12 21:13:05 tonyserver kernel:  ? srso_return_thunk+0x5/0x5f
Dec 12 21:13:05 tonyserver kernel:  ? __count_memcg_events+0x6f/0xe0
Dec 12 21:13:05 tonyserver kernel:  ? srso_return_thunk+0x5/0x5f
Dec 12 21:13:05 tonyserver kernel:  ? count_memcg_events.constprop.0+0x2a/0x50
Dec 12 21:13:05 tonyserver kernel:  ? srso_return_thunk+0x5/0x5f
Dec 12 21:13:05 tonyserver kernel:  ? handle_mm_fault+0xad/0x380
Dec 12 21:13:05 tonyserver kernel:  ? srso_return_thunk+0x5/0x5f
Dec 12 21:13:05 tonyserver kernel:  ? do_user_addr_fault+0x337/0x660
Dec 12 21:13:05 tonyserver kernel:  ? srso_return_thunk+0x5/0x5f
Dec 12 21:13:05 tonyserver kernel:  ? irqentry_exit_to_user_mode+0x7e/0x260
Dec 12 21:13:05 tonyserver kernel:  ? srso_return_thunk+0x5/0x5f
Dec 12 21:13:05 tonyserver kernel:  ? irqentry_exit+0x43/0x50
Dec 12 21:13:05 tonyserver kernel:  ? srso_return_thunk+0x5/0x5f
Dec 12 21:13:05 tonyserver kernel:  ? exc_page_fault+0x94/0x1b0
Dec 12 21:13:05 tonyserver kernel:  entry_SYSCALL_64_after_hwframe+0x78/0x80
Dec 12 21:13:05 tonyserver kernel: RIP: 0033:0x7bce473098e0
Dec 12 21:13:05 tonyserver kernel: RSP: 002b:00007fffa95f1098 EFLAGS: 00000202 ORIG_RAX: 0000000000000003
Dec 12 21:13:05 tonyserver kernel: RAX: ffffffffffffffda RBX: 0000633b57ffd2a0 RCX: 00007bce473098e0
Dec 12 21:13:05 tonyserver kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000000e
Dec 12 21:13:05 tonyserver kernel: RBP: 000000000000000e R08: 0000000000000000 R09: 0000000000000000
Dec 12 21:13:05 tonyserver kernel: R10: 0000000000000000 R11: 0000000000000202 R12: 0000633b5ea46d20
Dec 12 21:13:05 tonyserver kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
Dec 12 21:13:05 tonyserver kernel:  </TASK>

It seems the NFS is not responding and that caused the backup to freeze/stop.

I've had no issues with my network on the server or across any other systems, and everything is hardwired. Here is the SMART results for sdd. Seems to be passing, but it is surely on its way out. sdd is my main VM storage, where the hosts are being ran on. Is there an easy way to copy that drive and its data and such to a new one? Is it recommended to have VMs stored on a raid/zfs array?

I would agree that sure seems to be whats happening, but I'm not too sure why. Also slightly curious how proxmox doesn't kill the process after it hangs for so long, and instead just freezes the entire host. I guess, I was unable to kill/stop the backup as well after it started to hang for 1-2 hours, so maybe it is unable to as well.

I've disabled all backups going to the NFS shared ZFS pool for now, and currently just have my FoundryVTT being backed up to the same disk that the hosts are on (obviously dumb but only other option is the proxmox host disk which I figured was a worse idea)

Potatoes1921 · 2024-12-20T04:43:34+0100

So far, about 3 full days have gone by with 0 issues. The NFS share and zfs pool are both up and stable, proxmox has not crashed, and my 1 backup job of my Foundryvtt server has been running fine with no issues when backing up to the VMstorage SSD. Not sure what conclusion to draw from this, I'll play around a bit more when I have the time.

Search

Search

Host Freezes Once a Week

Potatoes1921

New Member

Potatoes1921

New Member

Maximiliano

Proxmox Staff Member

Potatoes1921

New Member

Attachments

Potatoes1921

New Member

Potatoes1921

New Member

Maximiliano

Proxmox Staff Member

Potatoes1921

New Member

Potatoes1921

New Member