pve freezes during backup job

spongi · Oct 13, 2023

Hello everyone!

I'm using one of these CWWK/Topton/... Nxxx quad NIC router devices as my pve host. Two days ago I have migrated from a unit with the N100 to the exact same unit only with the N305 CPU. This means that I have just swapped out the two NVME drives, SATA OS drive and RAM. My daily backup (pbs on same system, snapshot to external usb hdd) runs at night and so I noticed a frozen system in the morning. The Host is not responding to any input anymore. The only solution is to hard reset it via power button. I can reproduce this behavior with just starting a manual backup of one of my VMs.

Oct 12 15:14:02 pve proxmox-backup-proxy[190647]: error during snapshot file listing: 'unable to load blob '"/mnt/pve/backup/vm/101/2023-10-12T12:18:27Z/index.json.blob"' - No such file or directory (os error 2)'
Oct 12 15:14:08 pve pvedaemon[190730]: <root@pam> starting task UPID

ve:0004799A:00040CC4:6527F120:imgdel:101@pbs:root@pam:
Oct 12 15:14:08 pve proxmox-backup-[190647]: pve proxmox-backup-proxy[190647]: removing backup snapshot "/mnt/pve/backup/vm/101/2023-10-12T12:18:27Z"
Oct 12 15:14:08 pve pvedaemon[190730]: <root@pam> end task UPID

ve:0004799A:00040CC4:6527F120:imgdel:101@pbs:root@pam: OK
Oct 12 15:14:11 pve pvedaemon[190729]: <root@pam> starting task UPID

ve:000479CE:00040DCE:6527F123:vzdump:101:root@pam:
Oct 12 15:14:11 pve pvedaemon[293326]: INFO: starting new backup job: vzdump 101 --notes-template '{{guestname}}' --remove 0 --mode snapshot --node pve --storage pbs
Oct 12 15:14:11 pve pvedaemon[293326]: INFO: Starting Backup of VM 101 (qemu)
Oct 12 15:14:11 pve proxmox-backup-proxy[190647]: starting new backup on datastore 'backup' from ::ffff:192.168.178.8: "vm/101/2023-10-12T13:14:11Z"
Oct 12 15:14:11 pve proxmox-backup-proxy[190647]: download 'index.json.blob' from previous backup.
Oct 12 15:14:11 pve proxmox-backup-proxy[190647]: register chunks in 'drive-scsi0.img.fidx' from previous backup.
Oct 12 15:14:12 pve proxmox-backup-proxy[190647]: download 'drive-scsi0.img.fidx' from previous backup.
Oct 12 15:14:12 pve proxmox-backup-proxy[190647]: created new fixed index 1 ("vm/101/2023-10-12T13:14:11Z/drive-scsi0.img.fidx")
Oct 12 15:14:12 pve proxmox-backup-proxy[190647]: register chunks in 'drive-scsi1.img.fidx' from previous backup.
Oct 12 15:14:12 pve proxmox-backup-proxy[190647]: download 'drive-scsi1.img.fidx' from previous backup.
Oct 12 15:14:12 pve proxmox-backup-proxy[190647]: created new fixed index 2 ("vm/101/2023-10-12T13:14:11Z/drive-scsi1.img.fidx")
Oct 12 15:14:12 pve proxmox-backup-proxy[190647]: add blob "/mnt/pve/backup/vm/101/2023-10-12T13:14:11Z/qemu-server.conf.blob" (420 bytes, comp: 420)
Oct 12 15:14:15 pve proxmox-backup-proxy[190647]: error during snapshot file listing: 'unable to load blob '"/mnt/pve/backup/vm/101/2023-10-12T13:14:11Z/index.json.blob"' - No such file or directory (os error 2)'
Oct 12 15:14:15 pve pveproxy[190871]: worker exit
Oct 12 15:14:15 pve pveproxy[2330]: worker 190871 finished
Oct 12 15:14:15 pve pveproxy[2330]: starting 1 worker(s)
Oct 12 15:14:15 pve pveproxy[2330]: worker 300986 started
Oct 12 15:17:01 pve CRON[600396]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Oct 12 15:17:01 pve CRON[600397]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Oct 12 15:17:01 pve CRON[600396]: pam_unix(cron:session): session closed for user root
Oct 12 15:20:33 pve kernel: perf: interrupt took too long (2505 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
Oct 12 15:20:34 pve zed[1264433]: eid=8 class=checksum pool='datastore' vdev=nvme-Samsung_SSD_970_EVO_Plus_2TB_S4J4NM0W615790R-part1 algorithm=fletcher4 size=8192 offset=211223277568 priority=0 err=52 flags=0x380880 bookmark=671:1:0:33183492
Oct 12 15:21:06 pve zed[1346716]: eid=9 class=checksum pool='datastore' vdev=nvme-Samsung_SSD_970_EVO_Plus_2TB_S4J4NM0W615820H-part1 algorithm=fletcher4 size=8192 offset=332496965632 priority=0 err=52 flags=0x380880 bookmark=671:1:0:37461792
Oct 12 15:21:19 pve kernel: BUG: unable to handle page fault for address: 00000e66129fa240
Oct 12 15:21:19 pve kernel: #PF: supervisor read access in kernel mode
-- Rebooted --

I think the interesting line is this one pve kernel: BUG: unable to handle page fault for address: 00000e66129fa240

My system is completely up to date with the packages from the no-subscription repo. The first time it occurred the system had an older kernel version, I have updated it hopping that maybe this will be fixed.

Has anyone an idea what could be the problem?

jjh3 · Dec 7, 2023

I have this same problem with a dual NIC N100 unit, though for me it sometimes happens without any logs; I have a HDMI monitor hooked up and it appears that it literally just stops. I have PBS running inside proxmox.

I have run memtest on the RAM, stress-ng on the CPU and badblocks/smartctl on the SSD and everything was fine. You mentioned that you are backing up to an external USB drive, I am also doing the same and using passthrough. Did that work fine on your N100 system?

I have a separate, similar but not identical N100 with only SATA drives, and it doesn't seem to have this problem, so I wonder if it is USB related. I also have a 90W adapter on my working one, but only 60W on the non-working one (both are what came with the unit). I'm trying to find out if I can use a 19V adapter on the one with issues to try to improve the power supply.

jjh3 · Dec 11, 2023

I think I have resolved this issue. My unit uses DDR5 RAM (CWWK x86-P5), and it seems that by default the UEFI settings have DDR5 on-die ECC disabled by default, even though support is mandatory for DDR5. The unit I have has a single 32GB SODIMM of Crucial (Micron) 4800MT/s installed (so RAM should be of reasonable quality) and it now has spent about a week without crashing with the on-die ECC enabled.

spongi · Dec 12, 2023

Sorry for the late reply, just saw your replies to the thread. I "solved" the problem using an usb-c ssd instead of the usb 3.0 HDD (without own power supply). Never had the page fault problem after this change. But my system was unstable in general. A few days ago it just hung up and even after reboot it hung up a few minutes after. After debugging, I found out that an automatic scrub of the zpool started, and I assume that I ran out of memory. Because after limiting the memory of my VMs everything is fine again.

Thank you for your update regarding the on-die ECC, I didn't know that this is a requirement for DDR5. I'm also using a Crucial module, since yesterday one with 48 GB. Could you tell me where I have to look in the UEFI settings?

magingale · Dec 12, 2023

Got also a strange issue during PVE backup, switched itself off....very weird, don't want to hijack your topic but feed it with search content...

For sure nobody pressed the N305 powerbutton... the NAS is shutdown at night and not a backup location for this backup job.

Code:

Dec 12 04:05:00 pve kernel: nfs: server 172.16.1.2 not responding, timed out
Dec 12 04:05:01 pve CRON[483307]: pam_unix(cron:session): session closed for user root
Dec 12 04:05:05 pve pvestatd[2773687]: storage 'NFS-HDD' is not online
Dec 12 04:05:07 pve pvescheduler[480158]: INFO: Finished Backup of VM 105 (00:01:25)
Dec 12 04:05:07 pve pvescheduler[480158]: INFO: Starting Backup of VM 107 (qemu)
Dec 12 04:05:08 pve pvestatd[2773687]: storage 'backup-remote' is not online
Dec 12 04:05:08 pve pvestatd[2773687]: status update time (6.210 seconds)
Dec 12 04:05:08 pve kernel: nfs: server 172.16.1.2 not responding, timed out
Dec 12 04:05:14 pve pvestatd[2773687]: storage 'backup-remote' is not online
Dec 12 04:05:17 pve pvestatd[2773687]: storage 'NFS-HDD' is not online
Dec 12 04:05:17 pve pvestatd[2773687]: status update time (5.228 seconds)
Dec 12 04:05:18 pve kernel: nfs: server 172.16.1.2 not responding, timed out
Dec 12 04:05:23 pve kernel: nfs: server 172.16.1.2 not responding, timed out
Dec 12 04:05:26 pve pvestatd[2773687]: storage 'NFS-HDD' is not online
Dec 12 04:05:29 pve pvestatd[2773687]: storage 'backup-remote' is not online
Dec 12 04:05:30 pve pvestatd[2773687]: status update time (7.296 seconds)
Dec 12 04:05:34 pve kernel: nfs: server 172.16.1.2 not responding, timed out
Dec 12 04:05:39 pve pvestatd[2773687]: storage 'backup-remote' is not online
Dec 12 04:05:41 pve systemd-logind[3981267]: Power key pressed short.
Dec 12 04:05:41 pve systemd-logind[3981267]: Powering off...
Dec 12 04:05:41 pve systemd-logind[3981267]: System is powering down.
Dec 12 04:05:41 pve systemd[1]: 102.scope: Deactivated successfully.
Dec 12 04:05:41 pve systemd[1]: Stopped 102.scope.

jjh3 · Dec 15, 2023

System still remains up with full stability. I have truenas running in a VM with 16GB RAM and I have hammered the USB enclosure (qemu passthrough) with ZFS scrubs repeatedly and it has no issues, even while I am writing to the drive during a scrub.

As far as my settings:

The EFI model of my device is CW-ADLNT-1C2L.

If you go to Chipset => Memory Configuration you can enable In-band ECC. Note that I *thought* this was on-die ECC, not in-band ECC ~~which is your traditional ECC found in servers. So I can't see why that would have fixed it since I have normal DDR5 without in-band ECC.~~ But the setting could be doing something more than is obvious here perhaps.

Some other things I have changed previously that may have contributed:

- Advanced => CPU - Power Management Control => Boot performance mode set to "Max Non-Turbo Performance"
- Advanced => CPU - Power Management Control => C states set to Disabled
- Advanced => ACPI Settings => Enable Hibernation set to Disabled

I also disabled Intel ME under Advanced => PCH-FW Configuration

jjh3 · Dec 15, 2023

Looking at this article: https://www.anandtech.com/show/18732/asrock-industrial-nucs-box1360pd4-review-raptor-lakep-ecc/2

It states that "It can be noted that the amount of hardware reserved memory in the 'In-Band ECC'-enabled case is 2GB higher than the default case. This points to 1/32 of the total memory capacity being reserved for ECC storage.". And indeed proxmox reports only 30GB RAM instead of 32GB total in my case.

Turns out that regular server ECC is called "side-band ECC", whereas "in-band ECC" is done by consuming some of the RAM but otherwise doing a similar thing (which is the case here). This feature is relatively new on Intel CPUs.

So while it is possible that I have a bad DIMM and that is causing the problems, I suspect more that there are issues with marginal signal integrity on the memory routing on the motherboard. It might be interesting to test going back to your setup that had some issues, and then enabling this in-band ECC feature to see if it fixes the problem.

Some more information here: https://docs.zephyrproject.org/latest/hardware/peripherals/edac/ibecc.html

jjh3 · Dec 15, 2023

Seems that there is an EDAC driver for this form of ECC: https://patchwork.kernel.org/project/linux-edac/patch/20201105074914.3866-1-qiuxu.zhuo@intel.com/ and it was merged into kernel 5.11. I'm not sure if it is actually enabled in the pve-kernel build though.

Also edac-util doesn't appear to work on proxmox since version 6:

- https://forum.proxmox.com/threads/edac-utils-shows-incomplete-information-in-proxmox-6-1.64293/
- https://forum.proxmox.com/threads/edac-utils-shows-incomplete-information-in-proxmox-7-1.106470/

So there is no way I can see to view the memory error statistics.

Search

Search

pve freezes during backup job

spongi

New Member

jjh3

New Member

jjh3

New Member

spongi

New Member

magingale

New Member

jjh3

New Member

jjh3

New Member

jjh3

New Member