[SOLVED] PVE backups crash the host system since two days (external + internal backup targets)

denis_pve · Feb 27, 2023

Hello.

I'm running a "pve-manager/7.3-6/723bb6ec" and since about two days while backuping my VMs the host crashes.

Code:

Header
Proxmox
Virtual Environment 7.3-6
Search
Node 'pve'
Week (average)
 CPU usage 0.59% of 6 CPU(s)
 IO delay 0.00%
 Load average 0.09,0.14,0.08
 RAM usage 15.84% (4.94 GiB of 31.15 GiB)
KSM sharing 0 B
 / HD space 12.66% (50.66 GiB of 400.19 GiB)
 SWAP usage N/A
CPU(s) 6 x Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz (1 Socket)
Kernel Version Linux 5.15.85-1-pve #1 SMP PVE 5.15.85-1 (2023-02-01T00:00Z)
PVE Manager Version pve-manager/7.3-6/723bb6ec
Repository Status Proxmox VE updates Non production-ready repository enabled!
Server View
Logs
()
INFO: starting new backup job: vzdump --notes-template '{{guestname}}' --storage synology_nas --prune-backups 'keep-last=3,keep-monthly=1,keep-weekly=1,keep-yearly=1' --quiet 1 --compress zstd --mode snapshot --exclude 100,302,401 --all 1 --mailnotification always
INFO: Starting Backup of VM 301 (qemu)
INFO: Backup started at 2023-02-27 00:00:00
INFO: status = running
INFO: VM Name: ubuntu-vm1-influxdb
INFO: include disk 'scsi0' 'local-zfs:vm-301-disk-0' 50G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/mnt/pve/synology_nas/dump/vzdump-qemu-301-2023_02_27-00_00_00.vma.zst'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '464c99a2-9a94-48be-a4be-1249d30357c0'
INFO: resuming VM again
INFO:   5% (2.7 GiB of 50.0 GiB) in 3s, read: 935.7 MiB/s, write: 213.7 MiB/s
INFO:  12% (6.2 GiB of 50.0 GiB) in 6s, read: 1.2 GiB/s, write: 172.3 MiB/s
INFO:  21% (10.8 GiB of 50.0 GiB) in 9s, read: 1.5 GiB/s, write: 180.7 MiB/s
INFO:  23% (11.8 GiB of 50.0 GiB) in 12s, read: 358.9 MiB/s, write: 213.7 MiB/s
INFO:  28% (14.1 GiB of 50.0 GiB) in 15s, read: 793.9 MiB/s, write: 188.2 MiB/s
INFO:  33% (16.5 GiB of 50.0 GiB) in 18s, read: 815.1 MiB/s, write: 175.9 MiB/s
ERROR: job failed with err -5 - Input/output error
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 301 failed - job failed with err -5 - Input/output error
INFO: Failed at 2023-02-27 00:00:37
INFO: Starting Backup of VM 303 (qemu)
INFO: Backup started at 2023-02-27 00:00:37
INFO: status = running
INFO: VM Name: haos-vm1-homeassistant
INFO: include disk 'scsi0' 'local-zfs:vm-303-disk-1' 40G
INFO: include disk 'efidisk0' 'local-zfs:vm-303-disk-0' 4M
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/mnt/pve/synology_nas/dump/vzdump-qemu-303-2023_02_27-00_00_37.vma.zst'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '06fd39d2-0bff-41e5-bf5b-72be4fccd64e'
INFO: resuming VM again
INFO:   2% (1.2 GiB of 40.0 GiB) in 3s, read: 404.8 MiB/s, write: 226.5 MiB/s
INFO:   4% (1.9 GiB of 40.0 GiB) in 6s, read: 239.2 MiB/s, write: 233.3 MiB/s
INFO:   6% (2.6 GiB of 40.0 GiB) in 9s, read: 227.2 MiB/s, write: 217.3 MiB/s
INFO:   7% (2.9 GiB of 40.0 GiB) in 11s, read: 163.2 MiB/s, write: 122.4 MiB/s
ERROR: job failed with err -5 - Input/output error
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 303 failed - job failed with err -5 - Input/output error
INFO: Failed at 2023-02-27 00:01:06
INFO: Backup job finished with errors
TASK ERROR: job errors

Code:

Feb 27 00:00:00 pve pvescheduler[1436066]: <root@pam> starting task UPID:pve:0015E9A3:079011E2:63FBE470:vzdump::root@pam:
Feb 27 00:00:00 pve pvescheduler[1436067]: INFO: starting new backup job: vzdump --notes-template '{{guestname}}' --storage synology_nas --prune-backups 'keep-last=3,keep-monthly=1,keep-weekly=1,keep-yearly=1' --quiet 1 --compress zstd --mode snapshot --exclude 100,302,401 --all 1 --mailnotification always
Feb 27 00:00:00 pve pvescheduler[1436067]: INFO: Starting Backup of VM 301 (qemu)
Feb 27 00:00:15 pve systemd[1]: Starting Rotate log files...
Feb 27 00:00:15 pve systemd[1]: Starting Daily man-db regeneration...
Feb 27 00:00:15 pve systemd[1]: Stopping Proxmox VE firewall logger...
Feb 27 00:00:15 pve pvefw-logger[28204]: received terminate request (signal)
Feb 27 00:00:15 pve pvefw-logger[28204]: stopping pvefw logger
Feb 27 00:00:15 pve systemd[1]: man-db.service: Succeeded.
Feb 27 00:00:15 pve systemd[1]: Finished Daily man-db regeneration.
Feb 27 00:00:15 pve systemd[1]: pvefw-logger.service: Succeeded.
Feb 27 00:00:15 pve systemd[1]: Stopped Proxmox VE firewall logger.
Feb 27 00:00:15 pve systemd[1]: pvefw-logger.service: Consumed 5.824s CPU time.
Feb 27 00:00:15 pve systemd[1]: Starting Proxmox VE firewall logger...
Feb 27 00:00:15 pve pvefw-logger[1438833]: starting pvefw logger
Feb 27 00:00:15 pve systemd[1]: Started Proxmox VE firewall logger.
Feb 27 00:00:15 pve systemd[1]: logrotate.service: Succeeded.
Feb 27 00:00:15 pve systemd[1]: Finished Rotate log files.
Feb 27 00:00:37 pve pvescheduler[1436067]: ERROR: Backup of VM 301 failed - job failed with err -5 - Input/output error
Feb 27 00:00:37 pve pvescheduler[1436067]: INFO: Starting Backup of VM 303 (qemu)
Feb 27 00:01:06 pve pvescheduler[1436067]: ERROR: Backup of VM 303 failed - job failed with err -5 - Input/output error
Feb 27 00:01:06 pve pvescheduler[1436067]: INFO: Backup job finished with errors
Feb 27 00:01:06 pve pvescheduler[1436067]: job errors
Feb 27 00:03:04 pve smartd[1304]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Drive_Temperature changed from 63 to 62
Feb 27 00:17:01 pve CRON[1452370]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Feb 27 00:17:01 pve CRON[1452371]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Feb 27 00:17:01 pve CRON[1452370]: pam_unix(cron:session): session closed for user root
Feb 27 00:33:04 pve smartd[1304]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Drive_Temperature changed from 62 to 63
Feb 27 00:35:15 pve systemd[1]: Starting Discard unused blocks on filesystems from /etc/fstab...
Feb 27 00:35:15 pve systemd[1]: fstrim.service: Succeeded.
Feb 27 00:35:15 pve systemd[1]: Finished Discard unused blocks on filesystems from /etc/fstab.
Feb 27 01:03:04 pve smartd[1304]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Drive_Temperature changed from 63 to 62
-- Reboot --

This is the same for backuping to a NAS as well for backuping to an internal ZFS storage in mirror mode. I got enough space on both backup targets. So this can't be the error.

To be honest I got no clue why this starts happening now.

The memory and disks are completely new so I would say it is no hardware error. And when disabling the backups I got no crashes.

Perhaps someone got an idea why this happens? Do the disks get too hot? (ZFS storage contains the VMs) I'm using Intel Enterprise SSDs.

Any help would be appreciated. Thank you very much!

Denis

denis_pve · Feb 28, 2023

Today I got one more crash in the morning.

This time the kernel panic was logged:

Code:

Feb 28 07:44:19 pve kernel: BUG: unable to handle page fault for address: ffffffff8e88a940
Feb 28 07:44:19 pve kernel: #PF: supervisor write access in kernel mode
Feb 28 07:44:19 pve kernel: #PF: error_code(0x0002) - not-present page
Feb 28 07:44:19 pve kernel: PGD 6d8415067 P4D 6d8415067 PUD 6d8416063 PMD 10eb8a063 PTE 800ffff927775062
Feb 28 07:44:19 pve kernel: Oops: 0002 [#1] SMP PTI
Feb 28 07:44:19 pve kernel: CPU: 4 PID: 678197 Comm: z_rd_int Tainted: P           O      5.15.85-1-pve #1
Feb 28 07:44:19 pve kernel: Hardware name: Dell Inc. OptiPlex 5060/0654JC, BIOS 1.7.1 07/02/2020
Feb 28 07:44:19 pve kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x1e1/0x240
Feb 28 07:44:19 pve kernel: Code: 41 89 ce 44 0f b7 e8 41 83 ee 01 49 c1 e5 05 4d 63 f6 49 81 c5 40 19 03 00 49 81 fe ff 1f 00 00 77 49 4e 03 2c f5 e0 0a ec 8d <4d> 89 65 00 41 8b 44 24 08 85 c0 75 0b f3 90 41 8b 44 24 08 85 c0
Feb 28 07:44:19 pve kernel: RSP: 0018:ffffa2552192bc28 EFLAGS: 00010282
Feb 28 07:44:19 pve kernel: RAX: 0000000000000000 RBX: ffff8f41b47bf0c0 RCX: 0000000000000010
Feb 28 07:44:19 pve kernel: RDX: 0000000000140000 RSI: 0000000000140000 RDI: ffff8f41b47bf0c0
Feb 28 07:44:19 pve kernel: RBP: ffffa2552192bc50 R08: 0000000000000001 R09: 9ae16a3b2f90404f
Feb 28 07:44:19 pve kernel: R10: 0000000000000000 R11: ffff8f4155cc2000 R12: ffff8f469c531940
Feb 28 07:44:19 pve kernel: R13: ffffffff8e88a940 R14: 000000000000000f R15: 0000000000140000
Feb 28 07:44:19 pve kernel: FS:  0000000000000000(0000) GS:ffff8f469c500000(0000) knlGS:0000000000000000
Feb 28 07:44:19 pve kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 28 07:44:19 pve kernel: CR2: ffffffff8e88a940 CR3: 00000006d8410004 CR4: 00000000003726e0
(...) the rest is in the attachement txt

Does someone have an idea if I can do something about it? I really don't know what could have changed. Since it was running stable for weeks.

What I saw before the crash was a syslogged


Feb 28 06:43:11 pve smartd[1761]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Drive_Temperature changed from 62 to 63

Perhaps someone got an idea.

The disks are two new "Intel SSD D3-S4510 480GB 2.5" SATA" running in ZFS mirror mode.

denis_pve · Feb 28, 2023

Here are the smart values from both disks:

Code:

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       444
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       20
170 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
174 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       19
175 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       2479 (20 26682)
183 SATA_Downshift_Count    0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error_Count  0x0033   100   100   090    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Drive_Temperature       0x0022   064   064   000    Old_age   Always       -       36 (Min/Max 35/36)
192 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       19
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       36
197 Pending_Sector_Count    0x0012   100   100   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       22834
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       10
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       14
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       26590
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       0
234 Thermal_Throttle_Status 0x0032   100   100   000    Old_age   Always       -       0/0
235 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       2479 (20 26682)
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       22834
242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       4170
243 NAND_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       35005

SMART Error Log Version: 1
No Errors Logged

Code:

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       445
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       20
170 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
174 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       19
175 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       2467 (20 26706)
183 SATA_Downshift_Count    0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error_Count  0x0033   100   100   090    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Drive_Temperature       0x0022   062   062   000    Old_age   Always       -       38 (Min/Max 37/38)
192 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       19
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       38
197 Pending_Sector_Count    0x0012   100   100   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       22843
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       10
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       14
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       26589
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       0
234 Thermal_Throttle_Status 0x0032   100   100   000    Old_age   Always       -       0/0
235 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       2467 (20 26706)
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       22843
242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       3985
243 NAND_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       34973

SMART Error Log Version: 1
No Errors Logged

denis_pve · Feb 28, 2023

What I don't get is that it shows me

Feb 28 08:27:36 pve smartd[2872]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Drive_Temperature changed from 65 to 64

but the "hddtemp" and the smart values in the GUI show me between 36 and 38 degrees celcius at the same time.

Manually doing a backup is also no longer possible at all since the system starts crashing during every backup. It gets to about 10-20% of the backup and crashes.

denis_pve · Feb 28, 2023

Here are the package versions if that is important:

Code:

proxmox-ve: 7.3-1 (running kernel: 5.15.85-1-pve) pve-manager: 7.3-6 (running version: 7.3-6/723bb6ec) pve-kernel-helper: 7.3-4 pve-kernel-5.15: 7.3-2 pve-kernel-5.15.85-1-pve: 5.15.85-1 pve-kernel-5.15.74-1-pve: 5.15.74-1 ceph-fuse: 15.2.17-pve1 corosync: 3.1.7-pve1 criu: 3.15-1+pve-1 glusterfs-client: 9.2-1 ifupdown2: 3.1.0-1+pmx3 ksm-control-daemon: 1.4-1 libjs-extjs: 7.0.0-1 libknet1: 1.24-pve2 libproxmox-acme-perl: 1.4.3 libproxmox-backup-qemu0: 1.3.1-1 libpve-access-control: 7.3-1 libpve-apiclient-perl: 3.2-1 libpve-common-perl: 7.3-2 libpve-guest-common-perl: 4.2-3 libpve-http-server-perl: 4.1-5 libpve-storage-perl: 7.3-2 libspice-server1: 0.14.3-2.1 lvm2: 2.03.11-2.1 lxc-pve: 5.0.2-1 lxcfs: 5.0.3-pve1 novnc-pve: 1.3.0-3 proxmox-backup-client: 2.3.3-1 proxmox-backup-file-restore: 2.3.3-1 proxmox-mail-forward: 0.1.1-1 proxmox-mini-journalreader: 1.3-1 proxmox-widget-toolkit: 3.5.5 pve-cluster: 7.3-2 pve-container: 4.4-2 pve-docs: 7.3-1 pve-edk2-firmware: 3.20220526-1 pve-firewall: 4.2-7 pve-firmware: 3.6-3 pve-ha-manager: 3.5.1 pve-i18n: 2.8-2 pve-qemu-kvm: 7.2.0-5 pve-xtermjs: 4.16.0-1 qemu-server: 7.3-3 smartmontools: 7.2-pve3 spiceterm: 3.2-2 swtpm: 0.8.0~bpo11+2 vncterm: 1.7-1 zfsutils-linux: 2.1.9-pve1

denis_pve · Feb 28, 2023

I can add now that cloning new VMs from templates doesn't work either. It just crashes with no kernel panic whatsoever.

mow · Mar 1, 2023

Try dd if=/dev/sda of=/dev/null bs=10M
Do you get Input/output errors?

denis_pve said:
What I don't get is that it shows me Feb 28 08:27:36 pve smartd[2872]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Drive_Temperature changed from 65 to 64 but the "hddtemp" and the smart values in the GUI show me between 36 and 38 degrees celcius at the same time.

In the VALUE and WORST columns it shows 062, so that's probably what smartd reports, not necessarily the real temperature. Do the drives feel hot?

denis_pve · Mar 1, 2023

So I solved the riddle. It was one of the brand new 16G Corsair memory sticks.

fiona · Mar 2, 2023

denis_pve said:
So I solved the riddle. It was one of the brand new 16G Corsair memory sticks.

https://en.wikipedia.org/wiki/Bathtub_curve

denis_pve · Mar 2, 2023

Just one follow up. The machine is now running fine again. Removing the memory solved all the issues above. So in the end it was good to use ZFS since it heavily uses the memory and I got aware of it early enough. Unfortunately ECC RAM is not possible on my desktop machine.

pacopad · Jun 7, 2023

Hi Denis_pve,

we've got the same problem here. PVE host is crashing during backup.

How did you pointed out the problem with your ram stick ?

Best regards

Pascal

fiona · Jun 7, 2023

pacopad said:
we've got the same problem here. PVE host is crashing during backup.

Are there any errors/warnings in the system logs?

pacopad said:
How did you pointed out the problem with your ram stick ?

To check your RAM, you can use a tool like memtest86+ (you can select it when you boot into the Proxmox installer ISO).

mow · Jun 7, 2023

fiona said:
To check your RAM, you can use a tool like memtest86+ (you can select it when you boot into the Proxmox installer ISO).

Caveat: While memtest errors rarely have false positives (errors shown but the RAM is actually good), and even those indicate some problem worth investigating, it is not uncommon to pass bad modules because it can't test every error situation. Sometimes errors show up only with unusual patterns (see also: rowhammer), sometimes there are interactions with other devices (e.g. RAM only acting up when the CPU or GPU has an extreme speed/power draw change).

pacopad · Jun 7, 2023

fiona said:
Are there any errors/warnings in the system logs?

To check your RAM, you can use a tool like memtest86+ (you can select it when you boot into the Proxmox installer ISO).

Hi Fiona,

I tried one pass of memtest86+ without error. the PVE works "well" in production mode. But when PBS backup one CT on the host, it reboots.

Reboot can occurs after 5 mins like after 2 hours.

The other ct on the pve is backingup well.

There is nothing in the logs , no kernel panic no alert in syslog nor in /var/log/messages

Normal network latency. At the moment i started a backup on an other PBS server.

Hardware was delivered 2 days ago on our cloud plateform AMD Ryzen 9 5900X 12-Core Processor / 128G ECC / 4x3.8TB NVME

I dont know where i must investigate ....

If you got an idea, you are welcome.

fiona · Jun 7, 2023

pacopad said:
I tried one pass of memtest86+ without error. the PVE works "well" in production mode. But when PBS backup one CT on the host, it reboots.

Reboot can occurs after 5 mins like after 2 hours.

Do you mean this amount of time after the backup is started or finished? Or do you mean it doesn't matter when you start the backup and it'll always reboot during backup?

Can you try doing a backup of that container to a non-PBS target?

pacopad said:
The other ct on the pve is backingup well.

Can you share the container configuration for both of them pct config <ID> and output of pveversion -v?

pacopad said:
There is nothing in the logs , no kernel panic no alert in syslog nor in /var/log/messages

That's unfortunate.

pacopad · Jun 7, 2023

It's amount of time after the backup starts, the backup never ends.

How can i backup ct on non pbs target ?

here is the pct info
arch: amd64
description: vzdump backup snapshot%0A
hostname: dmp-mail-clone
memory: 32768
mp0: ZFS_HDD:subvol-100-disk-0,mp=/home/,backup=1,size=2700G
nameserver: XX.XX.XX.XX
net0: name=eth0,bridge=vmbr0,gw=XX.XX.XX.1,hwaddr=XX:XX:XX:XX:XX:,XXip=XX.XX.XX.XX/24,link_down=1,type=veth
onboot: 1
ostype: ubuntu
rootfs: ZFS_NVME:subvol-100-disk-0,size=30G
swap: 512

fiona · Jun 7, 2023

pacopad said:
It's amount of time after the backup starts, the backup never ends.

Can you share the backup log up until it gets stuck? How does the task log on PBS look like?

pacopad said:
How can i backup ct on non pbs target ?

You'd need to add a file-based backup storage for that, e.g. use a spare disk and add it as a directory storage.

pacopad said:
here is the pct info
arch: amd64
description: vzdump backup snapshot%0A
hostname: dmp-mail-clone
memory: 32768
mp0: ZFS_HDD:subvol-100-disk-0,mp=/home/,backup=1,size=2700G
nameserver: XX.XX.XX.XX
net0: name=eth0,bridge=vmbr0,gw=XX.XX.XX.1,hwaddr=XX:XX:XX:XX:XX:,XXip=XX.XX.XX.XX/24,link_down=1,type=veth
onboot: 1
ostype: ubuntu
rootfs: ZFS_NVME:subvol-100-disk-0,size=30G
swap: 512

Since you are using ZFS, did you look at https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_zfs_limit_memory_usage already?

How much RAM do you have in total? How does the RAM usage behave during backup? Is the other container also using the same storages? Does it have disks of comparable size?

pacopad · Jun 7, 2023

Here are the last log lines

2023-06-07T15:39:32+02:00: POST /dynamic_chunk: 400 Bad Request: error reading a body from connection: connection reset
2023-06-07T15:39:32+02:00: POST /dynamic_chunk: 400 Bad Request: error reading a body from connection: connection reset
2023-06-07T15:39:32+02:00: backup failed: connection error: connection reset
2023-06-07T15:39:32+02:00: removing failed backup
2023-06-07T15:39:32+02:00: TASK ERROR: connection error: connection reset

there is 128G on the server, at the moment there is just this ct on the pve.

Ram is ok, we have limited ZFS ram settings and there is no oom-killer in logs...

The other CT used the same storage and volume is about 1T8

pacopad · Jun 8, 2023

Some new facts.

We've migrated an other heavy CT on the host. Pve reboots a few seconds after the ct starts. Problem occurs twice. We moved back the ct on original PVE => no problem

I think we got an hardware problem. I'll install kernel 6.2 this day to see if it fix the problem

Is there a way to activate "extra logs" on the pve ?

Bests regards

Pascal

fiona · Jun 9, 2023

pacopad said:
I think we got an hardware problem. I'll install kernel 6.2 this day to see if it fix the problem

Is there a way to activate "extra logs" on the pve ?

If there is nothing in the system logs, it could also indicate a hardware issue, meaning the log couldn't even be saved anymore. You could try your luck with https://pve.proxmox.com/wiki/Kernel_Crash_Trace_Log

[SOLVED] PVE backups crash the host system since two days (external + internal backup targets)

New Member

New Member

Attachments

New Member

New Member

New Member

New Member

Member

New Member

Proxmox Staff Member

New Member

New Member

Proxmox Staff Member

Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

New Member

Proxmox Staff Member