PBS just hangs during or after backups - hard to tell. its just gone after a couple hours

Moonwalk1059 · Aug 25, 2024

i am having this issue forever but now i am frustrated enough to ask for help.

i have a pbs and 2x prxmox hosts with several vms. ever since the use of the PBS, it just crashes during or after the backup's been done. its installed on a 3 disk zfs "raid".

sometimes it was running through for a day or two but more often than not it just hung and i had to reset.

a couple weeks / months ago i was googling for this issue and found somewhere in this forum (cant remember where) that i would need tet some zfs_arc_max param so that it would run through. i tried that, it was a little better (run a few days) but still crashes now.

these are my specs the nvme disk is currently not used):

System:
Host: proxmox-backup Kernel: 6.8.4-3-pve arch: x86_64 bits: 64 compiler: gcc v: 12.2.0
Console: pty pts/0 Distro: Debian GNU/Linux 12 (bookworm)
Machine:
Type: Desktop Mobo: ASUSTeK model: TUF GAMING X570-PLUS v: Rev X.0x serial: 200670021600043
UEFI: American Megatrends v: 5003 date: 10/07/2023
Memory:
RAM: total: 31.25 GiB used: 20.06 GiB (64.2%)
Array-1: capacity: 128 GiB slots: 4 EC: None max-module-size: 32 GiB note: est.
Device-1: DIMM_A1 type: no module installed
Device-2: DIMM_A2 type: no module installed
Device-3: DIMM_B1 type: DDR4 size: 16 GiB speed: 2133 MT/s
Device-4: DIMM_B2 type: DDR4 size: 16 GiB speed: 2133 MT/s
CPU:
Info: 12-core model: AMD Ryzen 9 3900X bits: 64 type: MT MCP arch: Zen 2 rev: 0 cache:
L1: 768 KiB L2: 6 MiB L3: 64 MiB
Speed (MHz): avg: 3800 min/max: 2200/4672 boost: enabled cores: 1: 3800 2: 3800 3: 3800
4: 3800 5: 3800 6: 3800 7: 3800 8: 3800 9: 3800 10: 3800 11: 3800 12: 3800 13: 3800 14: 3800
15: 3800 16: 3800 17: 3800 18: 3800 19: 3800 20: 3800 21: 3800 22: 3800 23: 3800 24: 3800
bogomips: 182059
Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3
Graphics:
Device-1: AMD Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M/6850M XT] vendor: Tul / PowerColor
driver: amdgpu v: kernel arch: RDNA-2 bus-ID: 0e:00.0
Display: server: No display server data found. Headless machine? tty: 220x81
resolution: 2560x1440
API: OpenGL Message: GL data unavailable in console for root.
Audio:
Device-1: AMD Navi 21/23 HDMI/DP Audio driver: snd_hda_intel v: kernel bus-ID: 0e:00.1
Device-2: AMD Starship/Matisse HD Audio vendor: ASUSTeK driver: snd_hda_intel v: kernel
bus-ID: 10:00.4
API: ALSA v: k6.8.4-3-pve status: kernel-api
Network:
Device-1: Realtek RTL8125 2.5GbE driver: r8169 v: kernel port: c000 bus-ID: 06:00.0
IF: enp6s0 state: up speed: 2500 Mbps duplex: full mac: 00:e0:4c:2a:04:54
Device-2: Realtek RTL8125 2.5GbE driver: r8169 v: kernel port: b000 bus-ID: 07:00.0
IF: enp7s0 state: down mac: 00:e0:4c:2a:04:55
Device-3: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet
vendor: ASUSTeK RTL8111/8168/8211/8411 driver: r8169 v: kernel port: d000 bus-ID: 08:00.0
IF: enp8s0 state: up speed: 1000 Mbps duplex: full mac: 24:4b:fe:06:03:1e
IF-ID-1: bonding_masters state: N/A speed: N/A duplex: N/A mac: N/A
RAID:
Device-1: rpool type: zfs status: ONLINE level: raidz1-0 raw: size: 21.8 TiB free: 16.7 TiB
zfs-fs: size: 14.41 TiB free: 10.98 TiB
Components: Online: 1: sda3 2: sdb3 3: sdc3
Drives:
Local Storage: total: raw: 22.35 TiB usable: 14.94 TiB used: 3.44 TiB (23.0%)
ID-1: /dev/nvme0n1 vendor: Samsung model: SSD 950 PRO 512GB size: 476.94 GiB temp: 39.9 C
ID-2: /dev/sda vendor: Western Digital model: WD80EFZZ-68BTXN0 size: 7.28 TiB
ID-3: /dev/sdb vendor: Western Digital model: WD80EFZZ-68BTXN0 size: 7.28 TiB
ID-4: /dev/sdc vendor: Western Digital model: WD80EFZZ-68BTXN0 size: 7.28 TiB
Partition:
ID-1: / size: 14.41 TiB used: 3.44 TiB (23.8%) fs: zfs logical: rpool/ROOT/pbs-1
Swap:
Alert: No swap data was found.
Sensors:
System Temperatures: cpu: 42.2 C mobo: N/A gpu: amdgpu temp: 45.0 C
Fan Speeds (RPM): N/A gpu: amdgpu fan: 0
Info:
Processes: 491 Uptime: 54m Init: systemd target: graphical (5) Compilers: N/A Packages: 511
Shell: Bash v: 5.2.15 inxi: 3.3.26

this is the config i changed back then:

root@proxmox-backup:~# cat /sys/module/zfs/parameters/zfs_arc_max
17179869184
root@proxmox-backup:~# cat /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=17179869184

this is the log, which doesnt show anything suspicious unfortunately:

Aug 24 11:23:28 proxmox-backup proxmox-backup-proxy[1593]: host/archive/2024-08-16T19:00:36Z keep
Aug 24 11:23:28 proxmox-backup proxmox-backup-proxy[1593]: host/archive/2024-08-17T07:00:37Z keep
Aug 24 11:23:28 proxmox-backup proxmox-backup-proxy[1593]: host/archive/2024-08-24T07:00:35Z keep
Aug 24 11:23:28 proxmox-backup proxmox-backup-proxy[1593]: TASK OK
Aug 24 11:34:57 proxmox-backup smartd[1226]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 125 to 124
Aug 24 11:34:57 proxmox-backup smartd[1226]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 125 to 124
Aug 24 11:35:10 proxmox-backup proxmox-backup-proxy[1593]: write rrd data back to disk
Aug 24 11:35:10 proxmox-backup proxmox-backup-proxy[1593]: starting rrd data sync
Aug 24 11:35:10 proxmox-backup proxmox-backup-proxy[1593]: rrd journal successfully committed (25 files in 0.069 seconds)
Aug 24 12:04:57 proxmox-backup smartd[1226]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 124 to 125
Aug 24 12:05:10 proxmox-backup proxmox-backup-proxy[1593]: write rrd data back to disk
Aug 24 12:05:10 proxmox-backup proxmox-backup-proxy[1593]: starting rrd data sync
Aug 24 12:05:10 proxmox-backup proxmox-backup-proxy[1593]: rrd journal successfully committed (25 files in 0.054 seconds)
Aug 24 12:10:01 proxmox-backup dhclient[1362]: DHCPREQUEST for 192.168.0.252 on enp8s0 to 192.168.0.1 port 67
Aug 24 12:10:01 proxmox-backup dhclient[1362]: DHCPACK of 192.168.0.252 from 192.168.0.1
Aug 24 12:10:01 proxmox-backup dhclient[1362]: bound to 192.168.0.252 -- renewal in 3529 seconds.
Aug 24 12:17:01 proxmox-backup CRON[2900]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Aug 24 12:17:01 proxmox-backup CRON[2901]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Aug 24 12:17:01 proxmox-backup CRON[2900]: pam_unix(cron:session): session closed for user root
Aug 24 12:35:10 proxmox-backup proxmox-backup-proxy[1593]: write rrd data back to disk
Aug 24 12:35:10 proxmox-backup proxmox-backup-proxy[1593]: starting rrd data sync
Aug 24 12:35:10 proxmox-backup proxmox-backup-proxy[1593]: rrd journal successfully committed (25 files in 0.098 seconds)
-- Reboot --
Aug 25 10:57:44 proxmox-backup kernel: Linux version 6.8.4-3-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.4-3 (2024-05-02T11:55Z) ()
Aug 25 10:57:44 proxmox-backup kernel: Command line: initrd=\EFI\proxmox\6.8.4-3-pve\initrd.img-6.8.4-3-pve root=ZFS=rpool/ROOT/pbs-1 boot=zfs

if you need any more info, please shoot, i am more than happy to provide that.

thanks a lot!

waltar · Aug 25, 2024

Looks like you do a barbecue with your HDD's as normal temp is (depending on clima or not) between 25-45°C and not around 125°C as that's no good condition for your bits and bytes on the platters and chip platine regardless of your used zfs or any other filesystem.

Moonwalk1059 · Aug 25, 2024

hmm, that is weird.. i have them singled out and with a dedicated fan .

thanks for spotting that. i totally missed it

will try to cool them further

sw-omit · Aug 25, 2024

Smart-values aren't values you can use without conversion, so a "value" of 125 does not equal a "temperature" of 125
the value goes from 0 to 255, with (at least according to [1]) 255 being really cold and 0 being really hot. So while it is good to check, the value seems to be fine (but check the smart-value specs for your drive as to what value is what temperature to be sure)

[1] https://serverfault.com/questions/693163/is-smartd-really-reporting-this-drive-is-too-hot

Moonwalk1059 · Aug 25, 2024

was about to respond something similar.

i checked the smart secion in pbs and the (apart from the "what is the real temp here) reported max temp was 115 and current temp 23C.

sw-omit · Aug 25, 2024

Also, if you look at your usage-graphs of the hosts around Aug 24 12:35:10, do they show any increased usage in either CPU, Memory or network, compared to the couple of hours before that?

And have you also tried limiting the processing/transfer speed of the backup? If it is a usage-problem anywhere, that could at least help at the only 'cost' of backups taking a bit longer.

Do note that I don't use PBS myself, only PVE.

Oh, and one more thought: Do you have HA setup on the hosts? Cause if so, and especially since you only have 2 hosts, the backup could cause the latency to go up too much/long, causing an out-of-sync reset. (Even without that though, do you at least have a Q-Device, or are you perhaps running that on the PBS server?)

Moonwalk1059 · Aug 25, 2024

hmm, i am not aware how i could look at historical graphs. will read up on that.

regarding the backup, i just let my proxmox do the backups, have nothing configured, i'd think it would do that serially.. this is the advanced tab:

up until now i didnt even know i could limit anything, lol.. will try maybe 50MB/s and look if it is stable

with that said, i would expect the PBS to be able to handle the normal throughput of a backup op and not crash. so, i'd consider this max a test to "debug" the error (dont get me wrong, i very much appreciate your help, this was more directed towards the "vendor")

no CEPH / HA

Edit: what is a q device?

Edit 2: the graphs should show the history but as it always crashs and i never really notice that it did, there's just no more history available right now. will try to check this more frequently to have more hourly data histroy available

sw-omit · Aug 25, 2024

A Q-Device or "External Vote Deamon" [1] is a service running on something OTHER then the cluster to make sure that quorum is broken in case of a tie.
Proxmox clusters rely on there being "MORE then half" of the nodes voting on something to perform any actions. If you are in an even-numbered cluster (especially small clusters), there are possibilities where only half (so not MORE then half) of the nodes can vote, and as such the cluster halts any changes. For example: If only one of your two servers are online, then on the other server you can't start any VM's/Containers, change settings, restore backups, and all that stuff that you would probably want in case the other server dies. (Or at least you can't without temporarily overriding this safety-feature.) Unless you of course don't have these two servers in a cluster (and are just stand-alone servers), in that case you're fine without a Q-Device too (1 is an odd number after all

)

For the Graphs, I meant more those in Proxmox itself, which you can find by just pressing on the node and then looking in the summary tab (optionally changing the max range in the top-right's dropdown menu from 1 hour to longer periods), but I must have been half asleep when I wrote that and didn't notice that it was the PBS-Server that was crashing, instead of one of the PVE-Nodes

And of course it should be able to handle more of course, but maybe this allows us to either focus on or rule out some hardware-related issue. Also, when it does crash, does it show anymore more on the screen of the console?

[1] https://pve.proxmox.com/wiki/Cluster_Manager#_corosync_external_vote_support

Moonwalk1059 · Aug 25, 2024

thank you for this. yes, i do not have such a device.. i never even heard of such a thing before but it sounds really cool.. my whole setup is just a home "lab".. i thought about adding a 3rd "server" (everything's just consumer hardware) into the cluster, maybe one day..

regarding the graphs, yeah, i will monitor a little more eager to catch it when it failed so that it can show history..

will see what the "throttling" brings and also looking forward what the Proxmox team might have to throw into the discussion. maybe they have a hint what might cause this crash.

really appreciating your help and all the insights here, @sw-omit !

Moonwalk1059 · Aug 26, 2024

alright, here's a screenshot of the load now.

there's basically 3 different backup ops and garbage collection after reboot in this pic:

the GC starting after i discovered that it crashed and rebooted the box, yesterday at ~11:00
yesterday at ~11:30 with personal files from outside the proxmox realm
today at ~0:00 with another round of backing up the personal fiels
today at ~3:00 with the vms / containers from the proxmox servers

the good news is, that it ran through today, but we arent outta the woods yet. sometimes it ran through, a couple days, and still crashed a few days later.

Moonwalk1059 · Aug 26, 2024

ah, missed the question about output.

sometimes it did, sometimes it did not. i thought i snapped a pic one time, but cant find it anymore. however, it made me google and ending up modifying those zfs options i mentioned in the initial post.

sw-omit · Aug 26, 2024

Ok, good that it worked this time, and like you said, let's keep it running for a few days (up to a week maybe) to see if it still crashes, if it now doesn't, it could be something hardware or usage-related, so running some stress-tests on things might be a logical next step, and if it does, we at least now have a "clean" graph to compare to, and see if we notice something different (for example something like OutOfMemory, or a process that hangs at high CPU for too long).

waltar · Aug 27, 2024

sw-omit said:
Smart-values aren't values you can use without conversion, so a "value" of 125 does not equal a "temperature" of 125
the value goes from 0 to 255, with (at least according to [1]) 255 being really cold and 0 being really hot. So while it is good to check, the value seems to be fine (but check the smart-value specs for your drive as to what value is what temperature to be sure)

[1] https://serverfault.com/questions/693163/is-smartd-really-reporting-this-drive-is-too-hot

Just took a look on our servers where smartctl attribute 194 show 28-40°C for different types of hdd's and ssd's.

Moonwalk1059 · Aug 27, 2024

quick update:

server still running..

Search

Search

PBS just hangs during or after backups - hard to tell. its just gone after a couple hours

Moonwalk1059

New Member

waltar

Member

Moonwalk1059

New Member

sw-omit

Active Member

Moonwalk1059

New Member

sw-omit

Active Member

Moonwalk1059

New Member

sw-omit

Active Member

Moonwalk1059

New Member

Moonwalk1059

New Member

Moonwalk1059

New Member

sw-omit

Active Member

waltar

Member

Moonwalk1059

New Member