PBS goes off regularly

manuelkamp

Member
May 10, 2022
33
8
13
I have the PBS in use since a few month without any issue. In the last week, the PBS went offline several times and needed a manual restart. I cannot find any reason (errors) in the logs:

Code:
May 09 16:05:39 backup proxmox-backup-proxy[1167]: Size: 540672
May 09 16:05:39 backup proxmox-backup-proxy[1167]: Chunk count: 1
May 09 16:05:39 backup proxmox-backup-proxy[1167]: Upload size: 4734976 (875%)
May 09 16:05:39 backup proxmox-backup-proxy[1167]: Duplicates: 0+1 (100%)
May 09 16:05:39 backup proxmox-backup-proxy[1167]: Compression: 0%
May 09 16:05:39 backup proxmox-backup-proxy[1167]: successfully closed fixed index 1
May 09 16:05:39 backup proxmox-backup-proxy[1167]: add blob "/mnt/datastore/Backup/vm/201/2022-05-09T14:02:35Z/index.json.blob" (446 bytes, comp: 446)
May 09 16:05:39 backup proxmox-backup-proxy[1167]: successfully finished backup
May 09 16:05:39 backup proxmox-backup-proxy[1167]: backup finished successfully
May 09 16:05:39 backup proxmox-backup-proxy[1167]: TASK OK
May 09 16:05:39 backup proxmox-backup-proxy[1167]: Upload backup log to Backup/vm/201/2022-05-09T14:02:35Z/client.log.blob
May 09 16:17:01 backup CRON[448408]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
May 09 16:17:01 backup CRON[448409]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
May 09 16:17:01 backup CRON[448408]: pam_unix(cron:session): session closed for user root
May 09 16:19:07 backup smartd[905]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 78 to 82
May 09 16:19:07 backup smartd[905]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 52 to 50
May 09 16:19:07 backup smartd[905]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 48 to 50
May 09 16:19:07 backup smartd[905]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 78 to 82
May 09 16:19:07 backup smartd[905]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 82 to 65
May 09 16:19:07 backup smartd[905]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 53 to 51
May 09 16:19:07 backup smartd[905]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 47 to 49
May 09 16:19:07 backup smartd[905]: Device: /dev/sdc [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 82 to 65
May 09 16:19:07 backup smartd[905]: Device: /dev/sdd [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 151 to 147
May 09 16:19:11 backup proxmox-backup-proxy[1167]: write rrd data back to disk
May 09 16:19:11 backup proxmox-backup-proxy[1167]: starting rrd data sync
May 09 16:19:11 backup proxmox-backup-proxy[1167]: rrd journal successfully committed (23 files in 0.007 seconds)
-- Reboot --
May 10 10:30:38 backup kernel: Linux version 5.13.19-6-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.13.19-15 (Tue, 29 Mar 2022 15:59:50 +0200) ()
May 10 10:30:38 backup kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.13.19-6-pve root=/dev/mapper/pbs-root ro quiet
May 10 10:30:38 backup kernel: KERNEL supported cpus:
May 10 10:30:38 backup kernel:   Intel GenuineIntel
May 10 10:30:38 backup kernel:   AMD AuthenticAMD
May 10 10:30:38 backup kernel:   Hygon HygonGenuine
May 10 10:30:38 backup kernel:   Centaur CentaurHauls
May 10 10:30:38 backup kernel:   zhaoxin   Shanghai 
May 10 10:30:38 backup kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
May 10 10:30:38 backup kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
May 10 10:30:38 backup kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
May 10 10:30:38 backup kernel: x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
May 10 10:30:38 backup kernel: x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'compacted' format.
May 10 10:30:38 backup kernel: BIOS-provided physical RAM map:

Anyone an idea what too look for and how to solve this?

Thanks!
 
could you give a bit more information about the system? PBS version, storage used for the datastores, anything special about the hardware..
 
Sure, no secrets here since it is only my private playground in my homelab.

PBS 2.1-6
CPU: 4 x AMD Ryzen 3 1200 Quad-Core Processor (1 Socket)
Kernel: Linux 5.13.19-6-pve #1 SMP PVE 5.13.19-15 (Tue, 29 Mar 2022 15:59:50 +0200)
Storage root: Crucial MX500 500GB, SATA SSD
Storage Data: ZFS Raidz1 3x ST12000VE0008 12TB Harddisk Seagate SkyHawk AI - Video
RAM: 32GB DDR4
Network: 2x Gigabit PCIe Cards (both are the same model, I am not sure exactly which - can look it up if necessary) LACP 802ad Linux bond
DNS and Gateway are configured and working (HA cluster).

No other Hardware is included in this 2HE 19" Unit.

The PBS also votes in my 2 node HA cluster quorum.
No other services etc. than that are running on the PBS unit.

EDIT:
Backups of 11 CT and 7 VM are scheduled daily at 03:00.
Prune is scheduled 21:00.
GC weekly on sundays.
1.png

2.png

3.png
 
Last edited:
I'd check your ZFS ARC settings - possible you run out of memory.. also, the current 5.15 kernel series seems to be a lot more stable on zen hardware in general, so probably worth a try as well. if you haven't already, disabling power saving states in the bios should also improve stability on such hardware.
 
Thank you. Based on the usage graphs when there was an issue, it does not seem to me that there was an out of memory issue (it looks similar on the amount of used ram on other outages):
1.png
Code:
ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                    50.2 %    7.9 GiB
        Target size (adaptive):                        50.6 %    7.9 GiB
        Min size (hard limit):                         6.2 %  1001.0 MiB
        Max size (high water):                           16:1   15.6 GiB
        Most Frequently Used (MFU) cache size:          0.4 %   33.3 MiB
        Most Recently Used (MRU) cache size:           99.6 %    7.8 GiB
        Metadata cache size (hard limit):              75.0 %   11.7 GiB
        Metadata cache size (current):                  1.8 %  212.3 MiB
        Dnode cache size (hard limit):                 10.0 %    1.2 GiB
        Dnode cache size (current):                     0.9 %   10.2 MiB

ARC hash breakdown:
        Elements max:                                              80.7k
        Elements current:                             100.0 %      80.7k
        Collisions:                                                 1.0k
        Chain max:                                                     2
        Chains:                                                      753

ARC misc:
        Deleted:                                                     552
        Mutex misses:                                                  0
        Eviction skips:                                                3
        Eviction skips due to L2 writes:                               0
        L2 cached evictions:                                     0 Bytes
        L2 eligible evictions:                                   8.3 MiB
        L2 eligible MFU evictions:                      2.7 %  224.5 KiB
        L2 eligible MRU evictions:                     97.3 %    8.0 MiB
        L2 ineligible evictions:                                64.0 KiB

ARC total accesses (hits + misses):                               353.5k
        Cache hit ratio:                               95.0 %     335.9k
        Cache miss ratio:                               5.0 %      17.6k
        Actual hit ratio (MFU + MRU hits):             94.8 %     335.0k
        Data demand efficiency:                        99.0 %      37.0k
        Data prefetch efficiency:                       0.0 %        730

Cache hits by cache type:
        Most frequently used (MFU):                    71.9 %     241.3k
        Most recently used (MRU):                      27.9 %      93.7k
        Most frequently used (MFU) ghost:               0.0 %          0
        Most recently used (MRU) ghost:                 0.0 %          0
        Anonymously used:                               0.2 %        817

Cache hits by data type:
        Demand data:                                   10.9 %      36.6k
        Demand prefetch data:                           0.0 %          0
        Demand metadata:                               88.9 %     298.4k
        Demand prefetch metadata:                       0.2 %        820

Cache misses by data type:
        Demand data:                                    2.0 %        361
        Demand prefetch data:                           4.1 %        730
        Demand metadata:                               77.1 %      13.6k
        Demand prefetch metadata:                      16.7 %       2.9k

DMU prefetch efficiency:                                            3.8k
        Hit ratio:                                     26.7 %       1.0k
        Miss ratio:                                    73.3 %       2.8k
Is there any better way to determine any memory issues in PBS?
Power savings are already disabled, but checked it again now to be sure.
I will give the 5.15 kernel a try and report back how it will behave the next 1-2 days.
 
Last edited:
if nothing else helps you might want to try net or serial console - it might gather some output that is otherwise lost..
 
it's not the default (yet), but has been available for quite a while. our kernels are pretty much in sync across PVE, PBS and PMG.
 
Availability on PVE I was aware of (running almost from when it was available), but how to install on PBS?

Code:
apt update && apt install pve-kernel-5.15?
 
yeah, we still have that prefix in some places where "proxmox-" would be more appropriate nowadays ;)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!