Hello, I have an issue with a server and its backup. We have a backup every day and in a period of 4 to 8 days, it reboots the server along with the VMs with no warning.
The version is 8.1 with the latest kernel. HW that we are running is ASUS RS520A-E11-RS12U/800W/12NVME with 2 Samsung SSD 980 PRO, M.2 - 500GB and 5 Samsung PM9A3 960GB NVMe PCIe4x4 U.2 with 64GB of registered ECC DDR4 memory. The errors that we have encountered are caused by the stillPCIe bridge (80.03.1) and we are not sure what to do with it to stop this error from happening. there is a log with the errors. they are corrected but the server stil reboots
Feb 29 13:20:15 gc pveproxy[3184]: starting 1 worker(s)
Feb 29 13:20:15 gc pveproxy[3184]: worker 2312962 started
Feb 29 13:29:03 gc pvedaemon[847300]: <root@pam> successful auth for user 'root@pam'
Feb 29 14:17:02 gc CRON[2329788]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Feb 29 14:17:02 gc CRON[2329789]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Feb 29 14:17:02 gc CRON[2329788]: pam_unix(cron:session): session closed for user root
Feb 29 15:17:01 gc CRON[2347530]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Feb 29 15:17:01 gc CRON[2347531]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Feb 29 15:17:01 gc CRON[2347530]: pam_unix(cron:session): session closed for user root
Feb 29 16:17:01 gc CRON[2365287]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Feb 29 16:17:01 gc CRON[2365288]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Feb 29 16:17:01 gc CRON[2365287]: pam_unix(cron:session): session closed for user root
Feb 29 17:17:01 gc CRON[2382887]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Feb 29 17:17:01 gc CRON[2382888]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Feb 29 17:17:01 gc CRON[2382887]: pam_unix(cron:session): session closed for user root
Feb 29 18:17:01 gc CRON[2400600]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Feb 29 18:17:01 gc CRON[2400601]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Feb 29 18:17:01 gc CRON[2400600]: pam_unix(cron:session): session closed for user root
Feb 29 19:17:01 gc CRON[2418014]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Feb 29 19:17:01 gc CRON[2418015]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Feb 29 19:17:01 gc CRON[2418014]: pam_unix(cron:session): session closed for user root
Feb 29 20:17:01 gc CRON[2435617]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Feb 29 20:17:01 gc CRON[2435618]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Feb 29 20:17:01 gc CRON[2435617]: pam_unix(cron:session): session closed for user root
Feb 29 20:29:22 gc kernel: {5}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 512
Feb 29 20:29:22 gc kernel: {5}[Hardware Error]: It has been corrected by h/w and requires no further action
Feb 29 20:29:22 gc kernel: {5}[Hardware Error]: event severity: corrected
Feb 29 20:29:22 gc kernel: {5}[Hardware Error]: Error 0, type: corrected
Feb 29 20:29:22 gc kernel: {5}[Hardware Error]: section_type: PCIe error
Feb 29 20:29:22 gc kernel: {5}[Hardware Error]: port_type: 4, root port
Feb 29 20:29:22 gc kernel: {5}[Hardware Error]: version: 0.2
Feb 29 20:29:22 gc kernel: {5}[Hardware Error]: command: 0x0407, status: 0x0010
Feb 29 20:29:22 gc kernel: {5}[Hardware Error]: device_id: 0000:80:01.4
Feb 29 20:29:22 gc kernel: {5}[Hardware Error]: slot: 0
Feb 29 20:29:22 gc kernel: {5}[Hardware Error]: secondary_bus: 0x84
Feb 29 20:29:22 gc kernel: {5}[Hardware Error]: vendor_id: 0x1022, device_id: 0x1483
Feb 29 20:29:22 gc kernel: {5}[Hardware Error]: class_code: 060400
Feb 29 20:29:22 gc kernel: {5}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0012
Feb 29 20:29:22 gc kernel: pcieport 0000:80:01.4: AER: aer_status: 0x00000040, aer_mask: 0x00000000
Feb 29 20:29:22 gc kernel: pcieport 0000:80:01.4: [ 6] BadTLP
Feb 29 20:29:22 gc kernel: pcieport 0000:80:01.4: AER: aer_layer=Data Link Layer, aer_agent=Receiver ID
Feb 29 21:17:01 gc CRON[2453098]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Feb 29 21:17:01 gc CRON[2453099]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Feb 29 21:17:01 gc CRON[2453098]: pam_unix(cron:session): session closed for user root
Feb 29 22:00:00 gc pvescheduler[2465821]: <root@pam> starting task UPID:gc:0025A020:02CC8D54:65E0F050:vzdump::root@pam:
Feb 29 22:00:00 gc pvescheduler[2465824]: INFO: starting new backup job: vzdump 102 101 103 104 105 --storage local --mailnotification always --mode snapshot --notification-mode legacy-sendmail --prune-backups 'keep-weekl>
Feb 29 22:00:00 gc pvescheduler[2465824]: INFO: Starting Backup of VM 101 (qemu)
Feb 29 22:00:44 gc kernel: {6}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 512
Feb 29 22:00:44 gc kernel: {6}[Hardware Error]: It has been corrected by h/w and requires no further action
Feb 29 22:00:44 gc kernel: {6}[Hardware Error]: event severity: corrected
Feb 29 22:00:44 gc kernel: {6}[Hardware Error]: Error 0, type: corrected
Feb 29 22:00:44 gc kernel: {6}[Hardware Error]: section_type: PCIe error
Feb 29 22:00:44 gc kernel: {6}[Hardware Error]: port_type: 4, root port
Feb 29 22:00:44 gc kernel: {6}[Hardware Error]: version: 0.2
Feb 29 22:00:44 gc kernel: {6}[Hardware Error]: command: 0x0407, status: 0x0010
Feb 29 22:00:44 gc kernel: {6}[Hardware Error]: device_id: 0000:80:01.4
Feb 29 22:00:44 gc kernel: {6}[Hardware Error]: slot: 0
Feb 29 22:00:44 gc kernel: {6}[Hardware Error]: secondary_bus: 0x84
Feb 29 22:00:44 gc kernel: {6}[Hardware Error]: vendor_id: 0x1022, device_id: 0x1483
Feb 29 22:00:44 gc kernel: {6}[Hardware Error]: class_code: 060400
Feb 29 22:00:44 gc kernel: {6}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0012
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 514
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: It has been corrected by h/w and requires no further action
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: event severity: corrected
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: Error 0, type: corrected
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: fru_text: PcieError
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: section_type: PCIe error
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: port_type: 0, PCIe end point
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: version: 0.2
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: command: 0x0406, status: 0x0010
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: device_id: 0000:84:00.0
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: slot: 0
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: secondary_bus: 0x00
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: vendor_id: 0x144d, device_id: 0xa80a
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: class_code: 010802
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0000
Feb 29 22:00:44 gc kernel: pcieport 0000:80:01.4: AER: aer_status: 0x00000040, aer_mask: 0x00000000
Feb 29 22:00:44 gc kernel: pcieport 0000:80:01.4: [ 6] BadTLP
Feb 29 22:00:44 gc kernel: pcieport 0000:80:01.4: AER: aer_layer=Data Link Layer, aer_agent=Receiver ID
Feb 29 22:00:44 gc kernel: nvme 0000:84:00.0: AER: aer_status: 0x00003000, aer_mask: 0x00000000
Feb 29 22:00:44 gc kernel: nvme 0000:84:00.0: [12] Timeout
Feb 29 22:00:44 gc kernel: nvme 0000:84:00.0: [13] NonFatalErr
Feb 29 22:00:44 gc kernel: nvme 0000:84:00.0: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID
Feb 29 22:02:01 gc pvescheduler[2465824]: INFO: Finished Backup of VM 101 (00:02:01)
Feb 29 22:02:01 gc pvescheduler[2465824]: INFO: Starting Backup of VM 102 (qemu)
Feb 29 22:07:37 gc pvescheduler[2465824]: INFO: Finished Backup of VM 102 (00:05:36)
Feb 29 22:07:38 gc pvescheduler[2465824]: INFO: Starting Backup of VM 103 (qemu)
Thank you for any tips.
The version is 8.1 with the latest kernel. HW that we are running is ASUS RS520A-E11-RS12U/800W/12NVME with 2 Samsung SSD 980 PRO, M.2 - 500GB and 5 Samsung PM9A3 960GB NVMe PCIe4x4 U.2 with 64GB of registered ECC DDR4 memory. The errors that we have encountered are caused by the stillPCIe bridge (80.03.1) and we are not sure what to do with it to stop this error from happening. there is a log with the errors. they are corrected but the server stil reboots
Feb 29 13:20:15 gc pveproxy[3184]: starting 1 worker(s)
Feb 29 13:20:15 gc pveproxy[3184]: worker 2312962 started
Feb 29 13:29:03 gc pvedaemon[847300]: <root@pam> successful auth for user 'root@pam'
Feb 29 14:17:02 gc CRON[2329788]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Feb 29 14:17:02 gc CRON[2329789]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Feb 29 14:17:02 gc CRON[2329788]: pam_unix(cron:session): session closed for user root
Feb 29 15:17:01 gc CRON[2347530]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Feb 29 15:17:01 gc CRON[2347531]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Feb 29 15:17:01 gc CRON[2347530]: pam_unix(cron:session): session closed for user root
Feb 29 16:17:01 gc CRON[2365287]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Feb 29 16:17:01 gc CRON[2365288]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Feb 29 16:17:01 gc CRON[2365287]: pam_unix(cron:session): session closed for user root
Feb 29 17:17:01 gc CRON[2382887]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Feb 29 17:17:01 gc CRON[2382888]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Feb 29 17:17:01 gc CRON[2382887]: pam_unix(cron:session): session closed for user root
Feb 29 18:17:01 gc CRON[2400600]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Feb 29 18:17:01 gc CRON[2400601]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Feb 29 18:17:01 gc CRON[2400600]: pam_unix(cron:session): session closed for user root
Feb 29 19:17:01 gc CRON[2418014]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Feb 29 19:17:01 gc CRON[2418015]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Feb 29 19:17:01 gc CRON[2418014]: pam_unix(cron:session): session closed for user root
Feb 29 20:17:01 gc CRON[2435617]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Feb 29 20:17:01 gc CRON[2435618]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Feb 29 20:17:01 gc CRON[2435617]: pam_unix(cron:session): session closed for user root
Feb 29 20:29:22 gc kernel: {5}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 512
Feb 29 20:29:22 gc kernel: {5}[Hardware Error]: It has been corrected by h/w and requires no further action
Feb 29 20:29:22 gc kernel: {5}[Hardware Error]: event severity: corrected
Feb 29 20:29:22 gc kernel: {5}[Hardware Error]: Error 0, type: corrected
Feb 29 20:29:22 gc kernel: {5}[Hardware Error]: section_type: PCIe error
Feb 29 20:29:22 gc kernel: {5}[Hardware Error]: port_type: 4, root port
Feb 29 20:29:22 gc kernel: {5}[Hardware Error]: version: 0.2
Feb 29 20:29:22 gc kernel: {5}[Hardware Error]: command: 0x0407, status: 0x0010
Feb 29 20:29:22 gc kernel: {5}[Hardware Error]: device_id: 0000:80:01.4
Feb 29 20:29:22 gc kernel: {5}[Hardware Error]: slot: 0
Feb 29 20:29:22 gc kernel: {5}[Hardware Error]: secondary_bus: 0x84
Feb 29 20:29:22 gc kernel: {5}[Hardware Error]: vendor_id: 0x1022, device_id: 0x1483
Feb 29 20:29:22 gc kernel: {5}[Hardware Error]: class_code: 060400
Feb 29 20:29:22 gc kernel: {5}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0012
Feb 29 20:29:22 gc kernel: pcieport 0000:80:01.4: AER: aer_status: 0x00000040, aer_mask: 0x00000000
Feb 29 20:29:22 gc kernel: pcieport 0000:80:01.4: [ 6] BadTLP
Feb 29 20:29:22 gc kernel: pcieport 0000:80:01.4: AER: aer_layer=Data Link Layer, aer_agent=Receiver ID
Feb 29 21:17:01 gc CRON[2453098]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Feb 29 21:17:01 gc CRON[2453099]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Feb 29 21:17:01 gc CRON[2453098]: pam_unix(cron:session): session closed for user root
Feb 29 22:00:00 gc pvescheduler[2465821]: <root@pam> starting task UPID:gc:0025A020:02CC8D54:65E0F050:vzdump::root@pam:
Feb 29 22:00:00 gc pvescheduler[2465824]: INFO: starting new backup job: vzdump 102 101 103 104 105 --storage local --mailnotification always --mode snapshot --notification-mode legacy-sendmail --prune-backups 'keep-weekl>
Feb 29 22:00:00 gc pvescheduler[2465824]: INFO: Starting Backup of VM 101 (qemu)
Feb 29 22:00:44 gc kernel: {6}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 512
Feb 29 22:00:44 gc kernel: {6}[Hardware Error]: It has been corrected by h/w and requires no further action
Feb 29 22:00:44 gc kernel: {6}[Hardware Error]: event severity: corrected
Feb 29 22:00:44 gc kernel: {6}[Hardware Error]: Error 0, type: corrected
Feb 29 22:00:44 gc kernel: {6}[Hardware Error]: section_type: PCIe error
Feb 29 22:00:44 gc kernel: {6}[Hardware Error]: port_type: 4, root port
Feb 29 22:00:44 gc kernel: {6}[Hardware Error]: version: 0.2
Feb 29 22:00:44 gc kernel: {6}[Hardware Error]: command: 0x0407, status: 0x0010
Feb 29 22:00:44 gc kernel: {6}[Hardware Error]: device_id: 0000:80:01.4
Feb 29 22:00:44 gc kernel: {6}[Hardware Error]: slot: 0
Feb 29 22:00:44 gc kernel: {6}[Hardware Error]: secondary_bus: 0x84
Feb 29 22:00:44 gc kernel: {6}[Hardware Error]: vendor_id: 0x1022, device_id: 0x1483
Feb 29 22:00:44 gc kernel: {6}[Hardware Error]: class_code: 060400
Feb 29 22:00:44 gc kernel: {6}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0012
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 514
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: It has been corrected by h/w and requires no further action
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: event severity: corrected
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: Error 0, type: corrected
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: fru_text: PcieError
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: section_type: PCIe error
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: port_type: 0, PCIe end point
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: version: 0.2
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: command: 0x0406, status: 0x0010
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: device_id: 0000:84:00.0
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: slot: 0
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: secondary_bus: 0x00
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: vendor_id: 0x144d, device_id: 0xa80a
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: class_code: 010802
Feb 29 22:00:44 gc kernel: {7}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0000
Feb 29 22:00:44 gc kernel: pcieport 0000:80:01.4: AER: aer_status: 0x00000040, aer_mask: 0x00000000
Feb 29 22:00:44 gc kernel: pcieport 0000:80:01.4: [ 6] BadTLP
Feb 29 22:00:44 gc kernel: pcieport 0000:80:01.4: AER: aer_layer=Data Link Layer, aer_agent=Receiver ID
Feb 29 22:00:44 gc kernel: nvme 0000:84:00.0: AER: aer_status: 0x00003000, aer_mask: 0x00000000
Feb 29 22:00:44 gc kernel: nvme 0000:84:00.0: [12] Timeout
Feb 29 22:00:44 gc kernel: nvme 0000:84:00.0: [13] NonFatalErr
Feb 29 22:00:44 gc kernel: nvme 0000:84:00.0: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID
Feb 29 22:02:01 gc pvescheduler[2465824]: INFO: Finished Backup of VM 101 (00:02:01)
Feb 29 22:02:01 gc pvescheduler[2465824]: INFO: Starting Backup of VM 102 (qemu)
Feb 29 22:07:37 gc pvescheduler[2465824]: INFO: Finished Backup of VM 102 (00:05:36)
Feb 29 22:07:38 gc pvescheduler[2465824]: INFO: Starting Backup of VM 103 (qemu)
Thank you for any tips.