Reboot after heavy I/O load

Tasslehoff

New Member
Dec 22, 2023
2
0
1
Hi, I have a lab environment with three PVE 8.1.4 hosts, each of both have an SSD mirror for PVE itself and a SATA zfs mirror as datastore for my vm.
For backups I have another host with PBS 3.1-5 with an SSD mirror for the OS and a zfs raidz datastore for backups.
Each of these hosts have 1 Gbps nic and usually during backups or transfers between PVE hosts I fill up Gbps bandwidth between them.

So far so good, except when I have some I/O intensive loads on a host, today for example I started a backup of a new VM on PBS (~500 GB vm), the backup started with no problems, but after a new minutes some vm become unresponsive and the PVE host where I was running the new VM (the backup source) had a reboot.

After checking my environment I started looking for the cause of this reboot/crash, and the only trace I found was a simple log in journald for the reboot, no errors before, no sign in instability, nothing.

Code:
Apr 23 10:17:01 drakaris02 CRON[198750]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Apr 23 10:17:01 drakaris02 CRON[198751]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Apr 23 10:17:01 drakaris02 CRON[198750]: pam_unix(cron:session): session closed for user root
Apr 23 10:18:51 drakaris02 pmxcfs[1267]: [status] notice: received log
Apr 23 10:19:47 drakaris02 pvedaemon[3445903]: <root@pam> starting task UPID:drakaris02:00031189:204F0DBB:66276F23:vzdump:101:root@pam:
Apr 23 10:19:47 drakaris02 pvedaemon[201097]: INFO: starting new backup job: vzdump 101 --notification-mode auto --notes-template '{{guestname}}' --rem
ove 0 --storage pbs-archive --mode snapshot --node drakaris02
Apr 23 10:19:47 drakaris02 pvedaemon[201097]: INFO: Starting Backup of VM 101 (qemu)
Apr 23 10:21:30 drakaris02 pvestatd[1394]: status update time (6.575 seconds)
-- Boot 42b0c380cd6f401bbb6445e19f5267be --
Apr 23 10:28:41 drakaris02 kernel: Linux version 6.5.11-8-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1
 SMP PREEMPT_DYNAMIC PMX 6.5.11-8 (2024-01-30T12:27Z) ()
Apr 23 10:28:41 drakaris02 kernel: Command line: BOOT_IMAGE=/vmlinuz-6.5.11-8-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet
Apr 23 10:28:41 drakaris02 kernel: KERNEL supported cpus:
Apr 23 10:28:41 drakaris02 kernel:   Intel GenuineIntel
Apr 23 10:28:41 drakaris02 kernel:   AMD AuthenticAMD
Apr 23 10:28:41 drakaris02 kernel:   Hygon HygonGenuine
Apr 23 10:28:41 drakaris02 kernel:   Centaur CentaurHauls
Apr 23 10:28:41 drakaris02 kernel:   zhaoxin   Shanghai
Apr 23 10:28:41 drakaris02 kernel: BIOS-provided physical RAM map:

Does anyone experienced this kind of behavior before?
Do you know how can I debug this kind of incidents?
Beside journald o /var/log/syslog is there any other specific PVE log I can check to find the cause of this reboot/crash?
 
If you are running each node on a 1 Gbps connection and experience issues during a backup or other network congestion, you are likely interfering with the network traffic between nodes.

Set a bandwidth limit on the backups to avoid the congestion.
 
  • Like
Reactions: Kingneutron
It's actually preferable to run corosync on its own separate network, I believe the docs mention this.

https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network

@Tasslehoff it may be worth looking into setting up a separate 2.5Gbit network - either with pcie Intel-based adapters and/or USB3 (likely Realtek chipset) as your backups will complete faster and won't be interfering with cluster comms.

2.5 is basically the last gasp for CAT5E cables. Just for consideration, I run 172.16.25/24 for 2.5, and 172.16.10/24 for 10Gbit.

You can setup a small VM with ipfire and the like for DHCP + ntp/time service for that net.