Hi, I have a lab environment with three PVE 8.1.4 hosts, each of both have an SSD mirror for PVE itself and a SATA zfs mirror as datastore for my vm.
For backups I have another host with PBS 3.1-5 with an SSD mirror for the OS and a zfs raidz datastore for backups.
Each of these hosts have 1 Gbps nic and usually during backups or transfers between PVE hosts I fill up Gbps bandwidth between them.
So far so good, except when I have some I/O intensive loads on a host, today for example I started a backup of a new VM on PBS (~500 GB vm), the backup started with no problems, but after a new minutes some vm become unresponsive and the PVE host where I was running the new VM (the backup source) had a reboot.
After checking my environment I started looking for the cause of this reboot/crash, and the only trace I found was a simple log in journald for the reboot, no errors before, no sign in instability, nothing.
Does anyone experienced this kind of behavior before?
Do you know how can I debug this kind of incidents?
Beside journald o /var/log/syslog is there any other specific PVE log I can check to find the cause of this reboot/crash?
For backups I have another host with PBS 3.1-5 with an SSD mirror for the OS and a zfs raidz datastore for backups.
Each of these hosts have 1 Gbps nic and usually during backups or transfers between PVE hosts I fill up Gbps bandwidth between them.
So far so good, except when I have some I/O intensive loads on a host, today for example I started a backup of a new VM on PBS (~500 GB vm), the backup started with no problems, but after a new minutes some vm become unresponsive and the PVE host where I was running the new VM (the backup source) had a reboot.
After checking my environment I started looking for the cause of this reboot/crash, and the only trace I found was a simple log in journald for the reboot, no errors before, no sign in instability, nothing.
Code:
Apr 23 10:17:01 drakaris02 CRON[198750]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Apr 23 10:17:01 drakaris02 CRON[198751]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Apr 23 10:17:01 drakaris02 CRON[198750]: pam_unix(cron:session): session closed for user root
Apr 23 10:18:51 drakaris02 pmxcfs[1267]: [status] notice: received log
Apr 23 10:19:47 drakaris02 pvedaemon[3445903]: <root@pam> starting task UPID:drakaris02:00031189:204F0DBB:66276F23:vzdump:101:root@pam:
Apr 23 10:19:47 drakaris02 pvedaemon[201097]: INFO: starting new backup job: vzdump 101 --notification-mode auto --notes-template '{{guestname}}' --rem
ove 0 --storage pbs-archive --mode snapshot --node drakaris02
Apr 23 10:19:47 drakaris02 pvedaemon[201097]: INFO: Starting Backup of VM 101 (qemu)
Apr 23 10:21:30 drakaris02 pvestatd[1394]: status update time (6.575 seconds)
-- Boot 42b0c380cd6f401bbb6445e19f5267be --
Apr 23 10:28:41 drakaris02 kernel: Linux version 6.5.11-8-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1
SMP PREEMPT_DYNAMIC PMX 6.5.11-8 (2024-01-30T12:27Z) ()
Apr 23 10:28:41 drakaris02 kernel: Command line: BOOT_IMAGE=/vmlinuz-6.5.11-8-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet
Apr 23 10:28:41 drakaris02 kernel: KERNEL supported cpus:
Apr 23 10:28:41 drakaris02 kernel: Intel GenuineIntel
Apr 23 10:28:41 drakaris02 kernel: AMD AuthenticAMD
Apr 23 10:28:41 drakaris02 kernel: Hygon HygonGenuine
Apr 23 10:28:41 drakaris02 kernel: Centaur CentaurHauls
Apr 23 10:28:41 drakaris02 kernel: zhaoxin Shanghai
Apr 23 10:28:41 drakaris02 kernel: BIOS-provided physical RAM map:
Does anyone experienced this kind of behavior before?
Do you know how can I debug this kind of incidents?
Beside journald o /var/log/syslog is there any other specific PVE log I can check to find the cause of this reboot/crash?