Random host crashes after upgrade to 6.2-9

ilia987 · Jul 20, 2020

we have 10 hosts in the cluster and once every few days (usually adjacent to backup task) there are some random host reboots.
reboots on the same tome

Code:

proxmox-ve: 6.2-1 (running kernel: 5.4.44-2-pve)
pve-manager: 6.2-9 (running version: 6.2-9/4d363c5b)
pve-kernel-5.4: 6.2-4
pve-kernel-helper: 6.2-4
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-4.15: 5.4-19
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-3-pve: 5.0.21-7
pve-kernel-4.15.18-30-pve: 4.15.18-58
pve-kernel-4.15.18-29-pve: 4.15.18-57
pve-kernel-4.15.18-28-pve: 4.15.18-56
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.15.18-9-pve: 4.15.18-30
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: not correctly installed
ifupdown2: 3.0.0-1+pve2
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-5
libpve-guest-common-perl: 3.0-11
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-9
pve-cluster: 6.1-8
pve-container: 3.1-10
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-9
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-8
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1

this is the job that cased all the issues:
any idea what logs i should investigate ?

t.lamprecht · Jul 20, 2020

ilia987 said:
any idea what logs i should investigate ?

Check that task log (double click it) and check the syslog around that time, without any error messages or more specific information it's hard to tell whats going on.

ilia987 · Jul 20, 2020

t.lamprecht said:
Check that task log (double click it) and check the syslog around that time, without any error messages or more specific information it's hard to tell whats going on.

i looked there but noting:
here are the last logs in the task (it was a task that backup multiple containers and vm)

Code:

NFO: Matched data: 4,001,484 bytes
INFO: File list size: 65,521
INFO: File list generation time: 0.001 seconds
INFO: File list transfer time: 0.000 seconds
INFO: Total bytes sent: 2,593,016
INFO: Total bytes received: 32,211
INFO: sent 2,593,016 bytes  received 32,211 bytes  71,924.03 bytes/sec
INFO: total size is 21,623,562,964  speedup is 8,236.84
INFO: final sync finished (36 seconds)
INFO: resume vm
INFO: guest is online again after 36 seconds
INFO: creating vzdump archive '/mnt/pve/nfs-backup/dump/vzdump-lxc-112-2020_07_19-23_10_45.tar.lzo'

Dominic · Jul 20, 2020

Is the log from the failing backup? Because it looks like a simple successful backup to me which does not bring us much forward.
Do you have a syslogs from before the hosts reboot? Is it always the same hosts?

ilia987 · Jul 20, 2020

Dominic said:
Is the log from the failing backup? Because it looks like a simple successful backup to me which does not bring us much forward.
Do you have a syslogs from before the hosts reboot? Is it always the same hosts?

yes it is from the task with the error
from sys log:

Code:

Jul 19 22:30:02 pve-srv1 vzdump[13451]: INFO: starting new backup job: vzdump 110 112 115 119 114 101 126 129 --mode suspend
...
...
...
Jul 19 23:17:00 pve-srv1 systemd[1]: Starting Proxmox VE replication runner...
Jul 19 23:17:01 pve-srv1 systemd[1]: pvesr.service: Succeeded.
Jul 19 23:17:01 pve-srv1 systemd[1]: Started Proxmox VE replication runner.
Jul 19 23:17:01 pve-srv1 CRON[2337]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jul 19 23:18:00 pve-srv1 systemd[1]: Starting Proxmox VE replication runner...
Jul 19 23:18:01 pve-srv1 systemd[1]: pvesr.service: Succeeded.
Jul 19 23:18:01 pve-srv1 systemd[1]: Started Proxmox VE replication runner.
Jul 19 23:18:11 pve-srv1 postfix/qmgr[14522]: 21BCA227862: from=<root@pve-srv1.MYDOMAIN.com>, size=1257, nrcpt=1 (queue active)
Jul 19 23:18:11 pve-srv1 postfix/qmgr[14522]: 739F8227021: from=<root@pve-srv1.MYDOMAIN.com>, size=81844, nrcpt=1 (queue active)
Jul 19 23:18:41 pve-srv1 postfix/smtp[13270]: connect to MYDOMAIN.com[XXX.XXX.XX.251]:25: Connection timed out
Jul 19 23:18:41 pve-srv1 postfix/smtp[13286]: connect to MYDOMAIN.com[XXX.XXX.XX..251]:25: Connection timed out
Jul 19 23:19:00 pve-srv1 systemd[1]: Starting Proxmox VE replication runner...
Jul 19 23:19:11 pve-srv1 postfix/smtp[13270]: connect to MYDOMAIN.com[XXX.XXX.XX..241]:25: Connection timed out
Jul 19 23:19:11 pve-srv1 postfix/smtp[13286]: connect to MYDOMAIN.com[XXX.XXX.XX.241]:25: Connection timed out
Jul 19 23:19:11 pve-srv1 postfix/smtp[13270]: 21BCA227862: to=<yitzikc@MYDOMAIN.com>, relay=none, delay=414394, delays=414334/0.11
/60/0, dsn=4.4.1, status=deferred (connect to MYDOMAIN.com[XXX.XXX.XX.241]:25: Connection timed out)
Jul 19 23:19:11 pve-srv1 postfix/smtp[13286]: 739F8227021: to=<iliak@MYDOMAIN.com>, relay=none, delay=246262, delays=246202/0.05/60/0, dsn=4.4.1, status=deferred (connect to MYDOMAIN.com[XXX.XXX.XX.241]:25: Connection timed out)
Jul 19 23:19:13 pve-srv1 systemd[1]: pvesr.service: Succeeded.
Jul 19 23:19:13 pve-srv1 systemd[1]: Started Proxmox VE replication runner.
Jul 19 23:20:00 pve-srv1 systemd[1]: Starting Proxmox VE replication runner...
Jul 19 23:20:01 pve-srv1 systemd[1]: pvesr.service: Succeeded.
Jul 19 23:20:01 pve-srv1 systemd[1]: Started Proxmox VE replication runner.
Jul 19 23:21:00 pve-srv1 systemd[1]: Starting Proxmox VE replication runner...
Jul 19 23:21:01 pve-srv1 systemd[1]: pvesr.service: Succeeded.
Jul 19 23:21:01 pve-srv1 systemd[1]: Started Proxmox VE replication runner.
Jul 19 23:22:00 pve-srv1 systemd[1]: Starting Proxmox VE replication runner...
Jul 19 23:22:01 pve-srv1 systemd[1]: pvesr.service: Succeeded.
Jul 19 23:22:01 pve-srv1 systemd[1]: Started Proxmox VE replication runner.
Jul 19 23:22:16 pve-srv1 pvestatd[14836]: storage 'regression' is not online
Jul 19 23:22:16 pve-srv1 pvestatd[14836]: status update time (12.384 seconds)
Jul 19 23:23:00 pve-srv1 systemd[1]: Starting Proxmox VE replication runner...
Jul 19 23:23:21 pve-srv1 watchdog-mux[13204]: client watchdog expired - disable watchdog updates
Jul 19 23:25:52 pve-srv1 systemd[1]: Starting Flush Journal to Persistent Storage...
Jul 19 23:25:52 pve-srv1 kernel: [    0.000000] Linux version 5.4.44-2-pve (build@pve) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.4.44-2 (Wed, 01 Jul 2020 16:37:57 +0200) ()
Jul 19 23:25:52 pve-srv1 kernel: [    0.000000] Command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-5.4.44-2-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet
Jul 19 23:25:52 pve-srv1 kernel: [    0.000000] KERNEL supported cpus:
Jul 19 23:25:52 pve-srv1 kernel: [    0.000000]   Intel GenuineIntel
Jul 19 23:25:52 pve-srv1 kernel: [    0.000000]   AMD AuthenticAMD

Dominic · Jul 21, 2020

Is it always the same hosts? Do you have a cluster? If yes, then is HA enabled?

ilia987 · Jul 21, 2020

Dominic said:
Is it always the same hosts? Do you have a cluster? If yes, then is HA enabled?

it happen on multiple hosts, and as far as i know always on the same hosts ( all hosts that have the latest kernel) we have few more hosts which i did not reboot after upgrade because they are also works as ceph storage. (ill reboot them in few weeks once we get few more servers and increase the the ceph nodes count)

yes ha enabled

ilia987 · Jul 22, 2020

it occurred again today: i found few more details with the flow:

a scheduled backup task started (for 8 lxc containers) :
3 lxc backup finish successfully

next container backup starting

Code:

INFO: Starting Backup of VM 114 (lxc) (this is the end of the task log, after this all the relevant host restarted)
INFO: Backup started at 2020-07-21 23:30:32
INFO: status = running
INFO: backup mode: suspend
INFO: ionice priority: 7
INFO: CT Name: centos-1
INFO: including mount point rootfs ('/') in backup
INFO: including mount point mp0 ('/home/local/MYDOMAIN/') in backup
INFO: excluding bind mount point mp1 ('/mnt/trade_data') from backup (not a volume)
INFO: excluding bind mount point mp2 ('/home/filer') from backup (not a volume)
INFO: excluding bind mount point mp3 ('/mnt/docs') from backup (not a volume)
INFO: excluding bind mount point mp4 ('/mnt/ftd') from backup (not a volume)
INFO: excluding bind mount point mp5 ('/mnt/scratch') from backup (not a volume)
INFO: excluding bind mount point mp6 ('/mnt/regression') from backup (not a volume)
INFO: starting first sync /proc/33755/root// to /var/tmp/vzdumptmp39077

full host reboot occures for all hosts that have one of the containers in the backup task
the backup container does not start automatically after reboot, because it is on lock state, and i have to manually unlock it: (pct unlock 114)

* just tested: running the backup task manualy finish with success.

ilia987 · Sep 14, 2020

i think i found something that cause it, hope you can recreate it:

i just run manual back on lxc container (Ubuntu 18.04 ) with multiple nfs mount points, but one of the nfs mounts is unavailing (down)
once the backup failed with the following error:

Code:

INFO: excluding bind mount point mp7 ('/mnt/recordings') from backup (not a volume)
INFO: excluding bind mount point mp8 ('/mnt/data-raw') from backup (not a volume)
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
failed to open /mnt/data-raw: Input/output error
INFO: creating vzdump archive '/mnt/pve/nfs-backup/dump/vzdump-lxc-127-2020_09_14-13_04_46.tar.zst'

and it cased the entire NODE to reboot

/mnt/data-raw is a nfs mount to external location (that we have trouble maintaining a stable connection to) hopefully it will be fixes in few days

Search

Search

Random host crashes after upgrade to 6.2-9

ilia987

Active Member

t.lamprecht

Proxmox Staff Member

ilia987

Active Member

Dominic

Proxmox Retired Staff

ilia987

Active Member

Dominic

Proxmox Retired Staff

ilia987

Active Member

ilia987

Active Member

ilia987

Active Member