Random host crashes after upgrade to 6.2-9

ilia987

Active Member
Sep 9, 2019
275
13
38
37
we have 10 hosts in the cluster and once every few days (usually adjacent to backup task) there are some random host reboots.
reboots on the same tome


Code:
proxmox-ve: 6.2-1 (running kernel: 5.4.44-2-pve)
pve-manager: 6.2-9 (running version: 6.2-9/4d363c5b)
pve-kernel-5.4: 6.2-4
pve-kernel-helper: 6.2-4
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-4.15: 5.4-19
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-3-pve: 5.0.21-7
pve-kernel-4.15.18-30-pve: 4.15.18-58
pve-kernel-4.15.18-29-pve: 4.15.18-57
pve-kernel-4.15.18-28-pve: 4.15.18-56
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.15.18-9-pve: 4.15.18-30
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: not correctly installed
ifupdown2: 3.0.0-1+pve2
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-5
libpve-guest-common-perl: 3.0-11
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-9
pve-cluster: 6.1-8
pve-container: 3.1-10
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-9
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-8
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1

this is the job that cased all the issues:
any idea what logs i should investigate ?
Screenshot from 2020-07-20 10-26-01.png
 
any idea what logs i should investigate ?

Check that task log (double click it) and check the syslog around that time, without any error messages or more specific information it's hard to tell whats going on.
 
Check that task log (double click it) and check the syslog around that time, without any error messages or more specific information it's hard to tell whats going on.
i looked there but noting:
here are the last logs in the task (it was a task that backup multiple containers and vm)
Code:
NFO: Matched data: 4,001,484 bytes
INFO: File list size: 65,521
INFO: File list generation time: 0.001 seconds
INFO: File list transfer time: 0.000 seconds
INFO: Total bytes sent: 2,593,016
INFO: Total bytes received: 32,211
INFO: sent 2,593,016 bytes  received 32,211 bytes  71,924.03 bytes/sec
INFO: total size is 21,623,562,964  speedup is 8,236.84
INFO: final sync finished (36 seconds)
INFO: resume vm
INFO: guest is online again after 36 seconds
INFO: creating vzdump archive '/mnt/pve/nfs-backup/dump/vzdump-lxc-112-2020_07_19-23_10_45.tar.lzo'
 
Is the log from the failing backup? Because it looks like a simple successful backup to me which does not bring us much forward.
Do you have a syslogs from before the hosts reboot? Is it always the same hosts?
 
Is the log from the failing backup? Because it looks like a simple successful backup to me which does not bring us much forward.
Do you have a syslogs from before the hosts reboot? Is it always the same hosts?
yes it is from the task with the error
from sys log:
Code:
Jul 19 22:30:02 pve-srv1 vzdump[13451]: INFO: starting new backup job: vzdump 110 112 115 119 114 101 126 129 --mode suspend
...
...
...
Jul 19 23:17:00 pve-srv1 systemd[1]: Starting Proxmox VE replication runner...
Jul 19 23:17:01 pve-srv1 systemd[1]: pvesr.service: Succeeded.
Jul 19 23:17:01 pve-srv1 systemd[1]: Started Proxmox VE replication runner.
Jul 19 23:17:01 pve-srv1 CRON[2337]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jul 19 23:18:00 pve-srv1 systemd[1]: Starting Proxmox VE replication runner...
Jul 19 23:18:01 pve-srv1 systemd[1]: pvesr.service: Succeeded.
Jul 19 23:18:01 pve-srv1 systemd[1]: Started Proxmox VE replication runner.
Jul 19 23:18:11 pve-srv1 postfix/qmgr[14522]: 21BCA227862: from=<root@pve-srv1.MYDOMAIN.com>, size=1257, nrcpt=1 (queue active)
Jul 19 23:18:11 pve-srv1 postfix/qmgr[14522]: 739F8227021: from=<root@pve-srv1.MYDOMAIN.com>, size=81844, nrcpt=1 (queue active)
Jul 19 23:18:41 pve-srv1 postfix/smtp[13270]: connect to MYDOMAIN.com[XXX.XXX.XX.251]:25: Connection timed out
Jul 19 23:18:41 pve-srv1 postfix/smtp[13286]: connect to MYDOMAIN.com[XXX.XXX.XX..251]:25: Connection timed out
Jul 19 23:19:00 pve-srv1 systemd[1]: Starting Proxmox VE replication runner...
Jul 19 23:19:11 pve-srv1 postfix/smtp[13270]: connect to MYDOMAIN.com[XXX.XXX.XX..241]:25: Connection timed out
Jul 19 23:19:11 pve-srv1 postfix/smtp[13286]: connect to MYDOMAIN.com[XXX.XXX.XX.241]:25: Connection timed out
Jul 19 23:19:11 pve-srv1 postfix/smtp[13270]: 21BCA227862: to=<yitzikc@MYDOMAIN.com>, relay=none, delay=414394, delays=414334/0.11
/60/0, dsn=4.4.1, status=deferred (connect to MYDOMAIN.com[XXX.XXX.XX.241]:25: Connection timed out)
Jul 19 23:19:11 pve-srv1 postfix/smtp[13286]: 739F8227021: to=<iliak@MYDOMAIN.com>, relay=none, delay=246262, delays=246202/0.05/60/0, dsn=4.4.1, status=deferred (connect to MYDOMAIN.com[XXX.XXX.XX.241]:25: Connection timed out)
Jul 19 23:19:13 pve-srv1 systemd[1]: pvesr.service: Succeeded.
Jul 19 23:19:13 pve-srv1 systemd[1]: Started Proxmox VE replication runner.
Jul 19 23:20:00 pve-srv1 systemd[1]: Starting Proxmox VE replication runner...
Jul 19 23:20:01 pve-srv1 systemd[1]: pvesr.service: Succeeded.
Jul 19 23:20:01 pve-srv1 systemd[1]: Started Proxmox VE replication runner.
Jul 19 23:21:00 pve-srv1 systemd[1]: Starting Proxmox VE replication runner...
Jul 19 23:21:01 pve-srv1 systemd[1]: pvesr.service: Succeeded.
Jul 19 23:21:01 pve-srv1 systemd[1]: Started Proxmox VE replication runner.
Jul 19 23:22:00 pve-srv1 systemd[1]: Starting Proxmox VE replication runner...
Jul 19 23:22:01 pve-srv1 systemd[1]: pvesr.service: Succeeded.
Jul 19 23:22:01 pve-srv1 systemd[1]: Started Proxmox VE replication runner.
Jul 19 23:22:16 pve-srv1 pvestatd[14836]: storage 'regression' is not online
Jul 19 23:22:16 pve-srv1 pvestatd[14836]: status update time (12.384 seconds)
Jul 19 23:23:00 pve-srv1 systemd[1]: Starting Proxmox VE replication runner...
Jul 19 23:23:21 pve-srv1 watchdog-mux[13204]: client watchdog expired - disable watchdog updates
Jul 19 23:25:52 pve-srv1 systemd[1]: Starting Flush Journal to Persistent Storage...
Jul 19 23:25:52 pve-srv1 kernel: [    0.000000] Linux version 5.4.44-2-pve (build@pve) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.4.44-2 (Wed, 01 Jul 2020 16:37:57 +0200) ()
Jul 19 23:25:52 pve-srv1 kernel: [    0.000000] Command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-5.4.44-2-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet
Jul 19 23:25:52 pve-srv1 kernel: [    0.000000] KERNEL supported cpus:
Jul 19 23:25:52 pve-srv1 kernel: [    0.000000]   Intel GenuineIntel
Jul 19 23:25:52 pve-srv1 kernel: [    0.000000]   AMD AuthenticAMD
 
Last edited:
Is it always the same hosts? Do you have a cluster? If yes, then is HA enabled?
 
Is it always the same hosts? Do you have a cluster? If yes, then is HA enabled?
it happen on multiple hosts, and as far as i know always on the same hosts ( all hosts that have the latest kernel) we have few more hosts which i did not reboot after upgrade because they are also works as ceph storage. (ill reboot them in few weeks once we get few more servers and increase the the ceph nodes count)

yes ha enabled
 
it occurred again today: i found few more details with the flow:
  1. a scheduled backup task started (for 8 lxc containers) :
  2. 3 lxc backup finish successfully
  3. next container backup starting
    Code:
    INFO: Starting Backup of VM 114 (lxc) (this is the end of the task log, after this all the relevant host restarted)
    INFO: Backup started at 2020-07-21 23:30:32
    INFO: status = running
    INFO: backup mode: suspend
    INFO: ionice priority: 7
    INFO: CT Name: centos-1
    INFO: including mount point rootfs ('/') in backup
    INFO: including mount point mp0 ('/home/local/MYDOMAIN/') in backup
    INFO: excluding bind mount point mp1 ('/mnt/trade_data') from backup (not a volume)
    INFO: excluding bind mount point mp2 ('/home/filer') from backup (not a volume)
    INFO: excluding bind mount point mp3 ('/mnt/docs') from backup (not a volume)
    INFO: excluding bind mount point mp4 ('/mnt/ftd') from backup (not a volume)
    INFO: excluding bind mount point mp5 ('/mnt/scratch') from backup (not a volume)
    INFO: excluding bind mount point mp6 ('/mnt/regression') from backup (not a volume)
    INFO: starting first sync /proc/33755/root// to /var/tmp/vzdumptmp39077
  4. full host reboot occures for all hosts that have one of the containers in the backup task
  5. the backup container does not start automatically after reboot, because it is on lock state, and i have to manually unlock it: (pct unlock 114)

* just tested: running the backup task manualy finish with success.
 
Last edited:
i think i found something that cause it, hope you can recreate it:

i just run manual back on lxc container (Ubuntu 18.04 ) with multiple nfs mount points, but one of the nfs mounts is unavailing (down)
once the backup failed with the following error:
Code:
INFO: excluding bind mount point mp7 ('/mnt/recordings') from backup (not a volume)
INFO: excluding bind mount point mp8 ('/mnt/data-raw') from backup (not a volume)
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
failed to open /mnt/data-raw: Input/output error
INFO: creating vzdump archive '/mnt/pve/nfs-backup/dump/vzdump-lxc-127-2020_09_14-13_04_46.tar.zst'

and it cased the entire NODE to reboot

/mnt/data-raw is a nfs mount to external location (that we have trouble maintaining a stable connection to) hopefully it will be fixes in few days
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!