Unsolicited reboot

Znuf

Member
Mar 29, 2021
26
1
23
39
Hello, I have a problem with my proxmox server. For some reason, once a month the server reboots.

Today between 17:00 and 17:04



my syslog :

Code:
Mar 29 16:57:00 pve systemd[1]: Started Proxmox VE replication runner.
Mar 29 16:58:00 pve systemd[1]: Starting Proxmox VE replication runner...
Mar 29 16:58:00 pve systemd[1]: pvesr.service: Succeeded.
Mar 29 16:58:00 pve systemd[1]: Started Proxmox VE replication runner.
Mar 29 16:59:00 pve systemd[1]: Starting Proxmox VE replication runner...
Mar 29 16:59:00 pve systemd[1]: pvesr.service: Succeeded.
Mar 29 16:59:00 pve systemd[1]: Started Proxmox VE replication runner.
Mar 29 17:00:00 pve systemd[1]: Starting Proxmox VE replication runner...
Mar 29 17:00:00 pve systemd[1]: pvesr.service: Succeeded.
Mar 29 17:00:00 pve systemd[1]: Started Proxmox VE replication runner.
Mar 29 17:01:00 pve systemd[1]: Starting Proxmox VE replication runner...
Mar 29 17:01:00 pve systemd[1]: pvesr.service: Succeeded.
Mar 29 17:01:00 pve systemd[1]: Started Proxmox VE replication runner.
Mar 29 17:04:21 pve systemd-modules-load[1602]: Inserted module 'iscsi_tcp'
Mar 29 17:04:21 pve systemd-modules-load[1602]: Inserted module 'ib_iser'
Mar 29 17:04:21 pve systemd-modules-load[1602]: Inserted module 'vhost_net'
Mar 29 17:04:21 pve systemd[1]: Starting Flush Journal to Persistent Storage...
Mar 29 17:04:21 pve systemd[1]: Started Flush Journal to Persistent Storage.
Mar 29 17:04:21 pve systemd[1]: Started udev Coldplug all Devices.
Mar 29 17:04:21 pve systemd[1]: Starting udev Wait for Complete Device Initialization...
Mar 29 17:04:21 pve systemd[1]: Starting Helper to synchronize boot up for ifupdown...
Mar 29 17:04:21 pve systemd[1]: Started udev Kernel Device Manager.
Mar 29 17:04:21 pve systemd[1]: Started Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling.
Mar 29 17:04:21 pve systemd[1]: Reached target Local File Systems (Pre).
Mar 29 17:04:21 pve systemd-udevd[1667]: Using default interface naming scheme 'v240'.

Did you know what should be the problem ? Thanks

Hardware :
CPU : AMD Ryzen 7 3700X 8-Core Processor
MB :Asus PRIME B450M-A
RAM : 32go
HDD : 2TO ZFS
 
hi,

* is server clustered?

* is HA enabled?

* is it once a month regularly? maybe it's a cron job or systemd timer?

* pveversion -v
 
Hi,

Thanks for your reply.
No not clustered,
Yes it's HA enabled.
No it's not regulary

pveversion -v :
Code:
proxmox-ve: 6.3-1 (running kernel: 5.4.106-1-pve)
pve-manager: 6.3-6 (running version: 6.3-6/2184247e)
pve-kernel-5.4: 6.3-8
pve-kernel-helper: 6.3-8
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.78-1-pve: 5.4.78-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.0-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.8
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-5
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.11-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-9
pve-cluster: 6.2-1
pve-container: 3.3-4
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.2.0-4
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-8
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

My contrab has only one line : 0 4 * * * rsync -av --delete /var/lib/vz/dump/ /home/network/
 
Last edited:
No it's not regulary
if it's not happening regularly then we can rule out cron or timer



HA is enabled but you don't have a cluster? not sure if this is really the case
can you post the output from ha-manager status --verbose?
 
ha-manager status --verbose
Code:
quorum OK
master pve (active, Tue Mar 30 17:01:41 2021)
lrm pve (active, Tue Mar 30 17:01:38 2021)
service vm:100 (pve, started)
service vm:104 (pve, started)
service vm:103 (pve, ignored)
service vm:105 (pve, ignored)
full cluster state:
{
   "lrm_status" : {
      "pve" : {
         "mode" : "active",
         "results" : {
            "X3vJPWMZG9yQmKPTVHZTbg" : {
               "exit_code" : 0,
               "sid" : "vm:104",
               "state" : "started"
            },
            "uSkhQE6v+jLEHyIVZGxSXw" : {
               "exit_code" : 0,
               "sid" : "vm:100",
               "state" : "started"
            }
         },
         "state" : "active",
         "timestamp" : 1617116498
      }
   },
   "manager_status" : {
      "master_node" : "pve",
      "node_status" : {
         "pve" : "online"
      },
      "service_status" : {
         "vm:100" : {
            "node" : "pve",
            "running" : 1,
            "state" : "started",
            "uid" : "sj1AxEho+X111fPZSZKFhw"
         },
         "vm:104" : {
            "node" : "pve",
            "running" : 1,
            "state" : "started",
            "uid" : "0xSd9FuSUOVWy5yd58jYwQ"
         }
      },
      "timestamp" : 1617116501
   },
   "quorum" : {
      "node" : "pve",
      "quorate" : "1"
   }
}
 
without a cluster setup setting up HA will only complicate things more (your node could be getting fenced, which would explain the reboots)

just to be sure could you also check systemctl list-timers
 
In my case, HA is usefull when my windows VM do an upgrade. If I don't activate HA, some time the machine didn't reboot.

systemctl list-timers :

Code:
NEXT                          LEFT     LAST                          PASSED   UNIT                         ACTIVATES
Fri 2021-04-09 17:27:00 CEST  8s left  Fri 2021-04-09 17:26:00 CEST  51s ago  pvesr.timer                  pvesr.service
Sat 2021-04-10 00:00:00 CEST  6h left  Fri 2021-04-09 00:00:00 CEST  17h ago  logrotate.timer              logrotate.service
Sat 2021-04-10 00:00:00 CEST  6h left  Fri 2021-04-09 00:00:00 CEST  17h ago  man-db.timer                 man-db.service
Sat 2021-04-10 00:25:14 CEST  6h left  Fri 2021-04-09 07:13:46 CEST  10h ago  apt-daily.timer              apt-daily.service
Sat 2021-04-10 05:34:52 CEST  12h left Fri 2021-04-09 02:53:46 CEST  14h ago  pve-daily-update.timer       pve-daily-update.service
Sat 2021-04-10 06:41:21 CEST  13h left Fri 2021-04-09 06:02:00 CEST  11h ago  apt-daily-upgrade.timer      apt-daily-upgrade.service
Sat 2021-04-10 17:21:00 CEST  23h left Fri 2021-04-09 17:21:00 CEST  5min ago systemd-tmpfiles-clean.timer systemd-tmpfiles-clean.service

7 timers listed.
Pass --all to see loaded but inactive timers, too.
 
I found the problem, but not the solution.

It's the IO, if the SSD are too solicited, the server died and reboot.
If I copy large file from one VM to another or if I run CrystalDiskMark I'm sure that the server will fail.