Proxmox often doesn't start the machines after back-up

Czam

New Member
Apr 21, 2023
14
0
1
Currently running 4 (all debian) virtual machines in a cluster with Virtual Environment 7.2-7 - Linux 5.15.39-2-pve #1 SMP PVE 5.15.39-2
After the back-up is done in the night, the machines are not getting started....sometimes only one, rarely all of them like today....sometimes they are started on time without any problems.
The load of the cluster is prety low
Load average 1.82,2.20,2.32
RAM usage 34.23% (21.49 GiB of 62.78 GiB)
HD space 54.84% (501.46 GiB of 914.37 GiB)

Any idea is welcome, thank you!
 

Attachments

  • Conf.png
    Conf.png
    7.6 KB · Views: 11
  • LoadCluster.png
    LoadCluster.png
    37.1 KB · Views: 10
Hi,

Can you check the Syslog at the backup time, especially when it finishes? You can use journalctl to sort the specific time, e.g.:

Bash:
journalctl --since "2023-04-21 00:00" --until "2023-04-21 08:45" > /tmp/Syslog.txt

You may have to change the time/date in the above command.
 
For each of the backups, the backup entry on proxmox will have a log of what happened with each machine:
1682062068446.png
The above is to give you an idea of where to look. Others here who know more will want that info, could you post up the backup logs for VMs which failed to start, and ones which did start ok?
 
This is the task viewer for one of the machines.....

INFO: starting new backup job: vzdump 10224 --node proxmaster --quiet 1 --prune-backups 'keep-last=7' --compress zstd --mode snapshot --storage NFSSlave02 --mailnotification always
INFO: Starting Backup of VM 10224 (qemu)
INFO: Backup started at 2023-04-21 03:30:14
INFO: status = running
INFO: include disk 'sata0' 'local:10224/vm-10224-disk-0.raw' 20G
INFO: include disk 'sata1' 'local:10224/vm-10224-disk-1.raw' 35G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/mnt/pve/NFSSlave02/dump/vzdump-qemu-10224-2023_04_21-03_30_13.vma.zst'
INFO: started backup task '836dea1c-1e6e-4c36-8222-7c900f68a0bc'
INFO: resuming VM again
INFO: 0% (174.0 MiB of 55.0 GiB) in 3s, read: 58.0 MiB/s, write: 49.9 MiB/s
# didnt include all the percentages because it's pointless
INFO: 98% (53.9 GiB of 55.0 GiB) in 19m 24s, read: 48.3 MiB/s, write: 47.8 MiB/s
INFO: 99% (54.9 GiB of 55.0 GiB) in 19m 28s, read: 251.0 MiB/s, write: 45.3 MiB/s
INFO: 100% (55.0 GiB of 55.0 GiB) in 19m 29s, read: 99.9 MiB/s, write: 47.3 MiB/s
INFO: backup is sparse: 2.17 GiB (3%) total zero data
INFO: transferred 55.00 GiB in 1169 seconds (48.2 MiB/s)
INFO: archive file size: 18.31GB
INFO: prune older backups with retention: keep-last=7
INFO: removing backup 'NFSSlave02:backup/vzdump-qemu-10224-2023_04_12-03_30_00.vma.zst'
INFO: pruned 1 backup(s) not covered by keep-retention policy
INFO: Finished Backup of VM 10224 (00:19:32)
INFO: Backup finished at 2023-04-21 03:49:45
INFO: Backup job finished successfully
TASK OK


Logs from the cluster

Apr 21 03:30:14 proxmaster pvescheduler[3943208]: INFO: Starting Backup of VM 10224 (qemu)
Apr 21 03:39:21 proxmaster smartd[1278]: Device: /dev/sda [SAT], SMART Usage Attribute: 19
0 Airflow_Temperature_Cel changed from 65 to 67
Apr 21 03:39:21 proxmaster smartd[1278]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 123 to 125
Apr 21 03:39:21 proxmaster smartd[1278]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 65
Apr 21 03:46:49 proxmaster corosync[1782]: [KNET ] link: host: 3 link: 0 is down
Apr 21 03:46:49 proxmaster corosync[1782]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Apr 21 03:46:49 proxmaster corosync[1782]: [KNET ] host: host: 3 has no active links
Apr 21 03:46:51 proxmaster corosync[1782]: [KNET ] rx: host: 3 link: 0 is up
Apr 21 03:46:51 proxmaster corosync[1782]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Apr 21 03:46:52 proxmaster corosync[1782]: [TOTEM ] Token has not been received in 2737 ms
Apr 21 03:49:45 proxmaster pvescheduler[3943208]: INFO: Finished Backup of VM 10224 (00:19:32)
Apr 21 03:49:45 proxmaster pvescheduler[3943208]: INFO: Backup job finished successfully
Apr 21 04:00:02 proxmaster pmxcfs[1617]: [status] notice: received log
Apr 21 04:03:37 proxmaster kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.39-2-pve root=UUID=51bb2a4e-3a36-47f4-9cf7-ee67b0a99f2c ro quiet
Apr 21 04:03:37 proxmaster kernel: KERNEL supported cpus:
Apr 21 04:03:37 proxmaster kernel: Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.15.39-2-pve root=UUID=51bb2a4e-3a36-47f4-9cf7-ee67b0a99f2c ro quiet
Apr 21 04:03:37 proxmaster kernel: Unknown kernel command line parameters "BOOT_IMAGE=/boot/vmlinuz-5.15.39-2-pve", will be passed to user space.
Apr 21 04:03:37 proxmaster kernel: sd 9:0:0:0: [sdf] Media removed, stopped polling
Apr 21 04:03:37 proxmaster kernel: sd 9:0:0:1: [sdg] Media removed, stopped polling
Apr 21 04:03:37 proxmaster kernel: sd 9:0:0:2: [sdh] Media removed, stopped polling
Apr 21 04:03:37 proxmaster kernel: sd 9:0:0:0: [sdf] Attached SCSI removable disk
Apr 21 04:03:37 proxmaster kernel: sd 9:0:0:1: [sdg] Attached SCSI removable disk
Apr 21 04:03:37 proxmaster kernel: sd 9:0:0:2: [sdh] Attached SCSI removable disk
Apr 21 04:03:37 proxmaster kernel: async_tx: api initialized (async)
Apr 21 04:03:37 proxmaster kernel: PM: Image not found (code -22)
Apr 21 04:03:37 proxmaster kernel: EXT4-fs (md126p1): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
Apr 21 04:03:37 proxmaster systemd: Inserted module 'autofs4'
Apr 21 04:03:37 proxmaster systemd: systemd 247.3-7 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +ZSTD +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=unified)
Apr 21 04:03:37 proxmaster systemd: Detected architecture x86-64.
Apr 21 04:03:37 proxmaster systemd: Set hostname to <proxmaster>.
Apr 21 04:03:37 proxmaster systemd: Queued start job for default target Graphical Interface.
Apr 21 04:03:37 proxmaster systemd: Created slice system-getty.slice.
Apr 21 04:03:37 proxmaster systemd: Created slice system-modprobe.slice.
Apr 21 04:03:37 proxmaster systemd: Created slice system-postfix.slice.
Apr 21 04:03:37 proxmaster systemd: Created slice system-zfs\x2dimport.slice.
Apr 21 04:03:37 proxmaster systemd: Created slice User and Session Slice.
Apr 21 04:03:37 proxmaster systemd: Started Dispatch Password Requests to Console Directory Watch.
Apr 21 04:03:37 proxmaster systemd: Started Forward Password Requests to Wall Directory Watch.
Apr 21 04:03:37 proxmaster systemd: Set up automount Arbitrary Executable File Formats File System Automount Point.
Apr 21 04:03:37 proxmaster systemd: Reached target ceph target allowing to start/stop all ceph-fuse@.service instances at once.
Apr 21 04:03:37 proxmaster systemd: Reached target ceph target allowing to start/stop all ceph*@.service instances at once.
Apr 21 04:03:37 proxmaster systemd: Reached target Local Encrypted Volumes.
Apr 21 04:03:37 proxmaster systemd: Reached target Paths.
Apr 21 04:03:37 proxmaster systemd: Reached target Slices.
Apr 21 04:03:37 proxmaster systemd: Listening on Device-mapper event daemon FIFOs.
Apr 21 04:03:37 proxmaster systemd: Listening on LVM2 poll daemon socket.
Apr 21 04:03:37 proxmaster systemd: Listening on RPCbind Server Activation Socket.
Apr 21 04:03:37 proxmaster systemd: Listening on Syslog Socket.
Apr 21 04:03:37 proxmaster systemd: Listening on fsck to fsckd communication Socket.
Apr 21 04:03:37 proxmaster systemd: Listening on initctl Compatibility Named Pipe.
Apr 21 04:03:37 proxmaster systemd: Listening on Journal Audit Socket.
Apr 21 04:03:37 proxmaster systemd: Listening on Journal Socket (/dev/log).
Apr 21 04:03:37 proxmaster systemd: Listening on Journal Socket.
Apr 21 04:03:37 proxmaster systemd: Listening on udev Control Socket.
Apr 21 04:03:37 proxmaster systemd: Listening on udev Kernel Socket.
Apr 21 04:03:37 proxmaster systemd: Mounting Huge Pages File System...
Apr 21 04:03:37 proxmaster systemd: Mounting POSIX Message Queue File System...
Apr 21 04:03:37 proxmaster systemd: Mounting NFSD configuration filesystem...
Apr 21 04:03:37 proxmaster systemd: Mounting RPC Pipe File System...
Apr 21 04:03:37 proxmaster systemd: Mounting Kernel Debug File System...
Apr 21 04:03:37 proxmaster systemd: Mounting Kernel Trace File System...
Apr 21 04:03:37 proxmaster systemd: Condition check resulted in Kernel Module supporting RPCSEC_GSS being skipped.
Apr 21 04:03:37 proxmaster systemd: Starting Set the console keyboard layout...
Apr 21 04:03:37 proxmaster systemd: Starting Create list of static device nodes for the current kernel...
Apr 21 04:03:37 proxmaster systemd: Starting Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling...
Apr 21 04:03:37 proxmaster systemd: Starting Load Kernel Module configfs...
Apr 21 04:03:37 proxmaster systemd: Starting Load Kernel Module drm...
Apr 21 04:03:37 proxmaster systemd: Starting Load Kernel Module fuse...
Apr 21 04:03:37 proxmaster systemd: Condition check resulted in Set Up Additional Binary Formats being skipped.
Apr 21 04:03:37 proxmaster systemd: Condition check resulted in File System Check on Root Device being skipped.
Apr 21 04:03:37 proxmaster systemd: Starting Journal Service...
Apr 21 04:03:37 proxmaster systemd: Starting Load Kernel Modules...
Apr 21 04:03:37 proxmaster systemd: Starting Remount Root and Kernel File Systems...
Apr 21 04:03:37 proxmaster systemd: Starting Coldplug All udev Devices...
Apr 21 04:03:37 proxmaster systemd: Mounted Huge Pages File System.
Apr 21 04:03:37 proxmaster systemd: Mounted POSIX Message Queue File System.
Apr 21 04:03:37 proxmaster systemd: Mounted Kernel Debug File System.
Apr 21 04:03:37 proxmaster systemd: Mounted Kernel Trace File System.
Apr 21 04:03:37 proxmaster systemd: Finished Create list of static device nodes for the current kernel.
Apr 21 04:03:37 proxmaster systemd: modprobe@configfs.service: Succeeded.
Apr 21 04:03:37 proxmaster systemd: Finished Load Kernel Module configfs.
Apr 21 04:03:37 proxmaster systemd: Finished Load Kernel Module fuse.
 

Attachments

  • Settings of Backup.png
    Settings of Backup.png
    8.4 KB · Views: 6
Last edited:
If you have any more ideas what i need to post more, please let me know thank you!
Last night was only running two back-ups from the same machine 10224 at 0:00 and at 3:30 each one to a different disk (both successful)
 

Attachments

  • BackUP.png
    BackUP.png
    23.2 KB · Views: 8
Last edited:
Thank you for the outputs!

Code:
Apr 21 03:46:49 proxmaster corosync[1782]: [KNET ] link: host: 3 link: 0 is down
Apr 21 03:46:49 proxmaster corosync[1782]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Apr 21 03:46:49 proxmaster corosync[1782]: [KNET ] host: host: 3 has no active links
Apr 21 03:46:51 proxmaster corosync[1782]: [KNET ] rx: host: 3 link: 0 is up
Apr 21 03:46:51 proxmaster corosync[1782]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Apr 21 03:46:52 proxmaster corosync[1782]: [TOTEM ] Token has not been received in 2737 ms

The cluster did not have quorum, this might cause the issue. We recommend having a separate NIC for the Corosync or/and adding an additional ring_X to the Corosync config [0].


[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy
 
  • Like
Reactions: Nuke Bloodaxe
So, in theory, the VM in your example resumed:

Code:
INFO: resuming VM again

Initiate Backup->suspend VM->Prepare Disk for Snapshot->Snapshot->Redirect writes->Start VM.

Check the ones which failed. Do they all say "VM Resumed"? I'm wondering if it's a race condition, as that would produce the random behaviour you see. Regarding the link to the NFS share, what network backbone are you running, and is it dedicated? [Can something interfere.]

I notice you also have mail notification configured, is it sending the emails to you okay?

Edit: Moayad is onto it :)
 
Last edited:
There is another cluster that is with the same settings, more stressed and with less computing power and ram..... (the same disks).
All back-ups work with "resuming VM again"
and without quorum.....without any hiccups and problems.
The other weird problem..... I remember before a month on the problematic cluster I tried to make a restore on a machine while the cluster was under normal load and the virtual machines started to hang and i stopped it.
Other than the back-up/restore all VM's are working without problems.
S.M.A.R.T of the disks also is perfectly fine.
 
Last edited:
If it was saturation problem, how is that preventing it from booting ....the back-ups are during the night, the second cluster is always not affected (shared local network)....If only 1 machine is running a back-up and it finishes successfully in the night why is not booting....but on suturday- sunday when all make back-up...they sometimes boot all without issues, if it was saturation why it happens even if only one machine is making a update and not happening sometimes when all are?
 
What you posted shows "--mode snapshot", so the VMs aren't being shutdown for backup. Are you sure something later on isn't what is causing them to be down?
 
  • Like
Reactions: Czam
So, what I see in the screenshots indicates a cluster, which is likely using CEPH storage. As CEPH storage will be shared between the nodes, this means each system using CEPH storage for its disks will write to all nodes. That requires the usage of the network for this to happen and is why I've asked if you have a separate network for CEPH to do so.
During a backup, a lot of data will be transferred over the network; if this is also the network which carries CEPH data, then the backup data could very well saturate the network being used by CEPH, slowing all writes. The behaviour of the systems suffering from that could be "interesting".
 
The disks are directly connected to the machine, there is no ceph installed.
The disks for the other cluster are the same model but another pair again connected to the cluster.
 
Well it's simple...
2 Cluster each one with 1TB SSD Disk (Samsung QVO) for the OS's and 1-4TB WD Red Disk for Back-ups (sata disks).
(the 2 clusters are 2 separate physical machines).
They communicate witch each other with gigabit network.
 
All good :)

So, looking at this, is it possible to run a little experiment?
If your machines can handle it, dividing the traffic for transporting the backup data from the normal operational data may help.

So, one switch carries the backup data, and another carries the day-to-day data. It Doesn't need to be sophisticated, it just has to function as a test.
At the very least, it would rule out a series of possible problems as being the cause of what you're seeing.

As an aside, there are also other things which can cause fun faults:
1) "flapping" ports, where the system uses automatic negotiation to determine the link speed, and it keeps jumping between 1Gb and 100Mb.
2) Traffic loss due to the switch going faulty; easy to test, just test with another.
3) faulty network cables; truly fun.

I'm not sure if you've done some quick tests on those; again, just to rule things out.
 
What you posted shows "--mode snapshot", so the VMs aren't being shutdown for backup. Are you sure something later on isn't what is causing them to be down?
Thank you for the carefully noted message!.....now and I think, the machines are not getting shutdown at all ( I was told otherwise ...and that they use script for that). I think they just hang during a back-up! and that they are running usually constantly....

The problem is why it's happening only on the first machine.........
The second machine is configured the same way and they are not hanging ever...
 
Thank you for the carefully noted message!.....now and I think, the machines are not getting shutdown at all ( I was told otherwise ...and that they use script for that). I think they just hang during a back-up! and that they are running usually constantly....

The problem is why it's happening only on the first machine.........
The second machine is configured the same way and they are not hanging ever...
Update on the logs:

Apr 21 00:00:11 proxmaster pvescheduler[3900396]: INFO: starting new backup job: vzdump 10224 --prune-backups 'keep-last=7' --compress zstd --node proxmaster --quiet 1 --storage BackupMaster --mode stop --mailnotification always
Apr 21 00:00:11 proxmaster pvescheduler[3900396]: INFO: Starting Backup of VM 10224 (qemu)

So the machine gets stopped and then:
-- Reboot --
Apr 21 04:03:37 proxmaster kernel: Linux version 5.15.39-2-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.39-2 (Wed, 20 Jul 2022 17:22:19 +0200) ()
Apr 21 04:03:37 proxmaster kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.39-2-pve root=UUID=51bb2a4e-3a36-47f4-9cf7-ee67b0a99f2c ro quiet
Apr 21 04:03:37 proxmaster kernel: KERNEL supported cpus:
Apr 21 04:03:37 proxmaster kernel: Intel GenuineIntel
Apr 21 04:03:37 proxmaster kernel: AMD AuthenticAMD
Apr 21 04:03:37 proxmaster kernel: Hygon HygonGenuine
Apr 21 04:03:37 proxmaster kernel: Centaur CentaurHauls
Apr 21 04:03:37 proxmaster kernel: zhaoxin Shanghai
Apr 21 04:03:37 proxmaster kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Apr 21 04:03:37 proxmaster kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Apr 21 04:03:37 proxmaster kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Apr 21 04:03:37 proxmaster kernel: x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
Apr 21 04:03:37 proxmaster kernel: x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
Apr 21 04:03:37 proxmaster kernel: signal: max sigframe size: 1776
Apr 21 04:03:37 proxmaster kernel: BIOS-provided physical RAM map:
Apr 21 04:03:37 proxmaster kernel: BIOS-e820: [mem 0x0000000000000000-0x00000000000927ff] usable
Apr 21 04:03:37 proxmaster kernel: BIOS-e820: [mem 0x0000000000092800-0x000000000009ffff] reserved
Apr 21 04:03:37 proxmaster kernel: BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
Apr 21 04:03:37 proxmaster kernel: BIOS-e820: [mem 0x0000000000100000-0x00000000791dafff] usable
Apr 21 04:03:37 proxmaster kernel: BIOS-e820: [mem 0x00000000791db000-0x000000007999cfff] reserved
Apr 21 04:03:37 proxmaster kernel: BIOS-e820: [mem 0x000000007999d000-0x00000000799f2fff] ACPI data
Apr 21 04:03:37 proxmaster kernel: BIOS-e820: [mem 0x00000000799f3000-0x000000007a000fff] ACPI NVS
Apr 21 04:03:37 proxmaster kernel: BIOS-e820: [mem 0x000000007a001000-0x000000008fffffff] reserved
Apr 21 04:03:37 proxmaster kernel: BIOS-e820: [mem 0x00000000fed1c000-0x00000000fed44fff] reserved
Apr 21 04:03:37 proxmaster kernel: BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
Apr 21 04:03:37 proxmaster kernel: BIOS-e820: [mem 0x0000000100000000-0x000000107fffffff] usable
Apr 21 04:03:37 proxmaster kernel: NX (Execute Disable) protection: active
Apr 21 04:03:37 proxmaster kernel: SMBIOS 3.0 present.
Apr 21 04:03:37 proxmaster kernel: DMI: HPE CL3100/Lego, BIOS 2F4C2150 12/27/2016
Apr 21 04:03:37 proxmaster kernel: tsc: Fast TSC calibration using PIT
Apr 21 04:03:37 proxmaster kernel: tsc: Detected 2499.930 MHz processor

But it doesn't start and I can't see in the logs why....
I'm checking the machines on the second cluster so far they don't get shutdown....
All good :)

So, looking at this, is it possible to run a little experiment?
If your machines can handle it, dividing the traffic for transporting the backup data from the normal operational data may help.

So, one switch carries the backup data, and another carries the day-to-day data. It Doesn't need to be sophisticated, it just has to function as a test.
At the very least, it would rule out a series of possible problems as being the cause of what you're seeing.

As an aside, there are also other things which can cause fun faults:
1) "flapping" ports, where the system uses automatic negotiation to determine the link speed, and it keeps jumping between 1Gb and 100Mb.
2) Traffic loss due to the switch going faulty; easy to test, just test with another.
3) faulty network cables; truly fun.

I'm not sure if you've done some quick tests on those; again, just to rule things out.
Will update you on that as soon as I can.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!