Node random reboot at around specific time

scania471

Member
Jun 25, 2019
13
0
21
22
Hi everyone!

I currently have a 3 node cluster in which node 1-2 is physical servers and the 3rd one is a vm running on unRAID.
(proxmox: v8.1.4 [16GB of RAM, 2 HDDs in RAID1 using ZFS, i3-4150 in both node1-2], unRAID: 6.12.4 [16GB of 48GB, 4 cores of a 6 core i5-8400, 2 vm disks on different HDDs in RAID 1 using ZFS])
For some reason node3 sometimes reboots between 2:00 am and 2:20 am. It does not reboot every night, so it's kinda random.
The node comes back online in 1-3 mins but because of this, all VMs are migrated every time.

Maybe it's related to performance issues when doing backups? There is 1 that starts at 1:30 am (1 VM) and at 2:00 am (1 VM). The 1:30 one finishes before the next one. Both VMs are on that 3rd node so maybe it's related to some kind of performance bottleneck? The assigned resources are the same as the physical servers, 2 vm disks on different drives on the host server. Can't be out of memory as it should kill the vm permanently and not reboot it, also it has plenty of memory, both the node and the host server.

In the proxmox log, I can't see anything related to it, 3 mins before there is a replication timeout which usually happens when there is a backup running. The 2:00 am backup finished in 7 mins and the node just rebooted 5 mins later without any sign, just a
Code:
-- Reboot --
in the log.

Plex starts the scheduled tasks at 2:00am on unRAID, so I set it to 3:00am but I don't think it's related. Planning to move the node to a physical server in the future but not there yet.

How can I troubleshoot this case, where should I start?

Thank you in advance!
 
Hi everyone!

I currently have a 3 node cluster in which node 1-2 is physical servers and the 3rd one is a vm running on unRAID.
(proxmox: v8.1.4 [16GB of RAM, 2 HDDs in RAID1 using ZFS, i3-4150 in both node1-2], unRAID: 6.12.4 [16GB of 48GB, 4 cores of a 6 core i5-8400, 2 vm disks on different HDDs in RAID 1 using ZFS])
For some reason node3 sometimes reboots between 2:00 am and 2:20 am. It does not reboot every night, so it's kinda random.
The node comes back online in 1-3 mins but because of this, all VMs are migrated every time.

Maybe it's related to performance issues when doing backups? There is 1 that starts at 1:30 am (1 VM) and at 2:00 am (1 VM). The 1:30 one finishes before the next one. Both VMs are on that 3rd node so maybe it's related to some kind of performance bottleneck? The assigned resources are the same as the physical servers, 2 vm disks on different drives on the host server. Can't be out of memory as it should kill the vm permanently and not reboot it, also it has plenty of memory, both the node and the host server.

In the proxmox log, I can't see anything related to it, 3 mins before there is a replication timeout which usually happens when there is a backup running. The 2:00 am backup finished in 7 mins and the node just rebooted 5 mins later without any sign, just a
Code:
-- Reboot --
in the log.

The potentially valuable part of the log file is missing though.

Plex starts the scheduled tasks at 2:00am on unRAID, so I set it to 3:00am but I don't think it's related. Planning to move the node to a physical server in the future but not there yet.

How can I troubleshoot this case, where should I start?

Thank you in advance!

Have you considered looking at the log of the host of the VM that is a cluster node?

NB Did you consider ditching the 2+1 setup and keeping that unRAID VM as a QDevice instead?
 
what does journalctl -u corosync say? please also share:

* /etc/pve/corosync.conf
* /etc/network/interfaces

Are your servers in sync regarding ntp? If not please configure ntp server.
 
Last edited:
what does journalctl -u corosync say? please also share:

* /etc/pve/corosync.conf
* /etc/network/interfaces

Are your servers in sync regarding ntp? If not please configure ntp server.
journalctl -u corosync
Code:
...
Feb 12 01:05:17 anton4 corosync[1091]:   [KNET  ] pmtud: Global data MTU changed to: 1397
-- Boot 823bcd6fe9c24bccb24881b2ac4946d0 --
Feb 14 02:14:37 anton4 systemd[1]: Starting corosync.service - Corosync Cluster Engine...
Feb 14 02:14:38 anton4 corosync[1115]:   [MAIN  ] Corosync Cluster Engine  starting up
Feb 14 02:14:38 anton4 corosync[1115]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pie relro bindnow
Feb 14 02:14:38 anton4 corosync[1115]:   [TOTEM ] Initializing transport (Kronosnet).
Feb 14 02:14:39 anton4 corosync[1115]:   [TOTEM ] totemknet initialized
Feb 14 02:14:39 anton4 corosync[1115]:   [KNET  ] pmtud: MTU manually set to: 0
Feb 14 02:14:39 anton4 corosync[1115]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Feb 14 02:14:39 anton4 corosync[1115]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Feb 14 02:14:39 anton4 corosync[1115]:   [QB    ] server name: cmap
Feb 14 02:14:39 anton4 corosync[1115]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Feb 14 02:14:39 anton4 corosync[1115]:   [QB    ] server name: cfg
Feb 14 02:14:39 anton4 corosync[1115]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Feb 14 02:14:39 anton4 corosync[1115]:   [QB    ] server name: cpg
Feb 14 02:14:39 anton4 corosync[1115]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Feb 14 02:14:39 anton4 corosync[1115]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Feb 14 02:14:39 anton4 corosync[1115]:   [WD    ] Watchdog not enabled by configuration
Feb 14 02:14:39 anton4 corosync[1115]:   [WD    ] resource load_15min missing a recovery key.
Feb 14 02:14:39 anton4 corosync[1115]:   [WD    ] resource memory_used missing a recovery key.
Feb 14 02:14:39 anton4 corosync[1115]:   [WD    ] no resources configured.
Feb 14 02:14:39 anton4 corosync[1115]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Feb 14 02:14:39 anton4 corosync[1115]:   [QUORUM] Using quorum provider corosync_votequorum
Feb 14 02:14:39 anton4 corosync[1115]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Feb 14 02:14:39 anton4 corosync[1115]:   [QB    ] server name: votequorum
Feb 14 02:14:39 anton4 corosync[1115]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Feb 14 02:14:39 anton4 corosync[1115]:   [QB    ] server name: quorum
Feb 14 02:14:39 anton4 corosync[1115]:   [TOTEM ] Configuring link 0
Feb 14 02:14:39 anton4 corosync[1115]:   [TOTEM ] Configured link number 0: local addr: 10.11.12.163, port=5405
Feb 14 02:14:39 anton4 corosync[1115]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 14 02:14:39 anton4 corosync[1115]:   [KNET  ] host: host: 1 has no active links
Feb 14 02:14:39 anton4 corosync[1115]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 14 02:14:39 anton4 corosync[1115]:   [KNET  ] host: host: 1 has no active links
Feb 14 02:14:39 anton4 corosync[1115]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 14 02:14:39 anton4 corosync[1115]:   [KNET  ] host: host: 1 has no active links
Feb 14 02:14:39 anton4 corosync[1115]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 14 02:14:39 anton4 corosync[1115]:   [KNET  ] host: host: 2 has no active links
Feb 14 02:14:39 anton4 corosync[1115]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 14 02:14:39 anton4 corosync[1115]:   [KNET  ] host: host: 2 has no active links
Feb 14 02:14:39 anton4 corosync[1115]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 14 02:14:39 anton4 corosync[1115]:   [KNET  ] host: host: 2 has no active links
Feb 14 02:14:39 anton4 corosync[1115]:   [KNET  ] link: Resetting MTU for link 0 because host 3 joined
Feb 14 02:14:39 anton4 corosync[1115]:   [QUORUM] Sync members[1]: 3
Feb 14 02:14:39 anton4 corosync[1115]:   [QUORUM] Sync joined[1]: 3
Feb 14 02:14:39 anton4 corosync[1115]:   [TOTEM ] A new membership (3.2039) was formed. Members joined: 3
Feb 14 02:14:39 anton4 corosync[1115]:   [QUORUM] Members[1]: 3
Feb 14 02:14:39 anton4 corosync[1115]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 14 02:14:39 anton4 systemd[1]: Started corosync.service - Corosync Cluster Engine.
Feb 14 02:14:41 anton4 corosync[1115]:   [KNET  ] rx: host: 2 link: 0 is up
Feb 14 02:14:41 anton4 corosync[1115]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Feb 14 02:14:41 anton4 corosync[1115]:   [KNET  ] rx: host: 1 link: 0 is up
Feb 14 02:14:41 anton4 corosync[1115]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Feb 14 02:14:41 anton4 corosync[1115]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 14 02:14:41 anton4 corosync[1115]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 14 02:14:41 anton4 corosync[1115]:   [QUORUM] Sync members[3]: 1 2 3
Feb 14 02:14:41 anton4 corosync[1115]:   [QUORUM] Sync joined[2]: 1 2
Feb 14 02:14:41 anton4 corosync[1115]:   [TOTEM ] A new membership (1.203d) was formed. Members joined: 1 2
Feb 14 02:14:41 anton4 corosync[1115]:   [QUORUM] This node is within the primary component and will provide service.
Feb 14 02:14:41 anton4 corosync[1115]:   [QUORUM] Members[3]: 1 2 3
Feb 14 02:14:41 anton4 corosync[1115]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 14 02:14:42 anton4 corosync[1115]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Feb 14 02:14:42 anton4 corosync[1115]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Feb 14 02:14:42 anton4 corosync[1115]:   [KNET  ] pmtud: Global data MTU changed to: 1397
The exact time uptime kuma sent the offline notification is 02:13:48am, came back at 02:14:52am today.

/etc/pve/corosync.conf
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: anton1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.11.12.160
  }
  node {
    name: anton2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.11.12.161
  }
  node {
    name: anton4
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.11.12.163
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Anton
  config_version: 13
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

/etc/network/interfaces
Code:
auto lo
iface lo inet loopback

iface enp1s0 inet manual

auto vmbr0
iface vmbr0 inet static
        address 10.11.12.163/24
        gateway 10.11.12.1
        bridge-ports enp1s0
        bridge-stp off
        bridge-fd 0

Note that in this case, anton4 is the 3rd node that is in the post. Previously it was a 4 node cluster but since the 3rd one got removed, I did not renamed/changed IP addresses to move 4 to 3.

NTP seems OK, all 3 servers show the correct time. systemctl status chrony.service does not show errors.
 
The potentially valuable part of the log file is missing though.



Have you considered looking at the log of the host of the VM that is a cluster node?

NB Did you consider ditching the 2+1 setup and keeping that unRAID VM as a QDevice instead?
That is what really annoys me that it does not log anything, like when the OS is killed immediately.
Already checked the logs on unRAID, unfortunately did not log anything useful, VM's log shows the start commands + qxl_send_events: spice-server bug: guest stopped, ignoring doesn't show timestamp but I think it's related to VNC which actually works perfectly.

Previously, the VM node was not running anything, it only acted as a 3rd voter for the cluster but since I got more RAM in the unRAID server it node 1-2 is already maxed out with 16GB, I "upgraded" that 3rd node to be a proper cluster node to balance the load and to be able to run everything in case 1 nodes dies. With only 2 nodes I don't have enough RAM to run every service + ZFS really consumes quite a lot. Maybe I will look into Cepth in the future.
 
That is what really annoys me that it does not log anything, like when the OS is killed immediately.

But you have not included the logs, it's not just corosync that could be read from there. Is there something you want to redact? Feel free.

Or at least show all the relevant PVE services like:

journalctl -b -1 --since "yyyy-mm-dd hh:mm:ss" -u pvedaemon -u pveproxy -u corosync -u pve-cluster -u pve-ha-crm -u pve-ha-lrm -u watchdog-mux

And for corosync especially it would be good to see the log from all three nodes of the same time before reboot.

Show at least one last hour before the reboot, if it's too big attach as a file.

Already checked the logs on unRAID, unfortunately did not log anything useful, VM's log shows the start commands + qxl_send_events: spice-server bug: guest stopped, ignoring doesn't show timestamp but I think it's related to VNC which actually works perfectly.

Well it's hard to have any alternative opinion from no logs posted.

Previously, the VM node was not running anything, it only acted as a 3rd voter for the cluster but since I got more RAM in the unRAID server it node 1-2 is already maxed out with 16GB, I "upgraded" that 3rd node to be a proper cluster node to balance the load and to be able to run everything in case 1 nodes dies.

Was the node rebooting then already? Do you run any HA groups to avoid having the third node overwhelmed?

With only 2 nodes I don't have enough RAM to run every service + ZFS really consumes quite a lot. Maybe I will look into Cepth in the future.

You should not run so many services as HA and CEPH is not going to improve the situation for this scenario.
 
But you have not included the logs, it's not just corosync that could be read from there. Is there something you want to redact? Feel free.

Or at least show all the relevant PVE services like:

journalctl -b -1 --since "yyyy-mm-dd hh:mm:ss" -u pvedaemon -u pveproxy -u corosync -u pve-cluster -u pve-ha-crm -u pve-ha-lrm -u watchdog-mux

And for corosync especially it would be good to see the log from all three nodes of the same time before reboot.

Show at least one last hour before the reboot, if it's too big attach as a file.

I attached the result from both node1-2-3 for the following command: journalctl --since "2024-02-13 22:00:00" -u pvedaemon -u pveproxy -u corosync -u pve-cluster -u pve-ha-crm -u pve-ha-lrm -u watchdog-mux

Well it's hard to have any alternative opinion from no logs posted.
The logs around that time from the host server contains these lines:
Code:
Feb 14 01:58:35 New-Whonnock kernel: i915 0000:00:02.0: [drm] *ERROR* Unexpected DP dual mode adaptor ID 7f
Feb 14 01:58:57 New-Whonnock kernel: i915 0000:00:02.0: [drm] *ERROR* Unexpected DP dual mode adaptor ID 7f
Feb 14 02:02:02 New-Whonnock kernel: i915 0000:00:02.0: [drm] *ERROR* Unexpected DP dual mode adaptor ID 7f
Feb 14 02:04:43 New-Whonnock kernel: i915 0000:00:02.0: [drm] *ERROR* Unexpected DP dual mode adaptor ID 7f
Feb 14 02:08:56 New-Whonnock kernel: i915 0000:00:02.0: [drm] *ERROR* Unexpected DP dual mode adaptor ID 7f
Feb 14 02:09:42 New-Whonnock kernel: i915 0000:00:02.0: [drm] *ERROR* Unexpected DP dual mode adaptor ID 7f
Feb 14 02:12:01 New-Whonnock kernel: i915 0000:00:02.0: [drm] *ERROR* Unexpected DP dual mode adaptor ID 7f
Feb 14 02:13:33 New-Whonnock kernel: i915 0000:00:02.0: [drm] *ERROR* Unexpected DP dual mode adaptor ID 7f
Feb 14 02:15:28 New-Whonnock kernel: i915 0000:00:02.0: [drm] *ERROR* Unexpected DP dual mode adaptor ID 7f
Feb 14 02:15:28 New-Whonnock kernel: i915 0000:00:02.0: [drm] *ERROR* Unexpected DP dual mode adaptor ID 7f
Feb 14 02:16:19 New-Whonnock dnsmasq[15480]: reading /etc/resolv.conf
Feb 14 02:16:19 New-Whonnock dnsmasq[15480]: using nameserver 1.1.1.1#53
Feb 14 02:16:19 New-Whonnock dnsmasq[15480]: using nameserver 8.8.4.4#53
Feb 14 02:16:19 New-Whonnock dnsmasq[15480]: using nameserver 8.8.8.8#53
Feb 14 02:16:19 New-Whonnock dnsmasq[15480]: using nameserver 2001:4860:4860::8888#53
Feb 14 02:16:19 New-Whonnock dnsmasq[15480]: using nameserver 2606:4700:4700::1111#53
Feb 14 02:16:19 New-Whonnock dnsmasq[15480]: using nameserver ::#53
Feb 14 02:16:19 New-Whonnock dnsmasq[15480]: using nameserver 2001:4c4c:21df:1c00:5054:ff:fe17:3fdf#53
Feb 14 02:16:51 New-Whonnock dnsmasq[15480]: reading /etc/resolv.conf
Feb 14 02:16:51 New-Whonnock dnsmasq[15480]: using nameserver 1.1.1.1#53
Feb 14 02:16:51 New-Whonnock dnsmasq[15480]: using nameserver 8.8.4.4#53
Feb 14 02:16:51 New-Whonnock dnsmasq[15480]: using nameserver 8.8.8.8#53
Feb 14 02:16:51 New-Whonnock dnsmasq[15480]: using nameserver 2001:4860:4860::8888#53
Feb 14 02:16:51 New-Whonnock dnsmasq[15480]: using nameserver 2606:4700:4700::1111#53
Feb 14 02:16:51 New-Whonnock dnsmasq[15480]: using nameserver ::#53
Feb 14 02:16:51 New-Whonnock dnsmasq[15480]: using nameserver 2001:4c4c:21df:1c00:ccaa:36d7:4caf:99b5#53
Feb 14 02:17:24 New-Whonnock kernel: i915 0000:00:02.0: [drm] *ERROR* Unexpected DP dual mode adaptor ID 7f
Feb 14 02:17:46 New-Whonnock kernel: i915 0000:00:02.0: [drm] *ERROR* Unexpected DP dual mode adaptor ID 7f
Feb 14 02:17:47 New-Whonnock kernel: i915 0000:00:02.0: [drm] *ERROR* Unexpected DP dual mode adaptor ID 7f

Before and after this it just spams kernel: i915 0000:00:02.0: [drm] *ERROR* Unexpected DP dual mode adaptor ID 7f for hours, so I won't include it here as it does not seems to be helpful. (This error is connected to the iGPU according to this post)
Was the node rebooting then already? Do you run any HA groups to avoid having the third node overwhelmed?
Can't really remember if it had problems back then but I will try to migrate every VM from node3 and just wait.
What do you mean by the HA groups to avoid having the third node overwhelmed? Node3 has the least amount of stress in terms of guests.
I have 2 groups, one contains node3 (this group has 3 VMs), other one contains only node1-2 (actually this can be removed, these VMs are no longer required to be in HA).
 

Attachments

  • merged.txt
    26.1 KB · Views: 1
So anton4 link clearly goes down 2:12, but there's nothing in this log on the machine itself since Feb 14 01:52:06 ... can you check the full log for the "missing period"? If you have IPs/names to redact, you can just sanitize it. Also it would be good to see how that node booted for that particular time (to see e.g. how services started including kernel messages).

EDIT:

What hardware is the host? How is the network setup done between the three physically in terms of switching and routing?

Also the VMs do not have trouble auto-migrating out once it's considered fenced, so that's not the issue, I wonder if anything is migrating replicating* e.g. to anton4 during that time. What about your other vms on the host of anton4? Do they keep running just fine? Do you have any monitoring of the load and e.g. RAM usage on the host? And do you know if the vms inside anton4 keep running till the last minute i.e. 2:12?

*I realised you are using ... shared storage for all of the three from the unraid? What's the storage setup of the cluster?
 
Last edited:
Can't be out of memory as it should kill the vm permanently and not reboot it, also it has plenty of memory, both the node and the host server.

You are talking about the unraid, correct? How does the free memory situation look inside the node when it's hosting the vms and prior to the reboot?
 
Can't really remember if it had problems back then but I will try to migrate every VM from node3 and just wait.

Can you check in the logs from the times when you used to run it with no VMs on anton4 whether it had this sort of reboots?
 
So anton4 link clearly goes down 2:12, but there's nothing in this log on the machine itself since Feb 14 01:52:06 ... can you check the full log for the "missing period"? If you have IPs/names to redact, you can just sanitize it. Also it would be good to see how that node booted for that particular time (to see e.g. how services started including kernel messages).
I attached the whole syslog (if I understand correctly this is what you mean under full log) from 0am to 3am, about 2047 lines.

How is the network setup done between the three physically in terms of switching and routing?
All 3 servers are connected with the built-in NIC to a 16 port TP-Link switch, fixed IPs everywhere, nothing extreme.

Also the VMs do not have trouble auto-migrating out once it's considered fenced, so that's not the issue, I wonder if anything is migrating replicating* e.g. to anton4 during that time. What about your other vms on the host of anton4? Do they keep running just fine?
The cluster replicates 4 VMs every 15 mins, the last message I see in syslog about replication is at 2:09 and it got timed out. The 2:00 backup finished at 2:07, so I think the replication should have finished without problem 2 mins later. It usually times out only if a backup is running.
I only run 2 VMs on anton4 because of this whole problem, both of them are migrated to the other servers and working fine after that (wordpress vm and pihole vm).
Do you have any monitoring of the load and e.g. RAM usage on the host? And do you know if the vms inside anton4 keep running till the last minute i.e. 2:12?
Fortunately yes, it shows a stable 12.2GB of usage for hours before the reboot, the last data from the guests before reboot arrived at 2:12, so I guess yes. Uptime Kuma sent the offline notification at 2:13:12 and it checks every 60s.
I realised you are using ... shared storage for all of the three from the unraid? What's the storage setup of the cluster?
The VMs are stored on the local ZFS arrays on the nodes (RAID 1, 2 hard drives for OS + VMs in physical servers (node2 has an additional drive assigned to a VM for big archives)), the shared storage is an SMB share running on unraid, all VMs are backed up to this share at different times at night
You are talking about the unraid, correct? How does the free memory situation look inside the node when it's hosting the vms and prior to the reboot?
Yup, I barely go above 90% on unraid. Stable 12.2GB on anton4 as I said before, 20 mins before the reboot it decreased by 200MB, I think it can be related to the backup that runs at 2:00 every night.
 

Attachments

  • syslog.txt
    189.1 KB · Views: 1
Can you check in the logs from the times when you used to run it with no VMs on anton4 whether it had this sort of reboots?
I went back in time in the Grafana dashboard, I started using anton4 in August, started putting VMs on it beginning in Sept and scrolling the syslog in that time, it did not reboot at night before running VMs on it but it started to have these reboots with the VMs.
If it helps, the reboots don't happen every night (same thing back then), for example, it happened today, feb 10, feb 8, feb 4, jan 31, jan 24, ...
 
I attached the whole syslog (if I understand correctly this is what you mean under full log) from 0am to 3am, about 2047 lines.

Just to be clear, do you use syslog to capture the logs somewhere outside of the machine over network? See [1] and [2] so that we are in the same page. I suspect this is local output of journalctl command, i.e. what it managed to flush onto the drive.

[1] https://www.loggly.com/ultimate-guide/linux-logging-with-systemd/
[2] https://www.loggly.com/ultimate-guide/centralizing-with-syslog/

The issue I want to ask about right away is e.g.

Feb 14 01:50:35 anton4 postfix/smtp[949416]: connect to example.com[public_ip]:25: No route to host

Is this normal misconfiguration on your side or did the machine lost routing outside or maybe your router ISP router is rebooting?

All 3 servers are connected with the built-in NIC to a 16 port TP-Link switch, fixed IPs everywhere, nothing extreme.

How is that unraid configured? Bridging? Do you have any VLANs? Topology would help.

The cluster replicates 4 VMs every 15 mins, the last message I see in syslog about replication is at 2:09 and it got timed out. The 2:00 backup finished at 2:07, so I think the replication should have finished without problem 2 mins later. It usually times out only if a backup is running.

I don't think it should be timing out, do you have these timeouts regularly?

Feb 14 01:30:42 anton4 pvestatd[1114]: unable to activate storage 'new-whonnock' - directory '/mnt/pve/new-whonnock' does not exist or is unreachable

Do you have an idea what this is about?

I only run 2 VMs on anton4 because of this whole problem, both of them are migrated to the other servers and working fine after that (wordpress vm and pihole vm).

That is possible to see from the log, the remaining 2 nodes recover those services just fine, there's nothing really horrible in terms of cluster health, you have a weirdly behaving node.

The VMs are stored on the local ZFS arrays on the nodes (RAID 1, 2 hard drives for OS + VMs in physical servers (node2 has an additional drive assigned to a VM for big archives)), the shared storage is an SMB share running on unraid, all VMs are backed up to this share at different times at night

So the nodes are all ZFS for the VMs, SMB is just for backup. How come the replications are taking so long? There's the thing a replication could saturate network and corosync traffic may suffer, but not what was seen in the logs.

I went back in time in the Grafana dashboard, I started using anton4 in August, started putting VMs on it beginning in Sept and scrolling the syslog in that time, it did not reboot at night before running VMs on it but it started to have these reboots with the VMs.
If it helps, the reboots don't happen every night (same thing back then), for example, it happened today, feb 10, feb 8, feb 4, jan 31, jan 24, ...

What you can do is use the journalctl command to see those. You can start with journalctl --list-boots, then if you are running it mostly 24/7, every boot in the period prior you had load there is suspicious. You can then check the end of the log of the particular boot (or rather the one before it) by journalctl -b. Unless there's clean shutdown sequences at the end, it was not a clean reboot / shutdown.

NB You may want to redact your attached log some more, there was some email addresses - I do not think it's a privacy issue (given how all the bugreports and devel mailing list works), but just letting you know. Anyhow no one can download attachment from forum unless logged in.
 
Just to be clear, do you use syslog to capture the logs somewhere outside of the machine over network? See [1] and [2] so that we are in the same page. I suspect this is local output of journalctl command, i.e. what it managed to flush onto the drive.
No, I don't store the logs anywhere else (that is a future project), I meant the menu in the proxmox gui named syslog under System where I can select the timestamp.

The issue I want to ask about right away is e.g.

Feb 14 01:50:35 anton4 postfix/smtp[949416]: connect to example.com[public_ip]:25: No route to host

Is this normal misconfiguration on your side or did the machine lost routing outside or maybe your router ISP router is rebooting?
I run my own mail server (I know, not the best idea, planning to move away from it) and as I know, the port 25 is blocked by the ISP (that is why I use Sendgrid for sending emails). I get the emails in my inbox from the servers so it seems to be working. Now you say it, I will check if I ever got an email from anton4, I regularly get emails from anton1-2, can't remember 4 but will check.
(also have problems with the router but that is a different story)
I don't think it should be timing out, do you have these timeouts regularly?
All the time when a backup is running. I read it somewhere on this forum that with hard drives (not the newest models) and ZFS it's kinda expected, also the I/O delay is on the Moon during backups. When the backups are done, no timeouts.

Feb 14 01:30:42 anton4 pvestatd[1114]: unable to activate storage 'new-whonnock' - directory '/mnt/pve/new-whonnock' does not exist or is unreachable

Do you have an idea what this is about?
That is actually interesting, it can see the shared storage, can backup to it, so that seems to be a new thing to troubleshoot.
So the nodes are all ZFS for the VMs, SMB is just for backup. How come the replications are taking so long? There's the thing a replication could saturate network and corosync traffic may suffer, but not what was seen in the logs.
There is a cache drive in the unraid server but the backups purposely not using it, it writes to the array directly. I checked 1 task and it transderred 16.2GB in 315s (52.7MB/s)
Could make it faster by using the cache drive for faster writes but I think it would kill it in no time and since it's running at night, there is no rush. :)

What you can do is use the journalctl command to see those. You can start with journalctl --list-boots, then if you are running it mostly 24/7, every boot in the period prior you had load there is suspicious. You can then check the end of the log of the particular boot (or rather the one before it) by journalctl -b. Unless there's clean shutdown sequences at the end, it was not a clean reboot / shutdown.
I checked the boots and every time it randomly rebooted at night, it was running a VM before. I randomly checked a date and the logs which are attached. I see these ssh sessions from the other servers, this is expected, right? Kinda creepy to see this many logins.

NB You may want to redact your attached log some more, there was some email addresses - I do not think it's a privacy issue (given how all the bugreports and devel mailing list works), but just letting you know. Anyhow no one can download attachment from forum unless logged in.
Understood! Will check for them in the future. Thank you for the heads up!

Also thank you for your help in this situation! :)
 

Attachments

  • nov22.txt
    22.1 KB · Views: 2
I was really most curious about this one. :) Especially now that you mentioned also "router issues".
Sorry, missed that one :)
No VLANS, bridged network adapter, custom assigned IP/MAC address.
(The router issues are not related to the servers, just can’t login to the router, returns a login_retry_error when I try to login, will need to talk to the ISP, don’t want to reset it for now as the latest note about open ports is 3 years old :)))) )
 
There is a cache drive in the unraid server but the backups purposely not using it, it writes to the array directly. I checked 1 task and it transderred 16.2GB in 315s (52.7MB/s)

Are you saying a replication job every 15min has 16GB worth of data snapshot to replicate?

The /mnt/pve/new-whonnock is SMB for backups?

Sorry, missed that one :)
No VLANS, bridged network adapter, custom assigned IP/MAC address.

Can you run arping -b 10.11.12.163 from some of the "normal" nodes and post?
 
Are you saying a replication job every 15min has 16GB worth of data snapshot to replicate?

The /mnt/pve/new-whonnock is SMB for backups?
Sorry, I misread your message, meant that 16GB for the backup, not the replication. The replication moves about 50-150MB of data between the nodes.

Yes, the New-Whonnock is an SMB share for backups.
Can you run arping -b 10.11.12.163 from some of the "normal" nodes and post?
It timeouts from both node1-2. Without -b, it works, time is between 300-500usec. What is the reason to set the host IP to 255.255.255.255?
 
Sorry, I misread your message, meant that 16GB for the backup, not the replication. The replication moves about 50-150MB of data between the nodes.

Sorry too, I was reacting on unraid where you had meanwhile clarified this was for backup only. So the replications ... why do they time out with 50-150MB payload when they have 50M/s network with two VMs only?

It timeouts from both node1-2. Without -b, it works, time is between 300-500usec. What is the reason to set the host IP to 255.255.255.255?

Can you run it with -0. I just want to know there's not more than one machine that thinks it has that IP address on the segment.
 
why do they time out with 50-150MB payload when they have 50M/s network with two VMs only?
This happens on all nodes, not just the problematic one. I think it's because of the drives, during a backup on a node, even regular tasks become really slow. Updating 2 linux vms on 1 node? Naah, forget it, will take ages. :) During a backup, the I/O delay is usually 60-80%.

Sorry too, I was reacting on unraid where you had meanwhile clarified this was for backup only. So the replications ... why do they time out with 50-150MB payload when they have 50M/s network with two VMs only?



Can you run it with -0. I just want to know there's not more than one machine that thinks it has that IP address on the segment.
-0 working fine. -b does not.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!