I'm running the latest 7.4-13 of Proxmox, with kernel 6.1.15-1 at this moment, on my QNAP TS-h973AX NAS. This device runs on an AMD Ryzen V1500B APU, with 2x Intel I225-V 2.5Gb network interfaces and a Marvell Aquantia AQC107 10Gb interface.
I'm only using the 10Gb interface at this moment, although the two other are part of the same default bridge.
A peculiarity that started happening is that every few days, I lose network connection, without any reason - kernel and system logs show absolutely no issue, then suddenly corosync notices that it can't see my other node, and begins complaining about the lack of quorum:
The network interface is detected as up, and shows traffic, however the NAS becomes unreachable, nor can it reach the outside world. Bringing `vmbr0` down and then back up, network connection is restored, however VMs will still be unreachable until reboot.
This is incredibly annoying as I can't leave the NAS to run while I'm away, nor can I remotely access it if it goes down.
I've tried excluding the Intel interfaces from the bridge, down- and upgrading the kernel, to no avail.
What exactly could be going wrong here?
I'm only using the 10Gb interface at this moment, although the two other are part of the same default bridge.
A peculiarity that started happening is that every few days, I lose network connection, without any reason - kernel and system logs show absolutely no issue, then suddenly corosync notices that it can't see my other node, and begins complaining about the lack of quorum:
Code:
Jun 13 01:18:58 nas systemd[1]: Starting Daily PVE download activities...
Jun 13 01:19:00 nas pveupdate[680312]: <root@pam> starting task UPID:nas:000A617D:01CA3B6F:6487B5F4:aptupdate::root@pam:
Jun 13 01:19:02 nas pveupdate[680317]: update new package list: /var/lib/pve-manager/pkgupdates
Jun 13 01:19:05 nas pveupdate[680312]: <root@pam> end task UPID:nas:000A617D:01CA3B6F:6487B5F4:aptupdate::root@pam: OK
Jun 13 01:19:05 nas systemd[1]: pve-daily-update.service: Succeeded.
Jun 13 01:19:05 nas systemd[1]: Finished Daily PVE download activities.
Jun 13 01:19:05 nas systemd[1]: pve-daily-update.service: Consumed 6.238s CPU time.
Jun 13 02:09:29 nas pmxcfs[2256]: [dcdb] notice: data verification successful
Jun 13 02:45:13 nas pmxcfs[2256]: [status] notice: received log
Jun 13 02:45:17 nas pmxcfs[2256]: [status] notice: received log
Jun 13 03:09:29 nas pmxcfs[2256]: [dcdb] notice: data verification successful
Jun 13 03:10:01 nas CRON[695793]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 13 03:10:01 nas CRON[695794]: (root) CMD (test -e /run/systemd/system || SERVICE_MODE=1 /sbin/e2scrub_all -A -r)
Jun 13 03:10:01 nas CRON[695793]: pam_unix(cron:session): session closed for user root
Jun 13 03:33:35 nas corosync[2261]: [KNET ] link: host: 1 link: 0 is down
Jun 13 03:33:35 nas corosync[2261]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 13 03:33:35 nas corosync[2261]: [KNET ] host: host: 1 has no active links
Jun 13 03:33:37 nas corosync[2261]: [TOTEM ] Token has not been received in 2250 ms
Jun 13 03:33:37 nas corosync[2261]: [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensus.
Jun 13 03:33:41 nas corosync[2261]: [QUORUM] Sync members[1]: 2
Jun 13 03:33:41 nas corosync[2261]: [QUORUM] Sync left[1]: 1
Jun 13 03:33:41 nas corosync[2261]: [TOTEM ] A new membership (2.3b5) was formed. Members left: 1
Jun 13 03:33:41 nas corosync[2261]: [TOTEM ] Failed to receive the leave message. failed: 1
Jun 13 03:33:41 nas corosync[2261]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jun 13 03:33:41 nas corosync[2261]: [QUORUM] Members[1]: 2
Jun 13 03:33:41 nas corosync[2261]: [MAIN ] Completed service synchronization, ready to provide service.
Jun 13 03:33:41 nas pmxcfs[2256]: [dcdb] notice: members: 2/2256
Jun 13 03:33:41 nas pmxcfs[2256]: [status] notice: node lost quorum
Jun 13 03:33:41 nas pmxcfs[2256]: [status] notice: members: 2/2256
Jun 13 03:33:41 nas pmxcfs[2256]: [dcdb] crit: received write while not quorate - trigger resync
Jun 13 03:33:41 nas pmxcfs[2256]: [dcdb] crit: leaving CPG group
Jun 13 03:33:41 nas pve-ha-lrm[2325]: unable to write lrm status file - unable to open file '/etc/pve/nodes/nas/lrm_status.tmp.2325' - Permission denied
Jun 13 03:33:42 nas pmxcfs[2256]: [dcdb] notice: start cluster connection
Jun 13 03:33:42 nas pmxcfs[2256]: [dcdb] crit: cpg_join failed: 14
Jun 13 03:33:42 nas pmxcfs[2256]: [dcdb] crit: can't initialize service
Jun 13 03:33:48 nas pmxcfs[2256]: [dcdb] notice: members: 2/2256
Jun 13 03:33:48 nas pmxcfs[2256]: [dcdb] notice: all data is up to date
Jun 13 03:34:10 nas pvescheduler[699050]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Jun 13 03:34:10 nas pvescheduler[699049]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Jun 13 03:35:10 nas pvescheduler[699185]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Jun 13 03:35:10 nas pvescheduler[699184]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
The network interface is detected as up, and shows traffic, however the NAS becomes unreachable, nor can it reach the outside world. Bringing `vmbr0` down and then back up, network connection is restored, however VMs will still be unreachable until reboot.
This is incredibly annoying as I can't leave the NAS to run while I'm away, nor can I remotely access it if it goes down.
I've tried excluding the Intel interfaces from the bridge, down- and upgrading the kernel, to no avail.
What exactly could be going wrong here?