I'm out of ideas and looking for assistance on this problem, not sure what else I can be testing or what logs I can be looking at for data to solve this problem.
Setup is a new cluster of 7 nodes, Ryzen 5950X ASRockRack Servers, 128G ECC RAM, each node has 4x2TB SSD in Raid10, 2x10G LACP Bond on each node to brocade switches. No shared storage, no HA ZFS setup. Each node has its own ZFS setup (NodeX-Tank) for local VM storage.
When I online migrate a VM from 1 node to another node, it works sometimes, other times the source node just hard reboots, other times the target node (yes, the node, not the VM) reboots.
I ran Memtest86 on each node for 48 hours, no errors. I ran iperf between VM's on all the nodes, and no crashes, I am able to get 9Gb/s across VM's on different nodes.
I am running proxmox enterprise, all updated to latest version on each node.
When I do the migration, it starts and gets at various levels, and then source or target node just reboot.
Output from migration window:
Output from syslog on source node (node3), showing node4 crashed and corosync doing its thing:
Output from syslog on destination node (node4):
Setup is a new cluster of 7 nodes, Ryzen 5950X ASRockRack Servers, 128G ECC RAM, each node has 4x2TB SSD in Raid10, 2x10G LACP Bond on each node to brocade switches. No shared storage, no HA ZFS setup. Each node has its own ZFS setup (NodeX-Tank) for local VM storage.
When I online migrate a VM from 1 node to another node, it works sometimes, other times the source node just hard reboots, other times the target node (yes, the node, not the VM) reboots.
I ran Memtest86 on each node for 48 hours, no errors. I ran iperf between VM's on all the nodes, and no crashes, I am able to get 9Gb/s across VM's on different nodes.
I am running proxmox enterprise, all updated to latest version on each node.
When I do the migration, it starts and gets at various levels, and then source or target node just reboot.
Output from migration window:
Bash:
drive-scsi0: transferred 24.8 GiB of 100.0 GiB (24.77%) in 1m 40s
drive-scsi0: transferred 24.8 GiB of 100.0 GiB (24.77%) in 1m 41s
drive-scsi0: transferred 24.8 GiB of 100.0 GiB (24.77%) in 1m 42s
drive-scsi0: transferred 24.8 GiB of 100.0 GiB (24.77%) in 1m 43s
drive-scsi0: transferred 24.8 GiB of 100.0 GiB (24.77%) in 1m 44s
drive-scsi0: Cancelling block job
drive-scsi0: Done.
2022-08-14 20:08:26 ERROR: online migrate failure - block job (mirror) error: drive-scsi0: 'mirror' has been cancelled
2022-08-14 20:08:26 aborting phase 2 - cleanup resources
2022-08-14 20:08:26 migrate_cancel
client_loop: send disconnect: Broken pipe
Use of uninitialized value $res in string eq at /usr/share/perl5/PVE/Tunnel.pm line 110, <GEN845> line 2.
Use of uninitialized value $res in concatenation (.) or string at /usr/share/perl5/PVE/Tunnel.pm line 113, <GEN845> line 2.
2022-08-14 20:08:28 ERROR: tunnel replied '' to command 'quit'
2022-08-14 20:08:28 ERROR: migration finished with problems (duration 00:01:49)
TASK ERROR: migration problems
Output from syslog on source node (node3), showing node4 crashed and corosync doing its thing:
Bash:
Aug 14 20:06:38 node3 pvedaemon[3175]: <root@pam> starting task UPID:node3:000058D2:0001DF45:62F98E0E:qmigrate:101:root@pam:
Aug 14 20:06:39 node3 pmxcfs[3011]: [status] notice: received log
Aug 14 20:06:41 node3 pmxcfs[3011]: [status] notice: received log
Aug 14 20:07:08 node3 corosync[3108]: [KNET ] link: host: 4 link: 0 is down
Aug 14 20:07:08 node3 corosync[3108]: [KNET ] link: host: 4 link: 1 is down
Aug 14 20:07:08 node3 corosync[3108]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Aug 14 20:07:08 node3 corosync[3108]: [KNET ] host: host: 4 has no active links
Aug 14 20:07:08 node3 corosync[3108]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Aug 14 20:07:08 node3 corosync[3108]: [KNET ] host: host: 4 has no active links
Aug 14 20:07:11 node3 corosync[3108]: [TOTEM ] Token has not been received in 4687 ms
Aug 14 20:07:20 node3 corosync[3108]: [QUORUM] Sync members[6]: 1 2 3 5 6 7
Aug 14 20:07:20 node3 corosync[3108]: [QUORUM] Sync left[1]: 4
Aug 14 20:07:20 node3 corosync[3108]: [TOTEM ] A new membership (1.3f6) was formed. Members left: 4
Aug 14 20:07:20 node3 corosync[3108]: [TOTEM ] Failed to receive the leave message. failed: 4
Aug 14 20:07:20 node3 pmxcfs[3011]: [dcdb] notice: members: 1/2948, 2/3207, 3/3011, 5/2905, 6/3247, 7/3316
Aug 14 20:07:20 node3 pmxcfs[3011]: [dcdb] notice: starting data syncronisation
Aug 14 20:07:20 node3 pmxcfs[3011]: [status] notice: members: 1/2948, 2/3207, 3/3011, 5/2905, 6/3247, 7/3316
Aug 14 20:07:20 node3 pmxcfs[3011]: [status] notice: starting data syncronisation
Aug 14 20:07:20 node3 corosync[3108]: [QUORUM] Members[6]: 1 2 3 5 6 7
Aug 14 20:07:20 node3 corosync[3108]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 14 20:07:20 node3 pmxcfs[3011]: [dcdb] notice: received sync request (epoch 1/2948/0000000A)
Aug 14 20:07:20 node3 pmxcfs[3011]: [status] notice: received sync request (epoch 1/2948/0000000A)
Aug 14 20:07:20 node3 pmxcfs[3011]: [dcdb] notice: received all states
Aug 14 20:07:20 node3 pmxcfs[3011]: [dcdb] notice: leader is 1/2948
Aug 14 20:07:20 node3 pmxcfs[3011]: [dcdb] notice: synced members: 1/2948, 2/3207, 3/3011, 5/2905, 6/3247, 7/3316
Aug 14 20:07:20 node3 pmxcfs[3011]: [dcdb] notice: all data is up to date
Aug 14 20:07:20 node3 pmxcfs[3011]: [dcdb] notice: dfsm_deliver_queue: queue length 6
Aug 14 20:07:20 node3 pmxcfs[3011]: [status] notice: received all states
Aug 14 20:07:20 node3 pmxcfs[3011]: [status] notice: all data is up to date
Aug 14 20:07:20 node3 pmxcfs[3011]: [status] notice: dfsm_deliver_queue: queue length 51
Output from syslog on destination node (node4):
Bash:
Aug 14 20:06:38 node4 pmxcfs[3176]: [status] notice: received log
Aug 14 20:06:38 node4 systemd[1]: Started Session 7 of user root.
Aug 14 20:06:39 node4 systemd[1]: session-7.scope: Succeeded.
Aug 14 20:06:39 node4 systemd[1]: Started Session 8 of user root.
Aug 14 20:06:39 node4 systemd[1]: session-8.scope: Succeeded.
Aug 14 20:06:39 node4 systemd[1]: Started Session 9 of user root.
Aug 14 20:06:39 node4 systemd[1]: session-9.scope: Succeeded.
Aug 14 20:06:39 node4 systemd[1]: Started Session 10 of user root.
Aug 14 20:06:39 node4 qm[25178]: <root@pam> starting task UPID:node4:0000625B:0002D30F:62F98E0F:qmstart:101:root@pam:
Aug 14 20:06:39 node4 qm[25179]: start VM 101: UPID:node4:0000625B:0002D30F:62F98E0F:qmstart:101:root@pam:
Aug 14 20:06:40 node4 systemd[1]: Created slice qemu.slice.
Aug 14 20:06:40 node4 systemd[1]: Started 101.scope.
Aug 14 20:06:40 node4 systemd-udevd[25239]: Using default interface naming scheme 'v247'.
Aug 14 20:06:40 node4 systemd-udevd[25239]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Aug 14 20:06:41 node4 kernel: [ 1853.566721] device tap101i0 entered promiscuous mode
Aug 14 20:06:41 node4 systemd-udevd[25239]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Aug 14 20:06:41 node4 systemd-udevd[25240]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Aug 14 20:06:41 node4 systemd-udevd[25240]: Using default interface naming scheme 'v247'.
Aug 14 20:06:41 node4 systemd-udevd[25239]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Aug 14 20:06:41 node4 kernel: [ 1853.591041] vmbr601: port 2(fwpr101p0) entered blocking state
Aug 14 20:06:41 node4 kernel: [ 1853.591062] vmbr601: port 2(fwpr101p0) entered disabled state
Aug 14 20:06:41 node4 kernel: [ 1853.591110] device fwpr101p0 entered promiscuous mode
Aug 14 20:06:41 node4 kernel: [ 1853.591155] vmbr601: port 2(fwpr101p0) entered blocking state
Aug 14 20:06:41 node4 kernel: [ 1853.591167] vmbr601: port 2(fwpr101p0) entered forwarding state
Aug 14 20:06:41 node4 kernel: [ 1853.593100] fwbr101i0: port 1(fwln101i0) entered blocking state
Aug 14 20:06:41 node4 kernel: [ 1853.593121] fwbr101i0: port 1(fwln101i0) entered disabled state
Aug 14 20:06:41 node4 kernel: [ 1853.593626] device fwln101i0 entered promiscuous mode
Aug 14 20:06:41 node4 kernel: [ 1853.593774] fwbr101i0: port 1(fwln101i0) entered blocking state
Aug 14 20:06:41 node4 kernel: [ 1853.593786] fwbr101i0: port 1(fwln101i0) entered forwarding state
Aug 14 20:06:41 node4 kernel: [ 1853.595971] fwbr101i0: port 2(tap101i0) entered blocking state
Aug 14 20:06:41 node4 kernel: [ 1853.595991] fwbr101i0: port 2(tap101i0) entered disabled state
Aug 14 20:06:41 node4 kernel: [ 1853.596065] fwbr101i0: port 2(tap101i0) entered blocking state
Aug 14 20:06:41 node4 kernel: [ 1853.596076] fwbr101i0: port 2(tap101i0) entered forwarding state
Aug 14 20:06:41 node4 qm[25178]: <root@pam> end task UPID:node4:0000625B:0002D30F:62F98E0F:qmstart:101:root@pam: OK
Aug 14 20:06:41 node4 systemd[1]: session-10.scope: Succeeded.
Aug 14 20:06:41 node4 systemd[1]: Started Session 11 of user root.