Servers Desync

virtux

New Member
Sep 8, 2011
11
0
1
Hello,

I had a big problem Friday, my Proxmox cluster was "desync".

I noticed that it was the server 3 that was the problem. (I have 4 in my cluster server)

I could not contol the server 3, I had to do a forced shutdown.

What is the breakdown on my server?

I have the logs of the day of the breakdown.

Sans titre.png
 
The few times I have had this same issue it was caused by an IO problem on the problem node.

For example, the other day my USB backup disk failed during a backup.
Proxmox stopped syncing because of the stalled IO to the backup disk.
Same problem as you describe too, I had to power cycle it to recover.

/var/log/message might have a clue

In my case I see entries like this:
Code:
Jan 29 00:14:12 vm2 kernel: lost page write due to I/O error on dm-9
Jan 29 00:14:43 vm2 kernel: usb 1-6.4: reset high speed USB device using ehci_hcd and address 30
Jan 29 00:15:14 vm2 kernel: usb 1-6.4: reset high speed USB device using ehci_hcd and address 30
Jan 29 00:15:45 vm2 kernel: usb 1-6.4: reset high speed USB device using ehci_hcd and address 30
Jan 29 00:16:16 vm2 kernel: usb 1-6.4: reset high speed USB device using ehci_hcd and address 30
Jan 29 00:16:47 vm2 kernel: usb 1-6.4: reset high speed USB device using ehci_hcd and address 30
Jan 29 00:17:18 vm2 kernel: usb 1-6.4: reset high speed USB device using ehci_hcd and address 30
Jan 29 00:17:18 vm2 kernel: sd 34:0:0:0: [sde] Unhandled error code
Jan 29 00:17:18 vm2 kernel: sd 34:0:0:0: [sde] Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK
Jan 29 00:17:18 vm2 kernel: sd 34:0:0:0: [sde] CDB: Write(10): 2a 00 02 09 25 c7 00 00 08 00
Jan 29 00:17:18 vm2 kernel: lost page write due to I/O error on dm-9
 
Hello,

Thank you for your answer, I do not think this is the same problem.
Indeed we do not have backups via USB hard drive.
We use iSCSI targets that are stored by the storage server.
 
A network glitch could cause IO to the iSCSI targets to stall resulting in a similar situation.

Every time I have suffered from Proxmox not syncing the offending node had some sort of IO issue.

The USB issue is simply the most recent example.
Buggy WD RE3 firmware causing numerous disks to drop from the RAID array resulting in no disk IO locally has caused this.
Bad RAM that caused the local filesystem to go into read only mode has caused this.
Buggy firmware in the RAID card making the card lock up has caused this.

In all of these cases I had to power cycle to recover because linux will not gracefully shut down when it can not flush dirty buffers to disk.

While the specific cause of the IO issue may differ, based on your description I suspect your problem was caused by some sort of IO issue.