Proxmox Node Lockup - VM Lost connection - Reboots Following Lockup

FatJoeGames · Dec 2, 2022

Good Evening all, I hope you're all well!

I am looking for some assistance with an issue I have recently come across, I am finding that one of my Proxmox Nodes is locking up and appears offline but is still running, this is viewable within the summary page. This issue has been occurring when a program (Pterodactyl Wings) performs a backup. I have been able to rule out this being the issue as I have started a Fresh Proxmox Install and virtual machine.

Hardware:
Dell Optiplex SPF 5050 - Node02

I5 7500
32GB of DDR4 @2444Mhz
1x 240GB SanDisk SATA SSD
1x 500GB WD Blue HDD

Dell Optiplex SPF 5050 - Node01

I5 7500
16GB of DDR4 @2444Mhz
1x 240GB SanDisk SATA SSD
1x 500GB WD Blue HDD

Proxmox Version & Others:

Linux 5.15.74-1-pve #1 SMP PVE 5.15.74-1
pve-manager/7.3-3/c3928077
Proxmox 7.3.3

Configuration:

Storage is set up in a RAID0 ZFS format
ZFS is configured to a MAX cache of 2GB
ZFS is configured to a MIN cache of 1GB
Both Nodes have Been updated to the latest patches

Virtual Machine Configuration:

Processor: 1 Socket 4 Cores
Ram: 24GB
BIOS: SeaBIOS
Machine: i440fx
SCSI Controller: ViriO SCSI
Hard Drive: ZFS 200GB - IOthread1

The issue:

Following the introduction, I have recently found myself experiencing a node within my Cluster that seems to lock out, I am unable to access it to reboot it and it does eventually reboot itself and then has to configure storage for a few minutes. Once it has returned the machine works as expected and there are no issues, the machine runs a game server which performs backups at 01:00:00 every day, I have found once this is activated the machine and Node go into a state where it is uncontactable but seems to still be running but in a frozen state.

This can also be seen within a Zabix Graph as well where it goes offline and does not come back, it also has Random increase and decrease in Drive Space:

Following this state, I would normally have to start the VM or start the game server, which it would then run without fault, until the next backup and then it would again go into a state where it is still running but can't be reached. I was able to capture the logs before this took place, but they don't indicate anything that was of concern or would indicate an issue:

I have then replicated this by manually running a backup within the VM which takes the host and VM offline, I have not been able to find a way for this to log anything further, I have checked through the logs within /var/logs/ and none of them indicate anything at this time.

Following this issue, I attempted to rebuild the VM and start it again, to which I used SFTP to transfer the old files back to the server, low and behold the same issue when I got the machine to do some work such as moving files or transferring. Transferring from the old machine before I rebuilt it didn't have any issues, but the new one receiving took the host and VM right offline again.

Again I have had no logs whatever for this, so I figured a corruption so I rebuilt the node from scratch, there were no issues at all, up until I transferred the files, it lasted a lot longer, but it did hit this state again, but as I had the machine on my tech bench I was able to hit enter in the window as it said the following:

Dec 02 00:32:59 SRVDEBPROXNOD02 systemd[1]: Stopped Journal Service.
Dec 02 00:32:59 SRVDEBPROXNOD02 systemd[1]: Starting Journal Service...
Dec 02 00:32:59 SRVDEBPROXNOD02 systemd-journald[249823]: Journal started
Dec 02 00:32:59 SRVDEBPROXNOD02 systemd-journald[249823]: System Journal (/var/log/journal/df051a9f50f046179f11e7e5158513a6) is 1.4M, max 4.0G, 3.9G free.
Dec 02 00:32:07 SRVDEBPROXNOD02 systemd[1]: systemd-journald.service: Watchdog timeout (limit 3min)!
Dec 02 00:29:20 SRVDEBPROXNOD02 pve-firewall[1375]: firewall update time (14.015 seconds)
Dec 02 00:32:07 SRVDEBPROXNOD02 systemd[1]: systemd-journald.service: Killing process 551 (systemd-journal) with signal SIGABRT.
Dec 02 00:32:58 SRVDEBPROXNOD02 pve-firewall[1375]: firewall update time (217.217 seconds)
Dec 02 00:32:58 SRVDEBPROXNOD02 pveproxy[1474]: proxy detected vanished client connection

It also had a lot of ZFS errors, which I will paste below:

Dec 02 00:25:55 SRVDEBPROXNOD02 pvestatd[1381]: status update time (120.726 seconds)
Dec 02 00:25:55 SRVDEBPROXNOD02 pve-firewall[1375]: firewall update time (120.891 seconds)
Dec 02 00:25:56 SRVDEBPROXNOD02 CRON[214726]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Dec 02 00:25:56 SRVDEBPROXNOD02 CRON[214779]: (root) CMD (if [ $(date +%w) -eq 0 ] && [ -x /usr/lib/zfs-linux/trim ]; then /usr/lib/zfs-linux/trim; fi)
Dec 02 00:25:56 SRVDEBPROXNOD02 CRON[214726]: pam_unix(cron:session): session closed for user root
Dec 02 00:25:59 SRVDEBPROXNOD02 pve-ha-crm[1465]: loop take too long (123 seconds)
Dec 02 00:26:00 SRVDEBPROXNOD02 pve-ha-lrm[1479]: loop take too long (128 seconds)
Dec 02 00:26:16 SRVDEBPROXNOD02 pmxcfs[1275]: [dcdb] notice: data verification successful
Dec 02 00:28:42 SRVDEBPROXNOD02 kernel: INFO: task zvol:255 blocked for more than 120 seconds.
Dec 02 00:28:42 SRVDEBPROXNOD02 kernel: Tainted: P O 5.15.74-1-pve #1
Dec 02 00:28:42 SRVDEBPROXNOD02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 02 00:28:42 SRVDEBPROXNOD02 kernel: task:zvol state stack: 0 pid: 255 ppid: 2 flags:0x00004000
Dec 02 00:28:42 SRVDEBPROXNOD02 kernel: Call Trace:
Dec 02 00:28:42 SRVDEBPROXNOD02 kernel: <TASK>
Dec 02 00:28:42 SRVDEBPROXNOD02 kernel: __schedule+0x34e/0x1740
Dec 02 00:28:42 SRVDEBPROXNOD02 kernel: ? kmem_cache_alloc+0x1ab/0x2f0
Dec 02 00:28:42 SRVDEBPROXNOD02 kernel: schedule+0x69/0x110
Dec 02 00:28:42 SRVDEBPROXNOD02 kernel: io_schedule+0x46/0x80
Dec 02 00:28:42 SRVDEBPROXNOD02 kernel: cv_wait_common+0xae/0x140 [spl]
Dec 02 00:28:42 SRVDEBPROXNOD02 kernel: ? wait_woken+0x70/0x70
Dec 02 00:28:42 SRVDEBPROXNOD02 kernel: __cv_wait_io+0x18/0x20 [spl]
Dec 02 00:28:42 SRVDEBPROXNOD02 kernel: txg_wait_synced_impl+0xda/0x130 [zfs]
Dec 02 00:28:42 SRVDEBPROXNOD02 kernel: txg_wait_synced+0x10/0x50 [zfs]
Dec 02 00:28:42 SRVDEBPROXNOD02 kernel: dmu_tx_wait+0x1ee/0x410 [zfs]
Dec 02 00:28:42 SRVDEBPROXNOD02 kernel: dmu_tx_assign+0x170/0x4f0 [zfs]
Dec 02 00:28:42 SRVDEBPROXNOD02 kernel: zvol_write+0x184/0x4b0 [zfs]
Dec 02 00:28:42 SRVDEBPROXNOD02 kernel: zvol_write_task+0x13/0x30 [zfs]
Dec 02 00:28:42 SRVDEBPROXNOD02 kernel: taskq_thread+0x29c/0x4d0 [spl]
Dec 02 00:28:42 SRVDEBPROXNOD02 kernel: ? wake_up_q+0x90/0x90
Dec 02 00:28:43 SRVDEBPROXNOD02 kernel: ? zvol_write+0x4b0/0x4b0 [zfs]
Dec 02 00:28:43 SRVDEBPROXNOD02 kernel: ? taskq_thread_spawn+0x60/0x60 [spl]
Dec 02 00:28:43 SRVDEBPROXNOD02 kernel: kthread+0x127/0x150
Dec 02 00:28:43 SRVDEBPROXNOD02 kernel: ? set_kthread_struct+0x50/0x50
Dec 02 00:28:43 SRVDEBPROXNOD02 kernel: ret_from_fork+0x1f/0x30
Dec 02 00:28:43 SRVDEBPROXNOD02 kernel: </TASK>
Dec 02 00:28:43 SRVDEBPROXNOD02 kernel: INFO: task txg_sync:377 blocked for more than 120 seconds.
Dec 02 00:28:43 SRVDEBPROXNOD02 kernel: Tainted: P O 5.15.74-1-pve #1
Dec 02 00:28:43 SRVDEBPROXNOD02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 02 00:28:43 SRVDEBPROXNOD02 kernel: task:txg_sync state stack: 0 pid: 377 ppid: 2 flags:0x00004000
Dec 02 00:28:43 SRVDEBPROXNOD02 kernel: Call Trace:

I have then received RDC errors for ZFS within the local PVE where it says the update must be at least a one-second difference, which can be seen in the attached log.

I have uploaded a file under "2.12.2022.log.txt" with an export of the log, but following this, I was able to get the VM to come back to life and it ran, the Cluster showed the node offline and I could not contact it for a good 5 minutes before it came back online. But it killed my SFTP transfer when it went offline, so it appears this is what allowed it to come back, but I am at a real loss as to why this is taking the entire host offline or freezing and locking.

Would anyone have any suggestions on what I can try, I am unable to really move to the other node I have due to limited resources, but if needed I can have a go, but this node is meant to handle a lot of heavy lifting.

Any help is much appreciated. =)

FatJoeGames · Dec 2, 2022

Hi again,

So after some further work, I found that my Desktop/Server still has intel Rapid Storage for RAID on, I have turned this off and there has definitely been an improvement.

But not enough to warrant myself being able to say this was the issue, I previously had a ZFS Arc Max Size of 2GB with a Min of 1GB, but I have upped this to 5GB, I have started to run the test I am doing and copying 22GB across to the VM.

I have found I am able to keep the machine online, but it is peaking at 90% IO then dropping to 23% and repeating, this is a lot better than the ARC of 2GB which was constant 93% and ended up freezing the Node again.

But I feel like this high demand seems wrong, as I have another 4 VM's on a machine with less RAM and a slower SSD that never passes 2% IO delay, they aren't heavy writers but during internal backups, they don't fault.

I can see within the ARC Summary that I have currently got it Maxed out most of the time to upload these files:

I am also not seeing any issues with the Write Speed for an SSD, but then again 8.71M for write seems low for a SATA SSD:

On the plus side, I haven't had any crashes or issues so far.

Other than it continues to eat all of the RAM and hit IO issues, but I can see that in the SFTP client where it stops copying then tries again, I have never seen this before!

Besides this, I previously ran a single Node that operated all hosts including this game Server, which did updates backups and everything without Fault, but after moving to these new machines I just have a headache with ZFS. Is anyone able to see anything in the above that seems to point to an issue with the Drive or setup?

Many thanks in advance! Cheers!

FatJoeGames · Dec 2, 2022

Further info points to this being a bad Drive, it appears that the SANDISK 240 Plus drive has common issues, I currently have 154 bad blocks which has increased rapidly in the last few months:

I am also getting a lot of ZFS Deadman faults:

So under this analysis, I am going to replace the drive, I suspect the reason why the old server didn't fault was the ZFS pool consisting of an Intel SSD and another SSD which more than likely covered this enough I didn't notice.

If anyone has any further Advise please let me know.

FatJoeGames · Dec 3, 2022

Thought it was worth a final write for tonight, I found the IO wait was not dropping after trying to copy a 22 GB file which was estimated to take 5 hours to the Proxmox Node directly, I could see the drive had a huge IO wait on it, for no apparent reason. I have spent a good amount of time testing with writing and reading, but there is nothing other than the drive failing.

As a final attempt (As what I didn't want to do since this is my other main stable Node) which has an Intel SSDSC2BB240G6 within it, this is similar read write speed as the SanDisk, there was no issue at all, all be it this had 7GB of ARC Cache, but it's running 5 other Servers and already has a 5 - 20% IO delay. I have copied the same files across, it took 5 minutes and it didn't go above 30% IO wait.

So will have to say from my research don't use the cheap SANDISK drives, I've not had issues for the last 2 years, so either it has failed or when I have isolated it to its own Host and out of a larger ZFS pool it's showing it's true colours.

I have got another Intel SSD for tomorrow, I will report back if this changes the outcome, who knows this might help someone in the future who fell into the same issue I did

Search

Search

Proxmox Node Lockup - VM Lost connection - Reboots Following Lockup

FatJoeGames

New Member

FatJoeGames

New Member

Attachments

FatJoeGames

New Member

FatJoeGames

New Member