IO delay troubleshooting

Oct 15, 2019
17
0
21
I am running a small cluster (3 x AMD EPYC 7313P, 448GB RAM, 10 x PCI 4.0 Enterprise NVMe drives in a RAIDZ2-0 array with PVE 8.2.3) with all three nodes set up using 2 bonded 10Gbps fiber connections for migrations and admin.

I am running into an issue where one of the three nodes is intermittently running really high IO Delay numbers (30-50% at times) while the two identical servers running the same load have an IO Delay of 0.03% pretty much all the time. I am pretty sure it isn't specific VM related as I have moved all the VMs off of that node and put different ones on it, and the issue persists.

The problem is especially bad during backup (Proxmox Backup Server and NFS Storage server) and during migration of VMs. About 2 minutes after a task finishes, IO Delay drops back down again.

  1. Are there any tools available for diagnosing the issue?
  2. Is IO Delay specifically limited to the disk io speed, or could this be a network card issue?
  3. Disks are about 2 years old, issue has only been noticed over the last few months. None of the disks are reporting more than 9% wear.
  4. I suspect a heat throttling issue more than anything else, but I would greatly appreciate any words of wisdom.
  5. I am seeing some odd logs showing up that look like the corosync network is going up and down, but I can't see any actual connection issues. Example below.
Aug 29 16:17:01 c51 CRON[418671]: pam_unix(cron:session): session closed for user root
Aug 29 16:17:12 c51 pmxcfs[5680]: [status] notice: received log
Aug 29 16:17:28 c51 corosync[5783]: [KNET ] link: host: 1 link: 0 is down
Aug 29 16:17:28 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:17:28 c51 corosync[5783]: [KNET ] host: host: 1 has no active links
Aug 29 16:17:30 c51 corosync[5783]: [KNET ] rx: host: 1 link: 0 is up
Aug 29 16:17:30 c51 corosync[5783]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Aug 29 16:17:30 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:17:30 c51 corosync[5783]: [KNET ] pmtud: Global data MTU changed to: 1397
Aug 29 16:19:28 c51 corosync[5783]: [KNET ] link: host: 1 link: 0 is down
Aug 29 16:19:28 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:19:28 c51 corosync[5783]: [KNET ] host: host: 1 has no active links
Aug 29 16:19:30 c51 corosync[5783]: [KNET ] rx: host: 1 link: 0 is up
Aug 29 16:19:30 c51 corosync[5783]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Aug 29 16:19:30 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:19:30 c51 corosync[5783]: [KNET ] pmtud: Global data MTU changed to: 1397
Aug 29 16:24:57 c51 corosync[5783]: [KNET ] link: host: 1 link: 0 is down
Aug 29 16:24:57 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:24:57 c51 corosync[5783]: [KNET ] host: host: 1 has no active links
Aug 29 16:24:59 c51 corosync[5783]: [KNET ] rx: host: 1 link: 0 is up
Aug 29 16:24:59 c51 corosync[5783]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Aug 29 16:24:59 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:24:59 c51 corosync[5783]: [KNET ] pmtud: Global data MTU changed to: 1397
Aug 29 16:26:14 c51 pmxcfs[5680]: [dcdb] notice: data verification successful
Aug 29 16:26:56 c51 corosync[5783]: [KNET ] link: host: 1 link: 0 is down
Aug 29 16:26:56 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:26:56 c51 corosync[5783]: [KNET ] host: host: 1 has no active links
Aug 29 16:26:58 c51 corosync[5783]: [KNET ] rx: host: 1 link: 0 is up
Aug 29 16:26:58 c51 corosync[5783]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Aug 29 16:26:58 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:26:59 c51 corosync[5783]: [KNET ] pmtud: Global data MTU changed to: 1397
Aug 29 16:26:59 c51 corosync[5783]: [TOTEM ] Token has not been received in 2737 ms
 
for my experience, high IO delay only occur when intensive read write happen to HDD (but this can be solved if passthrough the SATA controller to VM).
 
Yeah, I have ten Kioxia CD6-5 disks in a ZFS array with a huge amount of RAM available. These things have a theoretical max speed of 5800MB/s with 700k IOPS per disk. It is weird too, because regardless of migration direction it is the same server that freaks out, the healthy servers never exceed 1% utilization, and usually don't even hit 0.01%.

Thank you for the input, I would appreciate any other thoughts!
 
The server next to it has double the workload and has 0.01% IO Delay currently. It seems like any kind of heavy read/write causes huge IO delays. I just can't tell if it is network or disk traffic causing it.
 

Attachments

  • Screenshot 2024-08-30 at 9.14.51 PM.png
    Screenshot 2024-08-30 at 9.14.51 PM.png
    201.3 KB · Views: 10
Perhaps a faulty disk? What are you booting off? I had a slow SD card as boot media and corosync does not like that. You should use something like Prometheus or NetData to get all the statistics, then you can see thing like slow disks, network trouble etc.
 
I am running a small cluster (3 x AMD EPYC 7313P, 448GB RAM, 10 x PCI 4.0 Enterprise NVMe drives in a RAIDZ2-0 array with PVE 8.2.3) with all three nodes set up using 2 bonded 10Gbps fiber connections for migrations and admin.

I am running into an issue where one of the three nodes is intermittently running really high IO Delay numbers (30-50% at times) while the two identical servers running the same load have an IO Delay of 0.03% pretty much all the time. I am pretty sure it isn't specific VM related as I have moved all the VMs off of that node and put different ones on it, and the issue persists.

The problem is especially bad during backup (Proxmox Backup Server and NFS Storage server) and during migration of VMs. About 2 minutes after a task finishes, IO Delay drops back down again.

  1. Are there any tools available for diagnosing the issue?
  2. Is IO Delay specifically limited to the disk io speed, or could this be a network card issue?
  3. Disks are about 2 years old, issue has only been noticed over the last few months. None of the disks are reporting more than 9% wear.
  4. I suspect a heat throttling issue more than anything else, but I would greatly appreciate any words of wisdom.
  5. I am seeing some odd logs showing up that look like the corosync network is going up and down, but I can't see any actual connection issues. Example below.
Aug 29 16:17:01 c51 CRON[418671]: pam_unix(cron:session): session closed for user root
Aug 29 16:17:12 c51 pmxcfs[5680]: [status] notice: received log
Aug 29 16:17:28 c51 corosync[5783]: [KNET ] link: host: 1 link: 0 is down
Aug 29 16:17:28 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:17:28 c51 corosync[5783]: [KNET ] host: host: 1 has no active links
Aug 29 16:17:30 c51 corosync[5783]: [KNET ] rx: host: 1 link: 0 is up
Aug 29 16:17:30 c51 corosync[5783]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Aug 29 16:17:30 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:17:30 c51 corosync[5783]: [KNET ] pmtud: Global data MTU changed to: 1397
Aug 29 16:19:28 c51 corosync[5783]: [KNET ] link: host: 1 link: 0 is down
Aug 29 16:19:28 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:19:28 c51 corosync[5783]: [KNET ] host: host: 1 has no active links
Aug 29 16:19:30 c51 corosync[5783]: [KNET ] rx: host: 1 link: 0 is up
Aug 29 16:19:30 c51 corosync[5783]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Aug 29 16:19:30 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:19:30 c51 corosync[5783]: [KNET ] pmtud: Global data MTU changed to: 1397
Aug 29 16:24:57 c51 corosync[5783]: [KNET ] link: host: 1 link: 0 is down
Aug 29 16:24:57 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:24:57 c51 corosync[5783]: [KNET ] host: host: 1 has no active links
Aug 29 16:24:59 c51 corosync[5783]: [KNET ] rx: host: 1 link: 0 is up
Aug 29 16:24:59 c51 corosync[5783]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Aug 29 16:24:59 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:24:59 c51 corosync[5783]: [KNET ] pmtud: Global data MTU changed to: 1397
Aug 29 16:26:14 c51 pmxcfs[5680]: [dcdb] notice: data verification successful
Aug 29 16:26:56 c51 corosync[5783]: [KNET ] link: host: 1 link: 0 is down
Aug 29 16:26:56 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:26:56 c51 corosync[5783]: [KNET ] host: host: 1 has no active links
Aug 29 16:26:58 c51 corosync[5783]: [KNET ] rx: host: 1 link: 0 is up
Aug 29 16:26:58 c51 corosync[5783]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Aug 29 16:26:58 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:26:59 c51 corosync[5783]: [KNET ] pmtud: Global data MTU changed to: 1397
Aug 29 16:26:59 c51 corosync[5783]: [TOTEM ] Token has not been received in 2737 ms
you have a point, could be heat, what is the ambient temp. and the temp being reported by the hardware - check your ipmi

nvme drives are pretty reliable and fail rarely unlike earlier sata drives.

can you remove a few drives to test or is it running in production, can you put the vm's on the other servers and free up this one to test the parts one by one.
 
Perhaps a faulty disk? What are you booting off? I had a slow SD card as boot media and corosync does not like that. You should use something like Prometheus or NetData to get all the statistics, then you can see thing like slow disks, network trouble etc.

Both servers are running off of a pair of Toshiba XG6 KXG60ZNV512G 512GB NVMe M.2 Drives. PCI 4.0 in a basic ZFS Striped RAID. Stupid fast, only about 20% full. Also reporting healthy on SMART status.

I will take a look at Prometheus and NetData to see what I can get and report back.
 
you have a point, could be heat, what is the ambient temp. and the temp being reported by the hardware - check your ipmi

nvme drives are pretty reliable and fail rarely unlike earlier sata drives.

can you remove a few drives to test or is it running in production, can you put the vm's on the other servers and free up this one to test the parts one by one.

All drives on all servers are reporting between 32°-35° Celsius. No drives are showing oddly. These drive are U.2 PCI 4.0 Hotswappable drives, have relatively light read/write cycles on them as far as I can tell.

I will clear that server off and see what kind of Isolation I can do.

All three servers are in production unfortunately, so I can't be too aggressive in ripping things up. It does help that all three machines are identical in configuration and setup. Down to what kind of SPF+ dongles are used and arrangement of ports, OS version and reboot time.

I will check into the IPMI.
 
I pulled temp data from IPMI, baseline when not migrating looks like this.
 

Attachments

  • Screenshot 2024-08-31 at 7.45.23 PM.png
    Screenshot 2024-08-31 at 7.45.23 PM.png
    138.5 KB · Views: 3
  • Screenshot 2024-08-31 at 7.45.36 PM.png
    Screenshot 2024-08-31 at 7.45.36 PM.png
    26 KB · Views: 3
Yeah, I have ten Kioxia CD6-5 disks in a ZFS array with a huge amount of RAM available. These things have a theoretical max speed of 5800MB/s with 700k IOPS per disk. It is weird too, because regardless of migration direction it is the same server that freaks out, the healthy servers never exceed 1% utilization, and usually don't even hit 0.01%.

Thank you for the input, I would appreciate any other thoughts!
You should not underestimate the write delay with Raid-Z2. I have a file server with the same NVMe (but only 6 pieces) with RaidZ1 and do not achieve more than 300 MiB/s when writing.
If you make backups and then a VM starts to write a lot, this is exactly what could happen. Next time it happens, can you check if a VM is writing a lot?
 
First of all, never have your boot drives on a stripe, second in any proper server system temperature is rarely a problem, if you had fan issues, you would have CPU issues too.

You say 30% on the array, which implies possibly 1 disk being 100%. SMART is irrelevant if you haven’t hit some manufacturer limits yet, could also be a link issue etc, do the details on smartctl, see if there are any issues, on enterprise systems it should print checksum errors both on the bus and the data layer. If you are doing a test, check the disk busy stats with iostat and see if a disk is standing out, then replace it.

Have you engaged with the hardware vendor at all?
 
Last edited:
First of all, never have your boot drives on a stripe, second in any proper server system temperature is rarely a problem, if you had fan issues, you would have CPU issues too.

You say 30% on the array, which implies possibly 1 disk being 100%. SMART is irrelevant if you haven’t hit some manufacturer limits yet, could also be a link issue etc, do the details on smartctl, see if there are any issues, on enterprise systems it should print checksum errors both on the bus and the data layer. If you are doing a test, check the disk busy stats with iostat and see if a disk is standing out, then replace it.

Have you engaged with the hardware vendor at all?

I mispoke. It is a mirrored pair, not a striped one. Checking out iostat and smartctl.

The hardware is Gigabyte, they aren't exactly stellar in my experience. My warranty ended 60 days ago too.
 
You should not underestimate the write delay with Raid-Z2. I have a file server with the same NVMe (but only 6 pieces) with RaidZ1 and do not achieve more than 300 MiB/s when writing.
If you make backups and then a VM starts to write a lot, this is exactly what could happen. Next time it happens, can you check if a VM is writing a lot?
Any idea what would cause the issue to follow the device, not the VM?
 
Any idea what would cause the issue to follow the device, not the VM?
Often the VMs are only the ones where the problems are noticed first.
Therefore, if there are problems in VMs, it is easy to keep an eye on the host at the same time. Especially slow storage or network problems quickly turn into a VM.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!