IO delay troubleshooting

admin-rack.management · Aug 29, 2024

I am running a small cluster (3 x AMD EPYC 7313P, 448GB RAM, 10 x PCI 4.0 Enterprise NVMe drives in a RAIDZ2-0 array with PVE 8.2.3) with all three nodes set up using 2 bonded 10Gbps fiber connections for migrations and admin.

I am running into an issue where one of the three nodes is intermittently running really high IO Delay numbers (30-50% at times) while the two identical servers running the same load have an IO Delay of 0.03% pretty much all the time. I am pretty sure it isn't specific VM related as I have moved all the VMs off of that node and put different ones on it, and the issue persists.

The problem is especially bad during backup (Proxmox Backup Server and NFS Storage server) and during migration of VMs. About 2 minutes after a task finishes, IO Delay drops back down again.

Are there any tools available for diagnosing the issue?
Is IO Delay specifically limited to the disk io speed, or could this be a network card issue?
Disks are about 2 years old, issue has only been noticed over the last few months. None of the disks are reporting more than 9% wear.
I suspect a heat throttling issue more than anything else, but I would greatly appreciate any words of wisdom.
I am seeing some odd logs showing up that look like the corosync network is going up and down, but I can't see any actual connection issues. Example below.

Aug 29 16:17:01 c51 CRON[418671]: pam_unix(cron:session): session closed for user root
Aug 29 16:17:12 c51 pmxcfs[5680]: [status] notice: received log
Aug 29 16:17:28 c51 corosync[5783]: [KNET ] link: host: 1 link: 0 is down
Aug 29 16:17:28 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:17:28 c51 corosync[5783]: [KNET ] host: host: 1 has no active links
Aug 29 16:17:30 c51 corosync[5783]: [KNET ] rx: host: 1 link: 0 is up
Aug 29 16:17:30 c51 corosync[5783]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Aug 29 16:17:30 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:17:30 c51 corosync[5783]: [KNET ] pmtud: Global data MTU changed to: 1397
Aug 29 16:19:28 c51 corosync[5783]: [KNET ] link: host: 1 link: 0 is down
Aug 29 16:19:28 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:19:28 c51 corosync[5783]: [KNET ] host: host: 1 has no active links
Aug 29 16:19:30 c51 corosync[5783]: [KNET ] rx: host: 1 link: 0 is up
Aug 29 16:19:30 c51 corosync[5783]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Aug 29 16:19:30 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:19:30 c51 corosync[5783]: [KNET ] pmtud: Global data MTU changed to: 1397
Aug 29 16:24:57 c51 corosync[5783]: [KNET ] link: host: 1 link: 0 is down
Aug 29 16:24:57 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:24:57 c51 corosync[5783]: [KNET ] host: host: 1 has no active links
Aug 29 16:24:59 c51 corosync[5783]: [KNET ] rx: host: 1 link: 0 is up
Aug 29 16:24:59 c51 corosync[5783]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Aug 29 16:24:59 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:24:59 c51 corosync[5783]: [KNET ] pmtud: Global data MTU changed to: 1397
Aug 29 16:26:14 c51 pmxcfs[5680]: [dcdb] notice: data verification successful
Aug 29 16:26:56 c51 corosync[5783]: [KNET ] link: host: 1 link: 0 is down
Aug 29 16:26:56 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:26:56 c51 corosync[5783]: [KNET ] host: host: 1 has no active links
Aug 29 16:26:58 c51 corosync[5783]: [KNET ] rx: host: 1 link: 0 is up
Aug 29 16:26:58 c51 corosync[5783]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Aug 29 16:26:58 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:26:59 c51 corosync[5783]: [KNET ] pmtud: Global data MTU changed to: 1397
Aug 29 16:26:59 c51 corosync[5783]: [TOTEM ] Token has not been received in 2737 ms

chchia · Aug 30, 2024

for my experience, high IO delay only occur when intensive read write happen to HDD (but this can be solved if passthrough the SATA controller to VM).

admin-rack.management · Aug 30, 2024

Yeah, I have ten Kioxia CD6-5 disks in a ZFS array with a huge amount of RAM available. These things have a theoretical max speed of 5800MB/s with 700k IOPS per disk. It is weird too, because regardless of migration direction it is the same server that freaks out, the healthy servers never exceed 1% utilization, and usually don't even hit 0.01%.

Thank you for the input, I would appreciate any other thoughts!

admin-rack.management · Aug 31, 2024

The server next to it has double the workload and has 0.01% IO Delay currently. It seems like any kind of heavy read/write causes huge IO delays. I just can't tell if it is network or disk traffic causing it.

guruevi · Aug 31, 2024

Perhaps a faulty disk? What are you booting off? I had a slow SD card as boot media and corosync does not like that. You should use something like Prometheus or NetData to get all the statistics, then you can see thing like slow disks, network trouble etc.

deepcloud · Aug 31, 2024

admin-rack.management said:
I am running a small cluster (3 x AMD EPYC 7313P, 448GB RAM, 10 x PCI 4.0 Enterprise NVMe drives in a RAIDZ2-0 array with PVE 8.2.3) with all three nodes set up using 2 bonded 10Gbps fiber connections for migrations and admin.

I am running into an issue where one of the three nodes is intermittently running really high IO Delay numbers (30-50% at times) while the two identical servers running the same load have an IO Delay of 0.03% pretty much all the time. I am pretty sure it isn't specific VM related as I have moved all the VMs off of that node and put different ones on it, and the issue persists.

The problem is especially bad during backup (Proxmox Backup Server and NFS Storage server) and during migration of VMs. About 2 minutes after a task finishes, IO Delay drops back down again.

Are there any tools available for diagnosing the issue?

Is IO Delay specifically limited to the disk io speed, or could this be a network card issue?

Disks are about 2 years old, issue has only been noticed over the last few months. None of the disks are reporting more than 9% wear.

I suspect a heat throttling issue more than anything else, but I would greatly appreciate any words of wisdom.

I am seeing some odd logs showing up that look like the corosync network is going up and down, but I can't see any actual connection issues. Example below.

Aug 29 16:17:01 c51 CRON[418671]: pam_unix(cron:session): session closed for user root
Aug 29 16:17:12 c51 pmxcfs[5680]: [status] notice: received log
Aug 29 16:17:28 c51 corosync[5783]: [KNET ] link: host: 1 link: 0 is down
Aug 29 16:17:28 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:17:28 c51 corosync[5783]: [KNET ] host: host: 1 has no active links
Aug 29 16:17:30 c51 corosync[5783]: [KNET ] rx: host: 1 link: 0 is up
Aug 29 16:17:30 c51 corosync[5783]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Aug 29 16:17:30 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:17:30 c51 corosync[5783]: [KNET ] pmtud: Global data MTU changed to: 1397
Aug 29 16:19:28 c51 corosync[5783]: [KNET ] link: host: 1 link: 0 is down
Aug 29 16:19:28 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:19:28 c51 corosync[5783]: [KNET ] host: host: 1 has no active links
Aug 29 16:19:30 c51 corosync[5783]: [KNET ] rx: host: 1 link: 0 is up
Aug 29 16:19:30 c51 corosync[5783]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Aug 29 16:19:30 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:19:30 c51 corosync[5783]: [KNET ] pmtud: Global data MTU changed to: 1397
Aug 29 16:24:57 c51 corosync[5783]: [KNET ] link: host: 1 link: 0 is down
Aug 29 16:24:57 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:24:57 c51 corosync[5783]: [KNET ] host: host: 1 has no active links
Aug 29 16:24:59 c51 corosync[5783]: [KNET ] rx: host: 1 link: 0 is up
Aug 29 16:24:59 c51 corosync[5783]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Aug 29 16:24:59 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:24:59 c51 corosync[5783]: [KNET ] pmtud: Global data MTU changed to: 1397
Aug 29 16:26:14 c51 pmxcfs[5680]: [dcdb] notice: data verification successful
Aug 29 16:26:56 c51 corosync[5783]: [KNET ] link: host: 1 link: 0 is down
Aug 29 16:26:56 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:26:56 c51 corosync[5783]: [KNET ] host: host: 1 has no active links
Aug 29 16:26:58 c51 corosync[5783]: [KNET ] rx: host: 1 link: 0 is up
Aug 29 16:26:58 c51 corosync[5783]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Aug 29 16:26:58 c51 corosync[5783]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 29 16:26:59 c51 corosync[5783]: [KNET ] pmtud: Global data MTU changed to: 1397
Aug 29 16:26:59 c51 corosync[5783]: [TOTEM ] Token has not been received in 2737 ms

you have a point, could be heat, what is the ambient temp. and the temp being reported by the hardware - check your ipmi

nvme drives are pretty reliable and fail rarely unlike earlier sata drives.

can you remove a few drives to test or is it running in production, can you put the vm's on the other servers and free up this one to test the parts one by one.

admin-rack.management · Sep 1, 2024

guruevi said:
Perhaps a faulty disk? What are you booting off? I had a slow SD card as boot media and corosync does not like that. You should use something like Prometheus or NetData to get all the statistics, then you can see thing like slow disks, network trouble etc.

Both servers are running off of a pair of Toshiba XG6 KXG60ZNV512G 512GB NVMe M.2 Drives. PCI 4.0 in a basic ZFS Striped RAID. Stupid fast, only about 20% full. Also reporting healthy on SMART status.

I will take a look at Prometheus and NetData to see what I can get and report back.

admin-rack.management · Sep 1, 2024

deepcloud said:
you have a point, could be heat, what is the ambient temp. and the temp being reported by the hardware - check your ipmi

nvme drives are pretty reliable and fail rarely unlike earlier sata drives.

can you remove a few drives to test or is it running in production, can you put the vm's on the other servers and free up this one to test the parts one by one.

All drives on all servers are reporting between 32°-35° Celsius. No drives are showing oddly. These drive are U.2 PCI 4.0 Hotswappable drives, have relatively light read/write cycles on them as far as I can tell.

I will clear that server off and see what kind of Isolation I can do.

All three servers are in production unfortunately, so I can't be too aggressive in ripping things up. It does help that all three machines are identical in configuration and setup. Down to what kind of SPF+ dongles are used and arrangement of ports, OS version and reboot time.

I will check into the IPMI.

admin-rack.management · Sep 1, 2024

I pulled temp data from IPMI, baseline when not migrating looks like this.

admin-rack.management · Sep 1, 2024

Sensor data while a copy is in progress and reporting 30% IO Delay.

Falk R. · Sep 1, 2024

admin-rack.management said:
Yeah, I have ten Kioxia CD6-5 disks in a ZFS array with a huge amount of RAM available. These things have a theoretical max speed of 5800MB/s with 700k IOPS per disk. It is weird too, because regardless of migration direction it is the same server that freaks out, the healthy servers never exceed 1% utilization, and usually don't even hit 0.01%.

Thank you for the input, I would appreciate any other thoughts!

You should not underestimate the write delay with Raid-Z2. I have a file server with the same NVMe (but only 6 pieces) with RaidZ1 and do not achieve more than 300 MiB/s when writing.
If you make backups and then a VM starts to write a lot, this is exactly what could happen. Next time it happens, can you check if a VM is writing a lot?

guruevi · Sep 2, 2024

First of all, never have your boot drives on a stripe, second in any proper server system temperature is rarely a problem, if you had fan issues, you would have CPU issues too.

You say 30% on the array, which implies possibly 1 disk being 100%. SMART is irrelevant if you haven’t hit some manufacturer limits yet, could also be a link issue etc, do the details on smartctl, see if there are any issues, on enterprise systems it should print checksum errors both on the bus and the data layer. If you are doing a test, check the disk busy stats with iostat and see if a disk is standing out, then replace it.

Have you engaged with the hardware vendor at all?

admin-rack.management · Sep 3, 2024

guruevi said:
First of all, never have your boot drives on a stripe, second in any proper server system temperature is rarely a problem, if you had fan issues, you would have CPU issues too.

You say 30% on the array, which implies possibly 1 disk being 100%. SMART is irrelevant if you haven’t hit some manufacturer limits yet, could also be a link issue etc, do the details on smartctl, see if there are any issues, on enterprise systems it should print checksum errors both on the bus and the data layer. If you are doing a test, check the disk busy stats with iostat and see if a disk is standing out, then replace it.

Have you engaged with the hardware vendor at all?

I mispoke. It is a mirrored pair, not a striped one. Checking out iostat and smartctl.

The hardware is Gigabyte, they aren't exactly stellar in my experience. My warranty ended 60 days ago too.

admin-rack.management · Sep 5, 2024

Falk R. said:
You should not underestimate the write delay with Raid-Z2. I have a file server with the same NVMe (but only 6 pieces) with RaidZ1 and do not achieve more than 300 MiB/s when writing.
If you make backups and then a VM starts to write a lot, this is exactly what could happen. Next time it happens, can you check if a VM is writing a lot?

Any idea what would cause the issue to follow the device, not the VM?

Falk R. · Sep 6, 2024

admin-rack.management said:
Any idea what would cause the issue to follow the device, not the VM?

Often the VMs are only the ones where the problems are noticed first.
Therefore, if there are problems in VMs, it is easy to keep an eye on the host at the same time. Especially slow storage or network problems quickly turn into a VM.

Search

Search

IO delay troubleshooting

admin-rack.management

Member

chchia

Well-Known Member

admin-rack.management

Member

admin-rack.management

Member

Attachments

guruevi

Well-Known Member

deepcloud

Active Member

admin-rack.management

Member

admin-rack.management

Member

admin-rack.management

Member

Attachments

admin-rack.management

Member

Attachments

Falk R.

Distinguished Member

guruevi

Well-Known Member

admin-rack.management

Member

admin-rack.management

Member

Falk R.

Distinguished Member

We value your privacy