Hi guys,
I am struggling since 4.4-12 to 5.1-36 upgrade (to be fair it's a new deploy) due to terrible I/O performances via iSCSI (but after some testing also NFS seems affected). The problem doesn't always show up, but I have been able to reproduce it in this manner:
VM just booted up with linux connected via iSCSI to the old PVE 4.4-12 server:
dstat output while performing "dd bs=10M count=250 if=/dev/zero of=test conv=fdatasync"
After, I have done this:
sysbench --test=oltp --oltp-table-size=1000000 --mysql-db=test --mysql-user=root --mysql-password=password prepare
and then the actual test:
sysbench --test=oltp --oltp-table-size=1000000 --mysql-db=test --mysql-user=root --mysql-password=password --max-time=60 --oltp-read-only=on --max-requests=0 --num-threads=8 run
The dstat output while performing this second command was the following:
Finally, after this test was completed, all I did is re-run the very first test, that is to say: "dd bs=10M count=250 if=/dev/zero of=test conv=fdatasync"
This is where the things get different between the old and new PVE.
This is the output on the older system, and you can see that it's what you would expect:
Unfortunatley on the new installation, that's what happens:
As you can see the IO waits are constantly at 10% and speed is 1/100th. The problem is that this hapopnes AFTER the test is completed and IOTOP confirm that no other process is writing to the disk. The only way I've found to going back to the initial performances, is to reboot the VM.
Here are some other specs:
PVE 4.4-12 (works OK) - 8 x Intel(R) Xeon(R) CPU E3-1240 v3 @ 3.40GHz (1 Socket) 32GB ECC DDR3
PVE 5.1-36 (poor performances) - 8 x Intel(R) Xeon(R) CPU E3-1260L v5 @ 2.90GHz (1 Socket) 64GB ECC DDR4
The test VM is THE SAME EXACT ONE, that I was accessing in different times from the first and then the second PVE. I might now try to deploy PVE 5.1-36 on the older hardware to see if something changes.
If you have an idea on what is going on, please any help would be immensly appreciated.
I am struggling since 4.4-12 to 5.1-36 upgrade (to be fair it's a new deploy) due to terrible I/O performances via iSCSI (but after some testing also NFS seems affected). The problem doesn't always show up, but I have been able to reproduce it in this manner:
VM just booted up with linux connected via iSCSI to the old PVE 4.4-12 server:
dstat output while performing "dd bs=10M count=250 if=/dev/zero of=test conv=fdatasync"
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
0 0 99 0 0 0| 523k 3404k| 0 0 | 0 0 |1433 2976
0 0 100 0 0 0| 0 0 | 136B 158B| 0 0 | 60 103
0 0 100 0 0 0| 0 0 | 200B 830B| 0 0 | 52 89
0 0 100 0 0 0| 0 0 | 136B 350B| 0 0 | 47 85
0 0 100 0 0 0| 0 0 | 200B 350B| 0 0 | 381 83
0 0 100 0 0 0| 0 36k| 202B 700B| 0 0 | 367 87
0 0 100 0 0 0| 0 0 | 200B 66B| 0 0 | 55 91
0 0 100 0 0 0| 0 0 | 136B 358B| 0 0 | 54 93
0 0 100 0 0 0| 0 0 | 200B 350B| 0 0 | 385 85
0 0 100 0 0 0| 0 0 | 378B 802B| 0 0 |1012 83
0 12 88 0 0 0| 0 182M| 428B 610B| 0 0 | 64k 178k
0 13 87 0 0 0| 0 235M| 136B 42B| 0 0 | 78k 224k
0 13 87 0 0 0| 0 213M| 200B 358B| 0 0 | 74k 207k
0 13 86 0 0 0| 0 256M| 136B 358B| 0 0 | 77k 214k
0 13 87 0 0 0| 0 231M| 200B 358B| 0 0 | 79k 218k
0 13 85 2 0 0|8192B 222M| 136B 358B| 0 0 | 75k 206k
0 12 88 0 0 0| 0 231M| 200B 358B| 0 0 | 78k 221k
0 13 87 0 0 0| 0 225M| 136B 366B| 0 0 | 80k 220k
0 13 87 0 0 0| 0 231M| 200B 358B| 0 0 | 82k 228k
0 14 86 0 0 0| 0 225M| 136B 358B| 0 0 | 79k 218k
0 12 85 3 0 0| 0 248M| 398B 1024B| 0 0 | 69k 199k
0 0 100 0 0 0| 0 0 | 136B 174B| 0 0 | 55 83
0 0 100 0 0 0| 0 0 | 200B 358B| 0 0 | 897 85
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
0 0 99 0 0 0| 523k 3404k| 0 0 | 0 0 |1433 2976
0 0 100 0 0 0| 0 0 | 136B 158B| 0 0 | 60 103
0 0 100 0 0 0| 0 0 | 200B 830B| 0 0 | 52 89
0 0 100 0 0 0| 0 0 | 136B 350B| 0 0 | 47 85
0 0 100 0 0 0| 0 0 | 200B 350B| 0 0 | 381 83
0 0 100 0 0 0| 0 36k| 202B 700B| 0 0 | 367 87
0 0 100 0 0 0| 0 0 | 200B 66B| 0 0 | 55 91
0 0 100 0 0 0| 0 0 | 136B 358B| 0 0 | 54 93
0 0 100 0 0 0| 0 0 | 200B 350B| 0 0 | 385 85
0 0 100 0 0 0| 0 0 | 378B 802B| 0 0 |1012 83
0 12 88 0 0 0| 0 182M| 428B 610B| 0 0 | 64k 178k
0 13 87 0 0 0| 0 235M| 136B 42B| 0 0 | 78k 224k
0 13 87 0 0 0| 0 213M| 200B 358B| 0 0 | 74k 207k
0 13 86 0 0 0| 0 256M| 136B 358B| 0 0 | 77k 214k
0 13 87 0 0 0| 0 231M| 200B 358B| 0 0 | 79k 218k
0 13 85 2 0 0|8192B 222M| 136B 358B| 0 0 | 75k 206k
0 12 88 0 0 0| 0 231M| 200B 358B| 0 0 | 78k 221k
0 13 87 0 0 0| 0 225M| 136B 366B| 0 0 | 80k 220k
0 13 87 0 0 0| 0 231M| 200B 358B| 0 0 | 82k 228k
0 14 86 0 0 0| 0 225M| 136B 358B| 0 0 | 79k 218k
0 12 85 3 0 0| 0 248M| 398B 1024B| 0 0 | 69k 199k
0 0 100 0 0 0| 0 0 | 136B 174B| 0 0 | 55 83
0 0 100 0 0 0| 0 0 | 200B 358B| 0 0 | 897 85
After, I have done this:
sysbench --test=oltp --oltp-table-size=1000000 --mysql-db=test --mysql-user=root --mysql-password=password prepare
and then the actual test:
sysbench --test=oltp --oltp-table-size=1000000 --mysql-db=test --mysql-user=root --mysql-password=password --max-time=60 --oltp-read-only=on --max-requests=0 --num-threads=8 run
The dstat output while performing this second command was the following:
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
3 2 90 6 0 0| 100k 35M| 502B 902B| 0 0 | 10k 22k
5 2 87 6 0 0| 40k 49M| 266B 252B| 0 0 |9509 24k
6 2 88 5 0 0| 48k 52M| 136B 358B| 0 0 | 10k 26k
2 1 95 2 0 0| 16k 20M| 200B 366B| 0 0 |4175 11k
6 2 79 13 0 0| 40k 88M| 136B 366B| 0 0 | 11k 26k
2 1 91 5 0 0|8192B 34M| 290B 732B| 0 0 |4482 11k
3 1 91 5 0 0| 0 39M| 136B 90B| 0 0 |7414 17k
5 2 83 11 0 0| 0 69M| 200B 366B| 0 0 |8284 19k
3 1 89 7 0 0| 0 45M| 136B 358B| 0 0 |4709 11k
2 1 95 3 0 0| 0 25M| 200B 358B| 0 0 |4928 13k
6 3 78 13 0 0| 0 84M| 136B 358B| 0 0 | 12k 31k
3 1 89 7 0 0| 0 46M| 266B 358B| 0 0 |6173 16k
3 1 90 6 0 0| 0 36M| 202B 818B| 0 0 |5548 12k
4 2 86 8 0 0| 0 52M| 200B 66B| 0 0 |6038 15k
4 1 85 10 0 0| 0 66M| 136B 358B| 0 0 |7198 18k
3 1 90 6 0 0| 0 39M| 200B 358B| 0 0 |4946 11k
2 1 92 5 0 0| 0 34M| 136B 358B| 0 0 |6363 15k
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
3 2 90 6 0 0| 100k 35M| 502B 902B| 0 0 | 10k 22k
5 2 87 6 0 0| 40k 49M| 266B 252B| 0 0 |9509 24k
6 2 88 5 0 0| 48k 52M| 136B 358B| 0 0 | 10k 26k
2 1 95 2 0 0| 16k 20M| 200B 366B| 0 0 |4175 11k
6 2 79 13 0 0| 40k 88M| 136B 366B| 0 0 | 11k 26k
2 1 91 5 0 0|8192B 34M| 290B 732B| 0 0 |4482 11k
3 1 91 5 0 0| 0 39M| 136B 90B| 0 0 |7414 17k
5 2 83 11 0 0| 0 69M| 200B 366B| 0 0 |8284 19k
3 1 89 7 0 0| 0 45M| 136B 358B| 0 0 |4709 11k
2 1 95 3 0 0| 0 25M| 200B 358B| 0 0 |4928 13k
6 3 78 13 0 0| 0 84M| 136B 358B| 0 0 | 12k 31k
3 1 89 7 0 0| 0 46M| 266B 358B| 0 0 |6173 16k
3 1 90 6 0 0| 0 36M| 202B 818B| 0 0 |5548 12k
4 2 86 8 0 0| 0 52M| 200B 66B| 0 0 |6038 15k
4 1 85 10 0 0| 0 66M| 136B 358B| 0 0 |7198 18k
3 1 90 6 0 0| 0 39M| 200B 358B| 0 0 |4946 11k
2 1 92 5 0 0| 0 34M| 136B 358B| 0 0 |6363 15k
Finally, after this test was completed, all I did is re-run the very first test, that is to say: "dd bs=10M count=250 if=/dev/zero of=test conv=fdatasync"
This is where the things get different between the old and new PVE.
This is the output on the older system, and you can see that it's what you would expect:
0 13 87 0 0 0| 0 224M| 200B 102B| 0 0 | 88k 216k
0 13 87 0 0 0|4096B 234M| 136B 358B| 0 0 | 87k 221k
0 13 87 0 0 0|4096B 238M| 200B 358B| 0 0 | 91k 222k
0 2 95 3 0 0| 0 186M| 136B 366B| 0 0 | 71k 182k
0 12 88 0 0 0| 0 231M| 200B 366B| 0 0 | 88k 217k
0 7 92 1 0 0| 0 218M| 136B 358B| 0 0 | 81k 208k
0 12 87 0 0 0| 0 235M| 200B 358B| 0 0 | 90k 222k
0 9 90 1 0 0| 0 227M| 136B 358B| 0 0 | 85k 216k
0 13 87 0 0 0|4096B 234M| 136B 358B| 0 0 | 87k 221k
0 13 87 0 0 0|4096B 238M| 200B 358B| 0 0 | 91k 222k
0 2 95 3 0 0| 0 186M| 136B 366B| 0 0 | 71k 182k
0 12 88 0 0 0| 0 231M| 200B 366B| 0 0 | 88k 217k
0 7 92 1 0 0| 0 218M| 136B 358B| 0 0 | 81k 208k
0 12 87 0 0 0| 0 235M| 200B 358B| 0 0 | 90k 222k
0 9 90 1 0 0| 0 227M| 136B 358B| 0 0 | 85k 216k
Unfortunatley on the new installation, that's what happens:
0 0 88 12 0 0| 0 1256k| 136B 102B| 0 0 | 900 1305
0 0 88 12 0 0| 0 2180k| 200B 358B| 0 0 |1129 2099
0 0 88 12 0 0| 0 2484k| 370B 782B| 0 0 |1186 2119
0 0 88 12 0 0| 0 2520k| 200B 102B| 0 0 |1207 2226
0 0 91 9 0 0| 0 1484k| 334B 1016B| 0 0 | 826 1383
0 0 88 12 0 0| 0 2180k| 200B 358B| 0 0 |1129 2099
0 0 88 12 0 0| 0 2484k| 370B 782B| 0 0 |1186 2119
0 0 88 12 0 0| 0 2520k| 200B 102B| 0 0 |1207 2226
0 0 91 9 0 0| 0 1484k| 334B 1016B| 0 0 | 826 1383
As you can see the IO waits are constantly at 10% and speed is 1/100th. The problem is that this hapopnes AFTER the test is completed and IOTOP confirm that no other process is writing to the disk. The only way I've found to going back to the initial performances, is to reboot the VM.
Here are some other specs:
PVE 4.4-12 (works OK) - 8 x Intel(R) Xeon(R) CPU E3-1240 v3 @ 3.40GHz (1 Socket) 32GB ECC DDR3
PVE 5.1-36 (poor performances) - 8 x Intel(R) Xeon(R) CPU E3-1260L v5 @ 2.90GHz (1 Socket) 64GB ECC DDR4
The test VM is THE SAME EXACT ONE, that I was accessing in different times from the first and then the second PVE. I might now try to deploy PVE 5.1-36 on the older hardware to see if something changes.
If you have an idea on what is going on, please any help would be immensly appreciated.