Abnormal activity after 3.00AM

Lucasc · Apr 30, 2019

Hi, every morning at 3.00AM proxmox starts to backup 1 vm and run apt-get update for new updates
After that, the CPU of all my nodes has grown up from 2-3% of activity to over 60%

pveversion -v
proxmox-ve: 5.3-1 (running kernel: 4.15.18-10-pve)
pve-manager: 5.3-8 (running version: 5.3-8/2929af8e)
pve-kernel-4.15: 5.3-1
pve-kernel-4.15.18-10-pve: 4.15.18-32
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-44
libpve-guest-common-perl: 2.0-19
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-36
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-33
pve-container: 2.0-33
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-17
pve-firmware: 2.0-6
pve-ha-manager: 2.0-6
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 3.10.1-1
qemu-server: 5.0-45
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1

Top command

oguz · Apr 30, 2019

Hi.

Lucasc said:
very morning at 3.00AM proxmox starts to backup 1 vm

Judging from the top screenshot where we see kvm is using the most CPU, I'd guess this is the reason it goes up to 60%.
How long does it take for you to backup this VM normally, and does the same thing happen?

Lucasc · Apr 30, 2019

oguz said:
Hi.

Judging from the top screenshot where we see kvm is using the most CPU, I'd guess this is the reason it goes up to 60%.
How long does it take for you to backup this VM normally, and does the same thing happen?

Normally it takes 10 minutes to backup this VM, but the backup run only in one node and the problem is present in all nodes..

(Probably, but I'm not sure) The CPU goes down to 5-10% when happens this:

Apr 30 15:03:16 itsshpe02 systemd-logind[2036]: Removed session 223.
Apr 30 15:03:16 itsshpe02 systemd[1]: Stopping User Manager for UID 0...
Apr 30 15:03:16 itsshpe02 systemd[12210]: Stopped target Default.
Apr 30 15:03:16 itsshpe02 systemd[12210]: Stopped target Basic System.
Apr 30 15:03:16 itsshpe02 systemd[12210]: Stopped target Paths.
Apr 30 15:03:16 itsshpe02 systemd[12210]: Stopped target Timers.
Apr 30 15:03:16 itsshpe02 systemd[12210]: Stopped target Sockets.
Apr 30 15:03:16 itsshpe02 systemd[12210]: Closed GnuPG cryptographic agent (ssh-agent emulation).
Apr 30 15:03:16 itsshpe02 systemd[12210]: Closed GnuPG cryptographic agent and passphrase cache (restricted).
Apr 30 15:03:16 itsshpe02 systemd[12210]: Closed GnuPG cryptographic agent and passphrase cache.
Apr 30 15:03:16 itsshpe02 systemd[12210]: Closed GnuPG cryptographic agent (access for web browsers).
Apr 30 15:03:16 itsshpe02 systemd[12210]: Reached target Shutdown.
Apr 30 15:03:16 itsshpe02 systemd[12210]: Starting Exit the Session...
Apr 30 15:03:16 itsshpe02 systemd[12210]: Received SIGRTMIN+24 from PID 25620 (kill).
Apr 30 15:03:16 itsshpe02 systemd[12213]: pam_unix(systemd-user:session): session closed for user root
Apr 30 15:03:16 itsshpe02 systemd[1]: Stopped User Manager for UID 0.
Apr 30 15:03:16 itsshpe02 systemd[1]: Removed slice User Slice of root.

and then the cpu grow up again when happens this:

Apr 30 15:03:34 itsshpe02 systemd[25949]: pam_unix(systemd-user:session): session opened for user root by (uid=0)
Apr 30 15:03:34 itsshpe02 systemd[25949]: Reached target Paths.
Apr 30 15:03:34 itsshpe02 systemd[25949]: Listening on GnuPG cryptographic agent and passphrase cache.
Apr 30 15:03:34 itsshpe02 systemd[25949]: Reached target Timers.
Apr 30 15:03:34 itsshpe02 systemd[25949]: Listening on GnuPG cryptographic agent (access for web browsers).
Apr 30 15:03:34 itsshpe02 systemd[25949]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
Apr 30 15:03:34 itsshpe02 systemd[25949]: Listening on GnuPG cryptographic agent (ssh-agent emulation).
Apr 30 15:03:34 itsshpe02 systemd[25949]: Reached target Sockets.
Apr 30 15:03:34 itsshpe02 systemd[25949]: Reached target Basic System.
Apr 30 15:03:34 itsshpe02 systemd[25949]: Reached target Default.
Apr 30 15:03:34 itsshpe02 systemd[25949]: Startup finished in 53ms.
Apr 30 15:03:34 itsshpe02 systemd[1]: Started User Manager for UID 0.

Lucasc · Apr 30, 2019

The problem is still present also restarting the node

Does anyone have an idea?

Top:

oguz · May 2, 2019

What do you see in the syslog? Is there still GnuPG crypto agent running while this problem occurs? Or anything else weird? The top output in the last picture looks relatively normal to me? Are you sure this is the same host?

Lucasc · May 3, 2019

oguz said:
What do you see in the syslog? Is there still GnuPG crypto agent running while this problem occurs? Or anything else weird? The top output in the last picture looks relatively normal to me? Are you sure this is the same host?

I used as workaround: systemctl stop pve-cluster and CPU goes down to 1-3%.
If i restart pve-cluster, after 30-40 seconds CPU grow up to 60-70%

oguz · May 3, 2019

I don't know if that's a valid workaround.

* Can you send your network configurations? (Mask any public IP addresses or other sensitive information)

* Not sure if you mistyped in the subject (3:00AM), but the logs you sent are from 3:00PM. Is that when it starts happening? Can you check the `dmesg` output and the syslog for what happens when this issue starts happening by itself? My guess is it has to do with your storage (which is most likely a shared storage I guess?)

* What kind of storage are you using exactly? How are the I/O loads looking while this issue occurs?

Lucasc · May 3, 2019

oguz said:
I don't know if that's a valid workaround.

* Can you send your network configurations? (Mask any public IP addresses or other sensitive information)

* Not sure if you mistyped in the subject (3:00AM), but the logs you sent are from 3:00PM. Is that when it starts happening? Can you check the `dmesg` output and the syslog for what happens when this issue starts happening by itself? My guess is it has to do with your storage (which is most likely a shared storage I guess?)

* What kind of storage are you using exactly? How are the I/O loads looking while this issue occurs?

The storage is a ZFS RAID 1 2x480GB SSD
With iotop, I noticed that the txg_sync process takes up 99.99% of resources

Lucasc · May 3, 2019

cat /proc/spl/kstat/zfs/arcstats
13 1 0x01 96 26112 6776683189 4098160845547
name type data
hits 4 1509465
misses 4 437815
demand_data_hits 4 513396
demand_data_misses 4 255603
demand_metadata_hits 4 960028
demand_metadata_misses 4 29321
prefetch_data_hits 4 34557
prefetch_data_misses 4 151427
prefetch_metadata_hits 4 1484
prefetch_metadata_misses 4 1464
mru_hits 4 442798
mru_ghost_hits 4 0
mfu_hits 4 1031750
mfu_ghost_hits 4 0
deleted 4 65
mutex_miss 4 0
access_skip 4 1
evict_skip 4 233
evict_not_enough 4 0
evict_l2_cached 4 0
evict_l2_eligible 4 907264
evict_l2_ineligible 4 4096
evict_l2_skip 4 0
hash_elements 4 469095
hash_elements_max 4 469162
hash_collisions 4 10580
hash_chains 4 4349
hash_chain_max 4 2
p 4 33788046336
c 4 67576092672
c_min 4 4223505792
c_max 4 67576092672
size 4 4462628744
compressed_size 4 4113528832
uncompressed_size 4 4387236352
overhead_size 4 183423488
hdr_size 4 154416736
data_size 4 4098509824
metadata_size 4 198442496
dbuf_size 4 3840128
dnode_size 4 5474280
bonus_size 4 1945280
anon_size 4 11694080
anon_evictable_data 4 0
anon_evictable_metadata 4 0
mru_size 4 3450466816
mru_evictable_data 4 3181412864
mru_evictable_metadata 4 35782656
mru_ghost_size 4 0
mru_ghost_evictable_data 4 0
mru_ghost_evictable_metadata 4 0
mfu_size 4 834791424
mfu_evictable_data 4 750106112
mfu_evictable_metadata 4 39217152
mfu_ghost_size 4 0
mfu_ghost_evictable_data 4 0
mfu_ghost_evictable_metadata 4 0
l2_hits 4 0
l2_misses 4 0
l2_feeds 4 0
l2_rw_clash 4 0
l2_read_bytes 4 0
l2_write_bytes 4 0
l2_writes_sent 4 0
l2_writes_done 4 0
l2_writes_error 4 0
l2_writes_lock_retry 4 0
l2_evict_lock_retry 4 0
l2_evict_reading 4 0
l2_evict_l1cached 4 0
l2_free_on_write 4 0
l2_abort_lowmem 4 0
l2_cksum_bad 4 0
l2_io_error 4 0
l2_size 4 0
l2_asize 4 0
l2_hdr_size 4 0
memory_throttle_count 4 0
memory_direct_count 4 0
memory_indirect_count 4 0
memory_all_bytes 4 135152185344
memory_free_bytes 4 111677140992
memory_available_bytes 3 109565390848
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 0
arc_meta_used 4 364118920
arc_meta_limit 4 50682069504
arc_dnode_limit 4 5068206950
arc_meta_max 4 405803336
arc_meta_min 4 16777216
sync_wait_for_async 4 5002
demand_hit_predictive_prefetch 4 81136
arc_need_free 4 0
arc_sys_free 4 2111752896

Lucasc · May 4, 2019

Resolved with: zfs set primarycache=metadata {pool}

Thanks for all @oguz

oguz · May 6, 2019

You're welcome. Glad your problem is solved!

You can mark the thread as [SOLVED] if you edit your first post.

Cheers!

Search

Search

Abnormal activity after 3.00AM

Lucasc

New Member

oguz

Proxmox Retired Staff

Lucasc

New Member

Lucasc

New Member

oguz

Proxmox Retired Staff

Lucasc

New Member

oguz

Proxmox Retired Staff

Lucasc

New Member

Lucasc

New Member

Lucasc

New Member

oguz

Proxmox Retired Staff