Cluster master failed... And now?

copymaster

Member
Nov 25, 2009
183
0
16
I know the solution for the case that a node can become master through pveca -m.


In my cluster there is every now and then the situation that the servers on the cluster are not accessible i think it depends on the following described error.

But in my situation the master seems not to be "really" down. i get several errors in /var/log/kern.log saying that a problem with a sata drive exists (I think) i already posted the errorlog:

Code:
Jan 12 08:04:05 Donald kernel: sd 7:0:0:0: [sdb] 2147518464 512-byte hardware sectors (1099529 MB)
Jan 12 08:04:05 Donald kernel: sd 7:0:0:0: [sdb] Write Protect is off
Jan 12 08:04:05 Donald kernel: sd 7:0:0:0: [sdb] Mode Sense: bd 00 00 08
Jan 12 08:04:05 Donald kernel: sd 7:0:0:0: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
Jan 12 08:05:07 Donald kernel: ata2.00: qc timeout (cmd 0xa0)
Jan 12 08:05:07 Donald kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Jan 12 08:05:07 Donald kernel: ata2.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Jan 12 08:05:07 Donald kernel:         cdb 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
Jan 12 08:05:07 Donald kernel:         res 51/20:03:00:00:00/00:00:00:00:00/a0 Emask 0x5 (timeout)
Jan 12 08:05:07 Donald kernel: ata2.00: status: { DRDY ERR }
Jan 12 08:05:12 Donald kernel: ata2: port is slow to respond, please be patient (Status 0xd0)
Jan 12 08:05:17 Donald kernel: ata2: device not ready (errno=-16), forcing hardreset
Jan 12 08:05:17 Donald kernel: ata2: soft resetting link
Jan 12 08:05:18 Donald kernel: ata2.01: NODEV after polling detection
Jan 12 08:05:18 Donald kernel: ata2.00: configured for UDMA/25
Jan 12 08:05:18 Donald kernel: ata2: EH complete
Jan 12 08:07:00 Donald kernel: ata2.00: qc timeout (cmd 0xa0)
Jan 12 08:07:00 Donald kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Jan 12 08:07:00 Donald kernel: ata2.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Jan 12 08:07:00 Donald kernel:         cdb 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
Jan 12 08:07:00 Donald kernel:         res 51/20:03:00:00:00/00:00:00:00:00/a0 Emask 0x5 (timeout)
Jan 12 08:07:00 Donald kernel: ata2.00: status: { DRDY ERR }
Jan 12 08:07:05 Donald kernel: ata2: port is slow to respond, please be patient (Status 0xd0)
Jan 12 08:07:10 Donald kernel: ata2: device not ready (errno=-16), forcing hardreset
Jan 12 08:07:10 Donald kernel: ata2: soft resetting link
Jan 12 08:07:10 Donald kernel: ata2.01: NODEV after polling detection
Jan 12 08:07:11 Donald kernel: ata2.00: configured for UDMA/25
Jan 12 08:07:11 Donald kernel: ata2: EH complete
I googled this error and it seems that a new installation may cure the pain of the server. But now the question:

What is the procedure if i want to reinstall a cluster master??
Say, i want to shut the master down, reinstall it and then bring it back as master again.

Is that possible?? And what is the right action to do??
 
I know the solution for the case that a node can become master through pveca -m.


In my cluster there is every now and then the situation that the servers on the cluster are not accessible i think it depends on the following described error.

But in my situation the master seems not to be "really" down. i get several errors in /var/log/kern.log saying that a problem with a sata drive exists (I think) i already posted the errorlog:

Code:
Jan 12 08:04:05 Donald kernel: sd 7:0:0:0: [sdb] 2147518464 512-byte hardware sectors (1099529 MB)
Jan 12 08:04:05 Donald kernel: sd 7:0:0:0: [sdb] Write Protect is off
Jan 12 08:04:05 Donald kernel: sd 7:0:0:0: [sdb] Mode Sense: bd 00 00 08
Jan 12 08:04:05 Donald kernel: sd 7:0:0:0: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
Jan 12 08:05:07 Donald kernel: ata2.00: qc timeout (cmd 0xa0)
Jan 12 08:05:07 Donald kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Jan 12 08:05:07 Donald kernel: ata2.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Jan 12 08:05:07 Donald kernel:         cdb 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
Jan 12 08:05:07 Donald kernel:         res 51/20:03:00:00:00/00:00:00:00:00/a0 Emask 0x5 (timeout)
Jan 12 08:05:07 Donald kernel: ata2.00: status: { DRDY ERR }
Jan 12 08:05:12 Donald kernel: ata2: port is slow to respond, please be patient (Status 0xd0)
Jan 12 08:05:17 Donald kernel: ata2: device not ready (errno=-16), forcing hardreset
Jan 12 08:05:17 Donald kernel: ata2: soft resetting link
Jan 12 08:05:18 Donald kernel: ata2.01: NODEV after polling detection
Jan 12 08:05:18 Donald kernel: ata2.00: configured for UDMA/25
Jan 12 08:05:18 Donald kernel: ata2: EH complete
Jan 12 08:07:00 Donald kernel: ata2.00: qc timeout (cmd 0xa0)
Jan 12 08:07:00 Donald kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Jan 12 08:07:00 Donald kernel: ata2.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Jan 12 08:07:00 Donald kernel:         cdb 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
Jan 12 08:07:00 Donald kernel:         res 51/20:03:00:00:00/00:00:00:00:00/a0 Emask 0x5 (timeout)
Jan 12 08:07:00 Donald kernel: ata2.00: status: { DRDY ERR }
Jan 12 08:07:05 Donald kernel: ata2: port is slow to respond, please be patient (Status 0xd0)
Jan 12 08:07:10 Donald kernel: ata2: device not ready (errno=-16), forcing hardreset
Jan 12 08:07:10 Donald kernel: ata2: soft resetting link
Jan 12 08:07:10 Donald kernel: ata2.01: NODEV after polling detection
Jan 12 08:07:11 Donald kernel: ata2.00: configured for UDMA/25
Jan 12 08:07:11 Donald kernel: ata2: EH complete
I googled this error and it seems that a new installation may cure the pain of the server. But now the question:

What is the procedure if i want to reinstall a cluster master??
Say, i want to shut the master down, reinstall it and then bring it back as master again.

Is that possible?? And what is the right action to do??


  1. promote a node to master.
  2. install the new server (check the hardware/disksystem)
  3. join the new one to the cluster
  4. then promote the node to the master again
  5. tell all other server to sync. from this master
 
Thanks, i did like you advised...

i promoted one node to be master. Then the former Master disappeared.
I reinstalled the server and brought it up again. I added the server into the cluster.
Then i promoted it again as master with pveca -m

Then the node which was master before, disappeared.

I tried to tell the nodes to sync from the master---- Now DISASTER!
i think the ssh-keys are completely mixed up and my cluster is broken..

i have 3 servers, and here's the pveca -l output from each of them:

Donald (192.168.0.70)
Code:
ID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
 3 : 192.168.0.72    N     S    2 days 03:56   1.26    59%     2%
 4 : 192.168.0.70    M     A           00:22   0.03     2%     1%
Tick (192.168.0.71) was master during reinstallation of Donald
Code:
ID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
 2 : 192.168.0.71    M     A    2 days 03:57   0.64    61%     1%
 3 : 192.168.0.72    N     S    2 days 03:56   1.32    59%     2%
 4 : 192.168.0.70    N     A           00:22   0.01     2%     1%
Trick (192.168.0.72)
Code:
CID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
 1 : 192.168.0.70    M     ERROR: 500 Can't connect to 127.0.0.1:50000 (connect: Verbindungsaufbau abgelehnt)

 2 : 192.168.0.71    N     A    2 days 03:58   0.46    61%     1%
 3 : 192.168.0.72    N     S    2 days 03:57   1.64    59%     2%
when i log into the new master , i have no access to the point "storage" which is essential as there are all VM's

And if i choose "cluster" i can only see Master and Tick as a node, which is in state NOSYNC

HEEELP please!

A pveca -s on node Trick gives:
Code:
syncing master configuration from '192.168.0.70'
syncing master configuration from '192.168.0.70' failed (rsync --rsh=ssh -l root -o BatchMode=yes -lpgoq 192.168.0.70:/etc/pve/* /etc/cron.d/vzdump /etc/pve/master/ --exclude *~) : command 'rsync --rsh=ssh -l root -o BatchMode=yes -lpgoq 192.168.0.70:/etc/pve/* /etc/cron.d/vzdump /etc/pve/master/ --exclude *~' failed with exit code 255:
Host key verification failed.
rsync: connection unexpectedly closed (0 bytes received so far) [receiver]
rsync error: unexplained error (code 255) at io.c(635) [receiver=3.0.3]
 
Last edited:
Thanks, i did like you advised...

i promoted one node to be master. Then the former Master disappeared.
I reinstalled the server and brought it up again. I added the server into the cluster.
Then i promoted it again as master with pveca -m

Then the node which was master before, disappeared.

I tried to tell the nodes to sync from the master---- Now DISASTER!
i think the ssh-keys are completely mixed up and my cluster is broken..

i have 3 servers, and here's the pveca -l output from each of them:

Donald (192.168.0.70)
Code:
ID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
 3 : 192.168.0.72    N     S    2 days 03:56   1.26    59%     2%
 4 : 192.168.0.70    M     A           00:22   0.03     2%     1%
Tick (192.168.0.71) was master during reinstallation of Donald
Code:
ID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
 2 : 192.168.0.71    M     A    2 days 03:57   0.64    61%     1%
 3 : 192.168.0.72    N     S    2 days 03:56   1.32    59%     2%
 4 : 192.168.0.70    N     A           00:22   0.01     2%     1%
Trick (192.168.0.72)
Code:
CID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
 1 : 192.168.0.70    M     ERROR: 500 Can't connect to 127.0.0.1:50000 (connect: Verbindungsaufbau abgelehnt)

 2 : 192.168.0.71    N     A    2 days 03:58   0.46    61%     1%
 3 : 192.168.0.72    N     S    2 days 03:57   1.64    59%     2%
when i log into the new master , i have no access to the point "storage" which is essential as there are all VM's

And if i choose "cluster" i can only see Master and Tick as a node, which is in state NOSYNC

HEEELP please!

do you installed the old server with the same IP/name? so the other server has a wrong known.host file.

files to check (on all server):
/etc/pve/cluster.cfg
/root/.ssh/known_hosts

fix the known_hosts file (remove the old entry, or just delete the whole file). if you can ssh from each server to another without specifying a password, the cluster communication works again.
 
ok i removed the known_host fiels from all nodes and ssh'ed into each server from each other.

the connection works again, i can manually do a pveca -s

but the cluster.cfg is different on every host.
the new master only shows one node.

can i just edit the cluster.cfg to show up all nodes and copy this config to all other nodes?

By the way: the "storage" point in the webinterface still doesn't work
 
Last edited:
ok i removed the known_host fiels from all nodes and ssh'ed into each server from each other.

the connection works again, i can manually do a pveca -s

but the cluster.cfg is different on every host.
the new master only shows one node.

can i just edit the cluster.cfg to show up all nodes and copy this config to all other nodes?

By the way: the "storage" point in the webinterface still doesn't work

the cluster.cfg has to equal on all hosts.

storage: I did not understand this, details?

post also your /etc/pve/storage.cfg (master), this file is synchronized to all nodes.
 
Hi Tom,

I got it back up again in the meantime. I just copied a working storage.conf to all nodes and synced the cluster.conf

After that i ssh'ed from each machine into each other and made a pveca -s -h <master>

I think thats it. unfortunately this didn't solve my problem with the ata2 error.
I changed all drives but error remains, i reinstalled, error remains.

I think the only part of the server which can be the "black man" is the storage controller....
i will have a look at this today...