Fwd: Re: [HECnet] More clustering fun

Mark Wickens mark at wickensonline.co.uk
Fri Sep 16 11:27:56 PDT 2011


On 16/09/11 11:22, hvlems at zonnet.nl wrote:
Did you adjust EXPECTED_VOTES on the alpha server?

From: Mark Wickens <mark at wickensonline.co.uk>
Sender: owner-hecnet at Update.UU.SE
Date: Fri, 16 Sep 2011 11:09:04 +0100
To: <hecnet at Update.UU.SE>
ReplyTo: hecnet at Update.UU.SE
Subject: Fwd: Re: [HECnet] More clustering fun

Just to add to the picture - if I reduce VOTES on the satellite to 0 I get this happening:

-------- Original Message --------
Subject:
Re: [HECnet] More clustering fun
Date:
Fri, 16 Sep 2011 09:33:47 +0000
From:
hvlems at zonnet.nl
Reply-To:
hvlems at zonnet.nl
To:
Mark Wickens <mark at wickensonline.co.uk>


All I can think of is this:
1) Both slave and aleph both use the same VMSCLUSTER license 
2) The cluster id and cluster password are different on both nodes. 
On an alpha you can modify this in sysman (use help in sysman to find the correct command). On a Vax the command is burried in sysgen. 
Hans
-----Original Message-----
From: Mark Wickens <mark at wickensonline.co.uk>
Date: Fri, 16 Sep 2011 10:11:41 
Cc: <hvlems at zonnet.nl>
Subject: Re: [HECnet] More clustering fun

Hi Hans,

I didn't want to pre-empt what I thought happened last time as I wasn't 
sure I'd got it right, but it has happened again so it's definitely an 
issue.

I've updated the VOTES in ALEPH (the satellite) MODPARAMS.DAT, ran 
AUTOGEN and rebooted the cluster.

Now when ALEPH attempts to join the cluster I get these messages repeatedly:

%CNXMAN,  sending VAXcluster membership request to system SLAVE
%CNXMAN,  sending VAXcluster membership request to system SLAVE
%CNXMAN,  sending VAXcluster membership request to system SLAVE
%CNXMAN,  sending VAXcluster membership request to system SLAVE
%CNXMAN,  sending VAXcluster membership request to system SLAVE

and I see this on SLAVE:

$$
    %CNXMAN,  Received VMScluster membership request from system ALEPH
%CNXMAN,  Proposing addition of system ALEPH
%CNXMAN,  Completing VMScluster state transition
%%%%%%%%%%%  OPCOM  16-SEP-2011 10:05:35.79  %%%%%%%%%%%
10:05:35.79 Node SLAVE (csid 00010001) received VMScluster membership 
request from node ALEPH

%%%%%%%%%%%  OPCOM  16-SEP-2011 10:05:35.79  %%%%%%%%%%%
10:05:35.79 Node SLAVE (csid 00010001) proposed addition of node ALEPH

%%%%%%%%%%%  OPCOM  16-SEP-2011 10:05:35.79  %%%%%%%%%%%
10:05:35.79 Node SLAVE (csid 00010001) completed VMScluster state transition

$$
    %CNXMAN,  Received VMScluster membership request from system ALEPH
%CNXMAN,  Proposing addition of system ALEPH
%CNXMAN,  Completing VMScluster state transition
%%%%%%%%%%%  OPCOM  16-SEP-2011 10:05:39.04  %%%%%%%%%%%
10:05:39.04 Node SLAVE (csid 00010001) received VMScluster membership 
request from node ALEPH

%%%%%%%%%%%  OPCOM  16-SEP-2011 10:05:39.04  %%%%%%%%%%%
10:05:39.04 Node SLAVE (csid 00010001) proposed addition of node ALEPH

%%%%%%%%%%%  OPCOM  16-SEP-2011 10:05:39.04  %%%%%%%%%%%
10:05:39.04 Node SLAVE (csid 00010001) completed VMScluster state transition

$$
    %CNXMAN,  Received VMScluster membership request from system ALEPH
%CNXMAN,  Proposing addition of system ALEPH
%CNXMAN,  Completing VMScluster state transition
%%%%%%%%%%%  OPCOM  16-SEP-2011 10:05:43.02  %%%%%%%%%%%
10:05:43.02 Node SLAVE (csid 00010001) received VMScluster membership 
request from node ALEPH

%%%%%%%%%%%  OPCOM  16-SEP-2011 10:05:43.02  %%%%%%%%%%%
10:05:43.02 Node SLAVE (csid 00010001) proposed addition of node ALEPH

%%%%%%%%%%%  OPCOM  16-SEP-2011 10:05:43.02  %%%%%%%%%%%
10:05:43.02 Node SLAVE (csid 00010001) completed VMScluster state transition

This just repeats forever.

Some more information from SLAVE (the ALPHA server):

SHOW CLUSTER:

View of Cluster from system ID 4345  node: 
SLAVE                                                               
16-SEP-2011 10:06:13
+--------------------------------------------------------+---------+
|                         SYSTEMS                        | MEMBERS |
+--------+--------------------------------+--------------+---------+
|  NODE  |             HW_TYPE            |   SOFTWARE   |  STATUS |
+--------+--------------------------------+--------------+---------+
| SLAVE  | AlphaServer 1000A 5/300        | VMS V8.3     | MEMBER  |
| ALEPH  | VAXstation 4000-VLC            | VMS V7.3     | NEW     |
+--------+--------------------------------+--------------+---------+
+------------------------------------------------------------------------------------+
|                                       
CLUSTER                                      |
+--------+-----------+----------+------------+-------------------+-------------------+
| CL_EXP | CL_QUORUM | CL_VOTES | CL_MEMBERS |       FORMED      |  
LAST_TRANSITION  |
+--------+-----------+----------+------------+-------------------+-------------------+
|      1 |         1 |        1 |          1 | 16-SEP-2011 09:56 | 
16-SEP-2011 09:56 |
+--------+-----------+----------+------------+-------------------+-------------------+

SYSGEN>  SHOW EXPECTED_VOTES
%CNXMAN,  Completing VMScluster state transition
Parameter Name            Current    Default     Min.       Max.   Unit  
Dynamic
--------------            -------    -------   -------    -------  ----  
-------
EXPECTED_VOTES                  1          1         1        127 Votes


Any ideas why this is going wrong?

Thanks for the help, much appreciated,

Mark.

On 16/09/11 09:40, hvlems at zonnet.nl wrote:
> Regarding the alphaserver: check the value of expectedvotes in sysgen.
> In a cluster with non-voting satellites only, its value must be less than Votes+1
> Hans
> ------Origineel bericht------
> Van: Mark Wickens
> Afzender: owner-hecnet at Update.UU.SE
> Aan: hecnet at Update.UU.SE
> Beantwoorden: hecnet at Update.UU.SE
> Onderwerp: [HECnet] More clustering fun
> Verzonden: 16 september 2011 10:14
>
> I've now refreshed the VAX satellites system drive and installed it in the
> ALPHA server. The one problem I have remaining is that the VOTES the
> satellite is contributing to the cluster is 1. I believe for a proper
> satellite this should be 0.
>
> Is this a case of updating the MODPARAMS.DAT on the satellite and autogen
> and reboot? Do I need to do anything with the ALPHA servers configuration?
>
> Presumably I will need to reboot the ALPHA server as well.
>
> Thanks for the help,
>
> Kind regards, Mark.
>

The EXPECTED_VOTES on the ALPHA is set to 1 - which I assume is the value we would anticipate would work?

Even when both the server and the satellite have VOTES set to 1 the expected votes remains at 1 on the server, although if you examine the cluster via SHOW CLUSTER it shows that the quorum is set to 2:

View of Cluster from system ID 4345   node: SLAVE                     16-SEP-2011 11:21:15
+--------------------------------------------------------+---------+
|                                                 SYSTEMS                                               | MEMBERS |
+--------+--------------------------------+--------------+---------+
|   NODE   |                         HW_TYPE                       |     SOFTWARE     |   STATUS |
+--------+--------------------------------+--------------+---------+
| SLAVE   | AlphaServer 1000A 5/300               | VMS V8.3         | MEMBER   |
| ALEPH   | VAXstation 4000-VLC                       | VMS V7.3         | MEMBER   |
+--------+--------------------------------+--------------+---------+
+-------------------------------------------------------------------------------
|                                                                             CLUSTER
+--------+-----------+----------+------------+-------------------+--------------
| CL_EXP | CL_QUORUM | CL_VOTES | CL_MEMBERS |             FORMED           |   LAST_TRANSIT
+--------+-----------+----------+------------+-------------------+--------------
|           2 |                 2 |               2 |                   2 | 16-SEP-2011 11:04 | 16-SEP-2011 1
+--------+-----------+----------+------------+-------------------+--------------


With the satellites VOTES set to 0 (which causes the endless %CNXMAN   messages) if I turn off the satellite at that point I get the following:

$$ 
     %CNXMAN,   Quorum lost, blocking activity

$$ 
     %CNXMAN,   Timed-out lost connection to system ALEPH
%CNXMAN,   Proposing reconfiguration of the VMScluster
%CNXMAN,   Discovered system ALEPH
%CNXMAN,   Removed from VMScluster system ALEPH
%CNXMAN,   Completing VMScluster state transition
%CNXMAN,   Established connection to system ALEPH

and the server hangs. I end up having to reboot the server, because the satellite never joins the cluster successfully.

It's all fun!

Regards, Mark.



More information about the Hecnet-list mailing list