[HECnet] Tops-20 SETNOD Failure

Thomas DeBellis tommytimesharing at gmail.com
Wed May 5 13:05:37 PDT 2021


I got annoyed at the thought of having to wait a few more months for the 
error condition to show up and, instead of having the batch job run more 
frequently (and thus beating on poor MIM::), I wrote another batch job 
which took every single file that I have /ever/ downloaded from MIM:: 
and inserted it.  So that's 75 files and it failed on number 54.

15:36:51 USER   SETNOD>**T OLDS:NODE-DATA.TXT.54*
15:36:52 USER
15:36:52 USER   SETNOD>**List Total*
15:36:52 USER
15:36:52 USER
15:36:52 USER   TOTAL NODES FOUND: 869
15:36:52 USER
15:36:52 USER   SETNOD>**Insert*
15:36:52 USER
15:36:52 USER   ?SETNOD: Failed at node RSX11M (2.298), Item 650 of 869, 
Error: _-11_
15:36:52 USER   SETNOD>

It is interesting that it is failing on node 2.298, but this is before 
that number had been reassigned to REACH::. The negative 11 error 
returns means "Component in Wrong State" (aka NF.CWS), which I didn't 
find immediately informative.  However, now I've got something to look 
around for.

I still can't imagine why there would be anything particularly 
diabolical about the number 2.298.
> ------------------------------------------------------------------------
>
> On 5/5/21 12:38 AM, Thomas DeBellis wrote:
>
> I finished the modifications to SCLINK to properly return error values 
> which are negative and JNTMAN to return the error value in AC3 if 
> .NDINT doesn't succeed inserting all the nodes.  Then I modified 
> SETNOD to get this extended error information and print it.  I put the 
> new monitor and SETNOD up, rebooted *…AND*…
>
>     SETNOD>set nod 2.298 name REACH SETNOD>ins SETNOD>
>
> It works perfectly because, of course it does…
>
> So, as usual, Johnny's guess is pretty close to the mark, even if he 
> isn't a 36 bit'er.  "Slightly broken"?  Yeah, 'slightly' enough so 
> that it can't be easily reproduced…
>
> The only thing I can think of is that the system had been up over 15 
> weeks when I saw this.  I had looked at the storage space utilization 
> with SYSDPY and didn't notice anything maxing out.  I restarted the 
> GETNOD batch job on VENTI2::.  Maybe in another 15 weeks, it will 
> break again.
>
> /Annoyed/…
>
>> ------------------------------------------------------------------------
>> On 5/4/21 10:31 PM, Thomas DeBellis wrote:
>>
>> Personally, I don't see how it could /possibly/ be anything to do 
>> with the REACH:: node definition, but I have been known to 
>> occasionally overlook the utterly obvious, particularly when it's 
>> near night-night.  Maybe not this time.
>>
>> Right now, the way to figure it out is to get the minor error data 
>> and see where that takes things.  So I'm making a change to JNTMAN to 
>> have .NDINT to return the lower level code on an incomplete insert. 
>> SCLINK appears to have a problem that it is mangling return values, 
>> which I'm currently investigating.
>>
>> You can't just blithely assuming somebody got it wrong and 'fix' 
>> things; sometimes it's a certain way for a reason.
>>
>> On 5/4/21 8:46 PM, Johnny Billquist wrote:
>>> On 2021-05-05 00:54, Mike Kostersitz wrote:
>>>> Ouch that is one of my nodes 😊 @Johnny Billquist 
>>>> <mailto:bqt at softjar.se> anything you could think of since we just 
>>>> renamed my old RSX11M node to REACH.
>>>
>>> Well. It is something slightly broken in Tops-20, so there isn't 
>>> really anything we can do about it.
>>>
>>> Except hope that Thomas can figure it out and fix it.
>>>
>>>  Johnny
>>>
>>>>
>>>> Mike
>>>>
>>>> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for 
>>>> Windows 10
>>>>
>>>> *From: *Thomas DeBellis <mailto:tommytimesharing at gmail.com>
>>>> *Sent: *Tuesday, May 4, 2021 15:16
>>>> *To: *HECnet <mailto:hecnet at update.uu.se>
>>>> *Subject: *[HECnet] Re: Tops-20 SETNOD Failure
>>>>
>>>> I fixed a few things in SETNOD to get some more information about 
>>>> the error.  In particular,
>>>>
>>>>   * Allow listing of AREA 1 (this was specifically disallowed, I don't
>>>>     know why)
>>>>   * More consistent error reporting (via ESOUT%)
>>>>   * List more than one node when doing an area list (it would only 
>>>> list
>>>>     a single node)
>>>>   * List nodes with more than three digits in the node number when 
>>>> doing
>>>>     columnar output
>>>>
>>>> So now you get the expected results:
>>>>
>>>>     SETNOD>lis a 1
>>>>     [Area 1]
>>>>     A1RTR   1023    ATHENA   620    ATLE     605    AURORA 606   
>>>>     BANAI    770
>>>>     BANX25   771    BEA       19    BIZET    800 BJARNE     7       
>>>> BLINKY   266
>>>>     CATWZL   302    CLYDE    269    COOPER   263    CRISPS 201   
>>>>     CYGNUS   259
>>>>     DAVROS   254    DBIT     351    DE1RSX   450    DE1RSY 452   
>>>>     DOCTOR   252
>>>>     ELIN     616    ELMER    617    ERNIE      2    ERSATZ 350   
>>>>     FLETCH   100
>>>>     FNATTE     3    FREJ     608    GAXP     730 GNAT      16       
>>>> GNOME      6
>>>>     GOBLIN     4    GVAX     731    HAGMAN   262    HARPER 261   
>>>>     HORSE    150
>>>>     HUGIN    602    HYUNA    500    INKY     268    JIMIN 501       
>>>> JOCKE     21
>>>>     JOSSE     17    KLIO     451    KRILLE     8    LOKE 607       
>>>> MACARO   303
>>>>     MACRA    258    MAGICA     1    MASTER   251 MIM       13       
>>>> MUNIN    603
>>>>     NIPPER   202    NOMAD    610    NOXBIT   720    ORACLE 301   
>>>>     PACMAN   265
>>>>     PAI      541    PALLAS   621    PAMINA    18    PIDP11 560   
>>>>     PINKY    267
>>>>     PISTON   520    PLINTH   200    PMAVS2   510 PONDUS    15       
>>>> PONY      12
>>>>     PUFF      22    QEMUNT   151    REI      540 ROCKY     11       
>>>> ROJIN    542
>>>>     RSX124   306    RSX145   304    RSX170   305    RSX184 307   
>>>>     RUTAN    255
>>>>     SHARPE   260    SIDRAT   253    SIGGE     10 SPEEDY    24       
>>>> TARDIS   250
>>>>     TEMPO      9    THOROS   257    TINA      14    TIPSY 604       
>>>> TONGUE   264
>>>>     TOPSY    601    VALAR    400    VAROS    256 WXP       20       
>>>> WXP2      23
>>>>     YMER     609    ZEKE       5
>>>>     Total nodes in area 1: 92
>>>>     SETNOD>exit
>>>>
>>>> Regarding the error, I have reproduced it with a single entry, viz:
>>>>
>>>>     !setnod
>>>>     SETNOD>_set nod 2.298 name REACH_
>>>>     SETNOD>_insert_
>>>>     ?SETNOD: Failed at node REACH (2.298), Item 0 of 1
>>>>     SETNOD>
>>>>
>>>> The high level code to do the entry is in JNTMAN.  It loops through 
>>>> the table passed to it via .NDINT, calling a lower level routine 
>>>> called SCTAND in SCLINK.  An error here is passed up to JNTMAN, but 
>>>> it is not passed back to the user. There are some other problems in 
>>>> SCLINK pertaining to negative return values, so some minor work is 
>>>> necessary there, also.
>>>>
>>>> I'll make some changes to these two modules, generate a new monitor 
>>>> for VENTI2 and see what happens in a few days.
>>>>
>>>> Right now, if any Tops-20 using is using SETNOD to update DECnet 
>>>> tables, this appears to fail.  If anybody else is seeing it or can 
>>>> reproduce it, I'd like to hear about it.
>>>>
>>>>     On 5/4/21 11:15 AM, Thomas DeBellis wrote:
>>>>
>>>>     Has anybody ever seen SETNOD fail to insert the entire node 
>>>> list?  I
>>>>     just did.
>>>>
>>>>     Shortly after I put my 20's up on HECnet, I wrote a reoccurring
>>>>     batch job that fires once a week on Sundays to pull the latest 
>>>> node
>>>>     list (T20.FIX) from MIM::.  I use the highly venerable FILCOM
>>>>     program to do a difference of it with the previous week's list.  I
>>>>     don't do anything in particular with the output except save it in
>>>>     case I feel like looking at it for some reason.
>>>>
>>>>     The batch job always inserts the entire list, rewriting whatever
>>>>     might be in the monitor's data base.  I have always been 
>>>> unsatisfied
>>>>     with doing things that way because it seemed to me to be 
>>>> inefficient
>>>>     as the node list grew.   The HECnet node list count was 716 on
>>>>     9-Jun-19 and it's now up to 884 as of the latest version that I've
>>>>     pulled, 30-Apr-21.  The other problem is the microscopic 
>>>> possibility
>>>>     that a node is in Tops-20's monitor database (a hash table) that
>>>>     isn't in the HECnet node list.
>>>>
>>>>     Nodes can get removed, although I think that infrequent.  Nodes
>>>>     could get inserted outside of the batch job, but I think that most
>>>>     unlikely in my situation.  Nodes can get renamed, as evidenced by
>>>>     2.299 below, which went from THEPIT to THEARK.  None of this 
>>>> should
>>>>     or has broken anything.
>>>>
>>>>     However, it's been in the back of my mind to do two enhancements,
>>>>     one to Tops-20 and one to SETNOD.  The NODE% JSYS should have an
>>>>     additional feature to return the current monitor data base.  The
>>>>     SETNOD program should be enhanced to take that to compute the set
>>>>     difference with the new list.  This would show additions, renames
>>>>     and deletions.  That would bring the update operation down from 
>>>> some
>>>>     hundred items to less than ten, on average.  This would obviously
>>>>     make more of a difference on huge DECnet's in the tens of 
>>>> thousands
>>>>     of nodes.  Another NODE% feature should probably be to whack the
>>>>     entire monitor database except for the local node, which would be
>>>>     useful for trouble shooting.
>>>>
>>>>     Last Sunday, the batch job failed with the following error:
>>>>
>>>>     18:33:40 USER   SETNOD>*TAKE SYSTEM:NODE-DATA.TXT.0
>>>>     18:33:40 USER
>>>>     18:33:40 USER   [Fork SETNOD opening <SYSTEM>NODE-DATA.TXT.1 for
>>>>     reading]
>>>>     18:33:41 USER   SETNOD>*SAVE
>>>>     18:33:41 USER
>>>>     18:33:41 USER   [Fork SETNOD opening <SYSTEM>NODE-DATA.BIN.74 for
>>>>     reading, writing]
>>>>     18:33:41 USER   SETNOD>*INSERT
>>>>     18:33:41 USER
>>>>     18:33:41 USER *?SETNOD: Failed at node REACH*
>>>>     18:33:41 USER   SETNOD>
>>>>
>>>>     I had a look at the SETNOD source and the HECnet node list and 
>>>> have
>>>>     discovered and concluded a few things.  First, there doesn't 
>>>> seem to
>>>>     be anything syntactically wrong with REACH::'s definition: "set 
>>>> nod
>>>>     2.298 name REACH".  Second, there don't appear to be any semantic
>>>>     issues.  2.298 wasn't in use and it shouldn't matter if it was.
>>>>
>>>>     In the case of INSERT, there are two kinds of errors from NODE%, a
>>>>     general failure of the JSYS and an incomplete insertion.   The 
>>>> error
>>>>     is from the second case.  Unfortunately, SETNOD isn't reporting
>>>>     enough information about the error, so I have to make some changes
>>>>     there.  It's also possible that SETNOD is building an inconsistent
>>>>     database for the monitor to swallow; at least the LIST command is
>>>>     giving me some odd results, viz:
>>>>
>>>>         SETNOD>list arEA 2
>>>>
>>>>         [AREA 2]
>>>>         A2RTR
>>>>
>>>>         TOTAL NODES FOUND: 1
>>>>
>>>>         SETNOD>
>>>>
>>>>     That's clearly wrong, viz:
>>>>
>>>>         !i dec
>>>>           Local DECNET node: VENTI2.  Nodes reachable: 7.
>>>>           Accessible DECNET nodes are:    A2RTR    BOINGO LEGATO   
>>>>         TOMMYT    VENTI2    VENTI    ZITI
>>>>
>>>>     The Exec output should probably be changed to say, "Nodes 
>>>> reachable
>>>>     in local area" and "Online nodes in area are:"
>>>>
>>>>     Anybody have any ideas?  Hunches?  Clues?
>>>>
>>>> File 1) OLDF:[4,120]    created: 1241 15-Apr-21
>>>> File 2) NEWF:[1,1]      created: 0102 30-Apr-21
>>>>
>>>> 1)1     set nod 44.9 name OSMIUM
>>>> ****
>>>> 2)1     set nod 2.292 name OSIRIS
>>>> 2)      set nod 44.9 name OSMIUM
>>>> **************
>>>> 1)1     set nod 13.3 name RED
>>>> ****
>>>> 2)1 *set nod 2.298 name REACH *
>>>> 2)      set nod 13.3 name RED
>>>> **************
>>>> 1)1     set nod 2.298 name RSX11M
>>>> 1)      set nod 1.306 name RSX124
>>>> ****
>>>> 2)1     set nod 1.306 name RSX124
>>>> **************
>>>> 1)1     set nod 42.5 name SPARKY
>>>> ****
>>>> 2)1     set nod 2.291 name SPARK
>>>> 2)      set nod 42.5 name SPARKY
>>>> **************
>>>> 1)1     set nod 2.299 name THEPIT
>>>> 1)      set nod 35.70 name THOMAS
>>>> ****
>>>> 2)1     set nod 2.299 name THEARK
>>>> 2)      set nod 35.70 name THOMAS
>>>> **************
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sonic.net/pipermail/hecnet-list/attachments/20210505/6025d25b/attachment-0001.htm>


More information about the Hecnet-list mailing list