[HECnet] Tops-20 SETNOD Failure
Johnny Billquist
bqt at softjar.se
Wed May 5 14:03:21 PDT 2021
It might not have anything in particular to do with the node number
(2.298), but could be it being the n:th entry being poked at. Or
possibly because it redefines an existing node.
Try clearing the node out and then define it?
Add a dummy extra entry in the file before this one is another idea.
Johnny
On 2021-05-05 22:05, Thomas DeBellis wrote:
> I got annoyed at the thought of having to wait a few more months for the
> error condition to show up and, instead of having the batch job run more
> frequently (and thus beating on poor MIM::), I wrote another batch job
> which took every single file that I have /ever/ downloaded from MIM::
> and inserted it. So that's 75 files and it failed on number 54.
>
> 15:36:51 USER SETNOD>**T OLDS:NODE-DATA.TXT.54*
> 15:36:52 USER
> 15:36:52 USER SETNOD>**List Total*
> 15:36:52 USER
> 15:36:52 USER
> 15:36:52 USER TOTAL NODES FOUND: 869
> 15:36:52 USER
> 15:36:52 USER SETNOD>**Insert*
> 15:36:52 USER
> 15:36:52 USER ?SETNOD: Failed at node RSX11M (2.298), Item 650 of 869,
> Error: _-11_
> 15:36:52 USER SETNOD>
>
> It is interesting that it is failing on node 2.298, but this is before
> that number had been reassigned to REACH::. The negative 11 error
> returns means "Component in Wrong State" (aka NF.CWS), which I didn't
> find immediately informative. However, now I've got something to look
> around for.
>
> I still can't imagine why there would be anything particularly
> diabolical about the number 2.298.
>> ------------------------------------------------------------------------
>>
>> On 5/5/21 12:38 AM, Thomas DeBellis wrote:
>>
>> I finished the modifications to SCLINK to properly return error values
>> which are negative and JNTMAN to return the error value in AC3 if
>> .NDINT doesn't succeed inserting all the nodes. Then I modified
>> SETNOD to get this extended error information and print it. I put the
>> new monitor and SETNOD up, rebooted *…AND*…
>>
>> SETNOD>set nod 2.298 name REACH SETNOD>ins SETNOD>
>>
>> It works perfectly because, of course it does…
>>
>> So, as usual, Johnny's guess is pretty close to the mark, even if he
>> isn't a 36 bit'er. "Slightly broken"? Yeah, 'slightly' enough so
>> that it can't be easily reproduced…
>>
>> The only thing I can think of is that the system had been up over 15
>> weeks when I saw this. I had looked at the storage space utilization
>> with SYSDPY and didn't notice anything maxing out. I restarted the
>> GETNOD batch job on VENTI2::. Maybe in another 15 weeks, it will
>> break again.
>>
>> /Annoyed/…
>>
>>> ------------------------------------------------------------------------
>>> On 5/4/21 10:31 PM, Thomas DeBellis wrote:
>>>
>>> Personally, I don't see how it could /possibly/ be anything to do
>>> with the REACH:: node definition, but I have been known to
>>> occasionally overlook the utterly obvious, particularly when it's
>>> near night-night. Maybe not this time.
>>>
>>> Right now, the way to figure it out is to get the minor error data
>>> and see where that takes things. So I'm making a change to JNTMAN to
>>> have .NDINT to return the lower level code on an incomplete insert.
>>> SCLINK appears to have a problem that it is mangling return values,
>>> which I'm currently investigating.
>>>
>>> You can't just blithely assuming somebody got it wrong and 'fix'
>>> things; sometimes it's a certain way for a reason.
>>>
>>> On 5/4/21 8:46 PM, Johnny Billquist wrote:
>>>> On 2021-05-05 00:54, Mike Kostersitz wrote:
>>>>> Ouch that is one of my nodes 😊 @Johnny Billquist
>>>>> <mailto:bqt at softjar.se> anything you could think of since we just
>>>>> renamed my old RSX11M node to REACH.
>>>>
>>>> Well. It is something slightly broken in Tops-20, so there isn't
>>>> really anything we can do about it.
>>>>
>>>> Except hope that Thomas can figure it out and fix it.
>>>>
>>>> Johnny
>>>>
>>>>>
>>>>> Mike
>>>>>
>>>>> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
>>>>> Windows 10
>>>>>
>>>>> *From: *Thomas DeBellis <mailto:tommytimesharing at gmail.com>
>>>>> *Sent: *Tuesday, May 4, 2021 15:16
>>>>> *To: *HECnet <mailto:hecnet at update.uu.se>
>>>>> *Subject: *[HECnet] Re: Tops-20 SETNOD Failure
>>>>>
>>>>> I fixed a few things in SETNOD to get some more information about
>>>>> the error. In particular,
>>>>>
>>>>> * Allow listing of AREA 1 (this was specifically disallowed, I don't
>>>>> know why)
>>>>> * More consistent error reporting (via ESOUT%)
>>>>> * List more than one node when doing an area list (it would only
>>>>> list
>>>>> a single node)
>>>>> * List nodes with more than three digits in the node number when
>>>>> doing
>>>>> columnar output
>>>>>
>>>>> So now you get the expected results:
>>>>>
>>>>> SETNOD>lis a 1
>>>>> [Area 1]
>>>>> A1RTR 1023 ATHENA 620 ATLE 605 AURORA 606
>>>>> BANAI 770
>>>>> BANX25 771 BEA 19 BIZET 800 BJARNE 7
>>>>> BLINKY 266
>>>>> CATWZL 302 CLYDE 269 COOPER 263 CRISPS 201
>>>>> CYGNUS 259
>>>>> DAVROS 254 DBIT 351 DE1RSX 450 DE1RSY 452
>>>>> DOCTOR 252
>>>>> ELIN 616 ELMER 617 ERNIE 2 ERSATZ 350
>>>>> FLETCH 100
>>>>> FNATTE 3 FREJ 608 GAXP 730 GNAT 16
>>>>> GNOME 6
>>>>> GOBLIN 4 GVAX 731 HAGMAN 262 HARPER 261
>>>>> HORSE 150
>>>>> HUGIN 602 HYUNA 500 INKY 268 JIMIN 501
>>>>> JOCKE 21
>>>>> JOSSE 17 KLIO 451 KRILLE 8 LOKE 607
>>>>> MACARO 303
>>>>> MACRA 258 MAGICA 1 MASTER 251 MIM 13
>>>>> MUNIN 603
>>>>> NIPPER 202 NOMAD 610 NOXBIT 720 ORACLE 301
>>>>> PACMAN 265
>>>>> PAI 541 PALLAS 621 PAMINA 18 PIDP11 560
>>>>> PINKY 267
>>>>> PISTON 520 PLINTH 200 PMAVS2 510 PONDUS 15
>>>>> PONY 12
>>>>> PUFF 22 QEMUNT 151 REI 540 ROCKY 11
>>>>> ROJIN 542
>>>>> RSX124 306 RSX145 304 RSX170 305 RSX184 307
>>>>> RUTAN 255
>>>>> SHARPE 260 SIDRAT 253 SIGGE 10 SPEEDY 24
>>>>> TARDIS 250
>>>>> TEMPO 9 THOROS 257 TINA 14 TIPSY 604
>>>>> TONGUE 264
>>>>> TOPSY 601 VALAR 400 VAROS 256 WXP 20
>>>>> WXP2 23
>>>>> YMER 609 ZEKE 5
>>>>> Total nodes in area 1: 92
>>>>> SETNOD>exit
>>>>>
>>>>> Regarding the error, I have reproduced it with a single entry, viz:
>>>>>
>>>>> !setnod
>>>>> SETNOD>_set nod 2.298 name REACH_
>>>>> SETNOD>_insert_
>>>>> ?SETNOD: Failed at node REACH (2.298), Item 0 of 1
>>>>> SETNOD>
>>>>>
>>>>> The high level code to do the entry is in JNTMAN. It loops through
>>>>> the table passed to it via .NDINT, calling a lower level routine
>>>>> called SCTAND in SCLINK. An error here is passed up to JNTMAN, but
>>>>> it is not passed back to the user. There are some other problems in
>>>>> SCLINK pertaining to negative return values, so some minor work is
>>>>> necessary there, also.
>>>>>
>>>>> I'll make some changes to these two modules, generate a new monitor
>>>>> for VENTI2 and see what happens in a few days.
>>>>>
>>>>> Right now, if any Tops-20 using is using SETNOD to update DECnet
>>>>> tables, this appears to fail. If anybody else is seeing it or can
>>>>> reproduce it, I'd like to hear about it.
>>>>>
>>>>> On 5/4/21 11:15 AM, Thomas DeBellis wrote:
>>>>>
>>>>> Has anybody ever seen SETNOD fail to insert the entire node
>>>>> list? I
>>>>> just did.
>>>>>
>>>>> Shortly after I put my 20's up on HECnet, I wrote a reoccurring
>>>>> batch job that fires once a week on Sundays to pull the latest
>>>>> node
>>>>> list (T20.FIX) from MIM::. I use the highly venerable FILCOM
>>>>> program to do a difference of it with the previous week's list. I
>>>>> don't do anything in particular with the output except save it in
>>>>> case I feel like looking at it for some reason.
>>>>>
>>>>> The batch job always inserts the entire list, rewriting whatever
>>>>> might be in the monitor's data base. I have always been
>>>>> unsatisfied
>>>>> with doing things that way because it seemed to me to be
>>>>> inefficient
>>>>> as the node list grew. The HECnet node list count was 716 on
>>>>> 9-Jun-19 and it's now up to 884 as of the latest version that I've
>>>>> pulled, 30-Apr-21. The other problem is the microscopic
>>>>> possibility
>>>>> that a node is in Tops-20's monitor database (a hash table) that
>>>>> isn't in the HECnet node list.
>>>>>
>>>>> Nodes can get removed, although I think that infrequent. Nodes
>>>>> could get inserted outside of the batch job, but I think that most
>>>>> unlikely in my situation. Nodes can get renamed, as evidenced by
>>>>> 2.299 below, which went from THEPIT to THEARK. None of this
>>>>> should
>>>>> or has broken anything.
>>>>>
>>>>> However, it's been in the back of my mind to do two enhancements,
>>>>> one to Tops-20 and one to SETNOD. The NODE% JSYS should have an
>>>>> additional feature to return the current monitor data base. The
>>>>> SETNOD program should be enhanced to take that to compute the set
>>>>> difference with the new list. This would show additions, renames
>>>>> and deletions. That would bring the update operation down from
>>>>> some
>>>>> hundred items to less than ten, on average. This would obviously
>>>>> make more of a difference on huge DECnet's in the tens of
>>>>> thousands
>>>>> of nodes. Another NODE% feature should probably be to whack the
>>>>> entire monitor database except for the local node, which would be
>>>>> useful for trouble shooting.
>>>>>
>>>>> Last Sunday, the batch job failed with the following error:
>>>>>
>>>>> 18:33:40 USER SETNOD>*TAKE SYSTEM:NODE-DATA.TXT.0
>>>>> 18:33:40 USER
>>>>> 18:33:40 USER [Fork SETNOD opening <SYSTEM>NODE-DATA.TXT.1 for
>>>>> reading]
>>>>> 18:33:41 USER SETNOD>*SAVE
>>>>> 18:33:41 USER
>>>>> 18:33:41 USER [Fork SETNOD opening <SYSTEM>NODE-DATA.BIN.74 for
>>>>> reading, writing]
>>>>> 18:33:41 USER SETNOD>*INSERT
>>>>> 18:33:41 USER
>>>>> 18:33:41 USER *?SETNOD: Failed at node REACH*
>>>>> 18:33:41 USER SETNOD>
>>>>>
>>>>> I had a look at the SETNOD source and the HECnet node list and
>>>>> have
>>>>> discovered and concluded a few things. First, there doesn't
>>>>> seem to
>>>>> be anything syntactically wrong with REACH::'s definition: "set
>>>>> nod
>>>>> 2.298 name REACH". Second, there don't appear to be any semantic
>>>>> issues. 2.298 wasn't in use and it shouldn't matter if it was.
>>>>>
>>>>> In the case of INSERT, there are two kinds of errors from NODE%, a
>>>>> general failure of the JSYS and an incomplete insertion. The
>>>>> error
>>>>> is from the second case. Unfortunately, SETNOD isn't reporting
>>>>> enough information about the error, so I have to make some changes
>>>>> there. It's also possible that SETNOD is building an inconsistent
>>>>> database for the monitor to swallow; at least the LIST command is
>>>>> giving me some odd results, viz:
>>>>>
>>>>> SETNOD>list arEA 2
>>>>>
>>>>> [AREA 2]
>>>>> A2RTR
>>>>>
>>>>> TOTAL NODES FOUND: 1
>>>>>
>>>>> SETNOD>
>>>>>
>>>>> That's clearly wrong, viz:
>>>>>
>>>>> !i dec
>>>>> Local DECNET node: VENTI2. Nodes reachable: 7.
>>>>> Accessible DECNET nodes are: A2RTR BOINGO LEGATO
>>>>> TOMMYT VENTI2 VENTI ZITI
>>>>>
>>>>> The Exec output should probably be changed to say, "Nodes
>>>>> reachable
>>>>> in local area" and "Online nodes in area are:"
>>>>>
>>>>> Anybody have any ideas? Hunches? Clues?
>>>>>
>>>>> File 1) OLDF:[4,120] created: 1241 15-Apr-21
>>>>> File 2) NEWF:[1,1] created: 0102 30-Apr-21
>>>>>
>>>>> 1)1 set nod 44.9 name OSMIUM
>>>>> ****
>>>>> 2)1 set nod 2.292 name OSIRIS
>>>>> 2) set nod 44.9 name OSMIUM
>>>>> **************
>>>>> 1)1 set nod 13.3 name RED
>>>>> ****
>>>>> 2)1 *set nod 2.298 name REACH *
>>>>> 2) set nod 13.3 name RED
>>>>> **************
>>>>> 1)1 set nod 2.298 name RSX11M
>>>>> 1) set nod 1.306 name RSX124
>>>>> ****
>>>>> 2)1 set nod 1.306 name RSX124
>>>>> **************
>>>>> 1)1 set nod 42.5 name SPARKY
>>>>> ****
>>>>> 2)1 set nod 2.291 name SPARK
>>>>> 2) set nod 42.5 name SPARKY
>>>>> **************
>>>>> 1)1 set nod 2.299 name THEPIT
>>>>> 1) set nod 35.70 name THOMAS
>>>>> ****
>>>>> 2)1 set nod 2.299 name THEARK
>>>>> 2) set nod 35.70 name THOMAS
>>>>> **************
>>>>>
>>>>
--
Johnny Billquist || "I'm on a bus
|| on a psychedelic trip
email: bqt at softjar.se || Reading murder books
pdp is alive! || tryin' to stay hip" - B. Idol
More information about the Hecnet-list
mailing list