[HECnet] KLH10 halting at random

Thomas DeBellis tommytimesharing at gmail.com
Thu Sep 3 16:39:24 PDT 2020


Shortly after sending this, I wedged my development machine by 
mistakenly beating on the file system; this time by running SPEAR to 
pull out events around the DTEKPA BUGCHK.  There was too much activity 
(I have a very large ERROR.SYS, thanks to DECnet) and I got a DTEKPA.  
Once this happens, the machine hangs shortly afterwards.  This finally 
caused me to have a look at DTESRV.

KPALIV is a variable that is incremented by Tops-20 in a number of 
circumstances by SCHED, APRSRV and (oddly) CFSSRV.  It's a keep alive 
counter that both the front end and Tops-20 pay attention to.  An 
examination of the live monitor shows that it is monotonically increasing:

    1,,COMBAS+5[   417,,424521
    1,,COMBAS+5[   417,,426524
    1,,COMBAS+5[   417,,510532

It is updated approximately every 500 milliseconds; let's call that a 
keep-alive tick.  If it isn't updated in two ticks, the front end is 
declared down and reload action is initiated.  A number of things are 
done and it appears that KLH10 is not properly handling them.  Since the 
KLH10 DTE service is not running in a separate process (there are 
vestigial hooks to do this), it does not handle a ten triggered reload.

Tops-20 waits for the reload to complete, KLH10 does nothing and you're 
hung.

Fortunately, there is some code for the master DTE which checks a 
variable called FEDBSW, Front End Debugging Switch.  If this is 
non-zero, then the keep-alive count is incremented, but it's never 
checked.  So I set it to -1 (it was zero) and then proceeded to beat on 
the file system with wild abandon.

For periods of intense disk activity, the machine appeared to hang.  
After about 10 to 20 seconds, it came right back as if nothing had never 
happened.  Interesting...

Right now, my working assumption is that the PI system is getting 
saturated so that the clock interrupt somehow isn't making it through.  
For now, I'm thinking of rewriting the service routine so that instead 
of checking for two ticks, it checks elapsed time which can then be set 
to some 'reasonable' value.

If you think this may be what is hanging you, then you can try it.  For 
me, FEDBSW is at octal 1,,304544.  Thus far, I'm up 42:44:57 (1 Day, 18 
Hours, 44 Minutes, 57 Seconds and 615 Milliseconds).

> ------------------------------------------------------------------------
>
> On 8/31/20 9:03 PM, Thomas DeBellis wrote:
>
> Do you know what program is displaying those three lines?
>
> I'm unaware of a PANDA distribution that didn't announce itself as a 
> PANDA distribution in the system banner.   The date and time display 
> is odd.  Tops-20 native time output has been Y2K compliant since 
> forever.  It's the Tops-10 programs (MACRO, CREF, Etc.), plus 
> Tops-10'ish programs (GLXLIB, Quasar, Etc.) that needed Y2K patches.
>
> Tops-20 DAP needed a small modification to handle Y2K and to not break 
> RSX.
>
> The Tops-10 system that I use has a number of non-Y2K times, which 
> surprised me.  While I have had the freedom to remediate, I simply 
> don't have the time.  But it's jarring.
>
> I also found it interesting that the banner says DEC10 Development; 
> 20's were sometimes called DEC20's, but never DEC10's (well, 1031 
> might have been an exception).
>
> I could have sworn you were showing us something off of a Tops-10 CTY...
>
>> ------------------------------------------------------------------------
>> On 8/31/20 7:13 PM, Supratim Sanyal wrote:
>>
>> I will keep digging - but it is possibly interesting this happens 
>> between approx 52 and and indeterminate number of solid uptime
>>> ------------------------------------------------------------------------
>>>
>>> On Aug 31, 2020, at 5:00 PM, Thomas DeBellis 
>>> <tommytimesharing at gmail.com <mailto:tommytimesharing at gmail.com>> wrote:
>>>
>>> If you are running a standard PANDA distribution, then DDT is in the 
>>> monitor and you may fail to it.  Did it come up?  Did you do an 
>>> examine from the KLH10 micro-engine to see what instruction it was 
>>> failing on? Did you see what module it is failing in?
>>>
>>> My monitor is modified from the base PANDA distribution to include 
>>> several local enhancements, so when I looked at that address, it 
>>> showed up as in the entry of CHKOPC, which is what is checking for 
>>> differed closes on virtual circuits.  This is in PHYKLP which is the 
>>> KLIPA driver (a.k.a. the CI).  Since KLH10 (sadly) does not 
>>> implement the CI, there is no way you should be executing in that 
>>> module as there nothing for it to talk to.
>>>
>>> Moreover, there is no JRST 4 there.  So probably you have something 
>>> else at that address.
>>>
>>> I have been running KLH10 for a /very/ long time; since late 
>>> December 2002 and have made modifications there, too to fix an issue 
>>> with locking memory and to better support Linux (recent Ubuntu).  It 
>>> is remarkably robust; despite intensive development, I have stayed 
>>> up well over a year at a time (I.E., hit UP2LNG BUGHLT's)
>>>
>>> I have found one problem; if you are running it on an _extremely_ 
>>> fast machine with SSD storage (in other words, you're basically 
>>> never waiting for anything) and you seriously beat on the file 
>>> system, then the keep-alive counter can get out of sync with the 20 
>>> thinking the front end has died and the KLH10 DTE simulator 
>>> apparently not understanding what to do.
>>>
>>> The 20 typed an initial BUGCHK and then in the middle of the second 
>>> one, it hangs waiting for the front end.
>>>
>>> It's on my list of things to investigate.
>>>
>>>> ------------------------------------------------------------------------
>>>> On 8/31/20 4:15 PM, Supratim Sanyal wrote:
>>>>
>>>> hi all - my panda distribution instance is halting after a couple 
>>>> of days with the following message. is this a known problem for 
>>>> which there is some workaround?
>>>>
>>>> Monitor RF434E DEC10 Development
>>>> System uptime 52:10:47
>>>> Current date/time Wednesday 29-Jul-120 6:01:04
>>>>
>>>> [HALTED: Program Halt, PC = 22013]
>>>>
>>>> thanks
>>>>
>>>> Supratim
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sonic.net/pipermail/hecnet-list/attachments/20200903/e279b46d/attachment-0001.html>


More information about the Hecnet-list mailing list