[HECnet] KLH10 halting at random

Fri Sep 4 09:37:44 PDT 2020

You have my sympathies; I saw the storm coming and it finally got me off 
my butt to plug in a UPS that I had been dawdling on for about eight 
months.  That saved about half of everything, including the 20's.  I got 
another UPS two days afterwards for the other half.

    Conditioned power is /Good/...

While you're having all that fun, don't forget to update your DECnet 
hosts (SETNODE/SYSTEM:NODE-DATA.TXT); the last time I checked, you were 
very out of date--you don't have definitions for my systems and I've 
been on HECnet since June of last year.

I'm putting in some code into DTESRV to wait longer than a second to 
declare a front-end down.  It's a little tricky because some of it is 
running in scheduler context, outside of fork context, which means that 
I can't use certain wait paradigms.  Hopefully then, I'll be able to use 
some of that data to see what is keeping KLH10 from updating the master 
DTE keep-alive counter (KPALIV).
> ------------------------------------------------------------------------
>
> On 9/4/20 12:08 AM, Supratim Sanyal wrote:
>
> ok, will do. first I have to find and restore the KLH10 instance from 
> backup, thanks to an unexpectedly violent storm that triggered tornado 
> warnings and consecutive brown-outs. thanks.
>
>> ------------------------------------------------------------------------
>> On 9/3/20 7:39 PM, Thomas DeBellis wrote:
>>
>> Shortly after sending this, I wedged my development machine by 
>> mistakenly beating on the file system; this time by running SPEAR to 
>> pull out events around the DTEKPA BUGCHK.  There was too much 
>> activity (I have a very large ERROR.SYS, thanks to DECnet) and I got 
>> a DTEKPA.  Once this happens, the machine hangs shortly afterwards.  
>> This finally caused me to have a look at DTESRV.
>>
>> KPALIV is a variable that is incremented by Tops-20 in a number of 
>> circumstances by SCHED, APRSRV and (oddly) CFSSRV. It's a keep alive 
>> counter that both the front end and Tops-20 pay attention to.  An 
>> examination of the live monitor shows that it is monotonically 
>> increasing:
>>
>>     1,,COMBAS+5[   417,,424521
>>     1,,COMBAS+5[   417,,426524
>>     1,,COMBAS+5[   417,,510532
>>
>> It is updated approximately every 500 milliseconds; let's call that a 
>> keep-alive tick.  If it isn't updated in two ticks, the front end is 
>> declared down and reload action is initiated.  A number of things are 
>> done and it appears that KLH10 is not properly handling them.  Since 
>> the KLH10 DTE service is not running in a separate process (there are 
>> vestigial hooks to do this), it does not handle a ten triggered reload.
>>
>> Tops-20 waits for the reload to complete, KLH10 does nothing and 
>> you're hung.
>>
>> Fortunately, there is some code for the master DTE which checks a 
>> variable called FEDBSW, Front End Debugging Switch. If this is 
>> non-zero, then the keep-alive count is incremented, but it's never 
>> checked.  So I set it to -1 (it was zero) and then proceeded to beat 
>> on the file system with wild abandon.
>>
>> For periods of intense disk activity, the machine appeared to hang.  
>> After about 10 to 20 seconds, it came right back as if nothing had 
>> never happened.  Interesting...
>>
>> Right now, my working assumption is that the PI system is getting 
>> saturated so that the clock interrupt somehow isn't making it 
>> through.  For now, I'm thinking of rewriting the service routine so 
>> that instead of checking for two ticks, it checks elapsed time which 
>> can then be set to some 'reasonable' value.
>>
>> If you think this may be what is hanging you, then you can try it.  
>> For me, FEDBSW is at octal 1,,304544.  Thus far, I'm up 42:44:57 (1 
>> Day, 18 Hours, 44 Minutes, 57 Seconds and 615 Milliseconds).
>>
>>> ------------------------------------------------------------------------
>>>
>>> On 8/31/20 9:03 PM, Thomas DeBellis wrote:
>>>
>>> Do you know what program is displaying those three lines?
>>>
>>> I'm unaware of a PANDA distribution that didn't announce itself as a 
>>> PANDA distribution in the system banner.   The date and time display 
>>> is odd.  Tops-20 native time output has been Y2K compliant since 
>>> forever.  It's the Tops-10 programs (MACRO, CREF, Etc.), plus 
>>> Tops-10'ish programs (GLXLIB, Quasar, Etc.) that needed Y2K patches.
>>>
>>> Tops-20 DAP needed a small modification to handle Y2K and to not 
>>> break RSX.
>>>
>>> The Tops-10 system that I use has a number of non-Y2K times, which 
>>> surprised me.  While I have had the freedom to remediate, I simply 
>>> don't have the time.  But it's jarring.
>>>
>>> I also found it interesting that the banner says DEC10 Development; 
>>> 20's were sometimes called DEC20's, but never DEC10's (well, 1031 
>>> might have been an exception).
>>>
>>> I could have sworn you were showing us something off of a Tops-10 CTY...
>>>
>>>> ------------------------------------------------------------------------
>>>> On 8/31/20 7:13 PM, Supratim Sanyal wrote:
>>>>
>>>> I will keep digging - but it is possibly interesting this happens 
>>>> between approx 52 and and indeterminate number of solid uptime
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>> On Aug 31, 2020, at 5:00 PM, Thomas DeBellis 
>>>>> <tommytimesharing at gmail.com <mailto:tommytimesharing at gmail.com>> 
>>>>> wrote:
>>>>>
>>>>> If you are running a standard PANDA distribution, then DDT is in 
>>>>> the monitor and you may fail to it. Did it come up?  Did you do an 
>>>>> examine from the KLH10 micro-engine to see what instruction it was 
>>>>> failing on?  Did you see what module it is failing in?
>>>>>
>>>>> My monitor is modified from the base PANDA distribution to include 
>>>>> several local enhancements, so when I looked at that address, it 
>>>>> showed up as in the entry of CHKOPC, which is what is checking for 
>>>>> differed closes on virtual circuits.  This is in PHYKLP which is 
>>>>> the KLIPA driver (a.k.a. the CI). Since KLH10 (sadly) does not 
>>>>> implement the CI, there is no way you should be executing in that 
>>>>> module as there nothing for it to talk to.
>>>>>
>>>>> Moreover, there is no JRST 4 there.  So probably you have 
>>>>> something else at that address.
>>>>>
>>>>> I have been running KLH10 for a /very/ long time; since late 
>>>>> December 2002 and have made modifications there, too to fix an 
>>>>> issue with locking memory and to better support Linux (recent 
>>>>> Ubuntu).  It is remarkably robust; despite intensive development, 
>>>>> I have stayed up well over a year at a time (I.E., hit UP2LNG 
>>>>> BUGHLT's)
>>>>>
>>>>> I have found one problem; if you are running it on an _extremely_ 
>>>>> fast machine with SSD storage (in other words, you're basically 
>>>>> never waiting for anything) and you seriously beat on the file 
>>>>> system, then the keep-alive counter can get out of sync with the 
>>>>> 20 thinking the front end has died and the KLH10 DTE simulator 
>>>>> apparently not understanding what to do.
>>>>>
>>>>> The 20 typed an initial BUGCHK and then in the middle of the 
>>>>> second one, it hangs waiting for the front end.
>>>>>
>>>>> It's on my list of things to investigate.
>>>>>
>>>>>> ------------------------------------------------------------------------
>>>>>> On 8/31/20 4:15 PM, Supratim Sanyal wrote:
>>>>>>
>>>>>> hi all - my panda distribution instance is halting after a couple 
>>>>>> of days with the following message. is this a known problem for 
>>>>>> which there is some workaround?
>>>>>>
>>>>>> Monitor RF434E DEC10 Development
>>>>>> System uptime 52:10:47
>>>>>> Current date/time Wednesday 29-Jul-120 6:01:04
>>>>>>
>>>>>> [HALTED: Program Halt, PC = 22013]
>>>>>>
>>>>>> thanks
>>>>>>
>>>>>> Supratim
>>>>>>
> -- 
> Supratim Sanyal, W1XMT
> 39.19151 N, 77.23432 W
> QCOCAL::SANYAL via HECnet
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sonic.net/pipermail/hecnet-list/attachments/20200904/fa7d3f97/attachment-0001.html>