[HECnet] KLH10 halting at random

Thu Sep 3 21:08:47 PDT 2020

ok, will do. first I have to find and restore the KLH10 instance from 
backup, thanks to an unexpectedly violent storm that triggered tornado 
warnings and consecutive brown-outs. thanks.

On 9/3/20 7:39 PM, Thomas DeBellis wrote:
>
> Shortly after sending this, I wedged my development machine by 
> mistakenly beating on the file system; this time by running SPEAR to 
> pull out events around the DTEKPA BUGCHK.  There was too much activity 
> (I have a very large ERROR.SYS, thanks to DECnet) and I got a DTEKPA.  
> Once this happens, the machine hangs shortly afterwards.  This finally 
> caused me to have a look at DTESRV.
>
> KPALIV is a variable that is incremented by Tops-20 in a number of 
> circumstances by SCHED, APRSRV and (oddly) CFSSRV.  It's a keep alive 
> counter that both the front end and Tops-20 pay attention to.  An 
> examination of the live monitor shows that it is monotonically increasing:
>
>     1,,COMBAS+5[   417,,424521
>     1,,COMBAS+5[   417,,426524
>     1,,COMBAS+5[   417,,510532
>
> It is updated approximately every 500 milliseconds; let's call that a 
> keep-alive tick.  If it isn't updated in two ticks, the front end is 
> declared down and reload action is initiated.  A number of things are 
> done and it appears that KLH10 is not properly handling them.  Since 
> the KLH10 DTE service is not running in a separate process (there are 
> vestigial hooks to do this), it does not handle a ten triggered reload.
>
> Tops-20 waits for the reload to complete, KLH10 does nothing and 
> you're hung.
>
> Fortunately, there is some code for the master DTE which checks a 
> variable called FEDBSW, Front End Debugging Switch.  If this is 
> non-zero, then the keep-alive count is incremented, but it's never 
> checked.  So I set it to -1 (it was zero) and then proceeded to beat 
> on the file system with wild abandon.
>
> For periods of intense disk activity, the machine appeared to hang.  
> After about 10 to 20 seconds, it came right back as if nothing had 
> never happened.  Interesting...
>
> Right now, my working assumption is that the PI system is getting 
> saturated so that the clock interrupt somehow isn't making it 
> through.  For now, I'm thinking of rewriting the service routine so 
> that instead of checking for two ticks, it checks elapsed time which 
> can then be set to some 'reasonable' value.
>
> If you think this may be what is hanging you, then you can try it.  
> For me, FEDBSW is at octal 1,,304544.  Thus far, I'm up 42:44:57 (1 
> Day, 18 Hours, 44 Minutes, 57 Seconds and 615 Milliseconds).
>
>> ------------------------------------------------------------------------
>>
>> On 8/31/20 9:03 PM, Thomas DeBellis wrote:
>>
>> Do you know what program is displaying those three lines?
>>
>> I'm unaware of a PANDA distribution that didn't announce itself as a 
>> PANDA distribution in the system banner.   The date and time display 
>> is odd.  Tops-20 native time output has been Y2K compliant since 
>> forever.  It's the Tops-10 programs (MACRO, CREF, Etc.), plus 
>> Tops-10'ish programs (GLXLIB, Quasar, Etc.) that needed Y2K patches.
>>
>> Tops-20 DAP needed a small modification to handle Y2K and to not 
>> break RSX.
>>
>> The Tops-10 system that I use has a number of non-Y2K times, which 
>> surprised me.  While I have had the freedom to remediate, I simply 
>> don't have the time.  But it's jarring.
>>
>> I also found it interesting that the banner says DEC10 Development; 
>> 20's were sometimes called DEC20's, but never DEC10's (well, 1031 
>> might have been an exception).
>>
>> I could have sworn you were showing us something off of a Tops-10 CTY...
>>
>>> ------------------------------------------------------------------------
>>> On 8/31/20 7:13 PM, Supratim Sanyal wrote:
>>>
>>> I will keep digging - but it is possibly interesting this happens 
>>> between approx 52 and and indeterminate number of solid uptime
>>>> ------------------------------------------------------------------------
>>>>
>>>> On Aug 31, 2020, at 5:00 PM, Thomas DeBellis 
>>>> <tommytimesharing at gmail.com <mailto:tommytimesharing at gmail.com>> wrote:
>>>>
>>>> If you are running a standard PANDA distribution, then DDT is in 
>>>> the monitor and you may fail to it.  Did it come up?  Did you do an 
>>>> examine from the KLH10 micro-engine to see what instruction it was 
>>>> failing on?  Did you see what module it is failing in?
>>>>
>>>> My monitor is modified from the base PANDA distribution to include 
>>>> several local enhancements, so when I looked at that address, it 
>>>> showed up as in the entry of CHKOPC, which is what is checking for 
>>>> differed closes on virtual circuits.  This is in PHYKLP which is 
>>>> the KLIPA driver (a.k.a. the CI). Since KLH10 (sadly) does not 
>>>> implement the CI, there is no way you should be executing in that 
>>>> module as there nothing for it to talk to.
>>>>
>>>> Moreover, there is no JRST 4 there.  So probably you have something 
>>>> else at that address.
>>>>
>>>> I have been running KLH10 for a /very/ long time; since late 
>>>> December 2002 and have made modifications there, too to fix an 
>>>> issue with locking memory and to better support Linux (recent 
>>>> Ubuntu). It is remarkably robust; despite intensive development, I 
>>>> have stayed up well over a year at a time (I.E., hit UP2LNG BUGHLT's)
>>>>
>>>> I have found one problem; if you are running it on an _extremely_ 
>>>> fast machine with SSD storage (in other words, you're basically 
>>>> never waiting for anything) and you seriously beat on the file 
>>>> system, then the keep-alive counter can get out of sync with the 20 
>>>> thinking the front end has died and the KLH10 DTE simulator 
>>>> apparently not understanding what to do.
>>>>
>>>> The 20 typed an initial BUGCHK and then in the middle of the second 
>>>> one, it hangs waiting for the front end.
>>>>
>>>> It's on my list of things to investigate.
>>>>
>>>>> ------------------------------------------------------------------------
>>>>> On 8/31/20 4:15 PM, Supratim Sanyal wrote:
>>>>>
>>>>> hi all - my panda distribution instance is halting after a couple 
>>>>> of days with the following message. is this a known problem for 
>>>>> which there is some workaround?
>>>>>
>>>>> Monitor RF434E DEC10 Development
>>>>> System uptime 52:10:47
>>>>> Current date/time Wednesday 29-Jul-120 6:01:04
>>>>>
>>>>> [HALTED: Program Halt, PC = 22013]
>>>>>
>>>>> thanks
>>>>>
>>>>> Supratim
>>>>>
-- 
Supratim Sanyal, W1XMT
39.19151 N, 77.23432 W
QCOCAL::SANYAL via HECnet

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sonic.net/pipermail/hecnet-list/attachments/20200904/7798c319/attachment.html>