[HECnet] KLH10 halting at random
Supratim Sanyal
supratim at riseup.net
Thu Sep 3 21:08:47 PDT 2020
ok, will do. first I have to find and restore the KLH10 instance from
backup, thanks to an unexpectedly violent storm that triggered tornado
warnings and consecutive brown-outs. thanks.
On 9/3/20 7:39 PM, Thomas DeBellis wrote:
>
> Shortly after sending this, I wedged my development machine by
> mistakenly beating on the file system; this time by running SPEAR to
> pull out events around the DTEKPA BUGCHK. There was too much activity
> (I have a very large ERROR.SYS, thanks to DECnet) and I got a DTEKPA.
> Once this happens, the machine hangs shortly afterwards. This finally
> caused me to have a look at DTESRV.
>
> KPALIV is a variable that is incremented by Tops-20 in a number of
> circumstances by SCHED, APRSRV and (oddly) CFSSRV. It's a keep alive
> counter that both the front end and Tops-20 pay attention to. An
> examination of the live monitor shows that it is monotonically increasing:
>
> 1,,COMBAS+5[ 417,,424521
> 1,,COMBAS+5[ 417,,426524
> 1,,COMBAS+5[ 417,,510532
>
> It is updated approximately every 500 milliseconds; let's call that a
> keep-alive tick. If it isn't updated in two ticks, the front end is
> declared down and reload action is initiated. A number of things are
> done and it appears that KLH10 is not properly handling them. Since
> the KLH10 DTE service is not running in a separate process (there are
> vestigial hooks to do this), it does not handle a ten triggered reload.
>
> Tops-20 waits for the reload to complete, KLH10 does nothing and
> you're hung.
>
> Fortunately, there is some code for the master DTE which checks a
> variable called FEDBSW, Front End Debugging Switch. If this is
> non-zero, then the keep-alive count is incremented, but it's never
> checked. So I set it to -1 (it was zero) and then proceeded to beat
> on the file system with wild abandon.
>
> For periods of intense disk activity, the machine appeared to hang.
> After about 10 to 20 seconds, it came right back as if nothing had
> never happened. Interesting...
>
> Right now, my working assumption is that the PI system is getting
> saturated so that the clock interrupt somehow isn't making it
> through. For now, I'm thinking of rewriting the service routine so
> that instead of checking for two ticks, it checks elapsed time which
> can then be set to some 'reasonable' value.
>
> If you think this may be what is hanging you, then you can try it.
> For me, FEDBSW is at octal 1,,304544. Thus far, I'm up 42:44:57 (1
> Day, 18 Hours, 44 Minutes, 57 Seconds and 615 Milliseconds).
>
>> ------------------------------------------------------------------------
>>
>> On 8/31/20 9:03 PM, Thomas DeBellis wrote:
>>
>> Do you know what program is displaying those three lines?
>>
>> I'm unaware of a PANDA distribution that didn't announce itself as a
>> PANDA distribution in the system banner. The date and time display
>> is odd. Tops-20 native time output has been Y2K compliant since
>> forever. It's the Tops-10 programs (MACRO, CREF, Etc.), plus
>> Tops-10'ish programs (GLXLIB, Quasar, Etc.) that needed Y2K patches.
>>
>> Tops-20 DAP needed a small modification to handle Y2K and to not
>> break RSX.
>>
>> The Tops-10 system that I use has a number of non-Y2K times, which
>> surprised me. While I have had the freedom to remediate, I simply
>> don't have the time. But it's jarring.
>>
>> I also found it interesting that the banner says DEC10 Development;
>> 20's were sometimes called DEC20's, but never DEC10's (well, 1031
>> might have been an exception).
>>
>> I could have sworn you were showing us something off of a Tops-10 CTY...
>>
>>> ------------------------------------------------------------------------
>>> On 8/31/20 7:13 PM, Supratim Sanyal wrote:
>>>
>>> I will keep digging - but it is possibly interesting this happens
>>> between approx 52 and and indeterminate number of solid uptime
>>>> ------------------------------------------------------------------------
>>>>
>>>> On Aug 31, 2020, at 5:00 PM, Thomas DeBellis
>>>> <tommytimesharing at gmail.com <mailto:tommytimesharing at gmail.com>> wrote:
>>>>
>>>> If you are running a standard PANDA distribution, then DDT is in
>>>> the monitor and you may fail to it. Did it come up? Did you do an
>>>> examine from the KLH10 micro-engine to see what instruction it was
>>>> failing on? Did you see what module it is failing in?
>>>>
>>>> My monitor is modified from the base PANDA distribution to include
>>>> several local enhancements, so when I looked at that address, it
>>>> showed up as in the entry of CHKOPC, which is what is checking for
>>>> differed closes on virtual circuits. This is in PHYKLP which is
>>>> the KLIPA driver (a.k.a. the CI). Since KLH10 (sadly) does not
>>>> implement the CI, there is no way you should be executing in that
>>>> module as there nothing for it to talk to.
>>>>
>>>> Moreover, there is no JRST 4 there. So probably you have something
>>>> else at that address.
>>>>
>>>> I have been running KLH10 for a /very/ long time; since late
>>>> December 2002 and have made modifications there, too to fix an
>>>> issue with locking memory and to better support Linux (recent
>>>> Ubuntu). It is remarkably robust; despite intensive development, I
>>>> have stayed up well over a year at a time (I.E., hit UP2LNG BUGHLT's)
>>>>
>>>> I have found one problem; if you are running it on an _extremely_
>>>> fast machine with SSD storage (in other words, you're basically
>>>> never waiting for anything) and you seriously beat on the file
>>>> system, then the keep-alive counter can get out of sync with the 20
>>>> thinking the front end has died and the KLH10 DTE simulator
>>>> apparently not understanding what to do.
>>>>
>>>> The 20 typed an initial BUGCHK and then in the middle of the second
>>>> one, it hangs waiting for the front end.
>>>>
>>>> It's on my list of things to investigate.
>>>>
>>>>> ------------------------------------------------------------------------
>>>>> On 8/31/20 4:15 PM, Supratim Sanyal wrote:
>>>>>
>>>>> hi all - my panda distribution instance is halting after a couple
>>>>> of days with the following message. is this a known problem for
>>>>> which there is some workaround?
>>>>>
>>>>> Monitor RF434E DEC10 Development
>>>>> System uptime 52:10:47
>>>>> Current date/time Wednesday 29-Jul-120 6:01:04
>>>>>
>>>>> [HALTED: Program Halt, PC = 22013]
>>>>>
>>>>> thanks
>>>>>
>>>>> Supratim
>>>>>
--
Supratim Sanyal, W1XMT
39.19151 N, 77.23432 W
QCOCAL::SANYAL via HECnet
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sonic.net/pipermail/hecnet-list/attachments/20200904/7798c319/attachment.html>
More information about the Hecnet-list
mailing list