[HECnet] KLH10 halting at random

Mon Aug 31 16:13:47 PDT 2020

I will keep digging - but it is possibly interesting this happens between approx 52 and and indeterminate number of solid uptime

> On Aug 31, 2020, at 5:00 PM, Thomas DeBellis <tommytimesharing at gmail.com> wrote:
> 
> If you are running a standard PANDA distribution, then DDT is in the monitor and you may fail to it.  Did it come up?  Did you do an examine from the KLH10 micro-engine to see what instruction it was failing on?  Did you see what module it is failing in?
> 
> My monitor is modified from the base PANDA distribution to include several local enhancements, so when I looked at that address, it showed up as in the entry of CHKOPC, which is what is checking for differed closes on virtual circuits.  This is in PHYKLP which is the KLIPA driver (a.k.a. the CI).  Since KLH10 (sadly) does not implement the CI, there is no way you should be executing in that module as there nothing for it to talk to.
> 
> Moreover, there is no JRST 4 there.  So probably you have something else at that address.
> 
> I have been running KLH10 for a very long time; since late December 2002 and have made modifications there, too to fix an issue with locking memory and to better support Linux (recent Ubuntu).  It is remarkably robust; despite intensive development, I have stayed up well over a year at a time (I.E., hit UP2LNG BUGHLT's)
> 
> I have found one problem; if you are running it on an extremely fast machine with SSD storage (in other words, you're basically never waiting for anything) and you seriously beat on the file system, then the keep-alive counter       can get out of sync with the 20 thinking the front end has died and the KLH10 DTE simulator apparently not understanding what to do.
> The 20 typed an initial BUGCHK and then in the middle of the second one, it hangs waiting for the front end.
> 
> It's on my list of things to investigate.
> 
>> On 8/31/20 4:15 PM, Supratim Sanyal wrote:
>> 
>> hi all - my panda distribution instance is halting after a couple of days with the following message. is this a known problem for which there is some workaround? 
>> 
>> Monitor RF434E DEC10 Development 
>> System uptime 52:10:47 
>> Current date/time Wednesday 29-Jul-120 6:01:04 
>> 
>> [HALTED: Program Halt, PC = 22013] 
>> 
>> thanks 
>> 
>> Supratim 
>> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sonic.net/pipermail/hecnet-list/attachments/20200831/0e9fe745/attachment.html>