[HECnet] KLH10 front-end reload??

Sun Oct 11 20:19:20 PDT 2020

After working on the DTEKPA issue and coming up with a work around, I 
went searching to see if the problem had ever been reported.  In fact, 
it had been.   By me.   ...Well over a decade ago...

Basically, the default behavior for Tops-20 is to note that a front-end 
counter isn't incrementing and--if on the next check it still hasn't 
changed--to declare the associated PDP-11 front end down and to initiate 
a reboot.  This gives the 11 about a millisecond to get its act together 
before the KL whacks it.

Of course, that will never work in KLH10 (see below and previous); 
you're hung because the KLH10 DTE emulator doesn't implement code to 
simulate a reboot action, so the KL loops forever looking for the 
response.  Of course, why would it?  There should be no reason for 
Tops-20 to ever think its down.   Seeing as I like writing assembler 
more than C, I decided not to tweak the KLH10 code (I have tweaked it 
for other cases and might rethink this).

The workaround is to define some additional functions for the BOOT% JSYS 
to set some variables in resident storage in STG and modify DTESRV to 
use them.  You can now set an elapsed time to wait before declaring the 
PDP-11 down.  I default to five minutes of non-incrementing keep-alive. 
Depending on how hard I am beating on things, the front end 'appears' to 
go away between anywhere from 5 to 15 seconds.

I think probably the real fix is to not depend on an OS interrupt to 
increment the counter.  A thread should be spawned which uses 
nanosleep() to bump the counter every 500 microseconds, no matter what 
the rest of KLH10 might be doing.  KLH10 is already using multiple forks 
for the disks, tape, NI, Etc. (one reason I've preferred it over SimH), 
so maybe this won't be a big deal.

> ------------------------------------------------------------------------
> *From*: Mark Crispin <MRC at Lingling.Panda.COM>
> *Subject*: Re: KLH10 front-end reload??
> *To*: Thomas DeBellis <slogin at acedsl.com>
> *Date*: Sun, 29 Nov 2009 11:20:46 -0800 (PST)
> *In-Reply-To*: <4B11CFEA.4020906 at acedsl.com>
> *Message-ID*: <alpine.OSX.2.00.0911291100070.245 at hsinghsing.panda.com>
>
> KLH10 implements enough for the front end DTE protocol for TOPS-20 to 
> think that it is talking to a front end, albeit one with just a CTY 
> (no KLINIK, DL11 lines, or DECnet).
>
> There is a keepalive timer in both TOPS-20 (to reboot the front end 
> when the front end crashes) and in RSX-11F (to reboot TOPS-20 when it 
> crashes).
>
> The front end also keeps time well enough to set TOPS-20's clock 
> following a crash-reboot; this is superceded in KLH10 as the timebase 
> instructions get the time from the host OS.  In addition, Panda 
> monitors try to run a program called TIMCHK which will synchronize 
> with NTP servers.
>
> So, what happened was that the DTE protocol stopped for some reason. 
> TOPS-20 tried to reboot the front end in an attempt to get it going, 
> but of course that was futile.
>
> Here's something that may help:
>
> On some Linux systems the esoteric real-time interrupt mechanisms in 
> KLH10 don't work well.  So, it may be necessary to set 
> KLH10_ITIME_SYNC instead of the default KLH10_ITIME_INTRP. Note that 
> doing so will make KLH10 burn much more CPU on the host system.
>
> Usually, though, if you need to do this, it becomes pretty obvious at 
> once, with nasty DTE errors from KLH10 shortly after booting (and any 
> time you type on the CTY).
>
> One reason why I haven't upgraded Lingling's host CPU is that most of 
> the newer machines that I've run KLH10 on have required doing this.  
> It's quite annoying.
>
>> ------------------------------------------------------------------------
>> *From*: Thomas DeBellis <slogin at acedsl.com>
>> *Subject*: KLH10 front-end reload??
>> *To*: Tops-20 Wizards <TOPS-20 at lingling.panda.com>
>> *Date*: Sat, 28 Nov 2009 20:35:38 -0500
>> *Message-ID*: <4B11CFEA.4020906 at acedsl.com>
>>
>> Tommy Timesharing hung earlier today; it had been up over a 175 days. 
>> I got an error around 1:27PM-EST that the front end had hung and was 
>> rebooted.  By the time I noticed at 4:08, the system was completely 
>> wedged.
>>
>> I couldn't get in on the CTY, but KLH10 appeared to be working. 
>> However, in the process of poking around, I completely destroyed some 
>> information, so I am unable to determine exactly what was going on.  
>> Sigh...
>>
>> As I had been up since early June (and the middle of March before 
>> that because of a power failure), this does not appear to be of 
>> immediate concern.  The system had not had a single issue during all 
>> this time (not even a BUGINF)
>>
>> However ...  Ideas, anyone?  Should I think about getting nervous?  I 
>> mean, there IS no front-end on KLH10, right?
>> ________________________________________________________________________
>>
>> ************************************************************************
>> TOPS-20 BUGHLT-BUGCHK
>>  Logged on Sat 28 Nov 2009 13:27:08      Monitor uptime was 175 days 
>> 19:54:26
>>     Detected on system # 3699.
>>     Record sequence number:    17527.
>> ************************************************************************
>>
>> Error information:
>>     Date/Time of error:    Sat 28 Nov 2009 13:27:05
>>     Errors since reload:    1.
>>     Fork # & Job #:        777777,777777
>>     User's logged in dir:    unknown
>>     Program name:
>>     Error:            BUGINF
>>     Address of error:    1137031
>>     Name:            DTEKPA
>>     Description:        DTE keep alive fail
>>     CONI APR:        007740,,000003 = No error bits detected
>>     CONI PAG:        000000,,660151
>>     DATAI PAG:        700101,,002750
>>     Contents of ACs:
>>              0:    000000,,575700
>>              1:    777777,,000000
>>              2:    000000,,000000
>>              3:    000000,,277242
>>              4:    000100,,206260
>>              5:    000000,,247445
>>              6:    000000,,000000
>>              7:    000000,,000000
>>             10:    777775,,000002
>>             11:    000000,,000000
>>             12:    000000,,614101
>>             13:    777772,,000012
>>             14:    777777,,777650
>>             15:    777305,,353304
>>             16:    620012,,000000
>>             17:    777115,,246540
>>     PI status:        000000,,000175
>>     Additional data items:    1
>>                 000000,,000000
>>
>>     ERA:            000000,,000000 = word #0 Memory read
>>     Base phyiscal memory
>>      address at failure:    0
>>
>> ************************************************************************
>> FRONT END RELOADED
>>  Logged on Sat 28 Nov 2009 13:28:04      Monitor uptime was 175 days 
>> 19:55:21
>>     Detected on system # 3699.
>>     Record sequence number:    17528.
>> ************************************************************************
>>     CPU # :,,Front end #:    0,0
>>     Status at reload:     No error bits detected
>>     Retries:    3
>>     Filename for DUMP: <SYSTEM>0DMP11.BIN.1,28-Nov-2009 13:27:05
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sonic.net/pipermail/hecnet-list/attachments/20201011/7f5afb70/attachment.html>