[HECnet] Thousands of DECnet errors on Tops-20

Mon Jan 18 11:26:42 PST 2021

16 bytes?   Some kind of internal header?  I'm not sure I fully 
understand what Tops-20 is doing with the buffer; I'll have to 
investigate further to see whether my suspicion about the use of 
meta-data at the beginning.

Would it help to see more of the messages that I grabbed with tcpdump?  
I did a tcpdump -e -i en0 --immediate-mode -n decnet and have about 92K 
of traffic that I could send off-list.  I'll let Bob get you traces.

When fixing this, could you put in some kind of configuration command to 
change the fixed behavior back to the current?  I might need this when 
chasing down my suspicion, above.

This is probably happening to _all_ 20's on HECnet that are getting 
messages from PyDECnet.  I'm checking further into my logs, which has 
been tedious.  Beyond the DTEKPA crashes, the size of the error log can 
cause SPEAR to trigger a problem with PA1050; you get an Illegal 
reference to address 301000 at user 547217. And the extracts are so huge 
that not even the Tops-20 gnuemacs can hold them.  You have to transfer 
them to some other system. If you're not running my Extended mode FTP 
server, that will take days.

I only have access to one (TWENEX::), but when I checked, its 
SERR:ERROR.SYS file was over 70K pages, which is what you'd expect.  I 
didn't have time yesterday to run SPEAR to actually pull and now I can't 
seem to get to it. TWENEX::'s DECnet configuration does not appear to be 
up to date.  When I looked at available DECnet nodes, it didn't know 
any.  It also didn't know what node my CTERM was coming in on.

Does anybody have a Tops-20 node that is talking to PyDECnet that can 
either check or give me a guest account?  It would be instructive to 
double check.

> ------------------------------------------------------------------------
> On 1/18/21 12:39 PM, Paul Koning wrote:
>
> No, 1478 doesn't make any sense.
>
> Looking at my code, I use a local buffer size on Ethernet of 591.  But then I track what buffer sizes are reported by the router neighbors in their hello messages, and limit the routing message size to the smallest of all these numbers.
>
> Then I noticed I subtract 16 from that when calculating the size of the update messages.  Why that is I don't remember.
>
> So in any case, 1478 should never be a routing protocol message size coming out of PyDECnet.
>
> I'd like to see messages traces.  A trace level log from A2RTR would do the job.  Somthing is very strange here.
>
> 	paul
>
>> ------------------------------------------------------------------------
>> On Jan 18, 2021, at 1:54 AM, Johnny Billquist <bqt at softjar.se> wrote:
>>
>> Thomas, this is pretty much exactly what I expected (and I suspect Paul expected as well).
>>
>> The level 1 routing messages are (as we said) the ones that can grow big. And the advertised length are not used by the other side to limit what they send. It essentially hints how large messages you send.
>>
>> And Paul also noted that on ethernet the Python code is using larger buffer size (essentially the size an ethernet frame can be) instead of putting any lower limit on it. While this is perfectly legal from a protocol point of view, both TOPS-20 and VMS, it would seem, can't really control the size of the low layer buffer, and therefore fails if you use large packets without also having a large DECnet segment buffer size.
>>
>> So Paul's PyDECnet works the same as I have managed to have RSX work here. And you get the same problem towards some OSes.
>>
>> The obvious, and easy fix is to just lower the buffer size used over ethernet to more closely match what the DECnet segment buffer size is.
>>
>> The sad thing with that is that, at least for RSX, it means you run the risk of hanging the ethernet when running TCP/IP. The best would be if all OSes could separate the two buffer sizes properly.
>> But I just realized that I might just hack RSX DECnet here, to not use the large buffer size for the link messages... Hmm... Gotta look into this.
>>
>> Meanwhile, the fix that Paul already mentioned that he has prepared and ready should fix this for you.
>>
>> Alternatively, if you change that 1504-%RTEHS to instead actually say something like 1500, or 1504, you should probably also be good. (My guess would be 1500.)
>>
>>   Johnny
>>
>> ------------------------------------------------------------------------
>> On 2021-01-18 04:45, Thomas DeBellis wrote:
>>> I think I may have finally gotten to the bottom of this.  It's a level 1 routing message that I'm getting from 2.1023 (A2RTR) that does not appear to be respecting lengths, viz:
>>> *22:04:30*.749823 aa:00:04:00:ff:0b > ab:00:00:03:00:00, ethertype DN (0x6003), length *1478*: lev-1-routing src 2.1023 {ids 0-726 cost 0 hops 0
>>> This is two (2) bytes over the maximum that Tops-20 can accept.
>>>     NCP>*SHOW LINE NI-0 CHARACTERISTICS *
>>>     NCP>
>>>     22:16:04     NCP
>>>     Request # 23; Show Line Characteristics Completed
>>>     Line = NI-0
>>>        Receive Buffers = 6
>>>        Controller = Normal
>>>        Protocol = Ethernet
>>>        Hardware Address = 00 1F 16 EC CE 47
>>>        Receive buffer size = *1476*
>>> It would appear that the 20's are advertising this length in their layer 1 hello messages:
>>> 22:04:21.018507 aa:00:04:00:0a:0a > ab:00:00:03:00:00, ethertype DN (0x6003), length 60: router-hello l1rout vers 2 eco 0 ueco 0 src 2.522 blksize *1476* pri 5 hello 15
>>> 22:04:21.082680 aa:00:04:00:08:0a > ab:00:00:03:00:00, ethertype DN (0x6003), length 60: router-hello l1rout vers 2 eco 0 ueco 0 src 2.520 blksize *1476* pri 5 hello 15
>>> About two seconds after the message comes in from A2RTR, the following appears in the error log:
>>>     ***********************************************
>>>     DECNET ENTRY
>>>       LOGGED ON 17-Jan-2021 *22:04:32*-EST MONITOR UPTIME WAS 1 day(s)
>>>     1:17:54
>>>              DETECTED ON SYSTEM # 3691.
>>>              RECORD SEQUENCE NUMBER: 70952.
>>>     ***********************************************
>>>     DECNET Event type 5.15, Receive failed
>>>      From node 2.520 (TOMMYT), occurred 17-JAN-2021 22:04:08
>>>        Line NI-0-0
>>>        Failure reason = Frame too long
>>>        Ethernet header = AB 00 00 03 00 00 / AA 00 04 00 0A 0A
>>> So... no way I can get around this without some /serious/ hacking of DNADLL and ROUTER (see below), which would probably take me a few months to learn and debug.  Of course, then maybe I could put level 2 routing into Tops-20, which I been daydreaming about...
>>> Paul, what does this suggest to you?
>>>> ------------------------------------------------------------------------
>>>> On 1/17/21 7:39 PM, Johnny Billquist wrote:
>>>>> ------------------------------------------------------------------------
>>>>> On 2021-01-18 00:17, Thomas DeBellis wrote:
>>>>>
>>>>> Well, the frames certainly won't be larger than 1,500 bytes, right?  So I'm guessing they'll be the maximum.  Problem is, all of that stuff is hidden under several layers of drivers, so I'm not sure how I'm going to get the overage passed back.  And I also need to put in some BUGINF logic to alert if I get more of these than whatever I decide the interval to be.
>>>> That depends on what they count. Like I said - ethernet payload is 1500. Then you have the ethernet headers which is 14 bytes, plus the crc trailer, which is 4 bytes. If you count them, you end up at 1518 bytes.
>>>> Depends on the hardware I guess.   I have no idea what the NIA-20 expose.
>>> I meant the maximum frame size; I suspect this is 1500 for the NI, but I don't actually know.  My speculation is that DECnet is using part of the buffer to piggy back node and and other information into it instead of holding this meta-data, separately.  I don't know what Multinet does, but there you can configure the NI to have a packet size of 1500.
>>>>> If you are a DDP (LD.DDP), then you are not CPU dependent and you go ahead always, otherwise, you have to be on the CPU that owns the device (.CPCPN) So I'm not sure if it makes any difference, but DDP is not CPU dependent; not sure if that is a synonym for 'shared'.  If I stumble over something more, I'll report it.
>>>> It's actually the same in RSX. The DDCMP layer is sort of between the hardware driver and the higher level protocols, and it's not tied to any specific CPU.
>>>>
>>>> But that code would suggest that LD.DDP is just an indication of whether something is CPU dependent or not, and would have anything to do with DDCMP.
>>>  From looking at the routing code, seems LD.DDP is used when something is getting handed to the NSP to play with, I guess that would be goig through some kind of layering.
>> -- 
>> Johnny Billquist                  || "I'm on a bus
>>                                   ||  on a psychedelic trip
>> email: bqt at softjar.se             ||  Reading murder books
>> pdp is alive!                     ||  tryin' to stay hip" - B. Idol
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sonic.net/pipermail/hecnet-list/attachments/20210118/1bde05cb/attachment-0001.htm>