[HECnet] Booting VMS 5.5.2 cluster satellite node [SIMH 4.0] over slow link

Mon Sep 28 13:15:14 PDT 2020

On Monday, September 28, 2020 at 10:23 AM, Vladimir Machulsky wrote:
> I'm trying to boot latest SIMH (4.0-current) as LAVC satellite node over link
> with latency about 100ms.
> Boot node is VAX/VMS 5.5.2 with latest PEDRIVER ECO.
> Result: MOP part of boot sequence is work without a hitch, but SCS part is
> failing miserably.
> 
> The most frequent result:
> SIMH console filled with 10 x " %VAXcluster-W-NOCONN, No connection to
> disk server " messages, halting with "%VAXcluster-F-CTRLERR, boot driver
> virtual circuit to SCSSYSTEM 0000 Failed"
> Sometime it goes little further:
> ...
> %VAXcluster-W-RETRY, Attempting to reconnect to a disk server
> %VAXcluster-W-NOCONN, No connection to disk server %VAXcluster-W-
> RETRY, Attempting to reconnect to a disk server %VAXcluster-W-NOCONN,
> No connection to disk server VULCAN %VAXcluster-W-RETRY, Attempting to
> reconnect to a disk server %VAXcluster-W-NOCONN, No connection to disk
> server %VAXcluster-W-RETRY, Attempting to reconnect to a disk server
> %VAXcluster-I-CONN, Connected to disk server VULCAN %VAXcluster-W-
> NOCONN, No connection to disk server VULCAN %VAXcluster-W-RETRY,
> Attempting to reconnect to a disk server ...
> And halting after minute or so of filling console with those messages.
> 
> Whenever I setup throttling in SIMH to 2500K ops/s, the node boots
> successfully, joins cluster successfully and work flawlessly, but slow.
> Boot process takes about half hour. After boot, changing throttle value to
> 3500K ops/s still works.
> Increasing throttle value further broke system, with the same messages
> about disk server.
> Throttled SIMH performance is about 5VUPS.
> 
> The only information about maximum channel latency restrictions found in
> "Guidelines for OpenVMS Cluster Configurations" manual is that:
> "When an FDDI is used for OpenVMS Cluster communications, the ring
> latency when the FDDI ring is idle should not exceed 400 ms."
> So I suppose that 100ms latency link should be good enough for booting
> satellite nodes over it.
> 
> My understanding of situation is that combination of PEDRIVER/[PEBTDRIVER
> within NISCS_LOAD] with fast hardware and slow links is a primary reason of
> such behavior. Please correct me if I'm wrong.
> 
> Do anyone have experience with booting VMS clusters over slow links? OS
> version recommendations?
> Probably some VMS tunable variables are exists for making PEDRIVER happy
> on fast hardware?
> Having PEDRIVER listings can shed lights for such PEDRIVER's buggy behavior.
> 
> Link details:
> 
> Two Cisco 1861 routers connected with Internet via ADSL on one side and 3G
> HSDPA on other side.
> TCP/IP between sites is routed over IPSec site-to-site VPN. Ping between
> sites is about 100ms.
> Over that VPN built DECnet family (eth.type = 0x6000..0x600F) bridge with
> L2TPv3 VPN.

Fantastic detective work, and it is certainly amazing that you got this far.

If throttling affects the results, then it seems to me, that the SCS capabilities 
(which PEDRIVER leverages) aren't using anything which tracks with the wall 
clock to measure delays.  I suspect that where the SCS layer is "thinking about" 
timing stuff it is using the TIMEDWAIT macro in the kernel/driver code.
As I recall, TIMEDWAIT effectively uses a processor model specific value 
spinning until the presumably appropriate amount of time has elapsed.
Since most simh host systems actually run significantly faster than the 
original hardware did, these TIMEDWAIT macro invocations don't track 
with wall clock time very well.  The VAST majority of TIMEDWAIT uses 
in VMS drivers relate to the CPU interacting with a device which is internal 
to the simulator.  SIMH interactions with devices is measured in instructions 
(which aligns well with the internal implementation of the TIMEWAIT 
macro).

I vaguely recall that DEC had conceptual support for clusters which were 
located at physically different sites, but which were connected by relatively 
high speed and low latency network connections.  I believe that the 
"supported" configurations were much faster and much lower latency 
than your setup is seeing.

Good Luck,

- Mark