Request for ideas: How to allow 'retryable' OPEN/CLOSE in QDOS/SMSQE

martyn_hill · Post by **martyn_hill** » Wed May 12, 2021 11:25 am

Hi QDOS gurus!

I'm about 90% complete in (re) coding the NET and FSERVE drivers to work alongside the 'Message Queue server' that drives the QLUB Adapter and still have one fundamental design item to overcome to fully support the original drivers.

I can partially workaround the isaue with a cludgey approach, but thought to reach out for any ideas you might have developed already or could think about before I release the next version.

It's a bit tricky to explain, but here goes:

Both the simple NET driver as well as the client-side of FSERVE (Nx_) rely on the ability to 'retry' an operation when OPENing or CLOSEing a remote channel. It's an effective but non-multitasking friendly approach that kills job scheduling and other IO during the process, which may last up to 25 seconds.

QDOS's native IOSS retry mechanism works perfectly well for TRAP #3 operations such as byte/string Input and Output, but the same facility is not extended to TRAP #2 operations such as OPEN or CLOSE, which are rightly considered 'memory management' type operations and should therefore be atomic. I.e., you get one chance to OPEN or CLOSE before an error or success is reported - whereas TRAP #3 IO can be deferred by returning a 'Not Complete' error, in which case the IOSS will continue to reschedule/retry the operation until it timeouts or succeeds.

When it comes to network ops, the actual op could take several 50Hz 'ticks' to complete, with several timeouts occuring in-between before success. For normal IO, this is fine due to the retry mechanism as QDOS will give CPU time to other jobs and IO between attempts, but for OPEN or CLOSE, we end up stalling QDOS and anything else running in the meantime, whilst we stay in Supervisor mode throughout. The native NET and FSERVE client drivers manually retry and timeout after 20-odd seconds if they don't complete before. To see that in action on a regular QL, start the TK2 CLOCK job then attempt to open a non-existent file (OPEN #n, "Nx_somefile" ) on another remote QL that's running FSERVE. The clock won't get updated until the OPEN times-out.

Likewise and a bit more irritating, if you are also running FSERVE on the client QL, its own file-server job will be inactive and thus not respond to remote inbound requests whilst it attempts to access that non-existent file at the other end.

Now, when we introduce the QLUB Adapter connected to a PC/Mac/Unix box running a QDOS/SMSQE emulator, we inherently add some additional latency to the NET IO as the messages/packets get queued-up to pass out the SERial/USB port and then we wait for a reply from the QLUB to say 'all done' - rescheduling jobs and other IO between polls for a
reply from the QLUB. This doesn't hurt overall throughput much (if at all) as the QLUB takes care of the actual bit-banging down the NET line, freeing up the emulator between polls and it can retry packets much more rapidly than a native QL anyway.

However, given the principal aim of running FSERVE on the emulator to host files and devices accessible to other native QL stations, it does mean that we either live with an intermittently unresponsive file- server or else limit the emulated QDOS from making its own FSERVE Client requests outbound.

Furthermore, when sending a file through a simple NETO_x channel from the emulator/QLUB, this will also leverage that 'manual' blocking retry mechanism when the very last packet/block (flagged as EOF) is being delivered to the remote QL each time the NET channel is CLOSEd.

Not a huge limitation and, like I say, it could be worked around to an extent.

But I don't like it

So, any ideas about how we might effectively and safely re-enter the Scheduler from within an OPEN/CLOSE operation (which is running in Supervisor mode) and thus allow for these long-ish timeframes without stalling other jobs/IO running concurrently in the emulator?

tofro · Post by **tofro** » Wed May 12, 2021 12:12 pm

Martyn,

Very very generally, Trap #2 calls don't fall into the "partially atomic" category of system calls: They're not expected to re-enter the scheduler (but see below), thus should be safely callable from a job in supervisor mode. Your intended approach seems to be maybe breaking this rule (depends on implementation).

I think the answer needs to be two-fold, specific to IO.OPEN and IO.CLOSE:

IO.CLOSE actually has a mechanism to defer the actual closing (and freeing of channel memory) to a later point in time (which effectively can be the same as a "retry later"). This mechanism is just normal, because there still might be bytes to transfer in internal buffers when the program using the channel is already closing it. The mechanism is described in the Technical Guide as "wake yourself in a scheduler loop task, see if everything is done, then close the channel from the physical layer" (Page 36ff).

Technically, I see no reason why a similar mechanism shouldn't work for IO.OPEN. Other than:

For IO.OPEN, I think there's actually a good reason for requiring it to be atomic and synchronous. This has to do with exclusive opens - As the system is implemented, there is good reason for a job that did an exclusive open of a channel and didn't receive any error to assume it now has exclusive access to the file or device, And I'm pretty sure quite a number of existing programs will rely on this. Any "deferred open" might break such mechanisms and introduce all sorts of weird behaviour.

On second thought, (this is apparently not a problem for CLOSE, as that is best effort): how would you handle errors in the deferred IO.OPEN - you already told the application "all good", and have no way of telling the opposite now.

martyn_hill · Post by **martyn_hill** » Wed May 12, 2021 1:13 pm

Thank you Tobias!

The paragraph you refer to in the Guide for CLOSE does indicate a sensible way forward, though I'll have to investigate some more to understand how I can leverage this approach to effectively suspend the current job who asked for the CLOSE from within the driver's CLOSE routine itself. I'll take a look at how some std devices like MDV do this.

As for OPEN, I think in this particular use-case, a similar approach might still work. Why? Because the 'all-good' return would only ever be generated once the remote NET station (running FSERVE) has succesfully opened and (possibly exclusively locked) the file/device and returned that result back, via the QLUB and the MQ Server.

So, based on your really helpful explanation, I might refine my original question to become:

"What are the steps involved in suspending the current job from within an OPEN/CLOSE in the device driver and thus allow rescheduling to recommence, such that the suspended job will retry the (open/close) operation upon a later release?"

The MQ Server is already built around a Scheduler task - the 'half' that reads and processes asynchronous replies arriving back from the QLUB - so I can enhance it to also check for and release a suspended job, once a suitable reply arrives over the NET (via the QLUB). Currently, the MQ Server only updates an internal Message Table to reflect the status change (and buffers any packet data in the reply), and relies on the other 'half' of the solution to actually 'deliver' the reply/packet to the requesting client.

That other half of the MQ Server is a vectored routine (findable through a named THG) to handle the inbound requests from 'clients' - such as the NET and FSERVE client drivers - which either triggers the generation and sending of a new command to the QLUB, or else checks the latest status of an earlier CMD request by looking up the unique message-ID in the Message Table managed by the MQ Server (scheduled task) - and which may have received and processed the message reply since the last call/check by the driver/client.

Hummm - its slowly coming-together in my head!

tofro · Post by **tofro** » Wed May 12, 2021 1:30 pm

martyn_hill wrote:Thank you Tobias!

The paragraph you refer to in the Guide for CLOSE does indicate a sensible way forward, though I'll have to investigate some more to understand how I can leverage this approach to effectively suspend the current job who asked for the CLOSE from within the driver's CLOSE routine itself. I'll take a look at how some std devices like MDV do this.

I don't think that on IO.CLOSE the current drivers actually suspend the calling job until the deferred CLOSE has actually happened after buffers have been emptied. CLOSE is simply assumed to succeed, and the job is just going on (which becomes clear when you read on to see you should supply special handling of the case of something immediately re-opend that you haven't even closed yet). It might be challenging to change this - you're not supposed to trigger a re-schedule from a driver in supervisor mode (which you effectly do when suspending your calling job).

martyn_hill · Post by **martyn_hill** » Wed May 12, 2021 1:42 pm

Ah! I misunderstood!

Ok. Well the 'close/add a schedulers call to complete' scenario would at least take care of the NETO_x on the last packet.

I may need to simply accept a compromise on the support for FSERVE client use on the emulated QL/QLUB.

Thanks again, Tobi!

mk79 · Post by **mk79** » Wed May 12, 2021 5:44 pm

Yeah, this is where some of the QDOS concepts start to show their age. Only one job can be in the kernel/IO system at a time. What happens with Trap #3, at least in SMSQ/E, is that after the driver says it is not done yet the saved program counter is altered to the location of the TRAP call, the supervisor stack is tidies up (all registers restored etc.) and then a re-schedule is triggered. When the job is scheduled again the Trap #3 will be executed again as if this was the first time, except that D3 is now one lower. This works because the calls are designed to be re-callable.

So it's not like the job waits in the kernel for another chance, it waits in user space. But there is no mechanisms for re-trying an OPEN call without changes to the operating system. So I think this is an interesting problem but I fear you cannot do much about it without some serious efforts.

martyn_hill · Post by **martyn_hill** » Wed May 12, 2021 5:52 pm

Thanks Marcel!

Yes, that was the conclusion I was coming to

I can let that one go and focus on completing the re-coding of the driver for the time being.

The Sinclair QL Forum

Request for ideas: How to allow 'retryable' OPEN/CLOSE in QDOS/SMSQE

Request for ideas: How to allow 'retryable' OPEN/CLOSE in QDOS/SMSQE

Re: Request for ideas: How to allow 'retryable' OPEN/CLOSE in QDOS/SMSQE

Re: Request for ideas: How to allow 'retryable' OPEN/CLOSE in QDOS/SMSQE

Re: Request for ideas: How to allow 'retryable' OPEN/CLOSE in QDOS/SMSQE

Re: Request for ideas: How to allow 'retryable' OPEN/CLOSE in QDOS/SMSQE

Re: Request for ideas: How to allow 'retryable' OPEN/CLOSE in QDOS/SMSQE

Re: Request for ideas: How to allow 'retryable' OPEN/CLOSE in QDOS/SMSQE