Let me backtrack here a bit, re 'blocking' and 'non-blocking' tasks.
Once a second CPU is in the picture to handle IO, almost all tasks become 'non-blocking' in 'Dave-speak'
The reason some things take over the entire machine are mostly historic. The underlying principle, though, is always present, and it rears it's head whenever something like interrupt handling overhead takes up more time than is available between data transfers. This of course changes depending on the required data rate and CPU speed.
I will give two examples, one very relevant to the UART questions.
The first is a floppy controller. Since it does not have any sort of read/write buffer, data transfer from RAM to disk and back have to happen completely real-time. In effect, once a 'read sector' or 'write sector' command is given (used here as the quintessential example for the problem at hand) the CPU does polling for data, by looking at a bit in a status register and then reading or writing data when the bit state tells that data is available or required, as the case may be.
All floppy controller chips ever used on the QL can generate interrupts for the same condition, basically a 'I need/have data' interrupt.
Two problems happen here:
1) Interrupt overhead + processing takes more time than there is between data requests, or close enough. In our example there is an underlying problem, and that is that various floppy drive bit densities require various speeds of reaction. HD floppy is already fast enough that the old QL would not be able to process this under interrupt. It should be noted that when the interrupt occurs, the handler already knows a few things: is it for rad or write, where the data is coming from and where it needs to go, and data count. If one were to write a short snippet of code that does this, it would indeed be very short - a few instructions. But the OS arrives at them only after going through several layers of other code, plus the CPUs own interrupt latency + processing overhead. Compared to all of that, the actual data move takes only a VERY short fraction of the time.
2) Supposing the system CAN in principle cope with the frequency of interrupts. The problem that arises now, given this is a multitasking OS, is that certain operations (mostly regarding resource management or driving other peripherals) require intervals when the code executing must not be pre-empted, i.e. must not be interrupted, so the most common way to do this is to mask the relevant interrupt. Since there is really only one interrupt level on the QL, and masking of it is sometimes abused, this adds an almost completely unpredictable component to interrupt latency (time from interrupt being caused to being handled), which may break normal IO operation for such devices.
Once we are in danger of this, the only available solution is to switch to direct polling, and yes, because we do not want anything to interrupt the data transfer or data may be lost, interrupts are disabled during that time, and the machine is prevented from doing anything else except data transfer. However, if we examined the data transfer loop, it takes most of it's time running in a tight loop waiting for the relevant status bit to change, to know when data is needed or is available, and then one instruction to transfer data.
The second example is a serial port. Old type serial chips could also cause an interrupt when data was available or requested, and also this is the way handshake was handled. The 16550 is a notorious example, as it is in fact derived from an even older and simpler 8250. Unlike a floppy controller, the serial port can influence data transfer by using handshake (that is of course, if it is used and enabled in the first place). Even so - once the data rate increases, we run into the exact same problem as the floppy controller, with the one difference that we CAN stop the data flow in order to give the CPU some breathing room, IF you can react soon enough between byte transfers. This in fact still requires the very same guaranteed short interrupt latency as a floppy controller, except it has to operate on a (several) byte level, rather than on a floppy sector level (2^N x 512 bytes). For most cases, it's the same problem, by the 'if you can do it for one byte, you can do it for many' as the whole process operates on a byte by byte basis.
The obvious idea here is, is there a way to work on a 'several bytes by several bytes' level - an attempt was made in the update from a 16450 serial port chip to the 16550 chip, FIFO buffers (16 bytes) were added in the data path. The idea was, when a byte is received, an interrupt is generated, and if data continues to arrive while the CPU has not yet reacted to the interrupt, there is a buffer capable of receiving it without data loss. As long as the CPU react within up to 16 bytes received, things will be fine. Also, to make it possible to do things on a several bytes rather than a per byte level, a FIFO threshold system was added so that an interrupt does not happen until a number of bytes have been received, to lower the frequency of the interrupt, trading off for shorter latency. Unfortunately it was one of the more half-a**ed attempts by not having hardware handshake, so that it still has to be handled by interrupt, and it is a problem for transmitting bytes, where the FIFO provided loses practically all utility. Fortunately, even newer versions FINALLY implemented real hardware handshake, preventing data loss and still relaxing latency requirements for the CPU.
The FIFO buffers capitalizes on the capability of the CPU to transfer data quickly once it does react to the interrupt - and since the reaction time is usually the major part of how much time the CPU uses to transfer X amount of data, you get significantly more X per almost same time, by handling more than one byte of data per interrupt.
Now I will return for a short while to the floppy controller example - for one, because 1Mbit/s transfer rates are not unknown of with modern serial ports, and that just happens to be the transfer speed of a HD floppy. The latest generation of floppy controllers (and I think this includes the one on the GC/SGC) also include a FIFO in the data path - considering the floppy works on sector-sized chunks of data, this could be really useful, but it was never used on the QL. Fortunately, when hard drives became available, the data transfer speed was already high enough the designers thought to include a sector (or multi-sector) buffer, and the CPU got interrupted only when it was filled up with newly read data or the data filled in by the CPU has been written to disk. There is more, see below.
Finally, let me also write a bit on a method that can also be used to handle the problems above (and this is VERY relevant for the QL).
Given that there is a maximum speed for the above peripherals, that we can calculate in advance, we can also calculate the worst case time it takes to fill up a FIFO. For instance, if we decided we have the serial port at maximum 1Mbit/s, it translates to 100kbytes/s transfer rate, and if the receive FIFO is 16 bytes, it can fill up at most 6250 times a second (100000/16). So, if we generate a periodical interrupt at a high enough rate that it is always more than that, we can use it to pull for a service request for such devices - and in fact more than one, as transferring up to 16 bytes from the FIFO takes only a very short time, you sue one single interrupt latency on every poll
, and then go through a list of devices that may receive data within that time frame, and transfer it if it is indeed received. As far as I remember, this is the way Q40/60 and probably Q68 works, by using a much higher frequency of poll interrupt.
In fact, I wich more capable QL OS systems had a fast poll interrupt along with a slow poll one. The fast one would ONLY ever do data transfers, and the interrupt handler linked list would basically have a prototype handler that gets a pointer to a register structure of the IO device, knows hat to do to check if the device needs to be serviced, and transfer X data between data port and memory, and that's it. It has to be as short as possible, especially the 'check for service part'. And it can also have a limit of how much data it can transfer in one poll. If more granularity is needed in which device needs more or less bandwidth, the maximum amount of data transferred can be manipulated with, as well as the number of times and at which positions in the poll list it's linked (which is an interesting technique TT himself discussed once regarding his Stella RTOS).
One step above that (but possibly still internally using the same principle) is using an entirely separate CPU to do exactly the above, or indeed react to interrupts directly. It is HIGHLY advantageous if that CPU executes it's interrupt handling code in separate memory, or even better, on-chip cache. If this is done, the main CPU is only ever bothered for actual data transfers (or control register reads), it never sees IO related interrupts, the system does not see a cost of interrupt reaction latency (it's the other CPU spending time on that), and data transfers can be very fast and also (unlike DMA, which was kind of a beta version of this idea) as clever as you need them to be.
An alternative solution is to again use a second CPU that handles IO entirely, and communicates with the main CPU through a common memory buffer or a set of FIFO buffers - in other words, that second CPU or uC then buffers the data, handles the protocols, and in general takes care of the particulars of the IO devices, while presenting a simplified 'abstraction' to the main CPU, and only bothering it when the low level tasks of assembling data into a buffer or transmitting data from a buffer has been done - or the main CPU can also poll the status of the buffers.
It goes without saying that both of the above approaches can (And often are) used together (in fact the latter one is precisely how most mass storage devices work internally), and either of them lighten the load for the main CPU in a system quite significantly.