Re: ULA ZX8301 - TV Picture Capabilities
Posted: Wed Nov 29, 2017 4:28 pm
In detail:
The 8301 operation is very much a slave of the process of picture generation.
This repetitive process reads out the 32k of screen RAM as RGB pixels about 50 times a second, as a frame of lines of pixels.
The process is based on CRT technology, but even the most modern monitors use a version of the same. The reason why it does it over and over again is because the actual screens do not retain information for long, much like Dynamic RAM, so the contents must be refreshed. Indeed, they also must change dynamically, so a 'new version' comes out the RGB connector every 20ms.
Because the picture was originally drawn (almost literally) by a cathode ray on a luminescent surface, it is composed of a number of pixels in a line, followed by a sync pulse (which is a kind of 'carriage return' + 'new line' for monitors, followed by a period of black pixels which corresponds to the time needed for the beam to return to the starting position.
In a similar fashion, lines are displayed one under the other from left to right and down, until the bottom end of the screen is reached, and then a sync pulse (this time it's the 'vertical sync), followed by a number of lines filled with black pixels, during which the beam returns to it's top left starting position.
In reality the sync pulses do not happen immediately after the visible pixels but slightly after, so there is a bit of an unused 'border' on all sides. Although, as we know, when monitor mode is selected on the QL, i.e. the full width of the screen is used, some of the contents will end up just off the edges of the screen. It will soon be apparent why.
The 15MHz crystal
At first glance it's not easy to figure out why 15MHz was used except that you get the 7.5MHz CPU clock out of it by dividing by 2, which is a trivial operation in digital electronics. However, a look into the traces explains this, as well as calculating the requirements of the video standard.
The total length of one line should be 64us as defined in the standard. The visible portion should be some 48us, into which all the visible pixels in a line should fit. For the QL in mode 4, this is 512 pixels. These should be shifted out at some clock that is available, and at 15MHz, this would be 720 pixels, and we know it's 512. Using 512 as a reference, we get 93.75ns. The closest we can get from the available 15MHz is if we divide it by 1.5, getting us 10MHz, or 100ns. But, this gives us 51.2us as the visible area, more than the 48us available, so 3.2us end up displayed in the 'invisible part' - and now we know why at full 512x256 resolution, a small portion (left, right or both sides) of the screen is not visible.
However, 640 total pixels each 100ns 'wide' gives us exactly 64us for the total line lenght, i.e. the horizontal (or composite) sync period, just exactly what we need. Further, this is also exactly divisible by 66.6666ns (or, the 15MHz clock) and produces 960 periods of the 15MHz clock - important because our CPU runs at half this clock, so in essence the CPU runs in sync with the screen refresh process. Therefore a repetitive algorithm can be used to satisfy both sides - the CPU and the screen generation. Finally, if one studies the CPU datasheet carefully, one sees that all the CPU access cycles actually use both edges of the CPU clock - so having a double speed clock with respect to the CPU is a very big plus for any logic based on the CPU clock versus signal generation, as logic is normally triggered on a single edge (each clock cycle). This gives us a 15MHz clock pulse for each CPu clock edge, and an ability to thus track but also PREDICT the CPU timing.
As it turns out, this is exactly what the 8301 does.
Video timing
The 8301 generates 312 lines of 640 mode 4 pixels, each line also corresponding to 480 CPU clock periods.
Each line is divided into 40 chunks, 32 of which contain the 512 visible pixels in each line, and 8 of which are the retrace periods. So, chunks 0 to 31 are visible, and 32 to 39 are forced black, i.e. invisible. The horizontal portion of the CSYNCHL signal is a pulse that is active during chunks 34, 35 and 36.
Out of the 312 lines, 256 are used for the picture, and 56 are forced black, with VSYNCH occuring at approximately line 288, if someone needs a precise number I'll re-measure this.
The importance of the 40 chunks within each line might not be apparent until one considers that they are nicely expressed by a whole number of both the 10MHz pixel clock periods (16 pixels) and CPU clock periods (12 clocks, or 24 15MHz clocks).
The 10MHz clock is generated from the 15MHz clock by using double-edge triggering on the 15MHz clock, and counting 3 edges of the 15MHz clock for each 10MHz clock period. This normally generates a 10MHz clock with a 2:1 duty cycle, however this is not directly visible anywhere outside the 8301 so the exact edge to edge correspondence is not easy to find out. This is of some importance (see below) but only if one wants to 'read' the RGB outputs in order to do something clever with them, such as produce a 16 color mode out of 2 subsequent mode 4 pixels with external hardware.
Accessing video data
The 8301 uses a fixed scheme of access that it repeats within every of the 40 chunks of 12 CPU clocks which make each display line.
There are a maximum of 3 combinations:
1) When no screen data is accessed (during chunks 32 to 39) an scheme using 4 CPU clock cycles and double edge triggered logic is used to generate RAM timings. The CPU can start an access on any rising edge (which is how the 68008 normally works), and it will take 4 cycles to complete. If the CPU attempts to start an access 3 or less cycles before a chunk where video data needs to be accessed, it will be ignored and operation will continue as follows below. The important thing to say here is that ONLY during chunks 32 to 39, so ONLY 20% of the time the CPU has more or less full speed access to the motherboard RAM.
2a) When screen data is accessed (during chunks 0 to 31 for each of the 256 visible display lines), the first 8 CPU clock cycles of each chunk are dedicated to screen RAM data access, during which the VDA and TXOEL signals are high, preventing any contanct between CPU and RAM. If the CPU starts a RAM access cycle during this time or less than 3 cycles before chunk 0, it will be ignored, and then given access during the last 4 CPU cycles of the total 12 in a chunk. At this point the standard 4 cycle DRAM timing will be peformed for the CPU. This means that 80% of the time, the CPU only has access to the RAM 4 out of 12 cycles, i.e. at only 1/3 of the maximum theoretical speed.
2b) - and this one is not nice - during chunks 0 to 31 for each of the 56 invisible screen lines, no data needs to be accessed for the screen, BUT the 8301 behaves exactly the same, just does not access data, but rather refreshes the DRAM. In actuality, it still uses 8 CPU clock cycles out of 12 for itself, but does not activate any of the CASL lines, thus making the usual screen RAM access into a refresh cycle. This means that even for the 56 lines when no screen data is needed (nearly 22% of total time), the CPU is still slowed down the same as for visible lines.
The video data itself is accessed using DRAM page mode, which is a short 'burst' access mode that reads consecutive data within the same RAM row, in this case 4 bytes.
DRAM in general is organized as a roughly square array of memory cells, which is why RAS (row address) and CAS (column address) signals are given to the chips, and why the address is multiplexed, row first, then column. Internally, the RAM actually reads a whole row of bits - in the case of a 64k x 1 bit RAM as used in the QL, 256 bits are read at once and held in a 'column register'. The column address then selects the one bit out of the column. However, once the column has been read, data within it can be accessed very quickly by changing the column address only.
This is what the 8301 uses to access video data. It sets up the row adress, drives RASL low to latch it into the RAM chips, then sets up a column address and drives CAS0L low to read the data, then sets up the next column address, drives CAS0L high then low again to access the next consecutive bit, and does this 4 times total. So, instead of accessing one byte in 4 clocks as would be the case for random access, it manages to get 4 bytes in 8 clocks, a double improvement over regular access speed. But, as was explained above, even that penalizes the CPU severely. Out of every 480 clocks in each display line, only 224 are available for CPU access, and even then some might be lost due to sync (as when the CPU does not start a cycle on a modulo 4 clock boundary because of internal operations). This means the CPU can access motherboard RAM at most at 45.7% of the theoretical maximum speed.
Also, the 8301 only uses RAM bank 0 (CAS0L) for video data access.
Issue 5 boards and 8302
On issue 5 boards, the 8302 is connected to the RAM bus and for all intents and purposes, accessing it has exactly the same characteristics as accessing RAM - and is subject to the same slowdown.
One not so apparent problem here is that the 8301 needs rather substantial drivers on RAM address and data pins - as there are 16 chips there, with all addresses in parallel, so each address line drives 16 input pins on the RAM chips, the pin on the 74LS257 multiplexer, and a whole lot of copper trace. However, for each bit, the RAM chips used have separate inputs and outputs which are tied together, as well as tied together to the same pair connected to the corresponding bit in the other bank - so, each data line on the 8302, as connected on issue 5 boards, drives 6 pins, 2 on each DRAM in bank 0, 2 on each DRAM in bank 1, the 8301 (this only ever reads data), and the 74LS245 bus transceiver.
So, not only is there a timing in-accuracy (due to 8301 screen read - this probably results in problems with net access) when accessing the 8301, it also needs to drive more chips than expected. On issue 6, both are solved - the 8302 is decoded directly by the HAL and it drives 4 pins - the CPU, 74LS245, two ROM chips.
When the 8301 detects A17=0 and A16=1, it assumes an IO access. As a result, no DRAM bank is selected via CASL lines, even though a fake RAM access cycle is generated. Instead, PCENL is generated, with timing very similar to CASL.
When the 8301 acts as a DRAM controller for the CPU, it uses a simple sequence - ROWL is initially low to present the row address to the DRAM, after which RASL goes low to latch this address into the DRAM, then ROWL goes high with a slight delay (because DRAM expects the row address to persist a short time after RASL goes low, and then switching over tocolumn address), and then on the next half cycle, i.e. a bit more delay, CASL goes low.
In order to know if it's internal MC register is to be written, or the 8302 registers are to be accessed, the 8301 needs address line A6, which it does not have available as a signal from a pin. Instead, it can read it's state from the RAM address lines, as it is contained within the column address. Because of this, it cannot generate PCENL, which is the 8302 chip select signal, until ROWL goes high and A6 becomes available on DRAm address bit DA3. Consequently, the 8302 access time is actually quite a bit faster than a 68008 can manage, and it should work just fine with a faster CPU provided it's connected as on issue 6 and later motherboards.
The 8301 operation is very much a slave of the process of picture generation.
This repetitive process reads out the 32k of screen RAM as RGB pixels about 50 times a second, as a frame of lines of pixels.
The process is based on CRT technology, but even the most modern monitors use a version of the same. The reason why it does it over and over again is because the actual screens do not retain information for long, much like Dynamic RAM, so the contents must be refreshed. Indeed, they also must change dynamically, so a 'new version' comes out the RGB connector every 20ms.
Because the picture was originally drawn (almost literally) by a cathode ray on a luminescent surface, it is composed of a number of pixels in a line, followed by a sync pulse (which is a kind of 'carriage return' + 'new line' for monitors, followed by a period of black pixels which corresponds to the time needed for the beam to return to the starting position.
In a similar fashion, lines are displayed one under the other from left to right and down, until the bottom end of the screen is reached, and then a sync pulse (this time it's the 'vertical sync), followed by a number of lines filled with black pixels, during which the beam returns to it's top left starting position.
In reality the sync pulses do not happen immediately after the visible pixels but slightly after, so there is a bit of an unused 'border' on all sides. Although, as we know, when monitor mode is selected on the QL, i.e. the full width of the screen is used, some of the contents will end up just off the edges of the screen. It will soon be apparent why.
The 15MHz crystal
At first glance it's not easy to figure out why 15MHz was used except that you get the 7.5MHz CPU clock out of it by dividing by 2, which is a trivial operation in digital electronics. However, a look into the traces explains this, as well as calculating the requirements of the video standard.
The total length of one line should be 64us as defined in the standard. The visible portion should be some 48us, into which all the visible pixels in a line should fit. For the QL in mode 4, this is 512 pixels. These should be shifted out at some clock that is available, and at 15MHz, this would be 720 pixels, and we know it's 512. Using 512 as a reference, we get 93.75ns. The closest we can get from the available 15MHz is if we divide it by 1.5, getting us 10MHz, or 100ns. But, this gives us 51.2us as the visible area, more than the 48us available, so 3.2us end up displayed in the 'invisible part' - and now we know why at full 512x256 resolution, a small portion (left, right or both sides) of the screen is not visible.
However, 640 total pixels each 100ns 'wide' gives us exactly 64us for the total line lenght, i.e. the horizontal (or composite) sync period, just exactly what we need. Further, this is also exactly divisible by 66.6666ns (or, the 15MHz clock) and produces 960 periods of the 15MHz clock - important because our CPU runs at half this clock, so in essence the CPU runs in sync with the screen refresh process. Therefore a repetitive algorithm can be used to satisfy both sides - the CPU and the screen generation. Finally, if one studies the CPU datasheet carefully, one sees that all the CPU access cycles actually use both edges of the CPU clock - so having a double speed clock with respect to the CPU is a very big plus for any logic based on the CPU clock versus signal generation, as logic is normally triggered on a single edge (each clock cycle). This gives us a 15MHz clock pulse for each CPu clock edge, and an ability to thus track but also PREDICT the CPU timing.
As it turns out, this is exactly what the 8301 does.
Video timing
The 8301 generates 312 lines of 640 mode 4 pixels, each line also corresponding to 480 CPU clock periods.
Each line is divided into 40 chunks, 32 of which contain the 512 visible pixels in each line, and 8 of which are the retrace periods. So, chunks 0 to 31 are visible, and 32 to 39 are forced black, i.e. invisible. The horizontal portion of the CSYNCHL signal is a pulse that is active during chunks 34, 35 and 36.
Out of the 312 lines, 256 are used for the picture, and 56 are forced black, with VSYNCH occuring at approximately line 288, if someone needs a precise number I'll re-measure this.
The importance of the 40 chunks within each line might not be apparent until one considers that they are nicely expressed by a whole number of both the 10MHz pixel clock periods (16 pixels) and CPU clock periods (12 clocks, or 24 15MHz clocks).
The 10MHz clock is generated from the 15MHz clock by using double-edge triggering on the 15MHz clock, and counting 3 edges of the 15MHz clock for each 10MHz clock period. This normally generates a 10MHz clock with a 2:1 duty cycle, however this is not directly visible anywhere outside the 8301 so the exact edge to edge correspondence is not easy to find out. This is of some importance (see below) but only if one wants to 'read' the RGB outputs in order to do something clever with them, such as produce a 16 color mode out of 2 subsequent mode 4 pixels with external hardware.
Accessing video data
The 8301 uses a fixed scheme of access that it repeats within every of the 40 chunks of 12 CPU clocks which make each display line.
There are a maximum of 3 combinations:
1) When no screen data is accessed (during chunks 32 to 39) an scheme using 4 CPU clock cycles and double edge triggered logic is used to generate RAM timings. The CPU can start an access on any rising edge (which is how the 68008 normally works), and it will take 4 cycles to complete. If the CPU attempts to start an access 3 or less cycles before a chunk where video data needs to be accessed, it will be ignored and operation will continue as follows below. The important thing to say here is that ONLY during chunks 32 to 39, so ONLY 20% of the time the CPU has more or less full speed access to the motherboard RAM.
2a) When screen data is accessed (during chunks 0 to 31 for each of the 256 visible display lines), the first 8 CPU clock cycles of each chunk are dedicated to screen RAM data access, during which the VDA and TXOEL signals are high, preventing any contanct between CPU and RAM. If the CPU starts a RAM access cycle during this time or less than 3 cycles before chunk 0, it will be ignored, and then given access during the last 4 CPU cycles of the total 12 in a chunk. At this point the standard 4 cycle DRAM timing will be peformed for the CPU. This means that 80% of the time, the CPU only has access to the RAM 4 out of 12 cycles, i.e. at only 1/3 of the maximum theoretical speed.
2b) - and this one is not nice - during chunks 0 to 31 for each of the 56 invisible screen lines, no data needs to be accessed for the screen, BUT the 8301 behaves exactly the same, just does not access data, but rather refreshes the DRAM. In actuality, it still uses 8 CPU clock cycles out of 12 for itself, but does not activate any of the CASL lines, thus making the usual screen RAM access into a refresh cycle. This means that even for the 56 lines when no screen data is needed (nearly 22% of total time), the CPU is still slowed down the same as for visible lines.
The video data itself is accessed using DRAM page mode, which is a short 'burst' access mode that reads consecutive data within the same RAM row, in this case 4 bytes.
DRAM in general is organized as a roughly square array of memory cells, which is why RAS (row address) and CAS (column address) signals are given to the chips, and why the address is multiplexed, row first, then column. Internally, the RAM actually reads a whole row of bits - in the case of a 64k x 1 bit RAM as used in the QL, 256 bits are read at once and held in a 'column register'. The column address then selects the one bit out of the column. However, once the column has been read, data within it can be accessed very quickly by changing the column address only.
This is what the 8301 uses to access video data. It sets up the row adress, drives RASL low to latch it into the RAM chips, then sets up a column address and drives CAS0L low to read the data, then sets up the next column address, drives CAS0L high then low again to access the next consecutive bit, and does this 4 times total. So, instead of accessing one byte in 4 clocks as would be the case for random access, it manages to get 4 bytes in 8 clocks, a double improvement over regular access speed. But, as was explained above, even that penalizes the CPU severely. Out of every 480 clocks in each display line, only 224 are available for CPU access, and even then some might be lost due to sync (as when the CPU does not start a cycle on a modulo 4 clock boundary because of internal operations). This means the CPU can access motherboard RAM at most at 45.7% of the theoretical maximum speed.
Also, the 8301 only uses RAM bank 0 (CAS0L) for video data access.
Issue 5 boards and 8302
On issue 5 boards, the 8302 is connected to the RAM bus and for all intents and purposes, accessing it has exactly the same characteristics as accessing RAM - and is subject to the same slowdown.
One not so apparent problem here is that the 8301 needs rather substantial drivers on RAM address and data pins - as there are 16 chips there, with all addresses in parallel, so each address line drives 16 input pins on the RAM chips, the pin on the 74LS257 multiplexer, and a whole lot of copper trace. However, for each bit, the RAM chips used have separate inputs and outputs which are tied together, as well as tied together to the same pair connected to the corresponding bit in the other bank - so, each data line on the 8302, as connected on issue 5 boards, drives 6 pins, 2 on each DRAM in bank 0, 2 on each DRAM in bank 1, the 8301 (this only ever reads data), and the 74LS245 bus transceiver.
So, not only is there a timing in-accuracy (due to 8301 screen read - this probably results in problems with net access) when accessing the 8301, it also needs to drive more chips than expected. On issue 6, both are solved - the 8302 is decoded directly by the HAL and it drives 4 pins - the CPU, 74LS245, two ROM chips.
When the 8301 detects A17=0 and A16=1, it assumes an IO access. As a result, no DRAM bank is selected via CASL lines, even though a fake RAM access cycle is generated. Instead, PCENL is generated, with timing very similar to CASL.
When the 8301 acts as a DRAM controller for the CPU, it uses a simple sequence - ROWL is initially low to present the row address to the DRAM, after which RASL goes low to latch this address into the DRAM, then ROWL goes high with a slight delay (because DRAM expects the row address to persist a short time after RASL goes low, and then switching over tocolumn address), and then on the next half cycle, i.e. a bit more delay, CASL goes low.
In order to know if it's internal MC register is to be written, or the 8302 registers are to be accessed, the 8301 needs address line A6, which it does not have available as a signal from a pin. Instead, it can read it's state from the RAM address lines, as it is contained within the column address. Because of this, it cannot generate PCENL, which is the 8302 chip select signal, until ROWL goes high and A6 becomes available on DRAm address bit DA3. Consequently, the 8302 access time is actually quite a bit faster than a 68008 can manage, and it should work just fine with a faster CPU provided it's connected as on issue 6 and later motherboards.