8301 (ZX8301, the QL's Master Chip MC) - facts and figures

Nagging hardware related question? Post here!
Nasta
Gold Card
Posts: 443
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: 8301

Post by Nasta »

Here is a bit more about how the 8301 works.
Beware, there will be some numbers to understand, as well as a lot about timing based on various clock cycles.
I will start with the more complicated bit, which is how the 8301 actually manages to read the screen RAM and use the data to create the image on the screen.

A bit about CRT screen basics is needed here - and the plus side is, much of the way these used to work is still the underlying logic for more modern flat panel displays.
QL users already know that the MODE 4 resolution is 512 x 256 pixels - so, 512 pixels per each of the 256 lines. However, a CRT monitor does not actually have 'pixels' in the usual sense, but rather each line has a part you can display an image within, and a part that is not seen, and falls 'outside of the screen'. This will already give us a clue why some of the QL's picture get's clipped on the sides - the part the QL uses to display pixels is actually slightly wider than standard and extends outside the screen, into the 'invisible area'.
The CRT displays a picture using a 'raster' of lines, basically it draws the display using a focused 'dot' of electrons hitting the phosphors on the screen. Depending on the current (which is basically the amount of electrons hitting the screen) the point will be more or less bright. The dot is made to move from the top lefthand corner in horizontal lines, till it reaches the right end of the screen, then it returns quickly back to the lefthand side but a bit lower, so the next line it describes gets to be under the previous line. It repeats that until it reaches the bottom of the screen, then returns back to the top lefthand corner.
While it is doing that, the actual video signal modulates the electron dot and thus produces a picture.
In order for the monitor to know when to return from the right side of the screen to the left, there is a horizontal synch signal pulse that starts the 'retrace' back to the left side. Similar there is a vertical synch signal pulse that tells the monitor when the last line has been drawn and the dot should return to the top of the screen. In actuality, horizontal and vertical movement are independent so it's up to the device driving the monitor to properly generate the synch signals to get a stable picture.
One thing to know is that actual timing back then was based on a standard TV specification, and it is quite rigid. To top it off, it takes time for the dot to react to the synch signals and also to return to the left side, and during that time the dot travels backwards at a higher speed, so the video signal should be turned off (black level) or it would also write an image backwards and with less precision and resolution as the dot is now traveling faster and not necessarily as precisely as in the 'usable' L to R direction. As the definition of one display line is basically the time from one to the next synch pulse, this is why there is a portion of the line that can be used to display video within, while the other, non-usable part is called the retrace period.
The same exact logic applies to the vertical direction, but now the period between the synch pulses is expressed in lines. Similar to how each line is composed from a visible and invisible part, so is the entire frame of lines composed of visible and invisible lines, where the invisible lines now form the vertical retrace period.
For PAL TV, on which the QL video is based, each line takes 64us and there are 312 lines per frame, so the entire frame takes about 20ms to draw, resulting in a frame frequency of 50Hz.

Quick aside: Real PAL TV uses two consecutive frames with 312.5 lines each (.5 meaning the vertical synch pulse happens at about half the 313th line of the even frame and at the end of the 312th line of the odd frame) to get a total of 625 lines of vertical resolution by drawing first the even and then odd lines, at an effective 25Hz rate, the contents of the picture rarely being wildly different between the even and odd frame so the interlacing reduces flicker. However, the QL simply uses the non-interlaced version which reduces the vertical resolution to 312 lines but with twice the refresh rate, which is more suitable to a computer display where contents of lines can be completely independent, so flicker would be quite annoying at 25Hz even at twice the vertical resolution.

Now, remember that I said that not all of a line can be used, and neither can all of the lines be used to display pixels. Also, the signal that modulates the electron dot has a 'bandwidth', or, simply put, a maximum frequency so the number of pixels one can put into the allotted 64us of time is also limited. For a color system it's about 400 or so pixels in ideal circumstances, realistically around 320 if you used relatively high spec commercial video circuits and ICs. However, if you can drive the 'dot' directly, then it comes down only to the circuits in the actual monitor, and the sharpness of the dot - but the basic timing remained (back then) based on standard TV, if you wanted to avoid going bankrupt.
What this means is, one line is 64us, out of which 48us should be used for the actual pixels, and there are 312 lines out of which up to 288 in theory could be used for pixels, in other words, this defines the 'visible screen area'.

Since the QL was intended to have a 80 character text display, and (I will jump ahead a bit here) we know the pixels come out at a 10 MHz rate, this means that each pixel takes up 0.1us, so if 48us is visible, at 0.1us per pixel, that would give us 480 pixels per line. Using a 6 pixel wide character, we would get exactly 80 characters per line. And, if we wanted to use all of the 288 lines available for display, that would give us 138240 total pixels and a 480x288 resolution, and that comes out to 34560 bytes. And we do know that computers rather like things to be numbered in powers of 2, because it simplifies addressing of those bytes.

Let's explain this in a bit more detail.
The way various timing signals are generated by digital logic, is to use a master clock and then count cycles of it to get the various periods and frequencies. Again, knowing that 10MHz is used to drive the video system, it means all timings are derived from units of 0.1us. In this case it means that a whole diplay line (visible plus invisible part) takes 640 clocks at 10MHz, which is why the horizontal synch is 10MHz divided by 640, which gives us 15625Hz for the horizontal synch frequency. This signal is then used to count lines, again visible and invisible ones, 312 total.
If one uses standard binary counting, starting at 0, this means that one line has 640 cycles numbered 0 to 639. Lines are numbered 0 to 311. In binary, 629 requires 10 bits to encode, and the total pixels in a iline get numbered from 0000000000b to 1001111111b, the last number being 639=512+127. Once cycle 640 happens, the counter is reset to 0, so one could detect the combination 1010000000 to reset the line pixel counter - and in fact, this is done by only detecting the two 1's in the entire 10-bit counter, which makes the entire 'reset' circuit very simple.
For the line counter, 311 requires 9 bits to code, and the numbers go from 000000000b to 100110111. Once state 312 came up the counter would be reset, and 312 being 100111000, the hardware can detect the 4 1s in there using a 4-input and gate to reset the counter, so still not too bad for reset logic.
However, getting the address of the data to read from screen RAM to display at the right time, and this goes from address 0 to 34559, from the state of the pixel and line counters, is so complex that it's easier to actually have one 16 bit counter for the address (which is what is needed to code for numbers 0 to 34559) and reset it when the vertical counter is reset, and let it count only for certain combinations of states of the horizontal and vertical counter - namely, only while the horizontal counter counted from 0 to 479 and the vertical counter was counting from 0 to 287. So along with a 16-bit counter (which can be a problem because it takes time for the carry from lower bits to ripple into the higher bits for a simple counter implementation so it could get too slow, and a synchronous counter would have to be used instead which uses a LOT more logic for long counters), one would also need additional logic to figure out when the counter should advance in count and when not.

So what happens if the horizontal and vertical resolutions were some convenient power of 2? This is now getting awfully familiar, as using the closest values would be 512 horizontal and 256 vertical.
This makes the pixel count go from 0 to 511, which fits exactly 9 bits (the reason we used a power of 2 in the first place!) and line count from 0 to 255, which needs exactly 8 bits. So, the numbers go from 000000000 to 111111111 horizontally and 00000000 to 11111111 vertically. Since we are counting starting with visible pixels and lines, this means the initial state of the counters up to 511 for the horizontal and 255 vertical correspond exactly to the address of the pixel, so concatenating the lower 8 bits of the vertical counter with the lower 9 bits of the horizontal counter would directly give us a pixel address. Since there are 4 pixels per byte (given 2 bits per pixel), the bottom 2 bits of such address would give you the position of the 2 bit pixel within a byte and the remaining 15 bits would give you a byte address, from 0 to 32767, which is exactly 32k. All of this is basically re-using horizontal and vertical counter bits and no additional logic - which is a considerable simplification of logic, which is what you want when designing custom logic that is supposed to be as cheap as possible.
The consequence is that now the horizontal resolution is increased to 512 pixels, which uses 51.2us of the complete display line, which breaks the 48us standard. So, there are 32 extra pixels - the timing is adjusted so that around 16 are added to each side of the 480 pixel visible area, and this is how TV mode was born - have 512 pixels horizontally, and limit the usable pixels in software. Simplifying the logic to 9 and 8 bit addressing for the visible pixels and lines also simplifies the logic that has to do with generating the vertical synch pulse and 'blanking' the display, i.e. to determine when pixels are to be fed to the monitor, or 'black' should be generated during the various invisible or unused areas of the screen - simply look at the top bit of the counter and if it is 1, generate blank (black) pixels.
Even better - it makes it also much simpler to write software for. Figuring out the byte address for a 480 pixel wide display, requires 'take x coordinate and add to y coordinate times 120', so 'real multiplication' while calculating with a 512 pixel width means simply shifting and splicing bytes.

There are also other ways this simplifies the actual logic that reads the data from RAM as well as generates the required timing for the RAM when the CPU reads or writes it, but more on this in the next post.


User avatar
dilwyn
Mr QL
Posts: 2753
Joined: Wed Dec 01, 2010 10:39 pm

Re: 8301

Post by dilwyn »

Thanks Nasta, I have enjoyed reading these articles about the 8301.


Nasta
Gold Card
Posts: 443
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: 8301

Post by Nasta »

Now before I go into the details of what 8301 actually does on a signal and nanosecond basis, an explanation is needed on the setup of the actual hardware.

As we know, the CPU is, contextually, the 'master' of the system, and does it's work by accessing various parts of memory and input/output devices, using three buses:
1) The address bus, which transmits the address, or in a way, WHERE to find or put data.
2) The data bus, which carries data back and forth, from CPU to other devices, or from devices to the CPU.
3) The control bus, which is a collective name for a set of signals that time the transactions and control when the addresses and data are valid, as well as what direction the data is to travel.

However, since the QL also has a video system to let it's user see a graphical representation of what is going on, the CPU is not the only thing that needs to access memory.
As I explained, the way the picture is generated on the screen is a repetitive process, because the screen only retains the image for a short time (milliseconds) before it fades again, so it needs to be continually refreshed. I also mentioned that the entire picture is a special 32k area of memory which contents are interpreted as pixels on the screen. Since the contents are dynamic under program control, this is obviously RAM.
So, this area needs to be read out over and over again, 50 times a second - and more importantly, this MUST be done at certain intervals and speed, no stopping, no waiting.
We also know that memory is in general 'single port' which means that only one request can be served at a time, so at any given time, it's either going to be accessed by the CPU or read for screen generation purposes - and, since the latter must be exactly timed, should the CPU want access at the same time, it will have to wait.

In order to somewhat mitigate the problem, the QL design splits the bus between RAM and everything else. There is a 'switch' between the CPU bus and the RAM bus, which makes it possible for RAM data to travel on it's own bus to the 8301 when it reads it to refresh the screen, while the CPU can concurrently access everything else, like the ROM, or anything on the expansion bus. However, should the CPU want to access RAM, the 8301 has priority and the CPU has to wait.

* Small aside: depending on the version of the motherboard, the companion 8302 (and the IPC that communicates through it) can be on the RAM bus side or the CPU bus side. For instance, on ISS5 boards the 8302 is on the RAM side which means accessing the IO registers within it was subject to the same wait while the 8301 is accessing RAM. This changed on the latter boards with the HAL chip.
For people somewhat knowledgeable of the QL motherboard, the bus 'switch' is made out of 3 LSTTL chips, 74LS245 (for the data bus) and 2x 74LS257 (for the address bus). The address bus is also multiplexed by the 74LS257 on the way to the RAM, which is the normal way DRAM - which is the kind used in the QL - is addressed.

So, now we get to the nitty-gritty of how the 8301 does it's work.
Somewhere at the beginning I also mentioned that the 8301 is the main system address decoder. So, amongst other things it enables or disables the 'bridge' depending on what address the CPU wants to access. Of course, it has to monitor where the CPU is within the course of the access as well as where the screen access is in its own timing, and synchronize the two. Aside from that, the 8301 lets the CPU access ROM and add-ons at full speed, and asynchronously - basically it lets the CPU 'time itself' when it does that, counting on knowing that the speed of the ROM is enough to execute even the fastest transfer the 68008 is capable of at a 7.5MHz clock.
However, if RAM is accessed, since the screen refresh gets priority, this is the part that actually determines how fast accessing the RAM happens.

* again - a small aside. DRAM comes in certain chip sizes and organizations, and while we also know the 8301 can support two screen areas, 64k total, it actually controls two 64k DRAM banks and actually treats themas a single block of RAM so screen access will slow down the CPU not only when it is accessing the screen area, but any address within the 128k of on-board DRAM. This is an obvious cost-cutting measure as it's not as easy to set up the 68008 to control DRAM, compard to, say, a Z80 in the Spectrum. Spectrum knowledgeable people will know that the ULA, which also controls screen refresh, had shared access only to the first 16k of RAM, while the added 32 (to make 48 total) was not slowed down. Unfortunately, in order to save more TLL chips and logic, the 8301 controls the entire RAM, even though it can only use half of it to store two screen areas.

So, finally we get to the 'nasty' stuff.

I mentioned in the paragraphs above that the entire timing of the screen refresh is based on the specification of a standard TV raster line. We already know that the QL displays 512 visible pixels in each line out of 640 total, and that each line, having 512 pixels made out of 2 bits each (to get 4 colors), therefore is made out of 128 bytes. It takes 51.2us to display the pixels, so this means that, effectively, 128 bytes have to be read out of RAM to be translated into these 512 pixels, which means the memory bandwidth required is 128/0.0000512 bytes per second, or 2.5Mbytes per second. Just for a sanity check, let's see how fast the CPU can access memory - it takes a minimum of 4 clock cycles at 7.5MHz, so the maximum bandwidth is roughly 7.5/4 Mbytes per second, or 1.875Mbytes per second. In other words - the screen refresh process requires MORE bandwidth than the CPU! But then, let us see how fast the actual DRAM can work, and this roughly comes out to... drumroll: 2.5Mbytes per second.
So.... what now? It seems that there is not enough bandwidth to fit both screen refresh and CPU access!
Well... remember that the screen refresh 'only' happens for 51.2us out of 64us, so that means that in theory the RAM is used for screen refresh 4/5 of the time and the CPU can access it 1/5 of the time. While this would work, it would be downright crippling to the CPU and slow it down by a factor of 5 if it was executing code or fetching or storing data in RAM. We know that 8301 slows the RAM down a LOT but it's not this bad.
So, in one of the rare bouts of being clever, the 8301 uses the fact that screen refresh reads RAM from consecutive addresses, to speed up (well... somewhat) RAM access.

And here is how it does it:
Since we know that the pixels are shifted out from the 8301 at a rate of 10MHz, and it's clock is 15MHz, this means that it's using some convenient common denominator of the two clock to base it's timing on. And, indeed it does - it divides every display line into chunks that last 16 pixels in lenght, which takes 16 cycles at 10MHz and 24 cycles at 15MHz. The 10MHz clock is actually generated from the 15MHz clock by taking 3 half-cycles at 15MHz and using it as a full cycle at 10MHz, so as the 15MHz clock goes 010101 etc, the 10MHz clock goes 001001 etc, this pattern then keeps repeating.
Within the 24 cycles of the 15MHz clock, 16 cycles are used to read 4 consecutive bytes from screen RAM and 8 cycles are kept open for the CPU to access the RAM. Please note that 8 cycles at 15MHz is exactly 4 cycles at 7.5MHz and this is the 'natural' shortest access the CPU can generate if it's left to drive the bus at the maximum speed. That being said, using 8 cycles at 15MHz as the clock to drive the generation of the signals to the DRAM is the natural way to do it with a 68008 (and in general up to 68030) because the CPU itself generates it's control signals at twice the clock rate - it divides every 4 clock access into 8 half-cycles.

* More detail in here:
The 8301 generates a 'page mode' DRAM access to read 4 consecutive bytes much faster than if they were read as random bytes. Out of the 16 cycles at 15MHz it has to do this, it uses 3 to set up the access and then repeats the same 3-cycle sequence 4 times (12 cycles total) to read the 4 bytes, and then one more cycle to finish the access. So, it needs 16 cycles to get 4 bytes.
During the 8 cycles it can use for the CPU, it uses 3 cycles to set up the access, 3 to perform it and 2 to finish it - so it takes 8 cycles for 1 (yes - ONE) byte.
In other words, the bandwidth the RAM is capable of for consecutive data is twice as much as for random byte accesses.
Since the 4 bytes it reads consecutively will make up 16 pixels, they need to be buffered internally to the 8301, because they get read during 16 cycles of the 15MHz clock and get 'stretched' to 24 cycles at 15MHz (which is exactly 16 cycles at 10MHz). It is not exactly easy to figure out how the buffering is done but one could make an educated guess given that we need to save on logic. There are probably 2 2-byte (1 word) buffers, and the first 8 pixels ot of the total 16 read start at cycle 12 of each 24 cycle timing chunk.

* Aside: the 16-pixel timing chunk can sometimes be seen when the digital RGB signal from the 8301 is used to drive an analog input on a monitor or TV, when the screen displays white, one can see the screen seems to have 32 vertical bars that are slightly darker, at every 16 pixel interval, often they flicker slightly depending on what the computer is doing because it modulates the power supply, and this voltage is being directly output by the 8301 as logic 1 on the RGB lines.

Now, I laid out that the 8301 uses chunks of 24 cycles, out of which 8 are dedicated to the CPU. In case you missed it, let me rephrase it a bit: the CPU gets only 1/3 of the total time to access the RAM. And, in case you were wondering, yes it does slow it down to about 1/3 of it's maximum speed when it does. However, we know that when one measures the speed of the CPU having it execute code from RAM, it comes out as working at about half speed. So where's the difference?
Well, I did mention that only 512 out of the total 640 pixels in a line are visible, i.e. 51.2us are used out of 64. The entire line therefore consists of 40 total 16-pixel chunks, but 32 are used for the actual displayed pixels, during which the 8301 does it's 1/3rd bandwidth thing, while in the remaining 8 chunks it lets the CPU run full speed. If we look at it in CPU accesses, at 4 clock cycles at 7.5MHz, one display line can fit a maximum of 120 accesses, divided into 40 groups of 3. In the first 32 groups only one out of 3 is available to the CPU, in the last 8 groups all are available. This means a total of 56 out of the theoretical 120 accesses are usable by the CPU, which gives us an effective CPU speed of around 46.44% maximum, or a factor of 2.143 slowdown. In reality this of course depends on the actual instructions executed s some have 'internal' clock cycles while the CPU does not use the bus, and can partially overlap with the 8301 reading the screen data.

But wait, you might say, did I not say that also not all display lines are used for actual visible pixels? Yes I did, and you would be right - 256 are used out of 312. So, along with 80% of any line being used for visible pixels, roughly 82% of all lines are used for visible pixels, so it would logically follow that along with 20% of each line time being accessible to the CPU full speed, the same full speed could be had for 18% of all display lines. Alas - the cleverness of the 8301, such as it is, does not stretch that far. Sadly, it still does the same even during the invisible lines, just does no actual reading of data.

* Aside: let's explore for a minute what the effective speed of the CPU would be when running from 8301 controlled RAM, if it actually only used the actually required lines to read screen data:
We already calculated that during the 256 lines of the visible display it lets the CPU have 56 out of the theoretical 120 access slots. During the remaining 56 lines it could theoretically give the CPU all of the 120 theoretical slots. On a full screen basis, we therefore have 120x312=37440 slots, and the math for the actual implementation of the 8301 makes 56x312=17472 slots available. If the unused lines were not used for fake 8301 accesses, this number would be 56x256+120x56=19880. This would mean the CPU would work at 53.1% maximum speed, rather than ~46.5%. It does not look like a significant improvement looking at it like that, but if one compares the two, the larger number is almost a 14% improvement compared to the actual situation. Given that in it's time people complained about this, a 14% improvement would not be unwelcome - compare that to the mere 6.66% improvement if the clock was upped to the maximum 8MHz the CPU would support.

So, why was this not done? The partial reason comes down to the need of the DRAM to be refreshed periodically to guarantee the integrity of the data stored in it. But more on that, as well as other quirks of the 8301 in the final post.


martyn_hill
Aurora
Posts: 909
Joined: Sat Oct 25, 2014 9:53 am

Re: 8301

Post by martyn_hill »

Hanging-on every word, I'm looking forward to the final installment!


Nasta
Gold Card
Posts: 443
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: 8301

Post by Nasta »

Well, perhaps I will stretch the final part to several posts as it will be easier to comment the pictures.

So, for starters, here is a logic analyzer trace of the 8301 managing a write to RAM, the CPU having tried to access the RAM before/while the 8301 was reading screen data:
8301_write_screen_RAM.gif
The green part is the 8301 accessing screen RAM, and the grayish part is the actual CPU access to RAM.
I will explain some of the signals in detail, in order to follow how the actual logic I described makes the relevant signals behave in real life.
The top 8 traces are signals generated by the 8301 that control the RAM and the 'bridge' (buffer and multiplexer) circuits that separate the CPU bus from the RAM bus.
The bottom 8 traces are signals the CPU generates or looks at in order to signal the various devices on the bus what it's about to do, as well as signals the devices must generate in order to signal the CPU how they are reacting. Unlike the top 8 signals, 6 out of the 8 bottom signals are generated by the CPU and 2 are generated by the 8301 - these being the clock signal (7M5, standing for '7.5MHz'), and the /DTACK signal which tells the CPU when the device has finished with the current CPUs access so the CPU can proceed with he next access.

Here is a short explanation of the signals by name. First the CPU signals:
7M5 is the CPU 7.5MHz clock, the ULA 15MHz clock divided by 2. Since this is the only clock on the screen, both states/edges are marked by colored bars since every 15MHz clock cycle generates one edge on the 7.5M clock signal. Both the 8301 and the CPU effectively use both 7.5MHz edges.
A6, A15, A16, A17 are some of the CPU address signals which are relevant for decoding parts of the address map of the QL.
/DS is the CPU data strobe, when it goes low, it signals the 8301 that the CPU is starting an access. It also signals that the state of the address bus and data bus on write are stable.
/WR is the CPU read/write signal, shortened here to /WR as it goes low when the CPU wants to write data, i.e. output data on the data bus.
/DTACK is produced by a device the CPU is accessing when it has done doing what the CPU wanted from it :), at this point the device pulls this line low. The CPU detects this and finishes the current cycle and continues with the next one. In other words, this signal can be used to extend the access when needed. Since the 8301 is the decoder for all internal devices on the QL motherboard, it is responsible to generate this signal.

And here are some relevant 8301 signals.
/CSYNCH is the composite synch signal. It was added to the complement of signals so the logic analyzer can trigger a signal trace on it. The underlying reason is that /CSYNCH goes low when the currently displayed line of pixels ends, so triggering on it can tell the analyzer to start tracing signals at the beginning of a line.
VDA is generated by the 8301 and is high when the 8301 accesses display data. Actually, it is used to disable the address multiplexers on the motherboard (74LS257) to switch off the address coming from the CPU and replace it with an address it generates to access the required display data. It actually also multiplexes it's own internal counters that contain the address but it's done internally so the 8301 only has pins for a multiplexed address. When VDA is low, the CPU is free to access the RAM (*) if it wants to.
/TXOE is similar to VDA but works for the data bus. When high, it disables the data bus buffer (74LS245) which then disconnects the RAM data bus from the CPU data bus, in order that the 8301 can read it without it conflicting with data the CPU might be writing or reading to parts of the address map that are not RAM. There is a difference, in that the signal only goes low (and enables the data buffer) when there is actual data to be transferred, as not all cycles in a CPU access cycle are being used by the CPU to access data, some need to be used to set up control signals.
A subset of the ULA signals are specific to the way the dynamic RAM works, so let me explain them as a separete group. These are /RAS, /CAS (two of them), /WE and indirectly ROW

* A bit on DRAM operation: for various reasons, one of which is the reduction of pins on the chip, the address bus of the DRAM is multiplexed, and it gets communicated to the DRAM chip in two 'halves' usually half the number of bits of the total address. The QL uses 64k DRAM chips, and 64k requires 16 bits of an address. The actual DRAM uses 8 address lines and the 16 bits are input 8 bits at a time, in the context of a row and column address of a 256x256 matrix. Internally, the DRAM actually accesses a whole row of data once the row address is input, and then the column address selects a part of data (1 bit in the case of the QL) that is to be output to the CPU or replaced by the CPU data. After that, the whole column actually gets written back, which is also how the data is refreshed. This is a crucially important property of dynamic RAM - data is actually stored in a matrix of capacitors, and left alone, they discharge slowly so data will be lost if it is not refreshed periodically. Also, in order to fit the maximum bits to the minimum area, the capacitors are so tiny that reading their state also destroys the data, which is why it then has to be written back from the row buffer.

So, on to the actual signals:
/WE is obvious, and when low it tells the RAM it's to write the data it is given on the data bus. There are slight differences in the way particular DRAM chips treat it and the 8301 is being quite conservative in the way it generates it so many types could be used - obviously to make it easy to use almost any ol cr*p :)
/RAS goes low to latch the row address into the DRAM chip. It takes the states of the 8 address lines and 'remembers' them internally for the duration of the access.
/CAS goes low to tell the DRAM that the column address is present on the address bus
/ROW is used to drive the external multiplexer when the CPU accesses the DRAM, it selects the address lines from the CPU to present to the DRAM as row (when 0) or column (when 1) - this is the select input to the 74LS257 multiplexers. Note it is wrongly labeled as high active (without the / in front).

The way the DRAM works can be observed in the grayish area on the picture. The timing diagram goes left to right (shows how the signals change as time advances). The gray area is the part of the 'timing chunk' the 8301 uses to let the CPU access the RAM.
In state 0, the VDA signal goes low, for the CPU to access the RAM. The /DS signal is already low, telling us the CPU has already requested an access of RAM well before it got to actually be performed. In fact, if you look carefully, the CPU signals show that /DS is low and /WR is low right from the start, even before the part where the 8301 reads RAM happens, meaning the CPU has started the cycle well before the events we are observing in the diagram. So, it has already been made to wait at least 8 clock cycles. This means that all the address and data lines in the buses have long been stable.

* Aside: if the 8301 sees the CPU is trying to access the RAM from the state of the address lines, but /DS has not become low by 4 full clock cycles before the screen access starts, it will not let the CPU perform a RAM access even if it was 'it's turn' because it would not be able to finish it by the time the 8301 needs to read screen data. So it follows that the CPU can end up taking 16 clock cycles to perform a single byte access - 4x slower than maximum speed. 'Fortunately' the waits due to screen data access are so long that it has enough time to use the upcoming time slots when they come, as it usually starts a cycle sometime close to the start of the next 8301 access (see right-hand side of diagram, the green part after the gray one).

What we see next is that ROW stays low for a while after VDA goes low. The 74LS257 multiplexers have already selected the row address bits (a bit more about this later on) but they are put on the multiplexed address pins of the DRAM only whan VDA goes low. Before that, it was the address from the 8301 that was there. Next, /RAS goes low in cycle 2, and latches the row address into the RAM chips - all of them. A short time after that (and not clock related) /ROW goes high to replace the row address with the column address. The delay is part of the logic but is welcome as some DRAM chips require the row address to remain stable for a while after /RAS goes low (this is called a 'hold time'). Again, there is a delay before one of the /CAS signals goes low in state 3, because time is required for the actual signals to switch over and stabilize (this is called a 'setup time'), and then /CAS goes low.
As can be seen, there are two /CAS signals, /CAS0 and /CAS1. These are connected to the chips making up the lower and upper 64k of RAM respectively. In this case /CAS0 goes low, meaning the CPU accesses the low 64k - and this can actually be seen from the available address lines in the trace. /CAS needs to stay low for a while for the RAMs access time to pass and the data to get written into it.

* Note: there us a larger delay from the clock signal to /CAS going low than from the clock signal to /RAS going low. This is most likely because there is a clock generated internal /CAS signal that then gets split into /CAS0 and /CAS1 using combinatorial logic, which results in a small additional delay.

We can also see that the /TXOE signal has gone low in order to let the data on the CPU data bus on the RAM data bus, and /WE has gone low to tell the RAM it's supposed to write it. /WE is a sort of gated and buffered CPU /WR signal. The 8301 has rather strong outputs for the address bits. /RAS, /CAS, /WE as they need to feed the signal to at least 8 or the full 16 RAM chips. Supplying the required current is one of the reasons for the 8301 generating heat.
Finally, if we go down to the CPU signals, we cee the 8301 has also now generated the /DTACK signal in order to tell the CPU that it has gained access to the RAM and the data has been written.
The somewhat curious thing to remember here is that /DTACK actually goes low ahead of the actual data write in the strict sense but also, the data has long been written when the CPU figures out it's time to get on with the next access. This is down to two things:
First, the RAM actually latches data when /CAS goes low if it finds /WE low at that time. So, for this to work, data has to be present and stable on the RAM data pins before /CAS goes low, as well as the /WE signal - and as one can see, the /TXOE signal has indeed enabled the data buffer so the CPU data can go to the RAM, and /WE is indeed already low when/CAS goes low. This means that the CPU could just as well continue on it's merry way just a bit after /CAS has gone low (remember, a short hold time is required), but:
Second, the 68008 bus protocol is such that the CPU starts looking for the /DTACK being low at a certain point (well passed in this case) but once it detects it, it will take one and a half more clocks to finish the current cycle.
So, while the 8301 could have been implemented to optimize writes, the logic used has been simplified so that it works the same for both reads and writes, where there are certain subtle differences. The take away point is, the 8301 is based on the 68008 bus protocol and expects a certain reaction of the CPU as it gives it various signals in response to an access attempt, SYNCHRONOUSLY with the 7.5MHz clock it generates. In other words, the 8301 logic expects the 68008 to work off the 7.5MHz clock as supplied by the 8301, so it can work in lock-step with the CPU.

The access as observed in the CPU portion of the trace (gray) is the simplest mode of access of DRAM. When data is read, the only difference is that /WE is not low, and the 74LS245 buffer has it's direction changed, expecting data to go from RAM to CPU. However, this time the data must be present and stable on the CPU bus, as provided by the RAM. This happens sometime after /CAS has gone low and is defined as the /CAS access time, worst case this will be about 1 clock cycle at 7.5MHz. However, we are back to the way the 68008 protocol works, and 8301 expecting to work in lock-step with the CPU, clock by clock - knowing that the CPU takes time to recognize the /DTACK signal, it is set low by the 8301 before the actual data is ready, because the protocol is such that when /DTACK is recognized as low, the actual data is taken by the CPU one clock cycle later. The 8301 logic anticipates this. This is why the CPU cannot be run from an asynchronous clock at a frequency that is much different than the actual 7.5MHz, if it is much higher, the 8301 anticipates it has one clock cycle at 7.5MHz of time to provide valid data to the CPU, but the CPU will expect it one clock cycle later but at a higher clock - i.e. after a shorter period than the 8301 is counting on. So, what happens is that the data is usually being written correctly (due to the RAM internally 'storing' data to be written when /CAS goes low, before the actual internal write happens - so it's like a buffer of sorts), but reading will be incorrect.

Now, let's look at the green portion of the diagram. The reason why I decided to explain that one after the CPU part is that the address multiplexing and switching from row to column is hidden inside the 8301 so it's not easy to follow without knowing how it happens outside, which is seen when the CPU does it, as discrete logic chips are used to implement the multiplexing and data buffer.
However, what we can see immediately is that during the green part, VDA is high so the address from the CPU is disabled (and replaced by the address from the 8301) directly on the multiplexed signal level, and /TXOE is high, meaning the data buffer is disabled, preventing the data being read from the RAM by the 8301 from 'leaking' to the CPU bus.
Also, /WE is kept high, meaning that data is read - and indeed the 8301 only ever reads data to generate the display.
At the beginning things look familiar, first /RAS goes low in state 2, then a while later, in state 4 comparable with how it happens when the CPU is accessing RAM, /CAS goes low - and in this case it is also /CAS0. In fact, it will always be /CAS0 as the 8301 can fetch screen data only from the bottom 64k of RAM. In all probability in state 3 the row address is replaced by the column address using an internal multiplexer to get the relevant row and column counter bits to the multiplexed address bus pins.
But then something interesting happens, in state 6 /CAS goes high again and then the 3-half-clock period sequence repeats 3 more times, while /RAS stays low.
What is not seen in the diagram is that the two lowest address bits in the column address also change and count from the initial 00 binary in state 3, up through 01b, 10b and finally 11b. at the same time CAS goes back high.
This is a faster mode of access for addresses that are all within the same row in the RAM - as I said, internally a whole row is read and then when the column select address is given, only one out of the 256 bits maiking up the whole row is chosen out of the 256 making the whole row. In this case, the 8301 signals to the RAM it needs to keep the same row address without reading it anew, and read 4 consecutive bytes from the already read row buffer 'in one go' taking about 2x the time needed to read a single byte when the CPU does it. This can be done much faster as the row does not need to be re-read every time. Since the 8301 does not access RAM almost randomly like a CPU does (because you never know what program is running and what order of access it requires at a given moment), but does it strictly in sequence, from the first to the last pixel, it's easy to implement this sort of 'shortcut' to get more speed. One could argue that things would be better if more consecutive bytes could be read and buffered this way, but the problem is exactly the buffer required - remember, 8 bytes are read in 8 cycles, but their contents are spread out as pixels coming out sequentially during 12 cycles of the 7.5M clock. So, this was the best that could be done with the least amount of buffer size.

* I have not been able to time the pixels versus accesses exactly but I am reasonably confident that the 16 pixel block to be generated by the data being read begins sometime in cycle 9, with the just read second byte of data being written directly to a shift register from where it will be shifted 2 bits at a time. The next 2 bytes are stored in an intermediate buffer in state 16 and get transferred to the shift register in state 5 of the CPU access slot.


User avatar
Peter
QL Wafer Drive
Posts: 1953
Joined: Sat Jan 22, 2011 8:47 am

Re: 8301

Post by Peter »

Thank you very much Nasta! It is good to have this written down, before the knowledge is lost. Designing new hardware had pushed a lot of this to the back of my mind.

If you had to design a 8301 replacement today, wouldn't it contain the RAM and just use DA0..DA7 to grab the address info?


User avatar
Pr0f
QL Wafer Drive
Posts: 1298
Joined: Thu Oct 12, 2017 9:54 am

Re: 8301

Post by Pr0f »

I was looking into the idea of using on board RAM in an FPGA, but even with the basic QL display resolution, the amount of block ram you need means using quite a large FPGA in relative terms. And if you want higher resolutions, it just ramps up.

Thinking about the way the video is produced in the 8301, I wonder if a 16 bt wide dynamic RAM would make more sense? Being able to interleave cpu and video access would be ideal, so that no delay is caused to cpu writes / reads when it needs access to RAM locations in the video space. How fast would a RAM need to be to provide reasonable resolutions? An FPGA could be used to both provide the different display resolutions and gate access to the RAM from CPU / video generator.


Nasta
Gold Card
Posts: 443
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: 8301

Post by Nasta »

Peter wrote:Thank you very much Nasta! It is good to have this written down, before the knowledge is lost. Designing new hardware had pushed a lot of this to the back of my mind.
If you had to design a 8301 replacement today, wouldn't it contain the RAM and just use DA0..DA7 to grab the address info?
For sure - although you then still have a big heap of inactive hardware that never the less consumes power and influences signal integrity. But that is a story for a later post.


Nasta
Gold Card
Posts: 443
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: 8301

Post by Nasta »

Pr0f wrote:I was looking into the idea of using on board RAM in an FPGA, but even with the basic QL display resolution, the amount of block ram you need means using quite a large FPGA in relative terms. And if you want higher resolutions, it just ramps up.

Thinking about the way the video is produced in the 8301, I wonder if a 16 bt wide dynamic RAM would make more sense? Being able to interleave cpu and video access would be ideal, so that no delay is caused to cpu writes / reads when it needs access to RAM locations in the video space. How fast would a RAM need to be to provide reasonable resolutions? An FPGA could be used to both provide the different display resolutions and gate access to the RAM from CPU / video generator.
A logical conclusion but still difficult to do with reasonable FPGAs for the reasons outlined. External RAM could be added, though. There are also some tricks that could be used to make the whole thing smarter with less hardware.
There is an underlying problem though, for higher resolutions, more pixels need to be followed with a faster CPU to move the data around (in the absence of other hardware to actually generate and move the pixels), otherwise the system gets too slow to be really usable (even if it looks nice while just standing there :) ).


User avatar
Peter
QL Wafer Drive
Posts: 1953
Joined: Sat Jan 22, 2011 8:47 am

Re: 8301

Post by Peter »

Pr0f wrote:I was looking into the idea of using on board RAM in an FPGA, but even with the basic QL display resolution, the amount of block ram you need means using quite a large FPGA in relative terms.
The lowest end Lattice ECP5 family member for 4,25 € (at quantity 1) has already 64 KB block RAM. Not as comfortable as the Flash based chip used on the Q68, still quite "nice".
Pr0f wrote:And if you want higher resolutions, it just ramps up.
Roughly "double price, double RAM" at same package. And I'm not sure a 8301 replacement would be good for much higher resolutions, given the 68008 bus limitations.
Pr0f wrote:Thinking about the way the video is produced in the 8301, I wonder if a 16 bit wide dynamic RAM would make more sense?
Yes, it could make more sense. If it was my personal task, I could then copy logic from the Q68. A 8301 replacement would come as a PCB, not single chip anyway - obviously there are no PLD/FPGA in large DIL cases.


Post Reply