Faster/wider CPU...

Nagging hardware related question? Post here!
Brane2
Trump Card
Posts: 179
Joined: Fri Dec 30, 2011 5:42 pm
Location: Ljubljana, Slovenia

Re: Faster/wider CPU...

Postby Brane2 » Fri Feb 10, 2012 7:20 pm

If ever anything useable pans out of this, i see it as an ecosystem rathere than one machine.

More like Linux on modern hardware- one fundamental concept that governs behaviour of hardware and firmware/software.

At every level "low-fat" minimalistic principles would come first and binary compability to some ancient machine will be more like an afterthought. Which would be trivial to accomodate most of the time, but if/when it collides with just about any other interest, it would loose.


On the journey of life I chose the psycho path...
Nasta
Gold Card
Posts: 339
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: Faster/wider CPU...

Postby Nasta » Tue Feb 14, 2012 9:22 pm

Dave wrote:The EC and HC variants both do that. I can, however, obtain EC variants for less than half the cost of a similarly clocked HC.

I'm curious if there's a simple way to have a faster CPU than the 7.5MHz of the 68008, and/or if there's a simple way to have wider memory accesses, and still use the onboard facilities, like the gold cards do, but without lots of heavyweight custom logic.

I am looking at whatever comes out of this thread being an open-source design.


OK I'm quoting this from page 1, so doing a 'where have you been' here :) but:
In fact the least custom logic trickery is required with the 68020. That being said, a piece of code is required to set it up as a 68020, mostly to do with the interrupt stack pointer.
In any case, to get a faster and wider QL, your best bet would be a 68(EC)020.
Disregarding for the moment some members of the 68300 family (which are actually not that easy to get and quite expensive), the 68020 is the only one capable of 8 and 16 bit width accesses with minimal need for external logic (the logic needed is an expanded address decoder that tells the 68020 which addresses have what bus width). Getting a 68(EC)000 to do the same is 'a bit' more complicated as it requires logic to break every 16-bit bus cycle into two 8-bit ones when an access is attempted to addresses that use an 8-bit bus, this also includes some data bus multiplexing and latching etc.
Further, unlike other 'higher' members of the 68k family, the 68020 implements all the original 68020 instructions (notably MOVEP is missing from others). It also adds a bunch of instructions, some of which are missing from other more advanced chips.
Most of the logic associated with getting the 68020 to work 'QL style' i.e. capable of accessing the old ULA chips, revolves around slowing it down sufficiently.
In particular, the ZX8301 is the main problem here due to it's sharing the memory bus in order to generate video output. It relies on some 68008/68000 particulars regarding bus access, and also on the CPU clock being close to 7.5MHz. In theory it should be able to run completely asynchronously, but it does not (It will work with the CPU on it's own independent clock as long as this CPU clock is not too far from 7,5MHz, IIRC about 9MHz is the maximum before problems start).
The 8302 however just emulates a bunch of simple registers and it's not too finicky on timing. That being said, things like Microdrives (spit, spit, spit!!!) and network rely on some timing implemented in software so a faster CPU will upset this and they will not work (a good question would be who would need them to work...). Most of the software on the SGC revolves around patching this up. In fact, it could probably be patched up itself to run a 68020 with more RAM, for instance.

When expanding the RAM of QDOS/SMSQ machines, there is a requirement that the top 3 address lines (A29, A30, A31) remain 'don't care' as far as memory decoding is concerned. In effect, this limits the maximum size of the total usable memory area to 512M bytes. This has nothing to do with the OS, but rather with one of the SuperBasic compilers (Qliberator?) using the top 3 address bits in it's data structures, for something. The aforementioned 512M should contain all the ROM, RAM and IO, including any expansion areas.
The 68EC020 however has only 24 address lines and therefore can address only 16M of RAM, unlike the full 4G of the regular 68020. Both the regular 020 and the EC020 IIRC were available up to 33MHz and both can actually be overclocked a fair amount assuming the rata and address lines are suitable buffered.
One particular impression from long ago - the regular 68020 despite being largely NMOS actually uses less power than the original 68008, even though it was running at 25 MHz in my case :)
I actually made a small board that made the 68020 run with an 8-bit external bus, a bit of logic and a 7805 regulator, inspired by what I read about the Thor 20. It actually worked up to the F1/F2 display at which point it froze probably due to the interrupt stack pointer issue.


User avatar
Dave
SandySuperQDave
Posts: 2429
Joined: Sat Jan 22, 2011 6:52 am
Location: Austin, TX
Contact:

Re: Faster/wider CPU...

Postby Dave » Tue Feb 14, 2012 9:52 pm

Stupid question...

Why can't I provision 16-bit memory for the video memory etc and just have the video take the bottom 8 bits? The top 8 bits would be invisible, no?

Alternatively, the video memory could be read out by a new 16-bit video system that gave VGA-compatible or better output? Flash would be lost, maybe, but who cares?

Sorry for being a smooth-brain! ;)


Nasta
Gold Card
Posts: 339
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: Faster/wider CPU...

Postby Nasta » Wed Feb 15, 2012 12:47 am

Dave wrote:Stupid question...

Why can't I provision 16-bit memory for the video memory etc and just have the video take the bottom 8 bits? The top 8 bits would be invisible, no?

Alternatively, the video memory could be read out by a new 16-bit video system that gave VGA-compatible or better output? Flash would be lost, maybe, but who cares?

Sorry for being a smooth-brain! ;)


The ULA expects the contents of the screen to reside contiguously inside the screen area, and assumes byte addresses. The CPU in your case turn assumes word-wide memory. If you connect the ULA to only one half of the data bus (either top or bottom 8 bits) it will display only the even or odd byte within a 64k (note, NOT 32k) area, i.e. either the even or odd byte of SCR0 in the top half of the picture, and then even or odd byte of SCR1 in the bottom half of the screen, depending on what byte you connect it to (upper or lower). So, you COULD connect it that way, but then all your screen drawing / printing routines would have to be modified to take this into account. Not that it could not be done but a display format where every other byte is unused is extremely cumbersome to work with, never mind the loss of memory space.

Let's examine the inner workings of the ZX8301 a bit further, to see if there is a way around this.

It has one RAS line to the RAM and two CAS lines. CAS can also be viewed as a chip select of sorts.
The ZX8301 never uses CAS1 for screen access (which IMHO was really stupid as it would have enabled 4 screen areas to be used with VERY little extra logic). It also has an enable line which is used to switch-off the CPU from the RAM while the screen memory contents are being transferred to the ZX8301 and to the screen.

The RAM itself is an array organized as 256x256 bytes (one 1-bit chip is used for each bit of the byte) adding up to 64k. Two such arrays are implemented with 16 chips, one is connected to CAS0 and this one normally resides at address 20000h, and holds the two screen areas. THe other array is connected to CAS1 and resides at address 30000h.

The RAM itself has only 8 address lines so the RAS and CAS lines are used to latch the upper and lower byte of the address.
Peculiar to the RAM used, and common to nearly all dynamic RAM, the internal organization is 'only' 256 words or 256 bit each, and the row address (which is signalled to the RAM by the RAS line), once latched, actually completes the reading of data within the RAM. The column address, which is signalled to the RAM by the CAS line, then only selects one out of 256 bits within this long word, and puts it on the data out pin of the RAM. Because the whole 256 bit word, once read by RAS, remains in a local 'buffer' within the RAM chip, it is possible to get to other bits of it MUCH faster than reading a new 256 bit word. This sort of access is called 'page mode' access, and the 256-bit word is normally called a RAM page.

Why all this?
Well, the ZX8301 uses page mode to read 4 consecutive bytes out of the RAM when it reads display data. It buffers the 4 bytes internally and assembles RGB pixels from this buffer, and while doing so, lets the CPU have RAM access, until it needs to fill the buffer again. The reason this is done is to shorten the time needed to read the data - it takes about half the time to do this kind of access than it would take reading each byte the usual way.

Incidentally, this is one reason why it is difficult to adapt the ZX8301 to a 16-bit wide bus. On a 16-bit bus, it would take only two accesses to get 4 bytes, not 4. Of course, the ZX8301 assumes an 8-bit bus so there is no way to stop it from doing it's 4 accesses. But, there is a way we could make them appear as 4 accesses, twice to each of two consecutive addresses, but it requires external logic.

Normally the ZX8301 starts at the lowest address and it generates it's screen accesses something like this:
1st. 4 bytes, address 20000h..20003h (note that only the least signifficant 16 bits are transferred to the RAM, in this example, 0000h..0003h, the upper bits are used to decode what we are accessing):
00h -> RAS
00h -> CAS0, 01h -> CAS0, 02h -> CAS0, 03h -> CAS0

2nd 4 bytes, address 20004h..20007h:
00h -> RAS
04h -> CAS0, 05h -> CAS0, 06h -> CAS0, 07h -> CAS0

etc, until the last 4 bytes, address 27FFCh..27FFFh
7Fh -> RAS
FCh -> CAS0, FDh -> CAS0, FEh -> CAS0, FFh -> CAS0

Now, in a 16-bit system, the RAM addresses address 16-bit words, not bytes. Hence, 64k addresses address 64k words, or 128k bytes. This means that an 8-bit chip such as the ZX8301 needs all it's addresses divided by 2, and the least signifficant bit of the address must be used to select the low or high byte. We need to make the ZX8301 look like this, to the RAM:

1st. 4 bytes, address 20000h..20003h (note that bits 15 to 1 of the address are transferred to the RAM, and bit 0 is used as a byte select within a word):
00h -> RAS
00h -> CAS0:H, 00h -> CAS0:L, 01h -> CAS0:H, 01h -> CAS0:L

2nd 4 bytes, address 20004h..20007h:
00h -> RAS
02h -> CAS0:H, 02h -> CAS0:L, 03h -> CAS0:H, 03h -> CAS0:L

etc, until the last 4 bytes, address 27FFCh..27FFFh
3Fh -> RAS
FEh -> CAS0:H, FEh -> CAS0:L, FFh -> CAS0:H, FFh -> CAS0:L

Where CAS0:H means the high byte is passed to the ZX8301 data bus, and CAS0:L means the low byte is passed to the ZX8301 data bus.
Some examination shows that what actually happens is, the address lines of the XZ8301 are shifted one down, so A7 connects to A6 on the RAM, A6 connects to A5 on the RAM etc, down to A1 connecting to A0 on the RAM.
But, what about A0? Well, this is where the problem lies. A0 has to be latched by RAS in order to use it as A7 when the CAS signals appear. Also, A0 during CAS must be used to select which of the bytes out of a word (upper or lower) should be sent to the ZX8301.

All of this pertains ONLY to the ZX8301 accesses to the screen RAM. Everything else one can leave as needed. Still, there are two problems that need to be addressed:
1) When the ZX8302 is not accessing the screen RAM, it's internal control register (the one holding the 3 display control bits) is accessed when the proper address appears. The connection to the ZX8301 should be such that the register appears at the proper address expected by the OS.
2) RAM refresh - I am not entirely certain if the ZX8301 generates refresh cycles as such, I've never investigated this. The screen refresh process is theoretically not enough to provide refresh since this requires all rows of the RAM to be read at least once every 4ms, and the screen refresh takes 20ms and only reads half the rows. In the above modification, the row address is shifted down so whatever row address sequence the ZX8301 uses, there would only be half rows used. I'm not sure if or how this would reflect on proper RAM refresh. The RAM also has a refresh mode where it counts it's rows itself, if the ZX8301 uses this method to refresh the RAM, the number of refresh cycles stays the same so refresh is guaranteed. The fact that tweaks to the logic around the ZX8301 were used to produce internal 512k RAM expansions using larger chips (which also have more rows to refresh!) do suggest that it might indeed be so, and that refresh might not be an issue.

All that being said, it's a big question weather it's even a good idea to use the ZX8301 since they are getting scarcer by the day, and are not the pinnacle of reliability to begin with.

Some additional data on how he GC and SGC do this:
Both only write to the QL's motherboard RAM, speciffically to the screen area(s) for the benefit of the ZX8301. At the same time, the same exact data is written to the internal GC/SCG memory at the same address. It is always read from the internal GC/SCG memory. The reason is of course speed - the GC RAM is at least 4 times as fast as QL motherboard RAM at it's fastest (and nearly 3x on top of that on the SGC). This method of access is called shadowing, in general it means two or more copies of the same memory space exist in various physical memory chips, for various purposes.

The sadly never realized design of the GoldFire added one more level of trickery, a write buffer, which enabled the CPU to write up to one long word to it's IO interface, and go about it's business while the actual data was transferred to the external bus. It would only have to wait if it needed to write something else before the current write was completed.
Aurora, on the other hand, used dual-port RAM for the screen (in all modes) in order to speed up access even with a standard 68008. This scheme only prevented the CPU from immediately accessing the RAM 5% of the time worst case, whereas with the ZX8301 it approaches 50%.


Brane2
Trump Card
Posts: 179
Joined: Fri Dec 30, 2011 5:42 pm
Location: Ljubljana, Slovenia

Re: Faster/wider CPU...

Postby Brane2 » Wed Feb 15, 2012 1:30 am

One question:

Why even bother with ZX8301 ? It was piece of crap anyway and all that effort to use it in wider systems is exceeds even the effort to recreate it from scratch.

Just use two fast SRAMs and SIMPLE code inside of small CPLD and you will by far exceed anything ULA1 had to offer. Picture generation in such way is trivial and all the trickery ULA1 had to use you can simply forget. Same with DRAM refresh etc crap. ANd address decoding for the rest of the machine is simple deal.

Why complicate things with negative performance gain ?


On the journey of life I chose the psycho path...
Brane2
Trump Card
Posts: 179
Joined: Fri Dec 30, 2011 5:42 pm
Location: Ljubljana, Slovenia

Re: Faster/wider CPU...

Postby Brane2 » Wed Feb 15, 2012 1:56 am

But, just for the sake of argument, lets stipulate that:

- one of your goals include preserving as much of the original machine unchanged as possible. or at least contemporary.

- you limit yourself to the plain old 68/S/EC/HC/000.

- besides CPLD you don't use anything that uncle Clive could not use at that time.

I would then replace all 16 pieces of 64kx1 with 256Kx1 with small adaptations. I would use 120ns chips as they were deemed "fast" then.

On each chip I would lift A8 ( pin 1 ) and corespnding Din ( pin 2 i think) and Dout ( pin 14 IIRC).

This would give me 512K of main RAM and the possibility to do rw cycles.

Then I would wrap the RAM pages from 512x1 to something more square like 32x16.

I would use existing FP tricks but with a bit bigger internal buffer so that video could afford to wait for CPU access if needed. ANd with all registers 16-bit ofcourse.

I would open to the CPU the possibility to use FP cycles either by itself ( subsequent accesses to the same area with much lower access times) or through the CPLD.

I would open the possibility to aceess RAM in R&W cycles. That could be used for example, to clear the screen ( or fill it with some colour or pattern) just after reading each word for picture generation.

With extra fast RAM expansion such QL would not look very differently from existing one, but it would be significantly faster. Especially with slow RAM used just for picture generation and extra fast RAM for everything else...


On the journey of life I chose the psycho path...
Brane2
Trump Card
Posts: 179
Joined: Fri Dec 30, 2011 5:42 pm
Location: Ljubljana, Slovenia

Re: Faster/wider CPU...

Postby Brane2 » Wed Feb 15, 2012 2:27 am

There are probably couple more tricks I ave forgot to mention.

You could skew CAS as it was done on ATARI ST ( and thus achieve 250 ns instead of 280ns cycle time with 150 ns DRAMs and correspondingly faster with 120 ns).

Also, RWC could be very effectively used for drawing routines, especially with higher clocked CPUs.

So, within one fast page, CPU could read a word, modify it and then write it back.

Another trick would be to use FPGA instead of CPLD and its internal RAM banks for EPROM emulation and QDOS.

With higher clocked 68000 access times become problematic, especially for FLASH. But if your "FLASH" is actually very fast SRAM that is on the die to boot, then it is easier to keep low wait-state number even on very highly OC-ed CPU.

Another cool trick would be prefetch buffer where CPU could order "ULA" to read a group of consecutive words and store them in internal registers ad write buffer with oposite role.

All this would bring small physical change, but relatively big performance increase. All the user would see is small board, covering the area of CPU and ULA1 and a bunch of wires, leading through RAM field and missing SPC, EPROMs and possibly ULA2.

Also, you could replace existing SPC with PIC or similar chip which could generate synchronisation and cooperate with main "ULA" as it was done on ST. This would bring you programmmable resolution and timings of picture generation.

It would of course bring instantanuous communication with keyboard, speaker, COMport registers etc as a side benefit. And mouse interface etc. And real, working, battery backed RTC and/or NVRAM or EEPROM, etc etc.


On the journey of life I chose the psycho path...
User avatar
Dave
SandySuperQDave
Posts: 2429
Joined: Sat Jan 22, 2011 6:52 am
Location: Austin, TX
Contact:

Re: Faster/wider CPU...

Postby Dave » Wed Feb 15, 2012 5:26 pm

The nice thing about cool tricks is, well, ummm...

If you're at Nasta's skill level, the tricks are easy, and it's just a matter of making a plan, a design, testing and implementing and testing, going through a few iterations at moderate but limited expense until you have something that works as intended.

At my skill and budget level, it's a different matter. I have not kept current, know little of modern components beyond their basic principles, and am happiest with 74-series logic. I can do PCB design to a higher level due to radio work I have done in the past. I successfully cloned Nasta's QubIDE to make a CF card fast storage device (neat, but lost now) but Aurora-level designs are beyond my sole capabilities.

The point of this thread is to produce a least cost QL clone with moderately better performance at least development cost. This includes dropping anything tricky or needless like microdrive support, etc. The point of this thread is to draw together like-minded people to throw together a "least effort" effort ;)


nichtsnutz
ROM Dongle
Posts: 24
Joined: Wed Apr 13, 2011 6:33 pm

Re: Faster/wider CPU...

Postby nichtsnutz » Wed Feb 15, 2012 8:42 pm

Hello Dave,

as you would like to use the 68EC000,a first minimal expansion
you could build could use it in 8 bit mode but with the clock doubled
together with a 512KB sram for the address range $40000 to $BFFFF.
So to say,you would have a 68008 but with 15MHz when accessing the
expansion memory.
Controlling the speed would be done over DTACK.
When accessing the onboard memory you would use the DTACK generated
by the ZX8301 ula.
When accessing the expanded memory you have to disable the ULAs via DCMCL
high and would genearate a fast DTACK based on the 15MHz clock.
Things you would need :
1) 68EC000 in 8 bit mode via the MODE pin.
2) 512KB sram chip.
3) A PLL to double the 7.5MHz clock of the expansion port.
This is a bit tricky.Maybe use an ICS501 or a ICS9173B.
4) Some Address decoding and control logic.I thing this could be made
with some standard TTL.(gates,counter)
5) New DTACK generation.The cpu would always run with 15MHz,DTACK
is doing the magic!

I thing this would be the simplest possible to do,although the details
can of course cause some headache !

I have also a 68SEC000 as a spare part and will think about doing this,
but at the moment I am doing other measurements and have some other
smaller hardware tinkering running to assist me in measuring.

If you need,I can post measurements of the cpu access to the ZX8301 ula,
because at the time I and Daniele (the author of Q-emulator) are doing some
ula timing research to get the Q-emulator as close as possible to the real hardware.
If you need some timing support to get your project running I would like
to help as I can.

I have attatched a timing where you can see a cpu access that is stalled
by the ula and also the page mode access for the 4 bytes as the user Nasta
already explained.


Many Greetings,
Vassilis

CPUIF.PNG
cpu - ula timing


Brane2
Trump Card
Posts: 179
Joined: Fri Dec 30, 2011 5:42 pm
Location: Ljubljana, Slovenia

Re: Faster/wider CPU...

Postby Brane2 » Wed Feb 15, 2012 9:36 pm

Dave wrote:If you're at Nasta's skill level, the tricks are easy, and it's just a matter of making a plan, a design, testing and implementing and testing, going through a few iterations at moderate but limited expense until you have something that works as intended.


IMHO your philosophy is totally wrong. There are no Nasta's or any other "skill levels". AFAIK Nasta has had never ran QL tinkering seminars and issued special licenses to folks that would pass the some test etc ( like some RedHat license or something :mrgreen: )

World is full of stuff that presents specific challenges. Some of them you know how to overcome, some you don't (yet). Gray stuff inside your skull and internet connectivity enables you to push the boundaries on the matter that interests you, be it changing the oil in your car engine, replacing the old leaky fawcet in the house, learning mandarin or something totally diffferent.
There is no substitute for will to learn.

At my skill and budget level, it's a different matter. I have not kept current, know little of modern components beyond their basic principles,


Then it's time to learn. No way around that. CPLD and FPGA stuff is not that expensive, in fact it is probably cheaper than your existing route. Basic development tools are free, chips are almost free.
Programming gear does have its cost ( Xilinx USB thingie costs around €200), but:

- there are simple DIY substitutes
- there are Chinese copies for €25

and that's about it. GEar for programming microcontrollers is even simpler/cheaper. For all Microchip's microcontroller, you need just Pickit3 dongle (€35) and a free MPLABX. A simple PIC16F1527 cost you around €1 per chip. Solder the chip on the board, place small ( 6 pin?) header next to it and reprogram it a bazillion times in the circuit. No need for special adapters, programmers etc etc.

and am happiest with 74-series logic.


I know the feeling. My happy place is on top of Angelina Jolie. But let's be realistic, shall we ? :mrgreen:


The point of this thread is to produce a least cost QL clone with moderately better performance at least development cost. This includes dropping anything tricky or needless like microdrive support, etc. The point of this thread is to draw together like-minded people to throw together a "least effort" effort ;)


Make a few steps back and take another look at this- you are thinking about REDESIGNING A COMPUTER and you expect to find "non-tricky" way to do it ? :roll:
Last edited by Brane2 on Wed Feb 15, 2012 10:22 pm, edited 1 time in total.


On the journey of life I chose the psycho path...

Who is online

Users browsing this forum: No registered users and 4 guests