Native 68k vs Coldfire vs FPGA vs recompilation ?

Discussion and advice about emulating the QL on other machines.
Post Reply
User avatar
M68008
Trump Card
Posts: 223
Joined: Sat Jan 29, 2011 1:55 am
Contact:

Re: Native 68k vs Coldfire vs FPGA vs recompilation ?

Post by M68008 »

For 7, QL software is written in various languages (the OS in asm) and for lots of software there are no sources available, so binary recompilation may be the one to choose. The closest to this was probably my QemuFast experiment, which is just a faster emulator based on just-in-time 68000-to-x86 translation. For even (much) greater speed, one could try recompiling entire binaries offline with one of the major compiler engines, but this has its problems, for example it may be impossible to tell which parts of the binaries are code and which ones are data. QemuFast worked well, the only problem was with self-modifying code, which is common in old QL programs.
If the only benefit is speed, however, this is unlikely to make much of a difference, emulators already run fast on current CPUs. Probably the biggest challenge for a modern QL system would be modernizing the OS.


User avatar
Dave
SandySuperQDave
Posts: 2765
Joined: Sat Jan 22, 2011 6:52 am
Location: Austin, TX
Contact:

Re: Native 68k vs Coldfire vs FPGA vs recompilation ?

Post by Dave »

I went through this exact thought process a few years ago, and remember making an almost identical (though less well argued) post around 2005. The conclusion I drew was it would be best to recompile QDOS/SMSQ to target the ARM architecture.

I got a resounding "whatever dude".

Also, most people felt more generously towards emulation: QDOS needs a machine to run on - that machine can be implemented in HW or SW.

At this stage, it no longer matters what is optimal. A simpler, less optimal project that completes is FAR more use to the community than an ideal project that doesn't.

Funny, it seems we are in "violent agreement"...

Do you have Skype?


Nasta
Gold Card
Posts: 443
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: Native 68k vs Coldfire vs FPGA vs recompilation ?

Post by Nasta »

Well, having found this forum quite by accident, it's perhaps time to post something on it and show that news of my demise may after all have been premature :)
Brane2 wrote:I am not sure where to file this post : HW/SW/EMU, but here it goes.
I was pondering CPU problem with new QL-compatibles and after going the route many have gone many times before me, i decided to take unexpected left turn somewhere:
All existing solutions known to me, fall in one of next categories:
1..4 are basically derivatives of the original 68000 architecture.
As was pointed out, most are obsolete, and the only current ones are not entirely compatible, but I will get back to this in a moment.
There is one more 68k compatible which deserves mentioning, a member of the 683xx family, speciffically a subset called 'DragonBall. The most advanced of these, now sadly also obsolete and long in the tooth, was the DragonBall SZ, which would have been the almost ideal platform for a small QL replacement, mostly aimed at portability and embedded or PLC applications (*)
It's most interesting point is the inclusion of a 68k core recompile capable of operating at 66MHz. Even though the number sounds interesting compared to existing native or nextgen hardware, it should be noted this is a 68k core without cache and with 16-bit wide buses, so not as fast as the MHz number would suggest, but on the other hand, completely compatible and very low power.
The chip has a bunch of peripherals on it, which, if you are willing to read the 1000+ page datasheet, adds up to a 'connect the black boxes' system that also includes a simple frame buffer graphics implementation, capable of operating at WXGA resolutions in something like Aurora 8-bit color, mainly intended for a LCD display. A very complete system could have been built around this chip, although it requires a master of PCB design and probably a 6-layer board due to the very tiny BGA case with 0.8mm ball pitch and some 200 pins.

The ColdFire route seemed interesting at first when CF V1 was announced, which was basically a recompiled 68EC040 with a multiplexed bus, implemented that way to reduce the pin count.
This was also the first and last ColdFire that could be made to run like a 68k compatible CPU, at least as much as any other 68040 or 060 chip could. From then on, things went unusable as far as binary compatibility is concerned, with CF V2, and V3. However, V3 did show that Freescale thought better of it's efforts to cut down the 68k when some address modes and instructions were put back into the V3 core, and trapping of the other unused opcodes was again implemented. V4/V5 offered even more and basically came back to specs of a 68060 at much higher clock rates and with extra features such as some DSP elements, caches, and tons of peripherals. With one exception - whereas the 68020 already separated the usual 2 stacks of the 68k into 3, the CF all use the same one. There is also some other stack behaviour which is incompatible and unfortunately, cannot be trapped. There might be a way around this using some hardware support but that in turn slows everything down a lot, essentially disabling the caches. What it comes down to is, even though current CF is within a hair's breadth of being fully compatible, in the end you have to use emulation to make it so, just as you would essentially have to for almost any other modern CPU. At first it sounds like there would be less effort due to the 99.9% compatibility, but in fact it's incredibly difficult to trap that one 0.1%, about the same (+-3dB) as complete emulation. This is a great pity as running native on a modern CF would essentially beat everything else by a wide margin.
5. FPGA

Advantages: Modern FPGA can bear 68000 trivially. In fact, even designs, capable of beating 68060 on clock-per-clock basis have been demonstrated. SInce designers can tailor everything to their needs, compatibility is not a problem. Also, achievable frequencies are in excess of 100MHz.

Disadvantages: Development is tricky and slow, FPGAs are not that cheap. Results are great, but they don't pay for all that development effort, which could be better paid for by designing something else for some bigger crowd. Also, such QL resembles somewhat a WV Beetle, assembled from special kind of LEGOs, made from titanium. Great as an attraction, but not as great as car. Not that cheap alltogether and really, when you take everything into account, not with much reason for existance.
I have to agree on this. Due to CF, Freescale never released rights to the 68k core and people are only alowed to do their own functional clones as the architecture itself is not patented. Unfortunately, this dilutes the effort. If the core had been released, there would by now be several free implementations as there is a wider market for this sort of thing, mostly in the communications and military. These sorts of clients however are not interested in solutions unless whoever is offering cannot guarantee full compatibility and reliability, and to date, to my knowledge, no-one can.
What is worse, Freescale DO have a compileable 68k core, the very one that was used in the DragonBall series CPUs. In fact, a bit of history: 68040 was the first fully compiled core from Motorola/Freescale, except for the caches, FPU and MMU. By the time the 68060 appeared, it was fully compiled. Mot/Freescale was one of the pioneers in this, and it enabled them to produce tons of custom-synthesized chips for application or customer specific needs, which indeed is the way they survived and still do.
6. Emulation:

Advantages: Relatively good effects on modern HW.

Disadvantages: NO point doing it, outside some simple data conversion and transfer from QL to PC and vice-versa. Result preserves nothing from original machine. It is neither minimal, effective nor cheap.

7. recompilation for new target

So, I am asking whether route 7: recompilation of existing binaries and/or sources for new target is optimal route.

QL was designed as an efficient minimalistic machine. 68K wasn't pivotal for QLs design. It could very well be any other CPU. What made QL is philosophy, not concrete solution that was materialized in the end.

So, as long same principles were honoured, end user shouldn't care much if his CPU has no Motorola/Freescale logo on it, but for example MIPS or something else.

This would put main strain on analytic and recompilation tools, but that was long overdue, anyway. QL's dedelopment path was peppered with incompatibilities of many sorts and every step of the way some programs have fallen of the bandwagon of compatibility.

What do you think ?
There is also a middle way, call it 6.5.
Apple is using it for the second time now, and it's based on a mixture of emulation and recompilation.
Use of this sort of technology is not unknown in the QL world too - it has been used for a long time in the QPC emulator although in a rather speciffic way, by implementing some traps (eg. math functions) in native code. QemuFast is another example. Before that, we had spectrum emulators on the QL that did the same. And, there once was a prime example of this technology, which, incidentally could well have been used to emulate a 68k+ architecture, the Transmeta Crusoe and Efficeon CPUs.
The basic of this technology are a (preferably hardware assisted) software emulator, and a (preferably hardware assisted) method of differentiating between emulated and native code.
A basic implementation would be a traditional emulator. This would be a piece of code designed to emulate a 68k CPU taking into account various necessary hardware specifics of the actual computer we wish to emulate, like address maps, IO emulation etc. Normally this code is self-contained, and invisible from the emulated side. Depending on the actual architecture of the CPU or indeed, computer, that is used for emulation, several levels of sophistication can be added, some of which can improve performance quite drastically:
- hardware assisted instruction decode. In cases where the native CPU is close in architecture to the emulated CPU, or where the instruction to be emulated is very simple, decoding the actual instruction and finding the address of the equivalent native code to be executed takes MUCH more time than the actual operation that implements the emulated instruction, 10-20x more time is not uncommon. Hardware based decoding can speed this up drastically.
- emulation code caching, which is a form of compilation, eg. Just In Time Compilation techniques. This can also be hardware assisted, and works by the emulator 'remembering' how it emulated a given code fragment using native code, so when it is jumped to (for instance in a loop), it does not need to decode and emulate again, but rather just executes what it 'remembered' the last time it went thorough that code. Simplified, this technique is based on tracking program flow change instructions and condition test instructions. A purely software based emulator of this kind will perform badly on very short seldomly used code but offer orders of magnitude of speed-up on more complex code, sometimes nearing in performance to the native CPU.
- exploiting emulated machine or OS specifics, to replace non-volatile code with native code. The QL is a prime example, since it's OS API is organized around software generated exceptions, which are easy to recognize by the emulator. Since most of the OS routines are well known, their functionality can be duplicated using native code without the need to emulate any code at all. In fact, large parts of the OS could be re-written in native code, and as the development of the whole thing goes on, the whole OS could be re-written in native code.
Once some of the above is in place as the new architecture of the whole system,a decision needs to be made weather it's in the best interest to keep the system virtualized, and the user and programmer unaware of the actual CPU architecture doing the work, OR, to open a way for the programmer to write native code should they want to do so, in order to circumvent the emulation and get maximum performance. This is where the (preferably hardware assisted) way of differentiating between emulated and native code comes in. Traditionally, this is done in two ways, and in most cases simultaneously:
- The initial assumption is emulated mode. A software generated trap (usually unimplemented instruction) in the emulated architecture is used to tell the emulator that native code follows. Native code execution end is usually signaled by using the exception system of the native CPU so that a return from exception at the end of the native code cleanly ends a native code segment and returns to emulation.
- In native code mode, a 'native code equivalent' OS API is provided so that OS facilities can be used equivalently in native code, and the API takes care weather a particular part of the OS is emulated or native code.
Although the principle is fairly simple compared to the emulation part, careful attention to detail, such as actual hardware generated exceptions (interrupts etc) and stacks, is needed. Things become quite a bit more complicated if memory management and virtual memory are added to the mix, but fortunately, this is not the case for the QL. In fact, it could be argued that only minimal MMU functions would ever be needed for the QL if the OS was ever expanded in this manner.

One more point:
It DOES help a LOT if the emulating CPU is similar in architecture to the one being emulated, in particular endian-ness is the same, number and size of registers equal or more, no restrictions on register use except for special registers such as stack pointers and status. In short, nearly any RISC based architecture will be suitable, what it comes down to is availability, choice of peripherals, speed, ease of use and cost. When you look at it, the finger points mainly at ARM CPUs, which is no wonder since the 68k was largely the role model for this type of CPU, when it was originally designed.


(*) - these niche markets are still IMHO viable for a 'QL clone' machine, and although not massive, still vastly greater than the max. 200 market based solely on QL users.


Nasta
Gold Card
Posts: 443
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: Native 68k vs Coldfire vs FPGA vs recompilation ?

Post by Nasta »

Brane2 wrote:@Nasta:
So, you are alive and ( hopefully) well . :P
Well, I'll take that as good wishes as I may be alive, but simple words like well don't even scratch the surface of how I am, so let's just say I'm reasonably Ok and leave it at that :)
WRT to ARM- just as an inquiry - what do you think about its capability to conditionally run an instruction ( flags conditions encoded in instruction word) ? I don't see much advantage in that, as such inactive instruction is rarely "free".
If nothing else, one execution slot is unused, and evene if you would have instruction decoder, capable of decoding more than exection units could execute, this would often lower code density.
And with several consecutive or near "dead" instructions in the loop cost would sharply increase.
It doesn't seem to simplify or speed up dependency calculation also, since condition codes are not conditionaly set, but read and acted upon...
Interesting question, as I just recently had a discussion about it with the main software guy at the place i used to work at until recently, us both having pored over some compiler generated code for the ARM M3 and M5.
Actually, IMHO this is not as much a feature as it is a side-effect of the way ARM decodes instructions. Actually, nothing is lost except perhaps code space to encode instructions in, and it does have some rather interesting ways of simplifying code. One which we actually found very useful, is saturation arithmetic, normally a rather bothersome thing for standard CPUs. This is something where you go 'decrement register' but want to keep it from underflow, therefore you need to add a 'if register underflowed, load with zero' piece of code requiring a conditional jump and a load instruction. With ARM you simply do XOR register conditionally with itself, for example. The great difference here is there is no change of program flow, which makes anything pipelined flow smoothly and if there is branch prediction hardware, it completely avoids it keeping the branch cache free for more important branches. This is a very simple example, but examining the code shows that the compiler is quite good at folding instructions into this sort of CPU peculiarity, and the most important benefit is with time-critical IO operations - just consider how many low level IO routines have a 'check flag and do something simple if it is set or even more simple if not' code in them, and often within loops too, this conditional execution thing comes in very handy, because the jump overheads are sometimes far more time consuming than a short series of conditionally executed instructions. BUT - this kind of code is not easily humanly readable or understandable (which is almost a rule with RISC machine code).
WRT to version "6.5", wouldn't it be simpler just to glue an RAM chip extra and use its bits for such purpose ( like having 32+8 bits of RAM...)
Yes and no - depending on how you go around the whole thing. If you are counting on various fast memory units (which includes caches) inside the chip itself, in order to speed-up critical code execution, then you would be fresh out of luck with this approach as you get your RAM pre-organized by the manufacturer of the chip. Even if we were talking just caching (and real speed is practically impossible without it these days), the concept falls apart - UNLESS you build your own emulating CPU on a FPGA. Then you can basically do whatever you want.
However, external RAM connected in various interesting ways could be used for, eg, lookup tables to speed up instruction decode for emulation. But then, you could just as well implement the decoder in a FPGA. One might argue that's already halfway to implementing a CPU core inside a FPGA, but the argument falls flat if the intention is eventually moving on to native code for a different CPU.
WRT to having core in FPGA- for small apps who cares whether the RTL is available or not. 68000 is relatively simple and I doubt that anyone is really interested in having cycle exact copy.
Besides, FPGAs are by their nature very easily reprogramable, so even if bug would surface, it could be realtively painlessly either removed or promoted to "feature". :mrgreen:
I think there is cool niche for FPGA implementation also. One is world of small microcontrollers.
Cheap FPGA is not that much more expensive than MIPS-based chip or ARM and it can be extremely customized.
Somehow I can't push that "MachXO2 guy" from my mind. This stuff is interesting and prices do start quite low. Unlike other decent FPGAs MachXO2 doesn't have hardware multipliers, but they can be either synthesized or avoided. For simple CPU32-like core this should be quite adequate.
Such contraption could be used for "classic" applications or as advanced QL. Or both.
True, core couldn't compete with speed ARM/etc, but periphery could kill anything in sight. So all in all end product could win some applications...
The reason it's a pity the core is not available is that having it available, and obviously well tested, would dismiss the need to invest time in making yet another compatible core implementation.
Unless the implementation is not very efficient (arguably less efficient than even the first 4MHz version of the 68000 that had a 5-clock bus cycle) debugging a core that is even relatively simple and flat/symmetrical as the 68k is not a trivial task. Especially if one runs into problems with the synthesis tools, and it seems lately this is not at all uncommon. It has been my experience in designing several platforms since early 2003, that development systems are consistently getting worse in quality, although they try to peddle quantity (of options... which mostly do not work!) as a selling point.
Also, because synthesis tools are largely re-used, there is less and less user control over how logic is actually compiled and fitted into a FPGA, and sometimes the actual (not presented by the marketing department!) performance of a compiler is abysmal - for instance, unable to actually properly use the very extra features that may make your chosen FPGA best fit the task at hand, namely implementing a CPU. Unfortunately, there is really little one can do about this. In the world of core synthesis, FPGAs are used to verify the logic, and not to be a final optimized implementation - that las part is handled by a silicon compiler. This normally means that even for a very simple core, the design is done on the largest and fastest FPGA available, until it is fully verified - even if it ends up being a $1 micro-controller with some peripherals in the end. Not something anyone can attempt, unfortunately.
Even then, errors happen. In the first ARM project I did, based on a TMS ARM M3 implementation, we found 2 bugs in the CPU and 4 in the GPU. These will actually never be corrected - the chips have already been declared obsolete, and new ones are out, with their own bugs. Sometimes the very implementation is asking for it - the last ARM-based MPU I used could easily be set to have memory cycles that end before they even start, completely needles and idiotic lack of thinking on the developer's part, because it does not get trapped by the in-circuit development hardware, but instead crashes the core... so out comes the 20-year old trusted Tektronix scope. So much for advanced development tools.


Nasta
Gold Card
Posts: 443
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: Native 68k vs Coldfire vs FPGA vs recompilation ?

Post by Nasta »

Brane2 wrote:And even if you take look at ARM- based microcontrollers in that price range, they all seem to be M3 or somesuch, with operating frequencies in 20-50 MHz range.
This shouldn't be tough to reach even in small FPGA...
If one would do a FPGA implementation, it would be simpler to actually implement a cut-down 68k (such as ColdFire V2) but give it a way to handle unimplemented instructions or address modes through emulation, say, from an 'emulation RAM' internal to the FPGA for fastest access. In other words, provide partial hardware acceleration of decoding unimplemented commands and modes, and using that automatically launch code that works on a different register bank without the need to save the CPU context. This is not a regular type of exception and in fact should be invisible from the standpoint of the emulation.
In fact, a whole different CPU 'architecture' could be used to do this emulation bit, which works outiside the normal CPU model. Incidentally, this is nothing new - a long while ago a company called NexGen developed such technology and included a sort of emulation cache (so that the decoding and translation lookup process would not have to be performed over and over again in loops etc.) and developed a core around it that, after AMD acquired it, became the AMD K5, then K6, K6-2, K7 also known as Athlon and you know the rest.
Even more curious, some parts of this architecture can be traced as far back as the tragically demised Inmos Transputer. This in turn had a very interesting way of transferring lots of instructions to the core to be executed, by packing them into 'atoms' of 4 bits each, so up to 8 could be packed into a single 32-bit word. Where this was not enough, an extension format was used for larger word sizes. But, the main trick in getting so many instructions into so few bits was making the instructions zero-address. THis means that the addresses of operands for these instructions are always implied and do not have to be contained in the instruction word. How is this done? Simple - by using a stack. Yes, the transputer could function as a stack-based CPU, something Forth users will readily recognize. Various forms of this architecture re almost ideal for emulation provided initial decoding is done in hardware.


Nasta
Gold Card
Posts: 443
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: Native 68k vs Coldfire vs FPGA vs recompilation ?

Post by Nasta »

On ARM conditional instruction execution
Brane2 wrote:
Actually, nothing is lost except perhaps code space to encode instructions in, and it does have some rather interesting ways of simplifying code.
I don't see how "nothing is lost". Dead instructions still have to be decoded, if nothing else. So, you have to have decoder faster than the rest of the chip. That is, decoder has to be able to decode more instructions per clock than rst of the chip is able to retire.
And that can go only so far. Maybe one or two instructions extra, even if that. So you can't have many "dead by condition" instructions in the loop" or your pipeline will start to see some serious bubbles.
Hm. I really don't wee where this 'dead instruction' idea comes from, as most of the time instructions are live, not dead, i.e. condition for execution is 'Always'. This just means the bit field denoting the condition is mostly a constant within each instruction word.
There is absolutely no reason not to decode this instruction as the condition evaluation takes place during instruction execution, namely before the end result is written (either truly or effectively, depending on the actual core implementation). This is just like decoding a condition test or conditional jump, it has to be decoded anyway for the condition to be evaluated. When looking at the basics of the architecture, one should look at the first implementation - so no condition predictions, no instruction folding, no superscalarity etc, just a basic fetch - decode - execute pipeline system where 'execute' may be further divided into more pipeline stages. The condition must be evaluated at instruction execution to insure it's correctly set by any previous instructions, hence there is really no difference in pipeline flow regardless of condition met or not - if not, the instruction effectively becomes a 'NOP'. Obviously, in this simple implementation, there is an energy penalty as the actual instruction is really executed, just the result is either stored where it's supposed to go, or discarded.
Again, in this simple model, any changes to the program flow except pure linear flow are to be avoided for best efficiency. When conditional instruction execution is used in clever code, unlike even the simplest 'skip next instruction', the program flow remains linear, preventing any pipeline stall or flushing of pipelined instructions that will not be executed if a condition is met and a skip occurs. From the standpoint of a simple RISC, it really does not matter how far the PC moves as long as it does not simply increment - you get to do 'garbage collection' on your pipeline.
Also, one should remember that there are various (usually compiler implemented) tricks that can be used to prevent pipeline stall or flush. For instance, a classic example would be a loop that breaks on a condition being met. Basically, it ends with a 'jump back to the beginning of loop if break condition not met' and at that point you have N more instructions waiting in the pipeline, which would normally have to be thrown away. Certainly, if handling the jump's effect on the pipeline could be avoided, it would simplify instruction fetch-decode a lot. So, assume it is not handled in hardware at all, and that N=3. The proper way of encoding this loop would be the following:
;Loop start
Instruction 1
Instruction 2
Instruction 3
Loop re-entry:
<remainder of the loop>
jump to loop re-entry if condition for break not met
Instruction 1 if condition for break not met
instruction 2 if condition for break not met
instruction 3 if condition for break not met
; end of loop

What happens here?
Since N=3, this means 3 instructions after the jump instruction are pre-fetched in the pipeline at the point where the jump is executed. In order to correctly execute the loop, the first 3 instructions of the loop are duplicated after the jump instruction so that at the point the jump is executed, they will be already pre-fetched, and can be executed just as they would be in a traditional CPU if the jump went back to the loop start point. Instead, it goes back 3 instructions after, exactly the same 3 that are already pre-fetched, so nothing needs be done to the pipeline despite the jump.
However, these 3 instructions are conditional, and the condition is the same as the condition to jump. This is done so, because once the loop breaks, the 3 instructions following the jump are again pre-fetched and will execute, even though they shouldn't as they are really part of the loop code. However, since the condition for loop break has been met, the instructions will actually become NOP and although they will use up some clocks, again nothing needs be done tot he pipeline.
The premise is, of course, that the loop goes round many times while it breaks only once, so this additional 3 clock overhead is outside the loop and consumes negligible time compared to the loop itself.
Now, if it were me designing the CPU, and I found that I could just about squeeze the condition codes in every instruction (perhaps losing a bit or two that could maybe get me a few more seldomly used instructions into a single word), I would find it a great trade-off if it meant my pipeline remains simple and needs no hardware to deal with stalls or flushes.
As you can see, the decode unit is not required to run any faster than the rest of the CPU, in fact, things stay nice and in synch. At least until an interrupt comes along :)

Another thing to keep in mind here, some of these, actually rather simple techniques' are actually IP of certain CPU manufacturers and in some cases chip designers had to work around this. One interesting piece of trickery like that is used by the SPARC processor (and others I'm told), in order to keep instructions executing at a 1 Clock per instruction rate, and keep their pipeline 'flat'. Load 32-bit constant into register, is the instruction of choice here.
Normally, this would be an op-code followed by a 32-bit constant, which is of course a 'special case' considering all other instructions have only one instruction word and all the data is contained within. Well, some clever guy at SUN figured to abolish this instruction altogether, and instead, have the instruction 'load high word'. This is because they could put the instruction code and a word constant in a single 32-bit instruction word.
So, LOAD.L Rn, #$12345678 becomes 2 instructions, LOAD.L Rn, #$1234 followed by LOAD.H Rn, #$5678
The nice thing being both instructions take one 32-bit instruction word and execute in 1 clock, no special cases, and no need to medde with the pipeline.
RISC code is quite abundant with things like this, which would not be immediately clear to a regular assembler user, neither would a 'feature' intended to implement them, be immediately obvious, until one looks at some real code. In fact, where the manufacturer provides a means to program the CPU in assembler, there are pseudo-instructions or macros that actually convert to the proper instructions and hide the trickery even at the assembly level.
One which we actually found very useful, is saturation arithmetic, normally a rather bothersome thing for standard CPUs.
Yes, but:
1. You can always find that one instruction that makes your life in that particular case easier.
True, but that's exactly the point I meant with the 'feature' actually being a side-effect rather than intended for clever programming, even though it can be used for hat sort of thing. Keep in mind we are really talking 1st-gen RISC here, and they were designed to be simple hardware-wise, not clever for programming.
2. Saturation logic is discipline for FPGA or at least vector computing. If you want to intensely massage some bigger fields, you usually get yourself some chip with decent vector unit.
3. Your example uses conditonally executing one instruction in the loop. This offers very limited possibilities, and yet eats precious bits in every instruction word...
See above - besides, keep in mind the history of the ARM. It's first implementation was intended to do pretty much everything in a computer, rather reminiscent of Sir Clive's designs. No fancy new technologies then, but you still needed to manipulate bits and do sort-of DSP for graphics and sound, and IIRC, it did a rather good job, too.
1. WIth MIPS you just conditionally skip jump over next instrucion with similar result.
2. Other designs can detect branch to either same cacheline or one prefetched after it, with similar result.
Again, see above, the idea behind this is way older than branch prediction and in fact geared at simplifying the operation of the pipeline by avoiding 'traditional' jump implementations, in order to keep the pipeline full. In essence, the idea is, avoid the need to clean-up the pipeline, hence no need for any hardware to do it, and avoid it by avoiding actual program flow changes whenever possible.
Yes and no - depending on how you go around the whole thing. If you are counting on various fast memory units (which includes caches) inside the chip itself, in order to speed-up critical code execution, then you would be fresh out of luck with this approach as you get your RAM pre-organized by the manufacturer of the chip.
Yes, but you could modify your approach for that. CPU can usually signalize nature of the fetch. SO, if it is instruction fetch and you sense that in 16-byte group say first two lwords bytes are native code and last two are not, you could simply insert TRAP instructions there and fill the appropriate table with original opcodes of supplemented instructions.
If CPU then executes only legitimate code, everything will go "as usual". If it tries to execute trapped code, it will trap and read from the table what the trapped opcode was and then decide what to do.
True, but in fact you are to an extent duplicating the trap mechanism and not exactly with trivial hardware. This sort of thing would be a bonus if you were building a machine primairly intended to emulate other machines or CPUs. With 6.5 as the approach, the idea is to eventually migrate as much as possible to native code, hence your new programs (or routines, modules etc) would in the end all start with TRAP (native code) and not get out of that mode until they end, and here we could just as well be talking a whole application.
On the other hand, basing the OS API on TRAPs (or equivalent) provides a well defined line where the application (user) crosses into OS (supervisor/privileged) mode. This line is absolutely crucuial to QDOS/SMSQ and in fact the way the most basic and IMHO most important idea of that OS is implemented - using the lowest possible hardware based mechanism in the CPU itself to provide atomicity and time slicing and real-time functions. This is in fact the way QDOS/SMSQ avoids the use of semaphores and other traditional arbitration schemes, and effectively puts them into the domain of hardware, where they cannot be unduly influenced by software. From this standpoint, the overhead of the TRAPs is not only unavoidable but necessary.
One thing which this mechanism offers that is not very obvious, is based on the fact that TRAP implies a machine state change. Mostly we look at it as change from user to supervisor mode, but fi one looks a bit deeper, the idea is based on changing from one virtual CPU to another. This, not so obviously, implies that the concept can be expanded where the changeover does not have to be between virtual but real CPUs. And further, why assume the CPUs should be the same, since it's really only the API definition that tells us how parameters and results (i.e. information) is passed from one state to the other. This 'portal' of sorts is the basic foundation of QDOS/SMSQ, and as such has been there from the very beginning, which makes this OS much more easily adaptable to multiprocessin, be it virtual or real.
Funny enough the absence of a MMU also makes it easyer in this way - memory management in multiprocessor systems is no small feat and doing it across several different ones, or even different virtual ones (such as a 'native' and 'emulated' CPU) is no small feat.
It has been my experience in designing several platforms since early 2003, that development systems are consistently getting worse in quality, although they try to peddle quantity (of options... which mostly do not work!) as a selling point.
Also, because synthesis tools are largely re-used, there is less and less user control over how logic is actually compiled and fitted into a FPGA, and sometimes the actual (not presented by the marketing department!) performance of a compiler is abysmal - for instance, unable to actually properly use the very extra features that may make your chosen FPGA best fit the task at hand, namely implementing a CPU. Unfortunately, there is really little one can do about this.
Just out of curiosity- wouldn't some open-sourced IDE solve this problem ? I know FPGA manufacturers keep bitmap structures as a secret. But bitmap generators are available usually as a separate executable inside their tool. All I would need for simple projects is some graphical tool with chips design so I could preset LUTs etc etc.

IOW, I could use FPGA schematics with predefined native elements ( without switch matrix ) and then place and route them inside FPGA PCB. With all elements predefined then I could just employ closed-source bitstream generator and get myself a final bitstream...
Manufacturers also unfortunately keep routing resource details and configuration bit mapping a secret. So... no go, at least as far as I have been able to see. Simply put, if you want it fast, you need to pay for the next bigger and faster FPGA. Then when you finish development, hope and pray it compiles for a smaller FPGA.
Older CPLDs normally had a mauch simpler structure so you could 'massage' things into there with a few properly placed directives and locks. but for truly complex logic (and not just lots of simple copy-paste stuff), FPGAs are IT and whatever tools you can get with the money at your disposal, you are stuck with.


Nasta
Gold Card
Posts: 443
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: Native 68k vs Coldfire vs FPGA vs recompilation ?

Post by Nasta »

Brane2 wrote:OK.
This is getting tedious to read, so to summarize for further reckoning:
WRT to QL TNG:
1. Emulation of original QL seems nonsensical. Even if needed, we already have PC for this.
Not entirely, depends on what you want to achieve. Looking at the QL as a 'personal computer' it is largely nonsensical, in the sense it would be yet another emulator, and there are several. There are some exceptions to this, like writing an emulator for, say, a tablet PC.
However, if you look at the QL from the standpoint of an OS (forgetting for the moment it's connection with the 68k) in some sort of embedded environment, then some sort of emulation, like what I have discussed above, does have it's merit, until the OS is largely translated to a different CPU.
In this sort of scenario, we're not actually looking for gigahertz speeds, but for ease of programming. And when I say ease of programming, it's the kind of programming 'from scratch' where existing applications with the exception of development tools, are not important. After all, there are lots of such systems around running RTOS that are really less RT than old QDOS, on 8051 clones which despite the 24MHz clock are really abysmally slow (even with new 1 clock/cycle cores) and cumbersome to program.
2. Coldfire route is effectively closed, no matter how similar to 68K should v5 be. BTW, has anyone besides Yeti actually seen v5 Coldfire ?
It seems it's unlikely to appear as a non-application specific chip. It was a struggle to get the previous versions in that form too, only after they were already running in dozens and dozens application-specific platforms.
3. recompilation should be possible, but is problematic.
This is really the main point of the discussion. Recompilation is a partial form of emulation. Emulation is a partial form of recompilation. A decent re-compiler actually has to be capable of complete emulation, but in an untraditional form. The big question is, is there a 'middle way' - an emulator based on JIT compilation techniques, which would be able to produce a 'snapshot' of the emulated program. In theory, at some point in the run time of the program, this snapshot is equal to a recompilation.
4. HW assisted emulation of the original ISA seems possible, but has its problems, especially on new CPU/MCU chips with cache, smart prefetcher etc.
Yes, however here we have a different trade-off, namely the same one originally implemented by RISC architectures - perhaps implementing a simplified RISC-like ISA and a means to emulate the remaining instructions, in order to simplify the logic and thus gain speed results in a simpler hardware implementation that would on average produce decent performance. This sort of thing point so the use of LUT-based or embedded RAM/Flash based FPGAs, trading off relatively expensive routing resources for relatively cheap memory resources. In the ned you are looking at simpler but faster FPGAs, which might give a means to producing cheap midrange performance implementations of the ISA, and be simply scalable to higher costs and speeds by using a faster version of the same or larger FPGA (larger in case we want other hardware integrated with the CPU, which is almost a given these days). Basic idea: trade smart for fast, just like original RISCs.

...and we didn't even scratch DSP chips as a possible means of emulation.
DSPs actually present an interesting pseudo-alternative to regular CPUs. I say 'pseudo' because most currently popular DSPs are not that different from most currently popular CPUs, and this convergence started with the TMS 320 once the 'simple software model, easily programmable in C' phrase was first uttered in conjunction to a DSP of any kind. Still, DSPs have some peculiarities which enable fast look-up tables, table branches, interesting address manipulations etc, which can be useful for classic CPU emulation. And, most of todays DSPs can be programmed in C or assembler, whereas in the past, things like the Motorola 56k DSPs were indeed programmable in assembler, though the programs were usually wider than they were long. The 56k is perhaps the first large scale commercial implementation of what has in the mean time become known as VLIW processor, for 'Very Large Instruction Word'. Most RISC processors also fall into this category but the number and type of instructions are simplified so that the 'very Large Instruction Word' ended up comfortably fitting into something like 32 bits :)

Re ARM - yes, conditional execution is not fundamental, just a side-effect of what they tried to do with pipeline implementation, but it is somewhat usable on it's own.


Post Reply