Symmetric multiprocessing...

Nagging hardware related question? Post here!
Nasta
Gold Card
Posts: 443
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: Symmetric multiprocessing...

Post by Nasta »

If there was ever a Pandora's box of a topic, here it is.
That being said, it's REALLY important to know what symmetric processing is, and how many different kinds there are, to give the topic a proper treatment.


User avatar
1024MAK
Super Gold Card
Posts: 592
Joined: Sun Dec 11, 2011 1:16 am
Location: Looking forward to summer in Somerset, UK...

Re: Symmetric multiprocessing...

Post by 1024MAK »

Going in the other direction... See the other thread :?

Mark


:!: Standby alert :!:
“There are four lights!”
Step up to red alert. Sir, are you absolutely sure? It does mean changing the bulb :!:
Looking forward to summer in Somerset later in the year :)

QL, Falcon, Atari 520STFM, Atari 1040STE, more PC's than I care to count and an assortment of 8 bit micros (Sinclair and Acorn)(nearly forgot the Psion's)
User avatar
Dave
SandySuperQDave
Posts: 2765
Joined: Sat Jan 22, 2011 6:52 am
Location: Austin, TX
Contact:

Re: Symmetric multiprocessing...

Post by Dave »

The only thing that can probably be counted on, sadly, is that no matter what the hardware it would likely get very little, very late OS and software support. This is true for any new hardware. So it becomes a critical design goal to make the hardware as easy to use and access as possible. The main difficulty there is that at a minimum, using a second CPU will at the very least require knowledge of assembly.

Unless someone writes a program to put the S*BASIC part of QDOSMSQ onto a separate CPU and an a S*BASIC extension to allow job handling by dispatching it to that CPU.

Another aspect is that each CPU "chain" of supporting hardware has a cost of about $20. People may be unwilling to invest in a card with 8 CPUs on it, and may just need 2 or 3. So it seems there would be a backplane to take a "CPU carrier" and the "CPU carrier" could be populated separately with 8x individual cards. This would also allow a person to invest in, say, 2-3 cards with a 68000 and 8 or 16-bit private RAM, and a few 68030 cards with 32-bit private RAM. It also opens up the option of CPUs with math co-processors. 68882s are quite cheap right now.


User avatar
Peter
QL Wafer Drive
Posts: 1953
Joined: Sat Jan 22, 2011 8:47 am

Re: Symmetric multiprocessing...

Post by Peter »

Dave wrote:What if a "QL" had a main processor, then a gang of smaller processors, each allocated a few kilobytes of exclusive dual port RAM? Programs on the main processor could load programs and data into the workspace of each CPU, which could then run ongoing programs.
I wonder why nobody mentioned the almost obvious here: This is almost ideal for FPGA!
While recent FPGA technology progress did not provide much more CPU frequency (this also has to do with the 68K architecture), it did provide more block RAM and more logic space. Today, I could fit about 4 CPU cores into a chip costing roughly the same I used for the Q68.

In a reasonable price range, FPGA block RAM would still not be sufficient to run 4 complete QL operating systems, so local cache and a shared external memory are mandatory. My plan is to first develop cache for the Q68. This has a double advantage: Updates for the existing systems without new hardware and getting a starting point for future QL architectures.

Whether I will finally go for multiprocessing, or more single-core complexity, is an open question. I will decide this based on the overall QL situation, when the time comes. Even in an ideal case, I'd expect a maximum of one or two software developers supporting me - all of us being hobbyists. These limitations have more impact on finding a realistic way, than what is desirable technically.


Derek_Stewart
Font of All Knowledge
Posts: 3932
Joined: Mon Dec 20, 2010 11:40 am
Location: Sunny Runcorn, Cheshire, UK

Re: Symmetric multiprocessing...

Post by Derek_Stewart »

Dave wrote:The only thing that can probably be counted on, sadly, is that no matter what the hardware it would likely get very little, very late OS and software support. This is true for any new hardware. So it becomes a critical design goal to make the hardware as easy to use and access as possible. The main difficulty there is that at a minimum, using a second CPU will at the very least require knowledge of assembly.

Unless someone writes a program to put the S*BASIC part of QDOSMSQ onto a separate CPU and an a S*BASIC extension to allow job handling by dispatching it to that CPU.

Another aspect is that each CPU "chain" of supporting hardware has a cost of about $20. People may be unwilling to invest in a card with 8 CPUs on it, and may just need 2 or 3. So it seems there would be a backplane to take a "CPU carrier" and the "CPU carrier" could be populated separately with 8x individual cards. This would also allow a person to invest in, say, 2-3 cards with a 68000 and 8 or 16-bit private RAM, and a few 68030 cards with 32-bit private RAM. It also opens up the option of CPUs with math co-processors. 68882s are quite cheap right now.
Hi,

If you are proposing to use the DP Ram as a buffer to the other CPU, then all that would be needed would be the set of foreign OP Code instructions on the QL.

For example the 6502 has a small of 8 bit Op Codes, which I was writing a macro translation table in QMAC to take the 6502 source code and reassemble it in 68000. Still a project in-progress.

Maybe instead of using an assembler to convert the code to M68K, a translation system, like in the QXL to 8086 CPU could used to send the correct binary patterns to allow the QL to programme the other CPU. This may mean a second monitor connected to the computer/games console to display the output.


Regards,

Derek
User avatar
Pr0f
QL Wafer Drive
Posts: 1298
Joined: Thu Oct 12, 2017 9:54 am

Re: Symmetric multiprocessing...

Post by Pr0f »

I think if you are going down the route of symmetric multiprocessing - you are talking about 68K compatible cores.

Something along the lines of - I've got a job to run...

...How many processors are registered with me ?

… There are 8 of us on line - cpu 5 is the next one to get a new job

--- Dispatch job on cpu 5


you would have to write some kind of code to handle the registration and queuing of processors into a pool

you would have to rewrite job handling traps to allow jobs to be dispatched to one or more processors, instead of just one.

Some other traps may need to be recoded to work with either local or common resources (such as memory and other resource allocations)


User avatar
Pr0f
QL Wafer Drive
Posts: 1298
Joined: Thu Oct 12, 2017 9:54 am

Re: Symmetric multiprocessing...

Post by Pr0f »

You could also take the approach that each CPU would be it's own QL - effectively virtualising the QL into a logical entity with real CPU and memory, and only having the I/O as common

A new device could act as an inter processor gateway


User avatar
Dave
SandySuperQDave
Posts: 2765
Joined: Sat Jan 22, 2011 6:52 am
Location: Austin, TX
Contact:

Re: Symmetric multiprocessing...

Post by Dave »

Pr0f wrote:You could also take the approach that each CPU would be it's own QL - effectively virtualising the QL into a logical entity with real CPU and memory, and only having the I/O as common

A new device could act as an inter processor gateway
The way I am picturing this, the DPRAM is mapped into $0000, and a loader program is given to the peerN 68000. This is just a small machine code loader. Blocks are passed to the program through the top 1K of the DPRAM, with info about where it is going to be relocated stored a few bytes below that. This could be program specific code, or it could be all of QDOS/Minerva.

Once complete, the last step is to remap the DPRAM to an arbitrary location >$7FFFF, copy the vectors into the SRAM now mapped at $0, and reset the CPU so it starts the loaded OS or function.

I am using a hybrid ARM Cortex M0/CPLD device as the arbiter for my experimentation, as this will be cleaner and more flexible than using multiple GALs, but a little easier than using CPLDs or FPGAs. Unless someone wants to show me a 5V tolerant CPLD that has a schematic capture system and C-based IDE that is also free. My experimentation board will have just two 68SEC000s in 8-bit mode running at 40MHz. 2K DP SRAM and 512K private SRAM per CPU.

This is just a fun side project.


User avatar
Pr0f
QL Wafer Drive
Posts: 1298
Joined: Thu Oct 12, 2017 9:54 am

Re: Symmetric multiprocessing...

Post by Pr0f »

I thought there were some heavy license costs with those embedded arm core FPGA's ?

I was looking at the bemcro-CV a9 board when it was available, which had an A9 ARM core on board, but I was mainly interested in the large logic element count ideal for making a 'big' VHDL propeller chip implementation. The freebie design software supported this board, but it apparently took hours to compile projects, as they really wanted you to pay a couple of K and get an updated version of the design tools. Not so hobbyist friendly.


Nasta
Gold Card
Posts: 443
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: Symmetric multiprocessing...

Post by Nasta »

May I interject with some thoughts before this gets a bit out of hand?

Having some decade+ experience with various modes of multiprocessing (you would not believe how some apparently mundane applications can get heavily multi-processed once security and redundancy come into play), and even a diploma thesis in that venue, let me stress here that Symmetric MultiProcessing (SMP) is a VERY VERY wide term. Once things get closer to actual implementation, solutions can vary to the point of appearing completely different things all with the same name. The BIG question in setting up a SMP machine is, what sort of problem will this machine be solving?

I'm going to skip here a HUGE discussion (which is actually very interesting for people interested in multiprocessing, so if anyone is interested, ask me) and just state that the holy grail of a 'universal' parallel processing computer has not yet been found. It is also unlikely to be found in a real sense as long as programming models are based on sequential algorithms, also known as 'control flow computing'. I will only mention that there are other ways, most notably data flow computing, but to anyone used to 'mainstream' computing, this concept is very exotic, though some of it's concepts 'leaked' into mainstream (notably object oriented languages). It was experimented with as far back as the 70s but Moore's law pushed it into dustbin of history, regardless of it's very promising results. So, in the text that follows, I will NOT be talking abut 'exotic' computing models.

Before I go further, let me just quickly toss out a very simple definition of SMP, in which it's mostly (but not completely!) distinguishable from ASMP, asymmetric multiprocessing as discussed in the other topic. SMP is 'symmetric' because all of it's computing units (which do not actually have to be processors, nor even processor cores) are in principle capable of executing the same set of operations as any other - at least in a large extent, they all have no specialization, unlike in the asymmetric case, where processors each fulfill one aspect of the total intent and use of the machine.
Astute readers will notice there is nothing said about how a certain problem is then distributed among the available number of processing units - and this is the core problem of any parallel processing system. In a way, asymmetric processing is one of the 'low hanging fruit' solutions to MP in general, since at the very least a general computer has IO and processing tasks to do, so it's a natural separation. It should be noted that this separation goes along another important delimiting line, and that is one of data processing driven vs real world event driven task scheduling. Data processing in general is scheduled as the data to be processed is made available by IO but at that point, it's already abstracted into channels, pipes, buffers etc. To get there, the IO system had to manage protocols and peculiarities of the actual hardware interfaces, which have real world events and consequences to deal with, which determine what is processed when. In a classic system, this separation resulted in introduction of the concept of interrupts. As it turns out, interrupts become a whole field of contention with regular non-exotic MP (Data flow computing for instance has no notion of interrupts, nor indeed of memory, either data or program, in the sense standard computers have) - but that's a story for later, should we get there. The bottom line here is that this 'division of work type' is one obvious exploit for ASMP.

With SMP thing get more complicated. Once direct real world interaction is off the table and we are faced with raw data, other types of low hanging fruit emerge, such as large data sets that all need the same processing done on them, but without large dependency on data from one place in the set with the other. Things like compressing and encoding data video or audio data streams, or generating fractals - each processing element gets one part of the data to work with and largerly does not need to peak into the other parts to do it's job. This sort of thing lends itself to an architecture where one IO processor communicates with multiple nodes of a MP machine and gives then each a largely independent task to work with, so the result is a sort of 'hub spoke' configuration with the IO in the middle and a communication channel of some sort between it and each node.
The next level up requires each node to access parts or the whole of the common data set, but rarely compared to the time it needs to process the data. Here we have things like finite element simulations, 3D rendering, and similar. While the previous architecture can manage this, it needs to ask the central hub to give it data it does not have locally, so this can present a large overhead. In this case an architecture where the processing nodes can directly communicate between themselves is a better solution, as each of them frequently needs data that is local to their particular part of the whole task, but occasionally request and serve requests of other nodes. Indeed, in this model, any one of the nodes (or indeed multiple ones) can also devote a part of their processing power to fulfill the role of the IO processor.

But what about the regular fruit?
In the above examples, the architecture of the machine and it's programming are constructed knowing the problem in advance and exploiting it's inherent parallelism. But what about things we do with computers every day, apparently single or a small number of tasks, which may or may not be organized as multiple tasks 'behind the curtain', but without that being known in advance? The architecture of the machine has to be fairly universal - a sort of compromise. One could say, of course, that such an architecture is a jack of all trades but master of none, and - at least as things currently stand - they would be right. This brings us (and be aware i have skipped a HUGE amount of intermediate stuff) to 'universal' MP.

In a way, universal MP can be both SMP and ASMP, but since SMP has all processing units largely the same, it covers for both the equal and non-equal use of them, so given a universal SMP machine, it's easy to devote some nodes for specialist tasks, while splitting other tasks among several nodes, as the problem requires, or as use of the machine shows that certain things can be made more efficient when some nodes are devoted to specific tasks, and in general this would be system housekeeping and IO. In this sense, ASMP is a sub-set of SMP.
Here we introduce another important concept, and that is 'granularity'. In effect, this tells us at what level a task is split into several smaller tasks that are then given to each processing unit. It cold be at the level of a single machine instruction, and go up all the way to such nebulously defined things as 'applications'. Again, to an extent, using extremes is yet another level of 'low hanging fruit' for such machines.

On the level of an instruction, we have CPUs with multiple execution units that can execute several instructions that are in physical succession at once if their data is not interdependent (meaning one instruction does not require the result of another), which is often referred to as 'superscalar' processing. Note that it is possible to decide what instructions can be executed at the same time only for a small scope of neighboring instructions, which are written sequentially but can actually be executed in any sequence and even in parallel, the latter being the 'exploit' this system uses. The problem here is that once the number of units gets over about 3, there are no instructions to feed them with unless the code is incredibly inefficient OR specifically written for that scenario. Since code executed is often legacy, and not written for that scenario, the efficiency of such a system in parallelizing things falls of dramatically with the number of execution units. In fact, when such CPUs are built, existing code is carefully analyzed in order to find the most optimal configuration. One interesting historic example is AMD K2/3 vs, Intel Pentium and this battle is still raging today though at a higher granulation level. When the aforementioned were introduced, many more integer isntructions were executed compared to floating point, hence AMD had multiple integer but a single FP unit. Intel then introduced and pushed it's MMX system that used the floating point unit to do DSP instructions which are often heavily paralelizable, and introduced multiple FP units in their later pentiums, gaining a performance advantage over AMD. This sort of trade-of was is being waged even today, and some rather underhanded methods are used such as preferring certain ways of code generation in compilers that prefer one architecture over another. Here I should mention one interesting innovation from Intel, which they dubbed multi-threading (although it's actually a misnomer of sorts), where they realized that they could add another instruction decode unit and register file that uses the unused execution units, performing like a second CPU, which could then potentially re-use unused resources and provide extra performance similar to having a whole other CPU, IF the OS could schedule tasks to it.

This brings us to the other end of the spectrum - granularity at the top level, that of whole applications. This has been well known all the way back to the 70s and experimented with even before, with multiuser systems. These are by definition also multi-tasking, but in the beginning, it was a case of one user - one task. The important thing to notice here is that there is an underlying notion that the tasks are near completely independent of each other. The idea was, different users have different problems, which they solve by performing computing on their data sets to generate their results, independent of other users.
Of course, these systems soon became the multitasking systems we know today - where what the machine does for either one or more users, is divided into multiple tasks that are largely independent and if not, communicate through well defined abstracted data structures wither through IO or within memory.
In this model, the notion of one or many users becomes blurred - the abstraction does not even care about users, only about multiple tasks.
In the beginning, the tasks were executed in a time sliced manner on a single CPU. What is crucial here is that each task is presented with a 'virtualization' of the CPU, as if it had one all for itself. Inside the task, unless it is requested explicitly, there is no notion of time such that the task 'knows' it's not being executed 'in one go' but rather in a series of bursts separated by executing other tasks. The underlying operating system then takes care of scheduling, as well as 'housekeeping' that has to do with sharing (requesting, securing, using and then relinquishing) resources of the machine.
Since each task in essence thinks it's running on it's own CPU, there is in principle no problem in actually giving it it's own CPU to run on. HOWEVER - there is also an underlying assumption that tasks can request resources from a common shared pool, as it's not determined in advance which one will use what, since we do not know in advance what the tasks the machine will do are - remember, this is supposed to be a more or less universal system. This means that at least notionally, each CPU can get any share of any resource as long as the total is less or equal to what the machine actually has built in. This implies a HEAVY bias to the actual hardware architecture of the machine, to having common memory and IO.
The advent of multitasking within the scope of a single problem to be solved introduced the more common spread of the concept of multithreading. In principle there is no great difference between a task and a thread, except that when a task is split into several smaller tasks, and even these are then split into more even smaller tasks, the unit of granularity, which is what the system can schedule among several real or virtual processing units becomes a thread. A thread is a program that can be executed in parallel with other threads or other copies of itself. This is intimately connected to the notion of re-entrant code, where multiple instances of the same task do not use multiple copies of the code to execute the task, but rather use the same code by passing 'several threads' - one for each task - through the same code, but with different (and possibly shared or partially shared) data. Multitasking is VERY close to multithreading and in fact the only real difference is the ability to specify within programs parts that can be treated as multiple threads, more on this later.

This brings us to the most prevalent model of SMP in use today, one which you are likely using every day on your PC and mobile phone, each of which has a multi-core CPU. It should be noted that the notion of a CPU as was prevalent up to the 80s of the last century, i.e. as was originally defined, is actually a 'core', I will expand on this a bit later.
The main idea of this sort of SMP is that there is always a number of threads to be executed. Since we do not know in advance what kinds, as this depends on the use(r) of the machine, whatever processing resources we have should be able to handle any thread scheduling combination. Also, since we don't know in advance what each of them will need in terms of memory and IO resources, there is a means of sharing them among the processing units. Because of the attempt to be 'universal', this abstract of SMP lends itself to shared memory systems. The sharing is largely transparent to the threads and handled by the OS, just like in multitasking - the only difference is that threads can be distributed in a time sharing manner as before, as well as among multiple processing units. In general both are utilized at the same time, and the situation changes as threads are created, executed, and retired. If there are more threads than processing units, they are time sliced on the processing units themselves, of course, the idea would be this is more or less optimized. The important part to understand is that the threads do not know this - for them, execution is on a 'dedicated CPU', and they can request memory or IO channels, with the latter also used to communicate with other threads if needed. In general the fact that common structures or data or even re-entrant code used by several threads exist outside of the context of that thread is unknown to them.

If this was not already apparent, let me emphasize that this architecture can emulate all of the others mentioned before - but of course not with the same efficiency, just as the ones before will not be efficient at being 'universal' (and in fact may fall very flat on their face in that regard).
Implemented in the simplest way, this architecture can suffer from many bottlenecks that render it potentially just as efficient as a simpler single processor multitasking CPU but in most cases the performance will never degrade below that, and be over that. The question is how much over. For starters, since resources are shared, when all processing units attempt to use the same resource, like for instance RAM, the total RAM bandwidth has to be shared between them. However, most of the time one processing unit will access the same small portion of RAM and rarely other portions, so the logical extension to this architecture is a processor with cache memory. The negative side of that is that if shared memory is cached, and one CPU changes the contents, all of it's copies in all caches also need to be changed. Much of this can be managed by the OS, giving the thread code a choice to ask for private or shared memory. Since it's still likely private memory will be accessed the grand majority of time, even making shared memory non-cacheable will not signifficantly impact performance. One step further would be to predict what the typical shared vs private memory allocation sizes would be given the total amount of memory. This can then be leveraged so that the total available memory is not a strictly independent shared sub-system, but rather distributed among the processing units, with only a smaller part being dedicated only as shared memory. If more is needed than available for sharing, each processor can still access each others memory but at a speed penalty for both.

It might now be quite obvious that this sort of thing is not too far from the regular multitasking model we have on the QL. The difference would be that the OS would have to schedule tasks not only in time but in 'place', distributing them over multiple CPUs. There is a whole discussion regarding balancing to be had here which i am simply going to skip, and just say that it means the scheduler has to be more clever, but even a simple extension will work.
Here we have to return to two concepts I have mentioned before.

The simpler one in this scenario is the notion of a thread. This actually exists implicitly in QDOS (for the purposes of this post, QDOS = QL OS of any kind), as tasks can start and fork other tasks, but there is no direct way for applications to specify 'threadable' code. A simple example of tis would be the following: suppose that an application is given N bytes of data to process and it has to do one by one, but given multiple CPUs could divide the N bytes into parts between them, eg if there are 3 CPUs, each would get N/3 of data. Multithreaded code has OS calls to tell the system about this, ie. the following code can be re-entrant and be divided into P copies to process N/P of data each, where P is the number of available processing threads. It is then up to the system to set this up based on the number of available processing units. The code is then written for a general case. Otherwise, the distribution can also be left to the code, so it can set this up within itself based on info it gets from the OS as to the number of processing units it has managed to reserve for itself. For programmers that did not write for this, the performance remains single thread.

The more complex one is interaction with the outside world, basically IO and the notion of an interrupt.
Readers that remember a bit of PC history will remember that Intel at one time deliberately disabled some of it's CPUs back in the pentium II and III era from working in dual or multiple CPU motherboards, by disabling the on-chip advanced interrupt controller hardware.
This is because in such systems, we also have to solve that part of MP that is not symmetric, which is usually which CPU(s) handle the OS calls to do with allocation and release of resources, IO (and hence interrupts), managing and scheduling threads (potentially also under interrupt when it has to fall back to time slicing), and of course, booting.
In most PC level systems today, there is a 'boot' or default CPU or core, one which initially starts the system as if it had only one CPU, does all the memory checking, Io setup, etc. After that is done, it also boots the other cores and enables them as needed. This core or CPU will also handle most of the IO and all OS management calls. Sometimes OS management and IO are separated on two cores to reduce overhead, but in most cases this part of the system operation is not heavily distributed. This also implies that most of the IO related interrupts are routed to the one IO 'node' and a common set used as a system tick or timing interrupt is routed to all nodes. To an extent this already 'unbalances' the ability of each node as some are by default occupied some of the time with IO and housekeeping, and the scheduling part has to take that into account. This is usually done by preferring an otherwise unused node for active thread execution.
There are enough subtleties ion this design to warrant a large number of books, and indeed they are out there, so again, be aware I have skipped a LOT and simplified even more. One thing I have not mentioned was a path less often traveled where this same system has looped back to some of the more exotic models of processing, the simple examples being the Propeller processor and offerings from XMOS... more on that some other time if there is interest.

Next time, I will again mention my favorite vaporware, the ill fated GoldFire, which could use two CPUs and in fact implemented a system as described above - a lot of thought has gone into that, in particular how the existing OS could one day be extended to take advantage of multiple CPUs.


Post Reply