Coding partner sought...

Nagging hardware related question? Post here!
User avatar
Dave
SandySuperQDave
Posts: 2765
Joined: Sat Jan 22, 2011 6:52 am
Location: Austin, TX
Contact:

Coding partner sought...

Post by Dave »

Here's a brief outline on some projects I am working on. The main reason for 'slowly' is because I am not a programmer.

Fast, modern serial card:
Using a 16C754 UART or similar - fast serial interface with FIFOs and expansion possibilities.

Looking for someone to write a driver, loaded or in ROM. All documentation provided, receive prototype and finished hardware, which you can keep. Flat fee offered, or share of sales. Resulting hardware will be open sourced, but the driver may or may not be depending on your wishes. I would not be critiquing your code -

This is critical to implement some other key tech on the QL in a pleasing way.

Video frame buffer:
Raspberry Pi C++ and/or ARM assembly programmer sought to co-develop novel frame buffer video card for the QL. Can be provided all hardware and documentation, which can be kept. Based on the CM3 module and a custom board, will implement dual port RAM and support the basic QL modes with small changes - notably FLASH will be changed to BRITE. If nobody offers to co-operate on this, I can reasonably muddle through by myself but it would go a lot more quickly. This is a speculative side project to discover the capabilities of the memory mapped GPIO of the Broadcom BCM2837. To minimise IO, BCM would increment an external address counter. Options exist to use traditional dual port SRAM with an A and B port, or to use a dual port RAM that has a traditional A port and SAM b-port. Third option is to use a memory switching technique using VERY fast RAM to steal dead cycles from the CPU.

Goals: HDMI output, extended resolutions, palette capability, separate image scanning and frame rate by using a full frame buffer.

This project is speculative - a lot of people think it won't work, but I have a CM3 here running an example program that defines an 8 bit port in memory mapped GPIO, and transfers a copy of that byte into internal memory. Current rate is 8.7 million samples per second, even with my very non-optimized code. BCM forced to full clock speed, C++, most other OS tasks removed. No processing of data being done, no efforts to be code or cache efficient.

This implies 265 frames per second scan rate is possible, which suggests resolutions of up to 1024 x 768 would be quite responsive ... >40 updates/sec, though the BCM would still produce the standard frame rate even if it just had the frame buffer contents overwritten even as low as 10x/second... The other option is to offer Aurora/Q60 format extended bit depths.

The BCM can, with maybe 5 lines of code change, go from 8-bit transfers to 16- or 32-bit transfers, and they would be the same speed. This has a LOT of potential for dynamic bus sizing CPUs like the '020, 030 and up.

This project is generally applicable to almost every vintage micro, and can provide a cheaper and more elegant solution than almost anything else out there. The best thing is, the Pi Foundation has promised to continue the CM form factor, so as new devices happen people will simply be able to swap the card and upgrade their video capabilities. On average, Pi performance has doubled every 2 years.

There's some other smaller projects too, but these are the big ones that I really need help with.

The other side of this is, if you have a project and you need someone to help with hardware, hit me up. I have plenty of free time, and lots of enthusiasm.


Derek_Stewart
Font of All Knowledge
Posts: 3932
Joined: Mon Dec 20, 2010 11:40 am
Location: Sunny Runcorn, Cheshire, UK

Re: Coding partner sought...

Post by Derek_Stewart »

Hi Dave,

There maybe a solution already in existence for Acorn computers:
RGB to HDMI using a Pi Zero and a small CPLD wrote: Summary: Use a Raspberry Pi Zero as an RGB to HDMI converter that's optimised for the video timings of the Model B/Master/Electron and supports all screen modes (including mode 7), together with automatic calibration. Do this as cheaply as possible, using a small CPLD for level shifting and pixel sampling. The advantage of using HDMI, compared to VGA, is that almost all LCD TVs/Monitors with HDMI inputs should support 50Hz, where as few VGA monitor do.
Link to the Stardot.org.uk is:

https://stardot.org.uk/forums/viewtopic.php?f=3&t=14430


Regards,

Derek
User avatar
Dave
SandySuperQDave
Posts: 2765
Joined: Sat Jan 22, 2011 6:52 am
Location: Austin, TX
Contact:

Re: Coding partner sought...

Post by Dave »

I looked at that system a few weeks ago. What it does is quite different from what I am proposing.

It is using a CPLD to sample the output video image. What the Pi does is similar, and the code is definitely interesting ;)

I am planning to directly access the address and data bus to either read the video memory or to have the CM3 emulate video RAM. The CM3 is tens of times faster than the Pi Zero. The Zero can scan through 32K video ram about 20 frames per second. With the same code, the CM3 does nearly 300 frames per second.


User avatar
mk79
QL Wafer Drive
Posts: 1349
Joined: Sun Feb 02, 2014 10:54 am
Location: Esslingen/Germany
Contact:

Re: Coding partner sought...

Post by mk79 »

Dave wrote:I am planning to directly access the address and data bus to either read the video memory or to have the CM3 emulate video RAM. The CM3 is tens of times faster than the Pi Zero. The Zero can scan through 32K video ram about 20 frames per second. With the same code, the CM3 does nearly 300 frames per second.
I basically did the same thing a year or two ago using a RasPi A+. It barely was fast enough to sample the bus and still do something meaningful with the data. I had problems with the GAL that was doing the address decoding, having the interesting effect of being able to see ALL the memory writes on screen. I ultimately abandoned it when I got my first Tetroid GoldCard as it then occupied the space I was going to use for the RasPi.

These days I'm thinking more about an FPGA based external solution or about replacing the ZX8301 altogether.


User avatar
Dave
SandySuperQDave
Posts: 2765
Joined: Sat Jan 22, 2011 6:52 am
Location: Austin, TX
Contact:

Re: Coding partner sought...

Post by Dave »

The Raspi A+ is a single core 700MHz device. Of *course* it wasn't fast enough.

The CM3 is a quad core 1.4GHz device. It has memory mapped GPIO. In tests using the non-optimized, single threaded pigpio C++ daemon it happily manages 600,000 8-bit reads per second, which is plenty fast enough for our needs. The GPIO cycle time for a read is 25ns and for writes is 6.5ns. It handles GPIO ten times the rate that the A+ could.

The important thing is that those reads can be 8- 16- or 32-bit and they take the same time. By simply reading the 8-bit data from DPRAM or RAM with a SAM port and latching it as 32-bit data I quadruple the transfer rate. Even if I only achieve half of what's possible that is still 1.2MBytes/sec. 36 frames per second equivalent throughput.

However, with the system knowing if the RAM was written to and only accessing the written-to addresses, only the updated screen areas would be scanned. Most of the time, on BBQL systems not running a window system, that will just be a cursor blinking.

People who have written bare metal code have the CM3 doing 7,000,000 GPIO reads/writes per second - including post processing of that data.

You also may have not considered the implications of this? If video is being output by something else, the 8301 goes away and so does the need for any DRAM. The QL speeds up because it is no longer being held back, stopped while video memory is accessed. ALL the memory can now be SRAM. What the CM3 notion might lack in finesse or solid performance would be more than made up for by the large CPU performance gains, being insulated on the other side of the DPRAM. The other element is that if it doesn't work well for video it will certainly work well enough for IO - which is really the same problem.

This lets us push the bus out to 16- or 32-bits very easily. It also pushes us into the 3.3v domain which increases the device choices a LOT.

So, would you rather have a 32-bit system with rock steady but maybe slow video sometimes, or an 8-bit system with video that just works but uses over half the memory bandwidth?


Nasta
Gold Card
Posts: 443
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: Coding partner sought...

Post by Nasta »

I am thoroughly surprised a 700MHz ARM based SOC sould not read fast enough from GPIO. Someone is not programming it very well... usually bare GPIO (ports) can run at 25+ MHz speed, which is actually enough since once a byte/word is read, the rest of processing works at full speed. I mean, QXL could work reasonably well with a back end at least one order of magnitude slower, never mind an even slower communications protocol.

Let's have a look at 3 basic cases:

1) Bare QL style 68008, DRAM. The idea is to take the 68k off the bus and 'spoof' it to read the contents of the screen RAM. If the system is a real bare QL with the 8301 replaced by some sort of simpler RAM controller (just to generate proper timing and perhaps refresh, though the latter would automatically happen if reading the DRAM by the PI back end is handled properly), the throughput is limited the same way as with the 8301, except some optimization could be done so not all of the screen has to be read all the time. Even so, without some extra hardware to do DRAM page mode reads, the old 8-bit DRAM is inefficient and throughput is limited to what the 8301 does (otherwise we'd have better specs with the 8301 in the first place). In other words, with helper hardware we get 8301 resolution and QL side speed.
1a) replace DRAM with SRAM: needs more multiplexing and could potentially provide a bit more speed for the PI but again, we are stealing cycles from the 68008.

2) Bare QL style 68008, Dual Port RAM. This being static, interfacing is dead simple, the PI back end can benefit from being able to read the DPRAM in words, not necessarily just bytes,and at a much faster rate than the 68008 can ever fill it. DPRAM of sufficient size is expensive, though, but it is a lot faster than the original DRAM and also, the rest of the system can be re-designed to take advantage of that on the QL side. Typical throughput on the PI side could easily be 4-5 times original QL maximum theoretical rate (that would come out at about ~10Mbytes/s) even for a byte-wide DPRAM. This is far in excess of what the QL side is capable of filling, but we are limited to more or less old QL style resolutions. Expanding the screen resolution and color is VERY expensive in this sort of a system.

3) Any QL stlye, VRAM. Since VRAM is a sub-species of DRAM, it needs refresh. It does also have a serial port. A special DRAM style cycle is used to transfer a whole row of DRAM data into a parallel-in, serial-out register. This also refreshes the DRAM row that was transferred. During this operation, the CPU must be taken off-bus, and some cycles are stolen. For the rest of the cycles, it behaves like regular DRAM, and would need some cycles stolen for refresh, so the two can sometimes be folded into one cycle steal event. The difference is that the PI can initiate a DRAM->SAM transfer, possibly even of a row set up by the PI side. Once that is done, the PI can read out the row data sequentially without any addressing, and typical data rates for 8-bit VRAM are 33-50Mbytes/s. This is enough for 1024x512 in 256 colors at 50Hz if the PI can do it that fast. It can be further optimized by only transferring rows that have been changed, as well as partial rows if the resolution selected does not use up the complete row (usually one DRAM row, equals the size of the SAM and typically 1, 2, 4 display lines will fit in there). Even without that optimization, no 68k CPU can fill that RAM any faster so the PI interface is not a bottleneck.

All of these methods need some extra hardware, to emulate the video control register(s) so the PI knows what resolution and color depth is selected.
Method 1 requires reasonably sophisticated helper hardware, so that the PI does not need to bit-bang an emulation of 68k or DRAM signals. Method 2 requires the least sophisticated helping hardware (a counter to read sequentially so the PI does not need to provide (all) address lines), and takes no cycles from the CPU at all. Method 3 requires helper hardware to emulate DRAM/VRAM timing. This is however needed anyway if only the DRAM port is used as regular DRAM so can't be avoided. It is a bit less sophisticated than in case 1, but can also be made more sophisticated so the PI has to work less.
All of these can be augmented by keeping track of what was changed, by dividing the screen RAM into areas convenient for the given method. This is separate hardware.

Finally, there is also a very different method, that was used by the first QL emulator on the ATARI ST computers, that used an actual 8301 with it's own RAM.
The method used was to capture write cycles to the screen RAM and then buffer them on the way to the 8301 using a hardware FIFO. For every write cycle the FIFO stored an address and data to be written, and this was then re-coded into an emulation of an 68008 style bus cycle on the 8301 side. Given that a succession of only writes capable of overflowing the FIFO is extremely unlikely, the 'write queue' would always manage to empty itself in time. This also handled conversion of the 16-bit wide 68000 bus in the ST to an 8-bit wide 8301 bus. I am not exactly sure but I have seen the actual board and I think there were two 9-bit wide FIFOs on board, one to store a multiplexed address and the other to store a word of data, with the validity of each byte being stored in the 9-th FIFO bit. If this sort of thing was used to pass data to the PI, a different interface would be implemented with possibly only one FIFO needed, running at higher speed (to store both the high and low address and the data byte to be written). In this case it's up to the PI side to decode what needs to be written where, or use helper hardware for that purpose. Careful attention would have to be paid to the required FIFO size as well as what happens when it fills up, but the basic method of 're-constructing' the screen data inside the PI would be the same as used on the emulator. Again, this method needs to have an emulated display control register for the PI to know how to interpret the data. Like the DPRAM method, it requires no intervention on the QL side except perhaps if the FIFO fills up.
The transfer speed of this method is a complex topic as it depends on the actual amount of data in the FIFO. Obviously it only ever transfers bytes that actually change. It also means that the PI would have to keep a copy of the QL side screen bitmap and use one thread to update it via FIFO reads, and another to transcode it into real PI video memory, this can't do transcoding on the fly actually, as otherwise changing modes would not work. The main problem is keeping the FIFO from filling up.


User avatar
mk79
QL Wafer Drive
Posts: 1349
Joined: Sun Feb 02, 2014 10:54 am
Location: Esslingen/Germany
Contact:

Re: Coding partner sought...

Post by mk79 »

Dave wrote:The Raspi A+ is a single core 700MHz device. Of *course* it wasn't fast enough.
I didn't say it wasn't fast enough, I said it barely managed it, in the sense that it did but it wasn't easy. RasPi bare metal program is no walk in the park.
The CM3 is a quad core 1.4GHz device. It has memory mapped GPIO. In tests using the non-optimized, single threaded pigpio C++ daemon it happily manages 600,000 8-bit reads per second, which is plenty fast enough for our needs.
My bus snooping needs at least 4 times this speed and I still managed it with the A+.
However, with the system knowing if the RAM was written to and only accessing the written-to addresses, only the updated screen areas would be scanned. Most of the time, on BBQL systems not running a window system, that will just be a cursor blinking.
So you're doing VRAM *and* differential updates? Then why are you talking about constantly scanning the VRAM?


User avatar
Dave
SandySuperQDave
Posts: 2765
Joined: Sat Jan 22, 2011 6:52 am
Location: Austin, TX
Contact:

Re: Coding partner sought...

Post by Dave »

Thank you for the insightful post, Nasta.

One thought I had was to also have a small 2kx9 FIFO mapped to the internal IO area, so messages could be passed to the intelligent video controller.

A register set could take AxB@YxY

... and scroll it up, down, left, right N pixels.
... or store it in Pi RAM for re-insertion later - much faster than the QL could, AND.... 600MB of free RAM means a LOT of windows
... or

... or just handle all the IO too.


User avatar
mk79
QL Wafer Drive
Posts: 1349
Joined: Sun Feb 02, 2014 10:54 am
Location: Esslingen/Germany
Contact:

Re: Coding partner sought...

Post by mk79 »

Nasta wrote:I am thoroughly surprised a 700MHz ARM based SOC sould not read fast enough from GPIO. Someone is not programming it very well... usually bare GPIO (ports) can run at 25+ MHz speed, which is actually enough since once a byte/word is read, the rest of processing works at full speed.
New screen data can come in at 1,8Mhz or something, so you only have a few hundred clock cycles to process the data. And you pay dearly for every memory access you make.
I mean, QXL could work reasonably well with a back end at least one order of magnitude slower, never mind an even slower communications protocol.
If it only worked as well as the QXL then it is not worth doing ;)

My method differed from all the onces you mention, it's most like the emulator example, but no FIFO. The idea was to attach the RasPi to the bus with only a few level shifter and snoop the write accesses. As I said, it worked for what it's worth and with the knowledge I gained from the QL-SD I might have been able to finish it. But alas I have other projects. Last weekend I implemented some Aurora compatible graphics modes in Verilog for example ;)


User avatar
Dave
SandySuperQDave
Posts: 2765
Joined: Sat Jan 22, 2011 6:52 am
Location: Austin, TX
Contact:

Re: Coding partner sought...

Post by Dave »

mk79 wrote:I didn't say it wasn't fast enough, I said it barely managed it, in the sense that it did but it wasn't easy. RasPi bare metal program is no walk in the park.
I'm glad you recognize that the A+ and the CM3 are very, very different beasts... I have a Risc PC 700 with a StrongARM 287MHz, from back in my Acorn days. Not many of those with US power supplies! Any time you run a multi-tasking OS on a single core, timing is bound to be sketchy. Being able to force the task to run on one core and other tasks on others.... Much nicer. I learned ARM assembly on an Archimedes A440/1.
mk79 wrote:So you're doing VRAM *and* differential updates? Then why are you talking about constantly scanning the VRAM?
I'm just looking at LOTS of ways of doing it, and seeing what the machine would be happiest with. Some VRAM with a SAM port, I can set any row and clock out any multiple of 128 or 256 bytes.

I get it that you're skeptical, and I get it that you're not going to help. But don't tell me it's hard. You've done hard things. If I get this to work, a lot of people will have a use for it.
Last edited by Dave on Tue Mar 05, 2019 7:17 pm, edited 1 time in total.


Post Reply