TerribleFire accelerator for the QL accelerator card

Nagging hardware related question? Post here!
User avatar
Zarchos
Trump Card
Posts: 152
Joined: Mon May 08, 2017 11:49 am

Re: TerribleFire accelerator for the QL accelerator card

Post by Zarchos »

Do you know what the FPGA uses to be able to have multiplcation completed in 1 cycle ?
I am intrigued, really.


Owner of various QLs including accelerated beasts, and also a happy Q68 owner ;)
Now porting SOTB to the Archies, to then port it to the Q68.
https://www.youtube.com/user/Archimedes ... +%28100%25
User avatar
Peter
QL Wafer Drive
Posts: 1953
Joined: Sat Jan 22, 2011 8:47 am

Re: TerribleFire accelerator for the QL accelerator card

Post by Peter »

It has socalled "sysDSP" blocks for high performance multiply and accumulate. Which by the way are far from being used to full potential at 40 MHz and only 16 bit.


User avatar
Zarchos
Trump Card
Posts: 152
Joined: Mon May 08, 2017 11:49 am

Re: TerribleFire accelerator for the QL accelerator card

Post by Zarchos »

Peter wrote:It has socalled "sysDSP" blocks for high performance multiply and accumulate. Which by the way are far from being used to full potential at 40 MHz and only 16 bit.
Wow !
Btw this multiply in 1 cycle is great for a fast algo to plot sprites on a background : for the edges of a sprite, where you must have not only the background, but the edge
of your sprite too, instead of loading the background, bitclearing the pixels where the pixels of the sprites must be, my algo uses a shift p positions one direction, then the same p position shifting but the other direction (it does the bit clearing without the need of loading and applying a mask).
On the Archie shifts with a constant applied to the 3rd operand costs no extra cycle.
Here with this multiplication with 1 cycle, you can get the same result, at least when in memory it is the order background pixels than sprite pixels (multiplying by 2^p) ***
Even better if it is the case too for multiplication with addition. (2nd multiplication p positions + addition of the sprite pixels is perfect use of this MULA).
Excellent news for fast 2D action games with lots of sprites on the Q68 ;-)

*** or the other way around, as I believe MC68000 and ARM do not use the same endian system.
Last edited by Zarchos on Wed Dec 13, 2017 7:27 pm, edited 1 time in total.


Owner of various QLs including accelerated beasts, and also a happy Q68 owner ;)
Now porting SOTB to the Archies, to then port it to the Q68.
https://www.youtube.com/user/Archimedes ... +%28100%25
User avatar
tofro
Font of All Knowledge
Posts: 2685
Joined: Sun Feb 13, 2011 10:53 pm
Location: SW Germany

Re: TerribleFire accelerator for the QL accelerator card

Post by tofro »

There's no need for masking, shifting and other bit-fumbling in the linear hi-color modes of the Q68 - A pixel is either a byte or a short, and all of it is colour. Working with sprites is just moving memory around, as fast as possible.

(Still nice to have fast multiplication, though)

Tobias


ʎɐqǝ ɯoɹɟ ǝq oʇ ƃuᴉoƃ ʇou sᴉ pɹɐoqʎǝʞ ʇxǝu ʎɯ 'ɹɐǝp ɥO
User avatar
Zarchos
Trump Card
Posts: 152
Joined: Mon May 08, 2017 11:49 am

Re: TerribleFire accelerator for the QL accelerator card

Post by Zarchos »

tofro wrote:There's no need for masking, shifting and other bit-fumbling in the linear hi-color modes of the Q68 - A pixel is either a byte or a short, and all of it is colour. Working with sprites is just moving memory around, as fast as possible.

(Still nice to have fast multiplication, though)

Tobias
Isn't there a speed benefit to work with dwords instead of words ? (Yes if loading a dword is faster than loading twice a word).
You see I am not familiar with the 68000 timings ... this is where it will be interesting to work together.


Owner of various QLs including accelerated beasts, and also a happy Q68 owner ;)
Now porting SOTB to the Archies, to then port it to the Q68.
https://www.youtube.com/user/Archimedes ... +%28100%25
User avatar
tofro
Font of All Knowledge
Posts: 2685
Joined: Sun Feb 13, 2011 10:53 pm
Location: SW Germany

Re: TerribleFire accelerator for the QL accelerator card

Post by tofro »

Zarchos wrote:
tofro wrote:There's no need for masking, shifting and other bit-fumbling in the linear hi-color modes of the Q68 - A pixel is either a byte or a short, and all of it is colour. Working with sprites is just moving memory around, as fast as possible.

(Still nice to have fast multiplication, though)

Tobias
Isn't there a speed benefit to work with dwords instead of words ?
There is one, yes. If you have 2 bytes to move around, use a word, if it's 4, move a long. But moving one single byte is still way faster (because it is only one instruction) than mask out the high-order bits of a long and OR it into the screen as, for example, a long (which is at least two instructions, probably even more,... what you probably need to do on the ARM).


ʎɐqǝ ɯoɹɟ ǝq oʇ ƃuᴉoƃ ʇou sᴉ pɹɐoqʎǝʞ ʇxǝu ʎɯ 'ɹɐǝp ɥO
User avatar
Zarchos
Trump Card
Posts: 152
Joined: Mon May 08, 2017 11:49 am

Re: TerribleFire accelerator for the QL accelerator card

Post by Zarchos »

tofro wrote:
Zarchos wrote:
tofro wrote:There's no need for masking, shifting and other bit-fumbling in the linear hi-color modes of the Q68 - A pixel is either a byte or a short, and all of it is colour. Working with sprites is just moving memory around, as fast as possible.

(Still nice to have fast multiplication, though)

Tobias
Isn't there a speed benefit to work with dwords instead of words ?
There is one, yes. If you have 2 bytes to move around, use a word, if it's 4, move a long. But moving one single byte is still way faster (because it is only one instruction) than mask out the high-order bits of a long and OR it into the screen as, for example, a long (which is at least two instructions, probably even more,... what you probably need to do on the ARM).
Loading the background is not always necessary. It is true there are no 8 bit screen modes on the Q68, and I had my routines on the Archie in mind, which is a computer working fast on 32 bit aligned addresses, with 32 bit words.
Not the case for the Q68.
So to fast plot sprites on the Q68 in hi res modes, it is possible there will in fact never be the need of loading the background ...

Let's work with this example, I need to check something and I need your 68000 expertise.
We are in hi colour screen mode.
b is one background pixel
p is a one sprite pixel
You want this arrangement on screen, linearly, I start on a word (16 bit) aligned address :
bp pp pp pp

isn't it interesting to load the background to get b, arrange the bp sequence in a long word, and then store
bp pp pp pp with 1 instruction using dwords
or is it faster to
store the 1st p with 1 instruction using words
then
store pp pp pp with another instruction ? Using dwords

Can you give the cycles ?
I am a noob at the moment with all this.

And if you think such questionning to save maybe only one cycle isn't worth it, then, think again :
when you have hi res hi colour screen modes, and you want to put as many sprites you can on screen for a fast 2D game and still run at 50 fps, each cycle saved per segment of sprites plotted, multiplied by the number of segments in your sprite on a scnaline, by the sprite height, all this for each sprite ... yes, you can save a lot.
To me, it is important to get the sprites plotting routines perfect from day 1.
Ultimately it does make a difference.


Owner of various QLs including accelerated beasts, and also a happy Q68 owner ;)
Now porting SOTB to the Archies, to then port it to the Q68.
https://www.youtube.com/user/Archimedes ... +%28100%25
User avatar
tofro
Font of All Knowledge
Posts: 2685
Joined: Sun Feb 13, 2011 10:53 pm
Location: SW Germany

Re: TerribleFire accelerator for the QL accelerator card

Post by tofro »

That is simple:

byte and word move instructions have the exact same timing (a byte moves in 4 clock cycles in the best case, a word exactly the same). Addressing modes come on top, but we're just trying to compare)

long word (32-bit) moves have a 0-50% penalty on top of the word (i.e, take 50% longer in the worst case, once you work in memory). And I'm using standard 68k timing here - The Q68 implementation might vary.

So, a .B, .W move to drop 3 pixels into screen memory in 8-bit-mode would use like 1 unit plus 1 unit (one unit being the cycles you need to move a byte or a word),

on the other hand, handling the whole thing as a long and mask the sprite data in would end up in a

1. long AND (clear 3 lower bytes of screen) 1.5 units
2. move (fetch 4 bytes from sprite data into register) 1.5 units but don't count, as we didnt care above as well
3. AND (mask out upper byte of sprite in register) 1.5 units
4. OR (blend in 3 bytes of sprite into background, leaving 1st byte untouched) 1.5 units

logical instructions roughly have the same timing as move instructions, so we have at least 3 * 1.5 = 4.5 units compared to 2 above.

The first version has the advantage that it doesn't need to do any reading of screen data - Just drop the bytes/words where they belong. Step 3 of the above can only be done on a copy of the sprite data in a register, as you don't want to permanently modify your sprite.

The second method is what you need to use in the original QL screen modes (not with bytes here, but with bits) because if the weird intermingled bitplanes. That is why it's so much faster to set a pixel in the Q68-Hi-color modes.

Tobias
Last edited by tofro on Wed Dec 13, 2017 8:48 pm, edited 1 time in total.


ʎɐqǝ ɯoɹɟ ǝq oʇ ƃuᴉoƃ ʇou sᴉ pɹɐoqʎǝʞ ʇxǝu ʎɯ 'ɹɐǝp ɥO
User avatar
Zarchos
Trump Card
Posts: 152
Joined: Mon May 08, 2017 11:49 am

Re: TerribleFire accelerator for the QL accelerator card

Post by Zarchos »

I'll reread but the logical instructions are in fact the fast 1 cycle MUL instructions on the Q68 ... that makes 2 cycles
I don't understand your answer as I don't use mask. (3. and 4. are not in my algo)
Last edited by Zarchos on Wed Dec 13, 2017 8:57 pm, edited 1 time in total.


Owner of various QLs including accelerated beasts, and also a happy Q68 owner ;)
Now porting SOTB to the Archies, to then port it to the Q68.
https://www.youtube.com/user/Archimedes ... +%28100%25
User avatar
tofro
Font of All Knowledge
Posts: 2685
Joined: Sun Feb 13, 2011 10:53 pm
Location: SW Germany

Re: TerribleFire accelerator for the QL accelerator card

Post by tofro »

Not sure how multiply and add help here - I would use it to shift by 8 bits and mask in a lower byte - But that's not the case here. But maybe I simply don't see it.

You can shift out the upper byte of a long by multiplying with 256, then shift back with an expensive instruction - Not much help, but could replace step 3 above, but costing more than the 1.5 units we had there.

BTW I think the "multiply and add" above that Peter referred to was an FPGA function - The m68k doesn't have such an instruction. I would guess Peter uses it to implement the MULx instructions of the 68k (that "only" multiply) and simply adds 0 - always.

Tobias


ʎɐqǝ ɯoɹɟ ǝq oʇ ƃuᴉoƃ ʇou sᴉ pɹɐoqʎǝʞ ʇxǝu ʎɯ 'ɹɐǝp ɥO
Post Reply