SO now that I had the basic ray casting going, it was time to get some textures in. I toyed with a few ways of doing this, from normal code, to using the new instructions. I’d gotten it down to about 56 T-States per pixel when Jim Bagley started asking about the engine. He’d been thinking of doing one, and had a 48T-state rendering loop…. We chatted about it, and I noticed his wasn’t quite right, but I used my rendering, and some of his methods to get mine down to the 48Ts anyway.

texture_macro macro
        exx             ; 4
        ld      a,(de)  ; 7 read texel
        add     hl,bc   ; 11
        ld      e,h     ; 4
        exx             ; 4 = 30

        ld      (hl),a  ; 7
        add     hl,de   ; 11 = 48
        endm

The idea here is to make a macro then unroll the loop 76 times – the height of the lowres render area. We then work out how many pixels we’ve to render, then we jump into the middle of this code. This works “okay” for the most part…

The biggest problem with this method, is the setup. The TextureVertical() call takes a load of variables to work out everything

; *************************************************************************************
;
; Fill a textured vertical line in lowres
;
; e = X pixel on screen - where column lives
; d = start (Y pos)
; l = end (Y pos)
; a = line height (number of pixels to render)
; h = full pixel height (FULL height - including clipped area)
; b = Tex X (column to render)
; c = tile to render
;
; *************************************************************************************

From this list of things, we then work out texture scaling values, tile+texture address, bank it’s in, screen address+column offset, not to mention the point we need to “jump” to render all the pixels. Lastly, we also need to “clip” off the top of the screen, although this ends up being a very tight, simply loop adding the deltas until it’s “on” screen. This code looks pretty ugly… but here it is…

Texture_Vertical:
        ld   (tmp+1),a
        ld   a,h
        ld   (tmp),a
        ld   a,b
        ld   (texel_X+1),a
 
        ; work out tile bank, and offset
        ld   a,c
        and  1
        ld   (tile_id+1),a
        ld   a,c
        srl  a
        add  a,TilesSeg
        NextReg $52,a

        ld   a,l
        sub  d
        push af  ; store length

        ; work out lowres screen address (y*128)+x
        ld   a,e
        ld   e,0
        sra  d  ; *128
        rr   e
        add  a,e  ; +x
        ld   e,a
        ld   hl,ScreenAddress
        add  hl,de
        ld   de,$0080
        exx   ; screen rendering in "alternate" registers

        ld   a,(tmp)
        ld   de,Scales
        add  de,a

        pop  af
        add  a,a
        ld   hl,ScaleLookup
        add  hl,a

        ld   a,(de)  ; get texel scale
        ld   c,a
        inc  d
        ld   a,(de)
        ld   b,a
 

        ld   e,(hl)  ; get jump table
        inc  hl
        ld   d,(hl)
        ld   hl,VTLoop
        add  hl,de
        ld   (Jmp+1),hl

        ld   a,(tile_id+1)
        add  a,a
        add  a,a
        add  a,a
        add  a,a  ; = $1000
        ld   h,a
        ld   l,0
        add  hl,Tiles
        ex   de,hl
        ld   hl,(texel_X)
        srl  h  ; /2 = *128
        rr   l
        srl  h  ; /4 = *64
        rr   l
        add  hl,de
        ex   de,hl
        ld   h,e
        ld   l,0


        ; clip top?
        ld   a,(tmp+1)
        and  a
        jr   z,@SkipCLip
@Clip:  add  hl,bc
        dec  a
        jp   nz,@Clip
        ld   e,h
@SkipCLip:
        exx
Jmp     jp   $0000

So the engine will call this 128 times, once per column of the screen. So this setup time is very expensive, and something I’d need to try and optimise at some point. I did have a large scaler divide table in here, to help speed up working out texture scaling. It took up a huge amount of space, but got rid of 128 ugly divides – which take about 1000Ts per divide(ish), so it was well worth it.

Now that this was in, I had the very basics of a running engine…. it was working, it showed it was very possible. Yes, I still had to add 3D sprites in, but remembering how slow 3D games were on the original machine, this was very workable.

If this had come out back in the day, I’d have wet my pants!

So back to optimisation. I unrolled the 16bit divide I was using which shaved a little off, and did some other work to save Tstates here and there. But all these were just tiny amounts. What I needed, was a big saving….
I change the start/end line calculation with a simple table look up, which did save a chunk of clipping code, which was a good start – a couple hundred Tstates per column (so 128*200-ish), which is pretty good, and then I added to this by optimising the tile address calculation, but i was back to saving a few Tstates here and there.

I decided to look at the DDA stepping code, and I shuffled some of the instructions around, saving a memory read inside the map look up code, which happens every step. From here, I decided I’d like to make the MULs and DIVs macros, so I didn’t have to call them. There’s about 20 or so calls each vertical, so 20-ish*128*Call+ret Tstates. This was again a fair saving, but it doesn’t half balloon the code!! This made debugging tricky, so I stuck it on a compile flag to make testing easier.

I added in some more textures, so that I could get a little depth to the walls, and this really did make a huge difference. This was done by simply adding darker textures, and all texture numbers were then multiplied by 2, and the “side” it hit, added on. All these changes made the engine run in 2 to 3 frames, which was definitely in the ballpark I was aiming for!

You can see from the video, that this is definetly usable. Yes, we still have sprites to add…. but it’s pretty fast and fun to run around in.

Next up was adding in some 3D sprites….