14 Jun 2015
“Life is good for only two things, discovering mathematics and teaching mathematics.” — Simeon Poisson
In the previous installment of this journey, I’d basically proven (informally) the correctness of the design for Forth’s outer interpreter. Along the way, I performed some additional refinements, while glossing over others. One of those refinements that I glossed over was how to accumulate digits into a single number.
Accumulation of digits into a complete number is actually pretty simple: you first multiply what you have already by the base, then add the new stuff. I’ll quickly walk through the code I’ve written so far.
First, we need to get the next character from the token being converted:
jal rt, numbNextChar ori t1, a0, 0 jal rt, numbEatChar
We check to make sure the value falls within our current base setting. For brevity, I don’t properly account for hexadecimal or higher bases. Just assume that it “magically” works for now.
addi t1, t1, -48 ld a1, zpBase(x0) bltu t1, a1, nTUN1
If we discover a character which doesn’t qualify as a digit, we mark its location and early-exit from the procedure.
ld a0, zpWordIndex(x0) addi a0, a0, -1 sd a0, zpWordIndex(x0) jal x0, nTUN0
Otherwise, we “accumulate” the digit.
Note that register
A1 will already contain our base value from the check above.
We invoke a multiplication subroutine because the (first-generation) Polaris CPU definition for Kestrel-3 explicitly lacks support for multiplication.
I made this decision specifically to ease hardware implementation and verification.
I’ll add multiplication support later, once I have the rest of the CPU implementation working.
In the mean time, we need to emulate it in software:
nTUN1: ld a0, zpValue(x0) jal rt, math_multiply add a0, a0, t1 sd a0, zpValue(x0) jal x0, nTUN2
In various unit tests, I stub out
math_multiply as needed.
Now it’s time to define our requirements for
This procedure is going to be somewhat different from the rest,
in that I’m using CPU registers instead of memory-resident variables for communication.
A1 hold our input multiplicands.
Looking at the code above, we also see that
A0 holds the least significant 64 bits of the result as well.
We also cannot change any other register.
I don’t need to add support for signed multiplication.
As you can see from the pseudo-code in the previous two parts,
numeric conversion always happens with unsigned numbers (sign is applied only after conversion of the unsigned part of the number).
However, I would like this routine to be re-used to implement the
*/ operators in Forth.
Thus, I’ll consider sign up front.
Also, because I’m aiming to support
*/ out of the box,
the result of the multiplication should be a full, 128-bit product.
In other words,
A0 holds the least-significant 64 bits of the product, while
A1 holds the most significant.
As it happens, signed and unsigned multiplication is the same algorithm. You can check this out yourself using any ANSI or Forth-83 compatible Forth system:
-5 -3 UM* DROP . 15 OK -5 3 UM* DROP . -15 OK 5 -3 UM* DROP . -15 OK 5 3 UM* DROP . 15 OK _
Indeed, if you look at Bill Muench’s EForth, he explicitly defines
Thus, we need only concern ourselves with the needs of performing 128-bit arithmetic on a 64-bit RISC architecture without a carry bit. :) I choose to use the Russian Peasant’s Algorithm.
PROCEDURE Multiply(A0, A1): QWORD Save registers. Sign-extend A1 into T4:T0. Sign-extend A0 into T2:T1. Set product in A1:A0 to 0. WHILE T0 is not zero DO IF T0 is odd THEN Add T2:T1 into product. END Shift T4:T0 right by 1 logically. Shift T2:T1 left by 1 logically. END. Restore registers. RETURN product.
I won’t bother proving the correctness of this algorithm; I suspect several hundred to several thousand years of use has done an adequate job of that. Instead, I want to focus on finer-grained details of the algorithm.
We sign-extend our inputs to 128-bits to ensure mathematically correct results. Both inputs are so extended; however, below, we illustrate the process for only one parameter. The process is the same for the other; only the choice of registers differs.
First, we realize that the lower 64 bits remains intact, so we simply preserve them as-is.
Move A0 to T1.
To perform the sign-extension, however, I compute either 0 for positive, and 1 for negative, then negate this value, giving me either 0 or -1. As it happens, the choice of 0 or 1 matches the binary value of bit 63 of A0. So, I simply shift bits around, like so.
Shift top bit of A0 into T2. Negate T2.
If bit 63 of
A0 is zero, then negating zero in
T2 should yeild zero.
Otherwise, if bit 63 of
A0 is one, then
T2 will be set to negative one.
In this way,
T2 properly sign-extends
Being a RISC-V processor, Polaris can do this without destroying the value in
ori t1, a0, 0 srli t2, a0, 63 xori t2, t2, -1 addi t2, t2, 1
Addition of two quad-words is perhaps a bit non-obvious at first, but it’s actually fairly easy to do. We start by performing a carry-save addition of the two 64-bit halves. Then, we fix the result up by manually propegating any pending carry.
Add low words together. Add high words together. Propegate carry.
We propegate carry like so:
IF low result is less than the low half of an addend, THEN Add 1 to the top half of the result. END
In assembly language, this translates to:
add a0, a0, t1 add a1, a1, t2 bgeu a0, t1, *+8 addi a1, a1, 1
If you pretend the registers are only 6 or 8 bits wide, you can easily confirm this implementation works.
We need to shift our addend left by one bit, and the multiplicand right by one bit. Problem is, each of these are 128-bit quantities, and Polaris is only a 64-bit CPU without a carry flag.
We use a combination of shifts and logical operators to treat this as a bit-field access problem. For example, to shift to the addend left one bit, we need to first make room for incoming bits:
slli t2, t2, 1
We then take whatever is in bit 63 of the lower word and place it at bit 0, to be logically-ORed with the recently vacated bit 0 in the high word:
srli t3, t1, 63 or t2, t2, t3
Finally, shift the low word accordingly.
slli t1, t1, 1
We take the same approach for the multiplicand, but in the opposite direction:
srli t0, t0, 1 slli t3, t4, 63 or t0, t0, t3 srli t4, t4, 1
I now list the implementation of the multiplication routine, as of this writing. I’ve changed some of the registers around to reduce the cognative load of the reader, and to ease debugging effort.
I’ll elide tests in this blog post, but check the repo (look for
src/forth/test_math.asm and its related files).
math_multiply: addi rp, rp, -48 sd rt, 0(rp) sd t0, 8(rp) sd t1, 16(rp) sd t2, 24(rp) sd t3, 32(rp) sd t4, 40(rp) ori t0, a1, 0 ; 128-bit control multiplacand srli t4, a1, 63 xori t4, t4, -1 addi t4, t4, 1 ori t1, a0, 0 ; 128-bit partial sum multiplacand srli t2, a0, 63 xori t2, t2, -1 addi t2, t2, 1 ori a0, x0, 0 ; Zero product ori a1, x0, 0 m_m_again: bne t4, x0, *+8 beq t0, x0, m_m_done ; Complete when no more 1 bits exist. andi t3, t0, 1 ; Multiplicand odd? beq t3, x0, m_m_even add a0, a0, t1 ; Yes, perform 128-bit partial addition. add a1, a1, t2 bgeu a0, t1, m_m_even addi a1, a1, 1 m_m_even: srli t0, t0, 1 slli t3, t4, 63 or t0, t0, t3 srli t4, t4, 1 slli t2, t2, 1 srli t3, t1, 63 or t2, t2, t3 slli t1, t1, 1 jal x0, m_m_again m_m_done: ld rt, 0(rp) ld t0, 8(rp) ld t1, 16(rp) ld t2, 24(rp) ld t3, 32(rp) ld t4, 40(rp) addi rp, rp, 48 jalr x0, 0(rt)
Happily, this implementation actually gets computation of the upper 64 bits correct.
If you test
UM* on a 64-bit GForth 0.7 version under Linux, you’ll find that the high 64 bits take on bizarre values.
-5 -3 UM* . . -8 15 OK -5 3 UM* . . 4 -15 OK 5 -3 UM* . . 2 -15 OK 5 3 UM* . . 0 15 OK
The core of the multiplication loop would be far simpler and faster if we restricted our math to just 64-bits.
However, Forth’s support for the
*/ operator requires double-width products.
The code above provides correct results for both signed and unsigned inputs.
Eventually, when Polaris supports actual multiply instructions, this whole piece of code disappears, replaced with:
mul a0, xa, xb mulhi a1, xa, xb
Until that happens, though, this code will need to suffice.
Software engineer by day. Amateur computer engineer by night. Founded the Kestrel Computer Project as a proof-of-concept back in 2007, with the Kestrel-1 computer built around the 65816 CPU. Since then, he's evolved the design to use a simple stack-architecture CPU with the Kestrel-2, and is now in the process of refining the design once more with a 64-bit RISC-V compatible engine in the Kestrel-3.
Samuel is or was:
Samuel seeks inspirations in many things, but is particularly moved by those things which moved or enabled him as a child. These include all things Commodore, Amiga, Atari, and all those old Radio-Electronics magazines he used to read as a kid.
Today, he lives in the San Francisco Bay Area with his beautiful wife, Steph, and four cats; 13, 6.5, Tabitha, and Panther.