I have yet to see objective comparisons between MISC, RISC, and CISC architectures. MISC processors often get a bad rap for their stack-based micro-architectures; however, generally-available evidence to support these negative views seems lacking. I try to put these processor architectures into perspective using real-world experience with some processors I’ve used directly in the past. I hope the data revealed will help others in answering questions on which architecture proves right for them.
In this article, I highlight the S16X4 processor, a design which I built for the Kestrel-2 home-made computer. Details of this processor may be found in its datasheet.
The S16X4 derives from the Steamer16, a word-address machine capable of 16-bit wide words. The S16X4 includes instructions for addressing memory with byte-granularity, which I exploit in our example program for the purposes of plotting an 8x8 glyph onto a bitmap. However, this does not materially affect program size. Even were I to stick with pure word addressing, I would just say that the font data is packed with two bits per pixel. This results in a program with exactly the same number of instructions to execute.
Different routines in the program have been sized:
Name Size (%) ==================== plt 16 9.8% pltc 18 11% zerop (0p) 8 4.8% zeroq (0q) 14 8.5% addrs 8 4.9% crsr 33 20% plotch 15 9.1% setOutSrc 52 32% ==================== TOTAL 164 words (328 bytes) ====================
I count 102 uses of the
li instruction, which means 102 of the 164 words contains literal information, much of it repeated.
With a richer instruction set, one could find opportunities to reduce the dependency on the
li instruction, thus helping to improve code-density.
Without exception, all S16X4 instructions takes one clock cycle to execute. Additionally, the CPU further takes an additional clock cycle to fetch a batch of four instructions. These simple rules makes estimating run-time performance easy, as the following table illustrates.
Name Cycles (%) ==================== plt 29+(0*0) pltc 11+(8*(16+plt)) zerop (0p) 15+(0*0) zeroq (0q) 27+(0*0) addrs 12+(0*0)+0p+0q crsr 55+(0*0) plotch 22+(0*0)+addrs+crsr+pltc setOutSrc 95+(0*0) ==================== TOTAL 266+(344) = 610 cycles ====================
This table tells us that initializing a bitmap and printing a single character takes 610 cycles to complete, minimum.
Additional characters may be plotted with only 515 cycles, for no need exists to invoke
setOutSrc for each character intended.
pltc procedure requires the lion’s share of this time, taking 371 cycles on its own (11+(8*(16+29))).
plt takes seven cycles to transfer a single byte from the font bitmap to the destination bitmap.
It takes a further seven cycles to increment
p, and another seven for
Within this routine, at least, the S16X4 compares with a classic CISC processor equipped with many general-purpose registers, such as the MC68000.
Cost of Structure Field Access
Accessing fields off of a base pointer takes seven cycles and four words to complete:
- Fetch the
li base; fwm; li field_offset; addbundle.
- Spend four cycles executing these instructions.
- Fetch the
fwmin the next bundle.
- Spend one cycle executing this instruction.
Accessing an absolute memory location takes two cycles on average, three worst-case, while requiring half as many words of memory to encode. Clearly, executing references to absolute memory locations proves much faster, and as well, consumes far less program memory.
If left as-is, the S16X4 MISC would behave not entirely unlike a Motorola 68000 when reading or writing structure fields.
I’ve optimized the software in listing 1 to predominantly access absolute memory locations.
To ensure these absolute locations always reflect the caller’s choice of OutputSurface, the caller must invoke
setOutSfc before using
plotch to render text.
Developers of MISC software will recognize this style of software design as more idiomatic,
largely for the aforementioned performance and program size reasons.
Observe that outsfc pointer never changes while it’s in normal use by either the calling or called subprogram.
setOutSfc works like a cache refill operation — it writes fields that change back to original structure, then caches fields from new structure into absolute locations.
Caching, then, amortizes the cost of prefetching fields across all absolute memory references corresopnding to a field access.
It turns out, as shown in listing 1, the program makes 15 field accesses for each call to
setOutSfc routine consumes 95 cycles, assuming it writes back as well.
Thus, when writing a character, those 95 cycles virtually “spread out” over the 15 field accesses.
This corresponds to (95/15)=6.333 cycles per field access.
Since each absolute reference averages two cycles, the
plotch routine runs as if each field access consumed 2+6.333 = 8.333 cycles.
It turns out, caching holds no value when plotting only a single character on the bitmap.
However, when repeatedly invoking
plotch, e.g., when printing a string of characters, overhead drops very rapidly.
Printing the text, “Hello”, involves only five characters.
Yet, since the overhead for
setOutSfc remains constant, amortized overhead becomes (95/(5*15))=1.267 cycles,
plotch behaves as if each field access takes only 3.267 cycles.
If printing the average-sized line of text, say 40 characters, the overhead drops further to sub-unity: 0.158 cycles!
If printing an average-sized screen, say in the ballpark of 1000 characters for an 80x25 bitmap, overhead drops to insignificance: 0.00633 cycles per field access.
The less frequently a program calls
setOutSfc, the closer to two cycles each field access becomes.
Therefore, even though we needed to alter our strategy a bit to realize its benefits, the S16X4 MISC processor competes favorably against very sophisticated CISC architectures like the MC68000 for structure field references in the best case, and equals their performance in the pathologically worst case. The MIPS R3000 still performs better due to its dependency on pipelining, but only by a factor of two in the long term.
Cost of Indirect Subroutine Call
The S16X4 lacks any kind of subroutine mechanism in hardware. Thus, a programmer or compiler must synthesize this capability from existing instructions and conventions. Idiomatically, the S16X4 run-time environment stores return addresses in absolute memory locations, to save time. Of course, this prevents re-entrancy and recursion without more complex prolog and epilog procedures.
A normal subroutine, then, involves a code sequence such as:
li return_address li subroutine_address go
This takes four cycles to complete, not including latency introduced by the subroutine’s prolog and epilog. In this case, the S16X4 falls between CISC and RISC in subroutine performance.
Because the programmer must synthesize subroutine calls of all kinds, supporting indirect calls comes naturally and easily.
li return_address li pointer_to_subr_address fwm go
Note the simple introduction of a fetch instruction just before the go opcode. This adds a cycle to the operation.
In this case, the S16X4 runs closer to RISC speeds than classic CISC. Indeed, the 65C816 and MC68000 both take something on the order of 20 clock cycles to accomplish the same task. (This will be revealed in several articles.)
With 62% of the program space consumed by 16-bit literals, the S16X4 provides a code density on par with a 32-bit RISC.
I was surprised to find that the S16X4, despite lacking overt support for structured data access, potentially performs better at that task than it does moving bytes into a bitmap, provided one adopts some means of amortizing the effective address calculations, such as with field caching. If one cannot avoid the effective address calculations (e.g., as with creating or destroying a recursive subroutine’s activation frame), performance will never be worse than a classic CISC.
The S16X4’s unusually high indirect subroutine performance makes it surprisingly adept at object-oriented, functional, and highly modular software design.
The S16X4, despite having only 12 instructions, remains a surprisingly powerful MISC processor that offers performance somewhere in between classic CISC and RISC. Explicit exposure of effective address calculations provides opportunities for the programmer to amortize them into near insignificance. With careful attention to program structure, performance can be weighted closer to RISC levels of performance.
outsfc: .word 0 os_bitmap equ 0 os_fontBase equ 2 os_flip equ 4 os_scroll equ 6 os_x equ 8 os_y equ 10 os_ch equ 12 p: .word 0 q: .word 0 ; Extra storage for temporaries, etc. plt_pc: .word 0 pltc_pc: .word 0 zerop_pc: .word 0 zeroq_pc: .word 0 addrs_pc: .word 0 plotch_pc: .word 0 setOutSfc_pc: .word 0 t0: .word 0 outsfcNew: .word 0 ; Cached fields of current OutputSurface. cc_bitmap: .word 0 cc_fontBase: .word 0 cc_flip: .word 0 cc_scroll: .word 0 cc_x: .word 0 cc_y: .word 0 cc_ch: .word 0 plt: li plt_pc swm li p fwm fbm li q fwm sbm li p fwm li 256 add li p swm li q fwm li 80 add li q swm li plt_pc fwm go pltc: li pltc_pc swm li 8 li t0 swm pltc_l1: li pltc_l2 li plt go pltc_l2: li t0 fwm li -1 add li t0 swm li t0 fwm li pltc_l1 nzgo li pltc_pc fwm go zerop: li zerop_pc swm li cc_fontBase fwm li cc_ch fwm add li p swm li zerop_pc fwm go xref mulBy640 zeroq: li zeroq_pc swm li cc_y fwm li cc_y fwm add li mulBy640 add fwm li cc_bitmap fwm add li cc_x fwm add li q swm li zeroq_pc fwm go addrs: li addrs_pc swm li addrs_l1 li zerop go addrs_l1: li addrs_pc fwm li zeroq go crsr: li crsr_pc swm li cc_x fwm li -79 add li $8000 and li crsr_l1 zgo li cc_x fwm li 1 add li cc_x swm li crsr_pc fwm go crsr_l1: li 0 li cc_x swm li cc_y fwm li -24 add li $8000 and li crsr_l2 zgo li cc_y fwm li 1 add li cc_y swm crsr_l3: li crsr_pc fwm go crsr_l2: li crsr_l3 li cc_scroll fwm go xdef plotch plotch: li plotch_pc swm li plotch_l1 li addrs go plotch_l1: li plotch_l2 li pltc go plotch_l2: li plotch_l3 li crsr go plotch_l3: li plotch_pc fwm li cc_flip fwm go xdef setOutSfc setOutSfc: li setOutSfc_pc swm li outsfc fwm li setOutSfc_l1 zgo li cc_x fwm li outsfc fwm li os_x add swm li cc_y fwm li outsfc fwm li os_y add swm setOutSfc_l1: li outsfcNew fwm li outsfc swm li outsfc fwm li os_bitmap add fwm li cc_bitmap swm li outsfc fwm li os_fontBase add fwm li cc_fontBase swm li outsfc fwm li os_flip add fwm li cc_flip swm li outsfc fwm li os_scroll add fwm li cc_scroll swm li outsfc fwm li os_x add fwm li cc_x swm li outsfc fwm li os_y add fwm li cc_y swm li outsfc fwm li os_ch add fwm li cc_ch swm li setOutSfc_pc fwm go