Software Survivalist

On Subroutine Threading for the W65C816 Processor

2021-08-18T00:00:00+00:00

When implementing a subroutine threaded, dual-stack virtual machine for the W65C816 processor, it might be non-obvious how to attain the highest possible performance. The roles you assign to registers with the 65816 can significantly impact run-time performance. Conventional implementations typically prefer a particular register allotment, but without explaining why. I compare two competing methods of implementing subroutine-threaded code for this processor architecture. I can finally explain why there exists only a single preferred embodiment of subroutine-threaded code on the 65816 processor.

Introduction

Running a stack-architecture virtual machine on the W65C816 processor, such as what you find with the Forth programming language, incurs significant run-time overhead. Operations that the processor can natively perform in under 5 clock cycles consistently consume between 20 and 80 clock cycles, depending upon implementation technique used. A number of different techniques have been developed to minimize interpreter inefficiencies, filling out a spectrum of size/speed tradeoffs. If space consumption is not a primary concern, one of the fastest methods of running such a virtual machine is to compile stack architecture code into a representation called subroutine threaded code.

A quick review of the 65816 processor’s registers that are relevant to subroutine-threading will be helpful, especially for readers unfamiliar with the 65816 programming model.

15          0
+-----------+
|     A     |
+-----------+
|     X     |
+-----------+
|     Y     |
+-----------+
|     S     |
+-----------+
|     PC    |
+-----------+

Note that other registers which the 65816 supports, such as the bank registers and direct page register, are not listed, as they are not relevant for our purposes.

The accumulator (A) is the sole data processing register. It can add, subtract, etc.; however, it cannot not address memory.
The index registers (X, Y) are address offset registers. Used with a base address of zero, they can be thought of as pure address registers. However, they can only be incremented and decremented. They cannot participate in address arithmetic directly.
The stack pointer (S) is used to maintain the processor’s hardware stack. It works similarly to the index registers, in that it can only be incremented and decremented. Unlike the index registers, the CPU itself maintains this register automatically.
Finally, the program counter (PC) always points to the instruction the processor intends to fetch next.

As you can see, the 65816 is severely register-starved compared to contemporary processors. It is even register-starved compared to an ideal Forth CPU’s register set! Even so, we can still make use of two registers as our virtual machine’s stack pointers. Because the dp,X addressing mode only supports the X register, there are two ways of doing this:

The X register can hold the data stack pointer, and the S register the return stack pointer; or,
The X register can hold the return stack pointer, and the S register the data stack pointer.

Which of these approaches is better isn’t obvious at first. To help find out, I will take a look at both implementation techniques and compare their performance relative to each other by examining a hypothetical compiler’s output after consuming Forth code at different levels of abstraction.

The Problem

Conventional subroutine threading techniques typically allocates the S register to the roll of the return stack pointer, and X to the roll of the data stack pointer. This allows the data stack to be overlaid with the direct page segment, allowing the dp,X addressing mode to be used for addressing the stack. However, the X register must be maintained manually in software, incrementing and decrementing frequently as the data stack is more frequently used than the return stack.[2]

Consider how a VM implementer might program a simple binary addition primitive:

LISTING 1.

    jsr enter_add  ; 6 cycles
    ...
enter_add:
    lda 0,x        ; 5 cycles
    clc            ; 2 cycles
    adc 2,x        ; 5 cycles
    sta 2,x        ; 5 cycles
    inx            ; 2 cycles
    inx            ; 2 cycles
    rts            ; 6 cycles

According the the instruction set listings in [1], the above fragment of code should take 33 cycles to complete a 16-bit addition, including the JSR needed to invoke it. Since the addition itself takes 7 cycles (CLC; ADC combination), the rest is overhead. Can this overhead be reduced if we swap the conventional roles of the X and S registers?

The Idea

It would be logical to think exploiting the processor’s built-in stack hardware to implement the most frequently used stack would make the greatest amount of sense. Pushes and pops perform stores and loads from the stack, while the processor decrements or increments the stack pointer for us. Since a typical stack architecture is expected to use the data stack much more frequently than the return stack, wouldn’t it make more sense to let S refer to the data stack instead of the return stack? The X register, the pointer for the software-managed stack, would then be used for the least-frequently used resource, where cost of manual pointer adjustments would be more easily amortized.

The Details

Performance of 65816 machine code can depend on a number of factors. To establish a baseline for comparison, I make some assumptions about the runtime environment. All code discussed herein assumes that we’re compiling a 16-bit dialect of Forth, that the Forth dictionary and its stacks reside in the first 64KiB of memory, that the direct page base register is aligned to a 256-byte boundary, that the CPU is in native-mode, and that all registers are in 16-bit mode. This allows the processor to run the Forth code in the fewest number of cycles.

Regardless of which register allotment is used, a null subroutine costs only 12 clock cycles: the JSR to invoke it, and a single RTS to return from it.

LISTING 2.

    jsr enter_null
    ....
enter_null:
    rts

However, this isn’t a useful metric, since no work is achieved. Thus, we must look first to the smallest unit of useful work in the VM: the primitive. Listing 1, which you’ve seen before, illustrates a simple, 16-bit addition primitive using conventional subroutine threading register assignments. Recall that it took 33 cycles to run, including subroutine call overhead imposed by the processor itself.

Consider now a subroutine threaded implementation with the rolls of X and S reversed. Since the jsr and rts instructions use the built-in hardware stack for accessing return addresses, and since we’re now using that stack for data instead of return addresses, we need to introduce prolog and epilog code with each subroutine to ferry return addresses between the data and return stacks.

We start with primitives because they represent the simplest virtual machine unit of functionality. They have the fore-knowledge that they won’t be calling other subroutines, and so can take some liberties with their implementation to optimize for performance. Let’s re-implement our addition primitive; but, with S and X register roles swapped:

LISTING 3.

    jsr enter_add  ; 6 cycles
    ...
enter_add:
    ply            ; 5 cycles
    pla            ; 5 cycles
    clc            ; 2 cycles
    adc 1,s        ; 5 cycles
    sta 1,s        ; 5 cycles
    phy            ; 4 cycles
    rts            ; 6 cycles

In this case, since we don’t need it for anything else at the moment, we use the Y register to cache the return address temporarily while the addition takes place. This implementation takes 38 clock cycles to complete. Compared with the code in listing 1 at 33 clock cycles, this implementation takes a 15% performance hit.

Not all operations in a program consists of strings of primitives, however. Forth software development practices, for instance, favors highly factored code. Words can call out to other words, which call other words, and so forth, until primitives are encountered at the leaves of the call tree. (This all assumes a non-optimizing compiler.)

We cannot just pop the return address into a register and hope for the best, since a subsequent level of nesting would destroy the cached return address. This is where the software-managed return stack comes into play. For example, let’s say we want to write a word that adds four numbers together.

: sum4 ( n1 n2 n3 n4 -- n )   + + + ;

This might be compiled as follows using a conventional subroutine-threaded compiler:

LISTING 4.

    jsr enter_sum4  ; 6 cycles
    ...
enter_sum4:
    jsr enter_add   ; 33 cycles
    jsr enter_add   ; 33 cycles
    jsr enter_add   ; 33 cycles
    rts             ; 6 cycles

Total execution time for this fragment of code is 111 clock cycles.

If we swap the roles of X and S, we find the resulting fragment must take the following form:

LISTING 5.

    jsr enter_sum4  ; 6 cycles
    ...
enter_sum4:
    pla             ; 5 cycles
    sta 0,x         ; 5 cycles
    dex             ; 2 cycles
    dex             ; 2 cycles

    jsr enter_add   ; 38 cycles
    jsr enter_add   ; 38 cycles
    jsr enter_add   ; 38 cycles

    inx             ; 2 cycles
    inx             ; 2 cycles
    lda 0,x         ; 5 cycles
    pha             ; 4 cycles
    rts             ; 6 cycles

This comes to a total run-time of 153 cycles, a 37% performance hit.

Interestingly, the performance impact does not appear to compound; in fact, evidence seems to suggest it will settle somewhere near the 33% mark. If we go one level of abstraction higher, in an attempt to sum 10 numbers instead of just 4:

: sum10 ( n1 .. n10 -- n ) sum4 sum4 sum4 ;

we can compare the two register allotment methods. First, conventional subroutine threading, which comes in at 345 cycles.

LISTING 6.

    jsr enter_sum10  ; 6 cycles
    ...
enter_sum10:
    jsr enter_add4   ; 111 cycles
    jsr enter_add4   ; 111 cycles
    jsr enter_add4   ; 111 cycles
    rts              ; 6 cycles

Now, let’s example the swapped register allotment approach:

LISTING 7.

    jsr enter_sum10  ; 6 cycles
    ...
enter_sum10:
    pla              ; 5 cycles
    sta 0,x          ; 5 cycles
    dex              ; 2 cycles
    dex              ; 2 cycles

    jsr enter_add4   ; 153 cycles
    jsr enter_add4   ; 153 cycles
    jsr enter_add4   ; 153 cycles

    inx              ; 2 cycles
    inx              ; 2 cycles
    lda 0,x          ; 5 cycles
    pha              ; 4 cycles
    rts              ; 6 cycles

This version takes 498 clock cycles to complete, a 30% performance regression start to finish.

You’re probably wondering about simply swapping stack pointers and pushing and popping accordingly. It is an approach that works wonderfully on the Intel 8086 architecture, for example. However, the 65816 lacks a swap-stack-pointer instruction, so this approach is not cost effective. The closest we can do is synthesize the effect using a series of register-to-register transfer instructions. For example:

LISTING 8.

    jsr enter_sum4  ; 6 cycles
    ...
enter_sum4:
    ply             ; 5 cycles
    tsa             ; 2 cycles
    txs             ; 2 cycles
    phy             ; 4 cycles
    tsx             ; 2 cycles
    tas             ; 2 cycles

    jsr enter_add   ; 38 cycles
    jsr enter_add   ; 38 cycles
    jsr enter_add   ; 38 cycles

    tsa             ; 2 cycles
    txs             ; 2 cycles
    ply             ; 5 cycles
    tsx             ; 2 cycles
    tas             ; 2 cycles
    phy             ; 4 cycles
    rts             ; 6 cycles

This code would take 160 clock cycles to run, almost 4.6% slower than the code in listing 5.

Discussion

So, why is the counter-intuitive approach of placing the data stack pointer in X so much faster? I’ve identified two reasons which interact with each other.

When the S register is used to point into the return stack, prolog and epilog code become virtually unnecessary. With S and X’s roles swapped, the smallest prolog and epilog achievable, that of a ply and phy pair found in most primitives, contributes 9 cycles to every primitive call. Higher-level words are demonstrated to have steeper costs. With S pointing into the return stack, however, all of these costs are saved.
When the X register is used to point into the data stack, the dp,X addressing mode may be used to elide unnecessary pointer adjustments until the very end of the subroutine. For example, notice how in listing 1 we do not adjust the X register until the very end of the primitive.

This is essentially the same phenomenon that can be observed with contemporary RISC architecture processors, where prolog and epilog code consists of a single adjustment to a stack pointer, followed by repeated reads or writes into memory relative to the adjusted stack pointer.

addi sp,sp,-16
sw   s0,0(sp)
sw   s1,4(sp)
sw   s2,8(sp)
sw   ra,12(sp)

; productive computation here

lw   ra,12(sp)
lw   s2,8(sp)
lw   s1,4(sp)
lw   s0,0(sp)
addi sp,sp,16
jalr x0,0(ra)

You’d think that it’d take longer to run such code, but it in fact doesn’t because it has the knock-on effect of freeing the interior of the procedure of all bookkeeping responsibilities. This amortizes the cost of stack management over the cost of the procedure’s more productive computational task.

Subroutine threading isn’t the only approach to executing a stack-based virtual machine. Indirect, direct, and token threaded code representations also exist, and have been successfully used to implement a variety of different languages for constrained systems. [2] However, for the 65816 processor, none of these alternative approaches can come close to subroutine threading for performance. What they lack in performance, however, they make up for in code compactness, which explains their popularity on smaller systems. You can learn more about these alternative techniques in [2].

Conclusion

Counter-intuitively, the conventional approach to implementing subroutine threaded code on the 65816 processor is, indeed, the superior approach. It works because it consumes fewer cycles to accomplish the same computations thanks to fewer instructions executed both in between subroutines as well as within them. Although this study is not exhaustive for all types of software, it is believed to generalize to the entire virtual machine, as more sophsticated software is hierarchically composed of these lower-level primitives.

References

[1] Eyes, David and Ron Lichty. Programming the 65816. 1986. ISBN 0-89303-789-3.

[2] Rodriguez, Brad. Moving Forth. Accessed 2021 Aug 17. http://www.bradrodriguez.com/papers/moving1.htm

Plan 9 Shower Thoughts

2020-08-19T00:00:00+00:00

Plan 9 Shower Thoughts

As I woke up this morning, I had one of those random thoughts which makes me think, “Huh, why not?” We have a number of “tiny Unixes” for 8-bit processors of all ilk. While far from POSIX compliant, they all are perhaps surprisingly faithful to early Unix capabilities. So, this got me thinking: what would a “tiny Plan 9” look like? I make NO illusions of compatibility with real Plan 9, any more than Lunix is compatible with Linux. But, even so, it is a fun thought experiment.

For now, I’m considering only smallish 8-bit systems like the RC2014 or Commodore 64. One of the first and most important tasks for any kernel is memory management. Plan 9 has a tiny set of system calls for managing memory; however, they still assume a page-based MMU. segattach(2) and segdetach(2) can be shoehorned into a flat, single address space under very tightly controlled circumstances. However, because of the need for scatter-loading binaries, brk(2), sbrk(2), and segfree(2) are out of the picture entirely, however. (For those wondering, segattach(2) would need the virtual address parameter, va, to be set to zero at all times. Further, the attr input would be ignored.)

The next task for the kernel is managing running programs. Nearly all early 8-bit CPUs are built with absolute memory references in mind; the idea of pointer-relative addressing, much less PC-relative, just wasn’t “a thing” back in the 70s. Thus, the kernel will need to load programs using a relocatable binary file format (see my article on my Kestrel blog, On ELF, Part 2). And, once loaded, those binaries are not moving.

Since this kind of scatter loading will all happen in a single address space, rfork(2) and exec(2) are similarly out of the picture. An alternative method of launching programs will be necessary. Also, segattach(2) and segdetach(2) will need to be vigilent about minimizing address space fragmentation. Quite doable, but at some cost in implementation complexity.

All the other system calls should be doable, however. IPC could be handled via anonymous pipes, just like it’s done in Plan 9 currently (pipe buffers would need to be smaller than 4KB though!). If we were to stop here, I’d reckon you’d need a minimum of a 96KB RAM system to run everything (32KB for pipe buffers, 64KB for kernel and small set of processes) to make a halfway usable system. Obviously, more RAM is better; most of your overhead will be in IPC buffering. A C128 could be a good platform for this configuration.

However, if we’re willing to bend what it means to be a “Plan 9”, it might be more appropriate to use a system of callbacks instead, since these don’t use any buffers beyond what the client prepares ahead of time. A system similar to VMS’ Asynchronous System Traps or L4’s synchronous message passing might perhaps be used to implement this. But, now we’re getting deeper into microkernel territory.

So, for example, if process A calls open(“/aMountPt/foo”), it’d result in a system call which looks up /aMountPt/foo, sees that it doesn’t exist, then tries /aMountPt, sees that it does exist, and schedules a call to that mount point’s “open” callback. It must be scheduled since it must run in the context of another process. You could use a system of “migrating threads”, but now we’re veering well off both the Plan 9 AND microkernel paths, and has a whole bunch of its own problems.

For systems with more advanced memory management, such as a 65816-based system with a PMMU driving the ABORT# pin, or a dual-CPU system where one CPU is the “user CPU” and the other the “supervisor CPU”, it’s possible that rfork(2), exec(2), and the more complete set of segment functionality can be made available again. Since pages are not very granular, however, you’d probably need a system with at least 1MB to 2MB of RAM before you end up with a usable foundation.

So, can a tiny Plan 9 be constructed for small systems? I don’t know for sure; I’ve never actually tried. However, I think it can as long as you’re willing to let go of those system facilities which have a hard dependency upon a PMMU. Such a system would look like, at a minimum, a Commodore 64 equipped with an off-line bank of RAM for use as pipe buffers. More likely than not, though, as not every system as readily available RAM expanders like the Commodore REU, you’d also need to let go of the use of pipes all-together as your go-to choice for IPC, and start relying on asynchronous system traps or upon microkernel-like synchronous message queues as your IPC mechanism.

Black Boxes and Magic

2019-10-01T00:00:00+00:00

Black Boxes and Magic

I recently read an article online titled, First, Do No Harm: A Hippocratic Oath for Software Developers. I recommend reading this article; it’s fantastic. I’d like to focus on one aspect that really stood out to me, though.

In it, the author talks about the nature of abstractions as a synonym for reusable code. While not specifically dismissing them as useless, he raises an excellent point, one which I myself have been rather vociferously accused of being an inadequate developer for supporting. With respect to reusable software’s increasing tendency to seem like “black boxes and magic”, the author writes, “I am not a Luddite, but my fear – based on observations of hundreds of practitioners – is that we adopt these aforementioned technologies without fully understanding what they do or how they do it.”

Now, I am a nobody. I’m a college drop-out, uncredentialed in just about anything except having a drivers license, and just kind of float job to job. The author, however, is a Ph.D. at a prestigous university with books and articles out the wazoo to his name. Yet, we’ve been both saying the same thing, him from observing literally hundreds of other practitioners, and myself from personal, first-hand experience. So, maybe just this once, give what I have to say in this article a little credence before accusing me of inadequacy?

I go even further and suggest that this age of abstractions proactively discourages a practitioner from understanding them. How can you successfully apply an abstraction if you don’t have any intuition about its applicability in the first place? That is, after all, what understanding is all about.

One fallacious analogy people often make is, “You don’t need to understand how a car works to drive a car.” This is a fantastic metaphor for abstractions and black boxes, because unknown to those who frequently resort to it, it works both ways. Yes, the statement is entirely correct; however, if your car blows a tire or it starts leaking oil, you’ll be right screwed. This is the automotive equivalent of having some on-call IT developer’s pager that goes off at 3AM exactly 24 hours into their weekend. Do you know enough to limp the car to the nearest garage? Or, will you just sit in the middle of rush-hour traffic (if you’re lucky enough to have this happen in a city with decent infrastructure!), waiting for a response to your call to 2nd-level support/your nearest towing company? If you’re a programmer, in any capacity, I’d expect you to have some facility with how a computer actually works, how the software stack that sits on top of it works, and the resulting interactions between the two. Just as if you’re planning on going on a road trip out into the desert, I expect you to carry enough provisions not just for you, but for your car, to limp yourself to a garage if trouble happens. It’s just common sense. 24-hour road service doesn’t mean they’ll get to you right away; it only means you can call 24-hours a day to get help.

This is one of a couple of reasons why I prefer Forth as it is practiced by Chuck Moore – everything is in your face, and there is a very close correspondance between your source code and what gets executed by the computer. Even when writing code that relies heavily upon abstractions and what looks like black magic, you can search through the source code to find the meaning of a definition, study it, and create your own if it’s not suitable for your needs. You can reliably study the existing code to learn why it doesn’t work, and inform your superiors not only why things are broken, but how long it’ll take to fix (assuming you have that experience, of course). In fact, writing your own implementation of everything you come to depend upon is encouraged by Chuck Moore, from a simple multiplication routine to your own development tools (possibly including Forth itself!), depending upon context of course. As a result, I think some people early in the Forth community’s history have misunderstood Moore by taking his suggestions too literally, and let the pendulum swing too far the other way of black boxes and magic, to the point where writing your own Forth is seen as something of a rite of passage. I know, because I went through this phase myself. I now know that this is not at all what Moore was trying to get at. Years later, Moore lamented publicly that too many people are playing games with their Forth implementations and are not busy writing real-world applications.

The point Moore was trying to make is so frequently missed, it’s almost embarrassing. It’s simply this: you’re a technician; and, like most technicians, you need a set of tools on which you can rely. By rely, I mean intuitively know what solution is the right solution for a given job. When is a hammer better than a mallet? When is a crescent wrench better than hex driver? In most cases, you can reasonably substitute one for the other; but, not always! You don’t always have to make your own hammer from scratch; but, if you’ve ever used a block of wood to diffuse the blow of a hammer, you basically re-invented your own, purpose built, task-optimized mallet. That kind of resourcefulness, based on your understanding of first principles, is the point Moore was trying to get at. His goal was to encourage the development and understanding of first principles.

You can’t do that with software unless you have a profound understanding of that very same software, making this a grotesque chicken and egg situation. That means not just having ready access to the source code, but having it in an easy to understand, easy to adjust form. If necessary, how-to-hack and how-this-thing-works documentation would be required. The value proposition of open source is severely undercut if the source listing to a component you depend heavily upon but need to change or better understand consists of tens of thousands of lines of code strewn about hundreds or thousands of source files with no statically determinable path of understanding how one module relates to another. If your editing experience requires more than 8 files open concurrently to understand an interface or to implement a new feature, you might want to ask yourself if there’s a better way of organizing the code you’re working on. I would even argue that four is a more reasonable limit to strive for.

In my opinion, informed from personal experience, this is why I tend to be more successful with code written in C and Forth than I am with code written in Python or Smalltalk; all the IDE black magic in the world falls on its face with a goopy splat! the moment you introduce heavy reliance upon polymorphism into the mix. Trying to figure out how programs which rely heavily upon polymorphism works is a nigh impossible task for me.

Moore is often proud of having written his own multiplication and square root programs in Forth. But, did he write this code for every single platform he developed on? Turns out, not exactly. He kept a personal library of all the reusable concepts and code he’d written himself over the years. With each new project, he’d contribute new code to it for future reference. If an earlier routine needed adjustment for his new target application and/or platform, then he would do so.

Forth isn’t just a language; it’s an entire way of writing software. It’s how we approach the profession; the entire mindset of the developer. These things simply cannot be meaningfully standardized, no matter how much ANSI or Forth 2K folks want it to be. I mean, despite the existence of these standards, the aphorism “If you’ve seen one Forth, you’ve seen one Forth” still very much applies. Worse, it frequently applies across different versions of the same product. GForth 0.4.0 32-bit is a very different creature from 0.7.0 64-bit. Meanwhile a good Forth programmer is at home in any Forth environment; not because he can depend on a standardized vocabulary (though that helps), but because he understands how the Forth environment as a tool works, and its relationship to the underlying machine. Yes, I might have to hand-type code in that might be reusable elsewhere instead of just linking against it. However, that takes much less time in practice than what’s spent literally shopping for the right solution on the Internet, reading its frequently incomplete and even wrong documentation on how to install it on your platform, writing some test code to familiarize yourself with its API, kicking off debuggers when things go horribly wrong (because what should have been a library is really a framework), etc. before you can successfully deploy it in staging, much less in production.

Some Thoughts on Forth vis-a-vis Oracle and Java SE

2016-12-18T00:00:00+00:00

This whole Java SE bullshido that Oracle is all up on their high horse about has made me think about why I really enjoy Forth. I’d like to lay down some of my thoughts before I forget them, because I think it’s worth preserving and discussing.

One of the reasons why Forth has suffered in the greater computing community is, “If you’ve seen one Forth, you’ve seen one Forth.” Coming about in an age before the Internet, and with such a huge variety of platforms then available, standardization proved difficult and, to some, even undesirable. Yet, in light of Oracle’s brutish behavior concerning Java SE, I find Forth and what little of its ecosystem remains of its former prominence comforting. I’m going to argue that its unbridled freedom of choice is actually its core strength and saving grace. I’m going to go on record saying that Forth will outlive Java, even if in relative obscurity to something else.

Forth has been used for everything ranging from stand-up video arcade games to aerospace applications, and literally and figuratively, everything in between. This page lists some if its current terrestrial and extra-terrestrial applications. Yet, this small but impressive portfolio exists despite the lack of standard libraries or modularity constructs of any kind. How is this even possible, and why haven’t these things evolved over the 46 years Forth’s been alive? Who is actually in charge of Forth? What does it mean for something to be called Forth at all?

Honestly, we Forth programmers don’t know the answers either. Here’s what I do know.

Forth encourages libertarian thinking about programming, its own language definition, and especially its runtime environment. Forth gestated when you carried your own deck of punch cards for commonly used library routines from mainframe to mainframe, and of course, every mainframe was different, even those from the same vendor. IBM 1401, 7904, and System/360 systems all used unique operating systems, filing systems, and calling conventions. Then there were Univacs, and Honeywells, and Burroughs, and a whole litany of other computers, all special snowflakes of their own. Consider that each of these systems typically had an average of three competing operating systems at any given time, it’s only natural that one uniquely identifiable tenet of Forth, “Reduce your dependencies at all costs,” evolves. You might hear of several pithy alternative quotes today: “Identify and remove non-problems” is perhaps the most recent incantation; but, you’ll also hear about “Write your own subroutines” quite frequently too. This philosophy enabled easy portability of the language not only between mainframes, but also between minicomputers, and eventually, 8-bit home computers as well.

With all these different platforms, you’d expect Forth to be this grand abstraction that hides every detail of a computer from the programmer. Somewhat counter-intuitively, this is not the case. Forth also embraces diversity. Forth implementations seek to rationally expose, not abstract, the unique hardware features of the computer it runs on. This is not to say that abstractions are universally frowned upon; Forth implementations are not exokernels. Yet, as a general rule, abstractions are treated as adaptation layers that sits between your desired application and the surrounding environment. Often, these layers are bundled with the application, and not provided as a standard feature of the language.

Put more plainly, the Forth environment trusts the application to do the right thing, but the application never trusts the Forth environment. This differs starkly from the usual view where the application trusts the runtime or operating system, but the runtime environment or operating system does not trust the application.

When you have total freedom over the programming landscape like this, won’t chaos ensue? Yes and no. It’s quite clear that without some flavor of standardization, you cannot take a program for one Forth and run it as-is on another. Some standards do exist:

Forth depends on a two-stack environment: a data stack for evaluation, and a return stack to hold continuations.
Forth depends on a dictionary that can grow or shrink, logically similar to a stack, if not in fact.
You define new words with : (colon), and terminate their definitions with ; (semicolon).
You can define words in such a way that they execute at compile-time (“immediate” words) as well as run-time.
You can manually switch from compile-mode to interpret-mode and back again (e.g., with [ and ]).
Some method of off-line storage is provided, such that it allows you to (at run-time) swap in new code at will.
Some method exists of recycling space in the dictionary.

Note the specific absence of detailed methods of accomplishing or implementing these attributes. A typical C programmer will look at this and say, “That’s nice, but without more detail, you can’t easily port a program.” That’s because this programmer is not used to how a Forth environment works.

As someone with first-hand experience porting non-ANSI Forth programs to ANSI Forth and vice versa, porting software is simply not as big of a problem as it is for, say, a system built around a typical C toolchain. Forth translates directly from source code to binary with no intervening link editing steps, and thus concerns of binary compatibility generally doesn’t arise. Combined with the fact that most Forth systems make use of a hyper-static global environment, it’s often more than sufficient to just introduce a compatibility layer ahead of the program you actually want to run. In other words, it’s often easy to “virtualize” a figForth environment on an ANSI Forth system, and it’s often as easy to “virtualize” a sufficient subset of an ANS Forth environment on eForth 1.0. There are, of course, always exceptions. If your Forth environment lacks the ANSI Forth WORD-LIST facility, but the application you want to run uses those features extensively, you’ll need to find a way to emulate them. This takes a lot more work than simply writing wrappers around words with different names or permuted inputs or outputs. But, honestly, this sort of thing happens only very rarely. It’s usually cheaper to modify the application itself so as to not require that dependency in the first place. To this day, a majority of the Forth programming community is able to research and utilize Forth contributions written long before ANSI Forth ever existed, going as far back as Forth-79 even, on contemporary Forth deployments with only minor adjustments.

I invite you to contrast this user experience with that typically found in a POSIX environment. If you’ve ever sat next to an ops or site reliability engineer (both of which I’ll collectively lump together as SREs in this article), you’ll see the surface benefits of standardization right away: people using off-the-shelf software components to accomplish a specific company mandate, and using generic, off-the-shelf automation tools to bind them together in a “deployment”. Huzzah for software reuse!

What’s not made manifest is the months of learning that went into making that happen. Consider how much an SRE costs a typical company, now consider how many SREs exist in that company. Few ever consider that the very need for an SRE at all is a cost of standardization. Just look at the tools the SRE is forced to work with, and read up on their documentation, if any such documentation exists at all. Look how much documentation exists, and how much of it must be understood before being able to competently use the described tool. Look on any typical Apache Foundation or Github repository for such tools, and by reading its front matter, answer me these questions:

What problem does this tool actually solve?
Why is this a problem in the first place?
How does this tool actually solve that problem?
Why should I care? How does this tool compare with other (known) tools?

If you cannot answer all four of these questions from reading the front-matter of a tool’s documentation, I guarantee you’ll be spending months to years learning about the tool one way or the other, and how the tool makes increasingly less sense to continue using. I’ve seen this happen time and time and time again in commercial industry. After all that SRE investment, you’re stuck with it. As the cold, hard reality sets in, the original engineers responsible for the decision typically bails for greener pastures in a new start-up or project. That’s where you now come in, and realize the gravity of the mess you’re stuck with, and you start to rationalize it as “guaranteed employment.” Tell me this hasn’t happened to you even once.

Don’t get me started on conferences entirely dedicated to these tools, either. Seriously: if they’re that easy to use, that learnable, that reliable, and provide that much value to their clients, why do you need conferences to support the community? Think carefully about that the next time you attend Hadoop Summit or DockerCon. Pay close attention to what’s really happening at these conferences. It’s not about you or developing your knowledge; those are secondary effects at best. It’s all about big business: the vendors, the workshops with said vendors talking about stuff they’re working on and/or selling today, etc. It’s not about you at all, and it sure as hell isn’t about your customers either.

Major Forth applications are written with the operational demands of its user(s) in mind, requiring a significantly reduced need for administration overhead. This allows the developers and the businesses alike to just get on with what they both do best. Generalized solutions are written for a generalized, platonic, non-existent organization, and you, the SRE, are left holding the bag in an attempt to get things to work per your organization’s business rules. You’re not developing. You’re not creating. You’re duct-taping.

Forth developers generally believe standards bind much more strongly to the communications between modules (on-disk formats, line encodings, etc.) than to the modules themselves. What happens inside the computer stays inside the computer, unless there’s a documented need to know otherwise. If you’re skeptical that this model doesn’t work, you haven’t been paying attention. Consider the Internet itself, built on the compatibility guarantees promised by “RFCs”. This web page you’re reading right now is served from a computer very different from your own, with a different operating system, and is connected via infrastructure which doesn’t even use the same kinds of microprocessor technology. If you’re the kind of person who believes in the sanctity of IETF standards and how they support the Internet, then you have a good understanding already of how a typical Forth programmer views the world. It’s not uncommon to find a Forth programmer caring about how a module is implemented in direct proportion to how broken its interface is. We don’t want APIs. We want standards. We’ll take APIs if that’s all we’re given, but we much prefer to make our own.

That gives us freedom: freedom to create, freedom to learn and understand the nature of the problem being solved, and perhaps most importantly, the freedom to use our software as we see fit without legal interference from some obsolete has-been in the computer industry. Remember how SCO’s final days went down?

SPI: You're Doing It Wrong

2016-11-28T00:00:00+00:00

SPI: You’re Doing It Wrong

Are you planning on using a serial interconnect to replace a large number of wires with a smaller, more manageable number in a custom peripheral or microcontroller project? If so, you were probably about to do it completely wrong.

Oh, don’t get me wrong; it probably would have been right enough to work for your needs. However, I still encourage you to think carefully about the semantics of such an interface. As soon as you need to multiplex an amount of data greater than the number of wires you have available to you, you introduce the need for (semi-)intelligent endpoints, and a protocol that these endpoints must understand. In this article, I’ll talk about SPI in particular; but, be aware that it’s generally applicable to any narrow interconnect.

Based on my own research idly Googling for articles concerning SPI, it seems like an awful lot of people run into situations where the SPI slave just locks up. We need to remember that concurrent programming is hard, and even worse, extremely difficult to prove correct even with formal and automated provers. I have a hunch that this is due to people fundamentally misunderstanding the role that SPI plays in their design.

Understand the Role of SPI

The whole point of your exercise is undoubtedly to save money in your embedded (or not so embedded) design. Reducing the amount of wires in an interconnect not only reduces PCB space needed to route traffic, but it also reduces the cost of the packages you need to place on the PCB as well. In some cases, it even lets you transmit data faster than you could with a parallel bus due to secondary effects of reducing capacitance, skew, and easier transmission line construction. Copper traces on a printed circuit board do not fail except under extreme physical duress, so how do you implement the SPI interconnect to be at least as robust as those wires you’re replacing?

The interconnect should remain functional even if the peripheral it’s attached to is not.

I can’t re-iterate this enough: the purpose of SPI is not to just route message-oriented traffic to other chips. You can do that with a wider bus just as well, and do so faster and with less burden on the software in your controllers in the process.

Its actual purpose is to emulate the presence of a wider interconnect. This is a fundamental principle which I see routinely violated, both in simple hacked projects and, occasionally, in commercial vendors alike.

Recap: How SPI Works

SPI works on exactly two principles:

SPI exposes a strict, master/slave relationship. When the master talks, the slave must listen. There can be no exception to this rule. Note the absence of feedback signals like interrupt requests, retries, bus errors, and the like. Some devices have them, but they’re not universally supported, and their meaning often changes from device to device. The only constant is MOSI, MISO, SS#, and CLK. That’s all you get to depend on.
SPI peripherals expects you to adhere to a strict request/response protocol. The slave can communicate only while the master is sending something. This means, after the master issues its command to the slave, the master must busy-wait on the device while it’s busy formulating a response. As a performance optimization (which is really an accident of SPI’s implementation), a master may send another command while it’s receiving a reply to a former request, or it may even queue multiple commands if the slave allows for it. Again, not all slaves support such sophisticated protocols, and if it does, not all commands may be supported in a common manner. Consider: an SD card will allow a master send a command to terminate an in-progress block-read operation, but how will it react if you try to queue a block-write command?

The simplest possible SPI interconnection between a master and slave looks, schematically, something like this:

     Master                            Slave
     +--+--+--+--+--+--+--+--+         +--+--+--+--+--+--+--+--+
+--->|  |  |  |  |  |  |  |  |-------->|  |  |  |  |  |  |  |  |----+
|    +--+--+--+--+--+--+--+--+         +--+--+--+--+--+--+--+--+    |
|                                                                   |
+-------------------------------------------------------------------+

As bits are shifted out of the master, they’re shifted into the slave. At the same time, bits from the slave are shifted into the master. This arrangement allows for serial communications between peers with about half of the resources needed by even the smallest UART. Recall that a UART requires an independently functioning transmit and receive shift register, so even the smallest of UARTs will require four registers between both peers.

When the slave has something to send to the master, it loads its transceiver register with a byte, so that the next time the master tries to send something, it receives that data. The key word here is when; what happens if your slave is busy waiting on something of its own? It can be a while before it responds; the master has no idea if the slave is OK during this time. Indeed, the slave could well be locked up completely, in which case the master will wait indefinitely (at least until it times out).

Isolation

Just as a big bunch of wires connecting a peripheral ought to never fail, so too do you not want your tiny bundle of wires that replaces said big bunch of wires to fail either. SPI is a replacement for a big bunch of wires, not for what travels over those wires. That directly implies that whatever is responsible for handling the SPI slave interface must never fail, either.

Controller-based Isolation

The simplest method to achieve this goal of autonomy is to just dedicate an entire microcontroller just to the interconnect. The programming on this microcontroller sits in a tight loop, constantly monitoring the SPI link for activity, and upon receiving commands, formulates responses as quickly as possible. The actual function of the peripheral is none of the SPI controller’s concern; it’s merely a relay for reading and writing data, command and control, and telemetry.

+-----+  |  +-----+   +-----+
|     |<--->| SPI |<->|     |
|     |  |  +-----+   |     |
|     |  |            |     |
+-----+  |            +-----+
Master   |  Slave

This is where most projects that use SPI seem to fall down. They tightly integrate reacting to SPI events into their project’s main loop. This is dangerous, for precisely the same reasons that cooperative multitasking is dangerous on general purpose computers. It takes only one bad actor to derail the project’s main event loop, and when it happens, you lose everything.

In contrast, a project with a dedicated interconnect controller has an additional level of safety. Look at the failure modes of a project with a split controller:

SPI Controller	Function Controller	Perceived Failure	Remediation
Working.	Working.	None.	None.
Working.	Seized.	Status flags and performance counters report no activity from the function controller in a reasonable period of time.	Master can issue reset command(s) in an attempt to interrupt the function controller, or perhaps to even reboot it via its reset pin directly.
Seized.	Does not matter.	No response from the SPI controller for even the most basic of queries or commands.	None; this is likely a power outage, disconnected cable, or unseated/missing/under-powered chip.

With a separate SPI microcontroller, you can at least send a command via SPI to reset the peripheral controller as a means of attempting to regain control. Depending on your design, you might even have independent control over what gets reset or re-initialized. With a unified microcontroller, you have no way of telling what’s going on. Anyone who has ever suffered protocol errors with the SD protocol knows full well just how prone those things are to seizing hard, and with no lifesigns until the card is pulled out and re-inserted. Or until the master emulates this by power-cycling the card with dedicated circuitry for that task. But I digress; point is, SD cards are, in my experience, guilty of using a single controller for SPI and protocol control, and as a result, are fickle in the face of errors.

Process-based Isolation

However, if you’re so cost conscious that you cannot afford two separate controllers, then you should consider instead using at least two processes running on the same controller. Here, I define process in the same way Erlang might: a domain of protection intended to isolate one program from another, where no two processes share state, and only communicate through a well-trusted signaling or message passing system. For mid-grade microcontrollers, it’s unlikely you’ll have memory management units to help enforce this, so you will need to resort to a prioritized, preemptively scheduled multi-threading kernel with carefully written code, manually making sure that your SPI driver never touches any other task’s memory, and vice versa. Clearly, this depends on a lot of “what-ifs”, so it’s clearly the least provable method to achieve isolation. But, with a careful choice in programming languages, you can probably pull this off without much trouble.

A better solution would be to use a larger microcontroller that has a hardware memory management unit (MMU) of some kind. This does not have to be a paging MMU, either. For example, the current RISC-V privileged instruction set specification provides support for what’s called base and bounds protection. (Essentially a crude form of segmentation.) This is plenty sufficient to achieve the desired outcome. Alternatively, some PowerPC- or POWER-based microcontrollers provide special Block Address Translation, or BAT, registers which could be used here.

Comparing Controllers vs Processes

With a dedicated controller, you have a dedicated processor tending to the I/O channel. This means, plainly, that you never have complex timing relationships between what transactions come or go on the link. With processes, however, you need to make sure that the SPI driver process is scheduled in hard real-time, so as to avoid overrun errors in the SPI data register/FIFO. For that matter, you also need to make sure the process is scheduled when you have a response to send as well. You don’t want to risk the master timing out due to jitter in how your OS schedules your driver thread. This is why I emphasized using a prioritized, preemptively scheduled kernel above. A cooperatively scheduled kernel, prioritized or not, will just put you right back in the same situation as a naive implementation. It’s doable, but only if you can statically prove timing closure for all possible inputs, including those not anticipated in the field.

On the other hand, a dedicated controller will cost you printed circuit board space. If it’s one thing I’ve noticed over the years, it’s that PCBs are surprisingly expensive. When PCB area starts to dominate the cost of adding an additional MCU cost, you might want to consider upgrading to a more powerful MCU with memory protection, and relying on processes instead. There’s also the cost of coupling the SPI controller to the function controller as well.

Turtles All the Way Down

Above, I mentioned that the purpose of SPI is to emulate a wider interconnect. Typically, wider interconnects have only a small number of “commands.” Meanwhile, narrower interconnects tend to have a bewildering array of commands. Count the number of commands defined by the MMC/SD interface standards. I’ll wait.

Here are the commands exposed by the 6502 microprocessor, for example:

Read byte at address A15-A0.
Write byte D7-D0 at address A15-A0.
Fetch opcode D7-D0 at address A15-A0. Sample interrupts.

That’s it. The rest of the CPU’s interface is, in some form, a means of controlling how fast the CPU interacts with the addressed peripheral.

So, here’s a question: if we were to build a 6502 emulator that runs over an SPI interface of some kind, thus letting us replace a 40-pin parallel interconnect with a 6-pin PMOD interface, do we encode a series of higher-level messages like, “Fetch bytes starting at address”, or do we remain faithful to the original parallel interconnect?

There are, of course, arguments in favor of both methods; but, I’m going to argue that you should stay faithful to the original interconnect as a basis. You can always add enhanced functionality later if it’s warranted.

Sticking with the original interface semantics has several benefits. First and foremost, a 6502 doesn’t care if a RAM or I/O controller responds to a read request. It’ll happily read garbage if nothing’s there to respond to the data request. I argue that the SPI link’s command set must react in the same manner.

How would one handle a slow peripheral? SPI devices tend to send $FF as an idle character, so the master would busy-wait on the slave until it received some byte other than $FF. Then it would know the next several bytes contains the response to the previous command. In this case, a single byte containing the value requested.

We can’t wait forever, however, and the SPI slave controller knows this. There are two ways of handling this situation:

The master negates SS# when wants to cancel the command in progress. This causes the slave interface to abort its wait on results, and probably should also notify the function controller to cancel its pending operation as well. This frees the slave interface up to respond to another SPI command.
The slave interface controller times out itself (remember, it’s logically independent of the actual function controller!), and sends back an error response to the master, which can decide to re-issue the command or do something else later.

Either one of these approaches is OK by me. The first option lets the master control its own timing parameters to some extent, while option 2 allows for simpler master software stacks. I should point out that options 1 and 2 are not mutually exclusive; in fact, I’d go further to say that option 1 isn’t even an option; it really is the minimum functionality expected of an SPI link.

This brings us to the whole concept of layering protocols within protocols. It’s how the Internet works, for instance, and it can work for SPI as well. The command set supported by the SPI slave interface should terminate at the SPI slave interface. Requests for the function being controlled should appear inside the commands ferried to the SPI slave.

For instance, if I were to redesign the SD protocol from scratch, I would have to insist on using commands not entirely dissimilar to what IBM mainframes use when talking to disk units:

Read N sectors of data.
Write N sectors of data.
Receive the next N bytes of command data, and have the function execute it.
Sense up to N bytes of status information from the function.
Is the function alive? Alternatively, report the latest metric M of the function.
Reset the function.

(In point of fact, I strongly urge readers of this article to study the the IBM z/Architecture Principles of Operation, in particular the section on how Channel I/O works.)

So, to read a 1024-byte block starting at sector 1234, I would first “command” the card to seek to sector 1234, then send the “read” command for 1024 bytes, then receive the data requested. If, for some reason this transaction failed, I can use the “is alive?” function to determine health, and if it’s found to not be in a working state, I can issue a reset function in an attempt to reset it and try again.

Likewise, to determine card capacity, I would first: “command” the card to produce card capacity, then “sense” the results back. Note the differences between command and write: the former works on bytes and targets the function controller itself, while the latter works on whole sectors, and targets raw storage. Likewise with sense versus read. But all the while, notice that instructions like read, write, command, and sense target the SPI slave controller itself.

Conclusion

I’d like to conclude by recapitulating some key take-aways from this article:

SPI (or any serial interconnect) only proposes to replace a wider interconnect; it does not propose to serve any higher-level purpose than that.
SPI interconnects should never fail except in cases where a wider interconnect would also be expected to fail.
Functional units should be designed to expose basic telemetry to the master, so the master can gauge health of the slave as a whole.
SPI slave controllers (or their software equivalents) should not be involved with interpreting the commands intended for the function.
Commands intended for the function should be encapsulated inside of commands intended for the SPI slave controller, just as TCP is encapsulated inside of IP packets. Similarly with responses.
The command set supported by the SPI slave interface should be small enough that you can easily prove correctness without an automated theorem prover. Build up from there.

So if you’re planning on building an SPI slave, please keep the above points in mind. Now go forth and build something not just awesome, but reliable too!

Some Thoughts on Defined Processes

2016-04-26T00:00:00+00:00

Some Thoughts on Defined Processes

As my role with my current employer evolves, I find it increasingly difficult to remember even the most mundane tasks that I need to perform. I’m sure age has something to do with it, but I strongly suspect that it has more to do with The Magical Number Seven, Plus or Minus Two. Between remembering reading about the value of checklists in Watts Humphrey’s A Discipline for Software Engineering and the new emphasis on data-driven methods here at work, I decided to follow Watts’ advice. I’ve tried other methods before, even taking the time to hand-write my tasks on a daily basis, but nothing so far has delivered consistent results. I’m hoping having a regular checklist will at least allow me to consciously decide when to and when not to perform a task, versus simply forgetting all the time.

It’s Not Skill, It’s Something Else.

There was a time when I was an ardent agile process supporter. Why shouldn’t’ve I been? It revolutionized how I wrote software; I was able to produce clean, relatively bug-free software in relatively short order. More importantly, it was a repeatable process. I no longer had to explicitly think about quality – it just sort of happened on its own. I promoted it everywhere I went, and actively did my work using it, even in uncooperative organizations.

Today, I’m finding my ability to keep track of lots of details atrophy slowly yet surely with each passing day. I’m faced with a harsh reality: getting old sucks, and while I’m still not old enough to be thought of as “old” by society’s standards, I’m definitely starting to wake up to the realities of my age bracket. Keeping track of big-picture architectural details, necessary to know how to navigate the code you’re developing, would over time become quite fuzzy. I often find myself spending one or two days just trying to remember how to do something important to my task at hand.

Ironically, while my sensitivity to details seems to be reducing with age, my skill level doesn’t appear to change all that much overall. Once I have a firm understanding of what I need to produce or perform, I can execute with the same alacrity I had when I was 20 years younger. I still understand the basic principles of what I do every day. So the question is, how do I compensate not for faltering skill, but rather for my faltering ability to process large amounts of information seemingly at the same time?

I’m forced to re-evaluate a topic which I’ve not only discarded in the past, but actively campaigned against in my distant youth: disciplined processes. Of course, I still campaign against the top-down mandate for such processes; however, if implemented bottom-up, then by definition such processes are fit-for-purpose. They can then deliver (and have delivered) many benefits to teams and individuals alike.

While I’m obviously inspired by Watts Humphrey’s Personal Software Process, which even Humphrey himself admitted is a high discipline process, please be aware that this is me taking the plunge, not my employer. I’m doing this because I feel inadequately capable of keeping up with my peers and delivering with confidence value to my fellow team members.

It Starts With a Checklist.

Yesterday, I wrote my first personal process checklist. It’s a script which I follow every Monday. It actually contains a list of rather mundane things, including but not limited to the following:

Look for specific e-mails and copy their contents to other, well-known wiki pages. This is something that ought to be easy to automate, but experience suggests that manually spending the time once a week is actually cheaper than investing the time to automate it.
Look at all Github issues opened since the previous week, and assign severity and priority labels to them. This has to more to do with the company’s recent push for data-driven metrics on defect rates than with any particular value that it delivers to the team; hence, why we assign these asynchronously from issue creation.
Look at my own Github issue history, and using that information, compose our own version of the infamous TPS report, the equally infamous PPP report. In case you’re wondering, PPP stands for Progress, Plans, Problems; exactly the kinds of things you discuss during a daily stand-up meeting, but then for some reason have to repeat once more in a more formal manner as you report back to your manager what you did that previous week.
Schedule semi-regular meetings with the engineering managers, and re-assess current priorities so as to make sure that the QA department can deliver maximum value to the product department at all times.
Check for vacation time, and if I have any coming up, make sure I register it with the company.

and so it goes. I will, if need be, devise scripts for other days of the week as well.

Why bother, though? Wouldn’t a calendar application serve just as well to offer reminders? If this works for you, then great. For me, however, the answer is a solid no; the events I plug into a calendar are transient at best, even when configured to be recurring. There’s no such thing as “reading my calendar for the day”; I quickly forget what’s scheduled after I read it. (While writing that sentence, in fact, I had to check my calendar to make sure I wasn’t about to miss a meeting. This would have been the fourth time I checked it today.) The reminders are only helpful if I can see them. If I’m away from my computer, then I’ll have no conscious memory of any pending events. Anyone can create invites without my knowledge or permission. Especially if management creates an invite, I cannot often just say “no, I’m busy”. Usually, all I can do is reschedule. Finally, calendars are really poor vehicles to communicate exactly how to perform a certain task. You really want a channel that properly supports rich, long-form communicaitons.

So, for me, calendars are transient, out of sight/out of mind, often incontrovertible, short-form, and asynchronous in exactly the wrong sort of way. They’re useful; don’t get me wrong! I don’t think I could function as efficiently without one. Yet, they clearly aren’t covering all the use-cases for helping me get my day-to-day activities done. So far, I’m finding checklists to be more helpful for those kinds of rote, daily tasks.

Context Switching

I find checklists are also excellent at supporting context switching. Let’s suppose I need to go to a meeting (from one of those calendar invites I was talking about). After I come back from the meeting, which might last for an hour, I then have to pick up where I left off. What do I do now? With a suitable checklist, I may not be able to completely switch contexts as ideally as I’d like, but it’ll be a whole heck of a lot faster than if I didn’t have it. My Mondays are interrupted by several meetings, so getting my daily tasks done required several such context swaps. My checklist already proved its worth.

Estimating Size, Resources.

This is more an issue for non-rote tasks. I’m not estimating any task sizes or resources yet; but, it’s something I’d like to start doing for all my non-rote, deliverable-oriented tasks. From these estimates, I can then build simplistic Earned Value charts which I can then use to help inform my management of when things will be complete.

I actually have some experience using EVM and task breakdowns to predict delivery dates on two previous occasions. First, I used it to plan out my own work in developing some aspects of the Kestrel-2. Second, I used it to develop a functioning prototype “Project Eagle Eye”, a mechanism that combined my team’s documentation with programming examples and our CI/CD infrastructure to always test the example codes in a CI/CD manner. My goal was to help ensure our technical examples were always correct. While both efforts demonstrated excellent fidelity to the initially planned completion dates, I’m particularly proud of Eagle-Eye, where I delivered the proof of concept within 2 days of the planned delivery date. That was a schedule slip of only 1.11%.

I tried doing this with a sub-project for the Kestrel-3 recently, only to discover half-way through the project that I’d done the EVM wrong. Whoops. By sheer luck, it turns out that I was able to continue to schedule things, but the report it produces communicates the wrong metrics. (The only reason it still works is because my basis for comparison used the same units.) Had I used a checklist, the probability that I’d make this mistake would have been minimized, or even null.

A necessary prerequisite for getting EVM to work for you is to break tasks down into atomic (or nearly atomic) units of work, and estimate how long they’ll take to complete (PSP recommends using task hours, but you could just as well use gummy bears or other abstract units as long as they’re used consistently). The smaller the unit of work, the better. This is something I’m notoriously bad at; so much so, I’m surprised I’m still employed at all. I typically end up decomposing a task into units of work which are not atomic in the slightest, and that usually means I take a lot longer than expected to complete a high-level task. After being in the industry for so long, you’d think I’d’ve learned by now.

The fact that I haven’t learned this, however, seems to suggest that this activity is also best addressed with a checklist of some kind. What are the common criteria that are used to decompose tasks? Record them in a checklist, and make sure to reference the checklist as often as possible, so that I can remember to look for those facets. Obviously, if the checklist fails me, then during a retrospective, revise the checklist for next time.

Conclusion

Clearly, using a defined process has worked for me on several occasions in limited capacities, so it’s baffling to me why I bothered going back to working in a more ad-hoc manner instead of increasing my adoption of the recommended practices. It’s clear that I need to get back into the habit. It’s like dieting, really; once you’re on a diet, you’re always on that diet. As soon as you diverge even a little bit, you regain the weight you worked so hard to shed. So it is with personal processes: the moment you diverge from a proven personal software development process, the wheels start to fall off the wagon. Whether dieting or adhering to a defined process, the real challenge can be found in sticking with it. I’m hoping I can stick with it this time.

Bring It On!

2015-08-13T00:00:00+00:00

This past week, I became very flustered with myself. On Tuesday, it started at work, but I won’t discuss the origins. They’re not important to my story. Instead, let me talk about my experiences at Ving Tsun class.

Tuesday’s Blow to the Head

The goal of training on Tuesday was to let the junior student experience the rush of, just to take one example, receiving a blow to the side of the head, despite the aggressor being more or less in front of the target. The idea is simple, at least as I understand it: get used to the feeling, so you’ll be calm enough to render a proper, direct response. Will you get hit? Maybe! And, that’s kind of the point. If you don’t fear receiving strikes, you’ll be in more control of the situation. (Obviously, we’re not striking “for reals.” If you get hit, the most you’ll feel is someone landing knuckles or a palm on your temple. There’s no actual impact at this level of training.)

OK, so, I start the night’s practice. The problem is, I’m a horrible fighter. My attacks inevitably ended up being completely formulaic. So, my si hing comes over to give me advice. As I understood the advice, however, it seemed diametrically opposed to what I thought I should be doing as instructed at the beginning of the class. Why should I focus on my center when my goal as attacker is to attack off the line? Wait, what do you mean I should attack from the side? You just told me to focus on my center! Hold up, there; I can’t clear the center and attack to the side at the same time! Gaaahhh!!!

I didn’t want to say anything, though, because (a) he’s my senior by many years, and (b) let’s face it, it’d straight up be disrespectful. So, to the best of my ability, I tried to accomodate the instruction, but to no avail. Throughout the evening, it felt like everything I was doing was wrong, judging by how frequently my si hing offered instruction.

Clearly, Joe Rogan I am not.

I became so flustered with myself that I started to shut down. The smile on my face had dissipated to a scowl, and every si hing in the class knew not to talk to me unless it was important. Don’t worry; I was still quite cordial and respectful. But, it was very clear to all that I was not happy.

After class, I went to see my wife at work, and then home. I have to admit, unloading some tension by talking to my wife helped. Getting a reasonable night’s sleep also helped. But, by Wednesday, that frustration was still nagging me.

Wednesday’s Blow to the Ego

After work, I decided to skip Aikido out of convenience and go to Ving Tsun again. Maybe this time, class will be better, and I can forget yesterday’s woes and just focus on my forms. Because, frankly, they suck.

Remember that: focus on my forms. It’ll be important in a bit.

I get to class, and it’s a packed house. And, worse, everyone there is practicing toy ma or a related hand art. So, just as soon as I change, and step onto the floor and start my chum kiu stance, I get tagged by a fellow student, and start work on lap dah and dan chi sau. So much for moving meditation!

That was pleasant enough, but eventually, a si hing grabs me to work on toy ma. He’s pushing me all over the place. Not only that, but he’s pushing me all over the place entirely too easily. I’ve never felt toy ma that powerful before. I mentioned that it was the most aggressive toy ma I’d experienced, and he said he’s just getting back into it.

At first, it was fun. I got pushed around like a palette in the video game Sokoban, which is entirely normal. My chest received numerous blows, and is still bruised as I type this. All good stuff! The physical effort was helping me to relax from yesterday’s woes.

Only, they were bound to return. Another si hing stopped us, and offered a lot of corrections for each of us. One thing he mentioned to me was, “To the best that you can, you should hold your ground.” I knew this already, from previous training in toy ma. And that’s what set me off again: why do you keep telling me this, if I already know it?

So, again, I double-down on my efforts to not be moved by my opponent. The harder I tried, to a point, the more controllable I became. Eventually, I discovered a few body positions where I was able to hold my ground. The problem was, it ceased being ving tsun at that point. I was fighting my opponent, not relying on toy ma. But, I was too numb to the sensation to pick up on that. So, again, my si hing pulled me aside and offered corrections.

Correction after correction after correction. They just kept coming, and it seemed like nothing I could do would stop them. Since I felt more comfortable with this si hing than I did with yesterday’s si hing, I kind of broke down, and vented a bit of steam. I’m so glad he understood and took it the intended way. It was, in retrospect, unprofessional of me, and honestly, I should have just gone home.

But, I’m glad I didn’t.

The Talk

My si hing and I had a good conversation that evening; not one word sank in until much later the following day (namely, today).

One of the things he mentioned, numerous times, was, “hold your ground the best you can.” That’s what I was trying to do, of course, and it ultimately culminated in gridlock with my opponent, which only made things worse when corrected. But, the whole time, I was focusing on the wrong thing: holding my ground. What I should have been focusing on instead was the part where he said the best you can. It was while watching several other si hings, also practicing toy ma and running into the exact same problem I was despite having 20+ years on me, that the seeds of reconciliation with myself had been planted. I just didn’t know it yet.

The point of toy ma, as my si hing also repeatedly mentioned, was to help “develop my horse” (improve my structure) and to develop the sensitivity to know when I have an opening. Holding my ground doesn’t necessarily mean I stand still. In fact, he offered many examples of where you want to step back during the practice. If I get mowed over like grass, that’s because I have problems with my structure. This could be an inadequate horse posture, or it could be that my hands are in the wrong region of space for my opponent’s build, whatever. Either way, I need to adjust. And I can’t adjust if I’m fighting.

Another aspect of toy ma is that it helps you acclimate to being hit. It is at this level that your opponent will now find openings in your structure and exploit them.

Even these words may have a dual-meaning though. My si hing is fond of saying to me, “Toy ma is probably one of the hardest levels of ving tsun to get through. So many people just stop here, and never go further, because it is so hard.” I took these words for granted, because up until now, I’ve thoroughly enjoyed the experience of playing toy ma. Maybe those who quit just didn’t find being pushed around and slapped a lot fun. But, for some reason, last night, I felt like just packing it in and going home. I felt so thoroughly defeated, that I actually asked myself the question, “Why do I even bother?”

Today, it occurred to me, what if “acclimating yourself to being hit” doesn’t mean what it means on the surface? Sure, you’re getting yourself beaten into a wall all too frequently. But, that may not be the point. What if, instead, it really means that it’s intentionally ripping your ego to shreds, so you can pick up the pieces and put them back together again later on? If I could completely miss the fine detail behind “to the best of my ability,” why couldn’t I miss the subtlety of a Chinese teaching translated to English, and perhaps losing some of its meaning? I don’t know that this is the case. Maybe I’m reading too much into it. But, it is possible, and that’s due some consideration.

My ego was hurt. My pride was damaged. Between both nights, I learned that my Siu Nim Tau and Chum Kiu forms are basically wrong. Borderline garbage, even. So many details that I somehow missed or misinterpreted, and that after nine months (has it been that long already?) I finally realize I’ve been practicing these forms incorrectly? Of course my toy ma sucks — my entire foundation in the martial art is flawed!

And, of course, how do these forms even relate? It’s not been made clear to me how SNT and CK even come into toy ma, or a lot of other things in ving tsun. Sure, on a case by case basis, I’m shown how pieces of SNT can relate to some aspects of a technique. But, of what value is SNT and CK if (a) I have nobody to check my correctness, and (b) even if I did everything perfectly, I have no means of experiencing this “stored relaxation” thing that si hing always talks about? There’s nothing to express this “stored relaxation” against! None of this makes any sense!! More frustration! Gaaahhhh!!!

If only I’d asked someone to check my forms. If only I’d known to ask someone to check my forms.

Reconciliation

But then, isn’t that the true value in playing toy ma? Isn’t that the true value in playing off-line fighting techniques? I have to remind myself, I’ve only been working with toy ma for a few months, and even then, not on an every-day basis. And Tuesday’s off-line fighting? That was my very first day of ever doing something like that. Ever. Not one day in all my years of Aikido experience have I ever played the role a true aggressor.

Of course my art is busted. I’m still such a newbie! I have to remember, just because I’ve been training for 9 months, I’m not Bruce Lee. I can never be Bruce Lee, and actually, don’t even want to be Bruce Lee. But, I do want the confidence to know that I can handle myself in a combative situation. If I’m never put in a combative situation, how can I possibly know what questions to ask of myself?

The point of toy ma, I’ve come to realize, at least for my kung fu, is not to hold your ground. Not at all. Don’t even try! If I’m pushed to the floor, so be it. It’s to hold your ground the best that you can. Because, if you can’t hold your ground, that’s feedback to the practitioner to ask questions.

By way of analogy, if you iron a piece of cloth that doesn’t lay flat on the ironing board, it doesn’t fight you. Instead, it gives, and it creases. If you have enough awareness about you, you can actually feel this crease as you pass the iron over the fabric. This is your cue, “What did I do wrong? Oh, haha, silly me; I forgot to flatten the fabric.” And then, you correct your course of actions by flattening the fabric, and continue ironing. The cavaet is that you must have that awareness. Without that sensitivity, you’ll never have wrinkle-free clothes.

So it is with toy ma. Without the sensitivity to detect wrinkles in your structure, how can you know a problem exists? What problems exist with my basic forms? How can I improve my horse outside of toy ma? How do I recognize when elements of SNT and CK can inform what to do and how to do it? Why am I fighting so much? How can I best relax? How can I overcome my pride and seek help, without simultaneously being a nag?

One question that should never come up is, “Am I even ready for this practice yet?” You may not be. If my si hing is right, though, nobody ever is.

There will be more frustrating days ahead, I’m sure. I just need to remind myself, it’s all OK. I’m not a bad student. It is not wasted effort. No, I am not ready for what comes next in this art. I’m just going to have to be OK with that. Because, were I truly ready for it, wouldn’t I already know enough ving tsun to not bother?

It’s 6:00PM, and time for my next ving tsun class. I hear there’s a spot on the wall I haven’t been pushed into yet.

Meatprogramming, not Metaprogramming

2015-06-12T00:00:00+00:00

Let me tell you a story of something that happened at work recently.

I’ve been put in charge of contributing a plugin to a reasonably popular systems integration mocking tool. It offers support for so-called “control plane” APIs (a topic for another day), which enables system/integration tests to control how this mocking tool behaves at run-time. This tool is written in Python, a programming language I’ve been using since version 1.4. This tool is written by many incredibly intelligent software engineers, all of whom I respect greatly.

Contributing my plug-in ought to be relatively easy to figure out. I mean, I’ve been using Python since version 1.4 was a thing. I know object oriented programming and decomposition techniques. I’m aware of many different kinds of patterns and anti-patterns.

Yet, I can’t even write one line of productive code for my plugin without severe code-coddling from my peers.

While discussing this with some other contributors, it was concluded that the probem was inadequate documentation. Now, people who know me know that I’m a huge proponent for literate programming, so I’m hardly one to impede the documentation efforts of code authors. However, and perhaps for the first time in history, I think this is merely addressing the symptom, not the cause, of the problem.

I think the real cause of my problem is more fundamental: an accidental regression from structured programming.

Structured Programming

Structured programming came about to solve three different kinds of problems in computer science: code clarity, software quality, and, improved productivity.

Clarity

Structured programming improves clarity by establishing rules about how to read and write, and thus how to think about, code. Replacing a chunk of code with a “black box”, typically but not necessarily always in the form of a subroutine, enables the code reader to gloss over details irrelevant to understanding the code at hand. Before block structure and, more formally, lexical scoping, code sprawled all over the place (so-called spaghetti code), impeding a coder’s understanding of the control flow to the point of submission.

Quality

Thanks to block structure and the principle of Single Entry, Single Exit, a coder could actually prove correctness by treating subordinate code, already having been proved unto themselves, axiomatically. When this principle is applied to the design process, we know it as Stepwise Refinement, and it’s actually the same technique I’m using to develop the Kestrel-3’s firmware.

Now, it turns out that the SESE principle is a bit too restrictive in practice; today, we know that SEME (Single Entry, Multiple Exit) works better. How do we know this? By using SESE itself and formally reasoning about code written in both SESE and SEME styles, we can easily establish an equivalence. In fact, if you look under the hood, you’ll find that most compilers compile SEME code into SESE code anyway, so you still get the benefits the SESE principle provides.

See? SESE works.

Productivity

Structured programming aims to reduce coding effort, and thus time expended, by cataloging a small number of highly orthogonal design patterns which frequently appeared in high-quality software. These break down into three broad categories: Sequence (ordered collection of statements or their equivalents, such as subroutines), Selection or Alternation (execute one of n different code paths based on some criteria), and, Iteration (execute the same code path n times, or until a condition becomes satisfied). These patterns are so well understood these days that all major programming languages today supports explicit syntax for each of these statements. It’s hard to believe, but even as late as 1984, many languages actually lacked a while construct or the ability to invoke a subroutine without an explicit CALL keyword.

Compound Effects and Hierarchical Design

By combining block structure, introducing rules allowing correctness proofs (even if informally), thus allowing one to verify their own understanding of the code, and by offering a standard catalog of common design patterns, the gains compounded, allowing the programmer development speeds far faster and with fewer errors than with competing methodologies of the time.

Applying all three concepts with skill leads to software which has a natural hierarchy to it. Higher level code tends to mediate or coordinate lower-level code. Lower-level code tends to consist mostly of operations and/or data accessors. I encourage readers to look into Ralf Westphal’s IODA Architecture if you want to know more. Even if your code doesn’t actually run in a predominantly sequential fashion, your code can still benefit from this architecture. Indeed, browsing around Ralf’s website will reveal illustrations and examples of event-driven applications written IODA-style. The code is a joy to read.

The Problem with Dynamic Languages and Metaprogramming

I’m a Forth programmer. Metaprogramming, as it is with Lisp, is in my blood. However, there’s little disagreement from me that metaprogramming can be easily abused. In Forth, it comes in the form of IMMEDIATE words. In Lisp, it comes in the form of macros. But, in Python and other object-oriented, dynamically-typed programming languages, it comes in the form of functions.

Forth is also the ultimate in dynamically-typed languages: it has no types of any kind, except the machine word. Thus, I receive no benefit from the compiler or interpreter when I make a type-related error. Which I make quite frequently; certainly, enough for me to swear in public that I’d never use Forth again. (And, yet, I always go back to Forth.)

Speaking as someone fluent in Forth, Python, Ruby, and other dynamic languages, I mention this only because I feel that dynamically-typed languages encourage the use of metaprogramming too much. With a more restrictive environment that focuses only on the essentials, you may find your desire for cleverness increases, but begrudgingly, you will write more maintainable code.

For instance, while working on the Kestrel-2’s system software, I frequently lamented not having access to a return stack. I used a dialect of Forth that simply didn’t support application-custom immediate words, CREATE, >R or R>, or other metaprogramming tools, mainly because it was a target compiler, and because the underlying hardware just didn’t support that functionality. You know what? I can still read and support the code today, many years later. The code compiles to this day, without compile-time or run-time error.

So what’s my beef, then? Although patently contrived, here’s an example that’s actually inspired by code in that mocking tool I talked about earlier.

def create_comparator_class(low_bound, high_bound):
    class Comparator(object):
        """docstring goes here."""
        def compare(c_self):
            """another docstring here."""
            if c_self.get_left() < c_self.get_right():
                return -1
            elif c_self.get_left() == c_self.get_right():
                return 0
            else
                return 1

    def get_lower_bound(c_self):
        return low_bound

    def get_high_bound(c_self):
        return high_bound

    setattr(Comparator, "get_left", get_lower_bound)
    setattr(Comparator, "get_right", get_high_bound)

    return Comparator

With this code, we can now create any number of comparator classes simply by mentioning a set of parameters:

CompareLetters = create_comparator_class('a', 'z')
CompareNumbers = create_comparator_class(1, 10)

It should be readily apparent to the reader is that we’re effectively synthesizing code on-the-fly. In other words, we’re exploiting self-modifying code, in spirit if not in fact.

There are several problems with this technique:

How do you document the create_comparator_class function? I mean, really think about it. Python docstring formatting conventions are not well equipped to adequately document, to the same level as a statically declared class docstring, what the resulting classes do, what their methods are capable of, what pre- and post-conditions exist, and so forth. In fact, I’m willing to bet you that if you use code like this in your project, it probably won’t have a docstring longer than 15 lines. The example above, as simplistic as it is, already is 5 lines longer than that, and we didn’t even consider how to dynamically generate docstrings for the get_left and get_right methods. This compromises structured programming’s goal for increased code clarity.
It increases cyclomatic complexity, impeding the maintainer’s ability to understand what’s going on, and more importantly, why. Some projects I’ve worked on actually have a continuous-integration gate on cyclomatic complexity because of how problematic things can get. This compromises both the quality and the clarity of the code.
I don’t say this often; but I’ll say it now. This is one case where Python’s indent-based blocking is a real disadvantage. While maintaining Python code like this, you have to read and comprehend the entire outer definition and all inner definitions before you can even begin to consider, “Hey, are these inner definitions indented properly?”. Are you sure all your defs are properly aligned? Without static checking, you cannot know unless the code is actually executed and all code paths have been exercised. I hope you have a good code coverage tool! If you didn’t see the setattr functions at the end of the definition, or if they happened to be buried in a lower-level function somwhere out of sight, would you have complained about the methods not having the right indent level? These are not hypothetical concerns; they happen in real projects, in real time, every day. The time you spend checking and double-checking this impacts your overall productivity.
Up to an inflection point, it’s more code than you need to write. Software defect rates are known to be correlated with total lines of code, regardless of programming language used. Therefore, why choose to write more lines when you can write fewer? Again, a compromise on code quality.

It turns out Python has a perfectly serviceable method of creating new classes as we need them in the source code:

class ComparatorBase(object):
    def compare(self):
        """Return the result of comparing two values."""
        return self.a < self.b

class C1(ComparatorBase):
    def __init__(self):
        self.a = 1
        self.b = 10

class C2(ComparatorBase):
    def __init__(self):
        self.a = 'a'
        self.b = 'z'

That’s 12 lines of code, as compared to 19, despite the more verbose attribute assignment. Indeed, with the original approach, we don’t see a savings in total lines of code until we instantiate more than four subclasses, and even then, we can introduce a “maker” function that is still substantially simpler:

def create_comparator_class(low_bound, high_bound):
    class Cx(Comparator):
        def __init__(self):
            self.a = low_bound
            self.b = high_bound
    return Cx

When errors are measured in defects per 1000-lines-of-code, this small difference is insignificant. But, when you approach 2000 or more lines of code in your project, suddenly it becomes measurable, even to an individual.

We gain other benefits from this simplification as well.

The code flows are patently obvious to the reader. This reduces your documentation burden, so that you don’t have to document how things work at such minute levels that you’re basically documenting how the language works.
It’s provably correct by inspection. But, if you don’t trust your ability to inspect code, or you inherently distrust code written by 3rd party teams, it’s also fully compatible with, e.g., Python 3’s ability to support tooling around optional static typing.
It’s substantially easier to document. You can use normal Python-style docstring techniques to document the base class, and refer to it in create_comparator_class’s docstring as needed.

Those are important benefits exploited simply by using static program structures in favor of dynamic structures.

Alas, in the mocking tool, we can’t use statically structured code like my illustration above, because of a problem introduced by yet another metaprogramming facility: decorators.

Unlike Java’s decorators, Python’s decorators are intended to be compile-time functions of other Python constructs (usually classes and functions themselves), and this particular application uses them to specify URLs for RESTful endpoint dispatching. It follows that we can’t easily parameterize a RESTful URL without somehow digging into the internals of the REST framework’s guts. This utterly defeats the benefits and purpose of modular programming en toto.

Had the REST framework been architected with more static constructs in mind, the decorator would have been written in terms of a more general-purpose mechanism for adding URL routes, and this whole mess could have been avoided. We could have used a much simpler interface for plugin writers, such as using a dictionary to map URL to handler method or class, arguably far more obvious to the code reader as it provides better separation of concerns.

Alas, they didn’t, and it’s not, and so the decorator is the only (documented) method of adding a route to the web-app class instance.

You Suck.

At this point, you’re probably thinking that I’m just not that good of a Python programmer. Or, more generally, not that good at higher-level thinking in general. I hear you. And, you’re probably right. I’ll be the first to admit that my aptitude and proclivities align towards lower-level, simpler programming. Honestly, though, feedback from my fellow maintainers suggest that I’m not that bad as engineers go. And, yes, I speak with my peers frequently about my self-perceived deficiencies.

So, at the end of the day, if you’re reading this and thinking that I’m just not up to snuff or somehow “not good enough” to hack “real” Python code, then you’re falling prey to the No True Scotsman fallacy. Indeed, I can reverse the argument back on to you: if you can’t write semantically clean, easily maintainable code without resorting to cleverness, you’re not that experienced an engineer yourself. But, then, we’d just end up in a flame war that goes nowhere, wouldn’t we?

Conclusion and My Plea To You

The code is already written; I have to bite the bullet and deal with it. But, I’d like to plea with you, the reader, for mercy when writing new code.

If this blog post serves any purpose at all, it is hopefully to get you, the reader, the high-level, dynamically-typed programming language coder, to think twice every time you even consider a metaprogramming solution. This includes macros in Lisp, immediate words in Forth, and decorators in Python.

I’m not alone in this. If you do a Google search for “thoughts on metaprogramming”, you’ll find a litany of webpages describing how people just hate metaprogrammed solutions, for one reason or another.

Programming languages serve two audiences: humans, and computers. You’ve already mastered instructing the computer. Now you need to master how to write code to support your fellow human being. You need to write meatprograms, not metaprograms. Your fellow engineers will thank you, and most importantly, your employer will be thankful for not having to spend loads of cash on engineers trying to reverse-engineer some obscure bit of cleverness when they could be making progress instead.

Dragonfly

2015-03-17T11:00:00+00:00

Today, I fed two homeless, hungry people. The surprise, shock, and pure appreciation was palpable from both of them. That alone was enough to really make my day. What happened afterwards, though, was pure icing on the cake.

As I walked towards my office, still several blocks away and in the middle of a construction zone, a HUGE dragonfly decided he wasn’t in the least afraid of me, and flew right dead-center onto my chest, feet just an inch above the solar plexus. It must have measured three inches long, with a wingspan to match; big enough to give me pause. If you’ve ever seen an Amethystium Em-blem T-shirt, then you’ll know exactly where it landed. Primarily brown and copper in color, the wings were nicely irridescent, making for an interesting contrast. I wanted to grab a picture, but as my phone was in my pocket, it flew off before I could extract it.

I’ve worked in San Francisco for many years now, and not once have I ever seen a dragonfly inside the city that wasn’t part of some science exhibit. For one as big as this to just descend out from nowhere, weaving in between other people, and land directly on me as though that were its purpose all along was nothing short of delightful.

To that dragonfly, thank you. You’ve given me some much needed joy.

A Case for Literate Programming at Work

2014-10-10T17:25:00+00:00

Abstract

Recently, I suffered difficulties trying to utilize a testing tool at work. The primary contributors were unavailable for questioning as a result of their other daily tasks, leaving me blocked on my own tasks. If they used literate programming techniques when developing their product, I could have learned what I needed to know about it much faster. The improved asynchrony of our respective teams implies greater productivity of the company as a whole.

My Frustration

Several weeks ago, I took on the task of writing a regression test suite for a project at the office. At first, I was going to write them in Go, as I’m familiar with the language and its tools; the built-in framework does exactly what we want, and I could have built on top of another Go project’s tests as a foundation. However, management encouraged me to look into using a more widely accepted framework inside the product-QE group. I said OK, as I couldn’t see a good reason why not at the time.

However, it’s taken me several weeks of effort to learn how to get this tool to invoke any tests I write. The installation instructions were solid and well-written. However, the documentation fails to provide any tutorial documentation to let me replicate such a simple task successfully. (I intend on helping to fill this gap, of course.) Fellow engineers in my office who have worked with previous versions of this tool quickly found the latest version behaved differently from their expectations as well. I’ve sent e-mails to trusted coworkers on this issue, who either still maintain the framework, or know who does, but without receiving any substantive response to date. I resorted to trying to reverse engineer the design and data flow via the published source code.

Initially, this proved difficult; like any good object-oriented program, it relies on a significant amount of polymorphism to function. I looked through the code, line by line. Sometimes, I had to guess which package or class to look in next, due to the nature of polymorphism. I never did find the spot where my tests were invoked; however, I did locate the place where the framework should have discovered which tests to actually run. In case you’re curious, the problem resided in the code responsible for test class discovery. Even if I did find the dispatcher, I’d’ve spent a fair amount time working back to this point anyway.

This story ends happily, thankfully. But did it have to take this long? I argue it didn’t have to.

My History

But, first, another historical digression to establish some context.

For decades now, I ardently promulgated and supported the agile cause. Since the days of extreme programming, I used to believe in the agile manifesto’s second credo, “Working software over documentation,” especially after I let the authors convince me of the value of self-documenting code in the book, Extreme Programming Installed. Why bother with extensive commentary when the code can comment itself? Years after writing my own code this way, even without any flavor of comments, I found I was able to still read and understand my work. Clearly, Kent Beck and friends were right.

I failed to understand, though, that value of self-documenting code applies only as long as you retain any amount of state from working on the project. Since everything I write for myself is, by definition, my state, it follows therefore that I can still grok my code that I wrote way back in 1995. I feel two reasons explain why this works: Common coding conventions and a common lexicon of terms and the semantics that goes with them. A big, big problem happens, however, whenever I attempt to read someone else’s code, for which I have no prior project exposure. It might shine as an exemplar of clean code, even with the occasional helpful comment sprinkled in for good measure; I’d still find that if it wasn’t written in assembly language, I couldn’t make sense of it. All these years, the coding community pushed for and focused on having coding convention standards (PEP-8, go fmt, et. al.), and yes, they’re really nice to have. However, I claim, albeit without proof, that it’s the lexiconal differences that does the real damage to your productivity. PEP-8 did nothing to help me find what I was looking for with the testing tool, for instance. Learning the various lexiconal patterns used in the codebase proved far more valuable.

“Whoa, wait a minute!”, I hear you say as you suddenly realize what I’d written. “What does assembly language have to do with anything?” Surprisingly, a lot; however, if you’ll indulge me, I’ll try to show why later. For now, just keep it in the back of your mind.

As time passed, I frequently found myself utterly confounded by code written by others. After so many years of this phenomenon, I started to blame myself whenever this happened. I’d get really quite depressed about it. I always feel so inadequate as a software engineer because I can’t follow someone else’s code. I mean, everyone else can read this code and figure out what it does, so why can’t I?

Even when an author followed Beck’s advice on writing self-documenting code, figuring out how one part of a program related to another ranged from extremely difficult to utterly impossible once the scale of the program grew too big. Too often, I found myself asking, “What purpose does this code serve? Why do I care about it? Should I care about it?” This problem only became worse with the popularity boom of object-oriented software, thanks to polymorphism. (I completely gave up trying to understand aspect-oriented software.) So it is with the testing tool I’ve been tasked to use. I must admit some embarrassment, for during its incubatory year of development, their developers sought my involvement to help shape its evolution.

My Epiphany

In order for the second credo to hold any value, developers must uphold the first credo just as seriously, if not moreso: “Individuals and interactions over processes and tools.” Most engineers, without further qualifications, automatically take this to mean “with teammates or the stakeholders.” They almost never consider it in the context of fellow employees in other teams; because, you know, they’re smart enough to figure it out on their own. They’re engineers, after all!

This is, I feel, the agile movement’s Achilles’ Heel. Not only are we not all extroverts, but we just can’t always afford to conduct face-to-face interaction. How many people do you know who’ve worked hard on a project ABC, seen it through to a successful deployment, moved onto a different project DEF, and suddenly became unavailable for even the most basic of support queries concerning ABC? It’s happened to me at Cari.Net, at least thrice at Google, at Attributor, twice at Ning, and now here at Rackspace. Paradoxically, I received the worst treatment at Google, perhaps one of the most pro-agile bunch of coders I’ve ever worked with: one of the three engineers I sought help from to understand some old code actually put an auto-responder on his e-mail address and stopped answering his phone after I tried contacting him. Now; that, right there, is exemplary teamwork.

Having experienced the outcome of my folly, I now consider this behavior, however accidental, just shy of irresponsible not only to your fellow partners but also to all future partners as well. How many times have you seen months-old (or older) support tickets open which paraphrases as, “Tried to contact engineer about possible bug in ABC, but wasn’t available. Escalating.” Or, tickets reporting a genuine, reproducible bug that has gone unresolved for several years. Here’s an actual, real-world example: how many times have you seen this output produced from Firefox running under Linux, even though it’s now a decade old?

(firefox:2932): Gtk-CRITICAL **: IA__gtk_clipboard_set_width_data: assertion `targets != NULL` failed
** (firefox:2932): CRITICAL **: gst_app_src_set_size: assertion `GST_IS_APP_SRC (appsrc)` failed

Probably never, since most run it from a menu or double-clicking an icon. Try running it from a console, though, and see what happens. After a single day’s worth of browsing, your console will fill with messages such as these. Yet, Firefox obviously works; so why bother going through all that low-level implementation detail just to fix an innocuous non-bug like this, right? (Though, at least these messages include line numbers to help in the debugging effort.)

My Proposition

OK, I think I’ve laid out adequate examples demonstrating why I think lack of code documentation is a very real problem. I know I’m not the only one suffering from it, either. Looking at the academic and professional literature alike, I see continuing work in tricks and techniques to help facilitate rapid code comprehension. We are even starting to see entire conferences devoted to software documentation efforts. While many made positive impacts (e.g., generating API documentation via tools like Doxygen or JavaDoc), few as yet have attempted to tackle the workplace productivity losses due to code comprehension issues.

So, what do we do about it?

I’d like to propose the use of literate programming. Yes, that literate programming, the one everyone seems to hate, the one developed and advocated by Donald Knuth.

Remember when I said that assembly listings rarely posed a problem for me? So many find assembly opaque that assembly coders adopted a very discursive style of commenting. Most assembly coders demanded high-quality documentation; coders never considered it a luxury. It wasn’t uncommon to look at some random piece of assembly and, from the context of the comments alone, figure out what, how, and why a program did what it did.

Experience shows, however, that commentary-inside-code tends to have more limitations with most contemporary programming languages than code-inside-commentary. The commentary must appear along with the code in the order that the language demands it. This makes great sense for API documentation, but not so much for exposition covering how the software works, or why it has the design it does. Historically, such documentation, if it exists at all, exists out-of-band from the code, which means it’s almost always out of date as soon as the next code commit.

Little did I know that better tools existed for since the early 1980s. While I’d love to go into a full demonstration of available literate programming technologies today, that lies beyond the scope of this article. Instead, I want to go into why I claim we should adopt to this style of code commentary.

My Rationale

By far, I find the code-walkthrough aspect of a literate program its greatest value proposition. Most literate programming tools enforce (or, at least, strongly encourage) a structure to your documents, roughly described by this EBNF:

Document = Preamble (Explanation Code)*

Of course, some tools allow exceptions now and again; but, all in all, this format seems the most successful structure for literate programs today. When you think about it, though, it makes perfect sense. When you stand in front of your fellow team members during a code walk-through, you often point out a region of code, then talk about it. People ask questions about it, and then when it’s time to move on, you repeat the process: highlight a new region of code, and start talking about that.

Whether you’re standing in front of a code review panel, or writing prose in a literate program, it often helps to start talking about your code with a high-level overview first, and then refine down to lower-level details on an as-needed basis. Note that this often opposes the bottom-up approach to product development. Literate Programming enables you to work bottom up, while encouraging a top-down dissertation on how the software works.

Some advantages offered by literate programs over traditional code reviews exist. For instance, your software can come with a table of contents, allowing people to jump directly to the code that most interests them. With a traditional code walk-through, people sit through the whole meeting, while you’re wasting the majority’s time because only one or two people in the group take interest in any part of your program at a time. Also on the list of benefits, you’re free to include helpful diagrams and other explanatory figures in your explanation. When I give code walk-throughs, I often have to draw diagrams on the whiteboard, taking the attention of my reviewers away from the code. Last, but not least, every fragment of code you write sits physically adjacent to its corresponding documentation. Too many times I’ve performed code reviews and walk-throughs where I wanted to talk about something, and either didn’t have the time, or forgot completely. Because many tools will flag an undocumented code chunk as an error, the probability that I’ve forgotten to document some important topic drops asymptotically to zero.

A repository of literate programs largely decouples me from their respective authors. Organizations should find obvious value from this. Now, I’m not here to villify who don’t want or have time to chat with me about some esoteric aspect of a project long forgotten; many of them are actually good friends of mine and have quite valid reasons for not responding to me. One must ask, though, at what point do my priorities influence theirs? If they’d written the testing tool using a literate style and published the source code “documentation,” the probability I’d’ve been blocked would’ve dropped precipitously.

It also avoids the game of telephone: every time you recall how the program works, you’ll find the details will change. I can’t tell how many times I’ve wanted to talk about a project at work, and every person I mention it to gets, inadvertently, a different story. The same thing happens during code reviews. With a literate program, the story you write in the documentation remains intact, regardless of how many copies are downloaded, printed, or whatever.

My Conclusion

Look, I’m not asking for Knuth’s “book-publishable” or “camera-ready” quality documentation. While I agree they’re really nice to look at, I think people who focus on that aspect of literate programming are missing the point to the wider array of benefits. I only need enough information to let me do my job. By facilitating asynchrony between different teams, code documented in a literate programming manner can help encourage a self-serve engineering atmosphere. It can also significantly reduce new employee ramp-up times, for it makes “read the code” actually meaningful. This can potentially cut expenses because the time invested writing the documentation in the first place amortizes over every team which depends on it. Because the documentation sits adjacent the relevant code, the probability of it remaining current exceeds more traditional, out-of-band mechanisms of documentation, such as wiki pages and the like.

I hope I adequately made the case for adopting some flavor of literate programming in professional software engineering organizations. Thanks for your time.

Planned Processes: Not So Evil After All?

2014-05-14T18:24:37+00:00

This post is not politically correct in the contemporary software engineering community. I probably just blogged myself out of any future employment. But, when has that ever stopped me from saying what needs to be said?

As some of my readers know, I work at a sizable company called Rackspace. We’re not super-huge; nothing like Google or Microsoft or even IBM. However, we’re not exactly small potatos either. Historically a company focusing purely on excellent customer support, Rackspace today pushes itself to become more of a technology company with their OpenStack-related offerings. Rackspace adopted a (more or less) agile approach to software engineering and product testing alike. For the last two years, this approach worked wonders. Rackspace helped define a new market, and in the face of the recent Oracle-vs-Google ruling, we and our customers stand to benefit from OpenStack’s open-source API and implementation alike. Life’s good! Or is it?

Last week, my supervisor passed along a mandate that product testing must happen according to a documented test strategy that conforms to a standard, comprehensive template. When I received exposure to this template for the first time, I thought it looked quite onerous. This template includes all manner of compulsory, and occasionally optional, fields to fill in. While filling in the blanks is simple enough if you have the data at-hand, it’s not always so simple to research the data you need to fill those blanks in with. Tools exist to help the prepared engineer with this, such as Watts Humphery’s Personal Software Process, or burn-down charts if you’re a SCRUM master, so I won’t consider this topic further, except to say that most engineers are not so prepared. With this new template, part of the responsibility for project planning now falls squarely on the product’s quality engineers (QEs) assigned to that project, including forecasting which different phases of testing are needed, when they start and for how long, what is known to be in- and out-of-scope, an architecture diagram or description, and more. No part of this form seemed to comply with our previously pro-agile approach to product testing. My gut felt sick, with thoughts of pitchforks and torches dancing in my head. In short, I panicked.

Now, let’s fast-forward to today. I needed to perform perhaps one of the simplest possible tasks one could do for any project: contribute changes to my group’s customer-facing, online documentation. It’s maintained using Sphinx, and the repository is on Github, so it ought to be a simple task, right?

Never underestimate the power of poor communication.

After spending about an hour and a half researching and repairing a botched Sphinx installation¹, I tried for an hour more to get the damned thing to build the documentation website. Nothing I tried worked, so I issued several pleas to my more senior coworkers for help. Surprisingly, I was deferred by two senior engineers to speak to another, as they couldn’t remember the steps taken to let them edit the docs. After close to an hour of this, I finally get a helpful response from the third engineer. It turns out I need to install two repositories, one on top the other, for the build to work correctly. Tell me again why this isn’t written somewhere?

And that’s when it hit me like a ton of bricks. It occurred to me at this moment exactly why my organization (which focuses on quality, security, and repeatability across all of Rackspace), as distinct from my group (which focuses on delivering a single product to our customers), now requires QEs to fill out strategies using the recommended template. That form collects the wisdom and hard lessons learned over several years of other quality engineers running into exactly the kinds of problems I did above. How do I…? What does … do? All I did was …. and the whole system broke! Why? Who do I contact if … ? And, last but not least, why isn’t it written somewhere?

The template embodies a simple philosophy: pay it forward. It recognizes certain repeatable communications failures in teams regardless of their process, be it SCRUM, Lean, or cowboy coding. Namely, engineers and business-folk tend to ask the same kinds of questions, but engineers are loathe to respond. By filling in this template for your product, it frees you from having to respond at all in most cases, at least beyond pointing to the document. In a way, the template is a generalized FAQ. You just provide the details unique for your product or service.

Some questions are asked so frequently that any time you start a new project, or if you aim to maintain an existing one, you just know you’re going to need to answer them at some point anyway. Not only that, but you can even predict who is going to ask them, and depending on the demographic, when. Business folks, those handling the money behind the project’s funding, want to know you’re making regular and controlled progress. They want to know that what they’re paying for the software (consider engineer salaries, insurance, and benefits) is worth it to the business. Customer support folks want to know where they can find high-level overviews of the product so they can answer customer questions in a timely manner. System administrators want to know who to escalate level-2 or level-3 questions to. Engineers want to know the nitty-gritty details so that when a bug comes in, they know where to look without having to waste hours to days narrowing down the subsystems.

Knowing that these questions are coming, you might as well just take the time to answer all of them up-front. That’s exactly what the new template aims to encapsulate. Spending a few days to a week to perform a good (enough) job at planning can save a lot more time down the road. If nothing else, treat it like a FAQ. When junior engineers ask a question that’s already answered on the form, just point to the form. When management wants to know your progress, point them to the form and give a small earned value update, a percentage, or something. You don’t have to wait or dread this. I know you’re busy. Hell, we’re all busy. But, if a junior engineer asks a question that is already answered by the strategy document, just point to the document then and there. You’ll alleviate the mental burden to answer the question later, and you’ll save a crap-ton of the junior engineer’s time too. As if that weren’t enough, as engineers come and go from your project, and they will, you’ll ultimately save them countless hours of ramp-up investment. Brook’s Law will still apply, just not as much. Heck, interested engineers can even read about your product and how it works even before their start date. How’s that for hitting the ground running? Isn’t it worth it to spend a couple of days drawing up the plan?

But what if the plan isn’t correct? Isn’t planning up front just another form of big-design-up-front? News flash! Plans will always be incorrect. By definition, they’re glorified estimates! As long as (1) it’s close enough, and (2) deviations are reported to stakeholders at the time of the discovery, nobody will care if it’s technically wrong. It’s been my experience that senior management doesn’t generally have a problem with issues that come up in a project. They only want them to be reported as soon as they’re discovered, so that they have enough “runway” to take corrective action. It turns out that these values, core to many agile processes, also apply to planned processes too.

So what does this have to do with planned processes? Well, everything. See, the template is itself a planned process script, by definition. This could well be Rackspace’s first planned process script for software engineers. The template instructs the reader how to fill it out, even going so far as to provide examples. Once filled out, the resulting strategy serves as a kind of checklist to the QE, a calendar to business folk, an introduction to the system for new engineers, etc. all at once. From the examples cited in the template, it does quite an adequate job at it too. I’ve learned a lot about several of our internal projects which I had no knowledge of before. Already, as an engineer, I benefited from the form.

This means, I think, that Rackspace, the technology company, is maturing. Our products are both stable and mission-critical enough to warrant more rigid processes, so that our sales and executives can offer commitments to high-stakes customers with agreed-upon service level agreements. And, unlike Microsoft’s products, if our products break, we end up costing our customers lots of money, which we may be liable for under some SLAs. There’s no such thing as “just reboot or reinstall the cloud,” no matter what anyone tries to tell you. Obviously, we don’t want to put either customers or ourselves in such a costly position. So, while I doubt we even register on the CMMI maturity scale yet, that we’re organically recognizing the need for greater maturity in our processes can only be recognized as a good thing.

Of course, as with anything else, you can always have too much process imposed from on-high. When management imposes reporting requirements, it becomes a slippery slope. Let’s hope this doesn’t become a habit. Too much process can easily be as stifling as no process at all. But, for the time being, I look forward to stepping up my game and growing my skills as a professional software engineer.

¹ Not Sphinx’s or Python’s fault; it looks like damage incurred when I upgraded my Mac to Mavericks. Still, why did it happen in the first place?! Clearly a quality engineering issue!

On XML vs JSON

2014-03-09T18:24:37+00:00

In a recent discussion I engaged in on Google Plus about XML vs JSON, I was sent a webpage, written by David Lee, which successfully attempts to illustrate that, under specialized conditions, XML and JSON are comparable to each other in compactness, and thus, transmission and storage efficiency. It fails to convince me, however, that XML compares favorably against JSON at parsing performance. The latter issue hasn’t been a real concern until only fairly recently in history, though. It further fails to convince me, en toto, that I should even consider XML again, for anything outside of document markup, especially as more and more enterprises take greater interest in their transactions per watt metric.

First, I want to say that I mostly agree with David’s findings, given the narrow scope of his research thesis. I’m still not satisfied with the report’s lack of illustrations using namespaces, but other than that, it seems pretty comprehensive and well researched. No doubt, with a skilled practitioner defining an XML format, an end-user may find working with XML quite pleasant to work with, from an operational as well as usability point of view. In some cases, XML may even yield smaller uncompressed encodings than its equivalent JSON representation. If you serialize your object fields as attributes instead of nested constructs, and you don’t use characters which require entity expansion, you actually save four characters per attribute over JSON strings, and two bytes for numbers and booleans. It only takes a handful of attributes to recover XML’s framing overhead in that narrow case, assuming very short tag names. The Books corpus tests in figure 9 show this nicely. Alas, as every other corpus test shows, not everything renders so nicely.

This explains why I said specialized conditions above; it takes active software engineering effort to make XML as compact as JSON. Indeed, throughout the whole article, it’s as if David continually paraphrases the mantra, “See, if we only perform this best-practice, XML can be as compact as JSON.”¹ This is telling: it illustrates how JSON exhibits the compact-by-design property more than XML (though I certainly feel we can compact JSON further; I find the need for quotes around key names largely superfluous, for example). It’s pretty clear just by looking at the available applications of XML in the real world that few willingly expend the effort necessary to ensure a good quality, compact XML format. Anyone who’s had to work with Spring configuration files, or Maven dependency files, or fixing corrupted IDE project configurations, or synthesizing SOAP payloads for integration testing purposes, or … can tell you just how much of a nightmare XML is, purely from a usability point of view, and more rarely, an operational point of view as well.

Though the research suggests that there’s no real gain to be had over well-designed XML with the use of JSON, it never successfully states the contrapositive, namely that if you fail to exercise discipline with JSON format design, you can end up with JSON as fat as typical XML. The closest contraposition found in the research comes from the use of JSON naively auto-generated from a source of already suboptimal XML. While I can’t prove a negative, I can speculate that this never happened in the real world, for any commonly used wire transmission or storage format. Do detail-oriented developers simply have a predisposition to use JSON? David’s research cannot answer to that; nonetheless, without a supporting contrapositive taken from a corpus of JSON from the real world as David uses with his XML data, one cannot reliably refer to David’s research to justify XML over JSON for an over-the-wire or storage format.

Up to this point, I only discussed data which equates JSON and XML. Already, we find little incentive to reconsideer XML as a viable format for much of anything outside of legacy applications. However, careful examination of the data in David’s research may hint at a reduction in the number of new XML applications going forward.

Presumably in an effort to show superiority of XML over JSON in parsing performance, figure 16 shows JSON parsing takes longer than XML for most payload sizes. However, he’s using a parser that the greater JSON and Javascript communities shunned on account of its known performance and security hazards. Let me re-iterate – nobody I know of uses Javascript’s eval() function to parse anything but the most trivial JSON payloads, and even then, only for illustrative purposes, typically security vulnerabilities. No secure, high-performance production environment, be it on the client-side or server-side, ever uses eval(), period. This explains why jQuery and Node have their own, custom-implemented JSON parsers in the first place. eval() also bypasses any Javascript JIT, hence its lackluster performance. For these reasons, I consider that specific test categorically invalid.

JQuery’s own JSON parser, thanks to modern tracing-JIT technology, better approximates the performance found in such languages as Go, Java, PyPy, et. al., as ultimately we’re executing real machine instructions. To illustrate, figure 17 paints a different picture, which David completely ignores in his conclusion². Suddenly the built-in XML parsers start to look pretty slow in comparison. As your payloads get bigger, even despite well-formatted XML, it seems parsing XML requires greater CPU demands, from the Y-axis onward.

So why do I find this important? More and more enterprises and individuals alike host their corporate functions on VMs³, either internally or externally via providers like Rackspace. Having efficient parsers not only means less performance drain for your own application, it also means greater performance for your fellow tenants on the physical host. This means reduced IT and support loads for both the enterprise and the hosting provider as well. As enterprises increasingly pay attention to their electric bill, higher performance translates to increased transactions per second per watt consumed, which translates into more efficient compute resource utilization for their dollars spent.

While I agree with the individual who sent me the link that XML has been abused over the years, it’s clear that David’s research does nothing to convince me to return to XML for any reason what-so-ever, nor does it convince me that, in the real world, XML should even be considered as anything but a legacy format. David’s thesis, despite being validated with research, only goes so far as to say that XML only approaches the JSON asymptote, and exceeds it only in the most specialized of circumstances. In fact, David’s own data works against the thesis that one can justify new applications of XML, as I’ve pointed out above. This implies that JSON remains the superior over-the-wire protocol for textual formats. XML, like HTML and SGML before it, remains a document markup format.⁴ But, I digress. If we’re going to argue using the right tool for the right job, at least provide a compelling use-case for your side, preferably one which doesn’t include data supporting your opposition.

¹ He’s right, of course, which explains why I still agree with David’s thesis while concurrently disagreeing with how his thesis is being used to justify continued use of XML. After all, nobody except David asked the question, “Given the subset of applications where XML and JSON compete, can XML be as compact and useful as JSON, for all applications of its use?”. The question remains to this day, “Given the subset of applications where XML and JSON compete, is XML as compact and useful as JSON, for all applications of its use?” David’s research shows that it can be, as long as you pay attention to your schema. Yet, I find his data set too narrow to satisfy my skepticism in the general case. Put more simply, theory and reality are only theoretically related; I want to know to what degree they diverge.

² David writes, Pure JavaScript parsing generally performs better with XML then with JSON but not always, yet his own data directly contradicts this analysis.

³ And, increasingly, on containers within a single VM instance.

⁴ However, I, and many others, question even this, as formats like ROFF, GNU Info, Markdown, and AsciiDoc all have the benefit of working far better with revision control tools like Git and Subversion. Besides containing minimal syntax, which allows me to focus more on the problem I’m documenting, and a lot less on the syntax of the markup, documents in these formats tend to organize around lines, which are natural units of work for diff and related tools. What about their relative extensibility? Examining ROFF in particular, we find a rich ecosystem surrounded the ROFF format to provide embedded mathematical equations, camera-ready tables and line figures, and more, long before SVG and MathML, two XML applications, came around. Thus, ROFF proved every bit as “extensible” as XML claims to be. Compared to HTML, it merely lacked anchor points and browsers capable of hypertext navigation, but nothing fundamentally prevents its inclusion except that nobody uses it anymore except to write Unix man-pages in. Note that GNU Info format, AsciiDoc, and Markdown directly support hyperlinking, and at least Markdown and Asciidoc provide means to escape to HTML and/or XML when native markup proves insufficient (rare).

Subroutine Performance in J 701b

2014-01-05T00:00:00+00:00

I couldn’t find micro-benchmarks on J’s subroutine calling performance. I provide some measurements to fill this gap. I focus on J’s ability to invoke verbs in the context of writing a game, where I expect to manipulate lots of short vectors in a control-heavy environment. I also show how it compares to an equivalent program written in GForth.

Introduction

During Christmas and New Year’s vacation, I decided to re-install the J programming language and get back into it. To help me really learn the language, I’m planning on porting Equilibrium, a game I once originally wrote in Forth. I’m looking to tackle this challenge more for fun and self-education than to produce anything serious. This task will require my familiarity with several aspects of J, including but not limited to how to invoke foreign functions, but also how to write software such that tacit and explicit definitions work together. This naturally raises the question, how much code should exist tacitly versus explicitly? Put a slightly different way, how much time does J consume when invoking tacit versus explicit verbs?

While Forth’s subroutine calling overhead is well-known and documented for several different platforms, the same doesn’t appear to be the case for J. I’ve seen numerous, but unsubstantiated, claims to the high performance of the J interpreter. I fully recognize subroutine call overhead may not necessarily influence overall program performance; however, it certainly has an impact on how one writes software in that language. I remain unaware of any published micro-benchmarks relating to the J programming language at all, much less one that focuses on subroutine calls.

Considering J’s strong emphasis on mathematics, any published benchmarks will likely involve lengthy vectors or wide matrices, neither of which will exist in Equilibrium. For instance, the classic demonstration of finding the average of a long sequence of numbers appears in virtually every J tutorial on-line:

(+/%#) i.100

Indeed, if we widen our vector and time it, we find J apparently runs very fast indeed:

   (6!:2)'(+/%#) i.1e6'
0.003648

If we amortize the time across the million elements provided by i., we find the computer spent about 3.648 nanoseconds processing each number in the vector. That includes memory overhead for creating the vector, taking the summation, etc. This seems impossibly fast, at 274e6 or so numbers processed every second. My PC runs at 3.4GHz, meaning my machine spent somewhere in the vicinity of 12 cycles for each number. Remember, this amortization includes memory allocation overhead and other run-time bookkeeping that J performs for you.

Since I write highly factored code and anticipate vectors no longer than 4 atoms in length, I predict the overhead of invoking functions on short vectors will dominate J’s performance, and will slow down appreciably. This experiment aims to falsify or verify my prediction.

Problem Statement

As I write this article, I’m in the process of cloning Spike Dislike, a game available for several mobile platforms by the author Jayenkai, for my Forth-based Kestrel home-made computer. An important requirement of the game involves moving the spikes across the screen. With a very simple CPU architecture and lack of sprite hardware, I need to break down a spike’s X-coordinate into two pieces to help make drawing faster: a 16-pixel column number, and a pixel offset from the left-edge of that column. This won’t necessarily apply for games implemented in J for Linux, where I intend to rely on the SDL and byte-per-pixel graphics layouts. Nonetheless, I retain the logic here, since it’s representative of a real-world design decision which directly influences performance on the slower Kestrel architecture. Equilibrium involves similar logic for its combatants and particle effects.

In summary, we can describe the locations of our four spikes using a vector of coordinate triads:

0 0 0   --  spike 0
0 0 0   --  spike 1
0 0 0   --  spike 2
0 0 0   --  spike 3
            ...
| | |
| | +-----  bitmap row at which it appears
| +-------  pixel offset (0 <= offset < 16)
+---------  word column (0 <= column < 40)

With the Kestrel version of Spike Dislike, we have no greater than four spikes in the Kestrel’s 640x200, monochrome playfield at any given time. Thus, we create this matrix simply in J using the shape operation:

locs =: 4 3$0

Spikes will need to move to the left during game-play. To facilitate this, an operation called nudge exists, which adjusts the X-axis coordinates appropriately. It’s simplest definition, omitting any hypothetical drawing-related code for brevity, appears in listing 1. We apply it to all spikes using the rank adverb:

locs =: nudge"1 locs

With Spike Dislike, the only five things that move on the screen include the ball and the four spikes. However, with Equilibrium, we include particle effects as well, which increases the number of on-screen objects into the hundreds. The more efficiently J can support processing these individual, 3-element vectors in a loop, the more objects I can keep on the screen before flicker or sluggishness becomes noticeable to the player. Determining the best way to accomplish this task, therefore, directly influences player experience.

Experiment Setup

I used two platforms: the first is my desktop computer, running under Linux on a four-core Intel 64-bit processor, at 3.4GHz, with 8GB of RAM. See Listing 4 for one of the four /proc/cpuinfo reports corresponding to this computer. The second platform is a Samsung Galaxy Tab 3, model GT-P5210, equipped with 1GB of memory.

For the PC version of J, I downloaded the version found at http://www.jsoftware.com/download/j701a_linux64.sh, accessed December, 2013. For the Galaxy Tab 3, I run the Android version of J, which I found at https://github.com/mdykman/jconsole_for_android/blob/master/dist/j-console.apk, accessed January, 2014.

To ensure a sufficiently large data set to measure performance of invoking a function on a small vector, I use a significantly taller matrix of locations than discussed in the introduction. To ensure the computer keeps busy enough to average out random noise, I elected empirically to use 1.3 million location tuples, arranged in a cube:

locs =: 130 10000 3 $ 0

Originally, I used a shape of 1300000 3$0, but this yielded slightly slower performance. I’m at a loss to explain why, but it might serve as a topic for future research.

I first started out with the most obvious implementation, that of the explicit script in J. Structurally, this script looks as you’d expect from any functional programming language. When moving a spike to the left one pixel, we check for underflow of the pixel offset, and if so, adjust the word column accordingly.

Next, in listing 2, I rewrote the explicit script in tacit form. pfn stands for “point-free nudge.” Each of the conditional cases appear respectively in pfna (pixel offset equal to zero) and pfnb (pixel offset non-zero). mez tests to see if the pixel offset is zero (“middle equals zero”).

At first, I conducted the tests only with these two versions. However, I later thought about a third case: an explicit definition whose body consists of a tacit implementation. Listing 3 shows how I arranged the code. I generated this code by copying the result of pfn f. at the JConsole into the new definition, pfns (point-free nudge, scripted).

Finally, I created the following mnemonic verb to return the time taken per row in the matrix.

time=:(6!:2)%1300000"_

With this, I invoked th following code and recorded the resulting figures for each of the platforms considered:

time'locs=:nudge"1 locs'
time'locs=:pfn"1 locs'
time'locs=:(pfn f.)"1 locs'
time'locs=:pfns"1 locs'

As a point of comparison, I wrote the same software in GForth (see Listing 5). Due in part to the substantially faster word execution performance over J, and in part due to the substantially reduced timer precision provided by Unix command-line tools, I needed to increase the number of rows that pfn needs to process to 1.3e9. This caused GForth to run for just about a full minute on the PC, allowing time gforth pfn.fs to report meaningful results. You’ll notice I allocate only 1.3e8 cells of memory in the code, but iterate over the data set ten times. I tried allocating 1.3e9 cells (5.2GB) directly, but at least with GForth 0.7.0 64-bit running under Linux, I cannot allocate that much memory.

Data

Table 1 presents the raw data collected on the PC and Tab architectures. Table 2 presents the same data normalized to the fastest for each platform. The data for table 2 was derived from the following J expressions or the PC and the Tab, respectively:

(]%(<./)) 5.15617e_6 2.56521e_6 1.41128e_6 1.26321e_5
(]%(<./)) 5.87663e_5 2.63402e_5 1.46162e_5 1.41435e_4

The fastest possible execution happens, at roughly 1.4us per coordinate row on the PC, when you invoke an explicitly flattened, tacit definition. The next fastest comes from invoking the tacit definition alone, costing almost 1.8 times as much execution time. In the latter case, I suspect vocabulary look-up for pfna, pfnb, and mez verbs happens for each row. Flattening performs this look-up once, creating an anonymous verb with the same definition as the named verb, with all dependencies filled in ahead of time. Thus, by the time the interpreter recognizes all the inputs for the anonymous verb, it only needs to look through a single tree to evaluate it; this amortizes the cost of looking up pfn’s dependencies across all the available rows to process.

The explicit definition nudge incurs a 3.6 to 4.0 factor performance penalty, coming in at the 3rd slowest to execute. J seems to parse and interpret a script, line by line, token by token if not character by character, using an RR parser every time it’s invoked.

That a tacit definition inside an explicit definition takes almost an order of magnitude longer to run than a flattened tacit definition truly caught me by surprise. I originally believed that the lack of complex control flow or the need for vocabulary look-ups would have made parsing and execution performance fall between explicit and tacit without flattening. Even knowing what I know now, I would still expect the performance hit to not exceed 7.0. J seems to expend a significant amount of effort when invoking scripts with tacit functions in them. I speculate this excess work comes from the dynamic synthesis of anonymous verbs and the memory management overhead that entails.

I measured the GForth equivalent software dispatch speed at close to 38ns per row. It would appear that J requires close to 37 times longer to invoke a verb than Forth does. My gut tells me this excess latency comes mostly from dynamic memory management overhead. Forth, as with C and other static languages, operates in-place on memory. J, however, may need to construct new vectors. While I’m aware that J recognizes and optimizes in-place updates for the m} operation, it’s clear that J misses the same opportunity for in-place updates with a statement such as locs =: pfn"1 locs. My transcript for measuring GForth’s performance appears in listing 6.

Discussion

It should be noted that J always executes what it can, when it can. J will immediately interpret adverbial phrases, like {.@}., to dynamically construct anonymous verbs at parse-time. Thus, by the time the interpreter reaches the =: copula, the value of the right-hand side will be a complete anonymous verb, itself consisting of invokations of anonymous sub-verbs, etc. The leaves of this data flow tree have no dependencies of their own, and so take their inputs from the left- or right-hand side of the expression this anonymous verb finds itself in. If insufficient inputs exist, the phrase remains in verb form for later processing when all inputs become available.

If bound to a label, such as the case with pfn, J doesn’t have to reparse and rebuild the tree. Every reference to pfn will place in J’s parse stack a reference to the definition that already exists. However, if executed inline, such as what happens inside a script based on observational evidence, J would have to resynthesize the anonymous verb every time.

With J’s best case taking 37 times longer than Forth’s case on the PC hardware, it’s clear that J’s claim to high performance must come from exploiting relatively lengthy vectors or wide matrices. Even an unscientific test seems to confirm this: if we let xs=:1.3e6$0 and execute time'<:"0 xs', we see an average invokation of 2.9ns per cell. Wrapping <: into a verb of its own fails to produce a measurable difference in performance. With data so arranged, the tables turn, with J outperforming Forth by almost an order of magnitude.

Error Sources

It’s possible my understanding of how J works under the hood fails to match reality. I have studied the J interpreter implementation, but am far from mastering an understanding of it. Therefore, you should take my speculation of what influences runtime performance with some skepticism.

Factors influencing my performance measurements may not influence your own. Always make sure to measure your application’s performance against your needs.

Conclusion

I’ve measured the relative performance of invoking an explicit and tacit verb, each with variations. Flattened, tacit definitions run the fastest, with each invokation of pfn taking 1.4us (est.) on the PC, and 14us (est.) on the Tab 3. Meanwhile, tacit definitions expressed inside of explicit definitions run the slowest by close to an order of magnitude.

Based on the findings, I recommend writing software using tacit definitions where at all possible. You gain significant reduction of source code size while claiming a performance reward for free. Try to minimize the use of explicit definitions for processing vectors of data, as they incur an estimated 2x performance penalty. Instead, reserve fairly simple explicit definitions for control-oriented processing, such as any side-effecting feature of a program. If you cannot avoid invoking a script over a vector, avoid the use of tacit definitions inside the script at all costs. Instead, define them tacitly as verbs outside the script, and refer to them by name. Keep names in scripts short yet mnemonic, since J will parse those names every single time the verb runs.

If performance remains an issue, and assuming most of your software exists in tacit form, you may employ the f. operator to, in essence, “compile” the verb’s call-graph into a single abstract tree, thus roughly doubling run-time performance due to removal of vocabulary look-ups. This costs a little extra memory, however, as the anonymous verb f. creates must duplicate not only the named verb’s implementation, but also the implementation of all its dependencies. It may also exhibit deleterious effects when attempting to invoke polymorphic methods on a vector of objects, as the flattened abstract tree f. creates may not agree with all object types in the (presumably boxed) list.

Finally, consider re-arranging the layout or representation of your data structures. As indicated above, J works best with wide vectors or matrices. For example, in the case of Equilibrium, my best interest lies with avoiding N×3 matrices, and opting instead for 3×N matrices. Better still, use parallel vectors — arrays synchronized against each other, and explicitly named. This forms a relational table of sorts, with each variable naming a column of atoms. Based on my informal tests, such an organization would have improved performance by almost three orders of magnitude in the average case. Such a reorganization, however, comes at the cost of code legibility, for related data items no longer co-reside in the source code.

Table 1. Time taken to process a single row of the `locs` matrix, in seconds.

	`nudge"1`	`pfn"1`	`(pfn f.)"1`	`pfns"1`
PC	5.15617e_6	2.56521e_6	1.41128e_6	1.26321e_5
Galaxy Tab 3	5.87663e_5	2.63402e_5	1.46162e_5	1.41435e_4

Table 2. Relative performance versus the different ways to process `locs`.

	`nudge"1`	`pfn"1`	`(pfn f.)"1`	`pfns"1`
PC	3.65354	1.81765	1.00000	8.95081
Galaxy Tab 3	4.02063	1.80212	1.00000	9.67659

Listing 1.

  nudge =: 3 : 0
'xc xp yp' =. y
if. xp=0 do. (xc-1),15,yp
else. xc,(xp-1),yp end.
)

Listing 2.

  mez =: 0&=@({.@}.)
  pfna =: _1&+@{.,15"_,{:
  pfnb =: {.,_1&+@({.@}.),{:
  pfn =: pfnb`pfna@.mez

Listing 3.

  pfns =: 3 : '({.,_1&+@({.@}.),{:)`(_1&+@{.,15"_,{:)@.(0&=@({.@}.)) y'

  NB. I created this fully-expanded definition by issuing pfn f. at J console,
  NB. and copying result into the script.

Listing 4.

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 58
model name      : Intel(R) Core(TM) i5-3570K CPU @ 3.40GHz
stepping        : 9
cpu MHz         : 3401.000
cache size      : 6144 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips        : 6820.10
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

Listing 5.

3 cells constant /row
130000000 /row * constant /locs
/locs allocate throw constant locs
locs /locs + constant locs)

: pfna      -1 swap cell+ +! ;
: pfnb      -1 over +!  15 swap cell+ ! ;
: mez       cell+ @ 0= ;
: pfn       dup mez if pfnb exit then pfna ;
: row+      /row + ;
: pfn"1     begin dup locs) xor while dup pfn row+ repeat drop ;

locs pfn"1
locs pfn"1
locs pfn"1
locs pfn"1
locs pfn"1
locs pfn"1
locs pfn"1
locs pfn"1
locs pfn"1
locs pfn"1
bye

Listing 6.

kc5tja@deneb ~ $ time gforth pfn.fs 

real    0m49.110s
user    0m40.919s
sys 0m0.980s
kc5tja@deneb ~ $ time gforth pfn.fs 

real    0m48.459s
user    0m41.027s
sys 0m0.880s
kc5tja@deneb ~ $ time gforth pfn.fs 

real    0m48.907s
user    0m40.871s
sys 0m1.024s
kc5tja@deneb ~ $ time gforth pfn.fs 

real    0m49.629s
user    0m40.931s
sys 0m0.968s
kc5tja@deneb ~ $ time gforth pfn.fs 

real    0m50.148s
user    0m40.999s
sys 0m0.908s
kc5tja@deneb ~ $ ./j64-701/bin/jconsole
   (+/%#) 49.110 48.459 48.907 49.629 50.148
49.2506
   49.2506%1.3e9
3.78851e_8

How Does MISC Stand Up to RISC, CISC? (3/7)

2013-12-04T00:00:00+00:00

I have yet to see objective comparisons between MISC, RISC, and CISC architectures. MISC processors often get a bad rap for their stack-based micro-architectures; however, generally-available evidence to support these negative views seems lacking. I try to put these processor architectures into perspective using real-world experience with some processors I’ve used directly in the past. I hope the data revealed will help others in answering questions on which architecture proves right for them.

In this article, I highlight the F18A core, which comprises the computing elements of the Green Arrays’ GA4 and GA144 chips. Details of this processor may be found at the Green Arrays website.

Space Consumption

As implemented in any Green Arrays product, the F18A is incapable of supporting a program as long as what I’ve coded in this article. However, there’s nothing fundamental about the F18A core architecture that prevents its use with systems containing larger quantities of memory. The analysis that follows assumes such a hypothetical machine.

As before, the different routines in the program have been sized. The table below refers to the program found in Listing 1.

Name       Size (%)
====================
plt         14  13.3
pltc         5   4.8
zerop (0p)  12  11.4
zeroq (0q)  24  22.9
addrs        2   1.9
flip         5   4.8
crsr        39  37.1
plotch       4   3.8
====================
TOTAL       105 words (236.25 bytes)
====================

The F18A demonstrates significantly smaller code than the S16X4, despite an 18-bit vs 16-bit word size discrepency.

I counted 41 uses of the li instruction, which means 41 of the 105 words (39%) contains literal information, much of it repeated. Unlike the S16X4, the F18A supports very rapid subroutine calls and returns in hardware. Idiomatic development of F18A software would factor out frequently repeated sequences of instructions, such as effective address calculations, into subroutines. This helps improve code density while incurring only a modest performance penalty. Listing 2 illustrates how one might refactor listing 1. The table below illustrates how this refactoring affects the final program size.

Name       Size (%)
====================
plt         14  18.2
pltc         5   6.5
_outsfc__    3   3.9
_outsfc_os_x 2   2.6
_outsfc_os_y 2   2.6
zerop (0p)   7   9.1
zeroq (0q)  15  19.5
addrs        2   2.6
flip         3   3.9
crsr        20  26.0
plotch       4   5.2
====================
TOTAL       77 words (173.25 bytes)
====================

That simple change yields a 26.7% reduction in program size.

Runtime Performance

Without exception, all F18A instructions takes one instruction cycle to execute. (Observe that the F18A lacks external clocking, being a fully asynchronous architecture. Nonetheless, it offers a uniform instruction execution cycle length.) However, to conserve the maximum amount of energy, the arithmetic unit inside the core relies on ripple-carry; thus, calculating a sum frequently requires more than one cycle. This explains why you need a nop in front of all add instructions (unless you can statically prove ripple delays are insignificant).

Unlike the S16X4, the F18A requires that the programmer preload a memory address to fetch from or store to into a special register. Thus, all memory references require no fewer than two instructions as well.

Finally, as with the S16X4, the CPU takes an additional instruction cycle to fetch the next batch of up to four instructions. However, with a truncated fourth slot, only a subset of instructions may appear there, often resulting in a word containing no greater than three instructions. This results in an increased instruction fetch rate.

The following chart refers to the program in listing 1, for it represents the fastest implementation.

Name       Cycles (%)
====================
plt          33+(0*0)
pltc         11+(8*(7+plt))
zerop (0p)   29+(0*0)
zeroq (0q)  107+(0*0)
addrs         4+(0*0)+0p+0q
flip         13+(0*0)
crsr         57+(0*0) (worst case)
plotch        8+(0*0)+addrs+crsr+pltc
====================
TOTAL       262+(320) = 582 cycles
====================

This table tells us that initializing a bitmap pointer and printing a single character takes 582 cycles to complete, minimum. Note that I made no attempt to cache structure field references, and yet, the F18A still proves faster than the S16X4 by 4.6%. I believe the existence of the return stack and its supporting instructions account for the performance boost. In particular, since the return stack holds return addresses (hence its name), we no longer need to inflict a 7-cycle cost to invoking subroutines to manage return addresses. Moreover, when using finite loops, the F18A hardware provides the next and unext instructions, which combine a decrement-counter (which exists at the top of the return stack) and branch-if-non-zero operation into a single instruction. With the S16X4, I had to manually decrement and branch if non-zero, further adding to a subroutine’s latency.

Structure field caching would significantly reduce the fixed overhead of calling plotch, from 262 cycles to 191 (a 16.3% improvement over S16X4 baseline), and would actually bring program size down as well. What’s more, it would enable faster character plots after the bitmap calculations have already been done — additional characters would require only 320 cycles. Replacing the multiplication in zeroq with a table lookup would save 58 more cycles, reducing the fixed overhead further to 133 cycles (a final boost of 25.7% over baseline).

Finally, a fair amount of memory accesses involve read-modify-write operations (e.g., incrementing p and q). With the S16X4, I had to push the addresses onto the data stack each time a memory access became necessary. With the F18A, the A register caches the effective address, allowing me to amortize a read and write into three instruction cycles (loading A, fetch, then store) versus four on the S16X4 (load address onto stack, fetch, load address again, store).

Cost of Structure Field Access

Accessing fields off of a base pointer takes eleven cycles and five words to complete:

Fetch the `li base; lda; fwa” bundle.
Spend three clock cycles executing these instructions.
Fetch the li field_offset; nop; add bundle.
Spend three cycles executing these instructions.
Fetch the lda; fwa bundle.
Spend two cycles executing these instructions.

Accessing an absolute memory location takes three cycles on average, four worst-case, while requiring significantly fewer words of memory to encode. If the effective address to fetch from or store to remains unchanged, a cycle may be avoided, as the A register already contains the desired address. Clearly, executing references to absolute memory locations proves much faster, and as well, consumes far less program memory. Though not illustrated in this example, the F18A also supports accessing memory and post-incrementing the A register. For some classes of software, this can yield significant savings in time.

If left as-is, the F18A MISC would actually be slower than a Motorola 68000 when reading or writing structure fields. In our specific example, the savings from having a hardware return stack slightly exceeds the cost of repeated structure field references.

Cost of Indirect Subroutine Call

Idiomatically, the F18A environment stores return addresses on the hardware return stack. Implementing an indirect subroutine to any arbitrary address involves pushing a fake return address on the stack, then returning to that address.

li      subroutine_pointer
lda
fwa
push
rfs                             ; or 'ex' if you're calling from inside a larger subroutine.

This takes seven cycles to complete; since the processor maintains the linkage information on the return stack, no additional time need be spent.

In this case, the S16X4 runs closer to RISC speeds than the F18A does, which still runs faster than classic CISC. On the other hand, the F18A doesn’t require managing return addresses. While the indirect call may require more time, the fact that the called subroutine need not have any prolog and/or epilog code may end up with a net savings in time.

Conclusion

With only 39% of the program space consumed by 18-bit literals, the F18A provides significantly superior code density compared to the S16X4, despite packing fewer instructions per packet.

The F18A’s reasonably fast indirect subroutine performance makes it adept at object-oriented, functional, and highly modular software design.

As with the S16X4, explicit exposure of effective address calculations provides opportunities for the programmer to amortize them into near insignificance, particularly if exploiting sequential memory access patterns. With careful attention to program structure, performance can be weighted closer to RISC levels of performance.

Listing 1.

outsfc:         ds      1

os_bitmap       equ     0
os_fontBase     equ     1
os_flip         equ     2
os_scroll       equ     3
os_x            equ     4
os_y            equ     5
os_ch           equ     6

p:              ds      1
q:              ds      1

plt:            li      p
                sta
                fma

                sta
                fma
                li      q

                sta
                fma
                sta

                sma
                li      p
                sta

                fma
                li      256
                nop
                add

                sma
                li      q
                sma

                fma
                li      80
                nop
                add

                sma
                rfs

pltc:           li      8
                push

L01:            call    plt

                next    L01

                rfs

zerop:          li      outsfc
                lda
                fma

                li      os_fontBase
                nop
                add

                lda
                fma
                li      outsfc

                lda
                fma
                li      os_ch
                nop

                add
                lda
                fma
                nop

                add
                li      p
                lda

                sma
                rfs

zeroq:          li      outsfc
                lda
                fma

                li      os_y
                nop
                add

                lda
                fma
                li      640

                lda
                li      18
                push

                li      0

L02:            nop
                muls
                unext

                drop
                drop
                sta

                li      outsfc
                lda
                fma

                li      os_bitmap
                nop
                add

                lda
                fma
                add

                li      outsfc
                lda
                fma

                li      os_x
                nop
                add

                lda
                fma
                nop
                add

                li      q
                lda
                sma
                rfs

addrs:          call    zerop

                jmp     zeroq

flip:           li      outsfc
                lda
                fma

                li      os_flip
                nop
                add

                lda
                fma
                push
                rfs

crsr:           li      outsfc
                lda
                fma

                li      os_x
                nop
                add

                lda
                fma
                li      -79

                nop
                add
                jpl     L03

                drop
                li      outsfc
                lda

                fma
                li      os_x
                nop
                add

                lda
                fma
                li      1
                nop

                add
                sma
                rfs

L03:            drop
                li      0
                li      outsfc

                lda
                fma
                li      os_x
                nop

                add
                lda
                sma

                li      outsfc
                lda
                fma

                li      os_y
                nop
                add

                lda
                fma
                li      -24
                nop

                add
                jpl     L04
                drop

                li      outsfc
                lda
                fma

                li      os_y
                nop
                add

                lda
                fma
                li      1

                nop
                add
                sma
                rfs

L04:            drop
                li      outsfc
                lda

                fma
                li      os_scroll
                nop
                add

                lda
                fma
                push
                rfs

plotch:         call    addrs

                call    pltc

                call    crsr

                jmp     flip

Listing 2.

This program should behave as listing 1, but has been refactored for maximum code density.

outsfc:         ds      1

os_bitmap       equ     0
os_fontBase     equ     1
os_flip         equ     2
os_scroll       equ     3
os_x            equ     4
os_y            equ     5
os_ch           equ     6

p:              ds      1
q:              ds      1

plt:            li      p
                sta
                fma

                sta
                fma
                li      q

                sta
                fma
                sta

                sma
                li      p
                sta

                fma
                li      256
                nop
                add

                sma
                li      q
                sma

                fma
                li      80
                nop
                add

                sma
                rfs

pltc:           li      8
                push

L01:            call    plt

                next    L01

                rfs

_outsfc__:      li      outsfc
                lda
                fma
                nop

                add
                lda
                rfs

_outsfc_os_x:   li      os_x
                jmp     _outsfc__

_outsfc_os_y:   li      os_y
                jmp     _outsfc__

zerop:          li      os_fontBase
                call    _outsfc__

                fma
                li      os_ch
                call    _outsfc__

                fma
                nop
                add

                li      p
                lda
                sma
                rfs

zeroq:          call    _outsfc_os_y

                fma
                li      640
                lda

                li      18
                push
                li      0

L02:            nop
                muls
                unext

                drop
                drop
                sta

                li      os_bitmap
                call    _outsfc__

                fma
                nop
                add

                call    _outsfc_os_x

                fma
                nop
                add

                li      q
                lda
                sma
                rfs

addrs:          call    zerop

                jmp     zeroq

flip:           li      os_flip
                call    _outsfc__

                fma
                push
                rfs

crsr:           call    _outsfc_os_x

                fma
                li      -79
                nop
                add

                jpl     L03
                drop
                fma

                li      1
                nop
                add

                sma
                rfs

L03:            drop
                li      0
                sma

                call    _outsfc_os_y

                fma
                li      -24
                nop
                add

                jpl     L04

                drop
                fma
                li      1

                nop
                add
                sma
                rfs

L04:            drop
                li      os_scroll

                call    _outsfc__

                fma
                push
                rfs

plotch:         call    addrs

                call    pltc

                call    crsr

                jmp     flip

How Does MISC Stand Up to RISC, CISC? (2/7)

2013-10-24T00:00:00+00:00

In this article, I highlight the S16X4 processor, a design which I built for the Kestrel-2 home-made computer. Details of this processor may be found in its datasheet.

Space Consumption

The S16X4 derives from the Steamer16, a word-address machine capable of 16-bit wide words. The S16X4 includes instructions for addressing memory with byte-granularity, which I exploit in our example program for the purposes of plotting an 8x8 glyph onto a bitmap. However, this does not materially affect program size. Even were I to stick with pure word addressing, I would just say that the font data is packed with two bits per pixel. This results in a program with exactly the same number of instructions to execute.

Different routines in the program have been sized:

Name       Size (%)
====================
plt         16 9.8%
pltc        18 11%
zerop (0p)   8 4.8%
zeroq (0q)  14 8.5%
addrs        8 4.9%
crsr        33 20%
plotch      15 9.1%
setOutSrc   52 32%
====================
TOTAL      164 words (328 bytes)
====================

I count 102 uses of the li instruction, which means 102 of the 164 words contains literal information, much of it repeated. With a richer instruction set, one could find opportunities to reduce the dependency on the li instruction, thus helping to improve code-density.

Runtime Performance

Without exception, all S16X4 instructions takes one clock cycle to execute. Additionally, the CPU further takes an additional clock cycle to fetch a batch of four instructions. These simple rules makes estimating run-time performance easy, as the following table illustrates.

Name       Cycles (%)
====================
plt         29+(0*0)
pltc        11+(8*(16+plt))
zerop (0p)  15+(0*0)
zeroq (0q)  27+(0*0)
addrs       12+(0*0)+0p+0q
crsr        55+(0*0)
plotch      22+(0*0)+addrs+crsr+pltc
setOutSrc   95+(0*0)
====================
TOTAL      266+(344) = 610 cycles
====================

This table tells us that initializing a bitmap and printing a single character takes 610 cycles to complete, minimum. Additional characters may be plotted with only 515 cycles, for no need exists to invoke setOutSrc for each character intended.

The pltc procedure requires the lion’s share of this time, taking 371 cycles on its own (11+(8*(16+29))). plt takes seven cycles to transfer a single byte from the font bitmap to the destination bitmap. It takes a further seven cycles to increment p, and another seven for q. Within this routine, at least, the S16X4 compares with a classic CISC processor equipped with many general-purpose registers, such as the MC68000.

Cost of Structure Field Access

Accessing fields off of a base pointer takes seven cycles and four words to complete:

Fetch the li base; fwm; li field_offset; add bundle.
Spend four cycles executing these instructions.
Fetch the fwm in the next bundle.
Spend one cycle executing this instruction.

Accessing an absolute memory location takes two cycles on average, three worst-case, while requiring half as many words of memory to encode. Clearly, executing references to absolute memory locations proves much faster, and as well, consumes far less program memory.

If left as-is, the S16X4 MISC would behave not entirely unlike a Motorola 68000 when reading or writing structure fields. I’ve optimized the software in listing 1 to predominantly access absolute memory locations. To ensure these absolute locations always reflect the caller’s choice of OutputSurface, the caller must invoke setOutSfc before using plotch to render text. Developers of MISC software will recognize this style of software design as more idiomatic, largely for the aforementioned performance and program size reasons.

Observe that outsfc pointer never changes while it’s in normal use by either the calling or called subprogram. setOutSfc works like a cache refill operation — it writes fields that change back to original structure, then caches fields from new structure into absolute locations. Caching, then, amortizes the cost of prefetching fields across all absolute memory references corresopnding to a field access.

It turns out, as shown in listing 1, the program makes 15 field accesses for each call to plotch. The setOutSfc routine consumes 95 cycles, assuming it writes back as well. Thus, when writing a character, those 95 cycles virtually “spread out” over the 15 field accesses. This corresponds to (95/15)=6.333 cycles per field access. Since each absolute reference averages two cycles, the plotch routine runs as if each field access consumed 2+6.333 = 8.333 cycles.

It turns out, caching holds no value when plotting only a single character on the bitmap. However, when repeatedly invoking plotch, e.g., when printing a string of characters, overhead drops very rapidly. Printing the text, “Hello”, involves only five characters. Yet, since the overhead for setOutSfc remains constant, amortized overhead becomes (95/(5*15))=1.267 cycles, such that plotch behaves as if each field access takes only 3.267 cycles. If printing the average-sized line of text, say 40 characters, the overhead drops further to sub-unity: 0.158 cycles! If printing an average-sized screen, say in the ballpark of 1000 characters for an 80x25 bitmap, overhead drops to insignificance: 0.00633 cycles per field access. The less frequently a program calls setOutSfc, the closer to two cycles each field access becomes.

Therefore, even though we needed to alter our strategy a bit to realize its benefits, the S16X4 MISC processor competes favorably against very sophisticated CISC architectures like the MC68000 for structure field references in the best case, and equals their performance in the pathologically worst case. The MIPS R3000 still performs better due to its dependency on pipelining, but only by a factor of two in the long term.

Cost of Indirect Subroutine Call

The S16X4 lacks any kind of subroutine mechanism in hardware. Thus, a programmer or compiler must synthesize this capability from existing instructions and conventions. Idiomatically, the S16X4 run-time environment stores return addresses in absolute memory locations, to save time. Of course, this prevents re-entrancy and recursion without more complex prolog and epilog procedures.

A normal subroutine, then, involves a code sequence such as:

li      return_address
li      subroutine_address
go

This takes four cycles to complete, not including latency introduced by the subroutine’s prolog and epilog. In this case, the S16X4 falls between CISC and RISC in subroutine performance.

Because the programmer must synthesize subroutine calls of all kinds, supporting indirect calls comes naturally and easily.

li      return_address
li      pointer_to_subr_address
fwm
go

Note the simple introduction of a fetch instruction just before the go opcode. This adds a cycle to the operation.

In this case, the S16X4 runs closer to RISC speeds than classic CISC. Indeed, the 65C816 and MC68000 both take something on the order of 20 clock cycles to accomplish the same task. (This will be revealed in several articles.)

Conclusion

With 62% of the program space consumed by 16-bit literals, the S16X4 provides a code density on par with a 32-bit RISC.

I was surprised to find that the S16X4, despite lacking overt support for structured data access, potentially performs better at that task than it does moving bytes into a bitmap, provided one adopts some means of amortizing the effective address calculations, such as with field caching. If one cannot avoid the effective address calculations (e.g., as with creating or destroying a recursive subroutine’s activation frame), performance will never be worse than a classic CISC.

The S16X4’s unusually high indirect subroutine performance makes it surprisingly adept at object-oriented, functional, and highly modular software design.

The S16X4, despite having only 12 instructions, remains a surprisingly powerful MISC processor that offers performance somewhere in between classic CISC and RISC. Explicit exposure of effective address calculations provides opportunities for the programmer to amortize them into near insignificance. With careful attention to program structure, performance can be weighted closer to RISC levels of performance.

Listing 1.

outsfc:         .word   0

os_bitmap       equ     0
os_fontBase     equ     2
os_flip         equ     4
os_scroll       equ     6
os_x            equ     8
os_y            equ     10
os_ch           equ     12

p:              .word   0
q:              .word   0

                ; Extra storage for temporaries, etc.
plt_pc:         .word   0
pltc_pc:        .word   0
zerop_pc:       .word   0
zeroq_pc:       .word   0
addrs_pc:       .word   0
plotch_pc:      .word   0
setOutSfc_pc:   .word   0

t0:             .word   0
outsfcNew:      .word   0

                ; Cached fields of current OutputSurface.
cc_bitmap:      .word   0
cc_fontBase:    .word   0
cc_flip:        .word   0
cc_scroll:      .word   0
cc_x:           .word   0
cc_y:           .word   0
cc_ch:          .word   0


plt:            li      plt_pc
                swm
                li      p
                fwm

                fbm
                li      q
                fwm
                sbm
                
                li      p
                fwm
                li      256
                add

                li      p
                swm
                li      q
                fwm

                li      80
                add
                li      q
                swm

                li      plt_pc
                fwm
                go

pltc:           li      pltc_pc
                swm
                li      8
                li      t0

                swm

pltc_l1:        li      pltc_l2
                li      plt
                go

pltc_l2:        li      t0
                fwm
                li      -1
                add

                li      t0
                swm
                li      t0
                fwm

                li      pltc_l1
                nzgo
                li      pltc_pc
                fwm

                go

zerop:          li      zerop_pc
                swm
                li      cc_fontBase
                fwm

                li      cc_ch
                fwm
                add
                li      p

                swm
                li      zerop_pc
                fwm
                go

                xref    mulBy640
zeroq:          li      zeroq_pc
                swm
                li      cc_y
                fwm

                li      cc_y
                fwm
                add
                li      mulBy640

                add
                fwm
                li      cc_bitmap
                fwm

                add
                li      cc_x
                fwm
                add

                li      q
                swm
                li      zeroq_pc
                fwm

                go

addrs:          li      addrs_pc
                swm
                li      addrs_l1
                li      zerop

                go

addrs_l1:       li      addrs_pc
                fwm
                li      zeroq
                go

crsr:           li      crsr_pc
                swm
                li      cc_x
                fwm

                li      -79
                add
                li      $8000
                and

                li      crsr_l1
                zgo
                li      cc_x
                fwm

                li      1
                add
                li      cc_x
                swm

                li      crsr_pc
                fwm
                go

crsr_l1:        li      0
                li      cc_x
                swm
                li      cc_y

                fwm
                li      -24
                add
                li      $8000

                and
                li      crsr_l2
                zgo
                li      cc_y

                fwm
                li      1
                add
                li      cc_y

                swm

crsr_l3:        li      crsr_pc
                fwm
                go

crsr_l2:        li      crsr_l3
                li      cc_scroll
                fwm
                go

                xdef    plotch
plotch:         li      plotch_pc
                swm
                li      plotch_l1
                li      addrs

                go

plotch_l1:      li      plotch_l2
                li      pltc
                go

plotch_l2:      li      plotch_l3
                li      crsr
                go

plotch_l3:      li      plotch_pc
                fwm
                li      cc_flip
                fwm

                go

                xdef    setOutSfc
setOutSfc:      li      setOutSfc_pc
                swm
                li      outsfc
                fwm

                li      setOutSfc_l1
                zgo
                li      cc_x
                fwm

                li      outsfc
                fwm
                li      os_x
                add

                swm
                li      cc_y
                fwm
                li      outsfc

                fwm
                li      os_y
                add
                swm

setOutSfc_l1:   li      outsfcNew
                fwm
                li      outsfc
                swm

                li      outsfc
                fwm
                li      os_bitmap
                add

                fwm
                li      cc_bitmap
                swm
                li      outsfc

                fwm
                li      os_fontBase
                add
                fwm

                li      cc_fontBase
                swm
                li      outsfc
                fwm

                li      os_flip
                add
                fwm
                li      cc_flip

                swm
                li      outsfc
                fwm
                li      os_scroll

                add
                fwm
                li      cc_scroll
                swm

                li      outsfc
                fwm
                li      os_x
                add

                fwm
                li      cc_x
                swm
                li      outsfc

                fwm
                li      os_y
                add
                fwm

                li      cc_y
                swm
                li      outsfc
                fwm

                li      os_ch
                add
                fwm
                li      cc_ch

                swm
                li      setOutSfc_pc
                fwm
                go

How Does MISC Stand Up to RISC, CISC? (1/7)

2013-10-21T00:00:00+00:00

We often hear anecdotes about how RISC so deftly outperforms traditional CISC architectures. (Indeed, it’s important to qualify, for most CISC architectures today are really RISC machines in disguise.) But, if you’re involved with Forth at all, you’ve almost certainly also heard of an architecture style called MISC. No doubt, you’ve probably heard how awesome MISC compares to RISC. However, I’ve never seen an objective comparison between CISC, MISC, and RISC architecture styles from the software perspective.

I try to put these processor architectures into perspective using real-world experience with some processors I’ve used in the past. This article introduces my methodology for accomplishing this. The next article will dive into the S16X4 MISC architecture.

Features I’m Interested In Comparing

Efficient expression evaluation almost goes without saying, yet it isn’t the only thing one can do with a processor. Without the ability to retrieve data from memory and store data back, expression evaluation serves no purpose. Modern software design relies heavily on records, each of which contains fields, which a program may randomly access. Thus, we want to specifically exercise this method of information storage and retrieval. However, even information access and its subsequent evaluation no longer cuts it for, e.g., desktop- or server-grade software architecture. Today, a good architecture will naturally support polymorphic interfaces as well.

Everywhere you look, test-driven development techniques aids in high-quality software production. Its benefits, being one of process and not of technology, transcends both the choice of processor and the choice of programming language. Software produced in this way tends to heavily depend on polymorphic interfaces to provide linkage to other components, typically written independently and not necessarily in the order of dependency. Explicit processor support for polymorphism, in the form of fast, indirect subroutine calls, aids in writing independently testable chunks of code that still runs fast in production environments.

Program Description

To provide a consistent basis for comparison, I’m going to implement programs for several processors, so as to demonstrate expression evaluation, access to parameters found in control blocks pointed to through a single pointer, and calling services through explicit interfaces. The problem to solve involves placing text glyphs on a bitmap at some current cursor position, then advancing the cursor. I’ll write the software in such a way as to not depend on any particular OS or hardware feature. Thus, access to system-level services must occur through an (polymorphic) interface.

The listing below contains the basic program, written in high-level Forth. It contains an mix of indirect subroutine calls, expression evaluation, and structure access sufficient for comparison purposes. Since MISC architectures (so far) always implement stack architectures, I assume the reader understands Forth. However, just in case you do not, I also list the closest C code equivalent to serve as an understanding aid.

Listing 1.

\ Forth Program

variable outsfc                  \ pointer to an OutputSurface structure

0 cells constant os_bitmap       \ pointer to bitmap memory
1 cells constant os_fontBase     \ pointer to font memory
2 cells constant os_flip         \ pointer to function to display bitmap
3 cells constant os_scroll       \ pointer to function to scroll bitmap
4 cells constant os_x            \ where to put char (0 <= x < 80)
5 cells constant os_y            \ where to put char (0 <= y < 25)
6 cells constant os_ch           \ Character to display (0..255)

variable p
variable q

\ Plot character by moving font data into the bitmap.
: plt      p @ c@ q @ c!  256 p +!  80 q +! ;
: pltc     8 for plt next ;

\ Configure p and q temporaries to point to the correct values prior to pasting
\ the character.
: 0p       outsfc @ os_fontBase + @  outsfc @ os_ch + @ +  p ! ;
: 0q       outsfc @ os_y + @ 640 *  outsfc @ os_bitmap + @  +  outsfc @ os_x + @ +
           q ! ;
: addrs    0p 0q ;

\ After drawing, we "flip" the display to render it to the user's display.
: flip     outsfc @ os_flip + @ execute ;

\ Crsr updates the cursor position on your behalf.  If necessary, it will
\ scroll as well.
: crsr     outsfc @ os_x + @ 79 u< if 1 outsfc @ os_x + +! exit then
           0 outsfc @ os_x + !
           outsfc @ os_y + @ 24 u< if 1 outsfc @ os_y + +! exit then
           outsfc @ os_scroll + @ execute ;

\ Public entry point.
: plotch   addrs pltc crsr flip ;

Listing 2.

/* Equivalent C program */

typedef struct OutputSurface {
    char   *bitmap;
    char   *fontBase;
    void   (*flip)();
    void   (*scroll)();
    int    x;
    int    y;
    char   ch;
} OutputSurface;

OutputSurface *outsfc;

static char *p, *q;

void plotch() {
    p = outsfc->fontBase + outsfc->ch;
    q = outsfc->bitmap + outsfc->y * 640 + outsfc->x;
    for(int i = 0; i < 8; i++) {
        *q = *p;
        p += 256;
        q += 80;
    }
    if(outsfc->x < 79) {
        outsfc->x++;
        outsfc->flip();
        return;
    }
    outsfc->x = 0;
    if(outsfc->y < 24) {
        outsfc->y++;
        outsfc->flip();
        return;
    }
    outsfc->scroll();
    outsfc->flip();
}

Selection of Processors

I will compare microprocessors that I’ve direct and/or relatively recent experience in each of the architecture styles. In the MISC category, I’ll include the S16X4 (a MISC of my own design, powering my Kestrel-2 computer) and the F18A MISC core designed by Chuck Moore himself. Because of how reliable and predictable MISC architectures are, I’ll make regular comparisons to an as-yet unimplemented CPU intended for the Kestrel-3, the eP64.

In the CISC category, I’ll showcase the Western Design 65816 microprocessor running in 16-bit native mode, and the Motorola 68000 microprocessor, at least as used in the Commodore-Amiga 500. Other variants of the 68000 exist these days, but as I’ve not used those, I cannot guarantee their timing remains consistent with the original design.

I’ve the least experience with overtly RISC processors, but I once worked with MIPS R3000-based hardware back at Hifn, a long time ago. I’ll try to muddle my way through trying to remember MIPS mnemonics and delay slot rules. However, if I remember rightly, instructions for each processor behaved very predictably. In the case of the R3000, provided no pipeline stalls, it retired instructions at a rate of 1 cycle per evaluating instruction, and 2 cycles for loads and stores. If someone with more recent MIPS R3000 experience finds errors, please report them to me via Github issues. I’ll happily revise this article with verifiable corrections.

While I have used PowerPC and ARM devices, I’ve never written assembly-language for these platforms. Nonetheless, observations made about the R3000 can apply generally to other RISC designs as a first-degree approximation of expected performance.

For every processor with a cache, I’m explicitly assuming no cache misses.

Coming Up Next

This article is already getting a bit long, so I’m going to cut it here. In the next article, I provide a translation of the above program into the S16X4 MISC assembly language, and provide a simple analysis of its code.

Neo-Retro Computing

2013-10-06T19:40:00+00:00

In the previous article, I described myself as a software survivalist. I said then that I looked to neo-retro computing as a means of securing my ability to perform interesting hacks going forward in the future. Well, interesting to me at least.

In this article I define my vision of neo-retro computing is.

Not Quite Scientific . . .

At its core, neo-retro computing involves at least four of the five steps outlined in the scientific method:

Formulation of a question. When looking at contemporary computing infrastructure, all we see is complexity. Does it always need to be so complex? Often, for our own purposes, the answer is no; but, that raises another question: how and/or why did this complexity arise in the first place? Thus the motivation for engaging in the neo-retro community — the desire to know the answers to these questions.
Hypothesis. Sometimes, you might call this step bravado, depending on your attitude towards software and/or hardware. It is in this stage that you look at some aspect of computing and say to yourself, “I can probably do this simpler/better/easier.” If you’re smart, you’ll base your speculation on prior experience, either in coding or in product management. If you’re not that smart, you will be soon enough.
Testing. Also known as coding or hacking. It’s here that you actually commit to writing your software, or if you’re looking to include computer hardware in your project scope, schematic capture, board manufacture, etc.
Analysis. Also known as retrospective. It’s here that you record your experiences. You either write blog posts, or chapters of a book, whatever.

You’ll notice that Prediction doesn’t exist in the list above. That’s because it’s not a prerequisite to participate in the neo-retro movement. Of course, there’s nothing preventing you from making predictions. I’ve been known to do it myself now and again. However, at least as often, I just want to attempt to reproduce a piece of computing history, just to play and learn. In such cases, I (try to) have no preconceived notions. I let the experiment guide me in my understanding why computing is the way it is.

In summary, neo-retro computing is about demanding from history proof that concepts need to be as complicated as they appear for the circumstances they’re used in.

Examples of Neo-Retro Computing

My Kestrel computer family obviously qualifies; my experiences driving their evolution and development have informed my definition of neo-retro computing directly.

I think Jeri Ellsworth’s Commodore-One platform certainly qualifies. It qualifies as the first commercially viable, reconfigurable computing system. Although it aims to support running classic computers (which itself isn’t neo-retro), the fact that you can use the Commodore-One to question commonly held design philosophy and develop your own completely new computers makes the Commodore-One itself neo-retro.

Examples That Aren’t Neo-Retro Computing

Building a classic computer implementation in an FPGA fails to qualify. For example, Jeri Ellsworth’s C64DTV fails to qualify as a Neo-Retro Computing project because it fails to address the status quo of the Commodore 64 design. Instead, it takes the Commodore 64 system architecture as a given. Innovations like adapting it to use an SDRAM chip instead of regular DRAM definitely interests me, but I feel it doesn’t address the fundamental core of the complexities found in the Commodore 64 design.

Case Study: The Kestrel Family

This blog doesn’t generally cover Kestrel-related material, but I’ll mention a quick retrospective here since I haven’t yet set up the Kestrel-specific blog(s). (I’ll announce on this blog when I complete it/them.)

The Kestrel computer family of completely home-made computers exists to ask these questions for quite a number of different aspects of the computing world today. Too many questions to list here, in fact. Indeed, for each generation of Kestrel, everything within my financial reach is custom-designed, with the goal of learning, refining, proving, and improving the next generation.

Kestrel-1 (???-2004)

Question: How simply can I make a single-board computer?
Hypothesis: (with some aspects of Prediction mixed in for good measure.) Given a Western Design 65C816 microprocessor, if I can remove the need for ROM, the address decoder becomes a simple NAND gate configured as a simple inverter. The remaining three NAND gates found in a 74ACT00 can decode the clock and R_W lines to form output- and write-enable signals. When the high address bit is low, select a single VIA chip. The W65C22 VIA can interface with SPI devices to provide virtually unlimited I/O capability as applications require. When the high address bit is high, select RAM. While the CPU is in RESET state, use a crude DMA circuit attached to the host PC to upload the initial program into high RAM.
Testing/Hacking: The finished design consisted of three breadboards. The first contained the CPU and clock driving circuitry. The second contained the RAM and VIA chip; and, as well, a handful of LEDs to illustrate I/O was, in fact, working. The third contained a collection of 74ACT595 chips which my desktop PC drove to upload a program into RAM.
Analysis: It turns out you can build a moderately capable computer in a small form-factor and for greatly reduced costs if you can find a way to reduce address decoding to a simple binary decision. Additionally, the computer would need to depend on an initial program loading mechanism in order to be useful.

Kestrel-2 (2005-2012)

I think the earliest recollection I have of the Kestrel-2 dates to around 2005 or 2006, just one year before moving to the Bay Area for a new job.

The Kestrel-2 attempts to answer a number of questions all at once, addressed independently, and in a very unstructured way. Originally, bolstered with confidence from the success of the Kestrel-1, I wanted the Kestrel-2 to be a kind of hybrid of the Apple IIgs and the Commodore-Amiga. Based on the 65816 CPU running at 14MHz, capable of graphics with resolutions up to 640x480 and 256 colors out of a palette of 65536, and polyphonic, DMA-driven audio channels, it was to be a home computer of my dreams. Something that combined the ease of benchtop hacking that the Commodore 8-bit computers offered, and the usability of the Commodore Amiga.

Well, it didn’t turn out that way. I needed to answer a lot more questions before I could get there. For example, how to implement the video controller at all, much less one as capable as the Amiga’s AGA chipset. The issue of cost entered the picture at this point as well: FPGA development boards were still well outside my reach. Expansion buses required designing, as I wanted to slowly expand my RAM over time on the one hand, and I/O cards on the other. The original Kestrel-2 design proved entirely and thoroughly over-ambitious for my meager experience and resources.

Something got my juices flowing again around 2011, however. The earliest commit record I have for the Kestrel-2 Github project dates back to June of 2011. I started writing a software emulator for the J1 stack-architecture CPU. Eventually, I even managed to get real hardware working in my then brand-new Digilent Nexys2 FPGA development board.

Unfortunately, I cannot recall the circumstances that caused me to change from the J1A to the [S16X4] (https://github.com/sam-falvo/kestrel/blob/master/cores/S16X4/doc/datasheet.pdf) CPU it currently uses today. I do remember byte-addressibility and code density concerns were factors; though, the details are lost to the mists of time now.

Kestrel-2 (2012-2014)

Questions: How far can I get without interrupts? (Quite far.) Can I embed a working Forth environment in system memory and still have enough left over for programs? (Not with a 16KB system; 32KB minimum memory required, 44KB recommended. Font and bitmapped text output consumes too much space.) Can I make a working operating system to make up for limited memory resources? (Yes, as long as you’re careful about the OS’ own memory consumption.) Can Forth make a good systems programming language? (Yes, as long as you’re careful with code-reviews and proper test-driven techniques.) Is MISC as compact as hyped? (No. The 65816 in native-mode often produces smaller executables.) Is MISC as fast as hyped? (This depends on the workload; however, on average, even with something as limited as the S16X4, pure expression computation and simple effective address generation takes less time to complete than the equivalent 68000 or 65816 code.) More questions exist, of course, but are too numerous to list here.

The contemporary Kestrel-2 seems unrecognizable compared to my previously lofty goals. The modern Kestrel-2 addresses no greater than 64KB of memory space, of which up to 44KB can be program or data RAM, 4KB for I/O space, and 16KB for video display RAM. No audio support currently exists. No interrupts. No configurable video modes. Limited video RAM space forces the display to 640x200 resolution. No color — it’s black and white only.

The reason for the significant scope reduction, as you might expect, involves answering questions related to achieving my ultimate goal. It was the perfect test mule to learn about video display circuitry, for example. The lack of interrupts made for interesting programming challenges. Surprisingly, you can write some amazingly sophisticated programs without them, provided your hardware supports alternative tools. For example, the keyboard controller has, built-into its hardware, a 16-byte FIFO. Most computers implement this in software, managed through an interrupt service routine. Putting the FIFO right in hardware significantly simplified software and hardware design. The microprocessor, now a very simple MISC-architecture design, supported byte-addressibility right in the instruction set.

Kestrel-3 (2014-???)

Questions: Can I make a 64-bit CPU run efficiently? How can I support interrupts and traps? Can I access external SRAM or SDRAM with any modicum of efficiency? Do I absolutely need byte-, word-, dword-, and qword-specific accessors? Or, can I get by with word-addressing only and use byte-banding? Can I finally upgrade the monochrome graphics interface adapter (MGIA) to support multiple resolutions and color depths (CGIA)? Will the eP64 have enough code density to embed a functional Forth environment as the power-on language environment without requiring more than 32KB of memory? Can I port Tripos relatively easily? Can I embed Forth in ROM for when I cannot boot Tripos?

As you can see, some of the lofty goals from the original Kestrel-2 make a come-back with the Kestrel-3. You’ll notice I’m not tackling everything at once this time, however. For example, I’m still not addressing the needs of audio playback, nor of expansion buses. We do see the return of interrupt support, but system software will likely under-utilize this new feature for now.

Here’s my one prediction that I’m quite sure of: the eP64 has a data bus wider than that of external RAM by a factor of four (64 bits versus 16 bits). As a consequence of this, driving the microprocessor at RAM speeds (no faster than 14MHz) will result in a substantial reduction in real-world performance. Indeed, if we run the CPU at 13MHz (1/5th the dot-clock frequency for a 1024-pixel wide display), we can expect the eP64 to function as though it were clocked at only 3.25MHz. This will put the performance of the Kestrel-3 computer firmly in the Commodore Plus/4 or Atari 7800 level of performance. Thus, the higher video resolutions will be useful for productivity applications only. No games or demo-scene programs at 1024x768 yet!

The only way to attain a higher performance is to make use of FPGA block-RAM resources as instruction and data caches. However, that sounds like a job for the Kestrel-4.

Software Survivalism

2013-10-06T13:45:55+00:00

I consider myself something of a software survivalist, and therefore am interested in the neo-retro world as a means of securing a safe haven for my hacking interests.

What does this even mean?

As the first post to my new blog, I should perhaps explain what I think a software survivalist is. I’ll provide some clarification of what the neo-retro movement is in the next article. The two are related, but not the same.

From Wikipedia, excerpted on 2013-Oct-06:

Survivalism is a movement of individuals or groups (called survivalists or preppers) who are actively preparing for emergencies, including possible disruptions in social or political order, on scales from local to international.

A software survivalist takes this concept of preparedness and applies the survivalist ethos to the realm of software and the hardware which runs it.

For example, I definitely see an imminent threat in the computing world, and especially my freedom to hack. Miniaturization, for all its benefits, led to unhackable commercial hardware. There was a five to six year gap between the release of USB and the first home-built piece of hardware exploiting it. Even today, most “USB” projects are just USB-to-RS232 converters connected to legacy microcontroller interfaces. I won’t even begin to expound on the sheer cliff that describes the learning curve behind the USB’s various protocols. Anyone wanting to make an external USB-based QWERTY-to-Dvorak converter, something that was utterly trivial with a microcontroller or two with the PS/2 protocol, now requires a complete USB host and slave interface in the same package, plus the required software stack for each side of the connection. What was once trivial now requires the resources of Logitech’s hardware engineers just to get a prototype off the ground. If you find USB too slow for your needs and desire instead to use PCI-e, you’ll run into similar difficulties.

Trusted computing initiatives and their straight-jacket end-user license agreements reduce pocket-sized supercomputers to mere appliances. Even devices running an ostensibly open platform such as Android require explicit jail-breaking. Making matters worse, we’re starting to see dedicated chipsets just for encryption of traffic over a bus. That PCI-e card you wanted to create? More likely than not, you’re going to need a counterpart device on your peripheral. Where do you get one of these for cheap? Who keeps track of the keys used for encryption? I see things getting out of hand pretty quickly, and hobbiests left in the dust.

With increasing frequency and severity, our rights as consumers erode. I bought my computer with my own, hard-earned cash. I’m not renting it. Basic understanding of capitalism says I should have the ability to do whatever I want with it, at any level of abstraction I choose. Yet, hardware and software trends conspire to prevent this from happening without an absurd amount of resources brought to bear on the task.

Surprisingly, this situation doesn’t bug me as much as you think. Rather, the lack of any truly open alternatives bugs me far more. If you want to spend the money for a computing home appliance, by all means. I certainly would too, I have, and I will continue to do so. However, if I cannot program it simply or hack hardware easily, I’m left with two options. One, I can just grit my teeth and wait for a Kickstarter project to come along that does just what I want it to; or, two, I could make my own computer and software for that computer where I can hack as I see fit. The hopeful hobbiest will choose the first option, while the software survivalist will choose the second option. Both let the free market evolve as it should; however, the software survivalist takes a more pro-active role in securing his ability to hack on interesting projects in the future.

That leads me to what I call the neo-retro movement. I’ll discuss this more in the next article. Stay tuned.

The Declarative, Imperative, then Inquisitive Pattern

2010-02-27T16:30:00+00:00

Based on a presentation I gave at the Silicon Valley Forth Interest Group, I wrote this article some time before 2010 Feb 27; however, I don't recall exactly when. I originally published it on one of my earlier blogs, now long since gone. The content, including all errors, remains intact from its original publication; I made no attempt to clean up the prose below. I only edited it so that it may conform to contemporary formatting methods (e.g., Markdown). While details have changed over the years, the core essence of this pattern remains as true today as it did when I first published it. Maybe some day, I'll revisit this pattern and post an updated pattern reference, complete with examples in a variety of languages instead of focusing exclusively on Forth. Until then, enjoy this piece of personal archaeology. More will be coming as time permits.

— Samuel A. Falvo II, 2013-Dec-07.

Many interacting tensions need resolution if one desires a “well-written” Forth program. Unfortunately, except for the relatively scattered and often contradictory tips and suggestions offered in the book Thinking Forth, documentation of common problems and their solutions hardly exists. To remedy this problem, I present a programming pattern which I call Declarative, Imperative, then Inquisitive. Hopefully, this article will inspire others to contribute their own patterns.

Name

Declarative, Imperative, then Inquisitive (DItI).

Problem

Reasoning about software is extremely difficult. Any code can potentially cause any change to the state of the machine; conversely, any change the code makes might (or might not!) possibly be intentional, even if it doesn’t match the function’s name. Therefore, every programmer learns the importance of clearly naming their procedures, and many coders learn to use code without side effects. But, the side effects manifested in one procedure might be the intended main effect of another, and much benefit often exists in allowing side effects.

We therefore define a pattern which clearly marks procedures as being either Declarative, Imperative, or Inquisitive. A Declarative procedure states (in its name) that a given action or state is accomplished or reached; its semantics should ensure that the named action is accomplished or state is achieved, and furthermore that the action will only be accomplished at most once (that is, if the procedure is called multiple times with the same state, no further action will be taken). An Imperative procedure commands (by its name) that an action be done, and its semantics should perform that action every time it’s called, regardless of whether it was already done. An Inquisitive procedure states a question, and it should do nothing more than return the answer to that question, not changing the state of the system.

Furthermore, we specify that whenever possible, actions should be specified in the most limited manner possible, which means that most words should be Declarative rather than Imperative; and because Inquisitive words do nothing but allow decisions to be made, they should appear relatively rarely in order to reduce the complexity of the system. Thus, we list the names in their desired order of frequency.

Allow me to demonstrate with a more concrete example.

When I first started the HDLC networking component of my digital optical transceiver project, I needed a data store for the local/remote station connection relationships. Implementing this required a simple means of locating a record based on remote and local addresses (which I dubbed row, for it conceptually returned a database row). But, I did not want to deal with error conditions at that point — this meant that my queries must always produce the intended value, no matter what. I would handle errors elsewhere, where it proved more convenient. That kept my program logic clean and readable, unfettered by irrelevant logic.

When I first conceived row, I didn’t think in terms of declarative coding techniques; rather, I thought of the query word as a procedure; e.g., something I dictated to the computer: first do this, then do that, finally do those. The procedural thinking I first used yielded the following definition (dropDlc is the procedure that calls row in this case, hence the bizarre comment on the 7th line):

: row ( ra.la )
  0 >r
  begin ( ra.la : 0 <= rel <= nextDlc )
      r@ nextDlc @ = if
          2drop r> drop
          0 r> drop
          ( exit dropDlc w/ f : 0 <= rel = nextDlc )
          exit
      then
      ( ra.la : 0 <= rel < nextDlc )
      over r@ remoteA + @ =
      over r@ localA + @ = and if
          ( ra.la : [0 <= rel < nextDlc] /\ isDlc?[ra.la] )
          2drop r>
          ( rel : [0 <= rel < nextDlc] )
          exit
      then
      ( ra.la : 0 <= rel < nextDlc ==> 0 < rel+/row <= nextDlc )
      r> cell+ >r
      ( ra.la : 0 <  rel <= nextDlc )
      ( ra.la : 0 <= rel <= nextDlc )
  again ; 

While I was quite pleased with how easy it was to use row, I was not at all happy with how complex this word turned out to write. row seemed irreducable to me at the time¹, for I could find no meaningful way to simplify it. To help ensure the word’s correctness, I placed proof annotations interstitially in the body of the definition. The word definitely works — I’ve proved it “on paper”, and it also worked spectacularly well in practice.

Later, I encountered a most difficult bug where incoming connections were not appropriately accounted for in the data link connection state (DLCS) table; after hours of trying to track the bug down, process of elimination indicated that the bug had to reside in one of two places: inside row, or somewhere inside the frame dispatcher module. Despite the formal proof that this code worked, I decided it was the most likely cause of the failure, due to its visual complexity (maybe I had forgoten something). Just in case, I decided to rewrite it.²

This time, however, I wanted to re-implement row using definitions of similar structure to those using row itself, for I had discovered that using row in other definitions proved remarkably easy, and made for very readable code. I started with the most obvious use-case: I knew that if a row wasn’t found in the DLCS table, we had to return a zero to the caller of whatever word invoked row (remember: row cannot itself return without having a valid record number as its result).

: row           -found  2drop 0 r> drop ;

A word about notation: since ASCII lacks the boolean symbol for logical negation (this thing: ¬), I’m forced to choose the character with the closest iconographic resemblence: words starting with a dash usually read as, “not word” or “no word.” In this case, -found reads as not found.

Notice the structure of the aforementioned definition. -found, taken otherwise completely out of context, states a fact or pre-condition about the code which follows, which the reader safely assumes must hold for all subsequent code. Hence, 2drop 0 r> drop executes with full confidence that the record sought is genuinely not in the database.

Eventually, I had to implement -found. Once again, I decided to engineer the code so that it stated only facts, without obvious concern to what would happen had these facts been wrong:

: -found        0 begin dup nextDlc @ < while -match cell+ repeat drop ;

For those not familiar with idiomatic Forth coding conventions, I defined -found to mean, literally, for all allocated records in the table, no match exists. Note that -found, while itself a declaration, also makes use of another declaration: -match. cell+ runs with full confidence that no match has been discovered thus far.

: -match        hit? if nip nip r> r> 2drop exit then ;

-match ensures that no match (“hit”) exists insofar as it concerns -found and row. What happens, though, if it discovers a match? Considering the context we’ve established so far, we cannot just return to the caller because the caller’s subsequent code depends on there not being any match! Likewise, we cannot return to the word who called row’s caller. Only returning to the word which called row itself remains, making sure we also return the record number we promised for subsequent code to use.

It was at this time I realized how widely applicable declarative programming in Forth can be. From examples like 2 4 connected to establish the relationship that remote station 2 and local station 4 were connected to each other, to using row to guarantee a database record number, to using preconditions in a declarative mode, as above, I now had a consistent pattern of declarative programming wherein the legibility of code significantly improved while also improving code reliability at the same time.

Context

Declaration, Imperative, then Inquisitive applies whenever you desire:

easy to read and maintain source code. The best Forth code tends to read horizontally, not vertically, through the use of the rule of thumb, “One line, one definition.” DItI provides a more structured means of achieving this goal.
greater code reliability. Tony Hoare was one of the first computer scientists to identify the concept of preconditions, and later popularized by the Eiffel programming language through its Design By Contract system. He defined a precondition as a predicate which must hold true for any subsequent software to produce valid, correct results. Declaratively documenting preconditions at the beginning of Forth words both documents the requirement and provides a means to trap on erroneous input.

Forces

Readability — The DItI pattern, in effect, calls for a coding convention. Inasmuch, as with all conventions, familiarity with it results in measurable improvements in reading and comprehending unfamiliar pieces of code.
Correctness — The DItI pattern improves correctness by encouraging input parameter checking through preconditions.
Performance — In naive compilers, a subroutine call will occur for every declaration, even for those used only once. Hence, depending on your compiler, using DItI may impact runtime performance in timing sensitive event handlers or tight loops.

Solution

We have already observed the structure of row. I repeat the code fragment below, sans interstitial comments for greater clarity:

: hit?          >r over remoteA r@ + @ =  over localA r@ + @ = and r> swap ;
: -match        hit? if nip nip r> r> 2drop exit then ;
: -found        0 begin dup nextDlc @ < while -match cell+ repeat drop ;
: row           -found  2drop 0 r> drop ; 

Notice how row, -found, and -match exist as declarations — these words state or establish some truth, which code that uses them can rely on. However, hit? is inquisitive in nature.

Note the following conventions:

Word Type	Attributes
Inquisitive	Typically uses a question mark to ask a question. Typically past- or present-tense. E.g., `connected?`, `reusable?`. Always idempotent. Assuming no other externally induced state changes (including but not limited to time-sensitive properties, including the current time itself), invoking a predicate with the same parameters must return the same results.
Imperative	Typically named with a command-phrase. E.g., `sortArray`, `printError`. Per the principle of command/query separation, imperatives almost never return anything to their callers. Rarely idempotent. E.g., when printing, `newPage newPage` should cause the current page to finish, followed by a blank page.
Declarative	Typically named as a verb-derived adjective. With the most common form of expression as a past participle form of a verb (ending in -ed, as in `connected`), we understand the program state to reflect the results of some previous action and remains so to the present time. Depending on the context, you may find a present progressive form more suitable (ending in -ing, as in `connecting`). Still other forms, perhaps more rarely encountered depending on the kind of software written, suitable names appear as a verb with some other suffix that makes it adjectival (such as `reusable`)³. Most declarations have eponymously named queries. Always idempotent. Although declarative words may effect new state, re-asserting the same truth more than once has no further effect. E.g., `1 2 connected` will establish the fact that 1 and 2 are somehow connected. However, re-executing the expression will, in effect, do nothing, for the computer already knows that 1 and 2 are connected. A fact exists in at least one of two possible times: before a declaration executes, or after it's finished executing. As a result, declarations come in two basic forms: preconditions, which confirms facts known ahead of time, or state changing, which performs requisite actions to effect new knowledge. Preconditions typically do not consume their stack arguments, instead preserving them for subsequent computations in the event that the precondition holds. If a precondition fails, however, it takes immediate action to handle the exceptional condition, consuming parameters if necessary. State changing words do consume all of their stack-resident arguments, often treating the stacked data as a representation to interpret, rather than raw data to store verbatim. The word takes all actions necessary, if any at all, to alter the relevant state according to the representation given on the data stack. (C.f., Representational State Transfer, or REST.)

Resulting Context

Declarative words (“declarations”)
always express eponymously-named facts.
are effective — internal data and/or control-flow state may change to ensure the named facts actually are.
are idempotent — they take no unnecessary actions beyond that required to effect their stated truth.
Imperative words (“imperatives”)
provide the know-how responsible for making queries and declarations work.
generally appear at the lower abstraction levels, and therefore remain hidden from external programs.
typically deal with the realities of memory layout, pointer arithmetic, etc.
Inquisitive words (“queries”)
provide a read-only view on the relevant state.
can answer yes/no questions (e.g., are there enough bytes to read?) or reconstruct a representation of some internal state (e.g., where is the current cursor position in the Cartesian coordinate system?). Many declarations have eponymously named queries.
Runtime Performance
Potentially compromised due to increased subroutine call overhead.
Recovery possible through creative use of immediate and/or macro definitions.
Program Architecture
State maintenance most likely requires some form of database. Any suitable database architecture will work, including but not limited to key-value, object-oriented, relational, hierarchical, navigational, et. al. Any persistence model will work as well, including disk-backed, distributed RAM cache, or even ordinary record/object member fields.
Languages lacking aggregate data types like records or objects tend to rely more heavily on relational(-like) database concepts. Languages with native support for more sophisticated aggregate types tend to find more navigational styles of data management easier.

The programmer takes responsibility for choosing appropriate names for his or her words. Try to choose names that make sense in the context of the problem being solved, not for the underlying data structures used.

For example, if you have a queue of objects to process, enqueued and enqueued? likely will make for poor names even though they’re academically correct. Since the enqueue function takes both a datum and a queue to put it on, one must wonder what queue enqueued refers to. Does enqueued? query the same queue that enqueue uses?

Queues appear in many different kinds of software; even within any single project, any number of queues may exist, each serving a unique purpose. Therefore, the programmer must recognize this and ask why something needs queueing in the first place. Put another way, if something deposits an object on a queue, what significance lies behind it? What becomes true about the object once it’s been queued? If you experience difficulty answering any one of these questions, remain aware that answers to them always exist; when found, the answer provides a valuable source of ideas for choosing among candidate names. Having a thesaurus nearby helps too.

External software tends to use declarative words, usually overwhelmingly due to their ease of use. Try to minimize reliance on queries, for they exhibit a tendency to break a module’s encapsulation. The most frequently used predicates tend to reflect the safety of performing an action or ability to affect state.

Software tends to read more conversationally, at least once you’re used to the adopted notational conventions; it communicates more naturally with the human maintainer, as humans think declaratively. Contrast against imperative-only coding (both procedural and object-oriented variants), where the communication emphasis lies with the machine, or functional coding, where the emphasis lies with algebraic formulation and evaluation.

Examples

To help illustrate this pattern, we consider the relatively simple task of tracking a text input cursor on the screen. Consider a 640x480 pixel display, with an 8-pixel fixed-width, 8-pixel tall font. This yields a character matrix 80 columns wide, 60 rows tall on the screen. As a first cut, we know we need to keep track of the cursor’s coordinates:

variable cx
variable cy
: at            cy !  cx ! ;
: at?           cx @ cy @ ;

at repositions the cursor on the screen, while at? queries its current location. Because queries may idempotently provide different views on some state, we might want to return a byte offset into a bitmap corresponding to the current cursor location. We’ll define a word tile to return the base address of a character tile. Although a query, notice that tile lacks a question mark:

: tile          cy @ 80 * cx @ + ;

However, the behaviors of at? and tile makes sense only for the case where (0 ≤ x < 80) ∧ (0 ≤ y < 60). If this condition doesn’t hold, then we get strange effects, including the possibility of memory corruption elsewhere in the system. So, let’s constrain our coordinate space:

: constrained   0 max 59 min  swap  0 max 79 min  swap ;
: at            constrained   cy !  cx ! ;

constrained demonstrates a declarative word which ensures our precondition by constraining the cursor to the visible bounds of the screen. Notice it takes no unnecessary actions, it leaves the stack as-is for subsequent code, and also deals with exceptional cases aggressively. It’s also unlikely that this word find use by outside software, so choosing a more generic name for it doesn’t make sense here. However, if it’s desired to use constrained elsewhere, then rename the word to something more appropriate (e.g., visiblyConstrained) as a refactoring step.

When editing text, few things occur more frequently than rendering a character and advancing the cursor. Thus, we can define a word which bumps the cursor to the right:

: bumped   1 cx +! ;

Of course, this only works when 0 ≤ x < 79; when x = 79, we need to wrap the cursor to the left-side of the screen:

: r-edge?   cx @ 79 = ;
: -wrap     r-edge? if 0 cx !  1 cy +!  r> drop then ;
: bumped    -wrap   1 cx +! ;

While an improvement, we still neglect the case where y = 59. To prevent our invariant from being violated, we need to refine our code once more:

: b-edge?   cy @ 59 = ;
: -scroll   b-edge? if scrolledUp r> r> 2drop then ;
: r-edge?   cx @ 79 = ;
: -wrap     r-edge? if 0 cx !  -scroll 1 cy +!  r> drop then ;
: bumped    -wrap   1 cx +! ;

Observe how you can take any single line of code and understand it in complete isolation of the others, provided you have the over-arching context of the problem being solved (in this case, bumping the cursor to the right). You’ll find this occurs quite frequently with DItI, and helps to explain why DItI-style code proves easier to maintain.

Known Uses

Chuck Moore uses some declarative programming throughout ColorForth, albeit with highly abbreviated names often hiding their declarative characteristics. See http://www.colorforth.com/ide.html for the ColorForth IDE harddrive source code, published circa 2001. The bsy and rdy words, responsible for ensuring the IDE controller is not busy and is ready to receive or send data respectively, fulfill the declarative requirements established in this pattern. However, sector, read, and write take imperative forms.

Samuel A. Falvo II uses declarative programming style extensively in his HDLC network implementation.

At least with Forth, Partial Continuation often appears as the only way to satisfy the declarative coding style. Relying on sentinel return values, even if in a separate stack item, often complicates program flow. CATCH and THROW, while useful in their own right, still work with sentinel values at some point.

I suggest applying Aggressive Handling as well, for by definition, all exceptional cases to documented truths needs dealing with in the word establishing that truth. Eliminating run-time error dispatching will, in most cases, significantly simplify software maintainability.

Acknowledgements

I would like to thank Billy Tanksley for offering his extremely limited time to proof-reading this article, and offering valuable input concerning the expression of concepts and ideas herein.

¹ If you strip away the comments of the above definition of row, you will find the core logic simple enough. You’d be forgiven if you, too, thought that refactoring the word would yield no tangible benefits.

² It turns out that row was not in error; the proofs were correct. However, had I not mistrusted myself, I don’t think this pattern would ever have been recognized and documented.

³ Some might feel that such forms do not make explicit the idempotency of a procedure. While I do not feel this to be the case (e.g., once something is reusable, it will always remain reusable until such time as it is reused), that one person can think this implies others can too. Regrettably, I cannot prescribe formulaic rules for this; only experience can inform how and when to use such names.

Haskell Monads: Another View (Repost)

2007-03-22T15:00:00+00:00

</p> This article was first published 2007 March 22, and is converted from a doctools document. Some translation errors may remain. I’ve tested viewing this page under both Linux and Macintosh environments, with Chrome and Firefox. (I do not have access to Internet Explorer, sorry.) </p>

1 Introduction

When I first started coding in Haskell, not one month ago, the concept of monads just thoroughly confused the heck out of me. Sure, like any clueless newbie, I could code in Haskell without knowledge of monads. But who wants to use a gun without knowing how to clear the chamber first? Who wants to cook without knowing how the stove works? Who wants to learn Haskell without knowing about monads? After all -- it's only natural!

After much frustration, I finally came to know them well enough to write this article.

2 Monads as Constructors

To grossly over-simplify, monadic expressions are constructors. You read that right -- constructors. As in object constructors. To understand why this is the case, we need to look at the simplest possible monad: the State Monad.

OK, scenario one: you are trying to LR-parse data from some input stream, and you need to maintain some kind of state while iterating over the input. Let's conveniently ignore unfoldr for the time being; you're inexperienced, and you are trying to apply your knowledge of how to code something in C towards the Haskell solution. At least, that is, for this example.

Scenario two: you're writing a window manager for the X Window System, and you need to maintain the list of windows currently visible on the screen. How do you do this in a purely function environment, especially when you also need to process events from a multitude of different sources, parse (ahem) configuration files, etc?

Due to the genericity of the problem of maintaining state, I won't get into specific data structures for such state. Instead, I will concentrate plumbing required to make the magic of maintaining state happen in a functional language.

3 Observation on Sequential Evaluation

In any functional language, you cannot guarantee the order of execution of functions. At least, you're not supposed to. Just think: according to all the math texts out there, the results of a function magically appear only once all its inputs are satisfied. Functions are supposed to "just work." There's no real way to impose an order of evaluation on them. Or can you?

A function cannot be evaluated meaningfully without providing all the parameter values it needs to produce a result. Thus, it follows that you can artificially impose a sequential evaluation order by threading the output of one function into the input of another, like this:

> let hello w = w ++ "!"
> in  "hello" : (hello "world") : []

The above code looks like a normal list construction, but remember that you can't build the list until you have the nodes to stuff into it in the first place. Hence, it must evaluate the hello function before the list can be fully evaluated.

4 Observation on Threading State

Now that we know how to thread the execution of functions properly, we turn our attention to the obvious problem of maintaining state from one function to the next. Well, this is actually pretty simple -- simply pass, and return, that state as parameters and/or tuples. For example:

> data Stack a = Stack a (Stack a)
> push :: a -> Stack a -> Stack a
> push a stk = Stack a stk
> pop :: Stack a -> (a, Stack a)
> pop (Stack a stk) = (a, stk)

Let's ignore errors for the time being. What is important to observe in the above code is that (a) we're passing explicit state (though it's not strictly true, you can think of stk as an object, and its use as analogous to self or this in more traditional object oriented languages) at all times, and (b) the stack "methods" are always returning an updated state as a result; this is implicit in object oriented languages, where state is updated in-place. This isn't necessarily true in functional languages, though optimizations usually produce equivalent code.

Well, as you can see, having to maintain all that state, manually threading the output of one function to the input of another, can get substantially tedious! This can lead to code that is harder to read, even harder to debug, and all but impossible to modify at some future date.

There has to be some way of factoring this hand-threaded code out.

It turns out that there is. But first, we need one more observation about the nature of software before we can put all the pieces together.

5 Observation on Threading of Flow Control

In any given imperative programming language, like Python for instance, it's generally accepted that when you write something like:

>>> a = 5
>>> b = 4
>>> c = a+b
>>> print c

it is pretty clear that a is assigned a value, then b is assigned a value, then the sum of those values are printed to the screen. In other words, order of execution is determined exclusively by order of listing. Control flow follows, strictly, from top down. Unless told otherwise, of course, usually by a for loop, or some such. But those are minor details.

It is pretty clear, in the above code, that we cannot evaluate the value of c meaningfully without first having evaluated both a and b first. Sound familiar? Yes -- that's right -- nested functions. In this case, if you'll allow me some literary license, we can rewrite the above code using lambda expressions in a continuation passing notation, like so:

>>> (lambda (a) (lambda (b) (lambda (c) print c)(a+b)))(4)))(5)

Yikes! Maybe Python isn't such a good language to express lambda substitutions with bound variables in. Indeed, lambdas are going away in Python in the future, so we might as well look at the equivalent code in a language that directly supports the notion:

> show $ (\a -> (\b -> (\c -> c) (a+b)) 4) 5

What, this still isn't clear? OK, let me show you again, only this time we'll take it statement by statement. Or, rather, their equivalents.

> show $ (\a -> ...) 5

If this looks like normal function calling syntax, that's because it is. What we're doing is using a lambda expression to assign the value 5 to the formal variable a. Pretty slick, eh? Now, let's look at the next "statement":

> show $ (\a -> (\b -> ...) 4) 5

Notice a pattern yet? Note how "the rest of the program" is treated as a single function, expressed in terms of the preceding variable assignments? Note how the only way to successfully evaluate the functions is to evaluate them in the proper order? Yes; that's right -- a program, mathematically speaking, is just a nested set of functions. Evaluate them in the proper order, and you will get the same results as an imperative languge program.

6 Observation: Thinking Algebraically

By now, you're likely to have had at least one epiphany on where we're going. If not, I'll now make things more explicit by making one more observation: at any point in an imperative program, you can always split it between statements, such that evaluation of the top part is a necessary precondition to the execution of the bottom part. It seems pretty obvious at first glance, but it's a critical observation to make explicit. Because, after all, if we can manipulate functions like anything else in higher-order languages, and we can, then it should be possible to build some function which returns a properly composed sequence of other functions, that does precisely that. I mean, execute A before B, that is. For the sake of argument, let's call this >>. We can therefore rewrite our Python example like so:

(a = 5) >> (b = 4) >> (c = a+b) >> (print c)

The associativity of >> really doesn't matter; it can be shown that:

(a = 5) >> ((b = 4) >> ((c = a+b) >> (print c)))

is equal to:

(((a = 5) >> (b = 4)) >> (c = a+b)) >> (print c)

Hence, we don't usually bother drawing parentheses around such constructs. Haskell is, by default, left-associative, so Haskell says, also, that >> is left-associative. But, strictly speaking, it doesn't have to be. Anyway, drawn yet another way:

a = 5
    >>      -- Note how these operators are "between" statements
b = 4          conceptually.
    >>
c = a+b
    >>
print c

By the definition of >>, whatever is on the left-hand (top) side of the operator must evaluate before the right-hand (bottom) side.

The >> operator properly arranges for sequences of code execution, but it certainly doesn't address that sticky issue of transferring state from one statement to the next. How does, after all, one bind variables to values? Well, remember that dirty trick of nested lambda expressions? >> creates those, but as shown above, doesn't thread state from one portion of a program to the next. Fortunately, we have >>= to do that for us:

> 5 >>= \a -> 4 >>= \b -> a+b >>= \c -> show c

Remember the "we don't care about associativity because it just works out" rule for >>? It's the same for >>= too, since it, too, sequences bits of the program correctly. In fact:

> a >> b   =  a >>= \_ -> b

In other words, >> is just a special case of >>=, where we don't particularly care about the result of a at all.

7 Putting it Together: The Not So Obvious

So, we now have all the pieces we need to finally explain just what the heck monads are. So, let's get to work building our very own state monad!

7.1 Data Types

We need something to stuff into our state, but we don't know precisely what. Hence, we're going to use parametric types to describe it. But one thing is for sure, we know that properly sequenced functions depends on nested lambda expressions. Therefore:

> newtype ExState s a = ExS (s -> (s,a))

That's right; our data type basically is itself a function. It takes some state as input, and returns (hopefully) some value of interest, along with a modified state. Remember the observation about how we threaded state as input and got another state as output? The plumbing is right there, but is abstracted in a newtype. Note that a data type could work here just as well.

Next, we need some function to compose pieces of our program together -- we need to define the >>= operator. It's pretty clear that a program takes raw input, but provides a transformed version of those inputs in some capacity (otherwise, what's the point of writing the program?) Even if the program provides no useful value for a, the fact that it updated some kind of state somewhere is of great value. Hence, the result of a program must be of some type of ExState. Otherwise, what's the point?

> (>>=) :: ExState s a -> (a -> ExState s b) -> ExState s b
> top >>= btm           = ExS (\initialState ->
>   let (sTop, vTop)    = perform top initialState
>       (sBtm, vBtm)    = perform (btm vTop) sTop
>       perform (ExS f) = f
>   in  (sBtm, vBtm))

Look at what it's doing. The result of >>= is a kind of function, just like we said it should be before. It takes an initialState, and returns the pair (sBtm, vBtm), where vBtm is the return value from the complete computation, and sBtm is the new state. And, as both names and intuition would suggest, these values correspond to the results you'd get by finishing the program at the bottom of its listing; just like in any other programming language. We see that perform (btm vTop) sTop is used to compute these values. Note that btm vTop must, by definition, be some kind of ExState; we use the helper function perform to yank the function out of it. That allows us to invoke the bottom part of the program's functionality.

Remember in our earlier discussion how >>= bound only a single variable? Well, it turns out that by doing so, and requiring that function to return a monadic value itself, we get the following algebraic substitutions:

>   let (sBtm, vBtm) = perform (btm vTop) sTop
>   let (sBtm, vBtm) = (\someState -> (anotherState, anotherValue)) sTop
>   let (sBtm, vBtm) = (finalState, finalValue)

Pretty slick, eh? Note the middle line where substitution produces a function. But where does sTop come from? Yes; it comes from the first let-binding:

>   let (sTop, vTop) = perform top initialState

So, in order to properly evaluate the bottom, we must first evaluate the top portion of the program. The result of the binding is itself a function which, upon evaluation, evaluates these "sub-programs" in the proper order. And, as you might expect, sub-programs can consist of sub-sub-programs, and sub-sub-sub-programs, ad infinitum. The associativity rules of >>= allow us to just keep on threading and threading and threading.

And since we're passing around all these crazy data structures with functions containing functions containing functions, and states being carefully threaded from function to function, now you see why I said, earlier in this article, that monadic expressions are constructors. What we're building, literally, is a direct-threaded representation of a program. The name direct-threaded isn't accidental; invented in the 1960s and first successfully used by the Forth programming language, it's purpose was to thread together a bunch of functions, passing state from function to function. Just like what we're doing here!

But, there is one last issue involved with all this: once we are working inside a monad, how do we actually make use of the state we're maintaining?

7.2 Accessors

There are generally two kinds of accessors: getters and setters. Haskell makes this patently clear when working with monads, especially State monads, because the only way to access the "hidden" state is through such accessors.

> getState >>= \st -> doSomethingWith st >>= . . .

It's clear that getState must be a function that returns ExState in this case, since executing it must produce the pair (sTop, vTop) inside the >>= function.

> getState = ExS (\s -> (s,s))

Yes, it really is as simple as that. What we're doing is we're taking our state, and returning it as the next return value; we're also leaving it unchanged in the next state as well.

Changing state is accomplished with a similar, but complimentary, function:

> putState x = ExS (\s -> (x,x))

Note that we "return" x as well; it's not strictly necessary, since 99.999% of the time, putState would be used with >> rather than >>=, thus discarding the result anyway. But, no matter what, it's patently clear that the state half of the pair returned is, in fact, being set to x; precisely what we want.

One final "accessor" that really isn't is the return. This is a kind of hybridized putState; its purpose is simple: return a value, leaving the state otherwise unaltered:

> return x = ExS (\s -> (x, s))

7.3 Running "Programs" in the State Monad

So, let's recap. We have a set of operators that construct structures in memory representing what has to be performed (at least conceptually; I should point out that compilers designed to work with monads often optimize this step out), at what time, and with what state. Speaking of state, we have operators which grant us access to it, for both alteration and query purposes. We can construct the cure for world hunger with these basic primitives, but if only we can actually invoke the programs we create, and get the results!

Once we have the facilities for all that plumbing in place, and we can now clearly define the concept of "program execution" in terms of simple algebraic functions, we can now turn our attention to actually doing something useful with it. Like, incrementing a counter, for example:

> inc = getState >>= \s -> (putState (s+1))

Looks pretty straight-forward; admittedly, it's a far cry from curing world hunger, but it is a start! In fact, we can thread several "invokations" of this function along too:

> counterIncrementByThree = inc >> inc >> inc

Before you know it, we'll be curing AIDS. But, this syntax, while useful at times, is pretty ugly and hard to manage overall, so haskell allows us to use conventional imperative-style programming constructs, called do-notation, to write clearer code:

> inc = do
>         s <- getState
>         putState (s+1)

See what's happening? The <- operator maps to the >>= operator, complete with lambda variable binding. Quite convenient indeed! But, we needn't always use variables:

> counterIncrementByThree = do
>                             inc
>                             inc
>                             inc

In this case, because we're not assigning variables, the Haskell compiler knows to use the >> operator. But, remember how >> was defined in terms of >>= earlier? That's why we didn't need to explicitly define our own >> operator.

This is all fine, but, the observant reader will point out that we still have yet to explain how to actually reap the rewards of our programming. We need to extract the results of the computation. This is usually done with a runner. In fact, we have already seen such a runner:

> perform (ExS f) = f

That's the one. Previously, we defined it to be local to the >>= operator, and to not bother with the details of passing initial state. A more complete runner, however, does exactly that -- deal with the state, I mean:

> runExS (ExS prg) initialState = snd $ prg initialState

Since the result of running a monadically constructed function is a pair containing (state', result), we use the snd function to extract only the result. In some cases, you'll be more interested in the state; in this case, you can define another function if you like:

> stateExS (ExS prg) initialState = fst $ prg initialState

If both are of interest to you, then instead of evaluating prg twice, which can be a gratuitous waste of time if you're computing the digits of pi to a billion places, then you can forego the pair selection functions, and just return the whole pair itself:

> valueAndStateExS (ExS prg) initialState = prg initialState

Typically, you'd use it something like this in a real-world program:

> main = do
>   let (someState, someValue) = valueAndStateExS myCounter 0
>   putStr $ "Resulting value: " ++ (show someValue)
>   putStr $ "Resulting state: " ++ (show someState)

Doesn't look like much, does it? But note that 0 hanging off to the right on the call to valueAndStateExS? That is the program's initial state -- in C or C++, this is equivalent to the global state, usually established by numeric constants in main(), or through global (static, hopefully!) variables in some module.

8 Conclusion

Now that you know how the state monad works, you can apply this concept to any other monad, including the IO monad. There are other kinds of monads that are not stateful or state-like though. These include the list monad ([]) and specialized data types like Maybe. But, for now, I'll let these specializations of the concept rest. You've already been through a lot, and I'm itching to get this paper online. And, besides, having gone through all this work, you can now finally appreciate the Control.Monad.State and related libraries.

Well, there you have it; I must bring this essay to a close now, knowing that you have read yet another tutorial on monads and what they actually are. My essay differs from most of the others in that it doesn't invoke the concept of containers, which apparently has confused a lot of people. It also doesn't invoke any complex mathematics (like category theory, where monads come from). Instead, I resort to normal, day-to-day programming experience, possessed by any programmer of any imperative programming language. I hope that this explanation has proven as useful to you, as it has to me.

9 Appendix

The software contained in previous sections are valid Haskell fragments of code. However, they don't tell the whole story. Some details I've had to leave out for brevity's sake. But fear not -- contained herein is the complete program that allowed me to write this essay. It's pretty short and dense, but by following along with the earlier parts of this essay, hopefully you'll be able to see how all the data flows fit together with the monadic plumbing. I should also warn you, this code is far from optimal -- it's written so that it works, and is clear. If you look at, e.g., Control.Monad.State library, you'll find a substantially terser definition, that does the same basic things.

> newtype ExState s a = ExS (s -> (s,a))
>
> instance Monad (ExState s) where
>   top >>= btm = ExS (\initialState ->
>                   let (vTop, sTop)    = perform top initialState
>                       (vBtm, sBtm)    = perform (btm vTop) sTop
>                       perform (ExS f) = f
>                   in  (vBtm, sBtm))
>
>   return x      = ExS (\initialState -> (x, initialState))
>
> getState   = ExS (\initialState -> (initialState, initialState))
> putState x = ExS (\initialState -> (x, x))
>
> stateExS (ExS prg) initialState         = fst $ prg initialState
> runExS (ExS prg) initialState           = snd $ prg initialState
> valueAndStateExS (ExS prg) initialState = prg initialState

10 Acknowledgements

I would like to take this time to thank the folks in the #Haskell IRC channel for opening my eyes to this topic. I could not have come to this understanding without their input. I especially would like to single out Cale, Dolio, Kowey, and SamB, whose patience with me has been the only thing keeping my interest in understanding monads alive.

Special thanks to Cale and Dolio for reviewing this document. Their contributions were invaluable at helping to clarify various points.

And, oddly and finally, thanks to James Burke, for providing a literary style that I continuously endeavor to match. It's highly conversational, and very engaging; it's not the dreary doldrum you'd expect from a paper on, of all things, stuff called "Monads." I mean, c'mon, who ever heard of monads? They sound like medical conditions! Anyway, his creative use of suspense and his masterful touch of humor in educational literature (as distinct from academic literature) lures the reader into wanting to learn more, which is precisely the effect I'm looking for. I hope I've succeeded.

Software Survivalist

On Subroutine Threading for the W65C816 Processor

Introduction

The Problem

The Idea

The Details

Discussion

Related Work

Conclusion

References

Plan 9 Shower Thoughts

Plan 9 Shower Thoughts

Black Boxes and Magic

Black Boxes and Magic

Some Thoughts on Forth vis-a-vis Oracle and Java SE

SPI: You're Doing It Wrong

SPI: You’re Doing It Wrong

Understand the Role of SPI

Recap: How SPI Works

Isolation

Controller-based Isolation

Process-based Isolation

Comparing Controllers vs Processes

Turtles All the Way Down

Conclusion

Some Thoughts on Defined Processes

Some Thoughts on Defined Processes

It’s Not Skill, It’s Something Else.

It Starts With a Checklist.

Context Switching

Estimating Size, Resources.

Conclusion

Bring It On!

Tuesday’s Blow to the Head

Wednesday’s Blow to the Ego

The Talk

Reconciliation

Meatprogramming, not Metaprogramming

Structured Programming

Clarity

Quality

Productivity

Compound Effects and Hierarchical Design

The Problem with Dynamic Languages and Metaprogramming

You Suck.

Conclusion and My Plea To You

Dragonfly

A Case for Literate Programming at Work

Abstract

My Frustration

My History

My Epiphany

My Proposition

My Rationale

My Conclusion

Planned Processes: Not So Evil After All?

On XML vs JSON

Subroutine Performance in J 701b

Introduction

Problem Statement

Experiment Setup

Data

Discussion

Error Sources

Conclusion

Table 1. Time taken to process a single row of the locs matrix, in seconds.

Table 2. Relative performance versus the different ways to process locs.

Listing 1.

Listing 2.

Listing 3.

Listing 4.

Listing 5.

Listing 6.

How Does MISC Stand Up to RISC, CISC? (3/7)

Space Consumption

Runtime Performance

Cost of Structure Field Access

Cost of Indirect Subroutine Call

Conclusion

Listing 1.

Table 1. Time taken to process a single row of the `locs` matrix, in seconds.

Table 2. Relative performance versus the different ways to process `locs`.